[00:14:09] 10serviceops, 10Operations, 10ops-codfw: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10Papaul) @Dzahn I noticed that this server is not present in Icinga and has status "failed" in Netbox. Can we turn this task to a decommission task if the server is no longer needed in production an... [00:21:22] 10serviceops, 10MW-on-K8s, 10Operations, 10TechCom-RFC: Decide on logging in k8s for ShellBox - https://phabricator.wikimedia.org/T263545 (10tstarling) > https://pracucci.com/php-on-kubernetes-application-logging-via-unix-pipe.html provides two options for how php-fit could be set up to get the logs from w... [01:46:27] 10serviceops, 10MW-on-K8s, 10Operations: Decide on logging in k8s for ShellBox - https://phabricator.wikimedia.org/T263545 (10tstarling) [07:02:24] hello, I'm trying to create a new page on wikitech with visual editor and when publishing it I'm getting: "Something went wrong", "Error contacting the Parsoid/RESTBase server (HTTP 404)" [07:03:50] <_joe_> you're in the wrong place [07:04:19] <_joe_> there is a bug already, and a discussion in the element channel [07:04:44] * volans followed elukey's trick from -sre [07:04:56] <_joe_> on friday cscott said "I think we know of one issue on VisualEditor/Parsoid-not-using-RESTBase configs (aka office and wikitech, i think) that leads to 500 errors when creating a brand new page. That should be fixed in next week's train." [07:05:24] _joe_: where is documented where to ask? [07:06:30] <_joe_> ? [07:06:57] <_joe_> https://phabricator.wikimedia.org/T262838 [07:07:35] <_joe_> uh it was a 500 error before [07:07:53] <_joe_> so not sure if the fix is "live" on wikitech. [07:48:36] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10Patch-For-Review: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Joe) a:05Joe→03None De-assigning from me as the immediate bug is solved, and Petr is working on the long-term fix. [07:50:45] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Wikifeeds: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10Joe) a:05Joe→03None [07:52:43] <_joe_> ottomata / elukey we really need to convert all helmfile.d to the new structure. The only stuff lacking is eventgate-analytics-external eventgate-logging-external eventstreams [07:52:55] <_joe_> can you please prioritize that? [07:53:19] <_joe_> I can also just delete the directories and you can re-add them next time you need to do a release, with the correct format [07:53:23] <_joe_> if you prefer :P [07:54:28] I don't know much about the eventgate's helmfiles but I'll ask Andrew to see if he can work on them this week [07:59:00] <_joe_> I should've also pinged razzi and tobias, it will take time getting used to annoy them too [08:03:21] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, 10Wikifeeds: Wikifeeds should send uncachable response in case of some upstream failure - https://phabricator.wikimedia.org/T263100 (10Joe) a:05Joe→03None [08:32:13] 10serviceops, 10MW-on-K8s, 10Operations: Decide on logging in k8s for ShellBox - https://phabricator.wikimedia.org/T263545 (10tstarling) @Joe says that application logs can go to logstash while php-fpm error logs will go to k8s. I want to say that the simplest way to send application logs to logstash is to u... [09:11:37] _joe_ you never annoy people, it is always a pleasure to listen to your advices and follow the best practices [09:11:44] [09:11:46] * elukey runs away [09:12:12] <_joe_> elukey: you used to be subtler in your teasing [09:12:27] <_joe_> i'll send you some apache patches to retaliate [09:12:31] hahahaha [09:40:40] 10serviceops, 10Operations, 10Prod-Kubernetes, 10Kubernetes: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm) a:03JMeybohm [10:35:59] 10serviceops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Envoy should listen on ipv6 and ipv4 - https://phabricator.wikimedia.org/T255568 (10jijiki) This is causing socket errors, which is not a noticeable performance issue. {F32360772} [11:28:29] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10jijiki) [11:28:50] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10jijiki) [11:29:19] 10serviceops, 10MediaWiki-Cache, 10MediaWiki-General, 10Performance-Team, 10User-jijiki: Use monotonic clock instead of microtime() for perf measures in MW PHP - https://phabricator.wikimedia.org/T245464 (10jijiki) @dpifke please ping me to roll out the package when you are ready [12:52:25] _joe_: we didn't realize I needed to do that; someone did eventgate-main and eventgate-analytics (jay me?) for us [13:19:07] <_joe_> I did one, he did one, but I'm sure we said we wanted you to self-serve from there [13:19:25] <_joe_> we also did the same with hnowlan for the PET services [14:38:39] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10User-jijiki: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10LGoto) [15:39:11] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 2 others: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10LGoto) [15:39:16] 10serviceops, 10Operations, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 2 others: wikifeeds OpenAPI spec test doesn't fail if the response from `feed/featured` is malformed - https://phabricator.wikimedia.org/T263097 (10LGoto) p:05Triage→03Low [15:50:05] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10LGoto) [15:50:30] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10LGoto) a:05MSantos→03None [15:51:48] 10serviceops, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban), 10User-jijiki: Set up secrets for Token clean-up - https://phabricator.wikimedia.org/T262957 (10MSantos) a:03jijiki @jijiki thanks! This is not urgent, but please ping us when you're ready for it. I will assign it... [16:08:17] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) All tests passed with no issues. I've updated the firmware to the newest version of bios, which is the mainboard firmware. The system no longer sees the NIC. I suppose we should try to reim... [16:13:35] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009231613_robh_1862_... [16:13:40] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] ` [16:16:40] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` mw1360.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202009231616_robh_4519_... [16:18:08] 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) a:03holger.knust Assigning Holger, per yesterday's meeting. [16:27:58] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` Of which those **FAILED**: ` ['mw1360.eqiad.wmnet'] ` [17:09:29] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) So both myself and Papaul have looked into this, checking multiple items: * updated bios and idrac to newest firmware revisions * nic is enabled in bios * nic has error message in idrac inven... [17:23:11] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) Papaul suggests that we drain power (unplug since we don't have switched PDUs in normal racks) and let sit for a couple minutes and then plug it all back in. Worth a shot, since it has cleare... [17:23:19] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) a:03Cmjohnson [17:25:27] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) [17:25:32] 10serviceops, 10Operations, 10ops-eqiad: mw1360's NIC is faulty - https://phabricator.wikimedia.org/T262151 (10RobH) p:05Triage→03Medium [17:59:41] 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10User-jijiki: Should we create a separate 'mwdebug' cluster? - https://phabricator.wikimedia.org/T262202 (10jijiki) @Milimetric that would be great, if it is not too much work, I would appreciate it. I will work on the varnish... [18:01:22] 10serviceops, 10Analytics-Radar, 10Release-Engineering-Team, 10observability, 10User-jijiki: Create a separate 'mwdebug' cluster - https://phabricator.wikimedia.org/T262202 (10jijiki) [18:25:43] 10serviceops, 10WMF-JobQueue, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), and 2 others: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 (10BPirkle) This is still occurring as of Sept 23, 2020 [18:56:58] 10serviceops, 10MW-on-K8s, 10Operations: Decide on logging in k8s for ShellBox - https://phabricator.wikimedia.org/T263545 (10crusnov) p:05Triage→03Medium [19:38:47] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) Ok, export of the SEL (have to clear it to run the hw diagnostic or it throws error for errors in SEL) /admin1-> racadm getsel Record: 1 Date/Ti... [19:39:32] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) Removed system from ganeti cluster via directions on wikitech for extended downtime. will do hw testing on it next. [20:03:30] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) ` Technical Support will need this information to diagnose the problem. Please record the information below. Service Tag : FLX09X2 Error Code : 2000-0... [20:05:10] 10serviceops, 10Operations, 10ops-codfw: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10Dzahn) a:05Papaul→03Dzahn [20:13:05] 10serviceops, 10Operations, 10ops-codfw: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10Dzahn) >>! In T257903#6307464, @akosiaris wrote: > Yeah I think we can for now. The replacing hosts have been racked and have the role(insetup) applied so we can take it from here. Thanks! Meanwhil... [20:14:03] 10serviceops, 10Operations, 10ops-codfw: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10Dzahn) p:05Low→03Medium [20:14:39] I was going through my email backlog and realized that https://phabricator.wikimedia.org/T261531 is still outstanding... mutante or someone else, any chance you have some time for it? [20:15:02] it will probably involve mucking with wiki apache configs so I'm a little out of my area of easy depth [20:33:34] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10Dzahn) >>! In T257903#6485866, @Papaul wrote: > I noticed that this server is not present in Icinga and has status "failed" in Netbox. When I tried to run the decom cookbook I... [20:34:52] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10Dzahn) Arr.. it still shows up in MediaWiki config: ` Looking for matches in puppetmaster1001.eqiad.wmnet:/var/lib/git/operations/puppet Looking for matches in puppetmaster10... [20:46:46] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10Dzahn) ` Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway? Type "done" to proceed > done Scheduling downtime on Icinga server alert100... [20:52:47] 10serviceops, 10Operations, 10ops-codfw, 10Patch-For-Review: wtp2005 hardware issue - https://phabricator.wikimedia.org/T257903 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `wtp2005.codfw.wmnet` - wtp2005.codfw.wmnet (**FAIL**) - **Failed downtime host on... [21:09:26] 10serviceops, 10Operations, 10ops-codfw: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (10Dzahn) a:05Dzahn→03Papaul [21:14:32] 10serviceops, 10Operations, 10ops-codfw: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (10Dzahn) @Papaul This should now be ready for decom. After some intial issue with the DNS removal it is now gone. The only thing left is a MW config change but that can go anytime... [21:30:35] 10serviceops, 10Operations, 10ops-codfw: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (10Papaul) ` edit interfaces interface-range disabled] member ge-3/0/2 { ... } + member ge-4/0/21; [edit interfaces] - ge-4/0/21 { - description wtp2005; - en... [21:35:00] 10serviceops, 10Operations, 10ops-codfw: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (10Dzahn) >>! In T257903#6489392, @Dzahn wrote: > Arr.. it still shows up in MediaWiki config: removal added to today's evening deploy window (formerly SWAT) https://wikitech.wik... [21:37:20] 10serviceops, 10Operations, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban): Timeout when fetching OAuth token - https://phabricator.wikimedia.org/T263695 (10Mholloway) [21:39:29] 10serviceops, 10Operations, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban): Timeout when fetching Google OAuth2 token for FCM - https://phabricator.wikimedia.org/T263695 (10Mholloway) [21:42:09] 10serviceops, 10Operations, 10ops-codfw: decom wtp2005 (was: wtp2005 hardware issue) - https://phabricator.wikimedia.org/T257903 (10Papaul) Removed mgmt DNS. what left is just to removed the disk from the server and unrack it. [21:45:00] 10serviceops, 10Operations, 10ops-eqsin: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 (10RobH) I've created SR1037478758 to dispatch a replacement mainboard. I'll open an inbound shipment ticket with SG3 once I get notification of the shipment,... [21:55:14] 10serviceops, 10Operations, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban): Timeout when fetching Google OAuth2 token for FCM - https://phabricator.wikimedia.org/T263695 (10Mholloway) Outgoing requests to APNS via url-downloader are working fine, so I suspect the problem is in... [22:48:32] 10serviceops, 10Operations, 10Push-Notification-Service, 10Product-Infrastructure-Team-Backlog (Kanban): Timeout when fetching Google OAuth2 token for FCM - https://phabricator.wikimedia.org/T263695 (10Mholloway) The http.Agent with the proxy setup has to be passed to the Credential as well as to AppOption... [23:02:01] 10serviceops, 10Operations, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Timeout when fetching Google OAuth2 token for FCM - https://phabricator.wikimedia.org/T263695 (10Mholloway) a:03Mholloway [23:58:13] 10serviceops, 10Operations, 10Push-Notification-Service, 10Patch-For-Review, 10Product-Infrastructure-Team-Backlog (Kanban): Timeout when fetching Google OAuth2 token for FCM - https://phabricator.wikimedia.org/T263695 (10Mholloway) 05Open→03Resolved Message are now getting through to FCM.