[04:50:10] 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) >>! In T210704#5277211, @Joe wrote: > @KartikMistry if we trigger a rebuild of the production container, it s... [08:33:25] <_joe_> fsero: yesterday I rebuilt a few images [08:33:35] <_joe_> the base images and the nodejs10-slim one [08:33:44] <_joe_> and they're not present on the eqiad registries [08:33:58] <_joe_> so I guess the replication is broken? [08:34:13] <_joe_> how can I check what's wrong? [08:35:53] They are not yet present? [08:36:26] <_joe_> yup, not present [08:36:30] <_joe_> see the icinga alerts too [08:36:34] You can use swift cli to check the number of objects from each side and also it would output if there is an error on replication [08:37:45] Let me check [08:41:29] <_joe_> if you have a script, please add it to the runbook [08:50:03] 10serviceops, 10TechCom-RFC (TechCom-Approved): RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10daniel) Is this now documented somewhere on mediawiki.org? I don't see it linked from . [09:05:13] if we have the data per "type of objects" we could add a filter to this dashboard https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1 [09:13:47] 10serviceops, 10Thumbor, 10observability, 10User-jijiki: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 (10jcrespo) It would have been nice to be subscribed or notified of this. [09:14:30] 10serviceops, 10Thumbor, 10observability, 10User-jijiki: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 (10jijiki) @jcrespo I coordinated with @Marostegui for this :) [09:18:01] mmm replication has worked according to swift cli [09:18:14] but content has been altered somehow [09:18:28] ill add hte commands to the runbook _joe_ [09:18:36] <_joe_> thanks fsero [09:18:48] im still digging into it [09:19:15] <_joe_> yeah that seems strange indeed [09:27:52] 10serviceops: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 (10akosiaris) Some information in P8652 [09:34:22] 10serviceops, 10Thumbor, 10observability, 10User-jijiki: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 (10jijiki) 05Open→03Resolved [09:43:32] <_joe_> can someone look at kubernetes2001? there is an icinga alert about a failed systemd unit [09:46:12] I trust alex knows already [09:47:06] _joe_: https://phabricator.wikimedia.org/T226237 [09:47:08] ignore it [09:47:23] * akosiaris waist deep in looking into netfilter [09:57:25] <_joe_> sounds nice [10:02:19] _joe_: i manually fixed it, replication didnt create this object Object 'docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/0e/0edd5f8fed2b780f0d4fec2bf857ae3fb0ce656e3fb7b36c824144a789d3222a/data' not found [10:02:56] ill dig later in swift logs to know what happened in the meantime ill extend the runbook eith things you can do to debug this issues [10:03:08] however as arule of thumb in case of failure republish [10:03:09] :P [10:04:02] <_joe_> ok [10:42:47] https://www.cipher-it.co.uk/wp-content/uploads/2017/11/ITCrow.jpg :-P [11:30:22] 10serviceops: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 (10akosiaris) Using dropwatch I get ` akosiaris@kubernetes2001:~$ sudo dropwatch -l kas Initalizing kallsyms db dropwatch> start Enabling monitoring... Kernel monitoring activated. Is... [11:31:15] 10serviceops: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 (10akosiaris) Merging in as in P8652 Trying to figure out what the hell is the reason those icmp redirects get discarded https://grafana.wikimedia.org/d/PRA2F67Zz/t226237?orgId=1 Add... [11:57:28] 10serviceops: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 (10akosiaris) Using perf record also leads to the same conclusion as dropwatch for where the packets are dropped/discarded. ` $ sudo perf record -g -a -e skb:kfree_skb $ sudo per scri... [12:16:01] hi all yesterday while rebooting the conf serveres i noticed thet etcm,irror was enabled and thus started on boot. i have created the following CR if people could look at this. https://gerrit.wikimedia.org/r/c/operations/puppet/+/518960 [12:16:14] _joe_: ^^^ hopefully yuo didn;t allready start fixing this [12:16:41] <_joe_> jbond42: oh thanks a lot [12:16:51] <_joe_> It was on my radar but def not for today [12:17:16] np, im on clinc so was a good taks to get out the door [13:36:04] i followed the track of the missing file, somehow swift backend decided that the file was already there while it wasnt [13:36:16] https://www.irccloud.com/pastebin/e1MT37JV/ [13:37:14] <_joe_> are we relying on swift replication for mediawiki too? [13:37:21] i think so [13:38:19] <_joe_> I didn't think so until now [13:38:29] <_joe_> the thought terrifies me :P [13:38:39] <_joe_> godog: maybe you know? [13:38:59] <_joe_> IIRC we weren't using swift replication for originals, right [13:41:11] no we aren't, mediawiki knows about both swift clusters [13:45:06] <_joe_> ok, *good* [13:45:12] <_joe_> given what we've just seen [13:45:47] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services), 10User-jijiki: Deploy scap 3.10.0-1 - https://phabricator.wikimedia.org/T224915 (10jijiki) 05Open→03Resolved a:03jijiki [13:56:55] (please ping/hilight if you need me, I might be reading irc on and off) [13:57:13] also ETOOMANYCHANNELS [14:06:14] godog: this should help with replication https://gerrit.wikimedia.org/r/519018 take a look pls :) [14:07:35] fsero: I will! possibly later today, being swamped with backlog flushing :| [14:47:14] 10serviceops: deploy CoreDNS as a in-cluster DNS service - https://phabricator.wikimedia.org/T226516 (10fsero) [14:59:06] 10serviceops, 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) p:05High→03Normal @thiemowmde @WMDE-Fisch I have installed php-wikidiff2_1.8.2-1~wmf1_amd64 on deployment-mediawi... [15:01:21] 10serviceops, 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) [15:27:29] With CoreDNS release 1.2.0, you'll need to migrate existing CoreDNS related data (if any) on your etcd server to etcdv3 API [15:27:32] damn [16:16:23] 10serviceops, 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) Smoke testing the diffs shows that nothing seems to have been broken by the upgrade. We haven't been able to verify... [17:29:48] 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 5 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) [17:35:58] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [19:38:35] 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki)