[04:50:10] <wikibugs>	 10serviceops, 10Operations, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) >>! In T210704#5277211, @Joe wrote: > @KartikMistry if we trigger a rebuild of the production container, it s...
[08:33:25] <_joe_>	 fsero: yesterday I rebuilt a few images
[08:33:35] <_joe_>	 the base images and the nodejs10-slim one
[08:33:44] <_joe_>	 and they're not present on the eqiad registries
[08:33:58] <_joe_>	 so I guess the replication is broken?
[08:34:13] <_joe_>	 how can I check what's wrong?
[08:35:53] <fsero>	 They are not yet present?
[08:36:26] <_joe_>	 yup, not present
[08:36:30] <_joe_>	 see the icinga alerts too
[08:36:34] <fsero>	 You can use swift cli to check the number of objects from each side and also it would output if there is an error on replication
[08:37:45] <fsero>	 Let me check
[08:41:29] <_joe_>	 if you have a script, please add it to the runbook
[08:50:03] <wikibugs>	 10serviceops, 10TechCom-RFC (TechCom-Approved): RfC: Standards for external services in the Wikimedia infrastructure. - https://phabricator.wikimedia.org/T208524 (10daniel) Is this now documented somewhere on mediawiki.org? I don't see it linked from <https://www.mediawiki.org/wiki/Development_policy>.
[09:05:13] <volans>	 if we have the data per "type of objects" we could add a filter to this dashboard https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1
[09:13:47] <wikibugs>	 10serviceops, 10Thumbor, 10observability, 10User-jijiki: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 (10jcrespo) It would have been nice to be subscribed or notified of this.
[09:14:30] <wikibugs>	 10serviceops, 10Thumbor, 10observability, 10User-jijiki: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 (10jijiki) @jcrespo I coordinated with @Marostegui for this :)
[09:18:01] <fsero>	 mmm replication has worked according to swift cli 
[09:18:14] <fsero>	 but content has been altered somehow
[09:18:28] <fsero>	 ill add hte commands to the runbook _joe_ 
[09:18:36] <_joe_>	 thanks fsero
[09:18:48] <fsero>	 im still digging into it
[09:19:15] <_joe_>	 yeah that seems strange indeed
[09:27:52] <wikibugs>	 10serviceops: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 (10akosiaris) Some information in P8652
[09:34:22] <wikibugs>	 10serviceops, 10Thumbor, 10observability, 10User-jijiki: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 (10jijiki) 05Open→03Resolved
[09:43:32] <_joe_>	 can someone look at kubernetes2001? there is an icinga alert about a failed systemd unit
[09:46:12] <jijiki>	 I trust alex knows already 
[09:47:06] <akosiaris>	 _joe_: https://phabricator.wikimedia.org/T226237
[09:47:08] <akosiaris>	 ignore it
[09:47:23] * akosiaris waist deep in looking into netfilter
[09:57:25] <_joe_>	 sounds nice
[10:02:19] <fsero>	 _joe_: i manually fixed it, replication didnt create this object Object 'docker_registry_eqiad/files/docker/registry/v2/blobs/sha256/0e/0edd5f8fed2b780f0d4fec2bf857ae3fb0ce656e3fb7b36c824144a789d3222a/data' not found
[10:02:56] <fsero>	 ill dig later in swift logs to know what happened in the meantime ill extend the runbook eith things you can do to debug this issues
[10:03:08] <fsero>	 however as arule of thumb in case of failure republish
[10:03:09] <fsero>	 :P
[10:04:02] <_joe_>	 ok
[10:42:47] <volans>	 https://www.cipher-it.co.uk/wp-content/uploads/2017/11/ITCrow.jpg :-P
[11:30:22] <wikibugs>	 10serviceops: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 (10akosiaris) Using dropwatch I get  ` akosiaris@kubernetes2001:~$ sudo dropwatch -l kas Initalizing kallsyms db dropwatch> start Enabling monitoring... Kernel monitoring activated. Is...
[11:31:15] <wikibugs>	 10serviceops: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 (10akosiaris) Merging in as in P8652  Trying to figure out what the hell is the reason those icmp redirects get discarded https://grafana.wikimedia.org/d/PRA2F67Zz/t226237?orgId=1  Add...
[11:57:28] <wikibugs>	 10serviceops: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 (10akosiaris) Using perf record also leads to the same conclusion as dropwatch for where the packets are dropped/discarded.  ` $ sudo perf record -g -a -e skb:kfree_skb $ sudo per scri...
[12:16:01] <jbond42>	 hi all yesterday while rebooting the conf serveres i noticed thet etcm,irror was enabled and thus started on boot.  i have created the following CR if people could look at this. https://gerrit.wikimedia.org/r/c/operations/puppet/+/518960
[12:16:14] <jbond42>	 _joe_: ^^^ hopefully yuo didn;t allready start fixing this
[12:16:41] <_joe_>	 jbond42: oh thanks a lot
[12:16:51] <_joe_>	 It was on my radar but def not for today
[12:17:16] <jbond42>	 np, im on clinc so was a good taks to get out the door
[13:36:04] <fsero>	 i followed the track of the missing file, somehow swift backend decided that the file was already there while it wasnt
[13:36:16] <fsero>	 https://www.irccloud.com/pastebin/e1MT37JV/
[13:37:14] <_joe_>	 are we relying on swift replication for mediawiki too?
[13:37:21] <fsero>	 i think so
[13:38:19] <_joe_>	 I didn't think so until now
[13:38:29] <_joe_>	 the thought terrifies me :P
[13:38:39] <_joe_>	 godog: maybe you know?
[13:38:59] <_joe_>	 IIRC we weren't using swift replication for originals, right
[13:41:11] <godog>	 no we aren't, mediawiki knows about both swift clusters
[13:45:06] <_joe_>	 ok, *good*
[13:45:12] <_joe_>	 given what we've just seen
[13:45:47] <wikibugs>	 10serviceops, 10Release-Engineering-Team-TODO, 10Scap, 10Release-Engineering-Team (Deployment services), 10User-jijiki: Deploy scap 3.10.0-1 - https://phabricator.wikimedia.org/T224915 (10jijiki) 05Open→03Resolved a:03jijiki
[13:56:55] <godog>	 (please ping/hilight if you need me, I might be reading irc on and off)
[13:57:13] <godog>	 also ETOOMANYCHANNELS
[14:06:14] <fsero>	 godog: this should help with replication https://gerrit.wikimedia.org/r/519018 take a look pls :) 
[14:07:35] <godog>	 fsero: I will! possibly later today, being swamped with backlog flushing :|
[14:47:14] <wikibugs>	 10serviceops: deploy CoreDNS as a in-cluster DNS service - https://phabricator.wikimedia.org/T226516 (10fsero)
[14:59:06] <wikibugs>	 10serviceops, 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) p:05High→03Normal @thiemowmde @WMDE-Fisch I have installed php-wikidiff2_1.8.2-1~wmf1_amd64 on deployment-mediawi...
[15:01:21] <wikibugs>	 10serviceops, 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki)
[15:27:29] <fsero>	 With CoreDNS release 1.2.0, you'll need to migrate existing CoreDNS related data (if any) on your etcd server to etcdv3 API
[15:27:32] <fsero>	 damn
[16:16:23] <wikibugs>	 10serviceops, 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) Smoke testing the diffs shows that nothing seems to have been broken by the upgrade.  We haven't been able to verify...
[17:29:48] <wikibugs>	 10serviceops, 10Operations, 10Release Pipeline, 10Core Platform Team (RESTBase Split (CDP2)), and 5 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac)
[17:35:58] <wikibugs>	 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki)
[19:38:35] <wikibugs>	 10serviceops, 10Operations, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki)