[00:03:31] <grrrit-wm>	 (03CR) 10Alex Monk: "is there a task about this? has it passed security review?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 (owner: 10Yurik)
[00:04:33] <grrrit-wm>	 (03PS8) 10Krinkle: Set $wgResourceBasePath to "/w" for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) 
[00:04:37] <grrrit-wm>	 (03PS3) 10Yurik: Enable Kartographer ext in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 (https://phabricator.wikimedia.org/T114820) 
[00:05:29] <yurik>	 Krenair, ^
[00:05:34] <yurik>	 updated bug
[00:05:48] <yurik>	 Krenair, it was blocked on the sec review until about an hour ago
[00:06:17] <grrrit-wm>	 (03PS1) 10Krinkle: Set $wgResourceBasePath to "/w" for beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270446 (https://phabricator.wikimedia.org/T99096) 
[00:06:25] <Krenair>	 aha, that would explain why I was not aware of it
[00:06:30] <Krenair>	 ok
[00:06:37] <yurik>	 Krenair, can you +2 it pls
[00:06:56] <yurik>	 we get to play with it over the weekend :)
[00:07:05] <yurik>	 and continue to update it
[00:07:49] <Krenair>	 you want me to deploy a new extension at midnight on saturday morning?
[00:08:30] <icinga-wm>	 PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: puppet fail
[00:09:24] <yurik>	 Krenair, only to labs :)))
[00:09:31] <Krenair>	 greg-g?
[00:09:38] <yurik>	 i don't want it public, no way :)
[00:09:41] <yurik>	 i mean produciton
[00:09:51] <greg-g>	 ?
[00:09:52] <Krenair>	 I kind of had plans other than extension deployments
[00:10:05] <mutante>	 !log ruthenium - restarting parsoid, now works out of /srv/
[00:10:06] <yurik>	 Krenair, do you actually have to do anything manually?
[00:10:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:10:23] <yurik>	 greg-g, is it ok to do a *labs-only* ext depl
[00:10:52] <grrrit-wm>	 (03CR) 10Dzahn: "nginx works, parsoid runs out of /srv, checked ruthenium" [puppet] - 10https://gerrit.wikimedia.org/r/270442 (owner: 10Dzahn)
[00:10:59] <greg-g>	 yurik: if you are comfortable doing it yourself, it's not something I want to get in the habit of, this is bad timing (4pm on a Friday where many people will be leaving early)
[00:11:43] <yurik>	 greg-g, i won't sync it - if labs picks it up and enables, great, if not, i will wait until monday
[00:11:52] <greg-g>	 uh, no
[00:11:59] <grrrit-wm>	 (03CR) 10Dzahn: "on ruthenium, stop/start parsoid service after this.confirmed it runs out of /srv, then deleted /usr/lib/parsoid/" [puppet] - 10https://gerrit.wikimedia.org/r/269606 (owner: 10Dzahn)
[00:12:12] <yurik>	 hashar was saying something about the process being automated
[00:12:13] <greg-g>	 we don't allow diffs to not be sync'd for config changes in prod, if it's on tin it has to be deployed
[00:12:26] <greg-g>	 unless I misunderstood you
[00:12:45] <yurik>	 greg-g, i could do it, but i would rather observe the policy of not deploying on friday, even if its labs-only files
[00:12:57] <greg-g>	 yurik: right, so how about you do it all on Monday
[00:13:08] <yurik>	 meh, ok :)
[00:13:10] <greg-g>	 people are so discomfort with your plans here, please respect that
[00:13:12] <greg-g>	 thanks
[00:13:13] <greg-g>	 :)
[00:13:28] <yurik>	 discomfort with my plans? :/
[00:13:39] <hashar>	 we require changes on prod to be synced
[00:13:41] <hashar>	 even if labs
[00:13:49] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Set $wgResourceBasePath to "/w" for beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270446 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle)
[00:13:58] <hashar>	 we forget sometime. But over a week end that is a nono
[00:14:11] <yurik>	 is Krinkle deploying something? ^
[00:14:11] <hashar>	 specially for ops, they will be left wondering what to do with the patch
[00:14:15] <yurik>	 i am so confused
[00:14:20] <hashar>	 so just hold and do it on monday morning
[00:14:35] <hashar>	 might want to have the patch reviewed anyway. That is a good preparation for prod work
[00:14:37] * Krinkle reading scroll back
[00:14:47] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set $wgResourceBasePath to "/w" for beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270446 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle)
[00:15:03] <yurik>	 Krinkle, you just merged the mw-config, which greg-g and hashar were saying is not very nice to merge to on friday 
[00:15:25] <yurik>	 because deployments to labs must be matched with the sync to prod
[00:15:42] <hashar>	 anyway off
[00:15:52] <yurik>	 hashar, thx, chat monday )
[00:15:54] <Krinkle>	 jenkins will auto-deploy master on beta, right?
[00:15:55] <yurik>	 too late
[00:15:55] <apergos>	 have a good weekend has..
[00:15:58] <apergos>	 har...
[00:15:59] <apergos>	 heh
[00:16:20] <yurik>	 Krinkle, yes, but greg-g said not to do it because mw-config has to be always in sync with prod
[00:17:52] <yurik>	 greg-g, i think we should make a rule that its ok to merge a mw-config if you only touch labs files, without syncing... i know it has been raised before as an issue, but this way we don't block people from experimenting... but maybe its not worth it
[00:18:17] <greg-g>	 it's annoying, yeah
[00:18:18] <yurik>	 or maybe we could make a auto-sync script that pushes stuff to produciton automatically, just like puppets, if all files
[00:18:26] <yurik>	 are in labs
[00:18:57] <yurik>	 i mean if all files in a patch belong to a whitelist
[00:19:02] <Krinkle>	 Hm.. slightly confused. mediawiki-config has policy of merging=pulling on tin, and pulling on tin=merging. 
[00:19:21] <Krinkle>	 So by that logic, we can't do anything on beta because beta files are in mediawiki-config
[00:19:34] <yurik>	 Krinkle, that's what i have been hearing from greg-g 
[00:19:43] <Krinkle>	 but we also merge stuff in master for mediawiki-core and other repos on Friday (aka "work")
[00:19:45] <yurik>	 which i'm not very happy about either
[00:19:49] <Krinkle>	 and some thigns require config changes.
[00:19:55] <Krinkle>	 we should be able to fix beta on a friday
[00:20:09] <Krinkle>	 (all of which also go to beta)
[00:20:14] <greg-g>	 we can fix beta on Fridays, but new extensions isn't fixing
[00:20:18] <yurik>	 Krinkle, mediawiki-config is supposedly a special repo because it must always match whats in production
[00:20:21] <MaxSem>	 fixing beta is good enough a reason to pull on tin
[00:20:25] <Krinkle>	 if we don't want config changes to beta on Friday, we need to disable beta-update on Friday.
[00:20:41] <greg-g>	 Krinkle: you're being a little too extreme in your interpretation
[00:20:57] <Krinkle>	 (so that master commit to code repos don't go there either)
[00:21:04] <Krinkle>	 Right. I see.
[00:21:07] <yurik>	 fixing is a vague term - it could be an improvement or it could be "totally broken unless this change is made"
[00:21:17] <Krinkle>	 Not a freeze, just no regular changes.
[00:21:20] <greg-g>	 the uncomfortableness was the new extension and people not being ready to help fix if it breaks
[00:21:39] <greg-g>	 it being the end of a very long/emotionally draining week at 4:10 pacific
[00:21:40] <yurik>	 greg-g, by breaking, you mean in labs on in prod?
[00:21:47] <greg-g>	 beta cluster
[00:21:55] <greg-g>	 we don't want a broken beta cluster all weekend, either :)
[00:22:02] <yurik>	 true that :)
[00:22:05] <apergos>	 is toollabs separate?
[00:22:10] <yurik>	 i think so
[00:22:38] <greg-g>	 yeah, tool labs is it's own beast
[00:22:42] <apergos>	 thank goodness
[00:23:30] <yurik>	 in any case, no point in pushing it out, just trying to clarify this.  So apparently its ok to selfmerge mw-config without dir-syncing it to all of production if its labs files only
[00:23:46] <yurik>	 unless its a major major change like new ext
[00:23:57] <yurik>	 greg-g, ?
[00:24:12] <greg-g>	 no, it's OK to deploy beta cluster-only changes to production (to keep icinga happy)
[00:24:27] <greg-g>	 but, this specific situation was touchy because of...
[00:24:28] <greg-g>	 00:21 <    greg-g> the uncomfortableness was the new extension and people not being ready to  help fix if it breaks
[00:24:31] <greg-g>	 00:21 <    greg-g> it being the end of a very long/emotionally draining week at 4:10 pacific
[00:24:42] <yurik>	 yes yes , i totally understand that point :)
[00:24:49] <yurik>	 as i said, just trying to understand the rules :)
[00:24:55] <Krenair>	 the icinga check is there to keep other deployers happy
[00:25:10] <greg-g>	 and ops, they'd be pissed if it was complaining all weekend :)
[00:25:16] <apergos>	 +1
[00:25:16] <yurik>	 does icinga make sure that tin is in sync with prod, or in sync with git master?
[00:25:21] <Krinkle>	 incinga check compares tin against git?
[00:25:28] <Krinkle>	 (or mira)
[00:25:29] <greg-g>	 yurik: option 1
[00:25:30] <yurik>	 hehe
[00:25:50] <Krinkle>	 k
[00:25:51] <yurik>	 greg-g, does it have some timeout of half an hour?
[00:26:02] <greg-g>	 not that I know of
[00:26:03] <Krinkle>	 Should I sync-file my labs change to prod?
[00:26:06] <yurik>	 because otherwise it would start complaining the moment i git pull
[00:26:11] <Krinkle>	 Right
[00:26:19] <greg-g>	 yurik: oh, right, yeah, misinterpreted timeout
[00:26:20] <Krinkle>	 It's okay now since the git remote hasn't been updated yet
[00:26:28] <Krinkle>	 ok. I'l sync then
[00:26:59] <yurik>	 greg-g, but if it only checks that tin is in sync with production, +2 does not trigger alarms
[00:27:35] <yurik>	 only git pull without dir-sync does
[00:27:44] <greg-g>	 right, but then the next time someone wants to deploy they'll be confused by this unrelated change being pulled down
[00:27:47] <Krinkle>	 yurik: The check compares tin staging workspace against tin's git-remote perception of Gerrit repo 
[00:27:50] <MaxSem>	 actually, the other way around
[00:28:11] <logmsgbot>	 !log krinkle@mira Synchronized wmf-config/CommonSettings-labs.php: (no message) (duration: 01m 17s)
[00:28:13] <yurik>	 Krinkle, so if i don't git fetch, its all clear
[00:28:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:28:32] * yurik is scheming a way to get around icingas alarms :)))
[00:28:36] <Krinkle>	 Yeah, but it's a bit unfriendly to leave that behind for someone else to then unknowingly trigger an alarm on a 30min delay
[00:28:54] <yurik>	 yes, i get that :)
[00:29:11] <yurik>	 ok, i think i have done enough damage here
[00:29:16] <yurik>	 off i go to do something productive
[00:29:46] <yurik>	 appreciate patiently explaining it all to me
[00:30:21] <greg-g>	 word
[00:30:25] <greg-g>	 enjoy your weekend
[00:30:37] <yurik>	 thanks, will do.
[00:30:43] <yurik>	 you too!
[00:31:10] <greg-g>	 will do!
[00:32:20] <apergos>	 have a good one
[00:33:07] <grrrit-wm>	 (03PS1) 10Krinkle: Fix-up 88654b3 in beta config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270450 
[00:33:20] <Krinkle>	 K, end of a long day.
[00:33:36] <apergos>	 and a long week
[00:33:37] <grrrit-wm>	 (03CR) 10Krinkle: [C: 032] Fix-up 88654b3 in beta config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270450 (owner: 10Krinkle)
[00:34:24] <grrrit-wm>	 (03Merged) 10jenkins-bot: Fix-up 88654b3 in beta config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270450 (owner: 10Krinkle)
[00:37:00] <icinga-wm>	 RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[00:39:05] <mutante>	 s/"..pushes stuff to produciton automatically"//g
[00:39:45] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Deployment-Systems, 6Release-Engineering-Team, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#2024749 (10GWicke)
[00:42:33] <logmsgbot>	 !log krinkle@mira Synchronized wmf-config/CommonSettings-labs.php: (no message) (duration: 01m 14s)
[00:42:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:46:52] <wikibugs>	 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#2024751 (10Papaul) a:5Papaul>3Joe re-image complete and sinning puppet cert on mw2173 complete.
[01:00:20] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail
[01:15:47] <grrrit-wm>	 (03PS4) 10Dzahn: parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 
[01:18:43] <mutante>	 !log omg testing this log feature that logs straight to tickets (T108720)
[01:18:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:19:00] <mutante>	 bd808: you rock!
[01:22:42] <grrrit-wm>	 (03PS1) 10Dereckson: Remove *.ggpht.com from Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270456 (https://phabricator.wikimedia.org/T112500) 
[01:27:01] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[01:39:14] <grrrit-wm>	 (03PS5) 10Dzahn: parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 
[01:40:11] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0]
[01:43:41] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[01:56:20] <icinga-wm>	 PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: puppet fail
[01:57:26] <grrrit-wm>	 (03PS1) 10MaxSem: Fix GWToolset-related fatal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270459 (https://phabricator.wikimedia.org/T126830) 
[01:58:31] <MaxSem>	 greg-g, if we can find someone to review 2 patches, we can deploy ^^^ today to avoid train breaking prod on tuesday
[01:59:01] <grrrit-wm>	 (03PS6) 10Dzahn: parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 
[02:08:33] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "noop now http://puppet-compiler.wmflabs.org/1759/" [puppet] - 10https://gerrit.wikimedia.org/r/269707 (owner: 10Dzahn)
[02:09:37] <grrrit-wm>	 (03CR) 10Dzahn: "noop confirmed on ruthenium" [puppet] - 10https://gerrit.wikimedia.org/r/269707 (owner: 10Dzahn)
[02:10:18] <grrrit-wm>	 (03PS1) 10Mattflaschen: Add Echo site icons for all of the remaining families. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) 
[02:11:31] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0]
[02:11:48] <grrrit-wm>	 (03PS2) 10Mattflaschen: Add Echo site icons for all of the remaining families. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) 
[02:12:44] <grrrit-wm>	 (03CR) 10Dzahn: "@Giuseppe yea, there was some special case where we did not just want to load/unload a module but actually change the module config, defla" [puppet] - 10https://gerrit.wikimedia.org/r/264313 (owner: 10Dzahn)
[02:13:18] <grrrit-wm>	 (03Abandoned) 10Dzahn: apache: add conf_type "mods" [puppet] - 10https://gerrit.wikimedia.org/r/264313 (owner: 10Dzahn)
[02:13:43] <grrrit-wm>	 (03PS2) 10Dzahn: phabricator: fix 16 lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/269904 
[02:14:35] <grrrit-wm>	 (03CR) 10Jforrester: "I should protect the files first before we do this. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen)
[02:18:39] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[02:23:10] <icinga-wm>	 RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[02:29:01] <icinga-wm>	 PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: puppet fail
[02:57:19] <icinga-wm>	 RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[04:37:00] <urandom>	 !log (ephemerally) dropping compactor thread count from 10 to 8 on restbase1002.eqiad
[04:37:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[04:41:19] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[04:42:40] <icinga-wm>	 PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: Puppet has 1 failures
[04:46:39] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0
[05:09:11] <icinga-wm>	 RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[05:09:38] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] "Should go in 2016-02-15 AM SWAT and stay until 1.27.0-wmf.14 is on all wikis (~2016-02-18 PM SWAT) when the back compat branch should be d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270459 (https://phabricator.wikimedia.org/T126830) (owner: 10MaxSem)
[05:14:59] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[05:29:01] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0
[05:36:53] <icinga-wm>	 PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: / 526 MB (5% inode=62%)
[05:46:59] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[05:53:59] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0
[06:05:03] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 04-1] "I'm reverting the change that forced this (I84e2ba310c425e2d1db1e7032dc014f88ebef087). The weekend is no time for heroic firefighting in b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270459 (https://phabricator.wikimedia.org/T126830) (owner: 10MaxSem)
[06:28:04] <icinga-wm>	 RECOVERY - MariaDB disk space on silver is OK: DISK OK
[06:29:51] <icinga-wm>	 PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 2 failures
[06:30:07] <urandom>	 !log `nodetool stop -- COMPACTION && nodetool cleanup' on restbase1002.eqiad, an abundance of caution (https://phabricator.wikimedia.org/P2612) 
[06:30:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:30:29] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:10] <icinga-wm>	 PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:19] <icinga-wm>	 PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:31:39] <icinga-wm>	 PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:40] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:20] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:29] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:32:30] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:10] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:48:59] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[06:56:10] <icinga-wm>	 RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[06:56:11] <icinga-wm>	 RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[06:56:41] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[06:56:59] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[06:57:00] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[06:57:09] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[06:57:30] <icinga-wm>	 RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[06:57:39] <icinga-wm>	 RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:51] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[06:58:00] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[07:01:20] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0
[07:20:59] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[07:21:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[07:23:59] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[07:29:10] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0
[07:36:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[07:40:21] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[07:40:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0
[07:46:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0
[07:56:09] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR
[07:56:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR
[08:06:49] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0
[08:06:59] <icinga-wm>	 RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0
[08:09:39] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR
[08:15:12] <wikibugs>	 6operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2025077 (10Danny_B) @BBlack: What's the status, please?
[08:23:30] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0
[08:25:50] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0]
[08:34:12] <wikibugs>	 6operations, 7Availability: Upgrade jobrunners to redis 2.8 - https://phabricator.wikimedia.org/T97909#2025086 (10Joe) p:5Triage>3Low
[09:01:19] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[09:24:41] <grrrit-wm>	 (03PS1) 10Krinkle: Clean up old "/images" directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270470 
[09:35:16] <grrrit-wm>	 (03PS1) 10Krinkle: Update mobile config to use /static/images instead of deprecated /images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270471 (https://phabricator.wikimedia.org/T107395) 
[09:35:46] <grrrit-wm>	 (03PS2) 10Krinkle: Clean up old "/images" directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270470 
[09:38:50] <grrrit-wm>	 (03PS2) 10Krinkle: Update mobile config to use /static/images instead of deprecated /images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270471 (https://phabricator.wikimedia.org/T107395) 
[09:41:59] <grrrit-wm>	 (03PS3) 10Krinkle: Replaces "/images" directory with symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270470 
[09:42:30] <grrrit-wm>	 (03PS4) 10Krinkle: Replaces "/images" directory with symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270470 
[10:35:10] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 706
[10:40:10] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 2142115 Threads: 2 Questions: 14489651 Slow queries: 14388 Opens: 5033 Flush tables: 2 Open tables: 404 Queries per second avg: 6.764 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[11:39:10] <grrrit-wm>	 (03PS13) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) 
[11:40:20] <paladox>	 gerrit has gone down for me.
[11:48:08] <Nemo_bis>	 wfm
[12:45:03] <Luke081515>	 robla: Can you add an answer to https://phabricator.wikimedia.org/T114322#1961576 ?
[12:59:49] <grrrit-wm>	 (03PS1) 10Dereckson: Add pt.wikimedia.org in Apache vhosts configuration [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) 
[13:01:42] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Add pt.wikimedia.org in Apache vhosts configuration [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson)
[13:07:50] <grrrit-wm>	 (03CR) 10Dereckson: [C: 04-1] "Planning changes: pt.wikimedia is already defined in our Apache configuration, there is also a redirection to remove" [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson)
[13:14:06] <grrrit-wm>	 (03PS2) 10Dereckson: Apache configuration for pt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) 
[13:14:48] <grrrit-wm>	 (03CR) 10Dereckson: "PS2: removed redirect" [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson)
[13:17:01] <grrrit-wm>	 (03PS1) 10Dereckson: RESTBase configuration for pt.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/270481 (https://phabricator.wikimedia.org/T126832) 
[13:36:29] <grrrit-wm>	 (03PS1) 10Dereckson: Remove Wikimedia Foundation English blog from cs.planet [puppet] - 10https://gerrit.wikimedia.org/r/270483 
[14:07:54] <grrrit-wm>	 (03CR) 10Nemo bis: [C: 031] "Language should be respected" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270483 (owner: 10Dereckson)
[14:18:24] <grrrit-wm>	 (03PS1) 10Dereckson: Remove 404 face from wikipediste cs.planet entry [puppet] - 10https://gerrit.wikimedia.org/r/270487 
[14:18:44] <grrrit-wm>	 (03CR) 10Dereckson: Remove Wikimedia Foundation English blog from cs.planet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270483 (owner: 10Dereckson)
[14:26:53] <godog>	 ./go legoktm 
[14:26:57] <godog>	 nope!
[14:27:06] <godog>	 sorry lego.ktm
[14:31:34] <grrrit-wm>	 (03CR) 10Nemo bis: [C: 031] Remove 404 face from wikipediste cs.planet entry [puppet] - 10https://gerrit.wikimedia.org/r/270487 (owner: 10Dereckson)
[15:34:54] <wikibugs>	 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2025559 (10Danny_B)
[16:00:10] <icinga-wm>	 PROBLEM - puppet last run on mw2024 is CRITICAL: CRITICAL: puppet fail
[16:28:19] <icinga-wm>	 RECOVERY - puppet last run on mw2024 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[16:45:39] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1538 bytes in 0.147 second response time
[16:57:01] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1549 bytes in 0.234 second response time
[17:29:14] <Danny_B>	 Krenair: why did you remove my assignment from 126872?
[17:29:49] <Krenair>	 Danny_B, not ready for implementation
[17:30:13] <Krenair>	 should be considered by others before you do it
[17:33:11] <Danny_B>	 Krenair: well, i expected to wait for the result of the discussion of course
[17:57:19] <Luke081515>	 robla: I got still one question ;): Which type of project do you want? component? (https://www.mediawiki.org/wiki/Phabricator/Project_management#Types_of_Projects)
[17:58:38] * robla looks
[17:59:40] <robla>	 Luke081515: seems like a tag to me
[18:00:31] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.87% of data above the critical threshold [5000000.0]
[18:00:34] <Luke081515>	 robla: OK, I will add this to the task.
[18:01:10] <robla>	 thanks!
[18:11:09] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[18:44:06] <Luke081515>	 robla: You can use the tag now
[18:48:34] * robla looks
[18:54:32] <robla>	 Luke081515: thanks!  I'm starting to tag upcoming meetings: https://phabricator.wikimedia.org/calendar/query/wiuSD6vCKuhA/
[19:19:02] <matanya>	 akosiaris: are you around by any chance ?
[19:28:39] <icinga-wm>	 PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100%
[19:29:16] <mutante|away>	 eh, fdb?
[19:30:09] <icinga-wm>	 PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100%
[19:30:26] <mutante>	 ah, fundraising again
[19:30:37] <jynus>	 I think it is only codfw
[19:31:32] <mutante>	 yes, codfw frack
[19:31:53] <mutante>	 likely the pfw again
[19:32:01] <icinga-wm>	 PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100%
[19:32:08] <icinga-wm>	 PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100%
[19:32:18] <godog>	 aye, seems likely
[19:32:58] <apergos>	 so: not public facing etc
[19:33:18] <icinga-wm>	 PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100%
[19:33:20] <jynus>	 no user notice now, I checked it
[19:33:25] <icinga-wm>	 PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100%
[19:33:28] <mutante>	 told -fundraising 
[19:33:32] <icinga-wm>	 PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100%
[19:33:34] <chasemp>	 Same stuff again i guess
[19:34:07] <matanya>	 mutante: I am working with a contrator for the wmf to set up a django app in labs
[19:34:10] <icinga-wm>	 PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100%
[19:34:32] <matanya>	 any recommandtions for setting it up before it is production grade ?
[19:35:01] <matanya>	 (py2/3? mysql/postgres?/packages/pip) etc
[19:35:10] <mutante>	 no pip
[19:35:10] <icinga-wm>	 RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 38.65 ms
[19:35:13] <apergos>	 packages and not pip
[19:35:21] <icinga-wm>	 RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms
[19:35:21] <mutante>	 dont mix packages and pip
[19:35:28] <apergos>	 python2.(6,7) is what we have on the cluster
[19:35:28] <icinga-wm>	 RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms
[19:35:35] <icinga-wm>	 RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.49 ms
[19:35:40] <apergos>	 might be all 7 now, depends on what precise has
[19:35:41] <mutante>	 mysql/postgres we use both in some place
[19:35:42] <icinga-wm>	 RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.54 ms
[19:35:49] <apergos>	 preference for mysql however
[19:35:51] <icinga-wm>	 RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms
[19:35:53] <mutante>	 yes
[19:35:57] <matanya>	 thanks both
[19:35:59] <icinga-wm>	 RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 36.60 ms
[19:36:00] <apergos>	 yw
[19:36:03] <apergos>	 and they're back
[19:36:06] <icinga-wm>	 RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.58 ms
[19:36:35] <apergos>	 hey Jeff_Green
[19:37:05] <apergos>	 you saw them I guess: all recovered now 
[19:39:14] <Jeff_Green>	 yup. just network gear tripping up, thanks for the heads up
[19:39:34] <apergos>	 yeah I figured more of the same
[19:40:14] <icinga-wm>	 PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: puppet fail
[19:40:34] <apergos>	 yeahyeah
[19:45:14] <icinga-wm>	 PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 16 failures
[19:45:14] <icinga-wm>	 PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail
[19:45:14] <icinga-wm>	 RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 132 seconds ago with 0 failures
[19:50:14] <icinga-wm>	 PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 16 failures
[19:50:14] <icinga-wm>	 RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 165 seconds ago with 0 failures
[19:55:14] <icinga-wm>	 RECOVERY - check_puppetrun on saiph is OK: OK: Puppet is currently enabled, last run 284 seconds ago with 0 failures
[20:04:55] <bd808>	 !bash <  apergos> botshed: the action of kicking bots from a channel when it's overrun by alerts at the same time people are working to fix the problem
[20:04:55] <stashbot>	 bd808: Stored quip at https://tools.wmflabs.org/bash/quip/AVLcPHU5-0X0Il_jxrC1
[20:59:56] <wikibugs>	 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: db1069 is running low on space - https://phabricator.wikimedia.org/T124464#2025901 (10jcrespo) 5Open>3Resolved 73% -more could be done, but resolving for now.
[21:22:45] <icinga-wm>	 PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: puppet fail
[21:47:03] <wikibugs>	 6operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2025940 (10hashar) The nodejs package from Debian does not ship npm (see file list at https://packages.debian.org/sid/amd64/nodejs/filelist ). Instead, npm is split to its own `npm` package. To get NodeJs 4.2/4.3 we backported...
[21:48:25] <icinga-wm>	 RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures
[22:09:06] <wikibugs>	 6operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2025952 (10Paladox)
[22:33:52] <odder>	 legoktm: remember that time when my e-mail just kept being unsubscribed due to multiple message delivery failures (to quote the message)?
[22:33:57] <odder>	 legoktm: well that's happening to me again
[22:36:58] <odder>	 or anyone else, really
[22:37:23] <odder>	 I think I hit the maximum number already
[22:48:02] <odder>	 anyone?
[22:49:14] <bd808>	 odder: do you remember what needed to be done to fix it?
[22:49:38] <odder>	 bd808: Not really, some limit needed to be zeroed for my e-mail address
[22:50:49] <bd808>	 *nod* sounds like it would need a phab ticket in https://phabricator.wikimedia.org/tag/wikimedia-mailing-lists/ 
[22:51:08] <bd808>	 I don't have the super powers to mess with mailman things
[22:51:28] <odder>	 bd808: Ugh, sorry for not making this clear
[22:51:40] <odder>	 bd808: the issue I am having is I'm trying to send a gazillion e-mails
[22:52:02] <odder>	 and my e-mail address that I've got registered for my account keeps being unsubscribed, so I can't send them
[22:52:09] <odder>	 through the wiki, I mean, Special:EmailUser
[22:52:35] <bd808>	 oh, ok
[22:53:15] <bd808>	 Reedy: ^ any idea what bits need to be twiddled to get odder back in business?
[22:54:07] <Reedy>	 Worth having a look in the bouncehandler logs to see why he's being unsubscribed
[22:55:09] <Reedy>	 my internet connection is massively sucking
[22:55:19] <bd808>	 2016-02-13 22:30:38 mw1168 commonswiki 1.27.0-wmf.13 BounceHandler INFO: Un-subscribed global user Odder <REDACTED> for exceeding Bounce Limit 5.
[22:55:31] <odder>	 oh, is it just 5 messages?
[22:55:41] <odder>	 bd808: Dunno what's going on because it's happened to me before
[22:56:33] <odder>	 bd808: but I get talk page notifications and Echo mentions and all the rest just fine
[22:56:53] <Reedy>	 I wonder if BH needs to be smarter
[22:57:00] <odder>	 bd808: I think it might be an issue with the changed headers, ie. perhaps my mail host doesn't like the way that we mess with headers
[22:57:21] <Reedy>	 If a person gets sent a lot of emails... They should have a higher unsub threshold
[22:57:30] <bd808>	 " 550-sorry, external MTA's and unauthenticated MTU's don't havepermission to send email to this server with a header thatstates the email is from twkozlowski.com."
[22:58:01] <odder>	 Yup
[22:58:24] <odder>	 bd808: do I need to say the help desk is closed shut on a Saturday evening :-P
[23:00:55] <bd808>	 so there's not a simple counter. it's a select count() thing
[23:02:01] <bd808>	 and the bounce limit is global
[23:04:08] <odder>	 bd808: Now I think they're thinking along the lines that if I get an e-mail that is spoofing my actual e-mail address, then that's spam and should not be delivered
[23:04:13] <odder>	 ie. fear of spam
[23:04:47] <odder>	 dunno if there's a Phabricator ticket for that already
[23:05:30] <valhallasw`cloud>	 yes, I think so
[23:05:59] <valhallasw`cloud>	 not the DMARC one, buuut.... *tried to find it*
[23:06:33] <odder>	 https://phabricator.wikimedia.org/T99444 is same outcome
[23:06:39] <valhallasw`cloud>	 https://phabricator.wikimedia.org/T66795 is about the same issue with dmarc, and does note spf as well
[23:08:05] <odder>	 https://phabricator.wikimedia.org/T118648 same 550
[23:09:48] <odder>	 I guess I'll just whitelist myself for the time being
[23:10:04] <bd808>	 so one example I'm looking at in the logs is odder sending a user message. The message bounced so BH tried to forward the bounce response to odder. odder's mail server said woah I don't like seeing messages from odder that didn't start on my servers. That bounce came back the BH.
[23:10:27] <bd808>	 repeat 4 more times and boom!
[23:10:40] <Reedy>	 lol
[23:11:05] <odder>	 bd808: it does look logical from their side :-)
[23:11:30] <bd808>	 So one fix would be for the From: on the bounce forward to not spoof as the user themselves
[23:13:13] <bd808>	 oh wait, it's not even that convoluted. The message is the "send me a copy" message.
[23:13:51] * odder nods
[23:13:57] <bd808>	 so you send to user X and ask to get a copy via settiings. The copy says it is From: you and your mail server says "hell no it's not!"
[23:15:20] <odder>	 Precisely
[23:17:17] <bd808>	 root problem is the same though. From: should be something at the wiki, not the user's own email
[23:17:42] <Reedy>	 I'm not quite sure why we send them from the users email anyway...
[23:17:52] <Reedy>	 Presuambly it should be the wiki, but the return-to be the users email?
[23:18:02] <bd808>	 that would make more sense
[23:18:12] <bd808>	 or BCC on the original
[23:18:24] <bd808>	 although that might have the same problem
[23:18:25] <Reedy>	 Maybe use $USERNAME <wiki@wikimedia.org?
[23:18:30] <Reedy>	 I dunno
[23:18:41] <Reedy>	 Maybe ask our security guys a bit too
[23:18:50] <Reedy>	 Cause this sort of issue is just going to get more prevalent
[23:18:51] <bd808>	 Oh it can't be  BCC because that leaks the other user's email
[23:19:44] <bd808>	 Reedy: right. email isn't the wild west it was 10 years ago when this was written I'm sure
[23:22:03] <Reedy>	 Mmm
[23:23:26] <bd808>	 https://github.com/wikimedia/mediawiki/blame/master/includes/specials/SpecialEmailuser.php#L379-L391
[23:24:57] <Reedy>	 Barely touched in 5 years
[23:25:00] <bd808>	 it should use the $wgUserEmailUseReplyTo logic from above
[23:25:03] <Reedy>	 Nearly 6
[23:27:12] <bd808>	 that EmailUserCC hook is completely unused in our github repos
[23:27:36] <bd808>	 if it wasn't there the fix would be straigh forward
[23:53:54] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0]
[23:55:44] <grrrit-wm>	 (03PS1) 10Dereckson: Don't index NS_USER on cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270556 (https://phabricator.wikimedia.org/T125068) 
[23:57:25] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0]