[00:03:31] (03CR) 10Alex Monk: "is there a task about this? has it passed security review?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 (owner: 10Yurik) [00:04:33] (03PS8) 10Krinkle: Set $wgResourceBasePath to "/w" for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268715 (https://phabricator.wikimedia.org/T99096) [00:04:37] (03PS3) 10Yurik: Enable Kartographer ext in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270441 (https://phabricator.wikimedia.org/T114820) [00:05:29] Krenair, ^ [00:05:34] updated bug [00:05:48] Krenair, it was blocked on the sec review until about an hour ago [00:06:17] (03PS1) 10Krinkle: Set $wgResourceBasePath to "/w" for beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270446 (https://phabricator.wikimedia.org/T99096) [00:06:25] aha, that would explain why I was not aware of it [00:06:30] ok [00:06:37] Krenair, can you +2 it pls [00:06:56] we get to play with it over the weekend :) [00:07:05] and continue to update it [00:07:49] you want me to deploy a new extension at midnight on saturday morning? [00:08:30] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: puppet fail [00:09:24] Krenair, only to labs :))) [00:09:31] greg-g? [00:09:38] i don't want it public, no way :) [00:09:41] i mean produciton [00:09:51] ? [00:09:52] I kind of had plans other than extension deployments [00:10:05] !log ruthenium - restarting parsoid, now works out of /srv/ [00:10:06] Krenair, do you actually have to do anything manually? [00:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:23] greg-g, is it ok to do a *labs-only* ext depl [00:10:52] (03CR) 10Dzahn: "nginx works, parsoid runs out of /srv, checked ruthenium" [puppet] - 10https://gerrit.wikimedia.org/r/270442 (owner: 10Dzahn) [00:10:59] yurik: if you are comfortable doing it yourself, it's not something I want to get in the habit of, this is bad timing (4pm on a Friday where many people will be leaving early) [00:11:43] greg-g, i won't sync it - if labs picks it up and enables, great, if not, i will wait until monday [00:11:52] uh, no [00:11:59] (03CR) 10Dzahn: "on ruthenium, stop/start parsoid service after this.confirmed it runs out of /srv, then deleted /usr/lib/parsoid/" [puppet] - 10https://gerrit.wikimedia.org/r/269606 (owner: 10Dzahn) [00:12:12] hashar was saying something about the process being automated [00:12:13] we don't allow diffs to not be sync'd for config changes in prod, if it's on tin it has to be deployed [00:12:26] unless I misunderstood you [00:12:45] greg-g, i could do it, but i would rather observe the policy of not deploying on friday, even if its labs-only files [00:12:57] yurik: right, so how about you do it all on Monday [00:13:08] meh, ok :) [00:13:10] people are so discomfort with your plans here, please respect that [00:13:12] thanks [00:13:13] :) [00:13:28] discomfort with my plans? :/ [00:13:39] we require changes on prod to be synced [00:13:41] even if labs [00:13:49] (03CR) 10Krinkle: [C: 032] Set $wgResourceBasePath to "/w" for beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270446 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [00:13:58] we forget sometime. But over a week end that is a nono [00:14:11] is Krinkle deploying something? ^ [00:14:11] specially for ops, they will be left wondering what to do with the patch [00:14:15] i am so confused [00:14:20] so just hold and do it on monday morning [00:14:35] might want to have the patch reviewed anyway. That is a good preparation for prod work [00:14:37] * Krinkle reading scroll back [00:14:47] (03Merged) 10jenkins-bot: Set $wgResourceBasePath to "/w" for beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270446 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [00:15:03] Krinkle, you just merged the mw-config, which greg-g and hashar were saying is not very nice to merge to on friday [00:15:25] because deployments to labs must be matched with the sync to prod [00:15:42] anyway off [00:15:52] hashar, thx, chat monday ) [00:15:54] jenkins will auto-deploy master on beta, right? [00:15:55] too late [00:15:55] have a good weekend has.. [00:15:58] har... [00:15:59] heh [00:16:20] Krinkle, yes, but greg-g said not to do it because mw-config has to be always in sync with prod [00:17:52] greg-g, i think we should make a rule that its ok to merge a mw-config if you only touch labs files, without syncing... i know it has been raised before as an issue, but this way we don't block people from experimenting... but maybe its not worth it [00:18:17] it's annoying, yeah [00:18:18] or maybe we could make a auto-sync script that pushes stuff to produciton automatically, just like puppets, if all files [00:18:26] are in labs [00:18:57] i mean if all files in a patch belong to a whitelist [00:19:02] Hm.. slightly confused. mediawiki-config has policy of merging=pulling on tin, and pulling on tin=merging. [00:19:21] So by that logic, we can't do anything on beta because beta files are in mediawiki-config [00:19:34] Krinkle, that's what i have been hearing from greg-g [00:19:43] but we also merge stuff in master for mediawiki-core and other repos on Friday (aka "work") [00:19:45] which i'm not very happy about either [00:19:49] and some thigns require config changes. [00:19:55] we should be able to fix beta on a friday [00:20:09] (all of which also go to beta) [00:20:14] we can fix beta on Fridays, but new extensions isn't fixing [00:20:18] Krinkle, mediawiki-config is supposedly a special repo because it must always match whats in production [00:20:21] fixing beta is good enough a reason to pull on tin [00:20:25] if we don't want config changes to beta on Friday, we need to disable beta-update on Friday. [00:20:41] Krinkle: you're being a little too extreme in your interpretation [00:20:57] (so that master commit to code repos don't go there either) [00:21:04] Right. I see. [00:21:07] fixing is a vague term - it could be an improvement or it could be "totally broken unless this change is made" [00:21:17] Not a freeze, just no regular changes. [00:21:20] the uncomfortableness was the new extension and people not being ready to help fix if it breaks [00:21:39] it being the end of a very long/emotionally draining week at 4:10 pacific [00:21:40] greg-g, by breaking, you mean in labs on in prod? [00:21:47] beta cluster [00:21:55] we don't want a broken beta cluster all weekend, either :) [00:22:02] true that :) [00:22:05] is toollabs separate? [00:22:10] i think so [00:22:38] yeah, tool labs is it's own beast [00:22:42] thank goodness [00:23:30] in any case, no point in pushing it out, just trying to clarify this. So apparently its ok to selfmerge mw-config without dir-syncing it to all of production if its labs files only [00:23:46] unless its a major major change like new ext [00:23:57] greg-g, ? [00:24:12] no, it's OK to deploy beta cluster-only changes to production (to keep icinga happy) [00:24:27] but, this specific situation was touchy because of... [00:24:28] 00:21 < greg-g> the uncomfortableness was the new extension and people not being ready to help fix if it breaks [00:24:31] 00:21 < greg-g> it being the end of a very long/emotionally draining week at 4:10 pacific [00:24:42] yes yes , i totally understand that point :) [00:24:49] as i said, just trying to understand the rules :) [00:24:55] the icinga check is there to keep other deployers happy [00:25:10] and ops, they'd be pissed if it was complaining all weekend :) [00:25:16] +1 [00:25:16] does icinga make sure that tin is in sync with prod, or in sync with git master? [00:25:21] incinga check compares tin against git? [00:25:28] (or mira) [00:25:29] yurik: option 1 [00:25:30] hehe [00:25:50] k [00:25:51] greg-g, does it have some timeout of half an hour? [00:26:02] not that I know of [00:26:03] Should I sync-file my labs change to prod? [00:26:06] because otherwise it would start complaining the moment i git pull [00:26:11] Right [00:26:19] yurik: oh, right, yeah, misinterpreted timeout [00:26:20] It's okay now since the git remote hasn't been updated yet [00:26:28] ok. I'l sync then [00:26:59] greg-g, but if it only checks that tin is in sync with production, +2 does not trigger alarms [00:27:35] only git pull without dir-sync does [00:27:44] right, but then the next time someone wants to deploy they'll be confused by this unrelated change being pulled down [00:27:47] yurik: The check compares tin staging workspace against tin's git-remote perception of Gerrit repo [00:27:50] actually, the other way around [00:28:11] !log krinkle@mira Synchronized wmf-config/CommonSettings-labs.php: (no message) (duration: 01m 17s) [00:28:13] Krinkle, so if i don't git fetch, its all clear [00:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:28:32] * yurik is scheming a way to get around icingas alarms :))) [00:28:36] Yeah, but it's a bit unfriendly to leave that behind for someone else to then unknowingly trigger an alarm on a 30min delay [00:28:54] yes, i get that :) [00:29:11] ok, i think i have done enough damage here [00:29:16] off i go to do something productive [00:29:46] appreciate patiently explaining it all to me [00:30:21] word [00:30:25] enjoy your weekend [00:30:37] thanks, will do. [00:30:43] you too! [00:31:10] will do! [00:32:20] have a good one [00:33:07] (03PS1) 10Krinkle: Fix-up 88654b3 in beta config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270450 [00:33:20] K, end of a long day. [00:33:36] and a long week [00:33:37] (03CR) 10Krinkle: [C: 032] Fix-up 88654b3 in beta config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270450 (owner: 10Krinkle) [00:34:24] (03Merged) 10jenkins-bot: Fix-up 88654b3 in beta config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270450 (owner: 10Krinkle) [00:37:00] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [00:39:05] s/"..pushes stuff to produciton automatically"//g [00:39:45] 7Blocked-on-Operations, 6operations, 10Deployment-Systems, 6Release-Engineering-Team, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#2024749 (10GWicke) [00:42:33] !log krinkle@mira Synchronized wmf-config/CommonSettings-labs.php: (no message) (duration: 01m 14s) [00:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:52] 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#2024751 (10Papaul) a:5Papaul>3Joe re-image complete and sinning puppet cert on mw2173 complete. [01:00:20] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [01:15:47] (03PS4) 10Dzahn: parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 [01:18:43] !log omg testing this log feature that logs straight to tickets (T108720) [01:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:19:00] bd808: you rock! [01:22:42] (03PS1) 10Dereckson: Remove *.ggpht.com from Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270456 (https://phabricator.wikimedia.org/T112500) [01:27:01] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:39:14] (03PS5) 10Dzahn: parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 [01:40:11] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [01:43:41] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:56:20] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: puppet fail [01:57:26] (03PS1) 10MaxSem: Fix GWToolset-related fatal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270459 (https://phabricator.wikimedia.org/T126830) [01:58:31] greg-g, if we can find someone to review 2 patches, we can deploy ^^^ today to avoid train breaking prod on tuesday [01:59:01] (03PS6) 10Dzahn: parsoid: move rt/vd roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/269707 [02:08:33] (03CR) 10Dzahn: [C: 032] "noop now http://puppet-compiler.wmflabs.org/1759/" [puppet] - 10https://gerrit.wikimedia.org/r/269707 (owner: 10Dzahn) [02:09:37] (03CR) 10Dzahn: "noop confirmed on ruthenium" [puppet] - 10https://gerrit.wikimedia.org/r/269707 (owner: 10Dzahn) [02:10:18] (03PS1) 10Mattflaschen: Add Echo site icons for all of the remaining families. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) [02:11:31] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [02:11:48] (03PS2) 10Mattflaschen: Add Echo site icons for all of the remaining families. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) [02:12:44] (03CR) 10Dzahn: "@Giuseppe yea, there was some special case where we did not just want to load/unload a module but actually change the module config, defla" [puppet] - 10https://gerrit.wikimedia.org/r/264313 (owner: 10Dzahn) [02:13:18] (03Abandoned) 10Dzahn: apache: add conf_type "mods" [puppet] - 10https://gerrit.wikimedia.org/r/264313 (owner: 10Dzahn) [02:13:43] (03PS2) 10Dzahn: phabricator: fix 16 lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/269904 [02:14:35] (03CR) 10Jforrester: "I should protect the files first before we do this. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270460 (https://phabricator.wikimedia.org/T49662) (owner: 10Mattflaschen) [02:18:39] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:23:10] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:29:01] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: puppet fail [02:57:19] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:37:00] !log (ephemerally) dropping compactor thread count from 10 to 8 on restbase1002.eqiad [04:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:41:19] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [04:42:40] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: Puppet has 1 failures [04:46:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [05:09:11] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:09:38] (03CR) 10BryanDavis: [C: 031] "Should go in 2016-02-15 AM SWAT and stay until 1.27.0-wmf.14 is on all wikis (~2016-02-18 PM SWAT) when the back compat branch should be d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270459 (https://phabricator.wikimedia.org/T126830) (owner: 10MaxSem) [05:14:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [05:29:01] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [05:36:53] PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: / 526 MB (5% inode=62%) [05:46:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [05:53:59] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [06:05:03] (03CR) 10BryanDavis: [C: 04-1] "I'm reverting the change that forced this (I84e2ba310c425e2d1db1e7032dc014f88ebef087). The weekend is no time for heroic firefighting in b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270459 (https://phabricator.wikimedia.org/T126830) (owner: 10MaxSem) [06:28:04] RECOVERY - MariaDB disk space on silver is OK: DISK OK [06:29:51] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:07] !log `nodetool stop -- COMPACTION && nodetool cleanup' on restbase1002.eqiad, an abundance of caution (https://phabricator.wikimedia.org/P2612) [06:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:29] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:10] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:19] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:39] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:20] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:30] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:10] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [06:56:10] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:11] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:41] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:00] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:57:09] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:58:00] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [07:01:20] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [07:20:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:21:09] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:23:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:29:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [07:36:20] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [07:40:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:40:31] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [07:46:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [07:56:09] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:56:19] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [08:06:49] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:06:59] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [08:09:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps wave]BR [08:15:12] 6operations, 10MediaWiki-Interface, 10Traffic: Purge pages cached with mobile editlinks - https://phabricator.wikimedia.org/T125841#2025077 (10Danny_B) @BBlack: What's the status, please? [08:23:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 [08:25:50] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [08:34:12] 6operations, 7Availability: Upgrade jobrunners to redis 2.8 - https://phabricator.wikimedia.org/T97909#2025086 (10Joe) p:5Triage>3Low [09:01:19] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [09:24:41] (03PS1) 10Krinkle: Clean up old "/images" directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270470 [09:35:16] (03PS1) 10Krinkle: Update mobile config to use /static/images instead of deprecated /images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270471 (https://phabricator.wikimedia.org/T107395) [09:35:46] (03PS2) 10Krinkle: Clean up old "/images" directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270470 [09:38:50] (03PS2) 10Krinkle: Update mobile config to use /static/images instead of deprecated /images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270471 (https://phabricator.wikimedia.org/T107395) [09:41:59] (03PS3) 10Krinkle: Replaces "/images" directory with symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270470 [09:42:30] (03PS4) 10Krinkle: Replaces "/images" directory with symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270470 [10:35:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 706 [10:40:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 2142115 Threads: 2 Questions: 14489651 Slow queries: 14388 Opens: 5033 Flush tables: 2 Open tables: 404 Queries per second avg: 6.764 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:39:10] (03PS13) 10Gehel: Ship Elasticsearch logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/269100 (https://phabricator.wikimedia.org/T109101) [11:40:20] gerrit has gone down for me. [11:48:08] wfm [12:45:03] robla: Can you add an answer to https://phabricator.wikimedia.org/T114322#1961576 ? [12:59:49] (03PS1) 10Dereckson: Add pt.wikimedia.org in Apache vhosts configuration [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) [13:01:42] (03CR) 10Luke081515: [C: 031] Add pt.wikimedia.org in Apache vhosts configuration [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [13:07:50] (03CR) 10Dereckson: [C: 04-1] "Planning changes: pt.wikimedia is already defined in our Apache configuration, there is also a redirection to remove" [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [13:14:06] (03PS2) 10Dereckson: Apache configuration for pt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) [13:14:48] (03CR) 10Dereckson: "PS2: removed redirect" [puppet] - 10https://gerrit.wikimedia.org/r/270479 (https://phabricator.wikimedia.org/T126832) (owner: 10Dereckson) [13:17:01] (03PS1) 10Dereckson: RESTBase configuration for pt.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/270481 (https://phabricator.wikimedia.org/T126832) [13:36:29] (03PS1) 10Dereckson: Remove Wikimedia Foundation English blog from cs.planet [puppet] - 10https://gerrit.wikimedia.org/r/270483 [14:07:54] (03CR) 10Nemo bis: [C: 031] "Language should be respected" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270483 (owner: 10Dereckson) [14:18:24] (03PS1) 10Dereckson: Remove 404 face from wikipediste cs.planet entry [puppet] - 10https://gerrit.wikimedia.org/r/270487 [14:18:44] (03CR) 10Dereckson: Remove Wikimedia Foundation English blog from cs.planet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/270483 (owner: 10Dereckson) [14:26:53] ./go legoktm [14:26:57] nope! [14:27:06] sorry lego.ktm [14:31:34] (03CR) 10Nemo bis: [C: 031] Remove 404 face from wikipediste cs.planet entry [puppet] - 10https://gerrit.wikimedia.org/r/270487 (owner: 10Dereckson) [15:34:54] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#2025559 (10Danny_B) [16:00:10] PROBLEM - puppet last run on mw2024 is CRITICAL: CRITICAL: puppet fail [16:28:19] RECOVERY - puppet last run on mw2024 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:45:39] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1538 bytes in 0.147 second response time [16:57:01] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1549 bytes in 0.234 second response time [17:29:14] Krenair: why did you remove my assignment from 126872? [17:29:49] Danny_B, not ready for implementation [17:30:13] should be considered by others before you do it [17:33:11] Krenair: well, i expected to wait for the result of the discussion of course [17:57:19] robla: I got still one question ;): Which type of project do you want? component? (https://www.mediawiki.org/wiki/Phabricator/Project_management#Types_of_Projects) [17:58:38] * robla looks [17:59:40] Luke081515: seems like a tag to me [18:00:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.87% of data above the critical threshold [5000000.0] [18:00:34] robla: OK, I will add this to the task. [18:01:10] thanks! [18:11:09] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:44:06] robla: You can use the tag now [18:48:34] * robla looks [18:54:32] Luke081515: thanks! I'm starting to tag upcoming meetings: https://phabricator.wikimedia.org/calendar/query/wiuSD6vCKuhA/ [19:19:02] akosiaris: are you around by any chance ? [19:28:39] PROBLEM - Host fdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:16] eh, fdb? [19:30:09] PROBLEM - Host heka is DOWN: PING CRITICAL - Packet loss = 100% [19:30:26] ah, fundraising again [19:30:37] I think it is only codfw [19:31:32] yes, codfw frack [19:31:53] likely the pfw again [19:32:01] PROBLEM - Host payments2003 is DOWN: PING CRITICAL - Packet loss = 100% [19:32:08] PROBLEM - Host payments2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:32:18] aye, seems likely [19:32:58] so: not public facing etc [19:33:18] PROBLEM - Host pay-lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:20] no user notice now, I checked it [19:33:25] PROBLEM - Host saiph is DOWN: PING CRITICAL - Packet loss = 100% [19:33:28] told -fundraising [19:33:32] PROBLEM - Host alnilam is DOWN: PING CRITICAL - Packet loss = 100% [19:33:34] Same stuff again i guess [19:34:07] mutante: I am working with a contrator for the wmf to set up a django app in labs [19:34:10] PROBLEM - Host rigel is DOWN: PING CRITICAL - Packet loss = 100% [19:34:32] any recommandtions for setting it up before it is production grade ? [19:35:01] (py2/3? mysql/postgres?/packages/pip) etc [19:35:10] no pip [19:35:10] RECOVERY - Host fdb2001 is UP: PING OK - Packet loss = 0%, RTA = 38.65 ms [19:35:13] packages and not pip [19:35:21] RECOVERY - Host heka is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [19:35:21] dont mix packages and pip [19:35:28] python2.(6,7) is what we have on the cluster [19:35:28] RECOVERY - Host saiph is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms [19:35:35] RECOVERY - Host rigel is UP: PING OK - Packet loss = 0%, RTA = 36.49 ms [19:35:40] might be all 7 now, depends on what precise has [19:35:41] mysql/postgres we use both in some place [19:35:42] RECOVERY - Host pay-lvs2001 is UP: PING OK - Packet loss = 0%, RTA = 36.54 ms [19:35:49] preference for mysql however [19:35:51] RECOVERY - Host alnilam is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [19:35:53] yes [19:35:57] thanks both [19:35:59] RECOVERY - Host payments2003 is UP: PING OK - Packet loss = 0%, RTA = 36.60 ms [19:36:00] yw [19:36:03] and they're back [19:36:06] RECOVERY - Host payments2001 is UP: PING OK - Packet loss = 0%, RTA = 36.58 ms [19:36:35] hey Jeff_Green [19:37:05] you saw them I guess: all recovered now [19:39:14] yup. just network gear tripping up, thanks for the heads up [19:39:34] yeah I figured more of the same [19:40:14] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: puppet fail [19:40:34] yeahyeah [19:45:14] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 16 failures [19:45:14] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: puppet fail [19:45:14] RECOVERY - check_puppetrun on payments2002 is OK: OK: Puppet is currently enabled, last run 132 seconds ago with 0 failures [19:50:14] PROBLEM - check_puppetrun on saiph is CRITICAL: CRITICAL: Puppet has 16 failures [19:50:14] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 165 seconds ago with 0 failures [19:55:14] RECOVERY - check_puppetrun on saiph is OK: OK: Puppet is currently enabled, last run 284 seconds ago with 0 failures [20:04:55] !bash < apergos> botshed: the action of kicking bots from a channel when it's overrun by alerts at the same time people are working to fix the problem [20:04:55] bd808: Stored quip at https://tools.wmflabs.org/bash/quip/AVLcPHU5-0X0Il_jxrC1 [20:59:56] 6operations, 10DBA, 6Labs, 10Labs-Infrastructure: db1069 is running low on space - https://phabricator.wikimedia.org/T124464#2025901 (10jcrespo) 5Open>3Resolved 73% -more could be done, but resolving for now. [21:22:45] PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: puppet fail [21:47:03] 6operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2025940 (10hashar) The nodejs package from Debian does not ship npm (see file list at https://packages.debian.org/sid/amd64/nodejs/filelist ). Instead, npm is split to its own `npm` package. To get NodeJs 4.2/4.3 we backported... [21:48:25] RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:09:06] 6operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2025952 (10Paladox) [22:33:52] legoktm: remember that time when my e-mail just kept being unsubscribed due to multiple message delivery failures (to quote the message)? [22:33:57] legoktm: well that's happening to me again [22:36:58] or anyone else, really [22:37:23] I think I hit the maximum number already [22:48:02] anyone? [22:49:14] odder: do you remember what needed to be done to fix it? [22:49:38] bd808: Not really, some limit needed to be zeroed for my e-mail address [22:50:49] *nod* sounds like it would need a phab ticket in https://phabricator.wikimedia.org/tag/wikimedia-mailing-lists/ [22:51:08] I don't have the super powers to mess with mailman things [22:51:28] bd808: Ugh, sorry for not making this clear [22:51:40] bd808: the issue I am having is I'm trying to send a gazillion e-mails [22:52:02] and my e-mail address that I've got registered for my account keeps being unsubscribed, so I can't send them [22:52:09] through the wiki, I mean, Special:EmailUser [22:52:35] oh, ok [22:53:15] Reedy: ^ any idea what bits need to be twiddled to get odder back in business? [22:54:07] Worth having a look in the bouncehandler logs to see why he's being unsubscribed [22:55:09] my internet connection is massively sucking [22:55:19] 2016-02-13 22:30:38 mw1168 commonswiki 1.27.0-wmf.13 BounceHandler INFO: Un-subscribed global user Odder for exceeding Bounce Limit 5. [22:55:31] oh, is it just 5 messages? [22:55:41] bd808: Dunno what's going on because it's happened to me before [22:56:33] bd808: but I get talk page notifications and Echo mentions and all the rest just fine [22:56:53] I wonder if BH needs to be smarter [22:57:00] bd808: I think it might be an issue with the changed headers, ie. perhaps my mail host doesn't like the way that we mess with headers [22:57:21] If a person gets sent a lot of emails... They should have a higher unsub threshold [22:57:30] " 550-sorry, external MTA's and unauthenticated MTU's don't havepermission to send email to this server with a header thatstates the email is from twkozlowski.com." [22:58:01] Yup [22:58:24] bd808: do I need to say the help desk is closed shut on a Saturday evening :-P [23:00:55] so there's not a simple counter. it's a select count() thing [23:02:01] and the bounce limit is global [23:04:08] bd808: Now I think they're thinking along the lines that if I get an e-mail that is spoofing my actual e-mail address, then that's spam and should not be delivered [23:04:13] ie. fear of spam [23:04:47] dunno if there's a Phabricator ticket for that already [23:05:30] yes, I think so [23:05:59] not the DMARC one, buuut.... *tried to find it* [23:06:33] https://phabricator.wikimedia.org/T99444 is same outcome [23:06:39] https://phabricator.wikimedia.org/T66795 is about the same issue with dmarc, and does note spf as well [23:08:05] https://phabricator.wikimedia.org/T118648 same 550 [23:09:48] I guess I'll just whitelist myself for the time being [23:10:04] so one example I'm looking at in the logs is odder sending a user message. The message bounced so BH tried to forward the bounce response to odder. odder's mail server said woah I don't like seeing messages from odder that didn't start on my servers. That bounce came back the BH. [23:10:27] repeat 4 more times and boom! [23:10:40] lol [23:11:05] bd808: it does look logical from their side :-) [23:11:30] So one fix would be for the From: on the bounce forward to not spoof as the user themselves [23:13:13] oh wait, it's not even that convoluted. The message is the "send me a copy" message. [23:13:51] * odder nods [23:13:57] so you send to user X and ask to get a copy via settiings. The copy says it is From: you and your mail server says "hell no it's not!" [23:15:20] Precisely [23:17:17] root problem is the same though. From: should be something at the wiki, not the user's own email [23:17:42] I'm not quite sure why we send them from the users email anyway... [23:17:52] Presuambly it should be the wiki, but the return-to be the users email? [23:18:02] that would make more sense [23:18:12] or BCC on the original [23:18:24] although that might have the same problem [23:18:25] Maybe use $USERNAME I dunno [23:18:41] Maybe ask our security guys a bit too [23:18:50] Cause this sort of issue is just going to get more prevalent [23:18:51] Oh it can't be BCC because that leaks the other user's email [23:19:44] Reedy: right. email isn't the wild west it was 10 years ago when this was written I'm sure [23:22:03] Mmm [23:23:26] https://github.com/wikimedia/mediawiki/blame/master/includes/specials/SpecialEmailuser.php#L379-L391 [23:24:57] Barely touched in 5 years [23:25:00] it should use the $wgUserEmailUseReplyTo logic from above [23:25:03] Nearly 6 [23:27:12] that EmailUserCC hook is completely unused in our github repos [23:27:36] if it wasn't there the fix would be straigh forward [23:53:54] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0] [23:55:44] (03PS1) 10Dereckson: Don't index NS_USER on cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270556 (https://phabricator.wikimedia.org/T125068) [23:57:25] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0]