[00:00:15] you can unsubscribe them first [00:00:16] which might be lowered by unsubscibing them, but herald will still resubscribe some [00:00:26] ehhh. why are we fiddling with this at all. can't we just fix the phab bug? :/ [00:00:39] MatmaRex: you're free to do so [00:01:03] people who get subscribed to tasks by herald bloody deserve to be spammed [00:01:06] I'm testing on https://phab-01.wmflabs.org - Danny_B are you getting emails for "test email 2-4" ? [00:01:15] let me see [00:01:17] Danny_B: how to disable what? [00:01:46] twentyafterfour: phabricator to send any email notification [00:01:50] (just for a bit) [00:02:02] assuming it might be some variable in config? [00:02:39] Danny_B: or we could just not touch old bugs, and wait for upstream to fix it properly [00:02:58] which imo seems to simplest solution [00:03:06] quiddity: created t8, chynged subscribers t7, changed subscribers t6, updated t7, updated t6 [00:03:41] p858snake: which will take ages according to their comments. and will create many dupes of complaining people etc etc... [00:04:24] omg what is so difficult on shutting of the mail for couple minutes? we could have already been done instead of this blahblahing here :-/ [00:04:25] Danny_B, ok. did you get the "quiddity removed a parent task: T5: test email 1. " that I just submitted? If not, then Matma's plan works. [00:04:25] T5: Get scap logs into logstash - https://phabricator.wikimedia.org/T5 [00:04:37] quiddity: mmt [00:04:48] i am receiving them with a bit delay [00:04:53] so need to wait for a bit [00:05:00] I don't think there is a straightforward way to disable mail although it may be possible I'm not sure about the consequences of unconfiguring the mta [00:05:34] quiddity removed a parent task: Restricted Maniphest Task. [00:05:38] :/ [00:06:11] ok, how about configuring smtp or whatever to ignore emails from phabricator? [00:07:03] i don't know much about phab in particular, but in general, messing with email config sounds like a great way to gain more problems [00:07:15] exactly [00:07:18] if this is such a big issue, we should probably undo the phabricator upgrade [00:07:25] if it isn't, then we should just wait for a fix [00:07:39] or get twentyafterfour in gear and make him fix it ;) [00:07:40] well i can simply trigger that batch too. [00:07:41] PhabricatorMailImplementationTestAdapter: this will completely disable outbound mail [00:07:44] if this is about the task graph I'm working on a fix [00:07:51] apergos: <3 [00:07:55] does this not work? maybe it's not for our version but [00:08:02] https://secure.phabricator.com/book/phabricator/article/configuring_outbound_email/ [00:08:10] and no I know nothing about it, I just asked google [00:08:54] so we need "Drop in a Hole" adapter ;-) [00:09:18] Adapter: Disable Outbound Mail You can use the PhabricatorMailImplementationTestAdapter to completely disable outbound mail, if you don't want to send mail or don't want to configure it yet. Just set metamta.mail-adapter to PhabricatorMailImplementationTestAdapter. [00:09:22] that's what it says anyways [00:09:42] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2464778 (10greg) No worries, thanks Jaime. [00:09:50] twentyafterfour: what do you think about that? ^^ [00:09:59] I'd be happiest: 1) waiting to see what solutions twnetyafterfour is cooking up, and 2) having a discussion about it (in a more appropriate channel), when andre__ is around. [00:10:46] again, i can trigger that batch without caring of mailspam in peoples mailboxes [00:10:53] what quiddity said [00:11:00] (I don't do agile well. I *hate* move fast and break things. ;-) [00:11:03] the deal is that it will fix most of the graphs instantly [00:11:11] without having to cook anything [00:11:22] quiddity: this is why I juts pass on possibilities and let others who will be here when I go to bed, make the decisions [00:11:33] that task would go out of dependencies anyway during the time [00:11:36] apergos: that should do what it says but ... I am 80% finished with a patch that keeps the graphs when they aren't too big and falls back to a regular list when the graph is huge [00:12:01] 06Operations, 10ops-eqiad: Rack/Setup Carbon/Apt Server Replacement - https://phabricator.wikimedia.org/T139171#2464782 (10RobH) [00:12:31] well those who will have to babysit, should decide what to do [00:12:56] quiddity: that's not about agile. that's simply that i am trying to be nice and prevent mailspam in peoples boxes as well as burning the smtp server... [00:13:30] i could have triggered it without asking to disable the mail [00:13:43] as MatmaRex sarcastically noted before: [00:13:51] I think it's good to try to avoid huge amounts of spam [00:14:02] I think it doesn't have to be solved this minute, it can wait a little whie [00:14:16] if that means cleanup on that task waits a little while, that is ok too [00:14:30] that's my opinion - it's not that huge of an issue is it? [00:14:43] 24hours more won't hurt. It'd be good to trigger as little mail as possible, and only remove the old task from where it's preventing the graph from displaying. I vote wait. (and will now return to other windows/work) [00:15:23] I'm not against the task cleanup but I will have a better hotfix soon barring unforseen issues [00:15:29] the issue is many tasks without graphs. and removing 4007 from dependencies will enable those graphs in most of the tasks back [00:16:01] and that stands for *any* task, not only direct children of 4007 [00:16:12] for any descendant of 4007 [00:16:19] which is several thousands [00:16:27] let's give twentyafterfour some time to get his patch ready and tested [00:16:32] just let him do the better fix [00:16:36] if there's a huge holdup we can revisit the issue [00:16:55] why cooking something when the easiest solution is to simply push the button "remove selected subtasks" [00:16:55] I'd like to fix it so that the graph code doesn't walk the graph past a certain level, instead of trying to be thorough [00:17:22] Danny_B: because thats not a proper fix? lets do properly in the first place? [00:18:35] mutante: that's not better fix. nor worse. it is paralel solution. because the dependency removal must be done in any case. [00:19:19] if the issue is broken graphs, it is a solution [00:19:41] Danny_B: who says the dependency removal must be done in any case? [00:19:46] or iow: paralel here means that ¨mmodell's fix will solve the displaying, my fix solves the wrong underlaying data [00:20:38] omg, why i simply did not aply "be bold" and didn't hit the button without asking... good lesson for future. never ask for obvious things. :-/ [00:21:11] being nice ends up with being undermined [00:22:04] it means the approach you advocate isn't always the one adopted, but that's not the same as being undermined [00:22:15] your instinct to avoid spam was a good one [00:22:19] twentyafterfour: it looks like the graph code walks the tasks breadth-first… can't it just stop after it has 100 tasks? [00:23:09] anyways, it's now ridiculous-o-clock here (3:20 am) so I'm out [00:23:12] see folks tomorrow [00:23:17] maybe i'm overestimating how simple this is… but it looks rather simple [00:23:55] limiting to 100 tasks is nonsense [00:24:07] i have clearly described it in relevant tasks [00:24:33] it is not systematical solution [00:24:50] lies. it's a better solution than showing nothing [00:24:51] it will give random 100 tasks without [00:24:55] Danny_B: no [00:25:19] it will give first take the nearest tasks to the current task. directly blocked/blocking first [00:25:33] then blocked/blocking tasks of the blocked/blocking tasks of the current task [00:25:41] then blocked/blocking tasks of the blocked/blocking tasks of the blocked/blocking tasks of the current task, etc. [00:26:58] the most intuitive would be limiting to the first level of dependencies (ie. show direct children, but no further descendants*) and also show only direct line to the root (= don't show other tasks which the parent, grandparent, ... depend on). [00:27:05] if several people just ask to take it slow, investigate and involve others that is not undermining. that just means there is no consensus, as the discussion shows [00:27:07] * however, those might be expandable to the next single level via AJAX request [00:28:41] bugzilla never went to parent direction. iirc, jira also does not. there is no reason to show whole tree than just all direct parents (aka no grand parents, no siblings etc...) [00:31:21] also, as i have seen, we have circular dependencies [00:31:28] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2464178 (10Dzahn) how about +2 in Gerrit? [00:32:56] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2464798 (10AlexMonk-WMF) to ops/puppet? it's given by ldap/ops, which I imagine should be on a generic ops onboarding list somewhere [00:32:57] twentyafterfour: would this work as a temporary solution, better than the current one? https://phabricator.wikimedia.org/T140333#2464797 [00:33:54] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2464178 (10Peachey88) >>! In T140422#2464795, @Dzahn wrote: > how about +2 in Gerrit? There is a few LDAP groups that do that inherently (that staff ar... [00:34:18] mutante: agreeable. but there is other several people who complain that they can't see anything. and i have quick fix for them instead of necessity to wait for somebody to write the patch (which - no offense intended to any possible author - doesn't necessarily have to work properly for the first shot) [00:34:50] the number of complaining people raises as clearly seen by the number of duplicate reports and subscribers [00:36:05] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2464801 (10Dzahn) https://office.wikimedia.org/wiki/Operations/On%28Off%29boarding [00:37:13] kaldari added a comment. [00:37:13] Can we please revert back to the previous version (before Task graphs)? I need to be able to see task parents, but all I get are "Task graph too large" errors. The previous simpler version was a lot better, IMO. [00:37:26] we're effectively blocking people from their work [00:39:41] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail [00:40:02] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [00:41:41] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2464816 (10Dzahn) - ops mailing list was already done @madhuvishy when you get to it we'll need a GPG key to add you to https://office.wikimedia.org/wi... [00:42:35] Danny_B: a friend just pointed out to me that i've been too mean/aggressive in this conversation, sorry about that. i should think a bit more before sending messages. but i still think that disabling email should be a last-resort kind of thing, and we're not there just yet. :) [00:45:12] ok https://secure.phabricator.com/D16304 [00:45:27] ^ this addresses the task graph problem by falling back to the old behavior [00:45:27] MatmaRex: i am also bit nervous and thus not communicating with the coldes head ;-) i am (like many others) just simply heavily annoyed by not being able to work with dependencies. furthermore i'm frustrated that i know the fastest solution, but it would make some people grumble... [00:45:55] twentyafterfour: nice. can we have it deployed on our instance soon, please? [00:46:00] twentyafterfour: nice! [00:46:42] ha!speaking about devil kaldari and devil kaldari is here ;-) [00:46:43] yes I will deploy it right away, though I wouldn't mind a tiny bit of code review before deploying it to production (I've only tested locally with test data ) [00:46:46] (03CR) 10Aude: [C: 031] admin: add addshore to deployers [puppet] - 10https://gerrit.wikimedia.org/r/299032 (https://phabricator.wikimedia.org/T140276) (owner: 10Dzahn) [00:46:58] twentyafterfour: let me see [00:47:05] so if anyone feels like looking it over at least, that'd be good [00:52:03] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 1 failures [00:53:23] twentyafterfour could you upload that patch to phabricator.wikimedia.org against the wmf/stable branch please [00:54:39] twentyafterfour so i can upload it to phab-01 please [00:54:46] paladox: i am reviewing the code now [00:54:51] and found an issue there [00:54:55] Danny_B Oh [00:54:57] thans [00:55:00] thanks [00:59:55] paladox: it's pushed [01:00:01] thankyou [01:00:02] :) [01:01:49] twentyafterfour: on our phab? [01:01:52] twentyafterfour ive deployed it to phab-01 [01:01:55] and yes Danny_B [01:02:01] https://phabricator.wikimedia.org/rPHAB9ed0e899b209f4262193d23bad577b778a797bd5 [01:02:05] Danny_B ^^ [01:02:33] paladox: sorry for improper wording, i meant if it is live now [01:02:34] Danny_B: yeah, if more changes need to be made I will make a separate commit rather than amending [01:02:46] Oh yes [01:02:53] I'll deploy it to production shortly if nobody else finds any problems [01:03:00] Danny_B you can now test on phab-01 [01:03:14] Danny_B already caught one stupid mistake so I'm glad I asked for code review [01:03:21] paladox: does it have any reasonable data set? [01:03:35] ie task with 100+ dependencies? [01:03:42] Oh, i think so [01:03:49] I set the limit to 50, btw [01:03:55] right [01:03:55] Per ^^ [01:04:00] since even 100 dependencies is a lot to graph [01:04:12] It may need to be adjusted further [01:04:23] though I am not sure what a reasonable cutoff would be [01:04:24] Someone should create 50+ tasks at https://phab-01.wmflabs.org/ [01:04:27] * Danny_B checking the newer patch [01:04:49] * Danny_B is pronouncing paladox as mr. someone ;-) [01:05:00] Oh :) [01:06:25] twentyafterfour thankyou for working on a fix :) [01:06:54] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [01:07:15] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [01:07:21] paladox: you're welcome :) ... I mean it seems a lot of people were bothered by it and rolling back wasn't an option. I still think Danny_B's solution is a good idea too but I'm not sure we could disable email on phabricator without disrupting some people's work [01:08:01] Yep, i rely on some email's from phabricator but not much due to me getting a ton of emails from spammers. [01:08:06] :) [01:08:25] Ive deployed it and so now we can create 50 tasks to test [01:08:41] I think I saw a script somewhere in phab that generates a bunch of test data ... [01:08:52] I'm looking for it now [01:09:19] Ok thanks [01:09:32] twentyafterfour: did you actually do that move of code between conditions i said? i am a bit confused in the new diff, as i can't collapse comments so it's a bit hard to follow [01:09:51] PhabricatorLipsumGenerateWorkflow.php [01:11:48] Danny_B: well I'm pretty sure I did but for some reason differential seems to have removed the diff-of-diffs feature? :-/ [01:12:31] line 94 is the end of that block [01:12:36] in the new file [01:13:21] ohh diff of diffs isn't gone... it's just in a tab now under revision contents [01:14:00] twentyafterfour could your problem with diffs be https://secure.phabricator.com/D16266 [01:14:01] https://secure.phabricator.com/D16304?vs=39210&id=39211&whitespace=ignore-most#toc [01:14:05] oh [01:17:54] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [01:20:08] twentyafterfour https://phab-01.wmflabs.org/T9 [01:20:14] Danny_B ^^ [01:21:50] 06Operations: reinstall snapshot100[1234].eqiad.wmnet with RAID - https://phabricator.wikimedia.org/T140439#2464872 (10Dzahn) [01:22:11] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn) [01:25:31] wow, it prevents the circular dependency now! [01:27:06] that's more than 50 and it still hasn't triggered the fallback [01:27:11] Yep [01:27:13] 06Operations: reinstall maps-test200[1234] with RAID - https://phabricator.wikimedia.org/T140440#2464889 (10Dzahn) [01:27:38] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn) [01:27:54] twentyafterfour: the question is if it takes 50 total or 50 one way [01:28:14] i was trying to speed it up by some secondary dependencies [01:28:17] Oh yes [01:28:21] that's a great idea [01:28:33] what if it will only do it for parent tasks [01:28:36] 50 that way [01:28:45] 06Operations, 10Research-and-Data-VisualEditor : reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2464907 (10Dzahn) osmium also appears in T136562 for not having RAID. so that should also be done as part of this task [01:28:46] and 50 for subtasks [01:28:59] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn) [01:29:27] We should make it so that it counts 50 for both [01:29:29] sub tasks [01:29:31] and parent tasks [01:30:08] not at once though [01:30:17] first have 50 on one side [01:30:28] then remove to 49, have 50 on other side [01:30:34] then have 50 on both sides [01:30:43] then we can track the behavior [01:31:05] o_O the graph is broken [01:31:17] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2464912 (10Dzahn) [01:31:32] https://phab-01.wmflabs.org/T63 breaks the graph [01:31:55] 06Operations, 13Patch-For-Review: Migrate hydrogen/chromium to jessie - https://phabricator.wikimedia.org/T123727#1936549 (10Dzahn) hydrogen and chromium also appear on T136562 for not having RAID. that should be done as part of this ticket too [01:32:52] looks to me that the over-limit tasks just simply disallow proper rendering of their part of graph [01:33:24] 06Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#2464918 (10Dzahn) [01:34:07] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2464932 (10Dzahn) [01:34:29] that's exactly what i was trying to emphasize earlier: [01:34:33] 06Operations: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562#2338953 (10Dzahn) [01:34:39] do not render the parent direction at all [01:34:48] render only subtasks [01:36:30] so just list the parent tasks, don't show them as a graph? [01:36:43] yup [01:36:56] have the current task as the root of graph [01:40:15] hmm, if i was tracking the test graph properly, it fails to continue rendering paths after 50 [01:40:34] but that doesn't solve that it should switch to the text only version [01:41:59] I found [01:42:01] that [01:42:06] when i pulled [01:42:15] again it has some merge conflicts [01:43:22] All resolved now [01:43:37] seems the tasks still [01:44:00] show in graph [01:45:28] 06Operations, 10wikidiff2: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2464963 (10Legoktm) [01:46:48] 06Operations, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2464981 (10Legoktm) [01:47:42] Our maximum varnish cache rollover time is still 1 month, right? [01:48:07] 06Operations, 10Research-and-Data-VisualEditor : reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2464983 (10Dzahn) [01:49:00] https://phab-01.wmflabs.org/T63 even the simpliest dependency graph is shown [01:50:27] * Danny_B is diving into the patch again [01:51:40] 06Operations: setup server osmium as parse benchmarking server - https://phabricator.wikimedia.org/T83861#2464990 (10Dzahn) [01:53:03] 06Operations, 10Research-and-Data-VisualEditor : reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2464996 (10Dzahn) see T83861#919541 - this was setup without RAID to be like mw servers [02:01:01] paladox: can you set the limit to 10? [02:01:11] Ok, yepo [02:01:14] yep [02:01:17] shouold be line 76 or around [02:01:17] i will do that now [02:01:22] Ok [02:01:23] thanks [02:01:43] i wonder if i have access to phab01 [02:02:00] Ok, ive set it now [02:02:08] Yeh you do [02:02:29] Still didnt stop it [02:02:35] let's try 0 [02:03:08] Nope 0 dosent either [02:04:46] paladox: where is it located? [02:04:54] the path to file on phab01 [02:05:03] the repo is located at /srv/phab/phabricator [02:05:10] Danny_B ^^ [02:05:28] it runs from there? [02:06:15] Yes [02:07:33] ok. gonna try some dirty stuff ;-) [02:07:38] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2465026 (10Jksamra) [02:10:02] Danny_B ok, thanks [02:10:11] looks like somebody is working on that too [02:10:16] Yep [02:10:17] me [02:10:25] I have been trying to find the absolute [02:10:25] again... ;-) [02:10:26] code [02:10:34] that makes the the graph [02:10:48] Yep, i was doing it before you logged in. [02:11:11] are you sure it is running from that path? [02:11:46] Yes [02:11:50] i disabled the graph completely and it is still visible [02:12:04] I checked the version from phabricator config [02:12:20] Im looking at the source code and shows all graph [02:13:56] it's not possible that it would run from that file. i made syntax error ther and nothing happened [02:14:47] Oh wait [02:14:53] What about caching [02:15:38] Yep [02:15:46] it's got a syntax error [02:15:56] we have to restart apache after every php change [02:16:08] due to one of the cache it uses [02:16:18] twentyafterfour [02:16:23] we may have found the reason [02:16:24] aha! [02:16:24] ^^ [02:16:40] that sort of sucks actually [02:16:43] Yep [02:16:56] anyway, i'm going to do some tests now, so pls don't edit [02:17:07] Ok, could you revert your syntax error [02:17:10] quickly please [02:17:20] it's not there [02:17:22] I want to restart apache to see if it will start working now [02:17:22] oh [02:17:39] Oh there's a syntax error [02:17:42] from somewhere else [02:17:55] after i've seen it's not working, i have removed it [02:18:11] can we disable caching there? [02:18:19] Yes please [02:18:24] opcache [02:18:30] I need to go and edit php setting [02:18:31] now [02:19:04] wohoo! table is there!!!! [02:19:09] is -operations the best channel for this discussion? [02:19:31] yeah, we're perhaps quite far from the original topic [02:19:32] 06Operations, 10VisualEditor, 07Performance: reinstall osmium with jessie - https://phabricator.wikimedia.org/T132530#2465048 (10Jdforrester-WMF) Please feel free. I don't think we've used the server for a while, though, so it should be good to return to the pool for other use unless @ori (who set it up) thi... [02:19:33] sorry [02:19:38] Maybe we could move it to -devtools [02:20:52] thanks folks. Totally worthwhile work though so keep hacking [02:23:13] Your welcome. [02:23:26] It looks like it can now be deployed to phabricator.wikimedia.org [02:29:03] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.10) (duration: 07m 54s) [02:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:17] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jul 15 02:35:16 UTC 2016 (duration 6m 14s) [02:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:30] (03CR) 10Legoktm: [C: 031] admin: add addshore to deployers [puppet] - 10https://gerrit.wikimedia.org/r/299032 (https://phabricator.wikimedia.org/T140276) (owner: 10Dzahn) [04:56:09] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [04:58:17] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:00:08] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [05:05:03] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures [05:11:43] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table wikishared.echo_unread_wikis: Cant find record in echo_unread_wikis, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1031-bin.001851, end_log_pos 464018346 [05:23:44] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 1 failures [05:28:55] PROBLEM - MariaDB Slave Lag: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1348.25 seconds [05:32:01] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2465141 (10Brentjoseph) [05:32:04] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [05:42:27] (03PS4) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [05:49:03] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:08:25] !log installing libarchive security updates [06:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:10:29] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: puppet fail [06:12:28] (03PS1) 10Urbanecm: Enable global abuse filters on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299111 (https://phabricator.wikimedia.org/T140395) [06:15:13] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 3 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2465173 (10Urbanecm) [06:30:09] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: puppet fail [06:30:39] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:50] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:29] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:39] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:33:15] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 3 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2465186 (10Urbanecm) Thanks for the link. As @Yurik wrote above there is another problem which is desc... [06:36:31] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:38:44] 06Operations: reinstall snapshot100[1234].eqiad.wmnet with RAID - https://phabricator.wikimedia.org/T140439#2465191 (10ArielGlenn) I need to decommission 2 and 4. 3 will be decommissioned after the cron jobs are moved off of it, see T133694. I'd like to keep 1 around for a while yet as a canary/testbed, it cou... [06:44:51] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465193 (10Menner) Besides Pango the library Harfbuzz is mention in the librsvg bugreport on Gnome, too. [06:47:32] !log installing nspr security updates [06:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:41] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:41] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:01] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:00] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:10] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:16] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465201 (10MoritzMuehlenhoff) @Menner On the jessie system which is anti-aliased Harfbuzz is installed in the same version as... [07:27:28] <_joe_> !log powercycling mw1280 [07:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:29:31] RECOVERY - Host mw1280 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [07:29:59] !log installing PHP security updates on jessie systems [07:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:33:20] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [07:46:50] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:48:48] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [07:53:07] (03PS4) 10Giuseppe Lavagetto: puppetmaster: puppetize private post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/298958 (https://phabricator.wikimedia.org/T98173) [08:00:09] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:00:19] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465254 (10Menner) From outside it is difficult to estimate when fallback fonts are use. Maybe this is something to look at si... [08:13:42] !log reimporting x1 partial db copy on dbstore1002 from x1-master [08:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:35:43] (03PS1) 10Filippo Giunchedi: admin: add marktraceur to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/299115 (https://phabricator.wikimedia.org/T140132) [08:37:16] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Add marktraceur to statistics-privatedata-users for access to stat1002 - https://phabricator.wikimedia.org/T140132#2454200 (10fgiunchedi) @MarkTraceur apologies for the delay! we'd need manager approval too [08:38:22] Hello guys! Sorry if this is off-topic, but I didn't get any good feedbacks somewhere else... I know that you have some Java webapps in the infra, and you're using icinga too, how do you monitor JVMs (Tomcat or others)? → Just looking for feedback, if your using a closed source plugin I can understand [08:39:08] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Add marktraceur to statistics-privatedata-users for access to stat1002 - https://phabricator.wikimedia.org/T140132#2465302 (10fgiunchedi) p:05Triage>03Normal [08:41:40] 06Operations, 10Parsoid: Delete Parsoid deb 0.4.0 package from releases wikimedia.org - https://phabricator.wikimedia.org/T140279#2465304 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff The parsoid 0.4.0 binary has been removed. [08:42:33] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-zeljkofilipin: MediaWiki deployment shell access request for zfilipin - https://phabricator.wikimedia.org/T140264#2465309 (10fgiunchedi) p:05Triage>03Normal thanks for submitting the patch already! to be merged next Mon [08:42:46] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:43:41] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: MediaWiki deployment shell access request for addshore - https://phabricator.wikimedia.org/T140276#2465312 (10fgiunchedi) p:05Triage>03Normal LGTM, to be merged on Mon [08:44:11] 06Operations, 06Analytics-Kanban, 10Traffic, 13Patch-For-Review: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2465314 (10elukey) Ran again the query, no empty dt fields for the past hours too. The issue seems solved! We'll might need to tune a... [08:44:19] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2465315 (10elukey) [08:44:45] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [08:48:33] (03CR) 10Filippo Giunchedi: "minor comment, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [08:49:39] (03CR) 10Filippo Giunchedi: Introduce wmde-analytics-admins group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [08:50:07] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2465320 (10fgiunchedi) p:05Triage>03Normal [08:50:53] (03CR) 10Addshore: Introduce wmde-analytics-admins group (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [08:50:54] 06Operations, 10Ops-Access-Requests: Platonides access to #mediawiki_security - https://phabricator.wikimedia.org/T140288#2465322 (10fgiunchedi) p:05Triage>03Normal hi, can you elaborate on why you'd need access? thanks! [08:56:42] (03PS4) 10Addshore: Introduce wmde-analytics-users group [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) [09:00:40] dbstore should be fixed any time soon [09:01:16] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:02:06] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.21 seconds [09:03:36] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [09:04:36] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [09:05:39] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2465334 (10fgiunchedi) p:05Triage>03Normal [09:06:07] (03PS5) 10Giuseppe Lavagetto: puppetmaster: puppetize private post-commit hook [puppet] - 10https://gerrit.wikimedia.org/r/298958 (https://phabricator.wikimedia.org/T98173) [09:06:24] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2465337 (10fgiunchedi) [09:06:26] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2465338 (10fgiunchedi) [09:06:28] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2465339 (10fgiunchedi) [09:06:30] 06Operations, 10Ops-Access-Requests: analytics server access request for three users from CPS Data Consulting - https://phabricator.wikimedia.org/T139764#2465336 (10fgiunchedi) [09:06:58] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Mpany - https://phabricator.wikimedia.org/T140399#2465341 (10fgiunchedi) p:05Triage>03Normal [09:07:05] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2465343 (10fgiunchedi) p:05Triage>03Normal [09:07:12] (03PS1) 10Muehlenhoff: etcd: Use DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/299116 [09:07:16] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Brentjoseph (bcohn) - https://phabricator.wikimedia.org/T140449#2465345 (10fgiunchedi) p:05Triage>03Normal [09:07:45] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:08:04] (03Abandoned) 10Muehlenhoff: etcd: Use PRODUCTION_NETWORKS in ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/295778 (owner: 10Muehlenhoff) [09:08:08] (03PS1) 10Jcrespo: Realign s4-master dns alias with reality (db1040) [dns] - 10https://gerrit.wikimedia.org/r/299117 [09:08:36] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:09:03] 06Operations, 10wikidiff2, 13Patch-For-Review: Deploy new version of wikidiff2 package - https://phabricator.wikimedia.org/T140443#2465346 (10fgiunchedi) p:05Triage>03Normal [09:10:28] (03CR) 10Giuseppe Lavagetto: [C: 031] etcd: Use DOMAIN_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/299116 (owner: 10Muehlenhoff) [09:12:45] (03CR) 10Elukey: "Looks good! I'd like to get moar metrics if possible :)" (033 comments) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/299036 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [09:13:34] (03CR) 10Jcrespo: [C: 032] Realign s4-master dns alias with reality (db1040) [dns] - 10https://gerrit.wikimedia.org/r/299117 (owner: 10Jcrespo) [09:22:14] !log updating dns record for s4-master.eqiad.wmnet [09:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:29:16] Hello guys! Sorry if this is off-topic, but I didn't get any good feedbacks somewhere else... I know that you have some Java webapps in the infra, and you're using icinga too, how do you monitor JVMs (Tomcat or others)? → Just looking for feedback, if your using a closed source plugin I can understand [09:31:19] elacheche: what sort of monitoring you had in mind? usually we check the metrics exported by the service itself via e.g. graphite [09:31:25] !log restarted circular replication from db2019 -> db1040 [09:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:33:34] !log renabling semisync replication throughout s4 [09:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:06] godog: anything related to JVM (memory/threads/etc..).. [09:37:26] 06Operations, 10Traffic, 06Wikipedia-Android-App-Backlog, 06Wikipedia-iOS-App-Backlog, and 2 others: Zero: Investigate removing the limit on carrier tagging to m-dot and zero-dot requests - https://phabricator.wikimedia.org/T137990#2465363 (10ema) p:05Triage>03Normal [09:40:25] elacheche: afaik no, though we use jmxtrans in some places to pull out metrics [09:41:36] Thx godog :) [09:41:46] It'll help me find my way ::) [09:43:43] elacheche: no worries! [09:43:43] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465369 (10Menner) I've install a Debian Jessie from netinstall on a local virtual machine and added Wikimedia repositories fo... [09:44:12] I don't think we'd be using a closed source plugin :) [09:46:21] (03PS1) 10Elukey: Add G1 to the supported JMX JVM metrics [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 [09:46:39] (03CR) 10Giuseppe Lavagetto: [C: 032] "Will apply carefully" [puppet] - 10https://gerrit.wikimedia.org/r/298958 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [09:46:55] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:47:23] Krenair: who knows x) I was just asking :) x) And trying to get people to answer me → As there is too many bots in here x) [09:48:27] PROBLEM - MD RAID on ms-be1022 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:04] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465389 (10Menner) BTW: How do you invoke rsvg-convert / librsvg on image scalers? [09:49:38] (03PS2) 10Elukey: Add G1 GC to the supported JMX JVM metrics [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 [09:50:17] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 0: no logical drives --- Slot 0: no drives [09:51:24] (03CR) 10Elukey: Emit zookeeper server JMX metrics in zookeeper::jmxtrans class (031 comment) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/299036 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [09:53:33] ms-be1022 is me [09:55:25] also ms-be1023 [09:58:17] RECOVERY - MD RAID on ms-be1022 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [10:01:22] (03CR) 10Elukey: "Looks good! Added some comments!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299039 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [10:02:01] 06Operations, 10Parsoid, 06Services, 15User-mobrovac: Migrate Parsoid cluster to Jessie / node 4.x - https://phabricator.wikimedia.org/T135176#2465395 (10mobrovac) [10:02:36] (03PS1) 10Jcrespo: Prepare hosts for labsdb1009, -10 and -11 [puppet] - 10https://gerrit.wikimedia.org/r/299121 (https://phabricator.wikimedia.org/T140452) [10:03:58] !log restbase deploy start of 018864b [10:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:12] (03PS1) 10Giuseppe Lavagetto: puppetmaster::gitclone: brown-paper-bag fix for template [puppet] - 10https://gerrit.wikimedia.org/r/299122 [10:07:32] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster::gitclone: brown-paper-bag fix for template [puppet] - 10https://gerrit.wikimedia.org/r/299122 (owner: 10Giuseppe Lavagetto) [10:09:45] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [10:09:46] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465432 (10MoritzMuehlenhoff) @Menner: The invocation of librsvg is rather straightforward, e.g. /usr/bin/rsvg-convert -w 109... [10:09:56] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures [10:10:33] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/3349/" [puppet] - 10https://gerrit.wikimedia.org/r/299121 (https://phabricator.wikimedia.org/T140452) (owner: 10Jcrespo) [10:10:48] (03PS2) 10Jcrespo: Prepare hosts for labsdb1009, -10 and -11 [puppet] - 10https://gerrit.wikimedia.org/r/299121 (https://phabricator.wikimedia.org/T140452) [10:11:46] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:14:26] (03Abandoned) 10Elukey: Remove cronspam coming from Gerrit log deletion [puppet] - 10https://gerrit.wikimedia.org/r/298779 (https://phabricator.wikimedia.org/T132324) (owner: 10Elukey) [10:15:31] !log restbase deploy end of 018864b [10:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:24] (03PS1) 10Yuvipanda: labs: Remove nfs for deployment-prep \o/ [puppet] - 10https://gerrit.wikimedia.org/r/299123 (https://phabricator.wikimedia.org/T64835) [10:28:16] (03CR) 10Mobrovac: "LGTM, but I concur with Luca's comments." (031 comment) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/299036 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [10:28:21] (03PS2) 10Yuvipanda: labs: Remove nfs for deployment-prep \o/ [puppet] - 10https://gerrit.wikimedia.org/r/299123 (https://phabricator.wikimedia.org/T102953) [10:30:03] !log deployed rPHABacb736547c6595fe09e05bafd7a3b563d3cf67c8 and rPHABcf12fdf248df82dc414d96bddd147c058bc3d636 to address maniphest task dependency graphs. Now related tasks will be shown as a plain list when there are too many tasks to graph. [10:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:34:05] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:35:50] (03CR) 10Elukey: Emit zookeeper server JMX metrics in zookeeper::jmxtrans class (031 comment) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/299036 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [10:36:34] PROBLEM - very high load average likely xfs on ms-be1024 is CRITICAL: CRITICAL - load average: 125.85, 101.26, 56.03 [10:36:34] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:39:08] PROBLEM - MariaDB disk space on labsdb1009 is CRITICAL: DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied [10:39:22] (03CR) 10Mobrovac: [C: 031] "Nice!" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 (owner: 10Elukey) [10:39:49] something is wrong with that install [10:42:49] which? (ms-be1024 is me) [10:43:24] labsdb1009 [10:45:10] why is disk space checks trying to use /sys/kernel/debug/tracing ? [10:45:43] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:46:06] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465494 (10MoritzMuehlenhoff) I ran FC_DEBUG=1024 fc-match "Times" on the trusty and jessie-based image scalers and that r... [10:46:22] jynus: from /proc/mounts perhaps? [10:46:53] RECOVERY - very high load average likely xfs on ms-be1024 is OK: OK - load average: 8.82, 70.82, 70.09 [10:46:59] godog, you are right [10:48:13] I think this is using ancient applications/config thought for precise and I need to generate a new role [10:51:07] heh, it should already exclude virtual fs I think tho [10:54:44] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:56:18] shush [10:57:10] (03PS1) 10BBlack: upload VCL: prep for easier V4 migration [puppet] - 10https://gerrit.wikimedia.org/r/299126 (https://phabricator.wikimedia.org/T131502) [10:58:19] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465507 (10MoritzMuehlenhoff) I've added such a file locally to one of the new jessie-based scalers and that seems to fix it,... [11:02:07] (03PS1) 10Jcrespo: [WIP] Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) [11:03:10] !log swift codfw-prod: ms-be202[567] weight 2500 [11:03:12] (03PS3) 10BBlack: cache_upload VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/298744 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [11:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:03:42] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) (owner: 10Jcrespo) [11:06:30] 06Operations, 10ops-eqiad, 10DBA: db1034 lag - https://phabricator.wikimedia.org/T139280#2465519 (10fgiunchedi) p:05Triage>03Normal [11:09:05] (03PS2) 10Jcrespo: [WIP] Setup the new labsdb hosts with a new role [puppet] - 10https://gerrit.wikimedia.org/r/299127 (https://phabricator.wikimedia.org/T140452) [11:09:18] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:10:07] 06Operations: lithium (central syslog server) is starting to run low on disk space - https://phabricator.wikimedia.org/T140189#2465540 (10fgiunchedi) 05Open>03Invalid there was still some space on the vg (and still is, 30G) so I've extended the fs, it should give some headroom. We're in the process of procur... [11:10:37] 06Operations, 03Maps-Sprint: Increase frequency of OSM replication - https://phabricator.wikimedia.org/T137939#2465545 (10fgiunchedi) p:05Triage>03Normal [11:10:49] (03CR) 10BBlack: [C: 031] cache_upload VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/298744 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [11:11:12] 06Operations, 10Revision-Scoring-As-A-Service-Backlog: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#2465546 (10fgiunchedi) p:05Triage>03Normal [11:12:30] 06Operations, 10Revision-Scoring-As-A-Service-Backlog: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#2430068 (10fgiunchedi) @Halfak Alex is currently on vacation (JFYI) [11:18:01] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, and 2 others: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2465558 (10Liuxinyu970226) [11:23:58] 06Operations, 06Commons, 10media-storage, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465563 (10TheDJ) Judging from http://www.ceus-now.com/weird-font-hinting-in-firefox-4/ I suspect this was removed, because t... [11:27:26] (03PS1) 10BBlack: VCL: add calls for cluster/layer vcl_backend_fetch for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299129 (https://phabricator.wikimedia.org/T131502) [11:27:28] (03PS1) 10BBlack: upload VCL: X-Range hack for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299130 (https://phabricator.wikimedia.org/T131502) [11:28:35] 06Operations, 06Commons, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2465571 (10fgiunchedi) p:05Triage>03Normal @kaldari would the same requirement be satistifed by liberation fonts? (Times New Roman, Arial, Courier New) wrt ttf-m... [11:29:10] (03CR) 10jenkins-bot: [V: 04-1] VCL: add calls for cluster/layer vcl_backend_fetch for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299129 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [11:29:12] (03PS2) 10BBlack: upload VCL: X-Range hack for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299130 (https://phabricator.wikimedia.org/T131502) [11:29:14] (03PS2) 10BBlack: VCL: add call for cluster/layer vcl_backend_fetch for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299129 (https://phabricator.wikimedia.org/T131502) [11:29:16] (03PS4) 10BBlack: cache_upload VCL forward port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/298744 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [11:29:18] (03PS2) 10BBlack: upload VCL: prep for easier V4 migration [puppet] - 10https://gerrit.wikimedia.org/r/299126 (https://phabricator.wikimedia.org/T131502) [11:29:40] (03CR) 10jenkins-bot: [V: 04-1] upload VCL: X-Range hack for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299130 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [11:29:45] (03PS1) 10Muehlenhoff: Provide fontconfig configuration which forces antialiasing [puppet] - 10https://gerrit.wikimedia.org/r/299131 (https://phabricator.wikimedia.org/T139543) [11:30:53] 06Operations, 10Fundraising Tech Backlog, 10Mail: Add granularity limiter (g=) to wikimedia.org DKIM record(s) - https://phabricator.wikimedia.org/T140316#2465591 (10fgiunchedi) p:05Triage>03Normal [11:32:03] (03CR) 10jenkins-bot: [V: 04-1] upload VCL: X-Range hack for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299130 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [11:32:15] (03CR) 10jenkins-bot: [V: 04-1] VCL: add call for cluster/layer vcl_backend_fetch for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299129 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [11:33:41] (03PS3) 10BBlack: upload VCL: X-Range hack for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299130 (https://phabricator.wikimedia.org/T131502) [11:33:43] (03PS3) 10BBlack: VCL: add call for cluster/layer vcl_backend_fetch for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299129 (https://phabricator.wikimedia.org/T131502) [11:34:49] (03PS2) 10BBlack: cache_text: raise FE mem size to 50% [puppet] - 10https://gerrit.wikimedia.org/r/298972 (https://phabricator.wikimedia.org/T135384) [11:35:14] (03CR) 10BBlack: [C: 032 V: 032] cache_text: raise FE mem size to 50% [puppet] - 10https://gerrit.wikimedia.org/r/298972 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack) [11:35:24] (03PS2) 10BBlack: cache_upload: raise FE mem size to 50% [puppet] - 10https://gerrit.wikimedia.org/r/298973 (https://phabricator.wikimedia.org/T135384) [11:35:33] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: raise FE mem size to 50% [puppet] - 10https://gerrit.wikimedia.org/r/298973 (https://phabricator.wikimedia.org/T135384) (owner: 10BBlack) [11:39:30] 06Operations, 10ops-eqiad, 10DBA: db1034 lag - https://phabricator.wikimedia.org/T139280#2465609 (10jcrespo) p:05Normal>03Low [11:41:24] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/3350/" [puppet] - 10https://gerrit.wikimedia.org/r/299131 (https://phabricator.wikimedia.org/T139543) (owner: 10Muehlenhoff) [11:41:29] (03PS2) 10Muehlenhoff: Provide fontconfig configuration which forces antialiasing [puppet] - 10https://gerrit.wikimedia.org/r/299131 (https://phabricator.wikimedia.org/T139543) [11:46:55] 06Operations, 06Commons, 10media-storage, 13Patch-For-Review, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465654 (10MoritzMuehlenhoff) @Menner: On these systems rsvg-convert is the only application processing... [11:50:44] 06Operations, 06Commons, 10media-storage, 13Patch-For-Review, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465659 (10Menner) >>! In T139543#2465507, @MoritzMuehlenhoff wrote: > I've added such a file locally to... [11:51:19] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: puppet fail [11:52:18] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-Site-requests, 13Patch-For-Review, 05WMF-deploy-2016-07-12_(1.28.0-wmf.10): Transwiki import not working in production - https://phabricator.wikimedia.org/T140206#2465660 (10BBlack) We only reported on logged-in account access during the final phase... [11:56:18] !log varnish: starting rolling, depooled restart of text and upload frontend caches [11:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:34] !log varnish: rolling frontend restarts for text+upload done [12:08:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:16] (03PS1) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [12:18:58] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:23:14] (03CR) 10Luke081515: [C: 031] Enable global abuse filters on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299111 (https://phabricator.wikimedia.org/T140395) (owner: 10Urbanecm) [12:29:02] (03PS2) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [12:33:50] (03PS3) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [12:35:02] (03PS4) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [12:35:16] 06Operations, 06Commons, 10media-storage, 13Patch-For-Review, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465748 (10MoritzMuehlenhoff) It provides a system-wide configuration stanza for fontconfig to force ant... [12:41:06] (03PS5) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [13:13:17] (03PS1) 10Muehlenhoff: Remove access credentials for jzerebecki [puppet] - 10https://gerrit.wikimedia.org/r/299141 [13:14:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove access credentials for jzerebecki [puppet] - 10https://gerrit.wikimedia.org/r/299141 (owner: 10Muehlenhoff) [13:14:49] (03PS3) 10Yuvipanda: labs: Remove nfs for deployment-prep \o/ [puppet] - 10https://gerrit.wikimedia.org/r/299123 (https://phabricator.wikimedia.org/T102953) [13:22:47] (03CR) 10Yuvipanda: [C: 032] labs: Remove nfs for deployment-prep \o/ [puppet] - 10https://gerrit.wikimedia.org/r/299123 (https://phabricator.wikimedia.org/T102953) (owner: 10Yuvipanda) [13:28:19] wow really? awesome! [13:29:14] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [13:29:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [13:30:26] apergos yes, finally. krenair finished it up :) [13:31:42] big spike in 50x from https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [13:32:37] schweeeet [13:35:33] seems to be api related? [13:35:38] (03CR) 10Ottomata: "OK! I think we can clean up some crap I did a while ago too. We can reduce the type=GarbageCollector objects to:" (031 comment) [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 (owner: 10Elukey) [13:35:43] at least fro ma quick peek from oxygen [13:35:46] (03CR) 10Ema: [C: 031] upload VCL: prep for easier V4 migration [puppet] - 10https://gerrit.wikimedia.org/r/299126 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [13:36:00] (03CR) 10Ottomata: Add G1 GC to the supported JMX JVM metrics (031 comment) [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 (owner: 10Elukey) [13:37:55] (03CR) 10Ottomata: "This would result in metric names like:" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 (owner: 10Elukey) [13:40:17] !log stress-test spinning disks on ms-be102[3-6] [13:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:25] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:41:26] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:41:45] (03PS2) 10Ottomata: Emit zookeeper server JMX metrics in zookeeper::jmxtrans class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/299036 (https://phabricator.wikimedia.org/T137302) [13:43:06] (03CR) 10Ottomata: Emit zookeeper server JMX metrics in zookeeper::jmxtrans class (032 comments) [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/299036 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [13:43:52] jynus: there was a big spike in 50x for API related to central auth, same issue that you were mentioning yesterday? [13:44:11] when? [13:44:32] 10 mins ago https://grafana.wikimedia.org/dashboard/db/varnish-http-errors [13:44:40] icinga just recovered [13:45:27] nope, not that, the errors never go over 30/minute [13:45:53] let me check the exceptions [13:45:55] (03PS1) 10Giuseppe Lavagetto: puppetmaster: add test site to palladium [puppet] - 10https://gerrit.wikimedia.org/r/299145 (https://phabricator.wikimedia.org/T98173) [13:46:01] thanks! [13:46:56] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:47:13] (03PS1) 10Filippo Giunchedi: nutcracker: default verbosity to 4 [puppet] - 10https://gerrit.wikimedia.org/r/299146 (https://phabricator.wikimedia.org/T139786) [13:47:20] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: add test site to palladium [puppet] - 10https://gerrit.wikimedia.org/r/299145 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [13:47:33] the errors from that last time seems to come from the api [13:47:40] but from redis, not the database [13:47:46] (03PS4) 10Filippo Giunchedi: admin: add test for absented users not in 'absented' group [puppet] - 10https://gerrit.wikimedia.org/r/299003 [13:47:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add test for absented users not in 'absented' group [puppet] - 10https://gerrit.wikimedia.org/r/299003 (owner: 10Filippo Giunchedi) [13:47:54] Duplicate get(): "{key}" fetched {count} times [13:48:44] those happen up to 500 times in 0.1 seconds [13:48:55] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [13:50:21] or 195069 times in the last hour [13:51:01] but they seem warnings, not errors [13:51:42] where are you checking? Api host or fluorine? [13:51:45] (just to learn) [13:52:04] on kibana [13:52:32] https://logstash.wikimedia.org [13:52:43] (03PS3) 10Ottomata: Emit zookeeper server JMX metrics in zookeeper::jmxtrans class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/299036 (https://phabricator.wikimedia.org/T137302) [13:53:04] from what I can see, I do not think those are reflected on mediawiki errors [13:53:19] 06Operations, 06Commons, 10media-storage, 13Patch-For-Review, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465900 (10MoritzMuehlenhoff) I'm away next week, but Giuseppe volunteered to review/merge this next wee... [13:53:33] 06Operations, 06Commons, 10media-storage, 13Patch-For-Review, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2465901 (10MoritzMuehlenhoff) a:03Joe [13:53:34] (03PS3) 10Ottomata: Update zookeeper submodule and configure sending zookeeper jmx stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/299039 (https://phabricator.wikimedia.org/T137302) [13:53:44] ah I always forget about mediawiki-errors [13:54:47] oh, I found it [13:54:55] it is g'old friend pageviews [13:55:06] so, not mediawiki at all [13:55:24] (03CR) 10Ottomata: Introduce wmde-analytics-users group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [13:55:50] jynus: from oxygen I saw pageview issues but the majority was api related [13:55:50] I may have missled you yesterday [13:55:53] no? [13:56:14] but check the urls: /api/rest_v1/metrics/pageviews/* [13:56:24] (03CR) 10Filippo Giunchedi: Introduce wmde-analytics-users group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [13:56:51] despite being en.wikimedia.org, they are served by a complete different app [13:57:35] (03PS2) 10Giuseppe Lavagetto: puppetmaster: add test site to palladium [puppet] - 10https://gerrit.wikimedia.org/r/299145 (https://phabricator.wikimedia.org/T98173) [13:57:55] I see some "?centralauthtoken" queries, you are right [13:58:40] but the errors do not appear on the app? [13:59:31] (03CR) 10Ottomata: "OH I see whay you are saying about the invalid group_prefix thing. Ja that is weird, I guess we will see." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299039 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [14:00:45] jynus: might have been mw1277 misbehaving https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm - logstash timing does not match exactly (10 mins later) but it could have been it [14:01:33] even if those are infos [14:01:35] checking logs [14:01:41] no, your initial thoughts fits: https://logstash.wikimedia.org/#dashboard/temp/AVXu3JKcT4MudYQNSuOT [14:01:47] you should report it on the bug [14:03:30] this is T119736 or T139970 [14:03:30] T139970: Centralauth last deployment creating database contention on CentralAuthUser::saveSettings (Lock wait timeout exceeded; try restarting transaction) - https://phabricator.wikimedia.org/T139970 [14:03:30] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [14:04:22] yeah I double checked mw1277, not the culprit but I saw some proxy-server/500 ?centralauthtoken [14:06:30] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2465959 (10Ottomata) NICE WORK! [14:08:05] (03CR) 10Filippo Giunchedi: "LGTM, a couple of nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/298976 (owner: 10Faidon Liambotis) [14:08:34] (03CR) 10Addshore: Introduce wmde-analytics-users group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [14:11:57] (03PS3) 10Giuseppe Lavagetto: puppetmaster: add test site to palladium [puppet] - 10https://gerrit.wikimedia.org/r/299145 (https://phabricator.wikimedia.org/T98173) [14:15:44] (03PS4) 10Ema: upload VCL: X-Range hack for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299130 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [14:19:17] (03CR) 10Ema: [C: 031] VCL: add call for cluster/layer vcl_backend_fetch for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299129 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [14:20:18] (03CR) 10Filippo Giunchedi: Introduce wmde-analytics-users group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [14:21:44] (03PS1) 10Ottomata: Stop kafka mirror maker on kafka100[12], it is not doing anything anyway [puppet] - 10https://gerrit.wikimedia.org/r/299149 (https://phabricator.wikimedia.org/T138265) [14:22:56] (03CR) 10jenkins-bot: [V: 04-1] Stop kafka mirror maker on kafka100[12], it is not doing anything anyway [puppet] - 10https://gerrit.wikimedia.org/r/299149 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [14:22:58] (03CR) 10Ottomata: [C: 032] Stop kafka mirror maker on kafka100[12], it is not doing anything anyway [puppet] - 10https://gerrit.wikimedia.org/r/299149 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [14:24:07] (03PS2) 10Ottomata: Stop kafka mirror maker on kafka100[12], it is not doing anything anyway [puppet] - 10https://gerrit.wikimedia.org/r/299149 (https://phabricator.wikimedia.org/T138265) [14:24:15] (03CR) 10Ottomata: [C: 032] Emit zookeeper server JMX metrics in zookeeper::jmxtrans class [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/299036 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [14:25:42] (03CR) 10Ottomata: [C: 032] Stop kafka mirror maker on kafka100[12], it is not doing anything anyway [puppet] - 10https://gerrit.wikimedia.org/r/299149 (https://phabricator.wikimedia.org/T138265) (owner: 10Ottomata) [14:28:54] anomie, I love your comments [14:29:12] and I also want to thank you for your bug fixes [14:29:38] (03CR) 10BBlack: [C: 031] upload VCL: X-Range hack for V4 [puppet] - 10https://gerrit.wikimedia.org/r/299130 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [14:29:56] ottomata: did you still need to talk about hooks or something? I can do that mon. [14:29:56] (03PS4) 10Ottomata: Update zookeeper submodule and configure sending zookeeper jmx stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/299039 (https://phabricator.wikimedia.org/T137302) [14:30:06] (03CR) 10BBlack: [C: 031] cache_upload VCL forward port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/298744 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [14:30:13] * AaronSchulz is still on theoretical "break" today. [14:30:48] AaronSchulz: aye ok cool. i think i will eventually, but i think we are good for the moment. we have to revisit a couple of schema design issues before we keep moving forward [14:31:03] thanks for reaching out though, will ping you when we know a little more [14:31:29] jynus: You're welcome [14:31:37] (03CR) 10Ottomata: [C: 032] Update zookeeper submodule and configure sending zookeeper jmx stats to statsd [puppet] - 10https://gerrit.wikimedia.org/r/299039 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [14:32:36] PROBLEM - Kafka MirrorMaker analytics-eqiad on kafka1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/analytics-eqiad/producer\.properties [14:33:43] OH [14:33:46] woops, that's me [14:33:52] need to remove monitoring from stored configs [14:34:43] man puppetstoredconfigclean.rb is harsh...'Killing kafka1001.eqiad.wmnet...done.' [14:35:36] (03PS1) 10Krinkle: Lower default $wgSquidMaxage from 31 days to 14 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299153 (https://phabricator.wikimedia.org/T124954) [14:35:39] (03CR) 10Filippo Giunchedi: [C: 04-1] "overall LGTM, though user/pass on the command line and using real ldap accounts is a showstopper IMO" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [14:39:37] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: puppet fail [14:43:54] 06Operations, 10hardware-requests: Find and rack 2 EX4200s in rack c1-eqiad - https://phabricator.wikimedia.org/T139752#2466044 (10fgiunchedi) p:05Triage>03Normal [14:43:55] 06Operations, 10ops-eqiad: ms-be1021.eqiad.wmnet: slot=1I:1:2 dev=sdh failed - https://phabricator.wikimedia.org/T139767#2466047 (10fgiunchedi) p:05Triage>03Normal [14:47:18] jynus: ok to leave the DBA tasks untriaged from the operations queue? [14:47:50] 06Operations, 13Patch-For-Review: Rotate (nutcracker) logs more frequently on terbium to save disk space - https://phabricator.wikimedia.org/T139786#2466050 (10fgiunchedi) p:05Triage>03Normal [14:47:54] as I said to moritz last week, if they are on the DBA queue, it means they have already been triaged [14:48:55] (03PS4) 10Faidon Liambotis: admin: add an NDA audit helper script [puppet] - 10https://gerrit.wikimedia.org/r/298976 [14:49:20] err, yeah I meant on 'needs triage' priority, ok to leave that alone too I guess (?) [14:53:23] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2466065 (10chasemp) [14:53:25] 06Operations, 06Labs: Soft mount /data/scratch - https://phabricator.wikimedia.org/T127561#2466062 (10chasemp) 05Open>03Resolved a:03chasemp I forgot this had it's own task, was done https://gerrit.wikimedia.org/r/#/c/289903/ and linked to parent [14:53:44] 06Operations, 10Cassandra: Update Cassandra in Wikimedia APT repository - https://phabricator.wikimedia.org/T140409#2466070 (10fgiunchedi) p:05Triage>03Normal agreed, the only cassandra cluster not on 2.2 is maps IIRC? (cc @akosiaris @Yurik) [14:56:13] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2466088 (10chasemp) [14:56:15] 06Operations, 06Labs: Soft mount /data/dumps - https://phabricator.wikimedia.org/T127560#2466085 (10chasemp) 05Open>03Resolved a:03chasemp I forgot this had it's own task, was done https://gerrit.wikimedia.org/r/#/c/289903/ and linked to parent [14:56:39] 06Operations, 10MediaWiki-JobQueue: Restore 30 minutes delayed list update to no waiting, to stop killing sandbox functionality - https://phabricator.wikimedia.org/T139893#2466090 (10fgiunchedi) p:05Triage>03Normal do we know what commit/change broke this and when? [14:56:58] 06Operations, 06Discovery, 06Maps, 10Maps-data, 10hardware-requests: 2 servers for maps-beta cluster - https://phabricator.wikimedia.org/T138600#2466092 (10fgiunchedi) p:05Triage>03Normal [14:57:32] 06Operations, 10Analytics-Cluster, 10Packaging: libcglib3-java replaces libcglib-java in Jessie - https://phabricator.wikimedia.org/T137791#2466093 (10fgiunchedi) p:05Triage>03Low [14:57:44] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10EventBus, and 2 others: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2466094 (10Ottomata) Woot, preliminary dash here: https://grafana.wikimedia.org/dashboard/db/zookeeper [14:59:19] (03PS6) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [15:02:09] (03CR) 10BBlack: [C: 031] Lower default $wgSquidMaxage from 31 days to 14 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299153 (https://phabricator.wikimedia.org/T124954) (owner: 10Krinkle) [15:02:26] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:04:32] (03PS2) 10Jgreen: SPF/DKIM records for wikipedia.org and domains sharing that zonefile. [dns] - 10https://gerrit.wikimedia.org/r/298500 (https://phabricator.wikimedia.org/T135410) [15:07:41] 06Operations, 10Cassandra: Update Cassandra in Wikimedia APT repository - https://phabricator.wikimedia.org/T140409#2466126 (10Eevans) >>! In T140409#2466070, @fgiunchedi wrote: > agreed, the only cassandra cluster not on 2.2 is maps IIRC? (cc @akosiaris @Yurik) It is worth noting, that any machine currently... [15:07:52] (03PS7) 10Rush: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) (owner: 10Yuvipanda) [15:12:30] 06Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 13Patch-For-Review: Allow Fundraising to A/B test wikipedia.org as send domain - https://phabricator.wikimedia.org/T135410#2466161 (10Jgreen) I think all the necessary DNS config changes are in https://gerrit.wikimedia.org/r/#/c/298500/ and that... [15:14:02] (03PS1) 10Ottomata: Include icinga alerts for zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/299160 (https://phabricator.wikimedia.org/T137302) [15:14:36] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10EventBus, and 2 others: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2466180 (10mobrovac) Nice! [15:15:20] (03CR) 10jenkins-bot: [V: 04-1] Include icinga alerts for zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/299160 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [15:17:19] (03PS2) 10Ottomata: Include icinga alerts for zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/299160 (https://phabricator.wikimedia.org/T137302) [15:20:34] (03PS3) 10Elukey: Refactor the JMX GC metrics definition [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 [15:22:06] (03CR) 10Ottomata: [C: 031] "One nit, +1 otherwise, merge away!" (031 comment) [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 (owner: 10Elukey) [15:22:23] ottomata: yeah I cut too much, I was amending :) [15:23:11] ah sorry no, not related to your comment.. anyhow, I think I removed # These only show up for Java 7 etc. [15:23:19] that probably can be refactored in the same way? [15:23:49] (03PS1) 10Chad: Gerrit: Don't install defaults file, package provides it [puppet] - 10https://gerrit.wikimedia.org/r/299163 [15:24:32] (03PS1) 10Chad: Minor tweaks to 2.12.2 package [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 [15:24:41] (03PS4) 10Elukey: Refactor the JMX GC metrics definition [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 [15:24:44] probably not worth it [15:24:47] this one should be good [15:24:50] pcc then merge [15:25:04] elukey: probably, but ja maybe in another patch if you get motivated :) [15:25:14] exactly :P [15:25:21] (03CR) 10Ottomata: [C: 032] Include icinga alerts for zookeeper [puppet] - 10https://gerrit.wikimedia.org/r/299160 (https://phabricator.wikimedia.org/T137302) (owner: 10Ottomata) [15:26:12] (03PS5) 10Elukey: Refactor the JMX GC metrics definition [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/299118 [15:26:57] 06Operations, 10Parsoid, 06Services, 10service-runner, and 2 others: Replace custom server.js with service-runner - https://phabricator.wikimedia.org/T90668#2466232 (10mobrovac) [15:29:46] !log restarting hadoop-mapreduce-historyserver to apply yarn log aggreation retention settings [15:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:18] OH [15:30:21] oops wrong chat [15:33:14] (03PS8) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [15:33:27] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10EventBus, and 2 others: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2466240 (10Ottomata) Alerts too! https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=conf1001&service=Zookeeper+Alive+Client+Connecti... [15:35:02] (03PS9) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [15:37:22] !log restbase deploy start of 731284b [15:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:52] mdholloway: ^ [15:49:37] !log restbase deploy end of 731284b [15:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:51:26] 06Operations, 10MediaWiki-JobQueue: Restore 30 minutes delayed list update to no waiting, to stop killing sandbox functionality - https://phabricator.wikimedia.org/T139893#2466263 (10ManosHacker) No, we do not. It was first reported on July 2, 2016. [15:54:05] (03PS1) 10Mobrovac: [Beta] Change-prop: Adjust the ORES URI in BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/299169 [15:55:26] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:57:01] (03CR) 10BryanDavis: [C: 031] "code looks ok; untested" [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) (owner: 10Yuvipanda) [15:58:32] (03CR) 10Mobrovac: "Cherry-picked in beta, works." [puppet] - 10https://gerrit.wikimedia.org/r/299169 (owner: 10Mobrovac) [16:06:23] (03CR) 10Mark Bergsma: [C: 032] [Beta] Change-prop: Adjust the ORES URI in BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/299169 (owner: 10Mobrovac) [16:08:29] (03PS10) 10Yuvipanda: tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) [16:09:13] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Set homedir permissions properly in kube-maintainusers [puppet] - 10https://gerrit.wikimedia.org/r/299133 (https://phabricator.wikimedia.org/T140460) (owner: 10Yuvipanda) [16:10:00] 06Operations, 06Analytics-Kanban, 10Traffic: Verify why varnishkafka stats and webrequest logs count differs - https://phabricator.wikimedia.org/T136314#2466298 (10Nuria) 05Open>03Resolved [16:20:45] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:21:07] (03PS1) 10Yuvipanda: Followup to I23e06ea5d [puppet] - 10https://gerrit.wikimedia.org/r/299174 [16:21:37] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, 06WMF-NDA-Requests: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2466325 (10greg) The only part that I'm not 100% sure about and would like fee... [16:22:33] (03PS2) 10Yuvipanda: Followup to I23e06ea5d [puppet] - 10https://gerrit.wikimedia.org/r/299174 [16:22:42] (03CR) 10Yuvipanda: [C: 032 V: 032] Followup to I23e06ea5d [puppet] - 10https://gerrit.wikimedia.org/r/299174 (owner: 10Yuvipanda) [16:23:24] 07Puppet, 10Continuous-Integration-Config, 07Jenkins: jenkins homedir on nodepool slaves is in /home/jenkins but this doesn't seem to be anywhere in puppet - https://phabricator.wikimedia.org/T140417#2466330 (10greg) p:05Triage>03Normal [16:24:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:25:57] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:29:56] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:30:46] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:47:30] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/298976 (owner: 10Faidon Liambotis) [16:51:46] (03PS15) 10Paladox: Add missing roottree, file configs to gerrit.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/298710 [16:55:04] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [16:55:53] me ^ stress-testing ongoing [16:59:01] nice godog :) [16:59:37] chasemp: heheh indeed, we'll see, the controller mustn't be amused [17:11:39] Danny_B, hi .. [17:12:28] I saw your update https://phabricator.wikimedia.org/T134423#2466396 ... it could be because it uses a template that uses a self-closing html5 tag .. but, do you have a sample for me to look at? [17:16:34] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 1 failures [17:20:50] (03PS1) 10ArielGlenn: clean up arg parsing, this db host checker will have a number of args [software] - 10https://gerrit.wikimedia.org/r/299180 [17:20:52] (03PS1) 10ArielGlenn: limit list of db hosts to be checked by shards and or dcs [software] - 10https://gerrit.wikimedia.org/r/299181 [17:24:46] Any ops able to delete gallium:/home/demon/jenkins-test/ for me? I can't....? It's been there since October and...I don't know why.... [17:26:51] ostriches: yeah I can do it [17:27:06] ty [17:27:12] subbu: hi, mmt [17:27:30] Danny_B, ok. I responded on the ticket. [17:27:37] ostriches: done but no take-backs :) [17:27:52] ty again :) [17:31:15] (03PS5) 10Faidon Liambotis: admin: add an NDA audit helper script [puppet] - 10https://gerrit.wikimedia.org/r/298976 [17:31:22] subbu: if i put the source of the page to expandtemplates, the result doesn't show any selfclosed tag but those i mentioned [17:31:30] (03CR) 10Faidon Liambotis: [C: 032 V: 032] admin: add an NDA audit helper script [puppet] - 10https://gerrit.wikimedia.org/r/298976 (owner: 10Faidon Liambotis) [17:31:51] Danny_B, ok .. can you open a bug report with a link to a sample page? thanks. [17:34:49] 06Operations, 06Commons, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2466488 (10kaldari) @fgiunchedi: We already have the Liberation fonts installed (I believe), but they only cover 3 substitutions: Arial, Times, and Courier. [17:35:42] subbu: ah, nvm. i finally found it. the deal is, that the category isn't populated well. so such template is not in category of selfclosing tags [17:36:00] k [17:36:05] 06Operations, 06Labs: Create an NFS mount manager - https://phabricator.wikimedia.org/T140483#2466489 (10chasemp) [17:36:25] 06Operations, 06Labs: Create an NFS mount manager - https://phabricator.wikimedia.org/T140483#2466506 (10chasemp) [17:36:42] subbu: can the population of the category be enforced somehow? [17:37:52] can we move to #mediawiki-parsoid? can also chat there with scott. [17:38:35] Danny_B, ^ [17:41:14] 06Operations, 10Revision-Scoring-As-A-Service-Backlog: Set up oresrdb redis node in codfw - https://phabricator.wikimedia.org/T139372#2466519 (10Halfak) Thanks @fgiunchedi. [17:42:19] subbu: sure. you poked me here... ;-) [17:43:01] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:29] (03PS1) 10Paladox: Enable gpg keys in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/299182 [17:45:28] (03CR) 10MaxSem: [C: 04-1] Externalize Postgresql user creation from role::osm::master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/297786 (owner: 10Gehel) [17:46:19] (03CR) 10Aklapper: "@Paladox: In the commit message: Could you replace "now" by actual dates/vcersion numbers? Could you fix the typos (doint; gpk)? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/299182 (owner: 10Paladox) [17:48:43] (03CR) 10Gehel: Externalize Postgresql user creation from role::osm::master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/297786 (owner: 10Gehel) [17:48:52] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:50:37] (03PS2) 10Paladox: Enable GPG keys in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/299182 [17:50:53] (03PS3) 10Paladox: Enable gpg keys in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/299182 [17:52:04] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06Release-Engineering-Team, and 2 others: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2466552 (10greg) Also adding #security-team for their feedback on this proposal and my... [17:52:22] (03PS2) 10Gehel: postgresql: fix user existence check [puppet] - 10https://gerrit.wikimedia.org/r/299075 (owner: 10MaxSem) [17:52:46] (03CR) 10Paladox: "Ok, done and your welcome." [puppet] - 10https://gerrit.wikimedia.org/r/299182 (owner: 10Paladox) [17:58:00] (03CR) 10Gehel: [C: 032] postgresql: fix user existence check [puppet] - 10https://gerrit.wikimedia.org/r/299075 (owner: 10MaxSem) [17:58:44] (03CR) 10Chad: "What's the use case? Do we need signed pushes?" [puppet] - 10https://gerrit.wikimedia.org/r/299182 (owner: 10Paladox) [18:00:37] (03CR) 10Paladox: "The use case is that users are now verified that they have verified there GPG key, it is like GitHub. But now git push --signed is now val" [puppet] - 10https://gerrit.wikimedia.org/r/299182 (owner: 10Paladox) [18:04:59] (03PS2) 10Gehel: Externalize Postgresql user creation from role::osm::master [puppet] - 10https://gerrit.wikimedia.org/r/297786 [18:06:11] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [18:25:55] (03CR) 10Ori.livneh: [C: 032] Lower default $wgSquidMaxage from 31 days to 14 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299153 (https://phabricator.wikimedia.org/T124954) (owner: 10Krinkle) [18:26:48] (03Merged) 10jenkins-bot: Lower default $wgSquidMaxage from 31 days to 14 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299153 (https://phabricator.wikimedia.org/T124954) (owner: 10Krinkle) [18:29:25] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: I8bf7c8dd: Lower default $wgSquidMaxage from 31 days to 14 days (duration: 00m 39s) [18:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:12] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:04:02] (03PS1) 10Andrew Bogott: Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 [19:05:12] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [19:06:13] (03CR) 10jenkins-bot: [V: 04-1] Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 (owner: 10Andrew Bogott) [19:07:27] (03PS2) 10Andrew Bogott: Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 [19:08:47] (03CR) 10jenkins-bot: [V: 04-1] Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 (owner: 10Andrew Bogott) [19:10:04] (03PS3) 10Andrew Bogott: Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 [19:11:10] (03CR) 10jenkins-bot: [V: 04-1] Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 (owner: 10Andrew Bogott) [19:12:00] (03PS4) 10Andrew Bogott: Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 [19:18:42] Hi, why can be https://cs.wikipedia.org/w/index.php?title=GEJMR&action=edit&redlink=1 created by anon user even Jan Kovář BK (one of cswiki's sysop) has semiprotected the page? See https://cs.wikipedia.org/w/index.php?title=Speci%C3%A1ln%C3%AD%3AProtokolovac%C3%AD_z%C3%A1znamy&type=protect&user=&page=GEJMR&year=&month=-1&tagfilter=&subtype=&uselang=en for logs. Should I fill a phab ticket? [19:19:57] (03CR) 10Dzahn: [C: 031] "looks alright. just please make sure it's really documented along with the other admin groups and what exactly it is for. maybe you could " [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [19:20:08] Urbanecm: when a page is created, the 'create' protection disappears [19:20:16] Urbanecm: and doesn't magically re-appear if the page is deleted [19:20:38] Urbanecm: https://cs.wikipedia.org/w/index.php?title=Speci%C3%A1ln%C3%AD%3AProtokolovac%C3%AD_z%C3%A1znamy&type=&user=&page=GEJMR&year=&month=-1&tagfilter=&subtype= the pages was created by someone since it was last protected, then deleted; it is again not protected right now [19:20:58] https://cs.wikipedia.org/w/index.php?title=GEJMR&action=info [19:22:11] Thanks for your explanation. So I'm going to semiprotect it again. [19:27:27] bblack: the varnish cache turnover is still 1 month, right? There isn't any new magic that makes new modules added to the HTML sooner than before? [19:27:32] also, hi! :) [19:32:12] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:41:03] (03PS1) 10Dzahn: icinga: let Madhu run commands from webui [puppet] - 10https://gerrit.wikimedia.org/r/299196 (https://phabricator.wikimedia.org/T140422) [19:41:59] AndyRussG: that sounds like a complex question :) The original ~30d was just a cap, not an absolute. [19:42:14] AndyRussG: lots of things can make new modules added to the HTML show up sooner than before [19:42:23] AndyRussG: and the caps have been evolving, and there's no one number anymore :) [19:42:41] AndyRussG: in general, things are evolving in the direction of shorter and shorter cache lifetimes [19:43:57] (03PS1) 10Dzahn: nagios_common: add Madhu to sms (ops paging) group [puppet] - 10https://gerrit.wikimedia.org/r/299198 (https://phabricator.wikimedia.org/T140422) [19:46:15] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2464178 (10Andrew) I granted Madhu admin/admin in keystone so she should be able to view all the stats &c in Horizon. [19:53:51] bblack: K gotcha, thx! [19:54:27] bblack: So if on Wednesday we stopped adding some deprecated RL modules to the HTML, 30 days is a safe delay to actually remove the definitions? [20:36:24] (03PS2) 10Hashar: nutcracker: default verbosity to 4 [puppet] - 10https://gerrit.wikimedia.org/r/299146 (https://phabricator.wikimedia.org/T136078) (owner: 10Filippo Giunchedi) [20:37:25] (03CR) 10Hashar: [C: 031] "Poking T136078 which is beta cluster filling disk due to verbosity set at 5 :) The root cause is different (that is hieradata/role not b" [puppet] - 10https://gerrit.wikimedia.org/r/299146 (https://phabricator.wikimedia.org/T136078) (owner: 10Filippo Giunchedi) [21:07:58] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [21:09:40] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5089961 keys - replication_delay is 0 [21:33:54] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [21:37:26] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2467273 (10ssastry) [21:53:44] PROBLEM - Getent speed check on labstore1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:15] PROBLEM - Check size of conntrack table on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:34] PROBLEM - DPKG on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:34] PROBLEM - Labs LDAP on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:46] PROBLEM - Disk space on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:55:15] PROBLEM - Tool Labs instance distribution on labcontrol1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:55:16] PROBLEM - Tool Labs instance distribution on labcontrol1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:55:25] PROBLEM - configured eth on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:55:36] PROBLEM - dhclient process on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:55:55] PROBLEM - LibreNMS HTTPS on netmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:55:55] PROBLEM - puppet last run on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:55:56] PROBLEM - SSH on seaborgium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:56:00] According to witnesses, a military helicopter has opened fire over Ankara. [21:56:04] Woops wrong place [21:56:14] PROBLEM - salt-minion processes on seaborgium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:57:04] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:57:15] PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:59:24] RECOVERY - Getent speed check on labstore1002 is OK: OK: getent group returns within a second [22:02:54] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 185 bytes in 54.478 second response time [22:04:55] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/service/start - 274 bytes in 0.298 second response time [22:05:56] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 299 bytes in 48.742 second response time [22:06:04] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 341 bytes in 29.298 second response time [22:07:04] is it me or is Gerrit unresponsive? [22:07:15] RECOVERY - LibreNMS HTTPS on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8685 bytes in 7.358 second response time [22:07:29] some changes (diffs) are loading, others don't [22:07:44] Nope very slow for me [22:07:53] Maybe because ldap is down in labs [22:08:11] POST https://gerrit.wikimedia.org/r/gerrit_ui/rpc/ChangeDetailService hangs [22:08:17] ops are aware [22:08:28] ok [22:08:37] might want to put it in the topic then [22:09:16] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/service/start - 274 bytes in 25.390 second response time [22:09:16] PROBLEM - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/webservice/kubernetes - 299 bytes in 6.332 second response time [22:09:20] !log rebooting seaborgium (labs LDAP), cant be reached [22:10:07] waiting for job ... [22:10:25] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/labs-puppetmaster/eqiad - 185 bytes in 33.363 second response time [22:10:28] logstash also down due to this? [22:11:17] I fixed the topics SPF|Cloud [22:11:20] thanks [22:11:29] in here and -labs [22:11:33] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/ldap - 237 bytes in 0.028 second response time [22:11:37] not -tech because it's not a wiki-facing thing [22:11:40] the VM has rebooted now [22:11:45] RECOVERY - Tool Labs instance distribution on labcontrol1002 is OK: OK: All critical toollabs instances are spread out enough [22:11:45] is it better? [22:11:48] there we go [22:11:54] RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 333 bytes in 2.660 second response time [22:12:04] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.352 second response time [22:12:05] RECOVERY - salt-minion processes on seaborgium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:12:15] RECOVERY - SSH on seaborgium is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u2 (protocol 2.0) [22:12:15] RECOVERY - Check size of conntrack table on seaborgium is OK: OK: nf_conntrack is 0 % full [22:12:24] RECOVERY - Tool Labs instance distribution on labcontrol1001 is OK: OK: All critical toollabs instances are spread out enough [22:12:29] I can log in again [22:12:34] RECOVERY - configured eth on seaborgium is OK: OK - interfaces up [22:12:36] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.010 seconds response time [22:12:36] RECOVERY - DPKG on seaborgium is OK: All packages OK [22:12:45] RECOVERY - dhclient process on seaborgium is OK: PROCS OK: 0 processes with command name dhclient [22:12:52] good [22:13:03] confirm that [22:13:05] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [22:13:08] but the Gerrit problem is still there [22:13:15] so it must be something else [22:13:24] RECOVERY - Disk space on seaborgium is OK: DISK OK [22:13:34] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.087 second response time [22:14:27] SPF|Cloud, okay, so gerrit [22:14:32] can you give me specifics of what doesn't work? [22:14:54] Krenair just statys on Working screen [22:14:59] https://gerrit.wikimedia.org/r/gerrit_ui/rpc/ChangeDetailService [22:15:04] eh https://gerrit.wikimedia.org/r/#/c/299078/ * [22:15:29] https://gerrit.wikimedia.org/r/changes/299078/detail hangs [22:15:34] hmm [22:15:40] but https://gerrit.wikimedia.org/r/#/c/287145/ works [22:15:55] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [22:15:57] ssh to gerrit (gerrit itself, not the actual sshd on the machine) seems to hang [22:16:32] perhaps a restart will fix it? or can someone look at the error log? [22:17:07] !log restart nfs-exports on labstore1001 [22:17:54] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [22:19:23] So I can ssh to gerrit and get the "Unfortunately, interactive shells are disabled." message, but when I try to make it execute a command once connected, it sits there [22:19:58] I can get on as root [22:20:09] yeah I expect sshd on the machine itself will still work [22:20:18] Paladox also reported Gerrit loading issues on other channel, but i cant confirm them. it works for me.. same links [22:20:40] it seems bunk now to me tho [22:20:45] going to restart [22:20:52] ok [22:20:59] why this happened I have no idea [22:21:27] it's back now [22:21:32] !log restart gerrit on ytterbium [22:21:33] paladox: fixed? [22:21:57] mutante yes [22:21:57] wfm now, and I got the 503 earlier [22:22:03] now appears to work [22:22:03] its fixed now [22:22:05] (probably while it was rebooting) [22:22:08] 503 would've shown during restart [22:22:21] it's normal for a little while during restart, yep [22:22:43] * greg-g nods [22:22:51] thanks [22:22:55] it's possibel gerrit (new version?) has some issue w/ ldap taking a vacation [22:23:15] the new version is gerrit-new though [22:23:17] chasemp but were not on the new gerrit version yet [22:23:21] on another host [22:23:24] new version is still running separately [22:23:29] ah well then idk :) [22:23:31] paladox: is gerrit-new ok? [22:23:35] yess [22:23:36] yes [22:23:39] without restart [22:23:40] that worked for me [22:23:41] ok [22:23:49] appears to be [22:24:51] thanks for fixing the problem [22:24:52] :) [22:25:04] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 6 failures [22:25:52] hard to tell if gerrit-new would have been effected by it or not if people weren't actively using it during the LDAP downtime [22:25:57] inconclusive :) [22:26:06] wb morebots [22:27:26] is kubernetes still broken? [22:28:57] I checked the logs and a couple of the checker.tools.wmflabs.org alerts didn't recover [22:29:16] PROBLEM - NTP on seaborgium is CRITICAL: NTP CRITICAL: Offset unknown [22:29:23] http://checker.tools.wmflabs.org/webservice/kubernetes and http://checker.tools.wmflabs.org/service/start show commands failing [22:31:14] chasemp, ^ [22:31:24] RECOVERY - NTP on seaborgium is OK: NTP OK: Offset -0.002656936646 secs [22:36:07] krenair I think that is unrelated and the check itself is racy [22:38:11] They predate the ldap flake out [22:41:36] yes, those are unchanged since before that [22:41:54] i saw them earlier [22:52:22] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:11:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:14:52] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [23:16:17] 06Operations, 10IRCecho: ircecho should accept input via unix sockets - https://phabricator.wikimedia.org/T95053#2467515 (10Danny_B) [23:16:31] 06Operations, 10IRCecho: Move ircecho config file to be YAML - https://phabricator.wikimedia.org/T95054#2467516 (10Danny_B) [23:16:47] 06Operations, 10IRCecho, 13Patch-For-Review: Convert ircecho init script to a systemd unit - https://phabricator.wikimedia.org/T95055#2467517 (10Danny_B) [23:16:57] 06Operations, 10IRCecho: Make ircecho run as its own user - https://phabricator.wikimedia.org/T76203#2467519 (10Danny_B) [23:17:42] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: Puppet has 1 failures [23:17:54] 06Operations, 10IRCecho: Make ircecho much better - https://phabricator.wikimedia.org/T95052#2467532 (10Danny_B) [23:18:09] 06Operations, 10IRCecho: Move ircecho config file to be YAML - https://phabricator.wikimedia.org/T95054#1179295 (10Danny_B) [23:18:11] 06Operations, 10IRCecho, 13Patch-For-Review: Convert ircecho init script to a systemd unit - https://phabricator.wikimedia.org/T95055#1179301 (10Danny_B) [23:18:13] 06Operations, 10IRCecho: ircecho should accept input via unix sockets - https://phabricator.wikimedia.org/T95053#1179280 (10Danny_B) [23:18:15] 06Operations, 10IRCecho: Make ircecho much better - https://phabricator.wikimedia.org/T95052#1179271 (10Danny_B) [23:18:17] 06Operations, 10IRCecho: Make ircecho run as its own user - https://phabricator.wikimedia.org/T76203#792998 (10Danny_B) [23:19:10] 06Operations, 10IRCecho: Make ircecho much better - https://phabricator.wikimedia.org/T95052#1179271 (10Danny_B) If you think none of the information in the comments above is worth to be transferred to task on its own, feel free to close this task, please. [23:19:38] 06Operations, 06Project-Admins: Create #IRCecho project - https://phabricator.wikimedia.org/T134961#2467541 (10Danny_B) 05Open>03Resolved #ircecho created. [23:23:10] 06Operations, 10IRCecho, 06Project-Admins: Create #IRCecho project - https://phabricator.wikimedia.org/T134961#2467548 (10Danny_B) [23:23:13] 06Operations, 10IRCecho, 10Wikimedia-IRC-RC-Server, 13Patch-For-Review: udpmxircecho spam/not working if unable to connect to irc server - https://phabricator.wikimedia.org/T134875#2467549 (10Danny_B) [23:23:15] 06Operations, 10IRCecho, 13Patch-For-Review: Move ircecho out of package into puppet repository - https://phabricator.wikimedia.org/T95038#2467550 (10Danny_B) [23:23:17] 06Operations, 10IRCecho, 13Patch-For-Review: IRCEcho package should put files in /usr/bin, not /usr/ircecho/bin - https://phabricator.wikimedia.org/T76208#2467551 (10Danny_B) [23:23:19] 06Operations, 10IRCecho: ircecho should support nickserv registration - https://phabricator.wikimedia.org/T48254#2467552 (10Danny_B) [23:23:21] 06Operations, 10IRCecho: ircecho does not handle netsplits well - https://phabricator.wikimedia.org/T45112#2467554 (10Danny_B) [23:31:08] 06Operations, 10IRCecho, 13Patch-For-Review: ircd doesnt come back after server reboot - https://phabricator.wikimedia.org/T87679#2467564 (10Danny_B) [23:31:25] 06Operations, 05Gitblit-Deprecate: gitblit blobs not redirecting to the correct moved resource unless .git is part of repo in url - https://phabricator.wikimedia.org/T139027#2467566 (10mmodell) 05Open>03declined I don't think we need to care about redirecting ancient / malformed urls. [23:43:02] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:46:22] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:21] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy