[00:01:20] grrrr [00:01:38] yep [00:01:40] packet loss now [00:01:48] LeslieCarr: ^^ [00:02:46] see it both ways though [00:02:48] so that's good [00:02:55] PROBLEM - Puppet freshness on sq80 is CRITICAL: No successful Puppet run in the last 10 hours [00:15:09] got a response [00:15:14] they had some provider circuit rerouting [00:15:19] i am asking for a timeline of that [00:17:19] thank you both very much for looking into it on your sunday [00:20:59] yw [00:21:04] thanks for noticing it [00:31:41] looks fixed to me [00:34:24] yeah [00:34:25] oi [00:46:19] <3 salt: salt-call file.find /srv/deployment/mediawiki/slot0 name='.gitmodules' [00:46:23] returns a list [00:46:50] with absolute paths [01:18:31] springle: thanks for the comments about the wikidata terms table [01:28:02] aude: np [01:28:34] i'll poke lydia to see if we can take a look at this issue in this sprint [01:28:43] we know it is a high priority [01:28:54] cool [02:04:24] !log Earlier issue identified by Ryan and Leslie as intermittent packet loss between eqiad and esams, due to capacity issue with provider. [02:04:42] Logged the message, Master [02:07:22] aude: https://ishmael.wikimedia.org/more.php?host=db1021&hours=24&checksum=17996563317104387368 [02:07:59] !log LocalisationUpdate completed (1.23wmf2) at Mon Nov 11 02:07:59 UTC 2013 [02:08:16] Logged the message, Master [02:08:51] ori-l: aude went to sleep I think [02:08:56] said she was, at least [02:09:05] oh, ok [02:10:01] (03PS1) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [02:10:21] ^^ and with that change I think I've pushed in code for all the blockers now :) [02:11:44] (03CR) 10jenkins-bot: [V: 04-1] Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 (owner: 10Ryan Lane) [02:12:31] jenkins has an impeccable sense of comedic timing [02:13:03] the chance it will ding a changeset is directly proportional to how much attention you call to it immediately before the review is in [02:13:13] that's been my experience, at least [02:13:35] yep :) [02:13:47] well, I knew it was going to get dinged because I didn't fix the patch before it [02:14:09] !log LocalisationUpdate completed (1.23wmf3) at Mon Nov 11 02:14:09 UTC 2013 [02:14:26] Logged the message, Master [02:15:17] !log Continuing inspection of logs on fluorine. memcached-serious.log is flooded with 'Memcached error for key [...]' errors, problem started in May or June judging by log sizes. [02:15:36] Logged the message, Master [02:15:46] (03PS2) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [02:16:44] (03CR) 10jenkins-bot: [V: 04-1] Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 (owner: 10Ryan Lane) [02:20:19] (03PS3) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [02:22:14] (I'm continuing to !log stuff but it's not in reference to a current emergency; I just can't possibly file that many bugs.) [02:22:33] ori-l: slacker! [02:22:42] in general we'd put in rt tickets for all the issues [02:22:51] or bugzilla bugs [02:23:00] RT is such a joy to use, compared to BZ [02:23:04] this is deliberate agitprop on my part [02:23:04] if it belongs in bugzilla it probably isn't worth !log [02:23:22] because only ops reads !log [02:23:38] and we only ever read it if emergencies occur [02:24:06] and even then we only read it when we need to track down changes ;) [02:24:27] well, probably right, but I'm nearly done, and I'd like to be complete [02:24:52] part of my point is that we're (developers, not ops' fault) entirely too glib about leaving alarms on [02:25:30] so I am being making a point in being overly-sincere about treating every error at face value [02:25:54] * Ryan_Lane nods [02:26:00] without the benefit of "meh, that's the usual rate of severe errors" [02:26:12] yeah. I'm not a fan of the level of errors we have [02:26:12] s/being// [02:26:30] point taken about polluting the logs for ops tho, i'll choose another medium next time [02:26:46] well, it's more that it's not going to be read by others ;) [02:27:12] i'm pretty good at writing cranky e-mails, it's one of my few talents [02:27:16] i'll include a link [02:27:41] ori-l, you should post on Wikimedia-L more then ;) [02:27:56] heh [02:28:17] heh, it took me like all of two weeks to unsubscribe after i joined the WMF [02:29:01] !log Apache logs filled with "SearchPhaseExecutionException[Failed to execute phase [dfs], all shards failed" [02:29:17] Logged the message, Master [02:29:20] (03PS2) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [02:38:50] (03CR) 10Chad: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 (owner: 10Ryan Lane) [02:40:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 11 02:40:33 UTC 2013 [02:40:47] Logged the message, Master [02:56:14] (03PS3) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [02:58:45] (03CR) 10Ryan Lane: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 (owner: 10Ryan Lane) [03:53:12] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:55:02] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.089 second response time [05:04:35] TimStarling: is there a reason we don't use rsync with --compress? cdb files compress well (gzip achieves a ratio of 3.75) so it seems like an obvious way to speed up scap [05:04:57] just haven't gotten around to clicking +2 yet, I think [05:05:28] oh, OK. No problem, then :) [05:08:26] it was only 24 hours ago [05:08:33] (03PS2) 10Tim Starling: MW_RSYNC_ARGS: '--delay-updates' & '--compress' [operations/puppet] - 10https://gerrit.wikimedia.org/r/94591 (owner: 10Ori.livneh) [05:08:40] (03CR) 10Tim Starling: [C: 032] MW_RSYNC_ARGS: '--delay-updates' & '--compress' [operations/puppet] - 10https://gerrit.wikimedia.org/r/94591 (owner: 10Ori.livneh) [05:08:46] * jeremyb welcomes comments on https://gerrit.wikimedia.org/r/94111 :) [05:09:18] (03CR) 10Tim Starling: [V: 032] MW_RSYNC_ARGS: '--delay-updates' & '--compress' [operations/puppet] - 10https://gerrit.wikimedia.org/r/94591 (owner: 10Ori.livneh) [05:10:11] that wasn't a nudge to review, I can +2 in puppet now, and the repo's conventions call for exaggerated self-confidence [05:10:39] but, i'll take it! [05:11:24] that sounds like quips fodder [05:11:31] (exaggerated self confidence) [05:12:21] jeremyb: patch looks ok, but address management is outside the bounds of what I can review / merge [05:13:50] TimStarling: anyways, thanks, didn't mean to snark. If you're around I can run a scap, it seems safer than just leaving something like that for the next deployer to verify. [05:13:54] ori-l: yeah, wasn't thinking about you doing it. i didn't even know you could +2 until you just said so. i was asking for comments about whether this is the right solution (ideally from someone that knows better than i do) [05:14:06] ori-l: mazel tov i guess? [05:14:07] :) [05:14:13] we have to wait for a puppet cycle first [05:14:16] I had to kill someone from a rival gang [05:15:53] TimStarling: OK, I'll give a heads up in 60 mins and scap in 70 [05:16:15] ok [05:45:38] PROBLEM - Puppet freshness on sq48 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:47] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [06:16:07] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [06:16:42] I'll run scap in 10 [06:19:47] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101298) [06:20:07] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101201) [06:22:57] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [06:25:48] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [06:26:07] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [06:26:50] running scap [06:27:55] !log ori Started syncing Wikimedia installation... : [06:28:07] RECOVERY - Host sq48 is UP: PING OK - Packet loss = 0%, RTA = 35.30 ms [06:28:13] Logged the message, Master [06:28:26] !powercycled hung sq48, took two tries to come up, "NMI received for unknown reason 31 on CPU 0" and "mptbase: ioc0: ERROR - Failed to come READY after reset" [06:28:32] grrr [06:28:50] !log powercycled hung sq48, took two tries to come up, "NMI received for unknown reason 31 on CPU 0" and "mptbase: ioc0: ERROR - Failed to come READY after reset" [06:29:05] Logged the message, Master [06:30:37] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: Connection refused [06:31:07] PROBLEM - SSH on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:31:47] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102298) [06:32:07] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102345) [06:32:40] !log sq48 repeat of these errors and hung again, so rt #6274 opened [06:32:56] Logged the message, Master [06:33:26] !log ori Finished syncing Wikimedia installation... : [06:33:41] Logged the message, Master [06:33:53] ori-l: how'd it fair with the rsync changes? [06:34:33] well, it took five minutes, but i wasn't pushing out a new version [06:34:47] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [06:35:08] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [06:35:12] ori-l: so, at least it didn't break stuff [06:35:51] no, but i would've been very surprised if it had [06:36:29] of courrse [06:36:31] -r [06:37:27] PROBLEM - Host sq80 is DOWN: PING CRITICAL - Packet loss = 100% [06:37:57] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Mon Nov 11 06:37:55 UTC 2013 [06:38:07] RECOVERY - Host sq80 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [06:38:07] RECOVERY - Backend Squid HTTP on sq80 is OK: HTTP OK: HTTP/1.0 200 OK - 486 bytes in 0.079 second response time [06:39:48] !log probably gratuitous powercycle of sq80, it seems fine now in any case [06:40:03] Logged the message, Master [06:43:04] !log scap: "sudo: no tty present and no askpass program specified" for snapshot1 & snapshot4 [06:43:21] Logged the message, Master [06:43:22] i presume that's expected [06:43:48] I should add those to the decom/reclaim list soon [06:44:52] any other whiners? [06:45:47] just me :) [06:45:58] * apergos scaps to ori-l [06:46:53] :D [06:54:57] RECOVERY - SSH on sq48 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:58:23] yeah whatever. so not trustworthy that box [07:02:28] Icinga is like the annoying relative that sends you Fwd: Fwd: Fwd: emails about the coming Mayan apocalypse [07:03:07] yeah but that same annoying relative sometimes sends you things that aren't urban legends [07:03:18] and that they heard about while you didn't... [07:03:38] (03PS1) 10ArielGlenn: clean up cruft for db fundraising hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94704 [07:05:09] so not awake yet *yawn* [07:06:28] (03CR) 10ArielGlenn: [C: 032] clean up cruft for db fundraising hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94704 (owner: 10ArielGlenn) [07:25:46] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103485) [07:26:06] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104258) [07:30:16] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [07:30:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [07:37:52] ori-l: if you run some grep -c of the poolcounter full errors you'd make jeremyb SO happy [07:45:46] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102131) [07:46:06] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102138) [07:47:11] jeremyb: what do you need me to count? [07:49:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [07:50:06] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [07:53:46] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101281) [07:54:06] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101287) [07:58:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [07:59:06] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:21:50] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103517) [08:22:10] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103562) [08:23:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:24:10] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:28:50] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103633) [08:29:10] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103653) [08:30:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:31:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:35:50] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103760) [08:36:10] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103728) [08:37:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:38:10] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:46:50] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103331) [08:47:10] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102872) [08:49:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:50:02] !log - update mwlib.rl to 0.14.4. [08:50:10] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:50:22] Logged the message, Master [08:50:34] !log restarted all services [08:50:51] Logged the message, Master [09:01:46] (03CR) 10Akosiaris: [C: 032] Fixing ownerships, permissions in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/94365 (owner: 10Akosiaris) [09:12:04] (03CR) 10Faidon Liambotis: [C: 04-1] "(4 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [09:21:11] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103663) [09:21:32] (03CR) 10Ori.livneh: "(4 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [09:23:08] (03PS3) 10Mark Bergsma: Create a backend_random director [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 [09:24:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [09:24:37] (03CR) 10Mark Bergsma: [C: 032] Create a backend_random director [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 (owner: 10Mark Bergsma) [09:29:59] (03PS1) 10Mark Bergsma: Fetch default_backend from the vcl_config hash [operations/puppet] - 10https://gerrit.wikimedia.org/r/94709 [09:30:48] mark____: good morning [09:30:51] you need to ghost your nickname [09:31:17] there's someone by the name of "mark" who's not you [09:31:21] (03CR) 10Mark Bergsma: [C: 032] Fetch default_backend from the vcl_config hash [operations/puppet] - 10https://gerrit.wikimedia.org/r/94709 (owner: 10Mark Bergsma) [09:31:46] better get used to it [09:32:32] haha [09:34:41] meh... 27 wikitech pages with references to sockpuppet that I will have to update in order to decomission them. And another 11 for stafford..... sigh [09:34:54] (03CR) 10Mark Bergsma: "What's the point of this? It's implicit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/94365 (owner: 10Akosiaris) [09:35:15] mark: not really [09:35:29] I found out this by switching cp4001 to point to palladium [09:35:40] and files and directories started changing owners and permissions [09:36:08] mark____: ^ [09:36:25] newer puppet? [09:37:10] ubuntu5 > ubuntu3 ? [09:37:24] I kind of doubt it would cause this [09:37:28] don't [09:37:34] ? [09:37:38] there were changes for filesystem permissions and such [09:37:39] (03CR) 10Ori.livneh: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [09:37:44] in the security update, iirc [09:38:10] (03PS1) 10Mark Bergsma: Fix erroneous attribute [operations/puppet] - 10https://gerrit.wikimedia.org/r/94711 [09:39:10] In any case... files were changing from ganglia user to gmetric user [09:39:11] oh, they fixed the performance regression with 2.5 [09:39:12] awesome [09:39:13] (03CR) 10Mark Bergsma: [C: 032] Fix erroneous attribute [operations/puppet] - 10https://gerrit.wikimedia.org/r/94711 (owner: 10Mark Bergsma) [09:39:18] both were wrong [09:39:30] * ori-l sleeps [09:39:39] akosiaris: used to be always root if not specified [09:39:46] and consequently we omitted it very very often [09:39:58] (03PS6) 10Ori.livneh: Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 [09:40:23] Yes it did... but somehow this is not the case now [09:40:41] puppet labs can't do anything right can they [09:40:51] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102153) [09:41:11] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100994) [09:43:10] 2.4 was all filesystem changes [09:43:51] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [09:44:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [09:44:46] mark____: http://projects.puppetlabs.com/issues/5240 [09:45:13] 3 years now open. So yes ... puppetlabs can do anything right :-) [09:45:31] like leaving it there for 3 years :-( [09:46:31] we can have a File class if you like [09:46:34] to set reasonable defaults [09:46:40] we might solve this with a File { owner => 'root', group => 'root' } in site.pp. What will happen afterwards [09:46:47] yup [09:47:08] although I wouldn't mind being explicit in the manifests themselves [09:47:44] I am pretty confident that while moving stuff to palladium I 'll be noticing this more and more [09:49:28] harmon was a swift test host? [09:49:38] -r-xr-x--- 1 gitpuppet ganglia 187 Nov 8 12:04 post-merge [09:49:39] oh my god [09:49:40] lol [09:49:50] how many swift test hosts did we have [09:50:20] (03PS1) 10Mark Bergsma: Use random backend for CentralAutoLogin/start requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94712 [09:50:48] akosiaris: yeah but I recall reading somewhere that it only goes like 5 nest levels deep or something fucked up like that [09:50:50] so it just sets the owner/group to whatever is in the repo of the master [09:51:13] (03CR) 10jenkins-bot: [V: 04-1] Use random backend for CentralAutoLogin/start requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94712 (owner: 10Mark Bergsma) [09:51:16] and I just remembered having met this before [09:51:34] mark____: what ? oh man.... why ? why ? why ? [09:51:41] so then you suddenly get things breaking just because you moved it into a subclass or whatever [09:51:43] yeah I know [09:51:54] if not for that, I would have set the default mode 444 etc with that too [09:52:33] so that's one of the reasons why the style guide deprecates those attribute defaults [09:52:39] simply because the implementation is fucked up [09:53:24] (03PS2) 10Mark Bergsma: Use random backend for CentralAutoLogin/start requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94712 [09:54:52] (03CR) 10Mark Bergsma: [C: 032] Use random backend for CentralAutoLogin/start requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94712 (owner: 10Mark Bergsma) [09:56:02] notice: /Stage[main]/Varnish::Common/File[/usr/share/varnish/reload-vcl]/owner: owner changed 'ganglia' to 'root' [09:56:03] oh man [09:56:10] i hope stuff is not really fucked up somewhere [09:58:13] wait, so this was broken before? [10:00:35] i don't think this affected anything [10:00:51] but if that obscure varnish script got a new owner, who knows what else was affected across the cluster? [10:01:15] akosiaris: is that new puppetmaster being used atm? [10:01:20] perhaps we better make sure it's not [10:01:28] the point is that it was ganglia before [10:01:33] and that was with the old puppetmaster [10:01:47] so it has been broken for a while probably? [10:02:23] are you sure this was the old puppetmaster? [10:02:39] i assumed it was the new one [10:02:51] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (107652) [10:03:11] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (105919) [10:05:06] damnit [10:05:17] just noticed the eqiad varnish servers are configured for 100G storage instead of 300 [10:05:34] I am not sure, no [10:05:53] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [10:06:13] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [10:06:55] and so is ulsfo... [10:08:13] PROBLEM - Puppetmaster HTTPS on strontium is CRITICAL: Connection refused [10:08:49] this almost certainly precludes us from migrating eqiad today [10:09:48] mark____: nothing is on the new puppetmaster yet [10:10:07] akosiaris: so this, on cp1068, was caused by the old puppetmaster? [10:10:16] yes [10:10:20] ok [10:10:23] PROBLEM - Varnish HTTP text-backend on cp4010 is CRITICAL: Connection refused [10:10:42] perhaps the temporary use of that ubuntu5 a while ago [10:11:25] ubuntu4 [10:11:46] ubuntu5 = ubuntu4 security fixes + the performance regression fix [10:11:59] 2.5 even [10:12:29] right [10:12:36] meh [10:12:45] so now I need to restart all varnish instances with 300GB instead of 100G [10:12:53] and some will have been mmapped less than 300GB apart [10:13:23] RECOVERY - Varnish HTTP text-backend on cp4010 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.150 second response time [10:13:46] ah it doesn't matter because if the size changes it gets wiped [10:19:02] paravoid: akosiaris do you know how i can login to https://gdash.wikimedia.org/ ? [10:19:26] or anyone :) [10:20:03] gdash doesn't need login [10:20:21] if i click data browser then it does [10:20:21] (03PS1) 10Akosiaris: Specify puppetmaster::gitclone missing ownerships [operations/puppet] - 10https://gerrit.wikimedia.org/r/94714 [10:21:49] bbl... [10:22:13] with the same nick? [10:22:14] :P [10:27:04] andre__: do you know about logging into graphite.wikimedia.org? [10:27:06] https://bugzilla.wikimedia.org/show_bug.cgi?id=54713 [10:27:15] as staff, maybe it works but not for normal folks like me [10:28:34] aude, hmm, I thought it's really only bound to a Gerrit/Wikitech account, but maybe there's another "staff" boolean that I don't know about [10:28:43] aude, sounds like a question for the Labs channel or list [10:28:44] it doesn't work for me [10:28:55] it works for me :-/ [10:29:04] do you get an error message? [10:29:09] andre__: aude IIRC it requires you to be in a specific LDAP group [10:29:14] that says you are from the wmf [10:29:19] aha! [10:29:25] :( [10:29:34] thanks YuviPanda|away [10:29:58] ok, so then we need the wmde group made [10:30:04] yeah [10:30:14] * aude hopes ryan is not on holiday [10:30:28] Require ldap-group cn=wmf,ou=groups,dc=wikimedia,dc=org [10:30:31] * aude sees [10:31:24] aude: if you want me to look at particular things, I can help [10:31:34] although bewarned, I hadn't slept since before I last spoke to you... :D [10:31:35] * aude wants to poke around [10:31:41] ah, can't help there, I guess :( [10:31:42] sleep! [10:32:02] trying to investigate issue with wikidata [10:32:42] also, ori-l gave me a Icinga link that i cant see [10:32:58] errr ishmael.wikimedia.org [10:33:04] can't see either [10:35:01] yeah, tied to the same group [10:35:29] alright :( [10:36:07] andre__: you can search for '[Ops] providing restricted tools to a wider audience ' in the ops list for the last known conversation about this [10:36:16] bz was mentioned, but the convo died down [10:36:36] hmm [10:37:03] aude: heh, in the email from way back in July, Sumanah mentions 'And evidently Wikimedia DE people could use a LDAP group as well (for Jenkins/Gerrit permissions); do we know whether that's particularly urgent?' [10:37:04] :) [10:37:12] probably [10:41:20] (03PS1) 10TTO: Make missing.php aware of interwiki prefixes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94716 [10:43:45] omg, pmacct uses cvs for revision control [10:51:53] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101196) [10:52:53] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [11:02:14] (03PS2) 10TTO: Make missing.php aware of interwiki prefixes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94716 [11:59:26] (03CR) 10Akosiaris: [C: 032] Specify puppetmaster::gitclone missing ownerships [operations/puppet] - 10https://gerrit.wikimedia.org/r/94714 (owner: 10Akosiaris) [12:01:48] (03PS1) 10Mark Bergsma: Add a 2nd 200GB persistent storage backend per SSD [operations/puppet] - 10https://gerrit.wikimedia.org/r/94720 [12:02:39] (03CR) 10jenkins-bot: [V: 04-1] Add a 2nd 200GB persistent storage backend per SSD [operations/puppet] - 10https://gerrit.wikimedia.org/r/94720 (owner: 10Mark Bergsma) [12:02:54] (03PS2) 10Mark Bergsma: Add a 2nd 200GB persistent storage backend per SSD [operations/puppet] - 10https://gerrit.wikimedia.org/r/94720 [12:05:05] (03CR) 10Mark Bergsma: [C: 032] Add a 2nd 200GB persistent storage backend per SSD [operations/puppet] - 10https://gerrit.wikimedia.org/r/94720 (owner: 10Mark Bergsma) [12:20:55] (03CR) 10Faidon Liambotis: [C: 032] "Looks sane to me. Thanks for taking care of it :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [12:25:18] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (125706) [12:25:38] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (126305) [13:05:29] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.articleFeedbackv5.st etc [13:05:35] latency 12.9s [13:05:40] X-Cache: cp1057 miss (0), cp3022 miss (0) [13:05:44] X-Varnish: 159566344, 3845145078 [13:05:51] from the netherlands [13:06:03] saw something similar yesterday evening at home [13:11:50] i see some concerns in the log about equid -> esams capacity... might be related ? [13:12:56] (03PS1) 10Mark Bergsma: Create a random director for non-tier1 backends as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/94722 [13:13:05] thedj[work]: indeed it is [13:13:17] 14% packet loss for crying out loud [13:13:18] mark: ^^ [13:13:56] i guess the nsa filter can't keep up anymore :D [13:14:06] so, does udpmcast still exist somewhere? [13:14:15] what's up? [13:14:25] 14% packet loss on eqiad-esams [13:14:38] it happened during the weekend too, leslie opened a ticket [13:14:42] where do you see that? [13:14:51] it stopped happening so she stopped investigating, but now it happens again [13:14:55] ping from hooft to bast1001 [13:15:45] 23% packet loss now [13:15:52] * paravoid grumbles [13:19:27] mark: are you investigating or should I do something about it? :) [13:19:39] what do you want to do about it? :) [13:19:54] a) contact vendor b) stop using the link, reinstate udpmcast? [13:20:13] either both or just (a) [13:20:35] go ahead [13:21:20] what do you think? [13:22:30] I think I see a capped 1 Gbps link [13:23:20] so go ahead, you have experience with that ;-) [13:25:26] you know what [13:25:29] I'll stop ospf3 on that link now [13:25:39] that will move upload off... [13:25:43] right, if packet loss is the result of congestion [13:25:47] and not a fault [13:27:33] traffic going down [13:31:19] i found the order form [13:31:23] explicitly says 'rate limit NO' [13:31:37] is it on RT somewhere? [13:32:50] i'm putting it in as we speak [13:34:06] (03PS1) 10Odder: (bug 56899) Extra NS for Collection on enwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94723 [13:34:38] #6276 [13:35:46] 1000mbps commit, but no rate limit [13:36:24] ffs [13:36:53] ? [13:36:58] nothing, I'm annoyed at them [13:37:27] so, I can handle the communication, should I? [13:38:36] you can [13:38:44] i haven't dealt with them before and I'm busy [13:38:52] okay [13:39:25] my old tinet login no longer seems to work [13:41:00] (03CR) 10Mark Bergsma: [C: 032] Create a random director for non-tier1 backends as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/94722 (owner: 10Mark Bergsma) [13:49:08]