[00:01:20] grrrr [00:01:38] yep [00:01:40] packet loss now [00:01:48] LeslieCarr: ^^ [00:02:46] see it both ways though [00:02:48] so that's good [00:02:55] PROBLEM - Puppet freshness on sq80 is CRITICAL: No successful Puppet run in the last 10 hours [00:15:09] got a response [00:15:14] they had some provider circuit rerouting [00:15:19] i am asking for a timeline of that [00:17:19] thank you both very much for looking into it on your sunday [00:20:59] yw [00:21:04] thanks for noticing it [00:31:41] looks fixed to me [00:34:24] yeah [00:34:25] oi [00:46:19] <3 salt: salt-call file.find /srv/deployment/mediawiki/slot0 name='.gitmodules' [00:46:23] returns a list [00:46:50] with absolute paths [01:18:31] springle: thanks for the comments about the wikidata terms table [01:28:02] aude: np [01:28:34] i'll poke lydia to see if we can take a look at this issue in this sprint [01:28:43] we know it is a high priority [01:28:54] cool [02:04:24] !log Earlier issue identified by Ryan and Leslie as intermittent packet loss between eqiad and esams, due to capacity issue with provider. [02:04:42] Logged the message, Master [02:07:22] aude: https://ishmael.wikimedia.org/more.php?host=db1021&hours=24&checksum=17996563317104387368 [02:07:59] !log LocalisationUpdate completed (1.23wmf2) at Mon Nov 11 02:07:59 UTC 2013 [02:08:16] Logged the message, Master [02:08:51] ori-l: aude went to sleep I think [02:08:56] said she was, at least [02:09:05] oh, ok [02:10:01] (03PS1) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [02:10:21] ^^ and with that change I think I've pushed in code for all the blockers now :) [02:11:44] (03CR) 10jenkins-bot: [V: 04-1] Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 (owner: 10Ryan Lane) [02:12:31] jenkins has an impeccable sense of comedic timing [02:13:03] the chance it will ding a changeset is directly proportional to how much attention you call to it immediately before the review is in [02:13:13] that's been my experience, at least [02:13:35] yep :) [02:13:47] well, I knew it was going to get dinged because I didn't fix the patch before it [02:14:09] !log LocalisationUpdate completed (1.23wmf3) at Mon Nov 11 02:14:09 UTC 2013 [02:14:26] Logged the message, Master [02:15:17] !log Continuing inspection of logs on fluorine. memcached-serious.log is flooded with 'Memcached error for key [...]' errors, problem started in May or June judging by log sizes. [02:15:36] Logged the message, Master [02:15:46] (03PS2) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [02:16:44] (03CR) 10jenkins-bot: [V: 04-1] Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 (owner: 10Ryan Lane) [02:20:19] (03PS3) 10Ryan Lane: Add mediawiki module for fetch and checkout hooks [operations/puppet] - 10https://gerrit.wikimedia.org/r/94682 [02:22:14] (I'm continuing to !log stuff but it's not in reference to a current emergency; I just can't possibly file that many bugs.) [02:22:33] ori-l: slacker! [02:22:42] in general we'd put in rt tickets for all the issues [02:22:51] or bugzilla bugs [02:23:00] RT is such a joy to use, compared to BZ [02:23:04] this is deliberate agitprop on my part [02:23:04] if it belongs in bugzilla it probably isn't worth !log [02:23:22] because only ops reads !log [02:23:38] and we only ever read it if emergencies occur [02:24:06] and even then we only read it when we need to track down changes ;) [02:24:27] well, probably right, but I'm nearly done, and I'd like to be complete [02:24:52] part of my point is that we're (developers, not ops' fault) entirely too glib about leaving alarms on [02:25:30] so I am being making a point in being overly-sincere about treating every error at face value [02:25:54] * Ryan_Lane nods [02:26:00] without the benefit of "meh, that's the usual rate of severe errors" [02:26:12] yeah. I'm not a fan of the level of errors we have [02:26:12] s/being// [02:26:30] point taken about polluting the logs for ops tho, i'll choose another medium next time [02:26:46] well, it's more that it's not going to be read by others ;) [02:27:12] i'm pretty good at writing cranky e-mails, it's one of my few talents [02:27:16] i'll include a link [02:27:41] ori-l, you should post on Wikimedia-L more then ;) [02:27:56] heh [02:28:17] heh, it took me like all of two weeks to unsubscribe after i joined the WMF [02:29:01] !log Apache logs filled with "SearchPhaseExecutionException[Failed to execute phase [dfs], all shards failed" [02:29:17] Logged the message, Master [02:29:20] (03PS2) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [02:38:50] (03CR) 10Chad: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 (owner: 10Ryan Lane) [02:40:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Nov 11 02:40:33 UTC 2013 [02:40:47] Logged the message, Master [02:56:14] (03PS3) 10Ryan Lane: Add recursive submodule support to trebuchet [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 [02:58:45] (03CR) 10Ryan Lane: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94688 (owner: 10Ryan Lane) [03:53:12] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:55:02] RECOVERY - HTTP on fenari is OK: HTTP OK: HTTP/1.1 200 OK - 4915 bytes in 0.089 second response time [05:04:35] TimStarling: is there a reason we don't use rsync with --compress? cdb files compress well (gzip achieves a ratio of 3.75) so it seems like an obvious way to speed up scap [05:04:57] just haven't gotten around to clicking +2 yet, I think [05:05:28] oh, OK. No problem, then :) [05:08:26] it was only 24 hours ago [05:08:33] (03PS2) 10Tim Starling: MW_RSYNC_ARGS: '--delay-updates' & '--compress' [operations/puppet] - 10https://gerrit.wikimedia.org/r/94591 (owner: 10Ori.livneh) [05:08:40] (03CR) 10Tim Starling: [C: 032] MW_RSYNC_ARGS: '--delay-updates' & '--compress' [operations/puppet] - 10https://gerrit.wikimedia.org/r/94591 (owner: 10Ori.livneh) [05:08:46] * jeremyb welcomes comments on https://gerrit.wikimedia.org/r/94111 :) [05:09:18] (03CR) 10Tim Starling: [V: 032] MW_RSYNC_ARGS: '--delay-updates' & '--compress' [operations/puppet] - 10https://gerrit.wikimedia.org/r/94591 (owner: 10Ori.livneh) [05:10:11] that wasn't a nudge to review, I can +2 in puppet now, and the repo's conventions call for exaggerated self-confidence [05:10:39] but, i'll take it! [05:11:24] that sounds like quips fodder [05:11:31] (exaggerated self confidence) [05:12:21] jeremyb: patch looks ok, but address management is outside the bounds of what I can review / merge [05:13:50] TimStarling: anyways, thanks, didn't mean to snark. If you're around I can run a scap, it seems safer than just leaving something like that for the next deployer to verify. [05:13:54] ori-l: yeah, wasn't thinking about you doing it. i didn't even know you could +2 until you just said so. i was asking for comments about whether this is the right solution (ideally from someone that knows better than i do) [05:14:06] ori-l: mazel tov i guess? [05:14:07] :) [05:14:13] we have to wait for a puppet cycle first [05:14:16] I had to kill someone from a rival gang [05:15:53] TimStarling: OK, I'll give a heads up in 60 mins and scap in 70 [05:16:15] ok [05:45:38] PROBLEM - Puppet freshness on sq48 is CRITICAL: No successful Puppet run in the last 10 hours [06:15:47] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [06:16:07] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [06:16:42] I'll run scap in 10 [06:19:47] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101298) [06:20:07] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101201) [06:22:57] PROBLEM - Host sq48 is DOWN: PING CRITICAL - Packet loss = 100% [06:25:48] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [06:26:07] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [06:26:50] running scap [06:27:55] !log ori Started syncing Wikimedia installation... : [06:28:07] RECOVERY - Host sq48 is UP: PING OK - Packet loss = 0%, RTA = 35.30 ms [06:28:13] Logged the message, Master [06:28:26] !powercycled hung sq48, took two tries to come up, "NMI received for unknown reason 31 on CPU 0" and "mptbase: ioc0: ERROR - Failed to come READY after reset" [06:28:32] grrr [06:28:50] !log powercycled hung sq48, took two tries to come up, "NMI received for unknown reason 31 on CPU 0" and "mptbase: ioc0: ERROR - Failed to come READY after reset" [06:29:05] Logged the message, Master [06:30:37] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: Connection refused [06:31:07] PROBLEM - SSH on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:31:47] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102298) [06:32:07] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102345) [06:32:40] !log sq48 repeat of these errors and hung again, so rt #6274 opened [06:32:56] Logged the message, Master [06:33:26] !log ori Finished syncing Wikimedia installation... : [06:33:41] Logged the message, Master [06:33:53] ori-l: how'd it fair with the rsync changes? [06:34:33] well, it took five minutes, but i wasn't pushing out a new version [06:34:47] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [06:35:08] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [06:35:12] ori-l: so, at least it didn't break stuff [06:35:51] no, but i would've been very surprised if it had [06:36:29] of courrse [06:36:31] -r [06:37:27] PROBLEM - Host sq80 is DOWN: PING CRITICAL - Packet loss = 100% [06:37:57] RECOVERY - Puppet freshness on sq80 is OK: puppet ran at Mon Nov 11 06:37:55 UTC 2013 [06:38:07] RECOVERY - Host sq80 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [06:38:07] RECOVERY - Backend Squid HTTP on sq80 is OK: HTTP OK: HTTP/1.0 200 OK - 486 bytes in 0.079 second response time [06:39:48] !log probably gratuitous powercycle of sq80, it seems fine now in any case [06:40:03] Logged the message, Master [06:43:04] !log scap: "sudo: no tty present and no askpass program specified" for snapshot1 & snapshot4 [06:43:21] Logged the message, Master [06:43:22] i presume that's expected [06:43:48] I should add those to the decom/reclaim list soon [06:44:52] any other whiners? [06:45:47] just me :) [06:45:58] * apergos scaps to ori-l [06:46:53] :D [06:54:57] RECOVERY - SSH on sq48 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [06:58:23] yeah whatever. so not trustworthy that box [07:02:28] Icinga is like the annoying relative that sends you Fwd: Fwd: Fwd: emails about the coming Mayan apocalypse [07:03:07] yeah but that same annoying relative sometimes sends you things that aren't urban legends [07:03:18] and that they heard about while you didn't... [07:03:38] (03PS1) 10ArielGlenn: clean up cruft for db fundraising hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94704 [07:05:09] so not awake yet *yawn* [07:06:28] (03CR) 10ArielGlenn: [C: 032] clean up cruft for db fundraising hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/94704 (owner: 10ArielGlenn) [07:25:46] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103485) [07:26:06] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104258) [07:30:16] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [07:30:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [07:37:52] ori-l: if you run some grep -c of the poolcounter full errors you'd make jeremyb SO happy [07:45:46] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102131) [07:46:06] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102138) [07:47:11] jeremyb: what do you need me to count? [07:49:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [07:50:06] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [07:53:46] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101281) [07:54:06] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101287) [07:58:46] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [07:59:06] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:21:50] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103517) [08:22:10] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103562) [08:23:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:24:10] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:28:50] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103633) [08:29:10] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103653) [08:30:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:31:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:35:50] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103760) [08:36:10] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103728) [08:37:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:38:10] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:46:50] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103331) [08:47:10] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102872) [08:49:50] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [08:50:02] !log - update mwlib.rl to 0.14.4. [08:50:10] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [08:50:22] Logged the message, Master [08:50:34] !log restarted all services [08:50:51] Logged the message, Master [09:01:46] (03CR) 10Akosiaris: [C: 032] Fixing ownerships, permissions in various places [operations/puppet] - 10https://gerrit.wikimedia.org/r/94365 (owner: 10Akosiaris) [09:12:04] (03CR) 10Faidon Liambotis: [C: 04-1] "(4 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [09:21:11] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103663) [09:21:32] (03CR) 10Ori.livneh: "(4 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [09:23:08] (03PS3) 10Mark Bergsma: Create a backend_random director [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 [09:24:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [09:24:37] (03CR) 10Mark Bergsma: [C: 032] Create a backend_random director [operations/puppet] - 10https://gerrit.wikimedia.org/r/94350 (owner: 10Mark Bergsma) [09:29:59] (03PS1) 10Mark Bergsma: Fetch default_backend from the vcl_config hash [operations/puppet] - 10https://gerrit.wikimedia.org/r/94709 [09:30:48] mark____: good morning [09:30:51] you need to ghost your nickname [09:31:17] there's someone by the name of "mark" who's not you [09:31:21] (03CR) 10Mark Bergsma: [C: 032] Fetch default_backend from the vcl_config hash [operations/puppet] - 10https://gerrit.wikimedia.org/r/94709 (owner: 10Mark Bergsma) [09:31:46] better get used to it [09:32:32] haha [09:34:41] meh... 27 wikitech pages with references to sockpuppet that I will have to update in order to decomission them. And another 11 for stafford..... sigh [09:34:54] (03CR) 10Mark Bergsma: "What's the point of this? It's implicit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/94365 (owner: 10Akosiaris) [09:35:15] mark: not really [09:35:29] I found out this by switching cp4001 to point to palladium [09:35:40] and files and directories started changing owners and permissions [09:36:08] mark____: ^ [09:36:25] newer puppet? [09:37:10] ubuntu5 > ubuntu3 ? [09:37:24] I kind of doubt it would cause this [09:37:28] don't [09:37:34] ? [09:37:38] there were changes for filesystem permissions and such [09:37:39] (03CR) 10Ori.livneh: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [09:37:44] in the security update, iirc [09:38:10] (03PS1) 10Mark Bergsma: Fix erroneous attribute [operations/puppet] - 10https://gerrit.wikimedia.org/r/94711 [09:39:10] In any case... files were changing from ganglia user to gmetric user [09:39:11] oh, they fixed the performance regression with 2.5 [09:39:12] awesome [09:39:13] (03CR) 10Mark Bergsma: [C: 032] Fix erroneous attribute [operations/puppet] - 10https://gerrit.wikimedia.org/r/94711 (owner: 10Mark Bergsma) [09:39:18] both were wrong [09:39:30] * ori-l sleeps [09:39:39] akosiaris: used to be always root if not specified [09:39:46] and consequently we omitted it very very often [09:39:58] (03PS6) 10Ori.livneh: Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 [09:40:23] Yes it did... but somehow this is not the case now [09:40:41] puppet labs can't do anything right can they [09:40:51] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (102153) [09:41:11] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (100994) [09:43:10] 2.4 was all filesystem changes [09:43:51] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [09:44:11] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [09:44:46] mark____: http://projects.puppetlabs.com/issues/5240 [09:45:13] 3 years now open. So yes ... puppetlabs can do anything right :-) [09:45:31] like leaving it there for 3 years :-( [09:46:31] we can have a File class if you like [09:46:34] to set reasonable defaults [09:46:40] we might solve this with a File { owner => 'root', group => 'root' } in site.pp. What will happen afterwards [09:46:47] yup [09:47:08] although I wouldn't mind being explicit in the manifests themselves [09:47:44] I am pretty confident that while moving stuff to palladium I 'll be noticing this more and more [09:49:28] harmon was a swift test host? [09:49:38] -r-xr-x--- 1 gitpuppet ganglia 187 Nov 8 12:04 post-merge [09:49:39] oh my god [09:49:40] lol [09:49:50] how many swift test hosts did we have [09:50:20] (03PS1) 10Mark Bergsma: Use random backend for CentralAutoLogin/start requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94712 [09:50:48] akosiaris: yeah but I recall reading somewhere that it only goes like 5 nest levels deep or something fucked up like that [09:50:50] so it just sets the owner/group to whatever is in the repo of the master [09:51:13] (03CR) 10jenkins-bot: [V: 04-1] Use random backend for CentralAutoLogin/start requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94712 (owner: 10Mark Bergsma) [09:51:16] and I just remembered having met this before [09:51:34] mark____: what ? oh man.... why ? why ? why ? [09:51:41] so then you suddenly get things breaking just because you moved it into a subclass or whatever [09:51:43] yeah I know [09:51:54] if not for that, I would have set the default mode 444 etc with that too [09:52:33] so that's one of the reasons why the style guide deprecates those attribute defaults [09:52:39] simply because the implementation is fucked up [09:53:24] (03PS2) 10Mark Bergsma: Use random backend for CentralAutoLogin/start requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94712 [09:54:52] (03CR) 10Mark Bergsma: [C: 032] Use random backend for CentralAutoLogin/start requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94712 (owner: 10Mark Bergsma) [09:56:02] notice: /Stage[main]/Varnish::Common/File[/usr/share/varnish/reload-vcl]/owner: owner changed 'ganglia' to 'root' [09:56:03] oh man [09:56:10] i hope stuff is not really fucked up somewhere [09:58:13] wait, so this was broken before? [10:00:35] i don't think this affected anything [10:00:51] but if that obscure varnish script got a new owner, who knows what else was affected across the cluster? [10:01:15] akosiaris: is that new puppetmaster being used atm? [10:01:20] perhaps we better make sure it's not [10:01:28] the point is that it was ganglia before [10:01:33] and that was with the old puppetmaster [10:01:47] so it has been broken for a while probably? [10:02:23] are you sure this was the old puppetmaster? [10:02:39] i assumed it was the new one [10:02:51] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (107652) [10:03:11] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (105919) [10:05:06] damnit [10:05:17] just noticed the eqiad varnish servers are configured for 100G storage instead of 300 [10:05:34] I am not sure, no [10:05:53] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [10:06:13] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [10:06:55] and so is ulsfo... [10:08:13] PROBLEM - Puppetmaster HTTPS on strontium is CRITICAL: Connection refused [10:08:49] this almost certainly precludes us from migrating eqiad today [10:09:48] mark____: nothing is on the new puppetmaster yet [10:10:07] akosiaris: so this, on cp1068, was caused by the old puppetmaster? [10:10:16] yes [10:10:20] ok [10:10:23] PROBLEM - Varnish HTTP text-backend on cp4010 is CRITICAL: Connection refused [10:10:42] perhaps the temporary use of that ubuntu5 a while ago [10:11:25] ubuntu4 [10:11:46] ubuntu5 = ubuntu4 security fixes + the performance regression fix [10:11:59] 2.5 even [10:12:29] right [10:12:36] meh [10:12:45] so now I need to restart all varnish instances with 300GB instead of 100G [10:12:53] and some will have been mmapped less than 300GB apart [10:13:23] RECOVERY - Varnish HTTP text-backend on cp4010 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.150 second response time [10:13:46] ah it doesn't matter because if the size changes it gets wiped [10:19:02] paravoid: akosiaris do you know how i can login to https://gdash.wikimedia.org/ ? [10:19:26] or anyone :) [10:20:03] gdash doesn't need login [10:20:21] if i click data browser then it does [10:20:21] (03PS1) 10Akosiaris: Specify puppetmaster::gitclone missing ownerships [operations/puppet] - 10https://gerrit.wikimedia.org/r/94714 [10:21:49] bbl... [10:22:13] with the same nick? [10:22:14] :P [10:27:04] andre__: do you know about logging into graphite.wikimedia.org? [10:27:06] https://bugzilla.wikimedia.org/show_bug.cgi?id=54713 [10:27:15] as staff, maybe it works but not for normal folks like me [10:28:34] aude, hmm, I thought it's really only bound to a Gerrit/Wikitech account, but maybe there's another "staff" boolean that I don't know about [10:28:43] aude, sounds like a question for the Labs channel or list [10:28:44] it doesn't work for me [10:28:55] it works for me :-/ [10:29:04] do you get an error message? [10:29:09] andre__: aude IIRC it requires you to be in a specific LDAP group [10:29:14] that says you are from the wmf [10:29:19] aha! [10:29:25] :( [10:29:34] thanks YuviPanda|away [10:29:58] ok, so then we need the wmde group made [10:30:04] yeah [10:30:14] * aude hopes ryan is not on holiday [10:30:28] Require ldap-group cn=wmf,ou=groups,dc=wikimedia,dc=org [10:30:31] * aude sees  [10:31:24] aude: if you want me to look at particular things, I can help [10:31:34] although bewarned, I hadn't slept since before I last spoke to you... :D [10:31:35] * aude wants to poke around [10:31:41] ah, can't help there, I guess :( [10:31:42] sleep! [10:32:02] trying to investigate issue with wikidata [10:32:42] also, ori-l gave me a Icinga link that i cant see [10:32:58] errr ishmael.wikimedia.org [10:33:04] can't see either [10:35:01] yeah, tied to the same group [10:35:29] alright :( [10:36:07] andre__: you can search for '[Ops] providing restricted tools to a wider audience ' in the ops list for the last known conversation about this [10:36:16] bz was mentioned, but the convo died down [10:36:36] hmm [10:37:03] aude: heh, in the email from way back in July, Sumanah mentions 'And evidently Wikimedia DE people could use a LDAP group as well (for Jenkins/Gerrit permissions); do we know whether that's particularly urgent?' [10:37:04] :) [10:37:12] probably [10:41:20] (03PS1) 10TTO: Make missing.php aware of interwiki prefixes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94716 [10:43:45] omg, pmacct uses cvs for revision control [10:51:53] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101196) [10:52:53] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [11:02:14] (03PS2) 10TTO: Make missing.php aware of interwiki prefixes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94716 [11:59:26] (03CR) 10Akosiaris: [C: 032] Specify puppetmaster::gitclone missing ownerships [operations/puppet] - 10https://gerrit.wikimedia.org/r/94714 (owner: 10Akosiaris) [12:01:48] (03PS1) 10Mark Bergsma: Add a 2nd 200GB persistent storage backend per SSD [operations/puppet] - 10https://gerrit.wikimedia.org/r/94720 [12:02:39] (03CR) 10jenkins-bot: [V: 04-1] Add a 2nd 200GB persistent storage backend per SSD [operations/puppet] - 10https://gerrit.wikimedia.org/r/94720 (owner: 10Mark Bergsma) [12:02:54] (03PS2) 10Mark Bergsma: Add a 2nd 200GB persistent storage backend per SSD [operations/puppet] - 10https://gerrit.wikimedia.org/r/94720 [12:05:05] (03CR) 10Mark Bergsma: [C: 032] Add a 2nd 200GB persistent storage backend per SSD [operations/puppet] - 10https://gerrit.wikimedia.org/r/94720 (owner: 10Mark Bergsma) [12:20:55] (03CR) 10Faidon Liambotis: [C: 032] "Looks sane to me. Thanks for taking care of it :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [12:25:18] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (125706) [12:25:38] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (126305) [13:05:29] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.articleFeedbackv5.st etc [13:05:35] latency 12.9s [13:05:40] X-Cache: cp1057 miss (0), cp3022 miss (0) [13:05:44] X-Varnish: 159566344, 3845145078 [13:05:51] from the netherlands [13:06:03] saw something similar yesterday evening at home [13:11:50] i see some concerns in the log about equid -> esams capacity... might be related ? [13:12:56] (03PS1) 10Mark Bergsma: Create a random director for non-tier1 backends as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/94722 [13:13:05] thedj[work]: indeed it is [13:13:17] 14% packet loss for crying out loud [13:13:18] mark: ^^ [13:13:56] i guess the nsa filter can't keep up anymore :D [13:14:06] so, does udpmcast still exist somewhere? [13:14:15] what's up? [13:14:25] 14% packet loss on eqiad-esams [13:14:38] it happened during the weekend too, leslie opened a ticket [13:14:42] where do you see that? [13:14:51] it stopped happening so she stopped investigating, but now it happens again [13:14:55] ping from hooft to bast1001 [13:15:45] 23% packet loss now [13:15:52] * paravoid grumbles [13:19:27] mark: are you investigating or should I do something about it? :) [13:19:39] what do you want to do about it? :) [13:19:54] a) contact vendor b) stop using the link, reinstate udpmcast? [13:20:13] either both or just (a) [13:20:35] go ahead [13:21:20] what do you think? [13:22:30] I think I see a capped 1 Gbps link [13:23:20] so go ahead, you have experience with that ;-) [13:25:26] you know what [13:25:29] I'll stop ospf3 on that link now [13:25:39] that will move upload off... [13:25:43] right, if packet loss is the result of congestion [13:25:47] and not a fault [13:27:33] traffic going down [13:31:19] i found the order form [13:31:23] explicitly says 'rate limit NO' [13:31:37] is it on RT somewhere? [13:32:50] i'm putting it in as we speak [13:34:06] (03PS1) 10Odder: (bug 56899) Extra NS for Collection on enwikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94723 [13:34:38] #6276 [13:35:46] 1000mbps commit, but no rate limit [13:36:24] ffs [13:36:53] ? [13:36:58] nothing, I'm annoyed at them [13:37:27] so, I can handle the communication, should I? [13:38:36] you can [13:38:44] i haven't dealt with them before and I'm busy [13:38:52] okay [13:39:25] my old tinet login no longer seems to work [13:41:00] (03CR) 10Mark Bergsma: [C: 032] Create a random director for non-tier1 backends as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/94722 (owner: 10Mark Bergsma) [13:49:08] (03PS1) 10Mark Bergsma: Remove the .weight parameter [operations/puppet] - 10https://gerrit.wikimedia.org/r/94726 [13:50:51] (03CR) 10Mark Bergsma: [C: 032] Remove the .weight parameter [operations/puppet] - 10https://gerrit.wikimedia.org/r/94726 (owner: 10Mark Bergsma) [14:03:53] (03PS1) 10Mark Bergsma: Expand vcl variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94728 [14:04:52] (03CR) 10Mark Bergsma: [C: 032] Expand vcl variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/94728 (owner: 10Mark Bergsma) [14:07:01] (03PS1) 10Mark Bergsma: Add .weight again, as a constant [operations/puppet] - 10https://gerrit.wikimedia.org/r/94729 [14:08:02] (03CR) 10Mark Bergsma: [C: 032] Add .weight again, as a constant [operations/puppet] - 10https://gerrit.wikimedia.org/r/94729 (owner: 10Mark Bergsma) [14:20:37] PROBLEM - Varnish HTTP text-backend on cp1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:21:27] RECOVERY - Varnish HTTP text-backend on cp1054 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 5.813 second response time [14:23:07] (03PS1) 10Akosiaris: Temporarily reduce puppet TTLs to 5min [operations/dns] - 10https://gerrit.wikimedia.org/r/94734 [14:28:27] PROBLEM - Varnish HTTP text-backend on amssq54 is CRITICAL: HTTP CRITICAL - No data received from host [14:28:37] PROBLEM - Varnish HTCP daemon on amssq54 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:28:37] PROBLEM - Varnish traffic logger on amssq54 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:31:14] !log rebooting amssq54, XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [14:31:37] Logged the message, Master [14:31:37] lol [14:34:08] PROBLEM - SSH on amssq54 is CRITICAL: Connection refused [14:45:08] RECOVERY - SSH on amssq54 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [14:45:28] RECOVERY - Varnish traffic logger on amssq54 is OK: PROCS OK: 2 processes with command name varnishncsa [14:45:28] RECOVERY - Varnish HTCP daemon on amssq54 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [14:55:27] RECOVERY - Varnish HTTP text-backend on amssq54 is OK: HTTP OK: HTTP/1.1 200 OK - 190 bytes in 0.192 second response time [15:06:48] !log Moved eqiad https/ipv6 text traffic from squid to varnish [15:07:05] Logged the message, Master [15:14:16] (03CR) 10Akosiaris: [C: 032] Temporarily reduce puppet TTLs to 5min [operations/dns] - 10https://gerrit.wikimedia.org/r/94734 (owner: 10Akosiaris) [15:14:46] (03PS1) 10Mark Bergsma: Migrate textsvc from text to text-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/94740 [15:16:14] (03CR) 10Mark Bergsma: [C: 032] Migrate textsvc from text to text-varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/94740 (owner: 10Mark Bergsma) [15:23:42] (03PS1) 10Mark Bergsma: Move eqiad wikipedia traffic to Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/94741 [15:26:46] (03CR) 10Mark Bergsma: [C: 032] Move eqiad wikipedia traffic to Varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/94741 (owner: 10Mark Bergsma) [15:32:03] !log Moved eqiad wikipedia traffic onto Varnish [15:32:13] yey!!!! [15:32:18] Logged the message, Master [15:33:38] her mark paravoid [15:33:50] hi [15:34:04] hey [15:34:18] grrrrr [15:34:21] saw the above [15:34:24] yeah... [15:34:35] I'm nagging their noc over mail and phone [15:34:40] they're not being very responsive... [15:34:53] since the buyout my favorite provider to work with is falling down the list [15:36:35] hm [15:36:41] i see varnish complaining about gzip a lot [15:40:26] (03PS1) 10Akosiaris: Specify perms for generic::upstart_job [operations/puppet] - 10https://gerrit.wikimedia.org/r/94746 [15:41:31] LeslieCarr: perhaps luis was running the entire network by himself before ;p [15:41:31] Nemo_bis: ori-l: errr? huh? why should you count? [15:41:59] or Lou [15:43:55] haha [15:45:30] (03CR) 10Akosiaris: [C: 032] Specify perms for generic::upstart_job [operations/puppet] - 10https://gerrit.wikimedia.org/r/94746 (owner: 10Akosiaris) [15:45:50] PROBLEM - Puppet freshness on sq48 is CRITICAL: No successful Puppet run in the last 10 hours [16:12:09] (03PS1) 10Reedy: Bump wmgMemoryLimit to 210MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94756 [16:13:07] !log upgrading image/video scalers [16:13:21] Logged the message, Master [16:17:17] (03CR) 10Reedy: [C: 032] Bump wmgMemoryLimit to 210MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94756 (owner: 10Reedy) [16:20:39] !log - upgrade mwlib to 0.15.12 [16:20:53] Logged the message, Master [16:21:11] !log restarted all services [16:21:14] (03Merged) 10jenkins-bot: Bump wmgMemoryLimit to 210MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94756 (owner: 10Reedy) [16:21:14] lol [16:21:19] so efficient [16:22:27] !log reedy synchronized wmf-config/InitialiseSettings.php 'Bump wmgMemoryLimit to 210MB' [16:22:39] Logged the message, Master [16:25:41] I love how the err screen on git.wm.o says "Your cache administrator is nobody." [16:30:18] hexmode: there's a bug for it if you want [16:30:33] jeremyb: I count every day [16:31:00] ah, the poolcounter? because you were trying to claim that it increased [16:35:31] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 2.047 second response time [16:37:18] (03PS1) 10Mark Bergsma: Allow caching of CentralAutoLogin/checkLoggedIn [operations/puppet] - 10https://gerrit.wikimedia.org/r/94765 [16:40:35] (03PS7) 10Ori.livneh: Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 [16:41:54] (03CR) 10Ori.livneh: [C: 032 V: 032] Add Graphite module & role [operations/puppet] - 10https://gerrit.wikimedia.org/r/92271 (owner: 10Ori.livneh) [16:44:21] (03PS2) 10Mark Bergsma: Allow caching of login.wikimedia.org requests [operations/puppet] - 10https://gerrit.wikimedia.org/r/94765 [16:46:08] (03PS1) 10Ori.livneh: Qualify the path to Python [operations/puppet] - 10https://gerrit.wikimedia.org/r/94767 [16:46:53] (03CR) 10Ori.livneh: [C: 032 V: 032] Qualify the path to Python [operations/puppet] - 10https://gerrit.wikimedia.org/r/94767 (owner: 10Ori.livneh) [16:49:27] annnnnd everything applied correctly [16:49:38] \o/ [16:49:38] nice job [16:49:44] thanks :) [16:50:04] Nemo_bis: ugly comments :( [16:52:36] i think we either need to copy all the existing whisper baggage from professor or just leave it running as graphite-old.wikimedia.org for a little longer until the lights go out in tampa [16:52:52] and lose all the perf data? [16:52:56] I don't like that much [16:53:08] but it's so nice and cleannnnnnnnnn now [16:53:16] but yeah, i guess you're right [16:53:58] need to think about it [16:54:20] if the old stuff is not compatible with a new naming scheme it can be renamed to all go under one namespace [16:54:28] deprecated.* or whatever [16:54:45] if we can rename, why not rename to the right locations? [16:54:57] (I have zero experience with whisper) [16:55:50] if the mapping allows us to rename everything into the right place with a few operations, that would work, but i think there are all these subtle variations and duplicated metrics [16:55:57] so it would be a ton of manual work [16:56:31] i guess it could at least be done for the important metrics, i.e., whatever's feeding the current set of gdash dashboards [16:57:13] gdash only goes back 1 month at most I think [16:57:31] but I think having these for archiving purposes if possible is a good idea [16:57:40] it's not like we haven't ever found issues that started 3 months back [16:57:42] ori-l: are now on change of gdash? [16:57:51] nope [16:57:58] not falling for that trap, Nemo_bis :P [16:58:14] hehe [16:58:23] ori-l: more specifically, do you know how to do this https://bugzilla.wikimedia.org/show_bug.cgi?id=41754 [16:58:52] yes, you can submit a patch yourself, let me point you to the right file [17:00:04] Nemo_bis: http://git.wikimedia.org/tree/operations%2Fpuppet.git/ca6fe4efc30c6a4b2606b13aab178b9e71914dca/files%2Fgraphite%2Fgdash%2Fdashboards%2Freqerror [17:01:26] the DSL gdash uses to describe graphs is at https://github.com/ripienaar/graphite-graph-dsl/wiki [17:04:00] (03PS1) 10Akosiaris: ULSFO uses new puppetmasters [operations/dns] - 10https://gerrit.wikimedia.org/r/94768 [17:04:25] what's "funny" is that last day disproved my bug, we got more 5xx than 500 [17:04:28] do we use the puppet.ulsfo.wmnet cnames and such? [17:04:58] it looks like we do, interesting [17:05:15] oh, or not [17:05:22] server = puppet [17:05:40] so whatever is first at resolv.conf :) [17:06:34] paravoid: :-) [17:06:45] now... let's see what happens when enabled [17:07:02] (03CR) 10Akosiaris: [C: 032] ULSFO uses new puppetmasters [operations/dns] - 10https://gerrit.wikimedia.org/r/94768 (owner: 10Akosiaris) [17:07:33] famous last words [17:07:39] :-) [17:12:27] RT on duty looks out of date (set in the first half of last week) [17:12:32] is today a holiday? [17:12:41] it is for the US people, yes [17:12:43] yes [17:12:50] k [17:40:36] !log reedy synchronized php-1.23wmf3/extensions/FlaggedRevs 'https://gerrit.wikimedia.org/r/94774' [17:40:51] Logged the message, Master [17:59:40] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:21] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.002 second response time [18:13:58] (03PS1) 10Akosiaris: More fixes for file permissions/ownerships [operations/puppet] - 10https://gerrit.wikimedia.org/r/94777 [18:27:08] (03PS1) 10Akosiaris: Remove references to /etc/puppet/software [operations/puppet] - 10https://gerrit.wikimedia.org/r/94779 [18:40:10] (03PS1) 10Yurik: Revert "Revoke Yuri's shell access" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94780 [18:40:35] mark, paravoid ^ :) [18:40:49] are you crazy? [18:40:52] no [18:40:53] new key [18:41:23] paravoid, good point, sorry, regening [18:43:31] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 4.819 second response time [18:59:55] (03PS2) 10Ottomata: Fix indentation in manifests/misc/statistic.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94624 (owner: 10QChris) [18:59:59] (03CR) 10Ottomata: [C: 032 V: 032] Fix indentation in manifests/misc/statistic.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/94624 (owner: 10QChris) [19:00:27] (03PS2) 10Ottomata: Restrict access to geowiki's data-private checkout [operations/puppet] - 10https://gerrit.wikimedia.org/r/94625 (owner: 10QChris) [19:00:32] (03CR) 10Ottomata: [C: 032 V: 032] Restrict access to geowiki's data-private checkout [operations/puppet] - 10https://gerrit.wikimedia.org/r/94625 (owner: 10QChris) [19:08:30] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:09:20] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [19:22:55] !log reedy updated /a/common to {{Gerrit|Ibab846000}}: Bump wmgMemoryLimit to 210MB [19:22:58] (03PS1) 10Reedy: Non wikipedias to 1.23wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94785 [19:23:13] Logged the message, Master [19:23:32] ori-l: ^ Guess that's related to your change? [19:23:36] is that autogenerated from sartoris? [19:23:39] It's a bit of a lie as I've only committed a locally [19:23:53] I'm not using sartoris [19:24:54] Oh.. [19:25:01] It's still a lie [19:25:16] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.23wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94785 (owner: 10Reedy) [19:25:24] (03Merged) 10jenkins-bot: Non wikipedias to 1.23wmf3 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/94785 (owner: 10Reedy) [19:25:40] reedy@tin:/a/common$ git pull [19:25:41] From ssh://gerrit.wikimedia.org:29418/operations/mediawiki-config [19:25:41] 3dfebcd..433496a master -> origin/master [19:25:41] Current branch master is up to date. [19:26:13] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.23wmf3 [19:26:34] Logged the message, Master [19:27:09] * Reedy is slightly confused [19:31:31] (03PS1) 10coren: Tool Labs: install cvs on dev_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/94786 [19:31:41] cvs? [19:31:44] retro much? [19:31:53] (03CR) 10coren: [C: 032] "Package install" [operations/puppet] - 10https://gerrit.wikimedia.org/r/94786 (owner: 10coren) [19:32:15] Reedy: It's the 8-bit nostalgia version of source control. :-) [19:32:22] I know what is is ;) [19:32:32] More wondering who/what needs/wants it [19:33:00] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:09] krd [19:34:15] For some reason. Tool labs user. [19:34:31] (03CR) 10Yurik: [C: 04-1] "I will need to regen private key - will commit it in a bit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/94780 (owner: 10Yurik) [19:34:50] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.589 second response time [19:35:30] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:30] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 5.548 second response time [19:36:33] Hi. Have there been any reports of bits.wikimedia.org issues on en.wikipedia.org via HTTPS for logged-in users? [19:36:54] gj Marybelle [19:36:57] L2IRC [19:36:58] I'm getting HTTP 503 errors. [19:37:11] Reedy: What'd I do wrong? :-( [19:37:13] sux2bu [19:37:32] I checked the channel topic! [19:38:04] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=user&only=scripts&skin=monobook&user=MZMcBride&version=20131015T015532Z&* [19:38:08] for me bits was also down for 1-2min, but's back up for me [19:38:10] Example link that's 503ing for me. [19:38:35] Marybelle's link is intermittent for me [19:38:43] Yeah, intermittent for me now as well. [19:39:19] fine for me every time [19:39:20] bits network traffic is still down 30-50% or so [19:39:31] akosiaris: Because esams is better [19:39:38] akosiaris: that's because it got cached. change the version= and it will be intermittent again [19:39:43] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.gadget.edittop%7Cext.rtlcite%2Cwikihiero%7Cext.uls.nojs%7Cext.visualEditor.viewPageTarget.noscript%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmw.PopUpMediaTransform%7Cskins.monobook&only=styles&skin=monobook&* [19:39:43] or what Reedy said [19:39:58] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=user%7Cuser.groups&only=styles&skin=monobook&user=MZMcBride&version=20131015T015532Z&* [19:40:14] Yeah, it's definitely intermittent. [19:40:22] well aware of that... just trying to figure out where the problem lies [19:40:36] Okay, that's all I was checking. Good luck! :-) [19:40:47] And thanks for poking at it. [19:42:40] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:00] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:30] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:40] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.332 second response time [19:43:50] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 6.593 second response time [19:44:13] gj icinga-wm [19:44:16] there we go [19:44:20] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.497 second response time [19:47:00] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:15] https://gdash.wikimedia.org/dashboards/jobq/ seems dead since 4 UTC [19:47:50] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 3.433 second response time [19:48:01] ori-l: see Nemo_bis ^ [19:48:30] Something for me to watch out about next week [19:49:20] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:51] jeremyb: isn't Aaron a better victim for a job queue ping? :) [19:50:16] Nemo_bis: we'll see... [19:50:21] job queue? [19:50:24] In my Wikimedia? [19:50:52] ori-l: siebrand Nikerabbit Weren't the ULS changes already deployed last week? [19:51:10] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 1.014 second response time [19:51:13] Reedy: Some, not all. [19:51:30] Reedy: And they'll only have hit all wikis be Thursday. [19:51:32] Could they account for a drop of network traffic from bits? Fonts etc? [19:51:36] Hmm.. [19:51:46] Reedy: I saw that drop, too. [19:52:17] Reedy: It could be. Not sure. [19:53:55] Quite a few cirrus related poolqueue full [19:54:23] warnings [19:55:28] poolqueue again? [19:56:18] !log reedy synchronized php-1.23wmf3/extensions/Wikibase 'https://gerrit.wikimedia.org/r/94790' [19:56:31] Logged the message, Master [19:56:54] Network traffic from bits app servers is picking up again [19:57:45] Also amusing [19:57:54] Increase memory limit due to some SVG related OOMs [19:58:17] Now see even more OOMs in GlobalFunctions [20:08:00] Reedy: just because they go further? :) [20:08:12] Yup [20:08:45] !log network level of bits application servers eqiad is back to the pre-deploy 18+8 MB/s [20:08:53] too bad, no more ULS savings it seems :) [20:09:02] Logged the message, Master [20:10:26] Reedy: well, it would be worrying if it increased the total number of OOMs I guess [20:10:50] where they OOM exactly matters less (assuming we don't have some ugly bug which corrupts database at some point and not another of course :P) [20:20:30] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:20] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [20:26:28] ottomata: on analytics 1027 and 1010 there are some weird errors from puppet which started early Nov 8, from cdh4 [20:27:09] hm [20:27:46] early = with the first puppet run after 1 am utc [20:27:48] looking [20:28:34] I couldn't see that anything had actually changed in the module or in puppet around that time, so punting to you [20:30:47] I very vaguely wonder if the power incident could have impacted anything there because that was also nov 8 around that time when it took out analytics1012 [20:30:54] no other guesses though [20:34:30] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:35:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 5.012 second response time [20:38:30] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:20] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.005 second response time [20:40:10] mark / paravoid? [20:40:17] yes? [20:40:20] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [20:40:30] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [20:40:41] paravoid: were there packet loss between esam and eqiad today which would've affected kafka? [20:40:52] yes [20:41:00] approx time window? [20:42:03] Snaps: 11 20:12:10 < thedj> speaking of weird stuff. analytics cluster just went from 0MB/s to 500MB/s :D [20:42:11] ~10:00 UTC - 13:25 UTC [20:42:19] related i guess but not an answer to your question [20:42:43] also yesterday for quite a while, and the day before for less time [20:43:16] huhm, okay, thanks guys. [20:43:20] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (112034) [20:43:29] jeremyb: unrelated [20:43:30] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (112016) [20:43:50] ottomata: can "we" try to map at least some of the drerrs to the above times? [20:44:09] ottomata got the incident report btw :) [20:44:38] awesome! [20:45:54] sorry, we tend to publish those incident reports to a wider audience (wikitech) after they've been resolved [20:46:10] (or not at all, when we're just too busy and fail to do it right :) [20:48:10] https://wikitech.wikimedia.org/wiki/Incident_documentation is the page [20:49:06] cool, thanks [21:10:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [21:10:30] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [21:13:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (106473) [21:13:30] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (106456) [21:13:52] flap, flap, flap [21:18:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [21:18:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [21:21:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (108372) [21:22:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (105476) [21:25:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [21:25:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [21:28:44] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=ms-be1003.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Swift+eqiad [21:28:47] fun [21:30:00] PROBLEM - DPKG on ms-be1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:30:20] PROBLEM - swift-account-auditor on ms-be1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:30:21] PROBLEM - swift-object-replicator on ms-be1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:30:21] PROBLEM - swift-account-replicator on ms-be1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:30:30] PROBLEM - swift-account-server on ms-be1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:30:50] PROBLEM - swift-object-updater on ms-be1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:31:00] PROBLEM - swift-account-reaper on ms-be1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:32:46] !log rebooting ms-be1003, kernel bug, system CPU & I/O wait through the roof [21:33:01] Logged the message, Master [21:34:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (106918) [21:34:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (106999) [21:40:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [21:40:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [21:47:10] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:48:00] RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [21:48:11] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [21:48:21] RECOVERY - swift-account-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [21:48:21] RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [21:48:21] RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [21:48:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101355) [21:48:30] RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [21:48:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101262) [21:48:50] RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:49:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [21:51:41] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [21:52:30] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:54:00] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [21:56:30] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:20] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.006 second response time [22:00:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (107794) [22:00:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (107746) [22:09:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [22:09:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [22:14:10] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 100,000 [22:15:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103640) [22:15:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (103711) [22:16:02] that's not a useful alert message [22:17:20] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:17:58] well it's flappy and it's missing the wiki DB names somehow [22:18:10] because no single one is over but the total is over? [22:18:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [22:18:37] Haha [22:18:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [22:18:42] Inconsistent too [22:18:55] earlier Nemo_bis reported https://gdash.wikimedia.org/dashboards/jobq/ was broken [22:21:00] PROBLEM - Varnish traffic logger on amssq58 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:20] PROBLEM - Varnish HTCP daemon on amssq58 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:27] looks like the icinga-wm spam for job q started ~2 hrs after the graphs broke [22:21:30] PROBLEM - Varnish HTTP text-backend on amssq58 is CRITICAL: HTTP CRITICAL - No data received from host [22:25:49] icinga-wm has always always lied about jobqueue [22:26:41] usually, this doesn't: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=hume.wikimedia.org&v=823574&m=Global_JobQueue_length&r=hour&z=default&jr=&js=&st=1365625056&z=large [22:27:04] Needs moving from hume [22:27:30] so what's that spike a day ago? [22:42:40] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:44:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 7.707 second response time [22:49:40] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [22:50:20] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.89 ms [22:51:06] jeremyb: it's only 100k items, you wouldn't even have noticed it few weeks ago when parsoid filled the queue with millions jobs :) it can be caused by a couple edits to templates [22:52:22] Nemo_bis: but it's lingering, not going away [22:53:38] jeremyb: how so, it went down from 180 to 100 then up 130 and now slowly down [22:53:40] PROBLEM - Varnish HTTP text-backend on cp1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:54:16] compare to the exponential growth in https://ganglia.wikimedia.org/latest/graph.php?r=year&z=large&c=Miscellaneous+pmtpa&h=hume.wikimedia.org&v=823574&m=Global_JobQueue_length&jr=&js= , can't be that bad :P [22:55:30] RECOVERY - Varnish HTTP text-backend on cp1055 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [22:55:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101160) [22:56:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (101446) [22:59:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [22:59:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [23:25:31] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104238) [23:25:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (104299) [23:29:30] RECOVERY - check_job_queue on arsenic is OK: JOBQUEUE OK - all job queues below 100,000 [23:29:40] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 100,000 [23:39:30] PROBLEM - check_job_queue on arsenic is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (125817) [23:39:40] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 99,999 jobs: , Total (125812) [23:39:55] icinga-wm, stfu? [23:40:35] maybe I should just edit some popular template to stop its flapping?:P [23:41:06] might want to just put in a real alertable number there.... [23:44:31] what's a real alertable number? [23:45:02] a couple of megs?:D [23:46:06] the problem is that the drop-off isn't smooth, because things are continuously added to the queue [23:46:27] so whatever threshold you set, if it is reached, is likely to cause a flapping alert [23:47:43] * greg-g nods [23:47:59] what was the original idea with the alert? [23:48:04] what is a failure state? [23:48:38] Continuously growing [23:48:47] is just "jobqueue increased by XX% in YYminutes"? [23:48:47] and not decreasing [23:49:07] but, it can't always be decreasing :) [23:49:27] Which can be 1 of 2 things - 1) job runners aren't running or 2) someone is doing something "bad" [23:49:28] so, "no drop in XX hours"? [23:49:42] Bad being editing a load of high use templates in a row, etc [23:49:45] but, there just seems like a lot of false alarms here :/ [23:50:19] Previously it was like [23:50:21] Ooh [23:50:24] It's in the millions [23:50:27] And it's still growing [23:51:04] maybe it's fundamentally the wrong approach to set up alerting based on the queue size [23:51:21] what about age of oldest job in the queue? [23:51:24] do we have that? [23:51:31] they're timestamped [23:51:35] And have a retry count [23:51:44] * greg-g nods [23:51:55] wikidata had that [23:53:08] job_attempts, job_token_timestamp [23:53:25] To have a useful metric, it's going to have to take a few things into consideration [23:53:37] * greg-g nods [23:54:38] I think repeatedly failing jobs go away after a while [23:54:43] emphasis on think there