[00:01:07] superm401: did you change nicks? [00:01:22] jeremyb_, no, why? [00:01:38] i thought i remembered another nick [00:01:46] anyway, you might want to actually !log that [00:01:50] instead of just saying it [00:01:56] ori-l: isAnon +1 [00:02:09] (e.g. mflaschen or something) [00:02:18] jeremyb_, doesn't scap automatically log when you pass a message? [00:02:27] Yes [00:02:28] yeah, forgot about that one earlier [00:02:30] It does now [00:02:32] Didn't always [00:03:05] ok... i can never remember what's automatic [00:03:24] jeremyb_, mattflaschen on Gerrit mflaschen for email and a a couple obscure other things [00:04:31] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [00:04:32] superm401: i was nearly certain i had seen you on IRC at some point before. maybe i imagined it [00:04:33] mutante: Any idea what results.labs.wikimedia.org is? Resolves to 131.152.80.208.in-addr.arpa name = recursor0.wikimedia.org. [00:04:53] HTTPSEverywhere has (?:ipv4|ipv6and4|results)\.labs [00:05:13] The other 2 can be dealt with when we get answers to your email [00:05:27] Reedy: i asked ops list about those 4 [00:05:35] like a little while ago..pending reply [00:05:38] Oh, duh [00:05:42] Reedy: that's not what it resolves to for me [00:05:43] I didn't see results on the end [00:06:13] ipv6and4.labs.wikimedia.org. [00:06:15] whatever [00:06:23] binasher: on the whole it looks like client-side rendering time dwarfs network latency [00:06:23] and 91.198.174.7 [00:06:29] ip4.labs and results.labs have the same IP [00:06:37] so pretty sure it is part of that [00:07:29] [00:07:35] We should update this regex [00:07:41] ganglia3? has HTTPS now [00:07:48] so does wikitech [00:08:20] so does shop [00:08:26] What's harmon? [00:09:14] not found in site.pp , is up and running, but nothing in motd tells me right away..emm [00:10:05] I'm killing (?:commonsprototype|mlqt|mobile)\.tesla\.usability as none of them resolve [00:10:07] it runs 'lldpd' [00:10:09] ori-l: although, does responseEnd only cover the initial response, and not the fetching of all http resources? of so, that would be counted in rendering [00:10:16] LLDP is to provide an inter-vendor compatible mechanism to deliver Link-Layer notifications to adjacent network devices. [00:10:22] apt-cache show lldpd [00:10:34] binasher: dunno, looking it up [00:10:45] Reedy: yea, removed tesla stuff from DNS to cleanup a while ago [00:10:50] jobs redirects to foundation [00:10:52] except one... the controller itself afair [00:11:05] because Ryan might still need that to actually shut down stuff [00:11:20] ori-l: i've been reading through http://dvcs.w3.org/hg/webperf/raw-file/tip/specs/NavigationTiming/Overview.html, but still not sure [00:13:55] milimetric noticed a while back that the nav timing events correspond to markers in the timeline view in chromium's dev tools, could be useful to cross-reference. my head is exploding from today's deployment though, not able to think through it. [00:16:02] jeremyb_, scap doesn't log anything when it starts, only when it gets to syncing, and mw-update-l10n before that can take a while. [00:17:13] (?:(?:commons|de|en|test)\.)?prototype [00:17:28] !log mflaschen Started syncing Wikimedia installation... : Deployment for Mobile and E3 [00:17:34] Logged the message, Master [00:18:13] [00:18:28] he's speaking in regexes again [00:18:30] 21 minutes later, see what I mean :) [00:18:31] * ori-l sprinkles holy water [00:18:32] flaggedrevssandbox and the next are dead too [00:19:08] nagios points to incinga which also has https [00:21:30] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:22:01] I got on add error during scap: [00:22:05] "mw10.pmtpa.wmnet: rsync: send_files failed to open "/php-1.21wmf11/.git/modules/extensions/FormPreloadPostCache/index.lock" (in common): Permission denied (13)" [00:22:15] Some kind of permissions issue [00:22:28] "^http://(?:apt|bayes|bayle|brewster|bug-attachment|cs|cz|dataset2|download|dumps|ekrem|emery|ersch|etherpad|harmon|hume|(?:ipv4|ipv6and4|results)\.labs|m|project2|search|sitemap|snapshot3|stafford|statu?s|torrus|ubuntu|wiki-mail|wlm|yongle)\.wikimedia\.org" [00:22:51] what is the issue with cs and cz? [00:23:00] I suspect they're hosted offsite... [00:23:15] oh, yea, true, they redirect to their own TLDs [00:23:17] Yeah, they resolve elsewhere [00:23:52] etherpad has https [00:23:53] That's quite a reduction in number of exceptions though [00:24:53] you can also remove "m" [00:24:58] mutante: that's not quite accurate [00:24:58] https://m.wikimedia.org/ [00:25:31] what is project2? [00:25:35] jeremyb_: which part [00:25:41] mutante: rt 2751 [00:26:47] That's etherpad.. [00:27:03] New patchset: Ori.livneh; "Add NavigationTiming to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53903 [00:27:33] 15 00:23:52 < mutante> etherpad has https [00:27:44] Ah [00:27:53] yeah, it loads originally, then goes back to http [00:27:54] jeremyb_: ugh, confirmed, well .. kind of .. it redirects from https to http when creating a new pad.. but also.. if i just change the URL back to https.. it stays :p [00:28:10] mutante: but the ajax is over HTTP? [00:28:30] iirc [00:28:35] umpf.. Lost connection with the EtherPad synchronization server. This may be due to a loss of network connectivity. [00:28:40] yea, as described [00:29:10] so, "has https" may be a bit of a stretch :) [00:29:12] but we can remove "m",, right [00:29:28] yea, agreed [00:30:59] Okay, there were a few scap errors in the "Updating rsync proxies..." section. [00:31:05] Reedy: fwiw.. https://wikitech.wikimedia.org/wiki/Httpsless_domains [00:31:16] But it seems to be back to normal operation on mw and srv machines. [00:31:53] "^http://(?:apt|bayes|bayle|brewster|bug-attachment|cs|cz|dataset2|download|dumps|ekrem|emery|ersch|etherpad|harmon|hume|(?:ipv4|ipv6and4|results)\.labs|m|project2|search|sitemap|snapshot3|stafford|statu?s|torrus|ubuntu|wiki-mail|wlm|yongle)\.wikimedia\.org" [00:31:58] I think I'll submit that for now then [00:32:00] the FIXME on rt is also fixed , isnt it [00:32:12] Reedy: cool [00:32:26] maybe we should ask mobile about "m" [00:33:31] marking RT and wikitech as fixed [00:33:32] Reedy: wikivoyage too? or did it already? [00:33:46] Eh? [00:33:53] Ages ago [00:33:57] yea [00:33:58] this is HTTPS everywhere, right? [00:34:05] oh, well it wasn't showing for me in chrome [00:34:13] https://github.com/reedy/https-everywhere/compare/master...reducewmfexceptions [00:34:14] idk how long it takes to propagate [00:34:14] Yeah [00:34:16] to the stores [00:34:22] [00:34:22] [00:41:20] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53894 [00:41:29] ^ jenkins seems to be workign now [00:43:20] project2 - is in DNS but that IP does not appear to be assigned anywhere and it got removed from reverse DNS [00:43:30] !log mflaschen Finished syncing Wikimedia installation... : Deployment for Mobile and E3 [00:43:38] Logged the message, Master [00:45:38] And console confirms, scap is done. [00:46:17] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53892 [00:46:24] mutante: Fancy testing https://gerrit.wikimedia.org/r/#/c/53894 [00:47:24] !log olivneh synchronized php-1.21wmf11/extensions/NavigationTiming/modules/ext.navigationTiming.js 'Additional fields for NavTiming' [00:47:30] Logged the message, Master [00:47:52] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [00:49:59] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53903 [00:51:59] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53892 [00:55:50] Reedy: Warning: DocumentRoot [/usr/local/apache/common/docroot/affcom] does not exist [00:56:00] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [00:56:57] !log reedy synchronized wmf-config/InitialiseSettings.php 'chapcomwiki readonly' [00:57:05] Logged the message, Master [01:00:25] New patchset: Pyoungmeister; "WIP: first bit of stuff for taming the mysql module and making the SANITARIUM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53907 [01:00:27] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53894 [01:06:31] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53894 [01:09:41] !log olivneh synchronized php-1.21wmf11/extensions/NavigationTiming 'Missed a file in earlier sync' [01:10:05] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53894 [01:10:40] Logged the message, Master [01:10:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53892 [01:11:01] syncing [01:12:16] !log reedy synchronized docroot [01:12:22] Logged the message, Master [01:12:39] dzahn is doing a graceful restart of all apaches [01:13:22] !log dzahn gracefulled all apaches [01:13:29] Logged the message, Master [01:13:47] !log reedy synchronized wmf-config/InitialiseSettings.php [01:13:56] Logged the message, Master [01:15:13] RECOVERY - MySQL Replication Heartbeat on db71 is OK: OK replication delay 0 seconds [01:15:22] RECOVERY - MySQL Slave Delay on db71 is OK: OK replication delay 0 seconds [01:21:10] !log olivneh synchronized php-1.21wmf11/extensions/GettingStarted/CategoryRoulette.php 'Fix to use single-parameter version of SRANDMEMBER' [01:21:18] Logged the message, Master [01:27:00] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [01:27:04] Logged the message, Master [01:28:12] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [01:28:18] Logged the message, Master [01:34:13] !log reedy synchronized docroot [01:34:21] Logged the message, Master [01:34:53] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [01:34:59] Logged the message, Master [01:35:25] !log reedy synchronized wmf-config/ [01:35:32] Logged the message, Master [01:40:00] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [01:43:45] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [01:43:52] Logged the message, Master [01:44:11] !log reedy synchronized wmf-config/ [01:44:17] Logged the message, Master [01:44:43] !log olivneh synchronized php-1.21wmf11/extensions/GettingStarted/SpecialGettingStarted.php 'Fix deployed bug in check for editability' [01:44:51] Logged the message, Master [01:48:57] New patchset: Reedy; "Revert "Bug 39482 - Rename "chapcomwiki" to "affcomwiki""" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53914 [01:50:49] New patchset: Reedy; "Revert "Bug 39482 - Rename "chapcomwiki" to "affcomwiki""" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53915 [01:50:56] Change merged: Asher; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53914 [01:51:01] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53915 [01:51:33] reedy is doing a graceful restart of all apaches [01:52:04] !log reedy gracefulled all apaches [01:52:10] Logged the message, Master [01:52:39] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [01:52:46] Logged the message, Master [01:53:44] !log reedy synchronized wmf-config/ [01:53:50] Logged the message, Master [01:59:20] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [01:59:50] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 192 seconds [02:02:00] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:08:43] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53921 [02:09:03] New patchset: Reedy; "Bug 39482 - Rename "chapcomwiki" to "affcomwiki"" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53922 [02:11:57] sync-apache .. [02:12:14] done [02:12:45] dzahn is doing a graceful restart of all apaches [02:13:26] !log dzahn gracefulled all apaches [02:13:33] Logged the message, Master [02:16:14] !log reedy synchronized docroot [02:16:19] Logged the message, Master [02:20:10] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:30:02] !log LocalisationUpdate completed (1.21wmf11) at Fri Mar 15 02:30:02 UTC 2013 [02:30:11] Logged the message, Master [02:32:02] New patchset: Ottomata; "hdfs sync to stats.wikimedia.org on *:15" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53925 [02:32:20] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 200 seconds [02:32:48] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53925 [02:32:50] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 233 seconds [02:33:10] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [02:40:42] New patchset: J; "install libjpeg-turbo-progs for rotate api" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44008 [02:50:51] * jeremyb_ knew that wouldn't work, mutante [03:00:23] mutante: so... there's a labs env for venus, right? to test the migration when you first switched [03:00:36] have you been able to reproduce it there? [03:08:19] i almost want to tell him to stop commenting on the bug. but i don't know python *that* well (or at least it's been too long since i had to deal with this particular issue) [03:09:50] mutante: want to add me to the planet project and i'll play with it some? [03:11:18] idk if ori-l still wants in [03:11:50] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [03:12:20] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:23:37] New patchset: Danny B.; "Adding Wikivoyage" [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/53928 [03:32:15] This NavigationTiming shit is exciting. [03:32:19] Yay performance data. [03:50:23] StevenW: is that like boomerang? [03:50:56] Dunno. [03:53:25] StevenW: http://lognormal.github.com/boomerang/doc/ [03:54:20] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 186 seconds [03:54:50] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [04:03:40] Can someone please take a look at ? [04:14:31] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [04:16:04] Susan: i did [04:16:19] Helpful. [04:16:20] (a while ago) [04:16:29] Susan: it was kicked. people didn't like the spamming [04:16:37] I don't like people. [04:16:39] I think we're even. [04:16:42] there was a user trying to get help and had trouble because of the flood [04:16:50] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 23 seconds [04:16:52] (wasn't my decision i only saw the scrollback) [04:17:03] Susan: anyway, there's #mediawiki-feed for now [04:17:11] It's a terrible replacement. [04:17:15] As it's missing wikibugs. [04:17:22] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [04:17:29] it has wm-bot providing a similar service [04:17:32] i witnessed it [04:17:51] and you've been there too [04:18:34] Susan: at some point we should either let it back into #mediawiki or configure it to not join (and remove the ban). also we could maybe get wikibugs to join #mediawiki-feed if you really care [04:18:48] I do really care. [04:28:46] where have you gone mr. wikibugs, joltin' joe has left and gone away [04:32:29] there was no flood [04:32:56] the channel was quiet while Andre Klapper went about his normal work [04:35:04] well i was only reading after the fact not contemperaneously and didn't review timestamps. i saw *other* people called it a flood. i think [04:36:20] jeremyb_: it is exactly like boomerang, except probably not as good. I hadn't heard of it before. They do some nifty things. [04:36:38] ori-l: is it worth just using boomerang? [04:37:11] maybe; dunno. [04:39:25] The thing that seems good about it is that they've gone through some of the work figuring out how to extract useful metrics from the new wealth of client-side performance data. That's a lot of intellectual labor and experimentation but not a lot of code. We could probably just use some of their techniques. [04:40:51] ori-l: how would i have found about NavigationTiming besides just reading on IRC? i'm on the EE list. but i guess i don't read everything there. maybe it was mentioned [04:41:45] i guess there's > [Analytics] RFC: Building a frontend performance analysis platform [04:41:52] a whole thread :) [04:42:06] but that's all from 3ish months back [04:42:29] there wasn't really any work getting done for three months because supporting pieces needed to be put into place [04:43:06] what are you guys on now, mingle? [04:43:25] (i guess we should move this convo to someplace more relevant) [04:43:26] no; I've made a concerted effort to move away from anything proprietary. [04:43:47] I have to run, actually [04:44:14] well, but the rest of the team(s)? analytics and EE [04:44:21] bye :) [04:46:30] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [04:55:44] New patchset: Tim Starling; "Fix lack of output for empty path part" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/53930 [04:56:01] Change merged: Tim Starling; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/53930 [05:15:26] New patchset: Tim Starling; "Update redirector binary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53935 [05:15:50] New review: Tim Starling; "Tested live." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/53935 [05:16:00] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53935 [05:19:53] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [05:20:13] TimStarling: the best way to test [05:24:30] I tested it on a single squid first [05:24:56] I was just afraid that it would fail to start up due to incorrect C library version or something [05:25:07] causing the whole site to instantly fail [05:25:39] packages do have their positives, hint to ops [05:26:05] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 13 seconds [05:27:11] heh [05:30:10] TimStarling: any chance for CR today? [05:32:34] quite a list you've got there [05:34:11] notpeter isn't online anymote it seems [05:34:27] New patchset: Krinkle; "Ensure package 'doxygen' on contint." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53938 [05:35:02] Can someone in ops please merge/deploy the above package for gallium? I was assuming the package was already there, but looks like it slipped through the cracks when migrating the scripts from svn.pp [05:35:07] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [05:35:25] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [05:35:26] jenkins script is currently erroring on bash command not found :-/ [05:35:39] usually I try to avoid building it up and slow done (doing CR and reading about stuff) [05:36:09] it kind of doubled lately though [05:38:05] TimStarling: I wish one could publicly star gerrit changes [05:40:16] Aaron|home: what would that mean? [05:40:35] i wish one could public ask for review from whoever felt like working the queue. without having to name people [05:40:59] Well, it would be most useful for the list of changes by author. One could flag the more important ones out. [05:41:27] jeremyb_: maybe https://www.mediawiki.org/wiki/Git/Reviewers will solve that a little [05:41:29] AaronSchulz: so with twemproxy, everything will normally connect directly to twemproxy, except for mcc and mctest? [05:41:37] assuming enough people add themselves [05:41:44] * Aaron|home hasn't added any regexes himself [05:41:58] TimStarling: depending on the --noproxy option or whatever passed to the script [05:42:25] ugh, one would have to keep the internalServers and .yaml file in sync though, meh [05:42:36] why not use a separate BagOStuff object? [05:42:51] then you could specify a $wgObjectCaches key on the command line to mcc.php [05:43:13] php mctest.php --cache=backend [05:43:13] Aaron|home: ahhhh, interesting star use [05:43:13] php mctest.php --cache=twemproxy [05:43:50] I guess having a dummy objectcache won't hurt [05:44:16] it could be useful to have it there in configuration in case a bug develops in twemproxy and we have to stop using it [05:44:26] TimStarling: Could you perhaps merge that puppet change for me? It's a one line change, currently being blocked by it. https://gerrit.wikimedia.org/r/53938 [05:44:32] yep [05:45:09] TimStarling: I abandoned that change [05:45:15] * Aaron|home will add the --cache param [05:45:29] thanks [05:46:10] Aaron|home: hrmmmm, i think i saw that page before. but that doesn't really solve my problem [05:46:36] Aaron|home: on bugzilla (at least the way mozilla uses it) you can ask for a review from a specific pereson or from "the wind" [05:47:53] and how well does that work? :) [05:48:23] i think it was ok? [05:48:31] i don't really pay attention there any more [05:48:54] Krinkle: sucks to be you [05:49:06] New patchset: Tim Starling; "Ensure package 'doxygen' on contint." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53938 [05:49:06] TimStarling: excuse me? [05:49:21] I'm a bit of a puppet noob, but I'm learing. [05:49:24] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53938 [05:49:31] Thanks :) [05:49:35] I mean, your lot in life is a hard one, you deserve sympathy [05:49:47] because you were blocked i guess [05:49:55] since you have puppetized your stuff instead of getting root, and now you need to beg any time you want to get something done [05:50:32] Indeed. Perhaps one day I'll be crossing over in platform more than I am. I'm certainly interested in it. [05:50:39] this is on gallium? [05:50:46] Yes sir. [05:51:13] I'm doing a manual puppet run there for you, so you will have that package within a minute or so [05:51:22] I'm not sure what the puppet interval is, if , that'd be great. [05:51:39] puppet interval is 30 mins [05:52:34] Seems you got it all worked out :) Nice clean queue: https://gerrit.wikimedia.org/r/#/q/owner:%22Tim+Starling+%253Ctstarling%2540wikimedia.org%253E%22+status:open,n,z [05:53:11] great, the bin is there. [05:54:20] * TimStarling is cherry picking the easy stuff out of Aaron's list [05:54:37] guilty pleasures [05:55:14] so argument parsing in mcc.php looks fun [05:55:24] $debug = in_array( '--debug', $argv ); [05:55:25] $help = in_array( '--help', $argv ); [05:57:00] Tyler may have a point on https://gerrit.wikimedia.org/r/#/c/53799/1/includes/ScopedCallback.php,unified [06:00:24] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 183 seconds [06:01:06] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 199 seconds [06:02:58] .. and we're rolling! https://integration.mediawiki.org/ci/job/test-mediawiki-docgen/3/console https://doc.wikimedia.org/mediawiki-core/master/php/html/ [06:03:02] how many search clients do you suppose there are at any given time? [06:03:24] we don't have stats? [06:03:43] I mean open connections to lucene [06:05:26] you think this is it? http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=search_threads&s=by+name&c=Search+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [06:05:52] I'm not sure what that measures, but it looks like a plausible number [06:06:06] i.e. 60-80 [06:07:19] the yearly graph shows a spike to 150 [06:07:25] PROBLEM - Puppet freshness on mw26 is CRITICAL: Puppet has not run in the last 10 hours [06:07:49] we could wrap search connections in a PoolCounter, then it would never cause downtime [06:08:54] assuming poolcounterd can manage ~1.5 req/s [06:10:58] well, 3 req/s if you count both lock and unlock [06:11:08] TimStarling: I guess https://gerrit.wikimedia.org/r/#/c/53796/1 and https://gerrit.wikimedia.org/r/#/c/53806/1 are "easy" too [06:13:30] looks like it's currently doing 550 pps received with 0.5% CPU [06:14:13] per server [06:14:15] there are two servers [06:14:26] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [06:15:56] * jeremyb_ wonders if mutante's there [06:17:32] we could use closures for PoolCounter now [06:18:17] once there was a closure subclass of PoolCounterWork, it would only take a few lines of code to add more callers [06:19:12] Aaron|home: how busy are you? [06:19:18] doesn't poolcounter get much more than 3 req/s when we're being slashdotted? [06:19:24] (or popedotted) [06:20:02] TimStarling: what's pps? [06:20:13] TimStarling: like right now? [06:20:14] packets per second [06:20:41] no, like middle of next week [06:21:25] * jeremyb_ can't tell if Tim's serious :P [06:21:36] I'm serious [06:21:46] this was always meant to be part of the application space for poolcounter [06:22:04] no i meant about "middle of next week" [06:22:20] well, that is when I would want it done by [06:22:34] but there are lots of other people who could do it if aaron is busy [06:23:13] some of those nodes are so much more idle than others [06:23:26] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 13 seconds [06:24:00] TimStarling: maybe if my commit backlog was small I'd be more likely to look at it :) [06:24:15] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:24:19] fair [06:24:31] bribery [06:24:38] * Aaron|home should go on another review sprint like last week [06:24:51] I think it took two people to keep up with brad [06:25:07] jeremyb_: the time resolution is very low now but maybe it increased the rate from 520 to 670 packets per second [06:26:14] you would expect very brief spikes each time squid is purged [06:26:23] that wouldn't show up on this data [06:30:30] Hm.. quick bash question involving file descriptors 1 and 2 and tee to send to stdout and a file. https://gerrit.wikimedia.org/r/#/c/39212/11/tools/mw-doc-gen.sh [06:37:24] PROBLEM - Puppet freshness on mw1118 is CRITICAL: Puppet has not run in the last 10 hours [06:38:44] If |& is used, the standard error of command is connected to command2's standard input through the pipe; it is shorthand for 2>&1 |. [06:39:08] maybe that's what you want [06:40:06] $cmd > out.txt |& ( tee error.txt >&2 ) [06:41:06] http://manpages.ubuntu.com/manpages/precise/en/man1/bash.1.html#contenttoc9 [06:46:48] Aaron|home: 53796 is not really that simple [06:47:53] since there is no calling code so it's hard to know whether the usual problems with purging heavily used caches apply [06:50:14] Aha, I almost got it. [06:50:15] Thanks! [06:50:19] I get it now. [06:50:25] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [06:56:21] * jeremyb_ waits for Krinkle's new patchset [06:56:49] Oh not yet, I'm doing something else first. the script works fine, the output is in jenkins. It's just the txt file [06:57:12] jeremyb_: I'm currently trying to figure out why docroot on integration.mediawiki.org and doc.wikimedia.org is being overwritten by puppet [06:58:21] what exactly is being overwritten? [06:58:55] gallium now has the docroot out of puppet in integration/docroot.git, so we don't have to have ops merge simple html/css changes to the web portal [06:59:13] afaik we moved the stuff out of puppet, but somehow it is still enforcing it [06:59:30] causing the files deployed by git/jenkins to be overidden in their working copy [07:08:14] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 192 seconds [07:08:24] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 194 seconds [07:12:58] Krinkle: you've checked for cron jobs? [07:13:43] jeremyb_: I know it's puppet doing it based on the specific version of the docroot it is restoring [07:13:55] idk how you could know that [07:14:02] other things could pull from git [07:14:02] it's still managed by puppet, see modules/contint/manifests/website.pp [07:14:10] i looked there [07:14:25] ori-l: Only the directory, not the files [07:14:31] those file resources were removed [07:16:27] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [07:16:27] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [07:16:27] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [07:16:27] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [07:16:27] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [07:17:30] Krinkle: i must insist that you don't know it's puppet [07:17:36] unless you have more proof :) [07:18:07] there are no other crons, when I reset it, it is overwritten within an hour [07:18:15] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:18:17] I wouldn't be surprised if it was, say, every 30 minutes [07:18:23] did you check *all* of the crontabs? [07:18:24] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:18:29] did you check the cron logs too? [07:19:35] note, there's /etc/crontab, /etc/cron.d, /etc/cron.{hourly,daily,etc.}, and then each user has a crontab too [07:20:27] Krinkle: re: doc.wikimedia.org, manifests/misc/docs.pp defines '/srv/org/wikimedia/doc/index.html' to have source => 'puppet:///files/misc/jenkins/doc_index.html' [07:21:10] jeremyb_: do you have root on gallium? Maybe it is an old effect that only needs to be fixed once. The file is not owned by jenkins so it stays modified [07:21:21] jeremyb_: /srv/org/mediawiki/integration/index.html [07:21:25] New patchset: Krinkle; "svn: Disable mwdocs and redirect to doc.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53954 [07:21:56] error: unable to unlink old 'org/mediawiki/integration/index.html' (Permission denied) [07:21:59] Krinkle: i don't even have bastion access [07:22:04] so... no [07:22:17] I keep confusing you with someone else [07:22:34] i wonder who! [07:36:32] New patchset: Krinkle; "misc::docsite: Remove file "doc/index.html"." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53955 [07:37:47] so ori-l found it? [07:43:26] jeremyb_: I knew doc.wikimedia was still in puppet, I'm focussing on integration.wikimedia.org first [07:43:33] but I did fix that one while at it at his reminder [07:43:51] the problem is /org/mediawiki/integration/index.html which is stuck in permission denied [07:44:57] which is causing git-checkout, git-pull, git-stash whatever to fail [07:45:12] it is unable to move or unlink that file, even though the permisisons are fine it is the parent directory [07:47:41] find /org/mediawiki/integration -exec ls -ld {} + [07:47:46] -> pastebin [07:47:50] Krinkle [07:48:40] jeremyb_: no, that'll be huge [07:48:49] jenkins publishes to a sub directory of this [07:48:56] oh, not mostly empty? [07:49:13] nightlies for one [07:49:26] what are you looking for? [07:49:31] I know the permissions and the problem [07:49:42] then, find /org/mediawiki/integration -maxdepth 1 -exec ls -ld {} + [07:50:03] /srv/mediawiki/integration is 755 by www-data, /srv is a git clone of integration/docroot.git deployed by jenkins [07:50:11] that dir needs to have jenkins as owner [07:50:15] idk, you said "even though the permisisons are fine it is the parent directory" [07:50:26] yes, index.html itself has permissions owned by jenkins [07:50:56] but even on 777, you can't recreate the file without access to the directory it is in [07:51:07] riiiight [07:51:11] who are you? [07:51:15] jenkins i guess [07:51:22] ? [07:51:39] the user recreating is jenkings [07:51:42] jenkins* [07:51:46] not e.g. www-data [07:52:15] when someone merges a change in gerrit, a zuul notification is sent to jenkins which then starts a build, at the end it does git pull in /srv [07:52:22] or rather git checkout ZUUL_REF [07:52:39] which is done by jenkins [07:53:07] I'm simulating that currently on gallium from sudo -su jenkins so I don't have to push fake changes to gerrit all the time [07:57:14] Where is op when you need one? [07:58:25] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [08:04:58] New patchset: Krinkle; "contint: Fix permissions in /srv." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53957 [08:06:21] yeah, that'll do it [08:06:24] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Puppet has not run in the last 10 hours [08:06:24] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [08:08:25] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Puppet has not run in the last 10 hours [08:12:12] Krinkle: sleeping! [08:12:30] speaking of which i should... [08:14:25] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [08:14:27] New patchset: Krinkle; "Integration: Move to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [08:14:27] New patchset: Krinkle; "misc::docsite: Remove file "doc/index.html"." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53955 [08:14:29] New patchset: Krinkle; "contint: Move docs.pp into contint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [08:17:07] New patchset: Krinkle; "Integration: Move to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [08:18:09] New patchset: Krinkle; "contint: Move integration site to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [08:19:43] in which repo i can find extract2, search-redirect, missing wiki etc? [08:21:21] Danny_B: operations/mediawiki-config.git [08:21:32] https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=tree [08:35:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:13] New review: Nemo bis; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53885 [08:38:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [08:38:53] New review: Nemo bis; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53885 [08:39:09] New patchset: Nemo bis; "Update wgServer and wgCanonicalServer for multi subdomain wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/53885 [09:04:04] join #mediawiki-feed [09:04:13] grrr. / [09:11:54] RECOVERY - Puppet freshness on colby is OK: puppet ran at Fri Mar 15 09:11:51 UTC 2013 [09:35:34] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 188 seconds [09:37:34] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [09:40:24] PROBLEM - MySQL Idle Transactions on es7 is CRITICAL: Timeout while attempting connection [09:41:15] RECOVERY - MySQL Idle Transactions on es7 is OK: OK longest blocking idle transaction sleeps for seconds [09:53:15] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 191 seconds [09:53:34] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 198 seconds [10:06:24] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [10:22:16] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 12 seconds [10:22:26] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:22:34] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [11:35:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:36:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.342 second response time [11:43:43] New patchset: Silke Meyer; "Added customized sidebar for Wikidata test repos." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53967 [11:57:07] New patchset: Silke Meyer; "Added more extensions and their settings to Wikidata test clients" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53969 [11:59:09] New review: Demon; "This won't actually remove the cronjobs, you should do the following:" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53954 [12:02:55] New patchset: Aklapper; "bugzilla_report.php: Add query and formatting for list of urgent issues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53387 [12:10:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.784 second response time [12:40:52] New patchset: Demon; "Finish puppetizing wikibugs bot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [12:41:53] awesome :) [12:44:22] New patchset: Demon; "Finish puppetizing wikibugs bot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [12:45:06] New review: Demon; "PS2 removes the start script. The puppetized ircecho takes care of all of this for us--just have to ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [12:46:25] New patchset: Demon; "Finish puppetizing wikibugs bot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [12:46:46] New review: Demon; "And PS3 removes a totally unrelated change that snuck in." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53973 [13:07:24] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [13:07:34] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 192 seconds [13:12:17] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 187 seconds [13:12:34] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 190 seconds [13:14:25] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 186 seconds [13:14:37] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 191 seconds [13:16:26] hello [13:19:15] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 184 seconds [13:23:16] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 26 seconds [13:26:36] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [13:35:24] RECOVERY - Puppet freshness on amslvs1 is OK: puppet ran at Fri Mar 15 13:35:14 UTC 2013 [13:36:20] Change abandoned: Silke Meyer; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47424 [14:39:36] New review: Hashar; "Indeed, that file has already been moved to integration/docroot.git with https://gerrit.wikimedia.or..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/53955 [14:47:24] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [14:57:15] New review: Hashar; "We can probably use the `jenkins` group as well since we are all part of it or should be part of it..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53957 [14:58:29] New review: JanZerebecki; "Yes, that is the goal. The PCRE (.*\.|)fqdn should match sub.fqdn and fqdn (with nothing before it)...." [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/53403 [15:02:06] New review: Hashar; "Refactoring to contint module is nice, but I would prefer we get rid of this hack. Puppet should rea..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53958 [15:23:10] !rt [15:23:10] http://rt.wikimedia.org/Ticket/Display.html?id=$1 [15:23:16] apergos: ^ [15:24:12] !del rt [15:24:12] if you want to delete a key, this is wrong way [15:24:18] hah [15:24:28] !rt del [15:24:28] Successfully removed rt [15:24:30] !rt del [15:24:30] Unable to find the specified key in db [15:24:41] !rt is https://rt.wikimedia.org/Ticket/Display.html?id=$1 [15:24:41] Key was added [15:24:48] !rt foo [15:24:48] https://rt.wikimedia.org/Ticket/Display.html?id=foo [15:39:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.033 second response time [15:51:27] New review: Krinkle; "@Hashar: We still need the apache configuration of course, which I move here. And I'm removing the ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [15:53:52] New patchset: Krinkle; "contint: Move docs.pp into contint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [15:53:53] New patchset: Krinkle; "contint: Move integration site to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [15:54:59] what was I supposed to notice about that borken rt link, jeremyb_ ? [15:55:51] apergos: just what the format is. you had said 15 09:37:06 < apergos> I have no idea what links to rt are supposed to look like, so the answer is 'not by me' [15:56:27] and Nemo_bis updated the map on meta. so now we just need someone to sync it to wikitech :) [15:56:45] jeremyb_: are you sure it uses that map? [15:58:41] Nemo_bis: if someone chooses too. ariel's original wikitech work used the map. AFAIK [15:58:55] well it used that map because it was horrible [15:59:11] what should happen is the interwiki cache db for th cluster needs to be regenerated with that map [15:59:19] then I can steal it and updated it for wikitech properly [16:00:20] Krinkle: so yeah I did some review :-] [16:00:38] Krinkle: and I found out how to get jenkins to run with umask 0002 so it creates files as group writable by default :-] [16:00:42] (hopefully) [16:01:11] New patchset: Krinkle; "svn: Disable mwdocs and redirect to doc.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53954 [16:02:52] New patchset: Andrew Bogott; "Fix up the mediawiki extension class a bit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51797 [16:02:52] New patchset: Andrew Bogott; "Added optional wiki_name setting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51796 [16:02:53] New patchset: Andrew Bogott; "Rearrange webserver dependencies so that openstack and mediawiki::singlenode classes can play nice together." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51799 [16:02:53] New patchset: Andrew Bogott; "Switch the openstack manifest to use webserver::php5." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51798 [16:02:54] New patchset: Andrew Bogott; "First pass at a labsconsole puppet setup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53989 [16:04:28] New review: Krinkle; "@Hashar: As mentioned in the commit message, yes these resource descriptors should go out of puppet ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53957 [16:05:19] hashar: jenkins can't git-checkout /srv/ [16:05:23] because of that [16:05:29] it will with that change [16:07:20] that ? [16:07:30] that is contextless :-] [16:07:38] New patchset: Hashar; "jenkins now creates files group writable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53990 [16:07:41] here it is [16:08:24] PROBLEM - Puppet freshness on mw26 is CRITICAL: Puppet has not run in the last 10 hours [16:08:57] Krinkle: regarding the permissions of /srv/ what do you think about having them belong to the group jenkins ? [16:09:27] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53967 [16:11:17] New review: Hashar; "https://gerrit.wikimedia.org/r/53990 will make the Jenkins daemon to have umask 0002 which should en..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53957 [16:11:20] hashar: sure, so 775>755 as well then, right ? [16:11:29] Krinkle: I think so [16:11:29] for the whole /srv stack [16:11:34] Krinkle: just replied on the /srv/ change [16:11:40] hmm [16:11:41] no [16:11:44] 775 :-] [16:11:50] so we (human) can write to it if needed [16:12:00] without having to sudo jenkins [16:12:33] yeah, the other way around [16:12:39] that's what I meant [16:12:42] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53969 [16:12:45] hashar: I'm including /srv/localhost/qunit as well [16:13:34] yeah good catch :-] [16:13:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:43] New patchset: Krinkle; "contint: Fix permissions in /srv." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53957 [16:14:45] New patchset: Krinkle; "contint: Move docs.pp into contint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [16:14:46] New patchset: Krinkle; "contint: Move integration site to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [16:14:46] New patchset: Krinkle; "misc::docsite: Remove file "doc/index.html"." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53955 [16:15:19] lovely gerrit :-] [16:15:25] PROBLEM - Puppet freshness on strontium is CRITICAL: Puppet has not run in the last 10 hours [16:16:51] poor timo :-] [16:17:22] New review: Hashar; "Great :-]" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/53957 [16:18:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 8.882 second response time [16:19:11] New review: Hashar; "Need to remove ./files/misc/jenkins/doc_index.html from puppet as well =)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53955 [16:19:30] Krinkle: https://gerrit.wikimedia.org/r/#/c/53955/ ./files/misc/jenkins/doc_index.html can be removed from puppet, no more used with that change [16:19:52] the docs.pp https://gerrit.wikimedia.org/r/#/c/53958/ you can abandon it [16:20:09] Do we have any general puppet guides on wikitech? e.g. coding style, appropriate uses, etc? [16:20:24] andrewbogott: none I know of. [16:20:32] ok, I'm about to make one [16:20:39] (which will consist entirely of questions and no answers) [16:21:00] andrewbogott: there is an upstream style guide though which puppet-lint let you check your manifests with [16:21:24] andrewbogott: what would be nice is an explanation of our overall design. Like the role classes, what the modules should be made for .. [16:21:33] Right, that's just what I'm thinking. [16:21:38] here is the guide http://docs.puppetlabs.com/guides/style_guide.html [16:21:46] andrewbogott: make sure to involve faidon :-] [16:21:52] What should be in a module, what should be outside. Also, what should be puppetized and what should just be hand-installed [16:22:02] I was going to make a page with just topic lines and let Faidon fill in the text :) [16:22:02] andrewbogott: but not m4rk cause he will ask us to use tabs :-] [16:22:07] New patchset: Krinkle; "contint: Move docs.pp into contint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [16:22:08] New patchset: Krinkle; "contint: Move integration site to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [16:22:09] New patchset: Krinkle; "misc::docsite: Remove file "doc/index.html"." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53955 [16:22:20] The tab ship has long-since sailed, hasn't it? [16:22:35] it still has a captain on board [16:23:10] faidon and I are happily sinking the tab boat though [16:23:12] The tab ship runs aground on a reef, litering the shores of Duluth/Superior with ambiguous whitespace [16:23:30] coffee break [16:23:39] I both a) hate tabs and b) know better than to advocate this position in public [16:23:43] then I might need your super power andrew :-] [16:23:58] we got a bunch of small puppet changes for you to press [merge] on :-D [16:24:18] 17:22:20 The tab ship has long-since sailed, hasn't it? [16:24:20] what is this bullshit? [16:25:04] mark: I just meant, "we have existing conventions for tab usage and there's no point in discussing the merits of our conventions" [16:25:30] the space ship has sailed! [16:25:33] that actually sounds funny [16:25:46] Oh yeah, 'space ship' is better. Hm... [16:26:01] :) [16:26:13] =) [16:30:06] funny having jenkins review changes to itself [16:30:31] I should have named it zuul-the-gatekeeper [16:31:44] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:31:48] New review: Hashar; "That is unneeded now, we can remove the doc generation next week when Openstack merge the change and..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53958 [16:34:30] * jeremyb_ glares at the ships [16:37:20] Krinkle: don't you need both http and https access for qunit ? [16:37:29] ? [16:37:31] Krinkle: bah that is on localhost.qunit … disregard [16:37:34] https://wikitech.wikimedia.org/wiki/Puppet_usage [16:37:42] Redirect permanent / https://integration.wikimedia.org/ is nice :-] [16:38:16] Used to be rewrite in an earlier patch version, but I grabbed this from another apache conf in puppet [16:38:25] PROBLEM - Puppet freshness on mw1118 is CRITICAL: Puppet has not run in the last 10 hours [16:38:27] I forgot that Redirect takes care of path query and even hash [16:38:30] just straight forward [16:38:35] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:38:47] hashar: btw, does bin/mv have a way to replace the existing directory (if any) [16:39:04] I need to mv workspace/ext/visualeditor/docs to /srv/org/wm/docs/VisualEditor/master [16:39:27] should I do silent rm -rf, mkdir -p (without /master) and then mv? [16:39:39] or is there a cleaner way [16:39:50] rsync --delete ? :-D [16:40:48] even --delete-after [16:40:59] that will make sure we still serve something during the file copy [16:44:22] New review: Hashar; "Good. Will need:" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/53513 [16:45:05] New review: Krinkle; "Symlink is in place already (in the docroot.git repo, but it can't be deployed due to the permission..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [16:45:25] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 191 seconds [16:45:33] hashar: since you're root, can you fix the permisisons manually and quickly git pull? Just so we at least have yesterdays changed live. [16:45:34] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 193 seconds [16:45:52] e.g. minor fixes to portals and addition of links to the integration page [16:45:57] https://integration.mediawiki.org/ is old [16:46:33] puppet will undo the permissions, but it won't undo the update of integration/index.html [16:46:36] andrewbogott: mark: can one of you volunteer to merge a few minor puppet changes please ? :-] [16:48:30] hashar: add me as a reviewer? [16:50:30] andrewbogott: sure [16:51:21] andrewbogott: mails incoming [16:51:25] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [16:51:35] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [16:53:33] andrewbogott: and I replied to your puppet usage questions on the talk page : https://wikitech.wikimedia.org/wiki/Talk:Puppet_usage :-] [16:54:08] cool, thanks [16:54:25] did you get the gerrit emails asking for review or do you want me to link the changes there ? [16:55:08] Got 'em. [16:55:25] hashar, contint server == gallium == doc.wikimedia.org? [16:55:30] yeah [16:55:42] I will phase out your puppet doc generation next week [16:55:49] to stop puppet from doing it in favor of Jenkins [16:56:02] great. [16:56:55] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53957 [16:57:07] … I wonder what happens if two runs of the puppet doc-generation overlap? [16:57:20] I will run puppet on gallium whenever the changes land on sockupppet [16:57:48] andrewbogott: the jenkins job is not enabled yet :-] [16:57:57] New patchset: Hashar; "contint: Move integration site to wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [16:58:45] New review: Hashar; "rebased to get rid of the dependency upon "contint: Move docs.pp into contint" https://gerrit.wikime..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/53513 [16:58:59] https://gerrit.wikimedia.org/r/#/c/53513/ is the last patch of the serie [16:59:13] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53955 [16:59:18] hashar: I mean, if two patches are submitted at the same time, will that prompt Jenkins to regenerate the puppet docs twice, and at the same time? [16:59:32] (for certain values of 'same') [16:59:47] !log stopped puppet on gallium to fix some permissions before puppet changes are merged in and applied. [16:59:53] Logged the message, Master [17:01:05] andrewbogott: Hm.. if that were true it might even cause a race condition where they end up outdated. [17:01:20] I don't think parellel jobs within 1 target repository are enabled [17:01:22] for this reason [17:01:23] and others [17:01:33] Krinkle: Yeah, I think there needs to be a lock of some sort if it's called by jenkins. [17:01:35] so it'd never run simutanious [17:01:48] zuul/jenkins makes sure of that [17:01:54] also because it needs to test against merge conflicts [17:02:07] andrewbogott: but yet, I do think it would queue it twice. [17:02:43] Depending on the job it may be important that you run it for each version. Perhaps there is a way to cancel old jobs in the queue. But its not like mediawiki's job queue. [17:02:56] it does run postmerge [17:03:08] so gerrit doesn't have to wait for it to vote or anything [17:03:27] !log gallium : update /srv/ to latest version of integration/docroot.git ( aaa4b0c ) and moved all content from /srv/org/mediawiki/integration to the new /srv/org/wikimedia/integration [17:03:34] Logged the message, Master [17:03:43] !log gallium : restarted puppet [17:03:49] Logged the message, Master [17:04:09] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53990 [17:04:19] hashar: That stuff isn't on sockpuppet yet... [17:04:29] andrewbogott: I can handle the jobs dependency in Zuul. Never tried it but apparently Zuul can cancel old jobs if there is a new patchset that override it [17:04:39] andrewbogott: yeah just restarted puppet :-] [17:04:54] to make sure I don't leave the box without puppet running :-] [17:05:03] Krinkle: /srv/ should be fine now [17:05:45] hashar: "You don't have permission to access / on this server." [17:05:49] https://integration.mediawiki.org/ [17:06:16] either we dont have followsymlinks or it needs clause for /org/wikimedia/integration [17:06:20] or both [17:06:28] symlink is wrong :-] [17:06:35] y'all are reading for the s/mediawiki/wikimedia/g patch to drop? [17:06:48] ll /srv/org/mediawiki/integration [17:06:49] lrwxrwxrwx 1 jenkins jenkins 25 Mar 15 17:01 /srv/org/mediawiki/integration -> org/wikimedia/integration [17:06:53] That would work too [17:07:02] hashar: You mean cancel a pending job or a running job? [17:07:37] andrewbogott: so given change A -> B -> C , you merge them all in sequence. Zuul would first trigger A, when it detects B, it cancel A and run B instead. When C land, it cancel A. [17:07:48] Krinkle: hashar: re Redirect, no, I don't think it does the hash. I don't think your browser even sends the anchor to the server at all [17:08:12] andrewbogott: if C is a success, zuul assumes A and B to be valid changes and do not retrigger them. If C is a failure, it tries B. [17:08:27] andrewbogott: something like that. But for merged commits that is not really useful. [17:08:30] hashar: I understand the logic but I'm curious what you mean by 'cancel'. Is it actually killing the running job? [17:08:41] andrewbogott: yeah it should :-] [17:09:00] Hm… well hopefully the doc-generator will handle that gracefully :) [17:09:23] hashar: Is that all the patches you needed? [17:09:24] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [17:09:24] will have to find out :-] [17:09:32] andrewbogott: checking my dashboard [17:10:39] andrewbogott: yeah seems good :-] [17:11:35] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 6 seconds [17:11:35] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 2 seconds [17:11:52] hashar: OK, all merged on sockpuppet now. [17:12:00] time to see what breaks :) [17:12:36] running puppet [17:12:55] puppet [17:12:57] I hate you [17:13:31] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: File[/srv/org/wikimedia] is already defined in file /var/lib/git/operations/puppet/modules/contint/manifests/website.pp at line 20; cannot redefine at /var/lib/git/operations/puppet/modules/contint/manifests/website.pp:53 on node gallium.wikimedia.org [17:13:33] :-] [17:13:48] I forgot to check them [17:14:01] * andrewbogott stands by to merge another tiny patch [17:14:35] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Fri Mar 15 17:14:28 UTC 2013 [17:16:01] incoming [17:16:20] * greg-g ducks [17:16:20] New patchset: Hashar; "contint: dupe /srv/org/wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54000 [17:16:43] andrewbogott: the tiny fix https://gerrit.wikimedia.org/r/#/c/54000/ [17:17:25] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [17:17:25] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [17:17:45] * marktraceur jsducks [17:17:46] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54000 [17:18:11] marktraceur: andrew is there doing all the awesome review + deployement of Timo changes [17:18:14] hashar: ok, merged [17:18:20] running puppet [17:18:46] <^demon> The symlink in /srv/org/mediawiki is broken :/ [17:18:56] ^demon: https://gerrit.wikimedia.org/r/#/c/53999/ [17:19:08] andrewbogott: catalog compiled :] [17:19:31] !log integration website going down while we deploy a new layout. [17:19:38] Logged the message, Master [17:21:15] !log gallium : reloading Zuul so it reports the status url with the wikimedia.org domain [17:21:22] Logged the message, Master [17:22:08] !log mlitn synchronized php-1.21wmf11/extensions/ArticleFeedbackv5/ 'Update ArticleFeedbackv5 to master' [17:22:15] Logged the message, Master [17:23:34] hashar: So, mostly stable again? I'm about to go to lunch. [17:23:43] andrewbogott: yeah sounds fine [17:23:44] * andrewbogott is back in CDT [17:23:51] andrewbogott: doing some checks but yeah that seems good [17:23:58] andrewbogott: thank you a lot for your assistance! [17:24:51] np [17:25:34] RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Fri Mar 15 17:25:30 UTC 2013 [17:34:45] New review: Krinkle; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [17:36:23] !log jenkins: updated jenkins configuration to have its URL default to wikimedia.org domain instead of the legacy mediawiki.org [17:36:29] Logged the message, Master [17:40:43] New patchset: Hashar; "contint: integration.wikimedia.org was missing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54003 [17:44:51] New patchset: Hashar; "contint: update codecoverage directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54004 [17:46:04] New review: Hashar; "Did the change for you Timo : https://gerrit.wikimedia.org/r/#/c/54003/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53513 [17:48:30] Krinkle|detached: the two changes above should fix the remaining issues :-] [17:49:07] Krinkle|detached: I am out for now, might come back later tonight [17:49:12] Krinkle|detached: congratulations :-] [17:52:12] !log mlitn Started syncing Wikimedia installation... : Update ArticleFeedbackv5 to master [17:52:19] Logged the message, Master [17:53:17] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54003 [17:53:41] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54004 [17:56:35] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 182 seconds [17:56:35] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 184 seconds [17:59:24] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [18:00:24] PROBLEM - Puppet freshness on es10 is CRITICAL: Puppet has not run in the last 10 hours [18:03:25] PROBLEM - Puppet freshness on mw1045 is CRITICAL: Puppet has not run in the last 10 hours [18:03:25] PROBLEM - Puppet freshness on mw1117 is CRITICAL: Puppet has not run in the last 10 hours [18:03:25] PROBLEM - Puppet freshness on mw1129 is CRITICAL: Puppet has not run in the last 10 hours [18:07:24] !log mlitn Finished syncing Wikimedia installation... : Update ArticleFeedbackv5 to master [18:07:24] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Puppet has not run in the last 10 hours [18:07:24] PROBLEM - Puppet freshness on sq83 is CRITICAL: Puppet has not run in the last 10 hours [18:07:30] Logged the message, Master [18:12:21] Reedy, Ryan_Lane: you cleaned up all two-level domains [18:12:27] but did you see en.m.mobile? :) [18:15:17] !log mlitn synchronized php-1.21wmf11/extensions/ArticleFeedbackv5/ [18:15:23] Logged the message, Master [18:19:06] !log ran puppetd --enable on all mw hosts [18:19:08] !log mlitn synchronized php-1.21wmf11/extensions/ArticleFeedbackv5/ [18:19:13] Logged the message, notpeter [18:19:18] Logged the message, Master [18:20:29] mlitn: people are getting database errors on enwiki from aft tables not existing [18:20:49] paravoid: that just redirects, right? [18:20:50] indeed, just rolled back that change [18:20:54] paravoid: we can't really fix those [18:21:01] I'll now look into what was wrong [18:21:08] but db error should no longer happen [18:21:26] Ryan_Lane: we will be soon [18:21:37] paravoid: will be soon? [18:21:47] using them? [18:21:53] no, able to fix [18:21:59] how? [18:22:24] the only reason we have en.m.mobile & en.zero.mobile is the whole langlist machinery [18:22:28] (I think) [18:22:32] ah. ok [18:22:36] yes. it is [18:22:45] yeah, this will be gone :) [18:22:58] but, realistically they shouldn't be referenced anywhere anyway :) [18:23:04] I know [18:23:05] so, it shouldn't be a problem [18:23:09] I just realized today [18:23:13] * Ryan_Lane nods [18:23:13] when I was diffing zonefiles [18:23:24] we have a lot of DNS entries that go nowhere [18:23:31] and I found it really funny [18:25:31] so, role::cache::bits requires class geoip which includes geoip::packages which should install geoip-bin [18:25:37] but i don't see that package on arsenic [18:25:45] anyone with more puppet foo want to tell me why ? [18:26:02] oh well it's theoretically installed [18:26:09] nm [18:26:17] so why are the cli commands not working ? [18:26:34] ahha [18:26:37] they changed the names [18:26:39] nm [18:26:47] why did the package change the names? the world may never know [18:30:54] LeslieCarr: As a rule, that'd be because of a name conflict with some other default package [18:31:12] geoip-lookup became geoiplookup [18:31:18] LeslieCarr: Like the dolphin emulator has executable 'dolphin-emu' in Ubuntu because of the file manager. [18:31:19] i say that's just fucking with us [18:31:34] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 27 seconds [18:31:42] LeslieCarr: Maybe it is. It's a conspiracy. :-) [18:32:25] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [18:33:51] !log mlitn synchronized php-1.21wmf11/extensions/ArticleFeedbackv5/ [18:33:57] Logged the message, Master [18:38:35] RECOVERY - Puppet freshness on mw26 is OK: puppet ran at Fri Mar 15 18:38:24 UTC 2013 [18:44:11] !log fixing broken dpkg/APT on maerlant - dpkg --configure -a , re-run puppet and let it install package upgrades [18:44:18] Logged the message, Master [18:44:44] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Fri Mar 15 18:44:36 UTC 2013 [18:45:05] RECOVERY - Puppet freshness on mw1045 is OK: puppet ran at Fri Mar 15 18:44:54 UTC 2013 [18:46:34] RECOVERY - Puppet freshness on mw1118 is OK: puppet ran at Fri Mar 15 18:46:25 UTC 2013 [18:46:46] !log installing package upgrades on formey [18:46:52] Logged the message, Master [18:50:24] !log installing more package upgrades on maerlant (wikimedia-lvs-realserver, libc6, nginx...) [18:50:29] Logged the message, Master [18:54:25] RECOVERY - Puppet freshness on mw1129 is OK: puppet ran at Fri Mar 15 18:54:18 UTC 2013 [18:55:54] RECOVERY - Puppet freshness on mw1117 is OK: puppet ran at Fri Mar 15 18:55:52 UTC 2013 [18:59:34] so, I just made a gerrit project with an empty first commit, pulled it, added shit, wanted to push that stuff back in, and the project is now gone... [18:59:37] I'm.... confused [18:59:44] ^demon: any clues? [19:02:25] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 181 seconds [19:02:34] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds [19:04:23] paravoid: We didn't yet... [19:04:31] We only fixed pa.us.wikimedia.org [19:04:37] the arbcom ones still need doing [19:04:41] Another 5 or 6 IIRC [19:06:25] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 193 seconds [19:07:24] <^demon> notpeter: That sounds wrong. [19:07:37] ^demon: ok I think https://gerrit.wikimedia.org/r/#/c/53882/ is ready for cr [19:09:01] notpeter: We prefer you not to commit shit to gerrit [19:09:28] ^demon: indeed! [19:09:31] <^demon> notpeter: Define, "gone" [19:09:45] Reedy: me too... but aparently this code needs to be deployed urgently. [19:10:07] ^demon: I'd like to deploy that when it's merged [19:10:15] ^demon: there is no project operations/debs/flask-login [19:10:21] when I list projects [19:13:04] <^demon> You made a repo called "flask-login" [19:13:11] <^demon> not operations/debs/flask-login [19:13:23] !log maerlant - dpkg: error processing ganglia-monitor | backing up /var/www/ to tridge /data/other then deleting contents | removing sites-enabled/ipv6and4 | nginx can't bind to [2620:0:862:1::80:2]:443 [19:13:31] Logged the message, Master [19:13:37] ^demon: damnit [19:13:44] is it possible to rename? [19:14:11] <^demon> Since you haven't pushed anything yet, yes :) [19:14:21] <^demon> We're working on it more generally for all repos :) [19:14:24] I was assuming that by making operations/debs its parent, it would include that in the name [19:14:35] I assume that things are like filesystems [19:14:36] silly me [19:14:41] ^demon: yay! [19:14:42] awesome [19:15:43] <^demon> You're set :) [19:16:02] <^demon> The repos are on the filesystem, but if you move a repo with history it confuses the stuff that's in the database. [19:16:36] ^demon: SVN still active, pywikipedia .. see details in mail [19:17:02] <^demon> No, I meant the wikimedia repo. [19:17:07] <^demon> pywikipedia has its own repo. [19:17:21] <^demon> I mean this: https://svn.wikimedia.org/viewvc/wikimedia/ [19:17:21] that was also still used [19:17:35] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [19:17:35] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 185 seconds [19:17:36] ^demon: sweet! thank you very much! [19:17:44] ^demon: r2596 | prolineserver | 2013-03-13 17:31:27 +0000 (Wed, 13 Mar 2013) [19:18:07] <^demon> Yeah, I'm asking people to fess up and move to gerrit ;-) [19:18:29] i know, i replied to that mail [19:19:01] we also have the mysql repo.. last touched in 2008 :) [19:21:24] hashar: I am (finally) back from lunch… everything working ok? [19:22:24] RECOVERY - Puppet freshness on es10 is OK: puppet ran at Fri Mar 15 19:22:17 UTC 2013 [19:22:33] !log taking down raskin for decommissioning [19:22:39] Logged the message, Master [19:37:09] !log maerlant - re-adding 2620:0:862:1::80:2 to eth0, starting nginx [19:37:15] Logged the message, Master [19:43:36] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 183 seconds [19:44:25] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 199 seconds [19:58:53] PROBLEM - Host ms-be1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:58:53] PROBLEM - Host ms-be1007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:58:53] PROBLEM - Host cp1007 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:58:53] PROBLEM - Host ms-be1011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:58:53] PROBLEM - Host cp1011 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:58:53] PROBLEM - Host ms-be1009 is DOWN: CRITICAL - Plugin timed out after 15 seconds [19:59:06] Faidon, have any spare cycles? [19:59:20] um… paravoid ^^ [19:59:44] RECOVERY - Host cp1011 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:59:44] RECOVERY - Host cp1007 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:59:45] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [19:59:45] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [19:59:45] RECOVERY - Host ms-be1009 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [19:59:45] RECOVERY - Host ms-be1011 is UP: PING OK - Packet loss = 0%, RTA = 3.69 ms [20:00:26] PROBLEM - Recursive DNS on 208.80.154.50 is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:01:28] RECOVERY - Recursive DNS on 208.80.154.50 is OK: DNS OK: 0.023 seconds response time. www.wikipedia.org returns 208.80.154.225 [20:07:07] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [20:07:46] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 185 seconds [20:08:16] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 199 seconds [20:14:21] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51796 [20:19:56] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 182 seconds [20:20:17] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 190 seconds [20:21:31] binasher: notpeter: is there anything special about db36.pmtpa.wmnet (or at least different from db32)? [20:21:49] attempting a mysqldump gives me mysqldump: Couldn't execute 'START TRANSACTION WITH CONSISTENT INNODB SNAPSHOT': You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'INNODB SNAPSHOT' at line 1 (1064) [20:21:53] so, i deleted a site config from nginx, restarted it, and the site is still up..wtf [20:22:04] mysqldump --single-transaction --add-locks --comments --create-options --disable-keys -e --quick --skip-lock-tables -h db36.pmtpa.wmnet enwiki user [20:23:08] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:28:33] hi [20:29:14] * jeremyb_ spies a mutante mailing... but is he here? [20:29:41] jeremyb_: so did the ssl guy figure stuff out ? [20:29:59] !log DNS update - removing ipv4/ipv6/ipv6and4/results.labs.wm [20:30:00] LeslieCarr: i didn't reply yet :( [20:30:06] Logged the message, Master [20:30:24] wooo, spring cleaning [20:30:52] hello [20:30:55] so, who broke qunit on jenkins? [20:31:00] !log restarting pdns on ns0 [20:31:03] maybe hashar or Krinkle|detached ? [20:31:07] Logged the message, Master [20:31:11] it's timing out immediately, apparently - see last comments on https://gerrit.wikimedia.org/r/#/c/39792/ [20:31:27] bug 658 ... [20:31:40] forget the bug, look at the tests failing :P [20:31:50] (although you could review it, sure) [20:32:03] i love low-numbered bugs, i fixed something like 258 some time ago, too. [20:32:20] how about bug 1? [20:32:29] jeremyb_: whats up [20:32:50] 19:28:24 Warning: PhantomJS timed out, possibly due to a missing QUnit start() call. Use --force to continue. [20:32:56] jeremyb_: worked on that one, too [20:33:19] mutante: left you a note about feedparser. and do you want to add me to the planet labs project? i can try to reproduce it there? [20:33:29] !log puppet runs on neon broken due to mysql puppet issue - mysql-client-5.5 can not create alias mysql-client: object already exists at /var/lib/git/operations/puppet/manifests/mysql.pp:513 [20:33:34] Logged the message, Master [20:33:43] jeremyb_: change 3ff7461 ;) [20:34:10] jeremyb_: sure, adding you to project [20:34:18] MatmaRex: your change is broken : https://gerrit.wikimedia.org/r/#/c/39792/9/tests/qunit/suites/resources/mediawiki/mediawiki.util.test.js,unified [20:34:29] MatmaRex: that is why jshint break [20:34:41] hashar: oh, poop. [20:34:46] MatmaRex: there are some <<<<< and >>>>> in it [20:34:53] yeah, merge conflict markers [20:34:54] dammit [20:35:02] i assumed it was just in release-notes, as always [20:35:09] MatmaRex: though the jshint report is not very helpful :-] ( https://integration.wikimedia.org/ci/job/mediawiki-core-jslint/3818/checkstyleResult/ ) [20:35:39] MatmaRex: ahh here it is : https://integration.wikimedia.org/ci/job/mediawiki-core-jslint/3818/checkstyleResult/file.1648977917/ :::: Expected an identifier and instead saw '<<'. [20:35:57] hashar: hidden among 20 false positives [20:35:57] MatmaRex: we should find out a way to verify whether the .js is valid javascript before running jshint :-] [20:36:01] Ryan_Lane: no more sub.sub.labs in DNS, besides the old projects we are redirecting, killed ipv4/ipv6/results... [20:36:13] mutante: hooray! [20:36:22] hashar: the checkstyle reports are utterly useless [20:36:24] with all the phpcs stuff we don't care about cluttering them [20:37:38] jeremyb_: Successfully added jeremyb to planet. Successfully added Jeremyb to projectadmin. [20:38:18] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 21 seconds [20:38:22] mutante: do you use mars or mostly venus? [20:38:36] (is there an actual project called mars?) [20:38:46] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [20:40:15] jeremyb_: mars was just my own invention, there is no actual project with that name. venus is the one with the public IP.. planet.wmflabs.org .. mars was created later just to confirm everything was puppetized on a fresh instance [20:40:32] ok. so they should be carbon copies [20:40:41] and i can nuke mars if i like? [20:40:52] should be, i thought one of them was using puppetmaster::self but i must have been wrong [20:40:59] you can [20:44:17] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 18 seconds [20:44:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [20:46:55] New patchset: Hashar; "restore integration.mediawiki.org config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54068 [20:47:05] !log integration.mediawiki.org is not redirecting to the new integration.wikimedia.org (apache conf is disabled). Patch in {{gerrit|54068}} [20:47:11] Logged the message, Master [20:47:28] mutante: guten tag :-] If you get 2 secs to merge https://gerrit.wikimedia.org/r/54068 and merge on sock puppet that would be nice :-] [20:48:00] mutante: we broke the contint website earlier today. The change will enable the conf that redirect the old URL to the new one (we migrated under wikimedia.org \O/ ) [20:48:05] pgehres: why are you trying to run mysqldump on db36? [20:48:23] binasher: grabbing updates for db29 [20:48:39] what kind of updates? [20:48:53] I should have asked in here. [20:48:53] new users, updated info for the rest of the rows [20:49:10] pgehres: you won't find anything like that on db36 [20:49:19] Yo, mutante, can you have a look at https://gerrit.wikimedia.org/r/#/c/47026/ sometime in the next couple of days? Would be nice to get that project moving again. [20:49:26] (The review bit is the least of it) [20:50:08] binasher: aha, now that comment in db-pmtpa.php makes sense [20:50:23] sbernardin: Ok, So we are clear to work on professor [20:50:30] I am going to shut it down, once it powers off it is all yours [20:50:39] once you have the fan swapped, power it back up and we should be ok to go from there [20:50:52] replacing the fan will clear the million entries on log so we can see which dimm slot is bad more easily =] [20:51:02] !log professor shutting down for fan swap, short downtime (hopefully) [20:51:08] Logged the message, RobH [20:51:32] RobH: OK [20:51:51] it should be turning off now, once its back up ping me will ya please (I wanna ensure services recover and clear log) [20:51:58] thx =] [20:52:04] andrewbogott: a follow up for the contint website :-] https://gerrit.wikimedia.org/r/54068 we disabled apache conf for the integration.mw.o entry :-/ that break back compat. [20:52:29] New review: Dzahn; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/54068 [20:52:44] hashar: there's a problem with that [20:52:53] ah [20:52:53] see inline comment [20:53:00] mutante: sorry though you were out for lunch :] [20:53:35] andrewbogott: bookmarking :p [20:53:36] binasher: my algorithm for identifying the pmtpa snaphost host fails on s1, so, my bad, thanks for identifying my stupidity [20:53:46] mutante, thanks. [20:54:17] PROBLEM - Host professor is DOWN: PING CRITICAL - Packet loss = 100% [20:55:04] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54068 [20:55:07] poor professor [20:55:22] mutante: files is unneeded in modules :/ (,replied on https://gerrit.wikimedia.org/r/#/c/54068/1/modules/contint/manifests/website.pp,unified ) [20:55:27] mutante: yet another inconsitency. [20:57:05] oh, i see [20:57:44] +2 , needs verified [20:58:01] i hate that inconstancy, I always have to copy paste from another module :( [20:58:23] Zuul is quite busy :/ https://integration.mediawiki.org/zuul/status [20:58:31] it is waiting for something before reporting [20:58:33] untrusted connection [20:58:35] heh [20:58:48] is that the reason for switching? yea, hm [20:58:49] argh [20:58:51] RobH: professor booting up [20:59:15] mutante: looks like I need yet another change to update the cert on integration.mediawiki.org =] [20:59:36] sbernardin: cool, so once we have it back online [20:59:41] i should be able to clear its event log [20:59:45] and now we will see memory erros [20:59:52] before they were getting lost in millions of fan alerts. [20:59:56] Yup [21:00:48] New patchset: Dzahn; "add krinkle to sudo ALL users on gallium (RT-4735)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54073 [21:00:58] hashar: no need to ensure the old one absent, but we should delete it (cert and key) afterwards [21:01:08] and see change above [21:01:52] ooh, ouch, wait for patch set 2 :p [21:02:46] RECOVERY - Host professor is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [21:05:17] PROBLEM - carbon-cache.py on professor is CRITICAL: PROCS CRITICAL: 0 processes with command name carbon-cache.py [21:05:17] PROBLEM - profiling collector on professor is CRITICAL: PROCS CRITICAL: 0 processes with command name collector [21:05:37] PROBLEM - profiler-to-carbon on professor is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [21:05:55] New review: Hashar; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54068 [21:06:05] New patchset: Dzahn; "add krinkle to sudo ALL users on gallium (RT-4735)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54073 [21:07:00] so zuul is locked and I have no idea why :( [21:09:01] !log professor services seem back up [21:09:07] Logged the message, RobH [21:09:07] binasher: ^ wsorry been up a few minutes [21:09:11] but seems ok [21:09:38] its collectors are showing off in icinga though [21:10:23] http://integration.mediawiki.org/ is down? [21:10:25] or... something? [21:10:32] !g I84579ad01a10fd9377795862bfb262b635925ca6 [21:10:32] https://gerrit.wikimedia.org/r/#q,I84579ad01a10fd9377795862bfb262b635925ca6,n,z [21:10:43] YuviPanda: use integration.wikimedia.org , will be redirected in a few [21:10:44] New patchset: Dzahn; "bugzilla_report.php: Add query and formatting for list of urgent issues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53387 [21:10:52] ah, okay [21:11:07] RobH: thanks, all up now [21:11:14] mutante: thanks :) [21:11:16] RECOVERY - profiling collector on professor is OK: PROCS OK: 2 processes with command name collector [21:11:36] RECOVERY - profiler-to-carbon on professor is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/udpprofile/sbin/profiler-to-carbon [21:11:47] New review: Dzahn; "manual verify - zuul issue" [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/54068 [21:11:57] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54068 [21:12:28] mutante: sorry trying to debug that issue :/ [21:12:45] hashar: np, i just thought let's fix the redirect while youre on that [21:12:50] YuviPanda: ^ [21:13:24] :) ok. [21:13:37] mutante: the wikimedia commons app nightly apk lives on that, so was just asking :) [21:14:12] yes, it moved, change your bookmark https://integration.wikimedia.org/nightly/mobile/android-commons/ :) [21:14:18] from mediawiki to wikimedia [21:14:36] !log restarting Zuul, it is locked somehow. Definitely need to update it :( [21:14:42] Logged the message, Master [21:15:04] New patchset: Asher; "udpprofile should be started at boot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54079 [21:15:13] YuviPanda: yeah we broke integration.mediawiki.org earlier today. Will be back soon (c) [21:15:23] :) [21:15:24] ok [21:15:39] mutante: will do :) It's also embedded as a link in the app, so will change that too [21:15:50] mutante: but old urls will (at some point) redirect to new ones, right? [21:15:53] for a while, at least [21:16:01] hashar: YuviPanda, and there you go, redirects [21:16:07] sweeet :) [21:16:16] mutante: thank you! [21:16:21] thanks :) [21:16:24] hashar: just one warning. [warn] NameVirtualHost *:443 has no VirtualHosts [21:16:27] np [21:17:21] hashar: do you mind if i also install package upgrades while on it? apt itself, ganglia-monitor, php-pear, php5-dev and a few libs [21:17:37] New review: Hashar; "Deployed by Daniel. That fixed https://integration.mediawiki.org/nightly/mobile/android-commons/ whi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54068 [21:17:51] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53873 [21:17:51] mutante: sure :-] [21:18:03] mutante: I don't do package upgrade on the box since I am lacking console access [21:18:05] !log installing package upgrades on gallium [21:18:11] Logged the message, Master [21:18:20] yea, ok [21:18:34] I am going to kill zuul I guess [21:18:39] it's installing new APT gpg keys for ubuntu repo and stuff [21:18:55] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54079 [21:19:10] done [21:20:35] !log hard restarted Zuul. It was locked down again :/ [21:20:43] Logged the message, Master [21:21:09] gotta remember that term "hard restart" sounds better than kill:) [21:25:11] !log added new build of mariadb 5.5.30 to precise-wikimedia on brewster [21:25:18] Logged the message, Master [21:27:51] !log downgrading Zuul to safe version ff79197 (aka stop doing `git remote update` on ref-update events) [21:27:57] Logged the message, Master [21:29:17] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 185 seconds [21:29:46] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 198 seconds [21:29:46] !log restarted Zuul with the safe version ff79197 [21:29:52] Logged the message, Master [21:32:45] New patchset: Pyoungmeister; "initial import of flask-login" [operations/debs/flask-login] (master) - https://gerrit.wikimedia.org/r/54083 [21:34:18] PROBLEM - mysqld processes on db59 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:35:16] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 3 seconds [21:35:47] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [21:36:16] RECOVERY - mysqld processes on db59 is OK: PROCS OK: 1 process with command name mysqld [21:36:32] !log db59 -> mariadb 5.5.30 [21:36:39] Logged the message, Master [21:38:19] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 368 seconds [21:43:36] andrewbogott: got a sec? [21:43:41] mail or irc clients? [21:43:43] * AaronSchulz lols [21:44:14] notpeter: what's up? [21:44:36] I dumped some python into an repo [21:44:37] https://gerrit.wikimedia.org/r/#/c/54083/ [21:44:41] I want t obuild a package [21:44:55] but I'd really like a dev to look over it to make sure that it's a reasonable thing to deploy on our cluster [21:44:59] it's small [21:45:15] would you be willing to look over it real quick and give it your blessing? [21:45:21] yep, sure. [21:45:36] also, I'm not sure what should be pulled from there and what shouldn' [21:47:17] notpeter: https://github.com/hychen/dh-make-python [21:47:37] notpeter: So, this means that we are effectively forking that github project, right? [21:48:36] andrewbogott: I just wanted to package what was there [21:48:38] but yes [21:48:54] Yeah, makes sense. I guess it's only a fork if we change anything :) [21:49:10] binasher: thanks! [21:50:11] andrewbogott: asher may have made this all un-needed.... [21:50:16] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 196 seconds [21:51:28] !log aaron synchronized php-1.21wmf11/includes 'deployed 08295adcdc4a4130954f52fb1eb5377638cd97bc' [21:51:34] Logged the message, Master [21:52:31] notpeter: Great! So I will do nothing for the moment...? [21:52:52] andrewbogott: well, I'd still like a sanity check on the code :) [21:54:16] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 30 seconds [21:55:12] notpeter: for security reviews you might want to ask Chris Steipp :-] [21:55:27] hashar: Just saw your log. What are the signs of Zuul being locked down? [21:56:12] siebrand: https://integration.wikimedia.org/zuul/status full of changes having completed jobs and jenkins-bot taking few minutes to report back to Gerrit. [21:56:34] siebrand: hopefully going to be solved next week if I manage to get Zuul upgraded :-] [21:56:35] hashar: Ah, okay. Thanks. [21:58:36] hashar: true! [21:58:49] csteipp: got a sec? [21:58:49] everyone's favorite question ;) [21:58:52] notpeter: sure [21:59:38] csteipp: reqest to put this on cluster: https://gerrit.wikimedia.org/r/#/c/54083/ [21:59:51] is this: https://github.com/maxcountryman/flask-login [21:59:59] would love a quick pair of security eyes on it [22:00:01] it's small [22:03:06] siebrand: finally opened a bug about Zuul slowness https://bugzilla.wikimedia.org/show_bug.cgi?id=46176 [22:03:20] !log Zuul : opened a bug about this week slowness, breakage: https://bugzilla.wikimedia.org/show_bug.cgi?id=46176 [22:03:26] Logged the message, Master [22:04:21] and I am out of there :-]  Have a nice weekend [22:12:51] binasher: damn, that totes barfs on what I'm trying to build :( [22:12:55] but thank you for the link! [22:17:07] argh [22:17:12] why did hashar quit just now [22:17:20] i have a real jenkins weirdness report this time [22:17:45] https://gerrit.wikimedia.org/r/#/c/53963/ , PS2: it reports issues on nonexistent lines. those lines are added two commits later (see dependencies). [22:18:59] binasher: well, there are lots of eqiad errors there too [22:19:26] yeah, some are legit [22:20:15] mediawiki does do retries when getting a slave connection, right? [22:20:27] it's like at certain time, there is a server that just drops a bunch of packets for a short while [22:20:31] *at a [22:20:43] binasher: it can try other slaves yes [22:22:03] looks like the per-server attempts is set to 2 [22:22:38] in application logic [22:22:47] binasher: so there are indeed per-server retries, though not many [22:22:48] !log deployed change 54093 to wikitech [22:22:53] howdy good peoples, anyone know if Pete Youngmeister is around? [22:22:54] Logged the message, Master [22:23:06] AaronSchulz: a lot of errors i see are clustered specifically to the snapshot slaves for es2 and es3 [22:23:10] !log changed wikitech configuration to require email address on account creation [22:23:16] Logged the message, Master [22:23:28] !log hid domain dropdown for account creation and login on wikitech [22:23:35] Logged the message, Master [22:23:49] AaronSchulz: and there's an xfs kernel bug in the mix :) [22:24:19] ori-l: hey! I have a question for you :) [22:24:25] notpeter: This all looks reasonable. Any real security concerns would be in the frameworks that it uses (werkzeug, flask) which I don't know much about. [22:25:07] andrewbogott: cool! thakn you [22:25:10] ah, notpeter, I was looking for you :) ^^ [22:25:37] milimetric: hi, what's up? [22:25:40] when do you have time to help me puppetize the cron job [22:25:48] andrew otto's off today [22:25:50] ah, ok [22:25:58] now, monday, whenever [22:26:10] now's good [22:26:11] it's just a script and a cron that runs the script, yes? [22:26:22] gotta do something with all that adrenaline Leslie induced :) [22:26:24] heh [22:26:26] yep [22:26:32] should we take this to PM? [22:26:37] oh, sure [22:26:39] not sure how your channel works but I don't wanna intrude [22:26:47] either is fine [22:29:04] notpeter: hey! [22:29:12] hey! [22:29:28] exclamation marks! [22:29:32] so you said that you've been packaging up python modules [22:29:35] I currently need to do this [22:29:48] would you be willing to help me out with this? [22:30:04] sure! i'm terrible at it! [22:30:12] heh [22:30:13] no worries [22:30:53] are there tools yo'uve been using? [22:30:59] packaging for debian is a notch above writing map-reduce implementations in haskell in terms of difficulty. faidon disputes this bitterly but only because he is a debian foundation shill [22:31:23] I generally tend to agree :) [22:31:40] not about faidon, of course. I think of him more as a very skilled debian zealot :) [22:31:54] FYI: I'm about to debize oursql and requests 1.1 [22:32:25] So you know, let's not do this painful thing twice if we can avoid it and all. :-) [22:32:36] notpeter: one requirement I had was available upstream in the next ubuntu release, so I used git-import-dsc [22:33:24] which might work for oursql or requests [22:33:28] !log asher synchronized wmf-config/db-eqiad.php 'temporarily pulling es1007, es1010' [22:33:34] Logged the message, Master [22:33:44] yes re: requests: http://packages.ubuntu.com/search?keywords=python-requests [22:35:23] !log rebooted es1007,1010 for kernel upgrades [22:35:28] binasher: grep -P 'Error connecting to \d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}' -o dberror.log | grep -P '\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}' -o | uniq -c | sort -g -r [22:35:30] Logged the message, Master [22:35:51] * AaronSchulz looks at the top boxes [22:36:56] PROBLEM - Host es1007 is DOWN: PING CRITICAL - Packet loss = 100% [22:37:03] Coren / notpeter: faidon's suggestion when there was an upstream package was: "Just import the source into git (man git-import-dsc from git-buildpackage), commit changes (just dch -v 0.9-1 && debcommit hopefully) and see if it builds." [22:37:16] which worked for me [22:37:17] PROBLEM - Host es1010 is DOWN: PING CRITICAL - Packet loss = 100% [22:37:34] AaronSchulz: the top, db1020 is also an lvm snapshot box [22:37:41] ori-l: That seems... remarkably optimistic. :-) [22:37:45] heh [22:38:02] ori-l: ok, cool. I think the one that I'm up against doesn't have any packages anywhere... [22:38:09] so I may be sol [22:38:14] i hope to do away with lvm snapshots [22:38:37] RECOVERY - Host es1010 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [22:38:40] well, second to top [22:38:48] RECOVERY - Host es1007 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:39:06] binasher: what problems is that causing? [22:39:14] ori-l: milimetric notpeter does 'puppetize everything' mean virtualenv is banned from the cluster? [22:39:37] YuviPanda: depends on the use-case, I guess [22:39:51] well, running stat scripts, in this particular instance. [22:40:00] yeah, i think nothing is banned a priori [22:40:14] AaronSchulz: the top top is a box in tampa that i took down, so bogus [22:40:27] but things have to be requested in the shape of git review to the puppet repo [22:40:33] hmm, virtualenv doesn't let you change anything outside you home dir, so is okay? [22:40:35] (notpeter, digging up my notes about packaging something from scratch..) [22:40:50] AaronSchulz: re problem, depends on the server [22:40:53] yeah, with a virtualenv you don't 'need' to go through the puppet repo, so I suppose that is not considered a good thing? [22:41:18] ori-l: oh, can fumble my way through proper debianization [22:41:26] I was just hoping you had a silver bullet :) [22:41:26] PROBLEM - mysqld processes on es1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:41:35] es1007 and es1010 were hitting http://oss.sgi.com/archives/xfs/2012-09/msg00315.html [22:42:03] !seen ^demon [22:42:27] RECOVERY - mysqld processes on es1010 is OK: PROCS OK: 1 process with command name mysqld [22:42:59] * AaronSchulz looks at grep -P 'Error connecting to \d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}' dberror.log | grep -P '\w{3} \d{1,2} \d\d:\d\d' -o | uniq -c [22:43:54] New patchset: Asher; "lower weight of es snapshot hosts" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54095 [22:44:15] notpeter: nope, fumbled through it. I could perhaps spare you some anguish by pointing out two pitfalls: 1) the target distribution should be "-wikimedia" (e.g. 'precise-wikimedia', rather than 'precise'). 2) gerrit hates you and will merge your debian branch onto master, mistaking it for a feature branch, unless you are careful. [22:44:49] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54095 [22:45:07] or grep -P 'Error connecting to \d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}' dberror.log | grep -P '\w{3} \d{1,2} \d\d' -o | uniq -c [22:45:22] ^ Coren the caveats above might be useful for you as well [22:45:25] definitely spams some hours and is quite others [22:45:36] !log asher synchronized wmf-config/db-eqiad.php 'returning es1007,10 at lower weights' [22:45:42] Logged the message, Master [22:47:02] AaronSchulz: the last facebook build occasionally hangs or crashes under heavy load and has a slow memory leak leading to oom kills and auto restarts from mysqld_safe [22:47:38] AaronSchulz: there are a variety of issues but I think they're generally hidden from users [22:48:03] well every error in the log is an exception for some user [22:48:18] I suppose if it was a bot it would be more ok, heh [22:48:25] ori-l: cool! thank you :) [22:48:35] AaronSchulz: are you sure about that? [22:48:55] ori-l: Noted. :-) [22:48:58] that's how the code works, it logs the exception and then throws it [22:55:30] AaronSchulz: what about LoadBalancer::reallyOpenConnection ? [22:57:05] New review: CSteipp; "(3 comments)" [operations/debs/flask-login] (master) - https://gerrit.wikimedia.org/r/54083 [22:57:28] yeah, it uses try/catch and getReaderIndex loops through the others [22:57:50] so for slaves, only if they all failed would the user get the error [22:58:05] so that means there can be log entries without no visible errors [22:58:09] New review: Dzahn; "yep, security_map.group_id != 15 looks good, confirmed the group id is 15" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/53387 [22:58:23] maybe the log should be split up then [22:58:52] it might be nice to know about those but they are kind of spam if it another slave was able to be reached [22:59:01] AaronSchulz: and if they've all been walked thru and failed [22:59:03] $this->mLastError = 'No working slave server: ' . $this->mLastError; [22:59:21] should that message make it to the log? [22:59:28] as long as the user notices [22:59:35] which they would one way or another [23:00:07] New patchset: Dzahn; "delete ipv6and4.erb nginx site template, ipv6and4.labs has been removed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54098 [23:00:18] "No working slave server" doesn't appear in dberror.log or exception.log [23:00:34] that would be 'return $this->reportConnectionError( $this->mErrorConnection );' in getConnection [23:01:02] so it must not actually be happening [23:02:02] all the user visible ones are in exception.log [23:02:22] which is less spamy in that regard [23:02:31] New patchset: Dzahn; "delete ipv6and4.erb nginx site template and class from protoproxy, ipv6and4.labs has been removed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54098 [23:03:24] though that has it's own flood of errors [23:03:33] some of which really should be fixed [23:03:59] AaronSchulz: http://pastebin.mozilla.org/2220590 [23:05:51] db1020, 1021, 1006 [23:06:13] db1017 is the enwiki master, it has one fail in the log [23:10:48] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:10:48] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:47] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:47] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:50] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:50] PROBLEM - Apache HTTP on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:50] PROBLEM - Apache HTTP on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:50] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:50] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:59] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:59] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:59] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:16] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:26] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:27] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:27] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:27] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:27] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:27] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:47] PROBLEM - Apache HTTP on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:47] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:47] PROBLEM - Apache HTTP on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:47] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:47] PROBLEM - Apache HTTP on mw1163 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:13:56] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:14] wee [23:14:15] wee! [23:14:40] notpeter: did you break everything? [23:14:45] :) [23:14:51] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:52] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:52] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:52] PROBLEM - Apache HTTP on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:52] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:52] PROBLEM - Apache HTTP on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:52] PROBLEM - Apache HTTP on mw1058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:53] PROBLEM - Apache HTTP on mw1183 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:53] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:53] PROBLEM - Apache HTTP on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:54] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:55] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:55] PROBLEM - Apache HTTP on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:55] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:56] PROBLEM - Apache HTTP on mw1041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:57] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:14:58] AaronSchulz: I need to fail search! [23:15:02] PROBLEM - Apache HTTP on mw1182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:15:02] PROBLEM - Apache HTTP on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:15:02] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:15:02] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:15:56] PROBLEM - Apache HTTP on mw1180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:16:45] PROBLEM - MySQL Slave Delay on db1030 is CRITICAL: CRIT replication delay 183 seconds [23:18:52] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.709 second response time [23:18:52] RECOVERY - Apache HTTP on mw1183 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.846 second response time [23:19:02] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.955 second response time [23:19:42] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.542 second response time [23:19:42] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.799 second response time [23:19:42] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.483 second response time [23:19:51] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.049 second response time [23:19:52] RECOVERY - Apache HTTP on mw1058 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.770 second response time [23:19:52] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [23:19:52] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.086 second response time [23:19:52] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.349 second response time [23:19:52] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.989 second response time [23:19:52] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.073 second response time [23:19:52] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.090 second response time [23:19:53] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.997 second response time [23:19:54] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.576 second response time [23:19:54] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.776 second response time [23:19:55] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.432 second response time [23:19:56] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.089 second response time [23:19:56] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.848 second response time [23:19:57] RECOVERY - Apache HTTP on mw1089 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.975 second response time [23:19:57] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.781 second response time [23:19:57] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.781 second response time [23:19:57] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.906 second response time [23:20:03] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.360 second response time [23:20:51] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.788 second response time [23:20:51] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.803 second response time [23:20:51] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.150 second response time [23:20:51] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.073 second response time [23:20:52] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.729 second response time [23:20:52] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.414 second response time [23:20:52] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.786 second response time [23:20:52] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.189 second response time [23:20:53] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.049 second response time [23:20:53] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.050 second response time [23:20:54] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.863 second response time [23:21:02] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.799 second response time [23:21:02] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.043 second response time [23:21:02] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.832 second response time [23:21:42] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.277 second response time [23:21:52] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.415 second response time [23:21:52] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.401 second response time [23:21:52] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.331 second response time [23:21:52] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.190 second response time [23:21:52] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.227 second response time [23:21:52] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.121 second response time [23:21:53] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.212 second response time [23:21:53] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.909 second response time [23:21:54] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.042 second response time [23:21:54] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.119 second response time [23:21:55] RECOVERY - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 757 bytes in 0.202 second response time [23:21:55] RECOVERY - Apache HTTP on mw1182 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.907 second response time [23:21:56] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.345 second response time [23:21:57] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.956 second response time [23:22:52] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [23:22:52] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [23:22:52] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.246 second response time [23:22:52] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.155 second response time [23:22:52] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.992 second response time [23:22:52] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [23:22:53] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [23:22:54] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.807 second response time [23:22:54] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.696 second response time [23:22:54] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.841 second response time [23:22:54] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.224 second response time [23:22:55] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.488 second response time [23:22:56] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.291 second response time [23:22:56] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.319 second response time [23:23:01] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.738 second response time [23:23:01] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.957 second response time [23:23:51] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.160 second response time [23:23:52] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.074 second response time [23:23:52] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.218 second response time [23:23:52] RECOVERY - Apache HTTP on mw1071 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.588 second response time [23:23:52] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.379 second response time [23:23:52] RECOVERY - Apache HTTP on mw1161 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.500 second response time [23:23:56] !log laner synchronized wmf-config/CommonSettings.php 'Disabling aft5' [23:24:02] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.683 second response time [23:24:03] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.714 second response time [23:24:03] Logged the message, Master [23:24:03] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.391 second response time [23:24:42] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CRIT replication delay 202 seconds [23:24:52] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.006 second response time [23:24:52] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 62797 bytes in 0.207 second response time [23:25:00] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.140 second response time [23:25:00] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.492 second response time [23:25:01] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.666 second response time [23:25:01] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.710 second response time [23:25:03] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CRIT replication delay 183 seconds [23:25:52] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.322 second response time [23:25:52] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:02] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:03] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.411 second response time [23:26:03] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 0 seconds [23:26:42] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [23:26:57] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.180 second response time [23:26:57] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.799 second response time [23:26:57] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.257 second response time [23:26:57] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.898 second response time [23:27:01] PROBLEM - Apache HTTP on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:27:51] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.210 second response time [23:27:52] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.017 second response time [23:28:01] RECOVERY - Apache HTTP on mw1170 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.408 second response time [23:28:31] notpeter: would it be possible to have a hume counterpart in eqiad? [23:28:44] RECOVERY - MySQL Slave Delay on db1030 is OK: OK replication delay seconds [23:28:44] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [23:28:44] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.048 second response time [23:28:44] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [23:28:44] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [23:29:52] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [23:30:05] AaronSchulz: take a look at asher@fenari:/home/asher/ [23:30:11] db1030-proclist [23:30:42] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 196 seconds [23:32:54] New patchset: Ryan Lane; "Disabling aft5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54101 [23:32:59] binasher: ^^ [23:33:17] New review: Asher; "+10000000000" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/54101 [23:33:17] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54101 [23:34:13] binasher: I did: git reset --hard c6c37d407874741c9163571cd013153d66f1ea45 [23:37:44] New patchset: Milimetric; "stat1 cron job that emails out aggregate pageviews" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54102 [23:49:06] New review: MZMcBride; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54101 [23:49:34] * Susan eyes the channel. [23:49:38] Beware the ides of March. [23:50:13] Susan, apparently even the cluster doesn't like aft :( [23:50:26] Heh. :-( [23:50:59] I was just to skim scrollback. It's unclear from just looking at that changeset, what's going on. [23:51:39] That whole message was a sin against grammar. about to * [23:54:04] Matthias is trying to move AFT5 to the new refactored backend & dedicated DB backend, I'm guessing that this caused the brief outage earlier, but Ryan_Lane and binasher would know more. [23:54:24] s/DB backend/DB cluster/ [23:54:36] Okay, yeah, I just looked at scrollback in here and didn't see much off-hand. [23:55:08] Isn't it like Friday at five o'clock there? Odd deployment time. ;-) [23:55:53] i'm writing an email about the incident [23:56:06] i thought the AFT5 deploy was actually backed out today [23:56:41] well, now it is, but I thought mlitn backed it out [23:56:48] it's listed on the deployment calendar for 9am-1pm pacific [23:56:49] Eloquence: Not that I need more e-mail, but is there a policy for being on engineering-l? [23:57:27] Susan: The policy is, if you're too awesome, you can't join. So sorry. [23:57:32] :D [23:58:04] New patchset: Milimetric; "stat1 cron job that emails out aggregate pageviews" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54102 [23:59:15] oh wow, I'm just catching up - did I break something? [23:59:38] apparently everything.