[00:00:51] our python linter must suck [00:01:01] New patchset: Ryan Lane; "Report on repo dependencies when reporting on repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43375 [00:01:06] because it just verified something with two syntax errors [00:01:21] Is it non-voting, perhaps? [00:01:25] there isn't one [00:01:28] it seems [00:01:29] Oh hah [00:01:49] flake8 ftw [00:02:24] ah. it also doesn't see this as a python file [00:02:33] becase it doesn't end in .py [00:02:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43375 [00:02:47] otherwise pep8 probably would have caught it [00:03:46] Damianz: flake8 is nifty, thanks [00:04:47] flake8 --exclude=migrations,.git --ignore=E501,E225,E121,E123,E124,E125,E127,E128,W404 --exit-zero in your MakeFile is useful [00:05:18] what's a makefile? [00:05:18] :) [00:07:06] * Damianz looks at Ryan_Lane with that old school autoconf look.... hipster cmake kids these days [00:07:15] :D [00:07:45] uuuuggghhhh [00:07:53] New patchset: Brion VIBBER; "Update FirefoxOS app submodule" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43377 [00:08:20] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [00:09:04] I should really just use puppetmaster::self [00:09:27] New patchset: Ryan Lane; "Use proper variable reference" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43379 [00:09:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43379 [00:10:15] or at least pull from the fake branch rather than merging [00:10:30] I tested it on beta first this time [00:11:08] it would be interesting to get jenkins to update beta via git-deploy via a peer runner [00:11:13] ok. now the report shows which minions haven't finished yet, in each stage [00:11:20] and if a repo has dependencies, it shows their state too [00:11:28] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43377 [00:11:39] that's doable [00:12:12] one of the problems is that they are in different salt clusters [00:12:56] Imo doing jenkins <> labs api <> salt api <> salt <> beta would be acceptable... assuming we make an api sometime before never [00:13:09] heh [00:13:15] yes, that would also work [00:13:25] or jenkins -> salt-api [00:13:29] why the extra step? [00:13:57] assuming salt-api will support groups and multiple auth backends that's do-able [00:14:01] yep [00:14:05] it would just need keystone [00:14:13] give jenkins an openstack user [00:14:17] assign it to the project [00:14:18] done [00:14:28] for that usage it has keystone... for osm integration keystone needs w0rk.... acl wise it needs work anyway [00:14:51] yep [00:15:06] this is my end-goal for salt/openstack integration [00:16:38] Damianz, are you still actively working on salt/keystone or should I take on some of that? [00:16:51] Magic would be jenkins making vms, deploying code, testing and then destroying via salt heh [00:16:55] I haven't been following your progress but I stand at the ready if there are things I can do [00:17:14] andrewbogott: he started it some [00:17:18] andrewbogott: I've not really touched it tbh - there's a backend for basic password auth [00:17:26] he made a keystone external_auth plugin [00:17:33] but it only does basic keyston auth [00:17:34] Need a backend for token auth, a class for osm writing (maybe one that uses exec for now) etc [00:17:41] it doesn't fetch tokens or validate them [00:18:04] Damianz: osm writing? [00:18:06] I'm hoping to work on acls in general for saltstack for a work project, so might re-vist tokens then or duuno - feel free to jump in [00:18:26] Ryan_Lane: well it should have a class that abstracts salt-api to osm... I think [00:18:29] like it does nova [00:18:30] ok. Is your work merged into salt or on github someplace or just on your laptop? [00:18:34] oh [00:18:40] you mean osm -> salt-api [00:18:45] yeah [00:18:49] andrewbogott: it's merged upstream [00:18:54] andrewbogott: Module is merged in dev branch [00:18:59] ah, great. [00:19:07] I will try to catch up :) [00:19:13] password module isn't a bad start [00:19:21] but token would be best :) [00:20:15] ok. let's see how reporting is working for repo dependencies :) [00:20:21] token is only useful if you're using it behind something ;) [00:20:26] indeed [00:20:43] without group support, meh [00:20:52] heh [00:21:10] badly need group support - gonna sugest a change to eauth modules... and probably get rejected [00:21:30] hm. I think i have too many apache processes running on tin [00:21:52] it becomes pretty damn slow when I tell everything to fetch [00:23:11] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 195 seconds [00:23:39] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 203 seconds [00:33:28] !log awjrichards synchronized php-1.21wmf7/extensions/MobileFrontend/ [00:33:40] Logged the message, Master [00:34:07] !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/ [00:34:13] RECOVERY - MySQL Slave Running on db59 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:34:17] Logged the message, Master [00:36:49] New patchset: Asher; "setting db59 weight to 0 while replication catches up" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43382 [00:37:02] New patchset: awjrichards; "Enable forcing https for mobile login account/creation everywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43383 [00:37:04] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 181135 seconds [00:37:20] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43382 [00:37:48] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43383 [00:38:07] !log asher synchronized wmf-config/db.php 'setting db59 weight to zero' [00:38:18] Logged the message, Master [00:39:36] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Force https for mobile login and account creation everywhere' [00:39:47] Logged the message, Master [00:40:49] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [00:46:04] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [00:48:46] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 3 seconds [00:48:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 3 seconds [00:49:12] New patchset: Ryan Lane; "Display concise information by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43385 [00:49:23] reporting is hard [00:51:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43385 [00:57:45] New patchset: Ryan Lane; "Fix obvious syntax error (we need linting :( )" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43386 [00:58:43] anyone have questions for [[Theodore_Ts'o]]? ;-) [00:58:50] he's about 10 feet away [00:59:21] awjr: wooot, re that last sync [00:59:28] jeremyb: :D [01:03:57] New patchset: Ryan Lane; "Fix obvious syntax error (we need linting :( )" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43386 [01:08:50] is it OK if I sync-file a small EventLogging change? It improves event logging for mobile who just wrapped their deploy [01:11:48] spagewmf: mobile is done with it's deployment [01:11:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43386 [01:12:23] anyone have objections? [01:12:37] go for it [01:12:42] Ryan_Lane: still making progress? [01:12:52] yes [01:12:56] I broke some stuff [01:13:00] fixing it now [01:13:46] heh [01:13:58] the system breaking in the middle of a deployment is bad news [01:14:11] also the way I'm reporting on fetches isn't working :( [01:14:48] at minimum not when re-pushing the same thing [01:21:26] New patchset: Ryan Lane; "Also pull tags for submodules on fetch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43388 [01:21:32] ^^ glad I found this bug before deployment and not during :) [01:21:55] !log spage synchronized php-1.21wmf7/extensions/EventLogging/EventLogging.hooks.php 'Log field fix for mobile' [01:22:08] Logged the message, Master [01:22:43] New patchset: Ryan Lane; "Also pull tags for submodules on fetch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43388 [01:25:33] New patchset: Ryan Lane; "Also pull tags for submodules on fetch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43388 [01:26:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43388 [01:29:55] New review: Reedy; "Guess I should note here that I'm removing mwversionsinuse as puppet provides an update version of it" [operations/debs/wikimedia-task-appserver] (master) C: 0; - https://gerrit.wikimedia.org/r/43356 [01:33:33] !log spage synchronized php-1.21wmf6/extensions/EventLogging/EventLogging.hooks.php 'Log field fix for mobile' [01:33:44] Logged the message, Master [01:34:31] and I am done 2660 enwiki {"token":"","userId":18235977,"userName":"ACUX mobile test 2","isSelfMade":true,"userBuckets":"","displayMobile":true," <-- !! [01:34:45] !log reedy synchronized php-1.21wmf7/extensions/ProofreadPage [01:34:51] thanks y'all [01:34:55] Logged the message, Master [01:35:18] !log Ran namespaceDupes.php on cswikisource [01:35:29] Logged the message, Master [01:35:49] Reedy, is your "deployment window" 167 hours wide? :) [01:48:41] New patchset: Dzahn; "replace SSH key for mflaschen - RT-4114" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43392 [01:48:58] New patchset: Ryan Lane; "git fetch and git fetch --tags is necessary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43393 [01:50:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43393 [01:50:50] I'm learning all kinds of things I didn't care to know about git [01:52:03] ok. time to add the localization script in \o/ [01:52:19] New review: Dzahn; "gpg: Signature made Thu 10 Jan 2013 05:03:19 PM PST using DSA key ID 3BBDED59" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/43392 [01:52:20] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43392 [02:12:36] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:00] !log LocalisationUpdate completed (1.21wmf7) at Fri Jan 11 02:26:00 UTC 2013 [02:26:12] Logged the message, Master [02:32:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:33:06] !log killed convert processes on srv222 [02:33:16] Logged the message, Master [02:34:04] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.061 second response time [02:36:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [02:41:21] Hey anyone, should "udp://$wmfUdp2logDest/eventlogging; on a production server wind up in fluorine:/a/mw-log/eventlogging ? [02:43:45] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:39] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:24] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.501 seconds [02:46:10] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:18] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 1.235 seconds [02:47:48] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64068 bytes in 0.140 seconds [02:49:46] !log LocalisationUpdate completed (1.21wmf6) at Fri Jan 11 02:49:45 UTC 2013 [02:49:56] Logged the message, Master [02:52:54] !log killed convert processes on all image scalers [02:53:05] Logged the message, Master [02:53:14] bad gateway for enwiki [02:53:19] (502) [02:55:08] Site seems slow. [02:55:29] I got one load without CSS (maybe bits?). [02:55:38] Seems better now. [02:55:58] Or not. [02:56:15] yeah it's laggy [02:56:23] packet loss? [02:56:24] interestingly no messages in this channel about LVS [02:56:48] PROBLEM - SSH on lvs1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:04] TimStarling: https://en.wikipedia.org/wiki/Main_Page browse around while logged in? [02:57:14] Error 101 (net::ERR_CONNECTION_RESET): The connection was reset. [02:57:43] Just seems to hang intermittently. [02:58:01] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:27] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:59:40] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.908 seconds [03:08:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:22:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [03:28:45] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:36:46] Ryan_Lane- Well, I've hung around late enough. Time for me to go to bed. [03:37:13] anomie: night [03:37:22] I think I'll be able to get it done soonish [03:37:59] If anything comes up that I should look at, send me an email or something. [03:38:27] will do [03:54:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:56] New patchset: Ryan Lane; "Add a retry option for fetch and checkout stages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43396 [04:01:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43396 [04:05:32] New patchset: Ryan Lane; "Prefer the deploy-info check to the runner output" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43397 [04:05:59] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43397 [04:06:28] I found a problem with today's E3 deployment. ori-l only ran sync-dir of the EventLogging extension in the 1.21wmf7 branch, so my later sync-file of one file to both wmf7 and wmf6 left wmf6 in an inconsistent state. The fix is to sync-dir of EventLogging in the 1.21wmf6 branch. Please sirs, may I? [04:07:11] you guys rolled back the earlier deployment, or things are broken right now? [04:07:47] if it's broken, then yes, fix it. make sure you aren't going into anyone's deploy window [04:08:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [04:09:01] Ryan_Lane No roll back, I just diagnosed this as the problem with mobile account creations on wikis still running wmf6. [04:09:17] if it's broken, yes fix it [04:10:16] Ryan_Lane. thanks, will do and I'll monitor it. Our other extensions are enwiki-only, thus wmf7 only, but EventLogging is more widespread. [04:10:25] <3 new deployment system [04:16:39] !log spage synchronized php-1.21wmf6/extensions/EventLogging 'EventLogging meant to be on 1.21wmf6 wikis as well' [04:16:49] Logged the message, Master [04:19:25] ... and now account creation on wmf6 produces frwiki {"token":"","userId":1464258,"userName":"S Page ACUX GS test 0110-8", "displayMobile":true, ..._valid: "true"} yay! [04:19:39] spagewmf: ... [04:27:56] TimStarling: rather than the cron on the deployment host pushing out the localization update, would it make sense to have the apaches pull it via a cron instead? [04:28:22] it's somewhat difficult to figure out when the fetch stage is properly done from an automated process on the deployment host [04:28:46] especially if some hosts aren't working properly [04:29:30] if, instead, the cron simply does a git deploy start/sync that doesn't push, it can be picked up by the client [04:29:49] though I guess that's problematic if times are off on the apaches [04:29:58] since they'd get messages at differenttimes [04:30:48] meh. I can push it. I'll move to the checkout stage if all report back, or if a timeout occurs [04:31:50] it can report to the log with statistics [04:35:58] New patchset: Aaron Schulz; "Set $wgMaxBacklinksInvalidate." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43398 [04:37:14] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43398 [04:38:29] !log aaron synchronized wmf-config/CommonSettings.php 'Set $wgMaxBacklinksInvalidate.' [04:38:40] Logged the message, Master [04:41:28] New patchset: Ryan Lane; "Use cmd.retcode for git clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43399 [04:59:48] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:59:49] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:04:00] New patchset: Ryan Lane; "Adding support for logging messages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43400 [05:07:59] New patchset: Ryan Lane; "Use cmd.retcode for git clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43399 [05:08:05] New patchset: Ryan Lane; "Adding support for logging messages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43400 [05:19:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43399 [05:19:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43400 [05:25:00] New patchset: Ryan Lane; "Import the correct module, this time" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43401 [05:27:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43401 [05:32:25] New patchset: Ryan Lane; "Add back in accidentally removed functions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43404 [05:33:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43404 [05:38:03] !log aaron synchronized php-1.21wmf7/includes/DefaultSettings.php 'deployed 9709a7d7e423c21e38a0436faa233ac0dcf04226 ' [05:38:14] Logged the message, Master [05:38:44] !log aaron synchronized php-1.21wmf7/includes/cache/HTMLCacheUpdate.php 'deployed 9709a7d7e423c21e38a0436faa233ac0dcf04226 ' [05:38:48] Logged the message, Master [05:43:30] New patchset: Ryan Lane; "Don't specify a cwd with git clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43408 [05:44:54] !log aaron synchronized php-1.21wmf7/includes/job/jobs/HTMLCacheUpdateJob.php 'deployed f0f528e254346f04d4afdb85b439cd20c0df93fe' [05:44:56] * Aaron|home wonders if Ryan has any midnight oil left [05:45:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43408 [05:45:06] Logged the message, Master [05:45:12] I'm honestly getting to the point where I don't think I do [05:46:21] just watch for the stupid mistake hockey stick graph [05:46:28] oh, I'm there [05:46:39] my last like 8 commits were fucked up [05:47:58] stupid deployment host isn't on a public IP, so I can't run ircecho on it [05:48:14] now I'm going to need to modify ircecho to pop messages from a queue in redis [05:48:28] I've wanted to do this for a while anyway, but I didn't want to do it right now :D [05:48:46] * Aaron|home is still trying to understand how all the salt stuff comes together [05:48:59] there's only three parts to that system [05:49:05] deployment host [05:49:08] salt master [05:49:10] minions [05:49:31] well puppet doesn't look quite that simple :) [05:49:55] (the stuff in puppet) [05:50:09] well, that's because I have to configure salt with puppet [05:50:20] which is somewhat silly, but whatever [05:50:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:51:19] I'm using a number of components of the salt system: runners, modules, returners, pillars, grains, and the peer system [05:51:40] the deployment host makes a peer call to run a runner on the master [05:52:06] the runner calls fetch or checkout on the minions, specifying that they should return their data to the deploy_redis returner [05:52:31] configuration is stored in pillars (it's basically global configuration) [05:53:03] grains are static data that exists on each minion. it's like facts in puppet [05:53:29] I only use that to determine which site a minion in [05:54:12] * Aaron|home has been reading salt docs [05:54:38] I'm looking forward to using reactors :) [05:55:10] take actions anywhere based on events from anywhere? yes please [06:03:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.392 seconds [06:08:23] New patchset: Tim Starling; "Comment out db59" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43409 [06:08:51] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43409 [06:09:58] !log tstarling synchronized wmf-config/db.php 'depool db59' [06:10:11] Logged the message, Master [06:21:52] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [06:21:52] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [06:21:52] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [06:21:52] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [06:21:52] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:36:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:47:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.236 seconds [07:23:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:25:36] New patchset: Ryan Lane; "Add localization update as the l10n dep script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43410 [07:28:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43410 [07:39:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds [07:54:53] New patchset: Ryan Lane; "Add parameters after the file check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43412 [08:05:03] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43412 [08:11:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.072 seconds [08:30:58] can someone review https://gerrit.wikimedia.org/r/#/c/43218/ please? [08:35:57] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 186 seconds [08:37:00] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 224 seconds [08:42:10] MaxSem: is it clean-killlist or clear-killlist? [08:42:45] fixing [08:44:37] renamed it at the very last moment:( [08:44:38] New patchset: MaxSem; "Cronjobs for GeoData" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43218 [08:45:19] not again... [08:45:42] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [08:46:11] New patchset: MaxSem; "Cronjobs for GeoData" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43218 [08:46:27] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [08:47:40] ok [08:47:52] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43218 [08:48:15] thanks a lot, apergos [08:48:38] lemme merge on sockpuppet first [08:49:12] running puppet on hume now [08:50:44] looks good [08:53:10] morning [08:54:11] yes it is [08:55:52] localization cache is going to suck [08:56:02] just saying [08:56:15] I'm looking at ways to make it suck less [09:04:45] New review: ArielGlenn; "I don't know why I'm on this list but sure, seems fine to me. I think Jeff Green might have been la..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/42908 [09:07:18] New patchset: Hashar; "move PHP linter to a new `wmfscripts` module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29937 [09:07:31] New review: Hashar; "rebased" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/29937 [09:15:53] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 192 seconds [09:16:27] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 208 seconds [09:17:47] New patchset: Hashar; "wikimedia module placeholder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43420 [09:17:47] New patchset: Hashar; "move PHP linter under `wikimedia` module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29937 [09:18:54] New review: Hashar; "rebased on top of https://gerrit.wikimedia.org/r/#/c/43420/ which creates the wikimedia puppet module." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/29937 [09:23:30] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [09:26:29] cool [09:26:37] we should get more opinions on this [09:26:44] maybe mark [09:29:54] paravoid: wikimedia module ? :-D [09:30:10] I am going to move the manifests/misc/contint.pp pile of mess under it too [09:34:05] yeah [09:35:24] ideally wee would want to move everything as modules [09:35:34] then we can fully take advantages of puppet autoloading system [09:35:40] that might make puppet run a bit faster [09:35:52] and will definitely let us start writing rspec integration tests [09:37:09] oh yes, this is the goal [09:37:09] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:37:19] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:37:24] the wikimedia/ module is that I'd like to think about a bit [09:37:41] I know it's my idea but as I said yesterday, I haven't realy thought about it thoroughly [09:38:25] be bold, think later :-] [09:40:46] hey paravoid, you said you could review the OSM packages [09:41:46] New patchset: Silke Meyer; "Puppet files to install Wikidata repo / client on different labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42786 [09:43:11] paravoid: I will poke mark about the wikimedia module whenever he wake [09:43:11] up [09:44:36] New review: Silke Meyer; "Thanks for the proposals! I added two templates, repo and client. Now the role/labsmediawiki.pp is m..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/42786 [10:10:00] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [10:12:51] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [10:40:59] New patchset: Silke Meyer; "Puppet files to install Wikidata repo / client on different labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42786 [10:42:32] New review: Silke Meyer; "In this patchset, the xml dump for the client does no longer contain the language links." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/42786 [11:08:19] yeahh moar refactoring [11:08:23] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [11:11:56] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [11:29:47] memcached died on virt0 :-D [11:29:50] that is recurring [11:30:07] yes [11:30:23] oh [11:30:26] maybe I just lot my session [11:34:09] lunch bbl [12:36:44] back [12:49:55] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [12:52:08] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [12:53:14] lovely puppet [12:53:15] (File[/srv/org/mediawiki/integration/index.html] => Class[Wikimedia::Contint::Webserver] => Class[Wikimedia::Contint::Jenkins] => User[jenkins] => File[/srv/org/mediawiki/integration] => Class[Wikimedia::Contint::Webserver]) [12:53:16] ;) [13:03:53] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [13:28:51] New patchset: MaxSem; "Raise throttle for the Pune event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43432 [13:30:08] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:31:54] New patchset: Demon; "Add support for RT & Bugzilla tracking ids" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43433 [13:33:55] <^demon> MaxSem: ??? is not an int ;-) [13:34:07] ^demon, it's intentional;) [13:34:26] we don't know the number yet [13:39:37] MaxSem: OVER 9000 [13:40:44] !log reedy synchronized php-1.21wmf7/extensions/FlaggedRevs [13:40:57] Logged the message, Master [13:44:56] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [13:45:14] New patchset: Demon; "Remove extension distributor mess from fenari" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [13:49:03] New patchset: Demon; "Remove extension distributor mess from fenari" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [13:50:13] New patchset: Demon; "Remove extension distributor mess from fenari" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [13:51:45] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [13:52:09] ^demon: what is replacing the lame extension distributor ? [13:52:36] <^demon> The awesome extension distributor :) [13:53:12] <^demon> We deployed it yesterday. [13:56:54] New review: ArielGlenn; "yay!" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/41976 [13:56:55] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [13:58:22] New review: Hashar; "See discussion in Ops mailing list." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43420 [14:03:43] New patchset: Demon; "Allow ensure => absent on systemuser" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43435 [14:05:02] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43435 [14:08:37] !log updating Zuul 23ec1ba..ff79197 . Bring up better support for LOST builds. [14:08:49] Logged the message, Master [14:13:33] New patchset: Demon; "Remove xinetd service entirely" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43436 [14:14:32] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43436 [14:23:29] New patchset: Demon; "Don't ensure the group as absent, it conflicts with the dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43437 [14:24:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43437 [14:25:54] <^demon> 4th time's the charm? [14:26:26] dunno yet [14:26:32] there's "puppet run in progress" [14:26:36] so I'm sitting here waiting it out [14:29:38] had the old catalog [14:29:40] running now [14:32:19] notice: /Stage[main]/Mediawiki::Former-extdist-removesoon/Systemuser[extdist]/User[extdist]/ensure: removed [14:32:21] \o/ [14:33:22] <^demon> Yay [14:33:28] however [14:36:05] <^demon> However? [14:36:41] ah no [14:36:52] the cron job got put back on the previous run [14:36:56] but this run it stayed gone [14:37:02] notice: /Stage[main]/Mediawiki::Former-extdist-removesoon/Systemuser[extdist]/Group[extdist]/ensure: created [14:37:11] we do have that anomaly but whatever :-D [14:40:00] <^demon> http://wikitech.wikimedia.org/view/Swift/Open_Issues_Aug_-_Sept_2012/Cruft_on_ms7#ext-dist - updated status. [14:40:16] oh yyyaaayyy [14:40:52] reedy cleared profiling data [14:41:00] sooo close [14:41:34] <^demon> Thanks for your help :) I'm glad to see this thing go away. [14:41:45] me too, thanks for slugging away at it! [14:43:18] PROBLEM - Varnish HTCP daemon on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:44:03] PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:45:20] New patchset: Reedy; "Make clear-profile runs be logged to SAL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43440 [14:48:33] RECOVERY - Varnish HTCP daemon on cp1043 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:49:28] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:50:11] wonder why they died in the first place (I restarted em) [15:01:00] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:01:01] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:01:33] heads up, Ganglia is down: http://ganglia.wmflabs.org/latest/ returns There was an error collecting ganglia data (127.0.0.1:8654): fsockopen error: Connection refused [15:02:07] eh, labs [15:02:19] was reported on #-tech [15:02:23] nvm [15:10:14] Could someone please review a few simple script changes? https://gerrit.wikimedia.org/r/#/c/43440/ https://gerrit.wikimedia.org/r/#/c/40766/ and https://gerrit.wikimedia.org/r/#/c/40311/ [15:13:20] Reedy: did the one I trust :-] [15:13:54] https://gerrit.wikimedia.org/r/#/c/43440/1/files/misc/scripts/clear-profile did this one want a path for dologmsg? [15:14:35] Not sure if it needs one, but might make sense to put one anyway [15:15:06] incoming [15:15:16] New patchset: Reedy; "Make clear-profile runs be logged to SAL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43440 [15:18:00] hm also wondr if we need ' instead of " with the ! [15:18:28] or an escape char [15:20:06] echo '!'"log" works, echo "!log" fails from bash command line (Reedy) [15:20:12] $BINDIR/dologmsg "!log $USER synchronized $DIR $MESSAGE" [15:20:12] $BINDIR/deploy2graphite sync-common-file [15:20:16] ^ sync-common-file [15:20:27] huh [15:20:34] * apergos does a script test [15:20:47] /usr/local/bin/sync-common-file [15:21:11] ok from script not from terminal [15:21:13] ok wfm [15:21:24] heh [15:21:31] hell yeah, consistency [15:21:34] sorry to be paranoid [15:21:44] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43440 [15:21:47] but I am :-P [15:22:45] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40766 [15:23:02] I'll push em all at once [15:23:19] thanks [15:23:36] <^demon> I think I'm going to step out and grab a late breakfast. [15:24:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40311 [15:25:49] ok live next time puppet runs [15:26:00] great [15:32:31] PROBLEM - Host analytics1006 is DOWN: PING CRITICAL - Packet loss = 100% [15:35:27] New patchset: MaxSem; "Raise throttle for the Pune event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43432 [15:38:21] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43432 [15:41:45] !log maxsem synchronized wmf-config/throttle.php 'https://gerrit.wikimedia.org/r/#/c/43432/' [15:41:55] Logged the message, Master [16:08:54] do you guys have any tool to parse the log files and create a human friendly report out of it [16:08:55] ? [16:09:12] something that would aggregate log lines based on some regex and dump a daily mail [16:12:09] I did something like that as a one off [16:12:17] I think it was for db errors [16:15:55] Lock wait timeout exceeded, Deadlock found when trying to get lock, Duplicate entry [16:16:00] Really a one trick pony :p [16:18:43] New review: Hashar; "Paul Belanger has a Debian directory to build a Debian package at : https://github.com/pabelanger/je..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/24620 [16:19:25] Reedy: might end up writing one myself if I don't think anything already existing [16:19:44] paul? [16:19:45] really? [16:19:48] I know him [16:20:03] Paul belanger ? [16:20:14] I am talking with him at least once a week in #openstack-infra [16:20:47] he packaged jenkins-job-builder for Debian :-] [16:21:10] hah [16:21:15] I know him from Debian VoIP [16:21:16] small world [16:21:47] he used to work in Digium [16:21:52] afaik [16:22:45] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [16:22:46] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [16:22:46] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [16:22:46] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [16:22:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [16:23:22] paravoid: passed him your salutations on your behalf [16:23:28] talking to him now :) [16:23:36] :-] [16:24:16] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 211 seconds [16:24:25] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 216 seconds [16:29:22] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [16:29:40] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [16:35:17] !log creating Jenkins jobs for GettingStarted and PHPExcel extensions. [16:35:28] Logged the message, Master [16:55:36] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 210 seconds [16:56:03] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 205 seconds [16:56:13] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 211 seconds [16:56:23] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 245 seconds [17:01:55] !log creating Jenkins jobs for Semantic* extensions [17:02:06] Logged the message, Master [17:04:46] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [17:05:04] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [17:08:49] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [17:09:52] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [18:28:30] New patchset: Faidon; "Disable the IPv4 route cache across the cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43478 [18:49:57] New review: Andrew Bogott; "This is looking good! I've added a few inline comments -- those changes should be pretty trivial." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/42786 [18:55:48] !log xenon & caesium installing as parsoid [18:55:58] Logged the message, RobH [19:05:55] RobH: you're the old man of the server mountain --- where can I find something that operations people would find handy about geographic caching? [19:06:34] you mean like how we do it? [19:06:37] Client has an EU presence and runs servers in US. They're want their EU people to use their MW based in the US but they're saying it is slow. [19:06:39] yes [19:06:58] http://wikitech-static.wikimedia.org/articles/d/n/s/DNS_ed5f.html#Geographic_DNS [19:07:01] ack, wait [19:07:04] wrong link [19:07:04] thats bad [19:07:11] * hexmode does not click [19:07:17] i googled rather than direct to wikitech link [19:07:18] my bad [19:07:26] http://wikitech.wikimedia.org/view/Dns#Geographic_DNS [19:08:04] So we only load balance based on location the hits to the caching layer [19:08:20] we dont have proper load balancing for multiple datacenter sites as primary sites 'yet' [19:08:34] I'll see if they can do geo-dns [19:08:57] what is the plan for doing "proper" load balancing? [19:10:54] afaik the focus is on making eqiad the priamry [19:11:03] RobH: if we did another dc-hackathon, would you go? [19:11:04] then settling the load balance on backend issue [19:11:15] hexmode: I live in SF now, so prolly not ;] [19:11:25] WHAT????? [19:11:35] you moved without sending me a postcard [19:11:40] haha [19:11:48] I should probably pay closer attn to fb [19:24:43] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [19:33:33] hi Ryan_Lane, how'd the l10n stuff turn out? [19:43:04] New patchset: Reedy; "RT #2295: Run cleanupUploadStash across all wikis daily" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37968 [19:43:33] New patchset: Pyoungmeister; "migrating first round of eqiad DBs to coredb module. woo!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43485 [19:47:54] !log anomie synchronized php-1.21wmf6/maintenance/rebuildLocalisationCache.php 'Backported for git-deploy l10n work' [19:47:56] hashar: this is probably a better channel [19:48:00] or there [19:48:05] Logged the message, Master [19:48:12] so yeah beta needs mw/ext.git as a submodule which is itself a submodule [19:48:14] !log anomie synchronized php-1.21wmf7/maintenance/rebuildLocalisationCache.php 'Backported for git-deploy l10n work' [19:48:16] hashar: so, dealing with sub-sub modules isn't the most straightforward thing [19:48:24] Logged the message, Master [19:48:32] isn't git submodule update --init --recursive supposed to handle that ? [19:48:32] because the system needs to modify the .gitmodules file [19:48:40] (I have no idea where the minion script is) [19:48:46] nor what it does [19:48:51] otherwise they'll pull from gerrit, rather than the deployment host [19:49:04] oh of course [19:49:26] we could still rewrite the remote url in .gitmodules :-) [19:49:28] I can probably do a recursive find to find them and modify them [19:50:11] anyway right now we have a while() loop that does a git pull on each submodule and thus hit gerrit [19:50:18] so I guess it is not going to be worth [19:51:16] New review: Lcarr; "this is a security issue anyways." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/43478 [19:53:15] Ryan_Lane: anyway it is probably better to focus on the l10n cache stuff [19:54:35] l10n stuff is basically done [19:54:42] wheee [19:54:46] but for the long-term I'm going to need to switch it away from git [19:54:50] I'm writing an email about that right now [19:55:16] (switch away from git for l10n) [19:55:19] Ryan_Lane: also do you have a component in bugzilla for git-deploy ? [19:55:23] no [19:55:25] we need to add one [19:55:35] I can do that [19:55:41] (trying to help) [19:56:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43485 [19:57:26] done Wikimedia > git-deploy [19:57:32] you and I are CC by default [19:57:53] cool [19:57:54] thanks [19:58:01] ok. lunch. back in a bit [19:58:04] almost wrote back in a git [19:58:09] ;) [19:58:13] enjoy your lunch [19:58:15] I need to work on something else :D [19:58:26] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [19:59:01] New patchset: Faidon; "Disable the IPv4 route cache across the cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43478 [19:59:11] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:59:31] Today is decision day, right? [19:59:57] Supposed to be, if I remember right from the meeting. [20:00:21] Or maybe it's pre-decision decision day [20:00:32] "We made the decision and jfdi" [20:01:35] * anomie checks meeting notes: "go/no go decision on git-deploy by 1/11" [20:01:58] "While Ryan_Lane was out for Lunch, the platform team got bored. git-deploy is now live" [20:03:33] New patchset: Pyoungmeister; "db1001 ganglia aggregator" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43493 [20:03:40] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [20:04:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43493 [20:05:22] New patchset: Asher; "returning db59 to s1" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43495 [20:06:02] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43495 [20:08:16] !log asher synchronized wmf-config/db.php 'repooling db59' [20:08:27] Logged the message, Master [20:10:52] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [20:13:35] PROBLEM - MySQL Slave Delay on es2 is CRITICAL: CRIT replication delay 4587782 seconds [20:15:23] PROBLEM - MySQL Slave Delay on es3 is CRITICAL: CRIT replication delay 4587890 seconds [20:15:41] PROBLEM - MySQL Slave Delay on es1001 is CRITICAL: CRIT replication delay 4587908 seconds [20:16:00] what's this es stuff about? [20:16:27] scale=3;4587782/86400 [20:16:27] 53.099 [20:16:30] ;-) [20:24:09] New patchset: Pyoungmeister; "mograting all eqiad slaves to coredb roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43501 [20:25:11] notpeter: typo! :-] [20:25:18] hashar: where? [20:25:28] also, thank you! [20:25:29] "mograting all eqiad slaves to coredb roles" [20:25:30] mograting -> migrating [20:25:31] hehe [20:25:36] bahahaha [20:25:48] I meant mograte :) [20:25:49] we should pass a spell checker on commit summaries :) [20:26:04] no such package: mograte [20:28:45] New patchset: Pyoungmeister; "mograting all eqiad slaves to coredb roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43501 [20:29:30] New patchset: Pyoungmeister; "mograting all eqiad slaves to coredb roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43501 [20:30:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43501 [20:33:18] New patchset: MaxSem; "WIP: OSM module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36222 [20:33:33] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [20:43:50] Is /usr/local/apache/conf used now for anything? [20:45:31] New patchset: Pyoungmeister; "migrating eqiad masters to coredb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43571 [20:46:45] There's not actually much work to make the /usr/local/apache symlink "farm". Get the /srv/deplyoment/mediawiki on the apaches [20:47:04] then just update common/common-local and uncommon [20:47:28] sync [20:47:31] update apache configs [20:47:33] graceful [20:47:34] profit [20:50:28] New patchset: Pyoungmeister; "migrating eqiad masters to coredb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43571 [20:52:02] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43571 [20:52:47] New patchset: Demon; "Fix gerrit patchset-created hook" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43574 [21:06:29] Reedy: today is indeed decision day [21:06:59] there's a few minor issues I'd like to take care of, but I think we'd good for monday from the system's POV, if you guys are OK with the configuration [21:07:04] New patchset: Pyoungmeister; "correcting regex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43577 [21:08:10] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43577 [21:21:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43433 [21:22:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43574 [21:23:17] ^demon: ^^ [21:23:25] ^demon: any others? [21:23:39] https://gerrit.wikimedia.org/r/#/c/39385/ ? [21:23:49] <^demon> Yes, that one. [21:23:59] <^demon> https://gerrit.wikimedia.org/r/#/c/39196/ [21:24:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39196 [21:24:26] New review: Demon; "Ok, I cleaned this up. Plugins are stored in the git repo referenced--and the files are built using ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/38798 [21:24:29] any others? [21:24:46] <^demon> I think 38798 is ready, if you're fine with how I'm doing the build-deploy cycle. [21:24:51] <^demon> For plugins. [21:25:23] that's fine [21:25:41] we could also deploy them via git-deploy [21:25:49] but maybe we handle that later? [21:25:50] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11589 [21:26:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38798 [21:26:32] <^demon> Yeah, sounds fine when we get a chance to set that up. [21:26:44] we'll need a symlink for that [21:26:58] <^demon> Ah, gotta rebase 39385. [21:27:12] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/23872 [21:27:26] I'm not really supporting destination directories that are different than the deployment host directory [21:28:48] New patchset: Demon; "Make github replication config forward compatible" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39385 [21:29:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39385 [21:29:42] running puppet on manganese [21:31:35] Is gerrit down..? [21:31:44] yes [21:31:54] it's restarting [21:32:10] it's back up [21:32:14] Thanks! [21:33:22] New patchset: Pyoungmeister; "correct master logic for coredb role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43584 [21:34:09] <^demon> Ryan_Lane: You finished running puppet? [21:34:38] thanks, ryan. for a glorious moment, this was me: http://xkcd.com/303/ [21:34:56] :) [21:35:05] ^demon: yes the clone failed [21:35:11] <^demon> Bah. [21:35:15] err: /Stage[main]/Gerrit::Jetty/Git::Clone[operations/gerrit/plugins]/Exec[git_clone_operations/gerrit/plugins]/returns: change from notrun to 0 failed: git clone -b master https://gerrit.wikimedia.org/r/p/operations/gerrit/plugins.git /var/lib/gerrit2/review_site/plugins returned 128 instead of one of [0] at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:671 [21:35:33] of course gerrit was down ;) [21:35:37] <^demon> Hahah. [21:35:39] <^demon> I'll clone by hand. [21:35:44] no [21:35:47] just re-run puppet [21:35:50] <^demon> Oh, yeah. [21:36:08] <^demon> Gerrit won't restart since config is already in place. [21:36:47] <^demon> executed successfully. [21:38:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43584 [22:01:25] New patchset: Pyoungmeister; "organize eqiad db node defs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43586 [22:05:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43586 [22:05:46] it would be good to have Tim in on this conversation [22:05:49] <^demon> So...rebuilding the cdbs is a pain in the ass, but the performance is really good. [22:05:51] we have a meeting about this later, right? [22:05:57] <^demon> Ryan_Lane: Indeed, he wrote the cdb stuff. [22:06:09] it would be great to have a saner way of handling this [22:06:19] siebrand: there :) [22:06:28] at minimum we'll need to switch deployment methods for this [22:06:46] adding new methods is easy enough [22:06:46] Tim-away wrote the CDB implementation, so if I may make a suggestion, I'd consult him and ask him to explain what and why, before looking into what to do next :). [22:06:54] siebrand: we're brainstorming more than anything [22:07:05] Creating one or more issues in bugzilla about what you would like to see improved, would aleo improve visibility and persistence of the observed issue. [22:08:26] maybe we could generate more cdb using the beginning of message keys as a prefix [22:08:43] so 'special-foobar' would end up in the special..cdb [22:08:47] <^demon> I don't think that's the problem? [22:08:51] it's not [22:09:01] then you get less cdb regenerated and less to transfer around [22:09:04] it's the fact that we're needing to deploy almost 1GB of messages [22:09:07] New patchset: Pyoungmeister; "arg, whitespace cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43587 [22:09:14] what are you trying to improve? Revision size when rebuilding localisation cache? [22:09:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43587 [22:09:39] siebrand: no, the fact that we need to deploy 1GB of data for minor changes [22:09:40] <^demon> Yeah. Putting them in git for git-deploy doesn't work well, since git doesn't find a good delta for the cdb files. [22:09:47] indeed [22:09:53] we all kind of knew that would be the case [22:10:12] I wasn't considering that to be a long-term solution [22:10:16] just a stop-gap for the switchover [22:10:17] one person says yes ,the other no. Are you talking about two different things, or is at least one of you confused, ^demon , Ryan_Lane ? [22:10:51] we're talking about the same thing [22:11:03] the problem is transferring large amounts of data for small changes [22:11:08] Then yeah and no as answers confuses me. [22:11:15] <^demon> I'm talking short term. [22:11:22] the root cause is the cdb files being huge and git not being able to generate a small delta when they change [22:11:26] indeed [22:11:31] so we end up having lot of data to transfer over the network [22:11:38] hashar: we'd have that either way [22:11:40] I believe we have that issue with rsync [22:11:46] generating on app servers locally might actually be better [22:11:47] the transfer size stays the same [22:11:49] that's using a new deployment model, different from the current one. Correct? [22:11:52] mark: it takes ages [22:11:56] mark: and that's kind of wasteful [22:11:57] how long? [22:11:59] mark: it drains too much CPU :/ [22:12:16] I can try out on gallium [22:12:28] it can take 5-10 minutes [22:12:32] <^demon> Hmm, what if we came up with a non-php way to rebuild localization files? [22:12:33] and it eats a lot of CPU the whole time [22:12:37] we general localisation caches on the twn server with 14 threads in about 15-20 seconds. [22:12:48] and we have a relatively slow vps. [22:12:59] general = generate. [22:13:00] siebrand: seriously? I ran it on tin and it took almost minutes [22:13:10] s/almost// [22:13:25] <^demon> Maybe we're doing it wrong ;-) [22:13:30] maybe [22:13:34] real 0m7.976s [22:13:35] Wow, I want one of those... it takes me almost 2 hours on my laptop [22:13:35] user 0m7.828s [22:13:36] sys 0m0.120s [22:13:40] Ryan_Lane: I'll make a "time" --force run now. There's a few extensions difference, but let me check. [22:13:47] (on gallium) [22:13:56] hashar: is that for just english, or all languages? [22:14:02] 371 languages rebuilt out of 371 [22:14:36] ahh [22:14:38] that is only core [22:14:41] indeed [22:14:44] real 0m20.012s [22:14:55] time b php maintenance/rebuildLocalisationCache.php --force --threads=14 [22:15:11] on beta it takes roughly 2 hours and a half [22:15:22] is it due to our thread count? [22:15:48] if we get them on all app servers we end up with a huge CPU spike [22:15:50] I've never ever heard anything longer than minutes. [22:16:09] so something is very wrong if it's longer than minutes. [22:16:38] does it have to do with LU maybe? [22:16:51] did anything change on stafford ? [22:16:57] all of a sudden it's not going into insane cpu spikes [22:16:58] we're not running LU on twn. [22:17:32] LeslieCarr: tim changed how often the clients restart [22:17:36] otherwise I don;t know [22:17:59] siebrand: I don't really know for beta. I guess it uses only one thread and most of the time is spent in waiting for I/O [22:18:05] siebrand: not a big deal anyway [22:18:14] hashar: is that running on glusterfs? [22:18:29] probably [22:18:32] that's why [22:18:44] hashar: well, with ~300 locales, 9000 seconds is outrageous. [22:18:45] blamin poor gluster [22:18:51] cool [22:18:53] every filesystem write goes across to two servers [22:18:54] glad that works [22:18:57] it's slow [22:19:03] and it's writing to SATA [22:19:10] that's in a raid 6 [22:19:29] we should have used raid10 for those [22:19:54] I should migrate that so the files are generated on /mnt and then replicate the .cdb files to the instance :-] [22:19:57] Hoe large is WMF's L10n cdb collection? twn's is 318M . [22:20:01] hoe = how [22:20:04] :) [22:20:08] siebrand: 750MB [22:20:19] siebrand: don't call me a large hoe [22:20:21] ;) [22:20:25] per branch [22:20:29] yeah [22:20:30] Ryan_Lane: hoe is how in Dutch :) [22:20:32] well, per slot [22:20:44] Ryan_Lane: Does gluster actually wait until `sync` is done for the file on 2 nodes before returning the fh? that would suck ass [22:20:51] Damianz: yes [22:21:08] I'd say generate the L10n cache locally, with max threads and nice. [22:21:11] it's a fully synchronous filesystem from what I've read [22:21:33] sure, cpu and io will spike, but that's what you have it for. [22:21:41] Hoe echt waar? (<-- google dutch) [22:21:50] w00t... so if you have x nodes it doesn't wait until 2 are synced and async x-2? [22:22:02] siebrand: actually, we have cpu to serve client requests ;) [22:22:05] siebrand - we're in america and we speak american here [22:22:10] Ryan_Lane: heh [22:22:18] yeah max threads would kill [22:22:21] LeslieCarr: more tired is more typos... [22:22:42] maybe we could use core affinity? [22:23:04] but then regenerating the cdb files on our hundreds of servers is not really power efficient :) [22:23:11] yeah, James_F claims that too with all his fancy u's ;) [22:23:11] agreed [22:23:18] I don't think doing the cdbs locally is a great idea [22:23:40] can't we deploy them asynchronously to another destination [22:23:41] it beats redesigning your deployment.... [22:23:43] it's really just a matter of transferring the files [22:23:46] <^demon> ms7? [22:23:47] * ^demon hides [22:23:51] power efficiency is a nonsense reason. [22:23:59] then once all minions have finished their bittorent transfer we can switch the app servers to the new mediawiki version [22:24:00] hashar: yes, that's what the idea in my email suggests [22:24:09] ah [22:24:13] I should read my emails maybe [22:24:19] have you considered using torrents for deployments yet? [22:24:26] Ha, I was totally going to say we should just bittorrent these. [22:24:29] Ryan_Lane: so can we generate a different format on tin, push that out, and have each box convert that to cdb? [22:24:33] some format that diffs well [22:24:34] Totally should just hiphop everything and bittorrent the binary out :D [22:24:38] siebrand: wasting cpu and power on hundreds of machines is wasteful to just cut down on data transfer [22:24:42] AaronSchulz: XML! [22:24:50] Damianz: we're going to use hphpvm, not hphp [22:24:59] <^demon> siebrand, csteipp: Yes, we did discuss torrents. [22:25:12] AaronSchulz: if it's efficient enough, yeah [22:25:14] so anyway you got 1.21wmf7 with it is cache. Get the new 1.21wmf7, regenerate the .cdb files, ask the minions to grab the new files and the new 1.21wmf7 code base. Once both process are complete, ask the minions to switch to the new code base. [22:25:21] that would require two more slots though [22:25:25] AaronSchulz: it's still going to eat a bunch of memory and/or cpu to do it that way, though [22:25:29] (two more slots per branch we support) [22:25:31] so torrents fix the data transfer issues, even over WAN pipes. [22:25:36] hashar: read your email :D [22:25:44] problem solved, next one. [22:25:51] Ryan_Lane: far less that from i18n source [22:25:51] jesus. both of you read your email :D [22:25:57] or we find out a way to transfer less data :-] [22:26:05] AaronSchulz: true [22:26:10] no point in transfer 750MB of data for just a few messages changes [22:26:13] hashar: drop l10n support [22:26:20] lololol [22:26:24] and I'm not convinced it would eat to much memory, if the file can be buffered in as you read (depending on format) [22:26:28] *too much [22:26:35] Reedy: na I love l10n. That is the first thing I started committing in mw iirc [22:26:40] AaronSchulz: true… that may be a good way to do it…. [22:26:40] I mean yeah if you load the whole thing into an array :) [22:26:46] then we could still use git for it [22:26:54] ah, right [22:26:59] I think we should make Wikipedi an internal site, and go back to biannual deployments. [22:27:08] Then there's much less deployment traffic. [22:27:23] Also fewer servers needed. [22:27:30] Or just never update it, ever :D [22:28:03] stable version reached. Done. [22:28:17] <^demon> AaronSchulz, Ryan_Lane: So send out something that does delta well (heck, php arrays), then just have a dummy script that turns that into cdb. [22:28:39] ^demon: it probably shouldn't be php arrays [22:28:44] message files are already php arrays [22:28:44] <^demon> Well, they'd delta well. [22:28:57] <^demon> That, and we've already got the cdb support in php on all apaches. [22:29:01] ^demon: we'd need to load the entire file into memory [22:29:06] <^demon> So turning the array -> file is like 5 lines of php. [22:29:11] Nikerabbit has a long cherished wish for our i18n nformat to not be executable. [22:29:23] or we could get a base .cdb file that contains the message as they were at midnight. Whenever a change need to be deployed we generate a patch .cdb file that contains the new message. [22:29:35] We have plenty of experience with YAML, so we could use that. Different project, though. [22:29:42] Mediawiki could then load the midnight cdb file that contains everything and override its content with the patch .cdb [22:29:48] ^demon: and loading it into memory on the app server would put it into apc [22:29:53] <^demon> Bah. [22:29:53] ^demon: that's bad times :) [22:29:55] <^demon> True. [22:29:55] so we end up having to deploy small cubs [22:30:09] hashar: we really need something properly delta-able [22:30:17] xml or json or yaml? [22:30:27] for god sake [22:30:43] json presumably [22:30:45] xml bloats [22:30:48] yeah [22:30:49] XML is too verbose. [22:30:55] yep [22:30:57] json and yaml are the same [22:31:01] For readability, YAML would be preferred. [22:31:01] yep [22:31:02] effectively [22:31:12] can you stream parse json? [22:31:33] so we end up with an even larger dataset to move around ? [22:31:47] sounds to me json would be larger than a cdb [22:31:49] hashar: no. because we'd only pass deltas [22:31:51] Ryan_Lane: It's theoretically possible [22:31:54] Not sure if anyone's done it [22:32:00] <^demon> It'd only be larger on the first push. [22:32:07] <^demon> Subsequent pushes would be small for json. [22:32:11] * AaronSchulz likes it when RoanKattouw says "theoretically" [22:32:22] so we might as well use the PHP files can't we? [22:32:26] deltas are already relatively small if cdb is not part of the current deployment repo. [22:32:28] I didn't actually mean "theoretically" in that sense this time? :) [22:32:29] (regardless of the APC cache issue) [22:32:32] indeed, hashar [22:32:34] well, it may not actually be a big deal if we load the entire set into memory per json file [22:32:39] I'm pretty damn sure a stream parser for JSON is feasible if not easy [22:32:44] But I don't know whether one already exists [22:32:59] we don't want JSON formatting. See before. [22:33:10] readability won't matter much [22:33:10] yaml, json, same same [22:33:15] it would just be an intermediary [22:33:18] AaronSchulz: well, we could actually switch the source to yaml [22:33:21] rather than php [22:33:21] siebrand: Why don't you want JSON again? [22:33:30] RoanKattouw: Readability. [22:33:35] http://fr2.php.net/manual/en/apc.configuration.php#ini.apc.filters A comma-separated list of POSIX extended regular expressions. If any pattern matches the source filename, the file will not be cached. [22:33:39] * hashar whistles [22:33:40] RoanKattouw: JSON has more formatting than YAML. [22:33:41] Yeah minified JSON isn't readable [22:33:41] Ryan_Lane: I'd prefer the source be the smallest possible thing [22:33:42] they're very indifferent to each other [22:33:44] But formatted JSON is [22:33:47] hashar: CSV!!!!! [22:33:59] ini [22:34:19] AaronSchulz: I'm saying if we switched sway from using php files in mediawiki code [22:34:25] *away [22:34:38] oh, I thought you meant the cdb cache [22:34:55] then we don't even need an intermediate format [22:35:15] well, how soon do we want the git-deploy thing working well? :) [22:35:27] but, yeah, that might make sense eventually [22:35:29] true [22:35:51] don't take a shortcut for a short term goals that hurts a long term goal. [22:35:55] though you end up loading all the stuff anyway the way the code is [22:36:05] if you want to get rid of php files to hold the messages [22:36:09] technical stuff [22:36:11] and still want something which is flat [22:36:18] just use .po files [22:36:26] we can solve the short term problem with the method I outlined in my email [22:36:28] or bittorrent [22:36:28] it would only help tin use less ram then ;) [22:36:29] hashar: sqlite db files [22:36:36] <^demon> Ryan_Lane: I like bittorrent idea. [22:36:55] the entries look roughly like: [22:36:56] I'd like to solve the long-term problem at some point, though :) [22:36:58] msgid "My name is %s.\n" [22:36:59] msgstr "Je m'appelle %s.\n" [22:37:05] so that should produce nice delta [22:37:07] fuck gettext. [22:37:22] burn fucking gettext. [22:37:23] <^demon> Ryan_Lane: Long term--deploy everything with bittorrent ;-) [22:37:40] ^demon: git is actually more efficient for text files [22:37:42] <^demon> siebrand: I heard you don't like gettext. [22:37:42] gettext is an i18n nightmare that should never have been invented. [22:37:57] we can still use {{PLURAL}} in a get text message can't we ? [22:38:01] Rewrite MediaWiki in java [22:38:06] <^demon> ^ That. [22:38:07] 1 parameter per key. [22:38:12] no gender [22:38:24] http://en.wikipedia.org/wiki/BitTorrent [22:38:29] 'Since 2010, more than 200,000 users of the protocol have been sued by copyright trolls.[6]' [22:38:33] lol, npov [22:38:36] :D [22:38:39] PROBLEM - MySQL Slave Delay on es2 is CRITICAL: CRIT replication delay 203 seconds [22:38:40] I should be clear: 1 plural parameter per key. [22:38:42] <^demon> Hahah [22:38:48] PROBLEM - MySQL Slave Delay on es1001 is CRITICAL: CRIT replication delay 211 seconds [22:38:58] let me see how hard it would be to set up bittorrent [22:39:02] New patchset: Pyoungmeister; "account for master/master case in coredb roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43590 [22:39:06] it may be a good short-term solution for this [22:39:11] bit torrent is being used successfully for deploying software. [22:39:15] Reedy: hashar doesnt that work automatically? :) "Our nTile PtoJ product is an automated software migrator that translates PHP source code into Java™/Java Enterprise Edition." [22:39:24] siebrand: of course. it's used by a bunch of places [22:39:25] And windows updates (bt) [22:39:28] siebrand: my idea was to use get text to simply load the message then let our usual l10n system transform it [22:39:53] PROBLEM - MySQL Slave Delay on es3 is CRITICAL: CRIT replication delay 275 seconds [22:39:57] hashar: gettext features are just not up to par with our i18n feasures. [22:40:39] <^demon> Ryan_Lane: https://github.com/lg/murder [22:40:52] siebrand: aka use .po to represent our key => messages, use get text PHP function to fetch it but not interpret it. [22:40:56] meh [22:40:58] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43590 [22:41:01] siebrand: gettext() is not going to understand $1 anyway :] [22:41:04] I'm not replacing the entire deployment system [22:41:10] <^demon> Bah, it's a gem. [22:41:15] <^demon> *shudder* [22:41:40] the new deployment system works very well for everything except huge binaries [22:42:09] <^demon> Indeed. And it provides you with a more consistent state. Torrents you can end up with 1 file done and 3 files pending. [22:42:13] and we can add different fetch methods for different repos as we wish [22:42:14] <^demon> Which sucks for code. [22:42:16] ^demon: yes [22:42:31] I'm only going to use bittorrent for the fetch stage [22:42:52] the checkout stage will copy over files that don't match [22:42:59] <^demon> *nod* [22:43:02] bittorrent will stick files into a cache [22:44:06] Doesn't rsync sync individual blocks? [22:44:21] (if you're looking at alternate fetch methods) [22:44:57] Ryan_Lane: ^demon: what was the objection in having i18n files in APC ? Having like 2GB of memory used by them? [22:46:08] hashar: yes. that's an absurd amount of memory to waste [22:46:15] and it's memory we don't have [22:47:03] csteipp: yes. http://rsync.samba.org/tech_report/node4.html [22:47:13] <^demon> We already hit apc limits with 2 deployments anyway. [22:47:18] <^demon> 3 deployments usually breaks it. [22:47:35] :/ [22:47:43] or we get apaches per languages :-] [22:47:47] New patchset: Pyoungmeister; "moving m2 nodes in eqiad to coredb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43592 [22:47:49] I should go to bed probably [22:48:00] RECOVERY - MySQL Slave Delay on es2 is OK: OK replication delay 4 seconds [22:48:01] RECOVERY - MySQL Slave Delay on es1001 is OK: OK replication delay 4 seconds [22:48:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43592 [22:48:46] <^demon> I'm done tonight too. Have a good weekend everyone. [22:48:53] New patchset: Lcarr; "moving ganglios to a more proper location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43593 [22:50:05] notpeter: mutante ^^ [22:51:01] PROBLEM - MySQL Slave Delay on es3 is CRITICAL: CRIT replication delay 184 seconds [22:51:17] good weekend! [22:51:19] * hashar waves [22:51:25] LeslieCarr: oh, rightly proper [22:51:33] it's practically sipping tea :) [22:52:46] :) [22:52:49] RECOVERY - MySQL Slave Delay on es3 is OK: OK replication delay 0 seconds [22:58:31] New patchset: Lcarr; "moving ganglios to a more proper location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43593 [23:21:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43593 [23:25:24] New patchset: Pyoungmeister; "migrating es1007 and es1010 to coredb roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43597 [23:26:54] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43597 [23:31:13] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [23:35:53] New patchset: Pyoungmeister; "rest of eqiad es2 and es3 to coredb roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43599 [23:36:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43599 [23:37:12] paravoid: how is ceph going? [23:37:46] it's 1:30 am on a saturday morning in athens [23:37:51] faidon is probably... [23:37:55] that means nothing [23:37:55] well, he might be here :) [23:38:01] true story [23:45:54] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [23:58:57] LeslieCarr: hey, neon be spammin', yo