[00:00:51] our python linter must suck [00:01:01] New patchset: Ryan Lane; "Report on repo dependencies when reporting on repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43375 [00:01:06] because it just verified something with two syntax errors [00:01:21] Is it non-voting, perhaps? [00:01:25] there isn't one [00:01:28] it seems [00:01:29] Oh hah [00:01:49] flake8 ftw [00:02:24] ah. it also doesn't see this as a python file [00:02:33] becase it doesn't end in .py [00:02:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43375 [00:02:47] otherwise pep8 probably would have caught it [00:03:46] Damianz: flake8 is nifty, thanks [00:04:47] flake8 --exclude=migrations,.git --ignore=E501,E225,E121,E123,E124,E125,E127,E128,W404 --exit-zero in your MakeFile is useful [00:05:18] what's a makefile? [00:05:18] :) [00:07:06] * Damianz looks at Ryan_Lane with that old school autoconf look.... hipster cmake kids these days [00:07:15] :D [00:07:45] uuuuggghhhh [00:07:53] New patchset: Brion VIBBER; "Update FirefoxOS app submodule" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43377 [00:08:20] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [00:09:04] I should really just use puppetmaster::self [00:09:27] New patchset: Ryan Lane; "Use proper variable reference" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43379 [00:09:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43379 [00:10:15] or at least pull from the fake branch rather than merging [00:10:30] I tested it on beta first this time [00:11:08] it would be interesting to get jenkins to update beta via git-deploy via a peer runner [00:11:13] ok. now the report shows which minions haven't finished yet, in each stage [00:11:20] and if a repo has dependencies, it shows their state too [00:11:28] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43377 [00:11:39] that's doable [00:12:12] one of the problems is that they are in different salt clusters [00:12:56] Imo doing jenkins <> labs api <> salt api <> salt <> beta would be acceptable... assuming we make an api sometime before never [00:13:09] heh [00:13:15] yes, that would also work [00:13:25] or jenkins -> salt-api [00:13:29] why the extra step? [00:13:57] assuming salt-api will support groups and multiple auth backends that's do-able [00:14:01] yep [00:14:05] it would just need keystone [00:14:13] give jenkins an openstack user [00:14:17] assign it to the project [00:14:18] done [00:14:28] for that usage it has keystone... for osm integration keystone needs w0rk.... acl wise it needs work anyway [00:14:51] yep [00:15:06] this is my end-goal for salt/openstack integration [00:16:38] Damianz, are you still actively working on salt/keystone or should I take on some of that? [00:16:51] Magic would be jenkins making vms, deploying code, testing and then destroying via salt heh [00:16:55] I haven't been following your progress but I stand at the ready if there are things I can do [00:17:14] andrewbogott: he started it some [00:17:18] andrewbogott: I've not really touched it tbh - there's a backend for basic password auth [00:17:26] he made a keystone external_auth plugin [00:17:33] but it only does basic keyston auth [00:17:34] Need a backend for token auth, a class for osm writing (maybe one that uses exec for now) etc [00:17:41] it doesn't fetch tokens or validate them [00:18:04] Damianz: osm writing? [00:18:06] I'm hoping to work on acls in general for saltstack for a work project, so might re-vist tokens then or duuno - feel free to jump in [00:18:26] Ryan_Lane: well it should have a class that abstracts salt-api to osm... I think [00:18:29] like it does nova [00:18:30] ok. Is your work merged into salt or on github someplace or just on your laptop? [00:18:34] oh [00:18:40] you mean osm -> salt-api [00:18:45] yeah [00:18:49] andrewbogott: it's merged upstream [00:18:54] andrewbogott: Module is merged in dev branch [00:18:59] ah, great. [00:19:07] I will try to catch up :) [00:19:13] password module isn't a bad start [00:19:21] but token would be best :) [00:20:15] ok. let's see how reporting is working for repo dependencies :) [00:20:21] token is only useful if you're using it behind something ;) [00:20:26] indeed [00:20:43] without group support, meh [00:20:52] heh [00:21:10] badly need group support - gonna sugest a change to eauth modules... and probably get rejected [00:21:30] hm. I think i have too many apache processes running on tin [00:21:52] it becomes pretty damn slow when I tell everything to fetch [00:23:11] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 195 seconds [00:23:39] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 203 seconds [00:33:28] !log awjrichards synchronized php-1.21wmf7/extensions/MobileFrontend/ [00:33:40] Logged the message, Master [00:34:07] !log awjrichards synchronized php-1.21wmf6/extensions/MobileFrontend/ [00:34:13] RECOVERY - MySQL Slave Running on db59 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [00:34:17] Logged the message, Master [00:36:49] New patchset: Asher; "setting db59 weight to 0 while replication catches up" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43382 [00:37:02] New patchset: awjrichards; "Enable forcing https for mobile login account/creation everywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43383 [00:37:04] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 181135 seconds [00:37:20] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43382 [00:37:48] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43383 [00:38:07] !log asher synchronized wmf-config/db.php 'setting db59 weight to zero' [00:38:18] Logged the message, Master [00:39:36] !log awjrichards synchronized wmf-config/InitialiseSettings.php 'Force https for mobile login and account creation everywhere' [00:39:47] Logged the message, Master [00:40:49] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [00:46:04] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.001 second response time on port 11000 [00:48:46] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 3 seconds [00:48:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 3 seconds [00:49:12] New patchset: Ryan Lane; "Display concise information by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43385 [00:49:23] reporting is hard [00:51:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43385 [00:57:45] New patchset: Ryan Lane; "Fix obvious syntax error (we need linting :( )" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43386 [00:58:43] anyone have questions for [[Theodore_Ts'o]]? ;-) [00:58:50] he's about 10 feet away [00:59:21] awjr: wooot, re that last sync [00:59:28] jeremyb: :D [01:03:57] New patchset: Ryan Lane; "Fix obvious syntax error (we need linting :( )" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43386 [01:08:50] is it OK if I sync-file a small EventLogging change? It improves event logging for mobile who just wrapped their deploy [01:11:48] spagewmf: mobile is done with it's deployment [01:11:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43386 [01:12:23] anyone have objections? [01:12:37] go for it [01:12:42] Ryan_Lane: still making progress? [01:12:52] yes [01:12:56] I broke some stuff [01:13:00] fixing it now [01:13:46] heh [01:13:58] the system breaking in the middle of a deployment is bad news [01:14:11] also the way I'm reporting on fetches isn't working :( [01:14:48] at minimum not when re-pushing the same thing [01:21:26] New patchset: Ryan Lane; "Also pull tags for submodules on fetch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43388 [01:21:32] ^^ glad I found this bug before deployment and not during :) [01:21:55] !log spage synchronized php-1.21wmf7/extensions/EventLogging/EventLogging.hooks.php 'Log field fix for mobile' [01:22:08] Logged the message, Master [01:22:43] New patchset: Ryan Lane; "Also pull tags for submodules on fetch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43388 [01:25:33] New patchset: Ryan Lane; "Also pull tags for submodules on fetch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43388 [01:26:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43388 [01:29:55] New review: Reedy; "Guess I should note here that I'm removing mwversionsinuse as puppet provides an update version of it" [operations/debs/wikimedia-task-appserver] (master) C: 0; - https://gerrit.wikimedia.org/r/43356 [01:33:33] !log spage synchronized php-1.21wmf6/extensions/EventLogging/EventLogging.hooks.php 'Log field fix for mobile' [01:33:44] Logged the message, Master [01:34:31] and I am done 2660 enwiki {"token":"","userId":18235977,"userName":"ACUX mobile test 2","isSelfMade":true,"userBuckets":"","displayMobile":true," <-- !! [01:34:45] !log reedy synchronized php-1.21wmf7/extensions/ProofreadPage [01:34:51] thanks y'all [01:34:55] Logged the message, Master [01:35:18] !log Ran namespaceDupes.php on cswikisource [01:35:29] Logged the message, Master [01:35:49] Reedy, is your "deployment window" 167 hours wide? :) [01:48:41] New patchset: Dzahn; "replace SSH key for mflaschen - RT-4114" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43392 [01:48:58] New patchset: Ryan Lane; "git fetch and git fetch --tags is necessary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43393 [01:50:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43393 [01:50:50] I'm learning all kinds of things I didn't care to know about git [01:52:03] ok. time to add the localization script in \o/ [01:52:19] New review: Dzahn; "gpg: Signature made Thu 10 Jan 2013 05:03:19 PM PST using DSA key ID 3BBDED59" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/43392 [01:52:20] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43392 [02:12:36] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:00] !log LocalisationUpdate completed (1.21wmf7) at Fri Jan 11 02:26:00 UTC 2013 [02:26:12] Logged the message, Master [02:32:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:33:06] !log killed convert processes on srv222 [02:33:16] Logged the message, Master [02:34:04] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.061 second response time [02:36:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [02:41:21] Hey anyone, should "udp://$wmfUdp2logDest/eventlogging; on a production server wind up in fluorine:/a/mw-log/eventlogging ? [02:43:45] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:39] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:45:24] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.501 seconds [02:46:10] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:46:18] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 1.235 seconds [02:47:48] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 64068 bytes in 0.140 seconds [02:49:46] !log LocalisationUpdate completed (1.21wmf6) at Fri Jan 11 02:49:45 UTC 2013 [02:49:56] Logged the message, Master [02:52:54] !log killed convert processes on all image scalers [02:53:05] Logged the message, Master [02:53:14] bad gateway for enwiki [02:53:19] (502) [02:55:08] Site seems slow. [02:55:29] I got one load without CSS (maybe bits?). [02:55:38] Seems better now. [02:55:58] Or not. [02:56:15] yeah it's laggy [02:56:23] packet loss? [02:56:24] interestingly no messages in this channel about LVS [02:56:48] PROBLEM - SSH on lvs1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:57:04] TimStarling: https://en.wikipedia.org/wiki/Main_Page browse around while logged in? [02:57:14] Error 101 (net::ERR_CONNECTION_RESET): The connection was reset. [02:57:43] Just seems to hang intermittently. [02:58:01] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:58:27] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:59:40] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 93078 bytes in 0.908 seconds [03:08:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:22:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [03:28:45] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:36:46] Ryan_Lane- Well, I've hung around late enough. Time for me to go to bed. [03:37:13] anomie: night [03:37:22] I think I'll be able to get it done soonish [03:37:59] If anything comes up that I should look at, send me an email or something. [03:38:27] will do [03:54:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:56] New patchset: Ryan Lane; "Add a retry option for fetch and checkout stages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43396 [04:01:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43396 [04:05:32] New patchset: Ryan Lane; "Prefer the deploy-info check to the runner output" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43397 [04:05:59] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43397 [04:06:28] I found a problem with today's E3 deployment. ori-l only ran sync-dir of the EventLogging extension in the 1.21wmf7 branch, so my later sync-file of one file to both wmf7 and wmf6 left wmf6 in an inconsistent state. The fix is to sync-dir of EventLogging in the 1.21wmf6 branch. Please sirs, may I? [04:07:11] you guys rolled back the earlier deployment, or things are broken right now? [04:07:47] if it's broken, then yes, fix it. make sure you aren't going into anyone's deploy window [04:08:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [04:09:01] Ryan_Lane No roll back, I just diagnosed this as the problem with mobile account creations on wikis still running wmf6. [04:09:17] if it's broken, yes fix it [04:10:16] Ryan_Lane. thanks, will do and I'll monitor it. Our other extensions are enwiki-only, thus wmf7 only, but EventLogging is more widespread. [04:10:25] <3 new deployment system [04:16:39] !log spage synchronized php-1.21wmf6/extensions/EventLogging 'EventLogging meant to be on 1.21wmf6 wikis as well' [04:16:49] Logged the message, Master [04:19:25] ... and now account creation on wmf6 produces frwiki {"token":"","userId":1464258,"userName":"S Page ACUX GS test 0110-8", "displayMobile":true, ..._valid: "true"} yay! [04:19:39] spagewmf: ... [04:27:56] TimStarling: rather than the cron on the deployment host pushing out the localization update, would it make sense to have the apaches pull it via a cron instead? [04:28:22] it's somewhat difficult to figure out when the fetch stage is properly done from an automated process on the deployment host [04:28:46] especially if some hosts aren't working properly [04:29:30] if, instead, the cron simply does a git deploy start/sync that doesn't push, it can be picked up by the client [04:29:49] though I guess that's problematic if times are off on the apaches [04:29:58] since they'd get messages at differenttimes [04:30:48] meh. I can push it. I'll move to the checkout stage if all report back, or if a timeout occurs [04:31:50] it can report to the log with statistics [04:35:58] New patchset: Aaron Schulz; "Set $wgMaxBacklinksInvalidate." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43398 [04:37:14] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43398 [04:38:29] !log aaron synchronized wmf-config/CommonSettings.php 'Set $wgMaxBacklinksInvalidate.' [04:38:40] Logged the message, Master [04:41:28] New patchset: Ryan Lane; "Use cmd.retcode for git clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43399 [04:59:48] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:59:49] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:04:00] New patchset: Ryan Lane; "Adding support for logging messages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43400 [05:07:59] New patchset: Ryan Lane; "Use cmd.retcode for git clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43399 [05:08:05] New patchset: Ryan Lane; "Adding support for logging messages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43400 [05:19:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43399 [05:19:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43400 [05:25:00] New patchset: Ryan Lane; "Import the correct module, this time" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43401 [05:27:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43401 [05:32:25] New patchset: Ryan Lane; "Add back in accidentally removed functions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43404 [05:33:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43404 [05:38:03] !log aaron synchronized php-1.21wmf7/includes/DefaultSettings.php 'deployed 9709a7d7e423c21e38a0436faa233ac0dcf04226 ' [05:38:14] Logged the message, Master [05:38:44] !log aaron synchronized php-1.21wmf7/includes/cache/HTMLCacheUpdate.php 'deployed 9709a7d7e423c21e38a0436faa233ac0dcf04226 ' [05:38:48] Logged the message, Master [05:43:30] New patchset: Ryan Lane; "Don't specify a cwd with git clone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43408 [05:44:54] !log aaron synchronized php-1.21wmf7/includes/job/jobs/HTMLCacheUpdateJob.php 'deployed f0f528e254346f04d4afdb85b439cd20c0df93fe' [05:44:56] * Aaron|home wonders if Ryan has any midnight oil left [05:45:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43408 [05:45:06] Logged the message, Master [05:45:12] I'm honestly getting to the point where I don't think I do [05:46:21] just watch for the stupid mistake hockey stick graph [05:46:28] oh, I'm there [05:46:39] my last like 8 commits were fucked up [05:47:58] stupid deployment host isn't on a public IP, so I can't run ircecho on it [05:48:14] now I'm going to need to modify ircecho to pop messages from a queue in redis [05:48:28] I've wanted to do this for a while anyway, but I didn't want to do it right now :D [05:48:46] * Aaron|home is still trying to understand how all the salt stuff comes together [05:48:59] there's only three parts to that system [05:49:05] deployment host [05:49:08] salt master [05:49:10] minions [05:49:31] well puppet doesn't look quite that simple :) [05:49:55] (the stuff in puppet) [05:50:09] well, that's because I have to configure salt with puppet [05:50:20] which is somewhat silly, but whatever [05:50:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:51:19] I'm using a number of components of the salt system: runners, modules, returners, pillars, grains, and the peer system [05:51:40] the deployment host makes a peer call to run a runner on the master [05:52:06] the runner calls fetch or checkout on the minions, specifying that they should return their data to the deploy_redis returner [05:52:31] configuration is stored in pillars (it's basically global configuration) [05:53:03] grains are static data that exists on each minion. it's like facts in puppet [05:53:29] I only use that to determine which site a minion in [05:54:12] * Aaron|home has been reading salt docs [05:54:38] I'm looking forward to using reactors :) [05:55:10] take actions anywhere based on events from anywhere? yes please [06:03:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.392 seconds [06:08:23] New patchset: Tim Starling; "Comment out db59" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43409 [06:08:51] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43409 [06:09:58] !log tstarling synchronized wmf-config/db.php 'depool db59' [06:10:11] Logged the message, Master [06:21:52] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [06:21:52] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [06:21:52] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [06:21:52] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [06:21:52] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:36:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:47:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.236 seconds [07:23:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:25:36] New patchset: Ryan Lane; "Add localization update as the l10n dep script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43410 [07:28:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43410 [07:39:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds [07:54:53] New patchset: Ryan Lane; "Add parameters after the file check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43412 [08:05:03] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43412 [08:11:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.072 seconds [08:30:58] can someone review https://gerrit.wikimedia.org/r/#/c/43218/ please? [08:35:57] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 186 seconds [08:37:00] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 224 seconds [08:42:10] MaxSem: is it clean-killlist or clear-killlist? [08:42:45] fixing [08:44:37] renamed it at the very last moment:( [08:44:38] New patchset: MaxSem; "Cronjobs for GeoData" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43218 [08:45:19] not again... [08:45:42] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [08:46:11] New patchset: MaxSem; "Cronjobs for GeoData" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43218 [08:46:27] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [08:47:40] ok [08:47:52] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43218 [08:48:15] thanks a lot, apergos [08:48:38] lemme merge on sockpuppet first [08:49:12] running puppet on hume now [08:50:44] looks good [08:53:10] morning [08:54:11] yes it is [08:55:52] localization cache is going to suck [08:56:02] just saying [08:56:15] I'm looking at ways to make it suck less [09:04:45] New review: ArielGlenn; "I don't know why I'm on this list but sure, seems fine to me. I think Jeff Green might have been la..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/42908 [09:07:18] New patchset: Hashar; "move PHP linter to a new `wmfscripts` module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29937 [09:07:31] New review: Hashar; "rebased" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/29937 [09:15:53] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 192 seconds [09:16:27] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 208 seconds [09:17:47] New patchset: Hashar; "wikimedia module placeholder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43420 [09:17:47] New patchset: Hashar; "move PHP linter under `wikimedia` module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29937 [09:18:54] New review: Hashar; "rebased on top of https://gerrit.wikimedia.org/r/#/c/43420/ which creates the wikimedia puppet module." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/29937 [09:23:30] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [09:26:29] cool [09:26:37] we should get more opinions on this [09:26:44] maybe mark [09:29:54] paravoid: wikimedia module ? :-D [09:30:10] I am going to move the manifests/misc/contint.pp pile of mess under it too [09:34:05] yeah [09:35:24] ideally wee would want to move everything as modules [09:35:34] then we can fully take advantages of puppet autoloading system [09:35:40] that might make puppet run a bit faster [09:35:52] and will definitely let us start writing rspec integration tests [09:37:09] oh yes, this is the goal [09:37:09] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [09:37:19] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [09:37:24] the wikimedia/ module is that I'd like to think about a bit [09:37:41] I know it's my idea but as I said yesterday, I haven't realy thought about it thoroughly [09:38:25] be bold, think later :-] [09:40:46] hey paravoid, you said you could review the OSM packages [09:41:46] New patchset: Silke Meyer; "Puppet files to install Wikidata repo / client on different labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42786 [09:43:11] paravoid: I will poke mark about the wikimedia module whenever he wake [09:43:11] up [09:44:36] New review: Silke Meyer; "Thanks for the proposals! I added two templates, repo and client. Now the role/labsmediawiki.pp is m..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/42786 [10:10:00] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [10:12:51] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [10:40:59] New patchset: Silke Meyer; "Puppet files to install Wikidata repo / client on different labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42786 [10:42:32] New review: Silke Meyer; "In this patchset, the xml dump for the client does no longer contain the language links." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/42786 [11:08:19] yeahh moar refactoring [11:08:23] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [11:11:56] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [11:29:47] memcached died on virt0 :-D [11:29:50] that is recurring [11:30:07] yes [11:30:23] oh [11:30:26] maybe I just lot my session [11:34:09] lunch bbl [12:36:44] back [12:49:55] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [12:52:08] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [12:53:14] lovely puppet [12:53:15] (File[/srv/org/mediawiki/integration/index.html] => Class[Wikimedia::Contint::Webserver] => Class[Wikimedia::Contint::Jenkins] => User[jenkins] => File[/srv/org/mediawiki/integration] => Class[Wikimedia::Contint::Webserver]) [12:53:16] ;) [13:03:53] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [13:28:51] New patchset: MaxSem; "Raise throttle for the Pune event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43432 [13:30:08] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:31:54] New patchset: Demon; "Add support for RT & Bugzilla tracking ids" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43433 [13:33:55] <^demon> MaxSem: ??? is not an int ;-) [13:34:07] ^demon, it's intentional;) [13:34:26] we don't know the number yet [13:39:37] MaxSem: OVER 9000 [13:40:44] !log reedy synchronized php-1.21wmf7/extensions/FlaggedRevs [13:40:57] Logged the message, Master [13:44:56] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [13:45:14] New patchset: Demon; "Remove extension distributor mess from fenari" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [13:49:03] New patchset: Demon; "Remove extension distributor mess from fenari" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [13:50:13] New patchset: Demon; "Remove extension distributor mess from fenari" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [13:51:45] New patchset: Hashar; "refactor continuous integration manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [13:52:09] ^demon: what is replacing the lame extension distributor ? [13:52:36] <^demon> The awesome extension distributor :) [13:53:12] <^demon> We deployed it yesterday. [13:56:54] New review: ArielGlenn; "yay!" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/41976 [13:56:55] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/41976 [13:58:22] New review: Hashar; "See discussion in Ops mailing list." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43420 [14:03:43] New patchset: Demon; "Allow ensure => absent on systemuser" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43435 [14:05:02] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43435 [14:08:37] !log updating Zuul 23ec1ba..ff79197 . Bring up better support for LOST builds. [14:08:49] Logged the message, Master [14:13:33] New patchset: Demon; "Remove xinetd service entirely" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43436 [14:14:32] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43436 [14:23:29] New patchset: Demon; "Don't ensure the group as absent, it conflicts with the dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43437 [14:24:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43437 [14:25:54] <^demon> 4th time's the charm? [14:26:26] dunno yet [14:26:32] there's "puppet run in progress" [14:26:36] so I'm sitting here waiting it out [14:29:38] had the old catalog [14:29:40] running now [14:32:19] notice: /Stage[main]/Mediawiki::Former-extdist-removesoon/Systemuser[extdist]/User[extdist]/ensure: removed [14:32:21] \o/ [14:33:22] <^demon> Yay [14:33:28] however [14:36:05] <^demon> However? [14:36:41] ah no [14:36:52] the cron job got put back on the previous run [14:36:56] but this run it stayed gone [14:37:02] notice: /Stage[main]/Mediawiki::Former-extdist-removesoon/Systemuser[extdist]/Group[extdist]/ensure: created [14:37:11] we do have that anomaly but whatever :-D [14:40:00] <^demon> http://wikitech.wikimedia.org/view/Swift/Open_Issues_Aug_-_Sept_2012/Cruft_on_ms7#ext-dist - updated status. [14:40:16] oh yyyaaayyy [14:40:52] reedy cleared profiling data [14:41:00] sooo close [14:41:34] <^demon> Thanks for your help :) I'm glad to see this thing go away. [14:41:45] me too, thanks for slugging away at it! [14:43:18] PROBLEM - Varnish HTCP daemon on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:44:03] PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:45:20] New patchset: Reedy; "Make clear-profile runs be logged to SAL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43440 [14:48:33] RECOVERY - Varnish HTCP daemon on cp1043 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:49:28] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [14:50:11] wonder why they died in the first place (I restarted em) [15:01:00] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:01:01]