[00:16:16] Dzahn said he'd do the DNS settings today, what happened [00:17:10] he had to take off unexpectedly [00:20:03] (03CR) 10Ori.livneh: "Cherry-picked in beta and deployed to beta cluster apaches" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138891 (owner: 10Ori.livneh) [00:25:55] sounds serious then [00:26:50] should I get the change into SWAT [00:26:57] No [00:27:15] It needs ops to deploy it [00:46:49] is there another along with https://gerrit.wikimedia.org/r/#/c/134939/ [00:46:52] that one looks merged [00:47:11] if it is something fairly simple I can probably knock it out [00:50:06] chasemp: What for? [00:50:36] sorry, are you guys waiting one some DNS stuff? [00:50:47] waiting on, even [00:51:31] https://gerrit.wikimedia.org/r/#/c/140186/ [00:51:46] Pretty simple change :) [00:53:59] Reedy would you mind +1'ing that so it has more context [00:54:14] I can verify technical, but the should be I think I get it but you would know :) [00:54:21] if you +1 it I'm prepared to push it out [00:54:22] (03CR) 10Reedy: [C: 031] DNS settings for wikimania 2015 wiki [operations/dns] - 10https://gerrit.wikimedia.org/r/140186 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [00:55:43] (03CR) 10Rush: [C: 032] "Based on my understanding of the previous years configuration, the bugzilla bug referenced, and since this is adding and not modifying exi" [operations/dns] - 10https://gerrit.wikimedia.org/r/140186 (https://bugzilla.wikimedia.org/66370) (owner: 10Withoutaname) [00:56:29] ok, merged and synced on the dns servers [00:56:39] it will take a bit but should be gtg [00:56:47] thanks :) [00:56:57] yeah [00:57:05] !log added dns for wikimania 2015 (gerrit 140186) [00:57:11] Logged the message, Master [00:57:17] there's still no wiki actually running at the target yet [00:58:11] heh, that seems like a good next step :) [01:31:35] (03PS2) 10Springle: Make dbstore1002 handle s7 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/141849 (https://bugzilla.wikimedia.org/66068) (owner: 10QChris) [01:32:11] (03CR) 10Springle: [C: 032] Make dbstore1002 handle s7 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/141849 (https://bugzilla.wikimedia.org/66068) (owner: 10QChris) [01:35:59] thanks Reedy [01:36:10] and chasemp [02:09:31] (03PS1) 10Springle: Reassign db1049 to s5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141882 [02:14:30] !log LocalisationUpdate completed (1.24wmf9) at 2014-06-25 02:13:27+00:00 [02:14:38] Logged the message, Master [02:15:32] (03CR) 10Springle: [C: 032] Reassign db1049 to s5 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141882 (owner: 10Springle) [02:17:57] (03PS1) 10Springle: Depool db1049 for reassignment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141885 [02:18:30] (03CR) 10Springle: [C: 032] Depool db1049 for reassignment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141885 (owner: 10Springle) [02:18:36] (03Merged) 10jenkins-bot: Depool db1049 for reassignment [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141885 (owner: 10Springle) [02:19:17] !log springle Synchronized wmf-config/db-eqiad.php: depool db1049 (duration: 00m 11s) [02:19:22] Logged the message, Master [02:25:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [02:25:23] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: Fetching origin [02:27:00] !log LocalisationUpdate completed (1.24wmf10) at 2014-06-25 02:25:57+00:00 [02:27:04] Logged the message, Master [02:39:41] !log xtrabackup clone db1005 to db1049 [02:39:46] Logged the message, Master [02:40:05] (03PS4) 10BBlack: [HAT] text-frontend VCL: set Content-Type if not set [operations/puppet] - 10https://gerrit.wikimedia.org/r/141086 (owner: 10Ori.livneh) [02:40:34] ori: I just rebased and added back the "syslog is temporary" comment ^ [02:41:10] I've tried it on some live caches and not seen any log entries, so looks much better, will merge it again tonight and see how things go [02:49:59] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jun 25 02:48:53 UTC 2014 (duration 48m 52s) [02:50:07] Logged the message, Master [03:00:36] (03CR) 10BBlack: [C: 032] [HAT] text-frontend VCL: set Content-Type if not set [operations/puppet] - 10https://gerrit.wikimedia.org/r/141086 (owner: 10Ori.livneh) [03:03:07] springle: db1049 to s5 good to go for puppet-merge? [03:04:10] (this, I mean: https://gerrit.wikimedia.org/r/#/c/141882/1 , it's outstanding on puppet-merge on palladium) [03:04:17] oh sorry [03:04:18] yes [03:04:22] ok, merging [03:04:29] thanks [03:04:51] RECOVERY - Unmerged changes on repository puppet on palladium is OK: Fetching origin [03:09:51] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: Fetching origin [03:10:11] PROBLEM - Unmerged changes on repository puppet on virt1000 is CRITICAL: Fetching origin [03:11:57] heh [03:12:13] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [03:13:12] RECOVERY - Unmerged changes on repository puppet on virt1000 is OK: Fetching origin [03:13:51] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: Fetching origin [03:14:19] ^ fixed perms issues on the master checkouts from the manual fixups earlier [03:18:09] ori: the DefaultType thing has rolled out to a bunch of various varnishes and no log messages yet, so I think that means we're no longer catching silly cases [03:18:28] (and thus we should get useful logs, if any, when the related apache config goes out. maybe next time we're both here) [03:30:00] bblack: hey! that's cool, thanks! [03:30:40] i deployed the apache change on the beta cluster, too [03:31:10] it's been a few hours now, so let's see if there are any log records [03:35:08] there aren't any [03:35:58] not surprising, but i'd like to validate the positive case too [03:41:24] oh, since the above, I found a case I'm looking into [03:41:45] I get syslog hits for public access to e.g. https://doc.wikimedia.org/puppetsource/files/ [03:41:57] turns out gallium has DefaultType None in its apache2.conf is why [03:42:13] which I didn't find in an earlier puppet grep, so I'm trying to see what's up with that now [03:42:52] maybe that file on gallium isn't puppet-managed [03:44:32] well, it's the default [03:44:41] and we don't override it except on the app servers, afaik [03:44:53] so it's probably just the apache2.conf that comes with the apache2 package [03:46:10] well hopefully that doesn't break anything? [03:46:33] why would it? [03:46:34] (switching various non-app servers to sending a new conten-type: application/octet-stream where they had one before, in varnish) [03:46:44] oh, i see what you're saying [03:47:12] it should only be misc servers involved anyways, right? so nothing really production-critical? [03:47:55] i can't think of anything especially critical, no [03:47:57] but probably before that change, the browser could guess text/plain on some of those docs.wm.o files [03:48:19] not sure if ci stuff also pulls from there and gets affected in any notable way [03:48:50] well, we could make it a cluster_option something-or-other, but i'm inclined to wait [03:48:54] Jun 25 03:20:37 cp1044 varnishd[19884]: DefaultType: /puppetsource/files/snmp/snmptt.init [03:48:57] Jun 25 03:35:10 cp1044 varnishd[19884]: DefaultType: /puppetsource/README [03:49:00] Jun 25 03:36:35 cp1044 varnishd[19884]: DefaultType: /puppetsource/files/ipmitool/ipmi_mgmt [03:49:03] Jun 25 03:43:02 cp1044 varnishd[19884]: DefaultType: /nightly/gerrit/gerrit-2013-06-03.war [03:49:10] (the middle two were from me, the outer two are organic) [03:51:17] i'm trying not to be overconfident and to take seriously the possibility that something would be broken by this change, but it seems unlikely. but let me poke and see if i can get something to break. [03:52:12] well what we're really looking for is new log entries when you switch off defaulttype in apache. the higher risk is there and might cause us to change strategy regardless [03:52:21] may as well go forward with that test at this point [03:53:00] fine by me [03:53:18] the md5sum of gallium's apache2.conf has a bunch of hits on google, btw [03:53:36] i think it came from the apache module of yore rather than the deb, but either way it's generic [03:54:03] ok [03:55:10] this is the apache change: [03:55:26] (03PS5) 10BBlack: [HAT] Remove DefaultType directive from Apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/138891 (owner: 10Ori.livneh) [03:55:36] yeah I'm rebasing [03:56:10] (03CR) 10BBlack: [C: 032 V: 032] [HAT] Remove DefaultType directive from Apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/138891 (owner: 10Ori.livneh) [03:56:57] ok so it's rolling out slowly [03:57:15] I have a command set up to periodically grep all the cache syslogs for DefaultType hits [03:57:21] we'll see what happens [03:57:34] excellent [03:57:45] hey, i really appreciate this by the way :) [03:57:52] np [04:29:58] i don't see any other the ones you have pasted [04:30:29] # salt 'cp*' cmd.run 'grep ": DefaultType:" /var/log/syslog' | grep varnishd [04:30:41] shows just seven entries [04:30:45] all gallium [04:46:32] bblack: is it ok if i step out for half an hour? so far the impact looks minimal [04:47:46] still just eight entries [04:48:18] * ori does, but will be back shortly [04:49:05] yeah [04:49:18] cool, bbiab then [04:49:20] it's all that docs stuff so far still [05:05:31] !log springle Synchronized wmf-config/db-eqiad.php: repool db1049, warm up (duration: 00m 08s) [05:05:36] Logged the message, Master [05:10:11] (03PS1) 10Aaron Schulz: Give "mergehistory" to sysops [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141892 [05:25:41] (03CR) 10MZMcBride: "Why not uncomment "$wgGroupPermissions['sysop']['mergehistory'] = true;" in MediaWiki core's DefaultSettings.php?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141892 (owner: 10Aaron Schulz) [05:40:44] still nothing [05:40:48] (03CR) 10Nemo bis: "That's what I expected too from the bug" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141892 (owner: 10Aaron Schulz) [05:41:24] if things don't change drastically over the course of the next few days, we could probably drop that VCL [05:42:15] i wish we had *something* from the app servers though, just to confirm we're sane [05:59:38] (03PS1) 10Springle: Depool db1021 for upgrade. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141895 [06:00:09] (03CR) 10Springle: [C: 032] Depool db1021 for upgrade. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141895 (owner: 10Springle) [06:00:15] (03Merged) 10jenkins-bot: Depool db1021 for upgrade. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141895 (owner: 10Springle) [06:01:10] !log springle Synchronized wmf-config/db-eqiad.php: depool db1021, db1049 to normal load (duration: 00m 07s) [06:01:15] Logged the message, Master [06:29:39] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:09] (03PS1) 10Springle: Repool db1021, enable traffic sampling. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141896 [06:30:31] (03CR) 10Springle: [C: 032] Repool db1021, enable traffic sampling. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141896 (owner: 10Springle) [06:30:38] (03Merged) 10jenkins-bot: Repool db1021, enable traffic sampling. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141896 (owner: 10Springle) [06:31:25] !log springle Synchronized wmf-config/db-eqiad.php: repool db1021 with traffic sampling (duration: 00m 09s) [06:31:31] Logged the message, Master [06:33:49] PROBLEM - MySQL Processlist on db1068 is CRITICAL: CRIT 89 unauthenticated, 0 locked, 0 copy to table, 0 statistics [06:34:49] RECOVERY - MySQL Processlist on db1068 is OK: OK 8 unauthenticated, 0 locked, 0 copy to table, 0 statistics [06:40:33] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.092 second response time [07:11:53] <_joe_> good morning [07:17:56] (03PS1) 10QChris: Document that m2's configuration in coredb need not be accurate [operations/puppet] - 10https://gerrit.wikimedia.org/r/141899 [07:25:46] (03PS4) 10Ori.livneh: Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 [07:26:19] <_joe_> ori: while I'm at it, should I take a shot at this also? [07:26:40] review it, you mean? [07:26:44] sure, i'd love that [07:26:49] i updated it to manage the symlink too [07:26:50] <_joe_> yes [07:27:03] as you and akosiaris suggested [07:27:35] <_joe_> ok, I do have to prepare rcstream ssl and the main site SSL patches [07:27:47] don't worry, i'm going to sleep, so i won't be breathing down your neck :P [07:27:49] <_joe_> then I'll dedicate the day to CRs :) [07:28:11] <_joe_> that was just my convoluted way to tell you rcstream goes live :) [07:28:15] thanks! i'd like akosiaris to look too, i rushed things a little before [07:28:33] _joe_: yes i have been bouncing in the office all day earlier because of it :) [07:28:54] <_joe_> I'll sync with Krinkle|detached when he becomes attached [07:29:07] _joe_: did you forget me ? :) [07:29:09] cool cool [07:29:11] good night! [07:29:14] <_joe_> matanya: eheh [07:29:19] <_joe_> ori: good night [07:29:24] <_joe_> matanya: I did not [07:30:21] <_joe_> matanya: in ~ 1-2 hours, I'll show you what we need to do with templates linting :) [07:30:32] <_joe_> more like 2 hours [07:30:45] sure, thanks [07:35:20] (03PS1) 10Springle: Sampling seems stable enough; raise load so the boxes are also useful. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141901 [07:36:20] (03CR) 10Springle: [C: 032] Sampling seems stable enough; raise load so the boxes are also useful. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141901 (owner: 10Springle) [07:36:26] (03Merged) 10jenkins-bot: Sampling seems stable enough; raise load so the boxes are also useful. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141901 (owner: 10Springle) [07:37:54] !log springle Synchronized wmf-config/db-eqiad.php: incremental LB bump on db1009 and db1021 traffic samplers (duration: 00m 07s) [07:37:58] Logged the message, Master [07:38:11] oh dear, icinga isn't a module [07:38:17] matanya: any plans to modulize icinga? :) [07:38:57] also anyone knows where I should look for prod's icinga server config? [07:39:02] seems to be spread around everywhere [07:39:03] YuviPanda: akosiaris was planning to do it, so i didn't [07:39:08] ah [07:39:20] what do you want to do YuviPanda? [07:39:37] matanya: setup icinga for toollabs [07:40:21] mostly manifests/misc/icinga.pp and modules/nrpe/ [07:40:22] I see nagios.pp [07:40:31] should I discount that completely? [07:41:01] YuviPanda: ops are planning to move off icinga, so i don't know i f it is worth your efforts [07:41:06] oh [07:41:07] to what? [07:41:17] any links / email about that? [07:41:24] not sure if it 100% set yet [07:41:42] diamond? but I thought that was for a different use case... [07:41:59] you should ask mark or faidon i guess [07:42:07] <_joe_> YuviPanda: it is rightfully spread everywhere [07:42:18] alex would probably know too [07:42:19] <_joe_> YuviPanda: you define monitoring entities in the modules [07:42:22] _joe_: even the serverside stuff? [07:42:43] _joe_: I'd expect an icinga module to define the things other modules would use [07:42:49] that is manifests/misc/icinga.pp [07:42:49] <_joe_> YuviPanda: what do you mean by 'serverside'? [07:42:56] _joe_: neon, I presume. [07:42:57] <_joe_> if you mean the configurations of services, yes [07:43:03] install icinga and config it [07:43:09] what matanya said [07:43:28] that is mostly the case now [07:43:45] <_joe_> YuviPanda: ok if you want, we can move in query and I can assist you on understanding how it works [07:43:47] just the monitored services are called all around [07:44:19] _joe_: sure! but before that do you know of current plans to move off icinga? I don't want to set that up on toollabs and then have prod move to something else... [07:44:21] * matanya is also interested in such an overview [07:44:51] <_joe_> YuviPanda: speak with alex, I don't think its something for the near-future anyway [07:44:56] _joe_: yeah, doing it here anyway would be nice as well, I think. logged, etc. [07:45:02] <_joe_> ok [07:45:37] _joe_: ok, will do when akosiaris is around. [07:45:40] <_joe_> so, the basic server-side config is _all_ defined in icinga.pp which admittedly is a mess and we should move to a module soon [07:45:54] right. and also has a lot of hard coded prod stuff in there [07:46:11] <_joe_> then you have the problem of collecting monitoring entities [07:46:33] <_joe_> so let's take as an example me and ori: when the rcstream service has been created, you' [07:47:03] <_joe_> ll need to define monitoring and alerts for that too [07:47:04] <_joe_> the right way to do that in puppet [07:47:07] <_joe_> is defining an exported resource in the module [07:47:24] <_joe_> and then collect resources on the icinga node [07:47:46] right. [07:48:10] <_joe_> what we also do is, we collect resources and then run (on neon) a script called 'naggen2' that generates the services.cfg and all other files from those collected resources [07:48:29] <_joe_> this is probably going to be done differently in labs, where we don't collect stored configs [07:48:38] <_joe_> (and we should, probably) [07:48:41] right. [07:48:43] line 649 in icinga.pp [07:49:33] _joe_: also consider that I'm looking only to do it for toollabs. the current icinga setup is for all of labs (and isn't really puppetized, I think?). Setting it up for just toollabs means we can do exported resources too, since all of toollabs infrastructure is puppetized [07:50:23] matanya: right. I also see the python script [07:53:13] btw _joe_ did you fix ganglia_aggregator scoping issues ? [07:54:10] <_joe_> matanya: ? [07:54:28] <_joe_> matanya: it's defined at node-scope, so it's not a problem in puppet 3 [07:55:18] _joe_: manifests/ganglia.pp line 39 [07:56:24] <_joe_> matanya: as I said, that is a node-scope variable [07:56:28] <_joe_> :) [07:56:41] _joe_: matanya I'm considering just waiting for icinga.pp to be module'd before taking that on to toollabs, since there's a *lot* of prod specific things there. [07:57:11] right, _joe_ my question is shouldn't it be defined on site.pp ? [07:57:29] or i'm missing some scoping thing here [07:59:00] <_joe_> matanya: it is. [07:59:14] <_joe_> whenever it's not, it's undefined, hence false [07:59:15] * matanya got lost [07:59:26] <_joe_> sorry guys, gotta get back to PFS [07:59:27] oh, now i get it [07:59:27] <_joe_> :) [07:59:49] thanks _joe_, enjoy, and rule out IE! [08:00:03] <_joe_> we can't, sadly [08:00:22] yeah, just wishing out loud [08:03:20] good morning [08:03:50] <_joe_> hi hashar [08:04:07] _joe_: thanks for the help :) I'll poke around some more [08:04:35] _joe_: may I divert you from PFS to talk about the Zuul puppet changeS? [08:04:53] <_joe_> hashar: not now pls :) [08:05:02] :-D [08:05:09] ping me when you have some bandwith [08:11:18] <_joe_> did I ever told you I hate gerrit? [08:12:04] _joe_: everyone does :) [08:12:04] (03PS7) 10Giuseppe Lavagetto: Improve nginx TLS/SSL settings. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [08:12:27] apergos: what is the future of dataset2 ? [08:15:08] (03CR) 10Giuseppe Lavagetto: [C: 031] "I removed any reference to DHE chipers, as having removed the ssl_dhparam directive broke them, and we don't want to use them anyways." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [08:15:25] * YuviPanda waits for ^ to be merged so he could use those in labs [08:19:22] thanks thanks thanks _joe_ [08:19:37] <_joe_> Nemo_bis: do you read ops@? [08:19:43] no [08:20:30] but I'm whitelisted to I can be seamlessly cc'd [08:20:51] (not that anyone would ever *need* to do so :p) [08:21:53] matanya: it will die eventually, once new data center is up an a dataset is there [08:22:06] <_joe_> eheh no just to forward you my conclusions on PFS testing [08:24:16] thanks apergos [09:22:48] (03PS1) 10Giuseppe Lavagetto: nginx: adding ssl config to the module [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/141909 [09:25:49] (03CR) 10Mglaser: [C: 031] "Go for it!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140634 (owner: 10Catrope) [09:42:38] _joe_: Hi [09:42:52] ori: [09:42:55] something to sync? [09:45:59] <_joe_> Krinkle: so, I'm preparing SSL for rcstream [09:46:32] Cool [09:46:39] <_joe_> I was asking myself if we need to support old browsers or not. [09:46:53] _joe_: I heard something the other day about rcstream serving lots of errors. Did you hear about that? [09:46:57] <_joe_> I don't think so, right? [09:47:02] _joe_: how old? [09:47:10] <_joe_> also, do you know of any tool to load-test websockets? [09:47:18] <_joe_> Krinkle: no I did not [09:47:32] Hm.. in what way are you worried about browser support with SSL? [09:48:12] <_joe_> Krinkle: if I disable SSL3 it will basically kill IE on XP [09:48:27] <_joe_> I don't think that's an issue [09:48:44] Hm.. I must say I didn't know about that. [09:48:48] How do we do that in production? [09:49:26] I'd really like to not add more columns to our browser support investigation when building an application I'd like to be able to assume that at least on the transport http/https layer stream is not going ot be diferent than other wikimedia servers. [09:50:00] <_joe_> Krinkle: the point is - IE6 or 7 did not support websockets, so... [09:50:09] <_joe_> no point in supporting them :) [09:50:18] In general IE6/IE7 should totally work. socket.io has built-in support for non-websockets. Remember that websockets is used here as a protocol, not as a browser feature per se. It falls back to mimicing the protocol over JSON-P, xhr pollling, comet and even flash. [09:50:29] _joe_: Be careful about assuming that, that's quite wrong. [09:51:17] <_joe_> ok, it's just flipping a switch anyways :) [09:51:37] <_joe_> (do we think people will use rcstream via a browser?) [09:51:59] <_joe_> Krinkle: point is, any non-carefully-written client will then use SSL3 which is insecure [09:52:03] Yes, that will be like 80% of usage initially. [09:52:10] Because all other uses are using irc.wikimedia.org [09:52:15] <_joe_> that will make me sad [09:52:44] <_joe_> as we will give up better security for programmed clients for 0.22% of our user base [09:52:58] <_joe_> which IMO is understandable on wikipedias [09:53:28] <_joe_> but your argument about having equal browser support everywhere may be right. [09:54:24] I can't really discuss that. I'd say we should inherit whatever we do for Wikipedia, if it makes sense there, here too. But at this point there is no user base or target difference for rcstream. Both target client-side and server-side, and no subset of audience kind. Rcstream will be lower traffic naturally but I don't think there is a reasonable assumptions clients will have typically more mode [09:54:24] rn browsers with rcstream. They're the same users. [09:54:44] e.g. not like VisualEditor. [09:55:10] overal it's the same audience as api.php [09:55:30] <_joe_> Krinkle: well, are those mostly browsers or programs? [09:56:00] <_joe_> (I don't know if you got my original argument - it will not mean rcstream does not work with those browsers, just no SSL) [09:56:47] connecting to non-ssl will not work without throwing insecure content warnings for people using https for Wikipedia. Which we're only intending to increase for general traffic afaik. [09:56:50] <_joe_> if they're programs, we are doing everyone a bad service by allowing SSL3, and we should think about it. For now, however, let's not block because of this :) [09:57:15] <_joe_> Krinkle: ok, agreed [09:57:38] <_joe_> We'll phase our support for IE6 for https at the same time everywhere. [09:57:41] I mean, I think we're going to a point where people don't choose HTTPS, we're just giving it to them. Which makes this more difficult, but oh well :) [09:57:47] Sounds good. [09:57:51] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [09:58:18] <_joe_> shit [09:58:41] Yeah, mostly browsers. Existing tools can much more easily migrate from api.php to rcstream (browser gadgets) than from irc.wikimedia.org to rcstream (server programs). [09:58:42] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 3.004 second response time on port 8123 [09:59:09] and at least all the tools I'll be working on this year are mediawiki extensions operationg client-side in the browser. [09:59:46] <_joe_> Krinkle: ok. We should test ssl once I'm done with the changes [09:59:52] _joe_: one question (2 really), we need better icinga checks for rcstream, where are those maintained? [10:00:38] whatever is checking it now didn't test it well enough, maybe it bypassed some of the front-end layers and was only testing redis or the websocket backend. [10:00:49] <_joe_> Krinkle: they should be maintained in the module [10:01:01] <_joe_> Krinkle: what should we test? [10:01:32] <_joe_> I mean, how can we understand easily if something's wrong without connecting via websockets :) [10:01:40] I'm trying to find an example of e.g. the tests I see here from icinga-wm. e.g. where a script to count the number of processes is written, how it is configured with icinga, what it returns etc. [10:01:42] <_joe_> we do have a stats endpoint [10:01:47] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:02:08] <_joe_> Krinkle: sorry, I have to take a look at the search outage [10:02:17] Why not make a connection? We do make /wiki/Main_Page requests and /wiki/Special:Random for mediawiki health as well, right? [10:02:21] yeah, no worires [10:02:40] Though opening and closing a socket is more expensive I guess. [10:03:21] <_joe_> Krinkle: just KISS, also, what should I look at to understand if the stream is correct? [10:03:34] <_joe_> remember, icinga checks are point-in-time checks [10:04:40] Yeah, forgot the rcstream part though, I'm also interested in general. I tried various times over the last year but to no avail. I want to know what those scripts look like (are they written in bash?), where they're stored, and how they're loaded into icinga. [10:05:52] <_joe_> maybe we could write the error rate to /stats, track that in graphite, alarm on trends [10:06:47] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [10:06:50] <_joe_> !log restarted lucene on search1015, it was stuck [10:06:55] Logged the message, Master [10:07:12] <_joe_> of course this is a fake recovery, still... [10:08:02] Search really down? [10:08:49] <_joe_> chasemp: isn't it like 3 AM there? don't worry I'm on it :) [10:09:50] Got paged :) but sounds good call me if I can help [10:11:18] <_joe_> chasemp: just for fun... 2014-06-25 10:10:51,690 [Thread-9] WARN org.wikimedia.lsearch.frontend.HttpMonitor - Thread[Thread-5006316,5,main] is waiting for 252245573 ms on /search/enwikinews/us?limit=1 [10:11:30] <_joe_> 252245573 ms :P [10:14:51] (03PS1) 10QChris: Take advantage of redis module again [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) [10:15:32] (03PS4) 10QChris: Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) [10:17:00] (03CR) 10jenkins-bot: [V: 04-1] Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [10:17:13] (03PS2) 10Faidon Liambotis: MX switch, part 2 [operations/dns] - 10https://gerrit.wikimedia.org/r/141427 [10:22:59] (03CR) 10Giuseppe Lavagetto: [C: 032] nginx: adding ssl config to the module [operations/puppet/nginx] - 10https://gerrit.wikimedia.org/r/141909 (owner: 10Giuseppe Lavagetto) [10:24:38] (03PS1) 10Faidon Liambotis: spamassassin: add bayes path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141919 [10:25:04] (03CR) 10Faidon Liambotis: [C: 032 V: 032] spamassassin: add bayes path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141919 (owner: 10Faidon Liambotis) [10:34:29] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [10:35:39] <_joe_> !log restarted lucene on search1016 as it was stuck there as well, once search1015 is up and running [10:35:44] Logged the message, Master [10:38:11] (03PS16) 10QChris: Add backup role and scripts to wikimetrics [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [10:38:48] (03PS9) 10QChris: Enable the new backup role in wikimetrics if set [operations/puppet] - 10https://gerrit.wikimedia.org/r/139558 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [10:40:19] (03CR) 10jenkins-bot: [V: 04-1] Enable the new backup role in wikimetrics if set [operations/puppet] - 10https://gerrit.wikimedia.org/r/139558 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [10:45:23] (03PS1) 10Hashar: zuul: get rid of git_dir and zuul_url in server conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/141924 [10:51:31] (03PS2) 10Hashar: zuul: migrate merger definitions to merger.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141487 [10:51:33] (03PS2) 10Hashar: zuul: migrate server definitions to server.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141488 [10:52:18] (03PS2) 10Hashar: zuul: prefix server default template with 'zuul.' [operations/puppet] - 10https://gerrit.wikimedia.org/r/141501 [10:52:41] (03PS3) 10Hashar: zuul: merger now has its own default file [operations/puppet] - 10https://gerrit.wikimedia.org/r/141502 [10:54:27] (03PS2) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [10:54:37] (03PS2) 10Hashar: zuul: migrate statsd_host to zuul::server [operations/puppet] - 10https://gerrit.wikimedia.org/r/141657 [10:54:48] (03PS5) 10Hashar: zuul: patch of doom (WIP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141663 [10:55:05] rebases rebases [10:55:53] gotta get down to the rebases? [10:56:25] (03Abandoned) 10Hashar: Add quoted 'true' and 'false' to the typos file [operations/puppet] - 10https://gerrit.wikimedia.org/r/109073 (owner: 10Andrew Bogott) [10:58:32] <_joe_> hashar: I'm terribly late, I'm figuring out now how we do manage SSL certs in production [10:58:35] <_joe_> sorry [10:58:46] (03Abandoned) 10Hashar: lvs: generic::upstart_job() now uses boolean values [operations/puppet] - 10https://gerrit.wikimedia.org/r/118717 (owner: 10Hashar) [10:59:09] <_joe_> (also, needing to install a new CA chain, sigh) [10:59:22] _joe_: i can't really help on the SSL front :-( [10:59:36] <_joe_> yeah don't worry I can manage that [10:59:42] <_joe_> I simply need more time :) [10:59:47] but RobH should have a good knowledge about how we deploy private certs and maintain chains [11:00:13] <_joe_> yes, but RobH is sleeping, and it's a good thing I understand how it works [11:01:35] _joe_: some lame doc at https://office.wikimedia.org/wiki/SSL_Certificates [11:11:44] (03CR) 10Faidon Liambotis: [C: 032] MX switch, part 2 [operations/dns] - 10https://gerrit.wikimedia.org/r/141427 (owner: 10Faidon Liambotis) [11:11:58] here goes [11:12:34] !log switching inbound email for wikimedia.org to polonium/mchenry [11:12:39] Logged the message, Master [11:12:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "LGTM, only a lint alert." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141677 (owner: 10Filippo Giunchedi) [11:14:35] hashar: ping [11:14:44] i just sent my invoice paravoid, it better work :P [11:14:58] lol [11:14:59] Krinkle: pong [11:15:04] <_joe_> lol [11:15:18] (03CR) 10QChris: Add backup role and scripts to wikimetrics (033 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [11:15:32] hashar: Remember we talked about potentially proxying some (or all) traffic for integration slaves to speed up dependency downloading? [11:15:42] yup [11:16:04] hashar: I'd like to priorise that because npm is having stability issues. The local .npm cache folder is not reliable (and once we have separate vm execution, we won't have that local folder anyway) [11:16:15] I think both npm and pip maintains a per instance cache in jenkins-deploy homedir [11:16:32] ah good point regarding vm [11:16:37] https://integration.wikimedia.org/ci/job/oojs-core-npm/136/console [11:16:46] got corrupted, had to manually sudo rm-rf [11:17:03] just mailed myself, it went via polonium [11:17:36] (03CR) 10QChris: Add backup role and scripts to wikimetrics (032 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [11:20:26] (03PS1) 10Giuseppe Lavagetto: rcstream: add SSL support [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 [11:21:07] hashar: also, nodejs needs to be update to 0.10.x soon. Things are already starting to break downstream, work arounds are getting very annoying. [11:21:34] Krinkle: yeah I talked to gwicke about it a while back nothing moved since then though [11:21:54] Krinkle: cxserver by the language team depends on nodejs 0.10.x but it is going to be deployed on Ubuntu Trusty which has 0.10.x [11:21:59] Since these are browser modules tested using nodejs (not nodejs applications), production having 0.8 is not a problem. For Parsoid will want 0.8 testing. We could use nvm (node version manager), or separate slaves for each. [11:22:27] <_joe_> nvm [11:22:38] :) [11:22:40] <_joe_> if it's half as lame as rvm, good luck [11:22:41] we should upgrade to 0.10 [11:22:45] I would just bump everything to 0.10 [11:22:50] we tried a while back, but parsoid borked at the time [11:22:51] <_joe_> +1 for 0.10 [11:22:59] it was a full blown parsoid outage [11:23:02] iirc the blocker was some garbage collector issue in 0.10 which was preventing Parsoid to upgrade [11:23:04] but gabriel has fixed it since [11:23:05] but got fixed up [11:23:05] but for grunt/jshint/qunit/jscs etc. the common npm-test, we really need 0.10 because those have actually dropped 08 support. [11:23:10] 11:19:01 npm WARN engine jscs@1.4.5: wanted: {"node":">= 0.10.0"} (current: {"node":"v0.8.2","npm":"1.4.13"}) [11:23:15] that's emitted on every build right now [11:23:16] but we have packages ready and everything [11:23:50] ok, I'll file an RT ticket [11:24:09] hashar: Maybe we can upgrade the integration slaves first, separate from Parsoid/cxserver/production. So that at least the environment for tests aimed at browsers (not nodejs production apps) continues to work as it is broken right now. [11:24:14] we have https://bugzilla.wikimedia.org/show_bug.cgi?id=66056 ""Jenkins: Upgrade nodejs from 0.8.x to 0.10.x on wmflabs integration slaves """ [11:24:19] I am not aware of any ticket for parsoid [11:24:22] Ah, right. [11:24:30] hashar: What's stopping us from doing that _today_ [11:26:05] Krinkle: nodejs is provided via debian package in apt.wikimedia.org which is also used by Parsoid [11:26:11] so gotta upgrade everyhing [11:26:23] you don't /have/ to, we can do pinning etc. [11:26:27] but we want to upgrade prod anyway [11:26:28] hashar: Well, this is labs. If I wnat to upgrade nodejs there, I can do that in 5minutes. [11:26:37] so you might just as well wait for that [11:26:38] using a ppa for all I care. [11:26:49] but I'd rather do it more correctly and maintainable. [11:27:11] Krinkle: then the slave will no more be fully puppetized. That is already the case with the local npm install [11:27:18] hashar: yep [11:27:29] so whenever a new slave is added we will have to manually update npm and figure out which nodejs ppa to pick [11:27:31] that is not nice [11:27:32] :D [11:27:56] I would prefer not having to administrate / manually tweak the slave but fully rely on puppet / apt.wm.o [11:27:58] i have a shell script attache to bugzilla . I know you forgot last time, but ut's quite easy. Run it once on a new slave and done. [11:28:23] if there is a way to ship both node 0.8 and 0.10 in apt, keep production to 0.8 but pin 0.10 for our slaves that would be ideal [11:28:31] I prefer puppet too, but we both know it hasn't been easy for it's been several months and it's still not puppetized. [11:28:48] I'm gonna have to finish this by today, one way or another. [11:29:20] Krinkle: faidon talked about pinning the package above. He might have a good solution :] [11:29:37] I don't know anything about apt, pinning, how to write that puppet or how to deploy that. [11:30:14] no clue either how two different versions of a package can be added [11:30:16] But I know a fair bit about aptitude and puppet [11:31:09] RT #7746 [11:31:23] hashar: Is https://github.com/wikimedia/operations-puppet/blob/production/modules/jenkins/manifests/slave.pp all that is puppetized? [11:31:24] any takers? [11:31:26] akosiaris maybe? [11:31:33] There's another manifest somewhere, right? [11:32:17] paravoid: wasn't this done ? [11:32:21] Where is 'role::ci::slave::labs' defined? [11:32:34] akosiaris: it was done, rolled back, and needs to be done again [11:32:49] rebuild the package or upgrade ? [11:32:51] akosiaris: I mention all that in the ticket, including the CI/Krinkle/hashar needs :) [11:33:03] rebuild the package probably, as my backport is probably outdated by now [11:33:06] then deploy [11:33:24] last time around it was just a matter of apt-get upgrade; restart parsoid [11:33:58] minor version outdated you mean I suppose. OK that seems easy [11:34:02] can we provide both versions in apt? [11:34:02] yes [11:34:09] just wondering [11:34:12] no [11:34:28] I guess you'll update the "one" version, but apply selectively [11:34:44] since we have ensure=>present, not latest [11:34:52] Krinkle: yes [11:35:42] and hope we keep track of where nodes have a version locally that doesn't match the latest one from our apt repository, so that in case of a server failure and setting up a new one, we know to downgrade. [11:35:51] Is there some kind of script that tracks that somewhere? [11:36:08] we can run salt across the cluster to check [11:36:14] we used to have a tool for that too [11:36:18] but someone uninstalled it [11:36:19] *g* [11:36:25] paravoid: :P [11:36:29] well, reformatted that server I should say [11:36:29] :P [11:36:35] not even that [11:36:38] A sort of 'todo' list for manual upgrades (like for apache right now) [11:36:53] more like migrated the server and did not migrate that part [12:02:45] !log upgraded etherpad.wikimedia.org to etherpad-lite 1.4.0 [12:02:50] Logged the message, Master [12:02:51] that was uneventful [12:02:53] PROBLEM - etherpad_lite_process_running on zirconium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^node node_modules/ep_etherpad-lite/node/server.js [12:02:59] meh... [12:03:37] lol [12:03:40] hm, they change we way they call it [12:03:50] it is an alert problem not the service [12:04:18] Was gonna say, it did seem to be up :) [12:04:52] SCRIPTPATH=`pwd -P` [12:04:52] node $SCRIPTPATH/node_modules/ep_etherpad-lite/node/server.js $* [12:04:54] sigh.... [12:06:44] bd808|BUFFER: for when you wake up: I still see more APC spam in the logs then I expect. Is that stuff actually fatal or do we retry on it? [12:07:29] AFAIK it's not fatal [12:07:37] just maybe shlower [12:12:21] (03PS1) 10Alexandros Kosiaris: Update etherpad icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/141942 [12:14:10] (03CR) 10Alexandros Kosiaris: [C: 032] Update etherpad icinga check [operations/puppet] - 10https://gerrit.wikimedia.org/r/141942 (owner: 10Alexandros Kosiaris) [12:14:43] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [12:15:53] RECOVERY - etherpad_lite_process_running on zirconium is OK: PROCS OK: 1 process with regex args ^node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [12:17:23] yey [12:18:49] <_joe_> \o/ [12:21:54] everyone getting their emails? [12:22:00] :) [12:25:10] !log Upgraded Zuul 9839edb..b7fc126 Brings patchset 20 of Zuul cloner ( https://review.openstack.org/#/c/70373/ ) [12:25:15] Logged the message, Master [12:31:54] (03CR) 10Giuseppe Lavagetto: [C: 031] [HAT] Load mod_version on application servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/141059 (owner: 10Ori.livneh) [12:32:36] _joe_: I think we need to decide on the puppet vs. apache-config issue soon [12:32:45] we have a similar issue with twemproxy fwiw [12:34:22] (03PS23) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [12:35:05] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Upgrade to 1.4.0 upstream version [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/141707 (owner: 10Alexandros Kosiaris) [12:36:28] <_joe_> paravoid: we will have this issue with anything that is config-related and is modified often and deployed in the way software releases are done [12:36:59] When does the twemproxy config actually change? [12:37:23] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 25 Jun 2014 09:36:50 UTC [12:37:53] <_joe_> and I don't see tagged puppet runs as a good solution - still, configs should just be in the same place so it's probably the solution that sucks less. [12:38:31] (03CR) 10Alexandros Kosiaris: [C: 031] "Same here" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141059 (owner: 10Ori.livneh) [12:39:32] hashar: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::cxserver::beta for i-00000421.eqiad.wmflabs on node i-00000421.eqiad.wmflabs - what it can be? [12:40:56] kart_: does the puppetmaster has a role::cxserver::beta ? [12:41:17] kart_: in some case you have to restart puppetmaster :/ on deployment-salt.eqiad.wmflabs : service puppetmaster restart [12:41:57] okay! [12:43:11] same. [12:43:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Reviewing your patch, I agree with Jan." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/141062 (owner: 10Ori.livneh) [12:45:13] (03CR) 10Matanya: "need to include https://gerrit.wikimedia.org/r/141942 as well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [12:47:58] hashar: okay now. broken patch :) [12:48:09] kart_: :-] [12:48:22] kart_: have you made it a puppet module ? [12:48:42] !log cirrus rebuild update: started rebuilding group1's indexes yesterday. commons and wikidata finished their in place pass and started their from mediawiki pass. The remaining wikis are running their in place pass in alphabetical order and currently on frwiktionary. [12:48:46] Logged the message, Master [12:50:22] (03PS1) 10Alexandros Kosiaris: Sanitize nrpe::check title parameter [operations/puppet] - 10https://gerrit.wikimedia.org/r/141951 [12:52:55] !log cirrus rebuild update: starting from mediawiki reindex step for all alphabetical wikis that have finished so far [12:53:03] Logged the message, Master [12:53:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "styling change, otherwise LGTM" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [12:56:06] hashar: yes. still work-in-progress. Nikerabbit did it. [12:56:23] (03PS24) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [13:00:02] (03PS1) 10Matanya: gitblit: remove service monitoring to role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/141952 [13:00:05] K4-713: The time is nigh to deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140625T1300) [13:02:25] (03PS2) 10Matanya: gitblit: move service monitoring to role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/141952 [13:05:15] akosiaris: Can please trouble you to merge this small change please ? [13:05:38] (03PS25) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [13:08:17] (03CR) 10Alexandros Kosiaris: [C: 032] gitblit: move service monitoring to role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/141952 (owner: 10Matanya) [13:08:42] thank you [13:11:15] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, this will actually be very useful." [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [13:11:16] _joe_: the strontium thing not getting the change this on puppet-merge happened again. Just now [13:11:50] <_joe_> akosiaris: it has happened yesterday as well [13:12:37] (03PS1) 10Hashar: contint: python-simplejson on slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/141959 [13:12:49] http://pastebin.com/hgSKekFK [13:12:53] could use a new python package on the jenkins boxes please ^^^ :-) [13:15:54] <_joe_> hashar: on it [13:15:57] (03CR) 10Manybubbles: "I'd love to look at this list!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [13:16:05] (03CR) 10Giuseppe Lavagetto: [C: 032] contint: python-simplejson on slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/141959 (owner: 10Hashar) [13:16:31] thanks :) [13:17:24] <_joe_> akosiaris: it's a mistery [13:17:27] <_joe_> really [13:17:51] it is git for some reason not figuring out the commiter's name/email [13:18:11] but why... and I have never witnessed that before the user changes [13:18:20] I mean when we used to login as root [13:18:25] <_joe_> it's not the committer, gitpuppet maybe? [13:18:40] <_joe_> akosiaris: how do you run puppet-merge? [13:18:45] sudo -s [13:18:47] puppet-merge [13:19:00] <_joe_> akosiaris: sudo -i should do the trick of being equal to login as root [13:19:13] and it works every single time until this one time a few minutes ago [13:20:14] ori: one deftype hit that's not immediately-obviously docs/integration: Jun 25 12:53:53 cp1043 varnishd[26311]: DefaultType: /font/fontawesome-webfont.woff?v=3.2.1 [13:20:24] _joe_: yeah I know but this could also be coincidental and have nothing to do with the user changes [13:20:58] <_joe_> akosiaris: I think it happened forever and we noticed because of icinga [13:21:12] <_joe_> we did not have any warning on this before [13:21:13] niah. I 've never gotten that error before [13:21:34] it is the first time I see the git pull failing [13:21:59] and if anyone else saw it and they did not speak up ... well shame on them [13:22:40] Warning: /Stage[main]/Gitblit/File[/etc/apache2/sites-enabled/git.wikimedia.org]: Ensure set to :present but file type is link so no content will be synced [13:22:43] sigh... [13:23:01] it happened yesterday too while I was unsubmoduling, because I went into /var/lib/git/ops/puppet to rm/pull, and that left a subtree owned by user root instead of gitpuppet [13:23:15] but I went back and fixed that with chown -R gitpuppet later [13:24:10] what was wrong with it this morning, exactly (i.e. what was git status saying there)? [13:25:20] <_joe_> bblack: simply enough, git refuses to run because it cannot determine the identity of user gitpuppet on strontium [13:25:21] bblack: it sure was not because of permissions this time. Cause I reran manually the /bin/sh -c 'cd /var/lib/git/operations/puppet && git pull && git submodule update --init' as gitpuppet on strontium and it was ok [13:25:24] <_joe_> mmmmh [13:25:42] <_joe_> akosiaris: I guess it's some sort of foolish env variables circus [13:25:54] could be [13:26:06] akosiaris: you did that as user gitpuppet, right? [13:26:10] yes [13:26:12] sudo -i [13:26:37] which is the command run on strontium when you ran puppet-merge on palladium [13:26:47] _joe_: where do you see the cannot determine identity bit? [13:27:00] <_joe_> bblack: http://pastebin.com/hgSKekFK [13:27:01] bblack: http://pastebin.com/hgSKekFK [13:27:03] lol [13:27:09] <_joe_> lol [13:27:32] <_joe_> (a perfect simmetry is always preferrable) [13:27:40] wow, lots of warnings spam on our masters these days [13:27:41] Jun 25 13:26:27 strontium puppet-master[4127]: Variable access via 'nameservers' is deprecated. Use '@nameservers' instead. template[/etc/puppet/modules/base/ [13:27:44] templates/resolv.conf.erb]:11 [13:27:48] x1000000 [13:28:17] <_joe_> bblack: puppet 3 [13:28:36] <_joe_> bblack: that's the ton of warnings we should lint sooner or later [13:29:10] <_joe_> (or change the logging level of the puppet master) [13:29:19] _joe_: but the universe is not symmetrical as it seems. There was even a nobel prize for that a couple of years ago. But I will not shut up because you obviously know better [13:29:39] hashar: I'm going to depool integration-slave1003 because it is throwing up on npm corruption almost every other build. [13:30:47] <_joe_> akosiaris: the funny thing is phylosophically our actual model of fundamental physics is really really similar to what Epicurus theorized [13:30:54] !log Depooling integration-slave1003 as almost every other -npm build on this node fails due to corrupted ~/.npm cache [13:30:58] Logged the message, Master [13:33:42] _joe_: wait what am I missing about puppet-merge? I look at the script in /usr/local/bin and it doesnt even have code to hit strontium [13:34:04] is it tied in via git hooks? [13:34:47] oh it is [13:35:38] matanya: heads up - it looks like in our march of progress we've scheduled switching cirrus to primary for hewiki - also, I believe I've repaired the missing pages [13:35:58] Thanks manybubbles! [13:36:00] I'm reasonably sure they came from the busted analyzer - the saneitizer tool we built indexed all the missing pages [13:36:08] Krinkle: why don't you just clear out the .npm cache? [13:36:13] please let me know if there is rebelion in the village pump [13:36:17] that is very good news [13:36:17] or just complaining [13:36:20] hashar: I've done that 20 times already today [13:36:33] i will do manybubbles [13:36:35] thanks! [13:36:44] Krinkle: could it use a different npm version than on the other boxes? [13:37:32] hashar: yes... [13:37:46] Is the local cache shared between instances? [13:37:54] na it is local to the instance [13:38:01] ok [13:38:09] There is a minor version difference indeed. [13:38:15] I'll see if that fixes it mabe [13:38:20] somewhere under /mnt/home/jenkins-deploy [13:38:24] Yeah [13:41:45] Krinkle: the other slaves have npm 1.4.5 [13:41:52] Yeah, 1.4.13 [13:41:56] 1003 has version 1.4.13 [13:42:10] The newer version shoudn't be causing issues, if anthing it should prevent/fix [13:42:13] I don't want to downgrade 1003 [13:42:18] checking https://github.com/npm/npm/releases now [13:43:06] I am not sure why it would suddenly cause issue though [13:43:37] 1.4.15 has cache: atomic de-race-ified package.json writing (isaacs) [13:45:45] https://github.com/npm/npm/issues/5472 [13:46:41] how are we versioning what we get from npm? [13:47:39] bblack: We aren't. npm manages this internally. We don't do global installs like apt, but it's all kept locally in versioned subdirectories by npm. [13:47:47] The bug here though is about npm itself [13:48:16] Krinkle: might want to try with 1.4.15 ? [13:48:20] Yes [13:48:41] or upgrade to 1.4.10 since you say on the bug report that 1.4.11 might have introduce the issue [13:48:49] I guess what I mean is, are we installing specific versions of modules, or just generic >=1.2.3 pulled in from dep chains [13:48:53] though I guess you have a good reason to get a newer version :] [13:49:09] which means runinng your updates with the same input at a slightly later point in time might get a different set of installed versions [13:49:23] (03CR) 10Milimetric: Take advantage of redis module again (032 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [13:49:33] bblack: parsoid uses a syntax like async": "~0.8.0", [13:49:54] no clue what the ~ mean. Maybe "something reasonably similar as 0.8.0 (i.e. 0.8.2 might be fine) [13:49:58] hashar: It's on .13, you mean downgrade instead of upgrade to .10? [13:50:09] Krinkle: yup downgrade to .10 [13:50:15] if you suspect .11 cause the issue [13:50:35] bblack: The way versions are specified isn't really relevant since this is for continuous integration. What is installed is user-land and not our concern. [13:50:52] bblack: e.g. whatever-gerrit-repo/package.json [13:51:09] just sayin: if the exact version of every installed node.js module isn't a controlled thing, you're asking for mysterious diffs between machines that cause bugs [13:51:13] these are not kept after the lifetime of a build and only installed once. [13:51:24] for production the packages are put in a git repo and we deploy that copy instead of fetching from npm registry [13:51:33] shouldn't the modules installed for CI match what modules will be used when deployed? [13:51:51] (03CR) 10Ottomata: Add backup role and scripts to wikimetrics (032 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [13:51:56] bblack: vague versions is a bad habit yes, but that would never cause a bug here for us, that would be downstream gerrit developers giving themselves a hardtime. [13:52:01] (03CR) 10Ottomata: Add backup role and scripts to wikimetrics (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [13:52:04] bblack: and actually not true since it's a fresh install each build [13:52:09] so it's the same between all instances. [13:52:18] bblack: yup we do that for Parsoid. We run the test suite using whatever is provided by packages.json AND run the same tests using the git repo that holds packages for production. [13:52:43] and the few packages we do use globally between builds are at fixed versions. [13:54:14] I'll stop bugging about it since I haven't even looked. it sounds like the same issues every script language with a library repo has, though. If you're not locking down a fixed set of every involved package and using that for everything, eventually one of the packages that you depend on will have one of its sub-sub-dependencies upgrade on your silently and cause a bug, etc [13:54:18] (03PS26) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [13:54:46] (or cause a lack of a bug, of it's CI that's uncontrolled and prod that is) [13:54:51] bblack: yep, but if people do that it's not our problem, and yes people shouldn't do that and afaik we aren't using ~ or <= or .x version types [13:55:44] bblack: I guess you got thrown off by the fact that npm had a different version on two hosts. That's no different than woould've happened with apt if you ugprade it in the repo but only apt upgrade one of the nodes.. [13:55:51] I don't get the "not our problem", and I don't think just using == for your explicit dependencies solves it. Because when those are fetched, they have their own dependencies specified by their authors, which might contains loose version comparisons... [13:55:54] (03CR) 10Alexandros Kosiaris: cxserver configuration for beta labs (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [13:56:30] It ran different versions because I forgot to upgrade the (explicitly versioned install) on both nodes last month [13:56:51] (this has nothing to do with the npm version thing, that just coincidentally got me thinking) [13:58:01] the 'not our problem' refers to the case of individual gerrit repos (e.g. visualeditor) using tilde or range versions as dependencies in their package.json. if they do that and get a build failure if 1.3.x matches 1.3.10 instead of 1.3.9 when it is released due to a regression in 1.3.10 of package x, that's their own choice and probably what they want. [13:58:48] yeah, so long as CI on 1.3.10 -> deploy on 1.3.10 in that case [13:58:49] _joe_ Krinkle: gotta rush out (daughter sick) [13:58:51] either way, we consistently re-evaluate and fresh install on each build so it doesn't race or stale between execution slaves. [13:59:09] _joe_: I guess we will look at the Zuul puppet patches tomorrow :]  Hope you will have your SSL issue solved! [13:59:12] and on deployment, node_modules is always recompiled. [13:59:15] <_joe_> hashar: no problem [13:59:16] it seems like CI process failure to say it ran ok (on 1.3.10) and then deploy to prod which has 1.4.9 [13:59:20] <_joe_> hashar: I did I think [13:59:22] Krinkle: ping me by email if you need anything. Will be back later this evening [13:59:27] 1.3.9, I meant, [13:59:28] _joe_: congrats! [13:59:37] bblack: Yep [13:59:43] <_joe_> hashar: take care of your daughter! [14:00:45] and the problem is compounded because even if packages.json is explicit about the 3 packages it depends on, it's not explicitly about the 45 other packages that those 3 depend on [14:01:08] (03CR) 10Krinkle: rcstream: add SSL support (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 (owner: 10Giuseppe Lavagetto) [14:02:20] akosiaris: hey! I heard rumors that you were 1. planning on moving icinga into a module, (and/or) 2. considering replacing icinga. Are either true? Asking because I'm planning on working on getting icinga monitoring setup for toollabs... [14:02:39] <_joe_> bblack: oh wait, people are trying to reinvent apt/dpkg in node and it's a FAIL? [14:02:52] <_joe_> how strange... [14:03:11] YuviPanda: I'm the sure module thing is planned either way, but the possible move was to shinken and I think there hasn't been much on that front yet. just fyi [14:03:16] it's not really just a node problem. I haven't seen one script language + modules repo fix this problem well, ever. [14:03:37] chasemp: checking out shinken... [14:03:38] chasemp: thanks! [14:03:52] YuviPanda: yes and yes. Altough 1) will become irrelevant if 2) comes to fruition [14:04:00] everyone ends up hacking around it by freezing static sets of module versions and treating them like one big binary dependency blob [14:04:07] shinken is the answer to 2 indeed as chasemp has already pointed out [14:04:22] akosiaris: true. I don't think I can do icinga on toollabs at least until (1) is done since icinga.pp has a lot of hardcoding in it. [14:04:28] or alternatively, mirroring the upstream package repo locally and freezing it at fixed points-in-time to version it as a whole [14:04:32] (03CR) 10Krinkle: "Hm.. ideally we wouldn't hardcode stream.wikimedia.org inside the generic manifest for rcstream. Aside from general best practice, this ac" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 (owner: 10Giuseppe Lavagetto) [14:04:36] akosiaris: I could experiment with putting shinken on toollabs if you guys are looking to move to it long term [14:04:58] we got a module already in puppet but it is in very early stages https://gerrit.wikimedia.org/r/#/c/124861/ [14:05:30] YuviPanda: you mean toollabs specifically right ? Cause labs already has an icinga instance [14:05:37] akosiaris: woo, cool! [14:05:40] the one maintained by DamianZaremba [14:06:01] akosiaris: yeah, toollabs specifically. Also from what I understand the icinga instance on labs isn't that great - nobody I know uses it, plus I don't know how puppetized it is (I was told not very) [14:06:48] akosiaris: plus there are long term plans to let individual tool authors be able to monitor things and have some form of notification setup (looong term), so having a tools specific one would be nice. [14:07:16] akosiaris: we've also been notified several times of things not working because their disks were full by people on labs-l and #-labs rather than by a tool... [14:07:32] YuviPanda: not surprised :-( [14:08:07] well, I am not so sure even shinken is going to help you much [14:08:25] all in all it is backwards compatible with nagios/icinga and all that [14:08:27] akosiaris: sure, I just don't want to spend time setting up icinga and then have to move. Plus I want to mirror prod's config, and icinga.pp seems very prod specific. [14:08:39] but the big problem you have is exported resources [14:09:05] we use it a lot to populate nagios/icinga config and labs can not have exported resources [14:09:07] akosiaris: right. I don't see why we can't use those with just tools, since it's fully puppetized. [14:09:10] akosiaris: oh, 'can not'? [14:10:06] yeah, exported resources come with quite some problems attached [14:10:15] !log Upgraded npm from v1.4.13 to v1.4.16 on integration-slave1003 to fix https://github.com/npm/npm/issues/5472 and repooling [14:10:20] Logged the message, Master [14:10:20] it is for example trivial for someone to clutter the puppet database [14:10:39] and fill up the tables with millions of useless resources (been there, done that btw) [14:10:44] ah [14:10:47] i... see. [14:10:55] !log Upgrade npm from v1.4.5 to v1.4.16 on integration-slave1001 and integration-slave1002 [14:11:00] Logged the message, Master [14:11:10] and then puppet runs are going to suffer horribly and blah blah [14:11:18] performance wise I mean [14:11:22] akosiaris: so assuming exported resources are a 'no' for toollabs, I guess I can't re-use prod's icinga? [14:11:36] I fear not [14:11:54] well there is a workaround however [14:12:03] well multiple [14:12:07] one is https://github.com/DamianZaremba/labsnagiosbuilder [14:12:23] not the project as it is, but ideas taken from it are not that bad [14:12:39] or even the project as it is if you feel like it, but it needs some love [14:12:48] hmm, right [14:13:09] another approach would be to have toollabs have their own puppetmaster, not shared with the master one for labs [14:13:29] hmm, right. That's something I've been considering, but does come with its own problems (generally just more complexity) [14:14:06] hashar has his own ones for the deployment prep project [14:14:16] akosiaris: but yes, having its own puppetmaster should enable us to use exported resources I guess [14:14:17] true [14:14:21] exactly [14:14:52] in fact I like that approach (the own puppetmaster one) [14:14:59] hmm, right. [14:15:11] that would let us mirror prod [14:15:59] * YuviPanda waves at andrewbogott [14:16:50] akosiaris: I guess it'll have to wait for modularization anyway. [14:17:38] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Jun 25 14:17:32 UTC 2014 [14:18:14] YuviPanda: sure but it might take a while. In fact, I will be putting resources in shinken first cause it is more urgently needed [14:18:38] akosiaris: aaah, ok. either way, I'll wait for the prod montioring story to solidify before taking this on [14:22:48] mutante: can you add me to the 'graphite' project in labs? [14:23:48] YuviPanda: You might be better asking someone else if there is anyone.. [14:24:02] mutante: Reedy yeah, nevermind. andrewbogott added me [14:24:07] heh [14:38:39] _joe_: I have been looking at https://gerrit.wikimedia.org/r/#/c/141931/1 . Out of curiosity, why a new class for ssl and not a parameter ? [14:39:10] <_joe_> akosiaris: because I felt I had too many ifs at one point :) [14:39:16] that use_ssl = false line in proxy.pp came as a surprise [14:39:46] <_joe_> that can change but I wanted to avoid has_variable [14:39:56] <_joe_> but yes, that may sound weird [14:40:02] <_joe_> I can make that better in fact [14:40:28] well it is one extra if to merge them [14:41:05] <_joe_> mmmh no I count 3 [14:41:07] hmm yeah not so much. I see some extra variables here and there, but still [14:41:32] <_joe_> akosiaris: I can make it better, I think [14:42:11] ok. I just spoke because it got me well out of the blue and I had a hard time following the flow... [14:42:32] <_joe_> which kind of proves I did the wrong choice :) [14:44:48] Reedy: Something weird is happening with APC -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/APC%20thrash [14:45:49] * aude wish i could click [14:45:56] +1 [14:46:15] <_joe_> akosiaris: oh I got to why I did duplicate the class. [14:46:41] <_joe_> akosiaris: I have to declare both classes on the same node. [14:46:57] The errors look to be pretty evenly spread across the cluster, but there is a very atypical bump starting around 8:00Z today. [14:47:04] <_joe_> the alternative is to transform that into a define, but I think that's even more complicated. [14:47:12] I wonder if we just need more memory assigned to APC [14:47:13] _joe_: yeah I figured that much. So use_ssl=both ? [14:47:26] <_joe_> so, has_variable [14:47:28] mutante: Can't we (you) create a ldap group for people not in the wmf who have an nda signed (aude, me, Daniel Kinzler and matanya probably)? [14:47:30] <_joe_> :) [14:47:32] I have no problem with both classes [14:47:42] but at least don't share the template then [14:48:13] or use content => template('template.erb', 'template-ssl.erb') [14:48:16] Reedy: I think we need to sample the APC utilization and hit rate on some servers. [14:48:28] if that is at all possible (and it might not be) [14:48:31] <_joe_> akosiaris: I thought I could share the template and use has_variable there. Sharing the template is a good idea I think [14:48:46] <_joe_> so that you don't have to duplicate common changes [14:48:59] bd808: Maybe _joe_ could do it this afternoon as it's not nearly bed time :) [14:49:00] hmm, true [14:49:58] <_joe_> Reedy: ehm ehm... I can do that _later_ [14:50:20] manybubbles: So which of us wants to SWAT today? [14:50:25] <_joe_> now I need to correct my rcstream patch, then I have to run an errand [14:50:29] _joe_: I wasn't meaning drop everything and do it now ;) [14:50:51] <_joe_> bd808: I think the best thing we can do is monitor apc via our monitoring tools [14:51:19] _joe_: That seems like a wise thing to do. [14:52:49] manybubbles: I guess I'll do it [14:52:57] hey, I can do it [14:53:02] anomie: I'm here [14:53:09] In the last hour the hosts affected are non-uniform. mw1176, mw1097, mw1223 and mw504 have far more errors than the rest of the cluster. 6832 errors in total too which is not what I'd expect for a wednesday. [14:53:12] manybubbles: Up to you, if you want it go ahead [14:53:16] got it [14:53:19] bd808: We'll need to put apc.php somewhere on the docroot... And presumably make it not externally accessible [14:54:09] manybubbles: submodule patches coming in a minute [14:54:13] for our thing [14:54:20] Reedy: Agreed. Or make a custom page that outputs something easy to feed to ganglia/graphite/icinga [14:54:24] greg-g: yt? [14:54:24] aude: cool [14:54:35] manybubbles: we need to talk about the 3 new es nodes again [14:54:38] now that the SSDs are back in them [14:54:53] aude: that patch thats on Deployments page is for wmf8 which isn't in production any more [14:54:55] ottomata: sure [14:55:04] I'm happy to have a look when you are ready to add them [14:55:14] reedy@tin:/tmp$ wc -l apc.php [14:55:14] 1362 apc.php [14:55:19] well, i'm going to try to reinstall 1017 today [14:55:33] ottomata: yeah [14:55:44] we need to look at this again (is it ok, I don't remembe rwhat it is): https://gerrit.wikimedia.org/r/#/c/138012/ [14:55:53] greg-g: not sure who to ask about this [14:55:53] https://rt.wikimedia.org/Ticket/Display.html?id=7737 [14:55:56] (03PS1) 10Reedy: Add apc.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141966 [14:55:56] but maybe you? [14:55:59] probably ok, ja? [14:56:37] manybubbles: we are on 'wmf8' version of our extension [14:56:38] (03CR) 10Reedy: "From php-apc 3.1.7-1" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141966 (owner: 10Reedy) [14:56:48] aude: oh, confusing [14:56:55] with wmf10 and wmf9 core [14:56:57] yes [14:57:57] thanks greg-g, I think that's good enough for me [14:58:07] :) [14:59:02] (03PS3) 10Ottomata: Add myself to releasers-mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/140634 (owner: 10Catrope) [14:59:14] (03CR) 10Manybubbles: "This patch will prevent an Elasticsearch node from joining the cluster when it doesn't have one of the plugins we need. Without it the no" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138012 (owner: 10Ottomata) [14:59:16] (03PS2) 10Reedy: Add apc.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141966 [14:59:23] (03CR) 10Alexandros Kosiaris: [C: 032] zuul: mv python-voluptous in the array of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/141486 (owner: 10Hashar) [14:59:25] (03CR) 10Reedy: [C: 04-1] Add apc.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141966 (owner: 10Reedy) [14:59:32] (03CR) 10Greg Grossmeier: [C: 031] "Yep." [operations/puppet] - 10https://gerrit.wikimedia.org/r/140634 (owner: 10Catrope) [14:59:34] (03CR) 10Ottomata: [C: 032 V: 032] "Greg G also +1ed this over at: https://rt.wikimedia.org/Ticket/Display.html?id=7737" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140634 (owner: 10Catrope) [14:59:37] (03PS1) 10Mark Bergsma: Add mr1-esams.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/141969 [15:00:04] manybubbles, anomie, aude: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140625T1500) [15:00:18] linked to the patches [15:00:28] (03CR) 10Mark Bergsma: [C: 032] Add mr1-esams.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/141969 (owner: 10Mark Bergsma) [15:00:48] (03CR) 10Alexandros Kosiaris: [C: 032] zuul: get rid of git_dir and zuul_url in server conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/141924 (owner: 10Hashar) [15:00:51] (03CR) 10Mark Bergsma: [V: 032] Add mr1-esams.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/141969 (owner: 10Mark Bergsma) [15:01:31] There is no limit to the number of times an aluminum can can be recycled. [15:01:49] manybubbles: you might want to touch our javascript (at least Wikibase/lib/resources/jquery.wikibase) [15:02:24] Wikidata/extensions/Wikibase... [15:02:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I'm not sure we need this full-blown script. It would be more than enough to have one cli tool that can export apc stats, that can thus be" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141966 (owner: 10Reedy) [15:02:38] aude: Shouldn't be needed over here [15:02:42] (03CR) 10Alexandros Kosiaris: "Oops, I just saw the "I will pair with Giusepe" part. Sorry. It is pending for merge due to another change though, so no harm done" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141486 (owner: 10Hashar) [15:03:07] hoo: shouldn't be but might be [15:03:07] aude: after I do the submodule update? [15:03:17] we can try without [15:03:32] i can verify on test.wikidata [15:06:19] manybubbles: busy? call later? [15:06:39] Lydia_WMDE: sorry! was busy but let me join [15:06:41] fine now [15:07:01] too demanding of manybubbles ;) [15:08:50] <_joe_> akosiaris: another great piece of puppet awesomeness, in case you missed it: http://docs.puppetlabs.com/guides/templating.html#testing-for-undefined-variables [15:09:52] _joe_: oh I have seen that a long time now. puppet sux [15:10:05] <_joe_> akosiaris: me too, but I keep forgetting that [15:10:06] (03PS2) 10Giuseppe Lavagetto: rcstream: add SSL support [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 [15:10:14] <_joe_> it's like the casting rules in php [15:10:24] <_joe_> ok, bbiab [15:10:25] php casting has rules?! :) [15:11:33] php has casting ? :-) [15:14:07] aude: deploy on hold while I talk to lydia for a bit [15:17:57] manybubbles: ping me again when ready [15:18:38] aude: back [15:19:41] ok [15:19:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] contint: install Zuul on all CI slaves (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141758 (owner: 10Hashar) [15:20:09] !log manybubbles Synchronized php-1.24wmf10/extensions/Wikidata/: SWAT - fix rtl issue (duration: 00m 12s) [15:20:09] aude: ^^^^^ that is wmf10 [15:20:14] Logged the message, Master [15:20:23] checking [15:21:15] so I just deployed wikidata like it was a regular submodule - update core then did a submodule updte [15:21:19] test.wikidata looks good [15:21:26] manybubbles: yep :) [15:21:28] yep [15:21:44] sweet - I'll do wmf9 [15:22:19] and test2 is good [15:22:49] gah, i keep forgetting it's wednesday and i can also check wikidata instead of test.wikidata [15:22:55] PROBLEM - Host elastic1017 is DOWN: PING CRITICAL - Packet loss = 100% [15:22:58] (03PS4) 10Ottomata: Make sure that elasticsearch won't start if required plugins aren't available [operations/puppet] - 10https://gerrit.wikimedia.org/r/138012 [15:23:19] !log reinstalling elastic1017,1018,1019 [15:23:24] Logged the message, Master [15:23:30] (not running puppet yet, just reinstalling) [15:24:09] wikidata is good [15:26:59] aude: ready for doing wmf9 [15:27:06] ready to verify [15:27:17] aude: who is on wmf9 now then? [15:27:22] wikipedia [15:27:23] !log manybubbles Synchronized php-1.24wmf9/extensions/Wikidata/: SWAT - fix rtl issue (duration: 00m 10s) [15:27:28] Logged the message, Master [15:27:32] I didn't know they had wikidata [15:27:50] they have the client [15:28:05] RECOVERY - Host elastic1017 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:28:18] i'm still getting cached js, but might have to wait a minute [15:28:50] try https://fa.wikipedia.org/wiki/%DA%AF%D8%B3%DA%A9%D9%85%DB%8C%D9%86%D8%AC%D8%A7%D9%86 [15:29:00] "add links" [15:29:11] I'm searching for a hewiki page w/o sitelinks [15:29:26] went to special:random about 10 times now :P [15:29:33] try the farsi one [15:29:36] same issue [15:29:55] * aude can actually understand some of it, like "random article" [15:30:15] PROBLEM - SSH on elastic1017 is CRITICAL: Connection refused [15:30:20] nice :) [15:30:23] now it work [15:30:24] s [15:30:24] Still broken for me, though [15:30:40] (03CR) 10Ottomata: [C: 032 V: 032] Make sure that elasticsearch won't start if required plugins aren't available [operations/puppet] - 10https://gerrit.wikimedia.org/r/138012 (owner: 10Ottomata) [15:30:41] :/ [15:30:48] it took a minute [15:31:11] fine for me now as well :) [15:31:16] manybubbles: We're good, thanks :) [15:31:28] ah crap [15:31:30] hoo and aude: cool! [15:31:31] impossible [15:31:31] oh no? [15:31:33] impossible [15:31:35] hoo: ? [15:31:37] dah [15:31:41] autoloader stuffs [15:31:42] not fixed? [15:31:46] will fix per hand [15:31:48] hang on [15:31:59] what? [15:32:16] autoloader stuff [15:32:21] paravoid, regarding my question from yesterday (the customer with the "i cannot connect to Wikipedia"): I've asked for more info, they are getting a timeout accessing Wikipedia through their "mcafee web gateway," according to them since Saturday, 1:53am UTC. Any idea? [15:32:25] see fatal log on fluorine [15:32:29] not supposed to be [15:32:36] *pm [15:33:12] mh [15:33:33] might also have happened during the deploy [15:34:51] i don't see the issue [15:35:09] I was on the box and grepped for the old autoloader name, no longer there [15:35:16] probably just a thing in transition [15:35:19] just one entry [15:35:43] out of 100k [15:35:56] Reedy: Maybe something like this to dump APC data for graphing/monitoring? -- https://gist.github.com/bd808/867dda34698717f11e8b [15:36:21] aude: I always look onto the tail of the fatal.log during Wikidata deploys [15:36:35] and whenever I see something I hop onto the box and fix it per hand :/ [15:36:36] yep [15:36:41] bblack: agreed on your points re dependencies & versioning [15:37:01] in Parsoid we are freezing everything in a deploy repo, and use that throughout [15:37:35] it's a pragmatic solution until the mystical package-based solution materializes.. [15:38:41] * gwicke wished there were stronger mechanisms like sonames in dynamic language module systems [15:39:51] PROBLEM - Host elastic1017 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:10] RECOVERY - SSH on elastic1017 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [15:41:20] RECOVERY - Host elastic1017 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [15:43:21] ok, manybubbles, all 3 nodes have the OS installed [15:43:27] should we try to run puppet on elastic1017? [15:43:35] or should we wait for a better time? [15:55:55] (03CR) 10Chad: "Net change is Cirrus becoming primary on anwiki,arwiki,azwiki,be_x_oldwiki,bewiki,bgwiki,bnwiki,brwiki,bswiki,cebwiki,ckbwiki,cswiki,cywik" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [15:59:54] (03PS1) 10BBlack: Turn reload-vcl failures into persistent puppet failures [operations/puppet] - 10https://gerrit.wikimedia.org/r/141980 [16:00:04] manybubbles, ^d: The time is nigh to deploy Cirrus (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140625T1600) [16:02:11] (03PS1) 10Mark Bergsma: Add mgmt subnet addresses for mr1-esams [operations/dns] - 10https://gerrit.wikimedia.org/r/141986 [16:02:36] (03CR) 10Mark Bergsma: [C: 032] Add mgmt subnet addresses for mr1-esams [operations/dns] - 10https://gerrit.wikimedia.org/r/141986 (owner: 10Mark Bergsma) [16:03:01] (03CR) 10BBlack: [C: 032] Turn reload-vcl failures into persistent puppet failures [operations/puppet] - 10https://gerrit.wikimedia.org/r/141980 (owner: 10BBlack) [16:03:52] (03CR) 10Alexandros Kosiaris: rcstream: add SSL support (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 (owner: 10Giuseppe Lavagetto) [16:05:21] (03CR) 10Nikerabbit: cxserver configuration for beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [16:06:09] ok that didn't quite work right :p [16:07:06] (03CR) 10Chad: [C: 032] Move remaining pool 4 lsearchd wikis (except commons) to Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [16:08:42] <^d> jenkins? [16:08:46] <^d> you about sir? [16:12:21] (03Merged) 10jenkins-bot: Move remaining pool 4 lsearchd wikis (except commons) to Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [16:15:11] ^d: you going to sync? [16:15:19] <^d> Yes, was just waiting for you to get back :) [16:16:04] sorry [16:16:14] traffic took longer though I thought - left turns [16:16:14] (03PS1) 10Rush: git::install updates (diff/file lock/diff notice) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141991 [16:16:16] !log demon Synchronized wmf-config/InitialiseSettings.php: I4c54357a: most remaining wikis getting Cirrus as primary (duration: 00m 04s) [16:16:21] Logged the message, Master [16:16:35] !log demon Synchronized wmf-config/CommonSettings.php: I4c54357a: most remaining wikis getting Cirrus as primary (duration: 00m 04s) [16:16:39] Logged the message, Master [16:17:10] ^d: pool queue errors [16:17:22] yeah, lets roll that back [16:17:23] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [16:17:59] or recovery [16:19:10] (03PS1) 10BBlack: naive fix for vcl-reload checking [operations/puppet] - 10https://gerrit.wikimedia.org/r/141992 [16:19:14] <^d> pool queue seems ok. [16:19:48] some nodes are under much higher load then other [16:19:49] others [16:19:59] <^d> I see that, yeah. [16:20:13] I can't actually search on enwiki with the betafeature any more though [16:20:20] like I'm banging against the pool queue [16:20:59] <^d> Yeah you're right. [16:21:02] <^d> Let's roll back. [16:21:39] (03PS1) 10Chad: Revert "Move remaining pool 4 lsearchd wikis (except commons) to Cirrus" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141993 [16:22:21] (03CR) 10Chad: [C: 032] Revert "Move remaining pool 4 lsearchd wikis (except commons) to Cirrus" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141993 (owner: 10Chad) [16:22:29] (03Merged) 10jenkins-bot: Revert "Move remaining pool 4 lsearchd wikis (except commons) to Cirrus" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141993 (owner: 10Chad) [16:23:05] !log demon Synchronized wmf-config/CommonSettings.php: Roll back previous Cirrus deploy (duration: 00m 04s) [16:23:09] Logged the message, Master [16:23:20] !log demon Synchronized wmf-config/InitialiseSettings.php: Roll back previous Cirrus deploy (duration: 00m 05s) [16:23:25] Logged the message, Master [16:23:50] <^d> Ok, so that didn't work. What did we learn? [16:24:32] (03CR) 10BBlack: [C: 032] naive fix for vcl-reload checking [operations/puppet] - 10https://gerrit.wikimedia.org/r/141992 (owner: 10BBlack) [16:25:52] ^d: its not recovered yet, oddly [16:27:59] <^d> We need to tweak our pool counter settings. [16:28:04] <^d> We obviously need more headroom. [16:28:39] ^d: so - I still can't search enwiki - I can search other wikis - but I don't get an error in terbium [16:28:46] sorry, on fluorine [16:29:06] <^d> me either :\ [16:30:46] <^d> Did we break the pool counter? [16:30:55] ^d: not sure - commons is working [16:31:06] <^d> mw.org isn't. [16:31:08] hewiki isn't [16:32:51] I see it on logstash - but I'm not sure what is causing it [16:33:33] its spitting out the number of shards that failed rather then the message [16:33:37] <^d> See what? [16:33:44] <^d> Oh, hmm. [16:34:34] (03PS1) 10BBlack: Comment out reload-vcl check for now... [operations/puppet] - 10https://gerrit.wikimedia.org/r/141994 [16:34:59] (03CR) 10BBlack: [C: 032 V: 032] Comment out reload-vcl check for now... [operations/puppet] - 10https://gerrit.wikimedia.org/r/141994 (owner: 10BBlack) [16:35:40] bleh - when I do a query from localhost against production it works [16:36:16] (03PS2) 10Rush: git::install updates (diff/file lock/diff notice) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141991 [16:36:37] <^d> manybubbles: Search timeout requesting http://10.2.2.16:8123/search/commo... [16:36:39] (03CR) 10Rush: [C: 032 V: 032] "merging as this doesn't affect anything, not used atm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141991 (owner: 10Rush) [16:36:43] <^d> Timeouts look suspicious. [16:36:51] ^d: can you snap everyone back to lsearchd? [16:37:39] what is 10.2.2.16? [16:37:57] (03PS1) 10Chad: Disbale Cirrus everywhere. Something is broken. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141995 [16:38:09] (03CR) 10Chad: [C: 032 V: 032] Disbale Cirrus everywhere. Something is broken. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141995 (owner: 10Chad) [16:38:34] thats search-pool5 - lsearchd [16:38:37] PROBLEM - puppetmaster backend https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [16:39:10] <^d> dur, mwsearch [16:39:11] <^d> wrong log [16:39:58] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:40:28] <^d> manybubbles: Confirm one last time before I press enter on tin: disabling cirrus everywhere? [16:40:44] ^d: keep the updates going - just stop the searches [16:41:11] (03PS1) 10Chad: Revert "Disbale Cirrus everywhere. Something is broken." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141996 [16:41:19] (03CR) 10Chad: [C: 032 V: 032] Revert "Disbale Cirrus everywhere. Something is broken." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141996 (owner: 10Chad) [16:42:28] (03PS1) 10Chad: Revert "Revert "Disbale Cirrus everywhere. Something is broken."" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141998 [16:42:34] (03CR) 10Chad: [C: 032 V: 032] Revert "Revert "Disbale Cirrus everywhere. Something is broken."" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141998 (owner: 10Chad) [16:42:52] !log demon Synchronized wmf-config/InitialiseSettings.php: Disable Cirrus everywhere but testwiki (duration: 00m 04s) [16:42:56] Logged the message, Master [16:43:58] ^d: wtf: https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=foo&fulltext=Search [16:43:58] <^d> manybubbles: It's secondary everywhere now, no more primary. [16:44:06] ah, but beta [16:44:34] can you disable us as a betafeature? [16:44:42] <^d> Umm. [16:44:47] I'll keep digging into wtf this is [16:48:18] PROBLEM - puppetmaster https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [16:49:34] !log restarting puppetmaster on palladium [16:49:37] Logged the message, Master [16:49:54] !log it didn't help [16:49:59] Logged the message, Master [16:50:18] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.024 second response time [16:51:16] (03PS1) 10Dr0ptp4kt: Temporarily disable OM on 470-07. Rollout is in two phases. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142001 [16:51:19] !log restarted apache on palladium -- _that_ helped [16:51:24] Logged the message, Master [16:51:38] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.022 second response time [16:52:24] ^^^ bblack, would you please review and, if appropriate, +2 https://gerrit.wikimedia.org/r/142001 ? the operator wants to rollout for direct connections, and will add om support i a week or so [16:53:36] (03CR) 10Ori.livneh: Add a lightweight apache::site resource (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [16:55:54] (03PS5) 10Ori.livneh: Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 [16:56:39] (03PS2) 10QChris: Take advantage of redis module again [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) [16:57:26] (03PS5) 10QChris: Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) [16:58:35] (03PS2) 10Ori.livneh: [HAT] Load mod_version on application servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/141059 [16:58:52] (03CR) 10jenkins-bot: [V: 04-1] Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [16:59:02] (03CR) 10Giuseppe Lavagetto: "I agree with Krinkle that the module should not be tied to the host name, correcting that." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 (owner: 10Giuseppe Lavagetto) [16:59:12] (03CR) 10Ori.livneh: [C: 032 V: 032] [HAT] Load mod_version on application servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/141059 (owner: 10Ori.livneh) [16:59:14] (03CR) 10QChris: Take advantage of redis module again (032 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [17:00:57] (03CR) 10Alexandros Kosiaris: "Not sure I get the reasoning behind this either. I can accept the 2.4 stanzas having the same content as 2.2 stanzas and diverging down th" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/141062 (owner: 10Ori.livneh) [17:03:52] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.726 second response time [17:04:24] akosiaris: yt? [17:04:32] what's your thoughts on new kafka + gradle [17:04:33] ? [17:04:34] ? [17:04:41] gonna happen? :) [17:05:10] gonna go with "having no thoughts yet" [17:05:26] could be though. Not negative to it [17:05:57] ok cool [17:06:12] you have successfully compiled it right ? [17:06:15] no eta though? I have a new package revision i shoudl probably deploy [17:06:16] yes [17:06:18] via gradle [17:06:19] yeah [17:06:21] !log manybubbles Synchronized wmf-config/: try to fix cirrus (duration: 00m 04s) [17:06:24] Logged the message, Master [17:06:40] the brokers are currently running with an unofficial 0.8.0 revision i just built [17:06:40] !log success! [17:06:44] i should make it a rev and put it in apt [17:06:45] Logged the message, Master [17:06:52] was waiting to see what we were gonna do with 0.8.1.1 [17:06:55] well we don't have to couple new kafka revision and gradle right now [17:06:58] yeah [17:07:24] tell you what. I am gonna give it a try on the morrow and see if it makes sense [17:07:35] it definitely looks a lot cooler than sbt TBH [17:07:51] akosiaris, _joe_: thanks for the reviews [17:08:01] although it does share the same ideas [17:08:14] it is probably way more maintained/maintainable [17:08:35] aye cool, thank you [17:08:37] typesafe.org sucks at creating tools (or so says google) [17:10:02] (03CR) 10Ori.livneh: "couple of formatting nits" (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 (owner: 10Giuseppe Lavagetto) [17:11:32] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [17:15:10] <_joe_> ori: I can't wrap my head around those whitespace rules of ours [17:15:13] <_joe_> :) [17:15:30] i'm annoying and OCD about it, sorry [17:15:39] <_joe_> but you're right [17:15:43] i wish we could automate it [17:15:48] <_joe_> it's just puppet gets me in perl-mode [17:15:53] <_joe_> my mind I mean [17:15:56] maybe using https://github.com/cloudsmith/geppetto [17:16:06] <_joe_> so I try not to waste whitespace [17:16:13] <_joe_> "geppetto", gosh [17:16:29] i haven't actually used it but i've been meaning to check it out [17:16:36] <_joe_> you know that's a slang term for "little Giuseppe"? [17:16:45] heheh, no, i didn't! [17:16:52] Geppetto! [17:16:57] from the Pinokio tale! Yay! [17:17:06] <_joe_> that's also Pinocchio's father [17:17:23] <_joe_> whose original name is clearly Giuseppe [17:19:00] Hi folks! I just joined the fundraising tech team and switched my bugzilla account (ejegg) to use my .org address. Seems that didn't get me the permissions that signing up with a wikimedia.org address would have [17:19:22] not sure if there's anything automatic, honestly [17:19:27] what's the email? [17:19:36] eeggleston@wikimedia.org [17:19:56] I don't think there's any magic inside Bugzilla for that [17:20:42] I don't see your account... [17:20:48] https://bugzilla.wikimedia.org/editusers.cgi?action=list&matchvalue=realname&matchstr=eggleston&matchtype=substr&groupid=7&enabled_only=1 [17:20:49] twkozlowski: how you doin'? :) [17:20:57] Ah, oops, I meant the gerrit permissions [17:20:59] (03PS3) 10Giuseppe Lavagetto: rcstream: add SSL support [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 [17:21:02] andre__: halp [17:21:11] andre__: user group edit for new wmf employee [17:21:21] andre__: nevermind! [17:21:33] (03CR) 10Giuseppe Lavagetto: "this should address most suggestions." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 (owner: 10Giuseppe Lavagetto) [17:21:48] ejegg: ahhh, so, that means you need to get added to the WMF LDAP group, James_F is working on that [17:22:06] hmm [17:22:18] i don't like the http 5xx graph: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [17:22:22] ori: Hi! I'm fabulous! How are you? :) [17:22:29] twkozlowski: a bit groggy today but ok :) [17:22:39] greg-g: Cool, thanks! [17:22:42] i don't see application-level errors so i worry it's related to my change [17:22:48] but i don't see how/why [17:23:01] ori: Not 11 AM UTC yet, wouldn't expect anything else! [17:23:16] ejegg: I just pinged the relevant person, I think [17:23:25] heh [17:23:43] Though, on the other hand, it's just 17:23 UTC, you should be in bed, sonnyjim :-P [17:23:44] oh, the 5xx rate just flattened again [17:23:53] hmmm [17:24:10] thanks again [17:24:32] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [17:24:41] akosiaris, _joe_: assuming the apache::site resource looks good, let's still be deliberate about how we switch over to it [17:24:50] because right now everything is using concrete files in sites-enabled [17:25:07] ori: not really [17:25:12] well some are [17:25:27] akosiaris: well, i cleaned up after the last change by purging the symlinks so they were recreated as files [17:26:11] and it's still the approach taken by apache::vhost (which i hope we can get rid of) [17:26:50] but fortunately (or unfortunately) the app servers don't use sites-* at all, so it's at least just a matter of porting a bunch of misc apaches on the periphery of our infrastructure [17:27:23] ori: antimony disagrees Warning: /Stage[main]/Gitblit/File[/etc/apache2/sites-enabled/git.wikimedia.org]: Ensure set to :present but file type is link so no content will be synced [17:27:35] d'oh [17:28:06] sigh, i'll clean that up [17:28:34] so trying to get to a well known state before going to the new one ? [17:29:00] it does make sense... [17:29:34] it won't be too hard, nothing is using apache::site at the moment [17:29:38] so we can port over things to it one by one [17:29:58] <_joe_> sounds like tons of fun [17:30:00] <_joe_> :) [17:30:34] yeah, kind of annoying. but still, i think things are looking better than they did a month ago [17:30:50] yeah they do [17:31:12] (03CR) 10Ori.livneh: [C: 031] "Haven't tested, but LGTM." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 (owner: 10Giuseppe Lavagetto) [17:32:40] <_joe_> ori: while we're both here, Krinkle said that he had reports of rcstream spawning errors [17:32:49] oh? [17:33:09] Its back up now but a few days back it was consistently broken [17:33:12] 50x or 40x errors [17:33:15] forgot what it was [17:33:31] <_joe_> Krinkle: I'll take a look at the logs [17:33:39] <_joe_> it's something we should check from icinga [17:33:57] hm [17:33:59] Exactly [17:36:23] heh [17:36:23] 127.0.0.1 - - [2014-06-11 14:35:15] "GET /w00tw00t.at.blackhats.romanian.anti-sec:) HTTP/1.1" 404 96 0.000145 [17:36:38] 127.0.0.1 - - [2014-06-09 20:19:51] "GET /w00tw00t.at.blackhats.romanian.anti-sec:) HTTP/1.1" 404 96 0.000153 [17:36:40] etc. [17:36:41] <_joe_> eheh that's not strange [17:37:05] <_joe_> seen that a thousand times in logs [17:37:08] (03PS1) 10Tadasv: remove extra quote from a template and add statsd support [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/142008 [17:37:09] i see some misbehaving clients [17:37:14] sending garbage that doesn't decode as json [17:37:22] probably a few things could be handled better [17:37:57] it looks like one of the instances restarted but failed to bind to one of its ports [17:38:04] probably because its previous incarnation had not been cleaned up after [17:38:45] (03PS1) 10Chad: Cirrus back on for wikis that had it [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142009 [17:39:05] hmm yeah there's an opportunity to make a few things more robust [17:39:21] <_joe_> ori: also, we're logging the proxy IP [17:39:25] <_joe_> which is unfortunate [17:39:43] <_joe_> ori: where do you see that? [17:39:57] (03CR) 10Chad: [C: 032] Cirrus back on for wikis that had it [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142009 (owner: 10Chad) [17:40:03] (03Merged) 10jenkins-bot: Cirrus back on for wikis that had it [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142009 (owner: 10Chad) [17:40:28] <_joe_> I mean those errors [17:40:32] https://dpaste.de/wD33/raw [17:40:40] /var/log/upstart [17:40:45] the filename pattern of the log files is awful too [17:41:09] i think i am growing out of my taste for writing to stdout and having upstart manage logging/log rotation [17:41:37] <_joe_> ori: on what server? [17:41:42] <_joe_> ori: :) [17:41:46] rcs1001 [17:41:48] !log demon Synchronized wmf-config/: Cirrus back on for wikis that had it before. Back to square 1 (duration: 00m 04s) [17:41:52] Logged the message, Master [17:42:26] <_joe_> ori: I do see quite some errors in the nginx error log as well [17:43:05] trusty has SO_REUSEPORT too, i forgot.. could be interesting [17:43:27] i mean it has kernel 3.13 which has it [17:44:14] i'll go over all the logs and submit fixes for any issues i identify later today [17:44:46] official "launch" is now waiting on the design folks to put up a nice and splashy api doc page so we have some time [17:45:24] also gives us time to try hammering the server using some evil websocket spamming tool we invent :P [17:45:35] <_joe_> ok :) [17:46:16] ooo, rcs! [17:50:08] hello Chris [17:51:06] cmjohson1 [17:51:30] cmjognson1 [17:51:51] cmjohnson1 [17:52:01] hi papaul [17:52:11] how are you [17:52:32] peachy... what's up? [17:53:05] nothing just checking on your guys and testing this new laptop on my lunch break [17:53:12] if Rob ok? [17:53:13] oh..cool. have you tried the mifi yet [17:53:35] yes i am using the mifi [17:53:51] that way you can stay online in the cage area...you will probably be okay without the mifi since you are no longer using windows [17:53:58] is Rob ok i haven't hear from him in two days [17:54:25] yep..he's okay. working on procurement things [17:54:31] ok [17:54:35] how are things coming along in the data center? [17:55:07] i am planning on finishing row A by tomorrow and will be starting row B on friday [17:55:31] cool!...have you plugged anything in and powered on yet? [17:55:39] not yet [17:55:53] i am to finish the data lines first [17:56:02] i would like you to power up a rack ..i want to see how the power cables look [17:56:04] i want to finish the data lines first [17:56:12] but it's not an immediate need [17:56:25] ok i will do that tomorrow and send you a pic [17:56:32] cool..thx [17:56:45] no thanks to you ' [17:57:07] ok buddy i have to get back in the cage like a lion lol [17:57:11] talk to you later [17:57:11] do you need anything from us? [17:57:19] no i am good thnak you [17:57:29] yes [17:57:30] ok..take the laptop in the cage with you and lmk if it works in ther [17:57:34] i have no velcro [17:57:39] it should with the mifi if the wifi is shit [17:58:03] okay on velcro..make a ticket in RT in procurement queue [17:58:12] robh will see it and order it for you [17:58:22] ok [17:58:30] i will talk to you later [17:58:33] k [17:58:56] yea i'll order more [17:59:03] i just keep forgetting cuz there isnt a ticket ;] [17:59:43] manybubbles, ^d done? [17:59:49] yurikR: I'm out of the way [17:59:56] <^d> Yes, I'm done too. [18:00:02] thx [18:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140625T1800) [18:02:55] (03PS2) 10Yurik: Removing ZeroRatedMobileAccess ext (obsolete) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140581 [18:05:23] (03CR) 10Yurik: [C: 032] Removing ZeroRatedMobileAccess ext (obsolete) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140581 (owner: 10Yurik) [18:07:03] (03Merged) 10jenkins-bot: Removing ZeroRatedMobileAccess ext (obsolete) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140581 (owner: 10Yurik) [18:08:11] Can I add a new group (of users) in Gerrit? Or is that something I would ask in this channel? [18:08:58] I'd like: fr-tech = Ejegg, Katie Horn, Ssmith, Mwalker, Awight [18:09:44] <^d> {{done}} [18:11:00] ^d: woot! thank you [18:11:11] While I'm rattling cages... Anyone want to kick this over the cliff? https://gerrit.wikimedia.org/r/#/c/137804/ [18:11:15] <^d> yw [18:15:05] (03PS1) 10RobH: setting up radon and correcting service entry location [operations/dns] - 10https://gerrit.wikimedia.org/r/142022 [18:16:12] (03CR) 10RobH: [C: 031] "standard dns additions (plus one move purely for formatting/ordering)" [operations/dns] - 10https://gerrit.wikimedia.org/r/142022 (owner: 10RobH) [18:16:21] (03CR) 10RobH: [C: 032] "standard dns additions (plus one move purely for formatting/ordering)" [operations/dns] - 10https://gerrit.wikimedia.org/r/142022 (owner: 10RobH) [18:16:32] clicked wrong radio button =P [18:17:16] !log yurik Started scap: Removing ZeroRatedMobileAccess ext settings, depl latest JsonConfig/ZeroBanner/Portal [18:17:21] Logged the message, Master [18:19:14] manybubbles: not sure if I missed you reply earlier [18:19:22] when is a good time to bring up ES on the new nodes? [18:19:23] hey, new contributor here, if I want to start contributing do I just grab a ticket off bugzilla? [18:19:27] ottomata: any time you want [18:19:42] omg any time I want?!, hehe [18:20:01] ok, let's do 1017 ummmm early tomorrow afternoon? [18:21:07] ACKNOWLEDGEMENT - DPKG on elastic1017 is CRITICAL: Connection refused by host ottomata We will be bringing this node back online June 26th around 1pm EST [18:21:07] ACKNOWLEDGEMENT - Disk space on elastic1017 is CRITICAL: Connection refused by host ottomata We will be bringing this node back online June 26th around 1pm EST [18:21:08] ACKNOWLEDGEMENT - NTP on elastic1017 is CRITICAL: NTP CRITICAL: No response from NTP server ottomata We will be bringing this node back online June 26th around 1pm EST [18:21:08] ACKNOWLEDGEMENT - RAID on elastic1017 is CRITICAL: Connection refused by host ottomata We will be bringing this node back online June 26th around 1pm EST [18:21:08] ACKNOWLEDGEMENT - check configured eth on elastic1017 is CRITICAL: Connection refused by host ottomata We will be bringing this node back online June 26th around 1pm EST [18:21:08] ACKNOWLEDGEMENT - check if dhclient is running on elastic1017 is CRITICAL: Connection refused by host ottomata We will be bringing this node back online June 26th around 1pm EST [18:21:08] ACKNOWLEDGEMENT - puppet disabled on elastic1017 is CRITICAL: Connection refused by host ottomata We will be bringing this node back online June 26th around 1pm EST [18:22:56] hey _joe_, yt? [18:24:30] ottomata: cool [18:37:28] (03PS1) 10RobH: setting install parameters for radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/142027 [18:37:41] (03PS2) 10RobH: setting install parameters for radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/142027 [18:38:03] why is there a space, it should be rt:23423 not rt: 134123 ;P [18:38:17] * RobH fixes commit message anyhow [18:38:40] <^d> Regex should make space optional. [18:39:48] (03CR) 10RobH: [C: 032] setting install parameters for radon [operations/puppet] - 10https://gerrit.wikimedia.org/r/142027 (owner: 10RobH) [18:41:13] (03PS1) 10Ottomata: Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 [18:41:18] i thought it was RT XXXX [18:41:56] <^d> The colon is necessary for the searching. I think it's optional for linking though. [18:42:04] <^d> Stupid gerrit. [18:42:19] (03CR) 10jenkins-bot: [V: 04-1] Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 (owner: 10Ottomata) [18:46:26] !log yurik Finished scap: Removing ZeroRatedMobileAccess ext settings, depl latest JsonConfig/ZeroBanner/Portal (duration: 29m 09s) [18:46:29] Logged the message, Master [18:46:34] (03PS1) 10BBlack: Revert 3b7e966b + 4179c38c (does not work right) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142034 [18:46:36] (03PS1) 10BBlack: A new method for persistent vcl reload failure [operations/puppet] - 10https://gerrit.wikimedia.org/r/142035 [18:48:37] (03CR) 10jenkins-bot: [V: 04-1] Revert 3b7e966b + 4179c38c (does not work right) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142034 (owner: 10BBlack) [18:48:55] (03CR) 10jenkins-bot: [V: 04-1] A new method for persistent vcl reload failure [operations/puppet] - 10https://gerrit.wikimedia.org/r/142035 (owner: 10BBlack) [18:51:37] (03PS2) 10Ottomata: Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 [18:51:47] (03PS2) 10BBlack: Revert 3b7e966b + 4179c38c (does not work right) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142034 [18:51:49] (03PS2) 10BBlack: A new method for persistent vcl reload failure [operations/puppet] - 10https://gerrit.wikimedia.org/r/142035 [18:52:29] (03CR) 10BBlack: [C: 032 V: 032] Revert 3b7e966b + 4179c38c (does not work right) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142034 (owner: 10BBlack) [18:52:42] (03PS3) 10Ottomata: Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 [18:53:04] (03PS4) 10Ottomata: Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 [18:53:55] (03CR) 10BBlack: [C: 031] Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 (owner: 10Ottomata) [18:53:59] !log yurik Synchronized wmf-config/PrivateSettings.php: Removed obsolete ZRMA user/pswd (duration: 01m 06s) [18:54:04] Logged the message, Master [18:54:14] (03CR) 10jenkins-bot: [V: 04-1] Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 (owner: 10Ottomata) [18:54:19] (03CR) 10BBlack: [C: 032 V: 032] A new method for persistent vcl reload failure [operations/puppet] - 10https://gerrit.wikimedia.org/r/142035 (owner: 10BBlack) [18:57:04] MaxSem, I have searched for JsonConfig and keep seeing - PHP Warning: require(): GC cache entry '/usr/local/apache/common-local/php-1.24wmf9/extensions/JsonConfig/JsonConfig.php' (dev=2049 ino=4727814) was on gc-list for 601 seconds in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 153 [18:57:17] is that something i can fix? [18:58:01] nope, just usual APC poop [18:58:06] matanya: you there? [18:58:21] thx [19:00:01] (03PS5) 10Ottomata: Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 [19:02:22] (03PS6) 10Ottomata: Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 [19:06:18] (03CR) 10Ottomata: [C: 032 V: 032] Add exception for smaller main storage size for upload caches cp301[5-8] [operations/puppet] - 10https://gerrit.wikimedia.org/r/142030 (owner: 10Ottomata) [19:13:04] cmjohnson1: So I'm putting together the fiber order [19:13:13] and i ahve a cable routing issue, i want to review with you if you have a moment? [19:13:22] ok..i have a moment [19:13:28] https://office.wikimedia.org/wiki/Operations/CODFW_Planning#Mark.27s_calculations [19:13:44] So the uplink ports for the switches to the routers are not in the front of them [19:13:45] but in the back [19:14:46] I was confused before and didn't realize the stackign ports are also the uplink ports [19:14:46] really? i didn't look [19:14:53] but they are same use case for the port [19:15:00] either to tie the stack OR to tie to router [19:15:17] so, we have a 1U space for the power cable routing for switches, correct? [19:15:23] yes [19:15:25] if so, this issue is solved with some flexduct [19:15:29] ok, excellent [19:15:41] 1u under the mgmt switch [19:15:43] So, I still think we should mount the fiber raceway in the back of the rack (where the top of rack holes are) [19:16:00] and then use a .5 meter flexduct for fibers from raceway to back of switch [19:16:17] just stick them all in the flexduct so they are protected, and nylon tie the duct into place on one side of the blank U panel [19:16:34] which fibers anyway? [19:16:40] all uplinks are to QFX right? [19:16:50] stacking cables, yes [19:16:55] yes, but the only holes in the top of the racks are in the back [19:16:57] back in a bit... [19:17:39] its an annoying oversight. we've never needed to route out the front of the rack in previous installs [19:17:48] in this case it would have been useful. [19:18:16] (would avoid having to move fibers from the back of switches (midrack) to the rear of the rack to route to the fiber raceways [19:18:18] ) [19:18:31] (03PS1) 10BBlack: test vcl compile failure via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/142043 [19:18:52] its all annoying due to these racks having no space at all in the sidewall (it has the rail directly against the sidewall) [19:20:57] cmjohnson1: So does what I state above make sense? Use the fiber lengths we reviewed, as they all got rounded upwards in length anyhow, traversing from the mid to back of rack shouldnt cause them to be too short [19:20:58] * bblack injects 500mg of caffeine into jenkins' arm, hoping for a reaction [19:21:12] then a flexduct to route to back of rack and upwards into raceway [19:21:15] (03CR) 10BBlack: [C: 032] test vcl compile failure via puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/142043 (owner: 10BBlack) [19:21:26] which then goes to the router racks and terminates in a patch panel [19:21:48] i think the lengths will be okay but how are we going to do this so it looks clean. [19:21:50] from said panel, we can go directly 40Gb into line cards (if available) or use the pigtail fibers mark requested to break them out into 4 10Gb connections [19:22:20] the flexduct can go from inside the enclosure fiber raceway [19:22:24] and just nylon tie into place [19:22:27] so it should look ok [19:22:45] just have to cut the vertical raceways cover in a spot [19:22:55] so its split in half where the flexduct emerges [19:23:06] i could make this look neat ;] [19:24:42] I may have to fly out and assist on this [19:25:12] if i have a template single one done for papaul he can replicate it all day long [19:25:35] (03PS1) 10BBlack: Trying to induce VCL failure, again... [operations/puppet] - 10https://gerrit.wikimedia.org/r/142044 [19:25:49] (03CR) 10BBlack: [C: 032 V: 032] Trying to induce VCL failure, again... [operations/puppet] - 10https://gerrit.wikimedia.org/r/142044 (owner: 10BBlack) [19:26:42] (03PS5) 1020after4: Move the ordered_json parser function to a shared module and remove the copy-pasta found in deployment, statsd and gdash modules. [operations/puppet] - 10https://gerrit.wikimedia.org/r/139921 [19:26:55] (03PS6) 10Rush: Move the ordered_json parser function to a shared module and remove the copy-pasta found in deployment, statsd and gdash modules. [operations/puppet] - 10https://gerrit.wikimedia.org/r/139921 (owner: 1020after4) [19:27:30] (03CR) 10Rush: [C: 032 V: 032] "Ori signed off and it's his code, the overall change I think is good." [operations/puppet] - 10https://gerrit.wikimedia.org/r/139921 (owner: 1020after4) [19:38:27] !log restarted Cirrus scripts after incident - the index rebuilds had to be completely restarted - sanity checking was simply paused [19:38:31] Logged the message, Master [19:39:56] (03PS7) 10Rush: Phabricator module [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [19:40:06] (03PS1) 10BBlack: improve reload-vcl retry mechanism (no double-exec on fix) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142049 [19:40:38] (03CR) 10BBlack: [C: 032 V: 032] improve reload-vcl retry mechanism (no double-exec on fix) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142049 (owner: 10BBlack) [19:43:29] (03PS1) 10BBlack: remove temporarily-induced VCL failure [operations/puppet] - 10https://gerrit.wikimedia.org/r/142050 [19:43:59] (03CR) 10BBlack: [C: 032 V: 032] remove temporarily-induced VCL failure [operations/puppet] - 10https://gerrit.wikimedia.org/r/142050 (owner: 10BBlack) [19:47:05] (03PS8) 10Rush: Phabricator module [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [19:53:00] (03PS1) 10BBlack: require to fix ordering issues on vcl reload [operations/puppet] - 10https://gerrit.wikimedia.org/r/142054 [19:53:28] (03CR) 10BBlack: [C: 032 V: 032] require to fix ordering issues on vcl reload [operations/puppet] - 10https://gerrit.wikimedia.org/r/142054 (owner: 10BBlack) [19:57:07] (03PS1) 10BBlack: induce temporary VCL fail on cp4010 again [operations/puppet] - 10https://gerrit.wikimedia.org/r/142055 [19:57:30] (03CR) 10BBlack: [C: 032 V: 032] induce temporary VCL fail on cp4010 again [operations/puppet] - 10https://gerrit.wikimedia.org/r/142055 (owner: 10BBlack) [19:58:28] cmjohnson1: so the fiber runner http://www.panduit.com/wcs/Satellite?c=Page&childpagename=Panduit_Global%2FPG_Layout&cid=1345564329071&packedargs=classification_id%3D1789%26item_id%3DE2X2BL6%26locale%3Den_us&pagename=PG_Wrapper and cover http://www.panduit.com/wcs/Satellite?c=Page&childpagename=Panduit_Global%2FPG_Layout&cid=1345564329071&packedargs=classification_id%3D1789%26item_id%3DC2BL6%26locale%3Den_us&pagename=PG_Wrapper [19:58:31] andrewbogott: have you ever been able to get the openstack extension tests to pass? [19:58:34] and then the mount http://www.panduit.com/wcs/Satellite?c=Page&childpagename=Panduit_Global%2FPG_Layout&cid=1345564329071&packedargs=classification_id%3D1797%26item_id%3DFLB%26locale%3Den_us&pagename=PG_Wrapper [19:58:41] seem right to you? [19:59:02] there is also a tool for cutting them to fit when needed [19:59:12] (like when we have to take an opening out of the cover for the flextube) [19:59:20] ori: which? [20:00:05] gwicke, subbu, cscott: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140625T2000) [20:01:11] (03PS1) 10BBlack: remove temporary VCL failure on cp4010, again [operations/puppet] - 10https://gerrit.wikimedia.org/r/142057 [20:01:31] (03CR) 10BBlack: [C: 032 V: 032] remove temporary VCL failure on cp4010, again [operations/puppet] - 10https://gerrit.wikimedia.org/r/142057 (owner: 10BBlack) [20:01:55] (03PS9) 10Rush: Phabricator module [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [20:04:17] (03PS2) 10BBlack: Temporarily disable OM on 470-07. Rollout is in two phases. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142001 (owner: 10Dr0ptp4kt) [20:07:21] yurik: ping ^ is the above ready and should go out now? [20:09:42] (03PS1) 10Rush: Phabricator for iridium [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 [20:10:06] (03Abandoned) 10Rush: phabricator trial [operations/puppet] - 10https://gerrit.wikimedia.org/r/137956 (owner: 10Rush) [20:11:55] (03Abandoned) 10Andrew Bogott: Added 'adminadd' tool to auto-generate new user entries. [operations/puppet] - 10https://gerrit.wikimedia.org/r/98700 (owner: 10Andrew Bogott) [20:12:17] (03Abandoned) 10Andrew Bogott: Moved nfs manifest into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/69682 (owner: 10Andrew Bogott) [20:14:02] (03CR) 1020after4: [C: 031] "looks good to me. nice work Chase!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [20:16:06] (03CR) 10Ottomata: [C: 032 V: 032] "Thanks so much!" [operations/puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/142008 (owner: 10Tadasv) [20:18:14] (03PS3) 10Rush: modules/mysql_multi_instance/ sans systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137991 [20:18:50] (03PS1) 10RobH: radon needs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/142063 [20:19:01] (03PS2) 10RobH: radon needs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/142063 [20:19:10] (03CR) 10Ottomata: "Does this actually buy us anything though? I kinda liked having the dependency on the redis module removed, especially since vagrant and " [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [20:20:01] andrewbogott, cscott synced new code (to deploy a new version of parsoid) .. but he is getting auth failures trying to restart the service via dsh ... (https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_the_latest_version_of_Parsoid) /cc gwicke [20:21:37] or anyone other ops who can help with this. [20:21:43] subbu: what kind of auth failures? [20:22:20] andrewbogott, " "Too many authentication failures for cscott" when he tried to run "dsh -g parsoid sudo service parsoid restart" [20:22:38] andrewbogott: cscott is likely not in the right group for ssh access on the parsoid hosts [20:22:47] was he ever? [20:22:54] probably not [20:22:58] no, it's his first attempt [20:22:59] i'm a parsoid-deploy virgin [20:23:40] also no dice ssh'ing directly into one of the hosts, despite agent forwarding being active [20:23:45] i am running the dsh command in the meantime. [20:23:51] <_joe_> ok so it's usually advisable to ask ops if he's in the right group before trying :) [20:24:03] (03CR) 10Rush: [C: 032] "over and over I checked this to make sure my brain isn't playing tricks on me. this is a noop. no changes from old config, just takes ge" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137991 (owner: 10Rush) [20:24:19] what's the 'right group'? we could document it. [20:24:22] _joe_, yes, we should have thought of that. [20:24:28] _joe_: my hazy recollection was that this was recently simplified [20:24:35] <_joe_> chasemp: did you check with the compiler? [20:24:52] mutante: do you remember the details? [20:24:56] I don't know how any of this works... [20:24:56] <_joe_> gwicke: management of groups has been simplified AFAIK :) [20:25:05] (03CR) 10QChris: "> Does this actually buy us anything though?" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [20:25:21] <_joe_> andrewbogott: I can take a look, give me the time to pull the repo on the laptop [20:25:32] I thought that all of wikitech had access now [20:25:40] _joe_: is this something in the yaml formerly known as admins.pp? [20:25:46] _joe_: I didn't, but that's good advice. [20:27:17] <_joe_> andrewbogott: yes, cscott is apparently not in parsoid-admin or parsoid-roots [20:27:28] <_joe_> no idea which one is the right group though [20:27:28] _joe_: oh, that's easy then... [20:27:48] what about parsoid-wheel or parsoid-fans? j/k [20:28:13] it depends on whether he just needs an account or ability to become root [20:28:36] well… subbu is an 'admin' and gwicke is a 'root' [20:28:48] So, gwicke, subbu, which of you should he resemble? [20:28:50] being like subbu is good enough for me. [20:29:01] admin then. [20:29:31] <_joe_> andrewbogott: do you need further assistance? [20:29:35] (03PS1) 10Andrew Bogott: Add cscott to parsoid-admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/142066 [20:29:41] _joe_: a +1 maybe? [20:29:41] <_joe_> seems not [20:29:45] <_joe_> ok [20:30:05] I don't want to be the stick in the mud but should we get more there than irc? [20:30:08] maybe a manager? [20:30:19] to +1 I mean :) [20:30:34] <_joe_> I was about to make the same request [20:30:47] hm... [20:30:48] <_joe_> chasemp: OTOH, this channel is logged [20:31:19] I think you're right, should be an rt ticket. [20:31:44] (03CR) 10Giuseppe Lavagetto: [C: 031] "change is good, but we'd need an RT ticket or manager approval here IMHO." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142066 (owner: 10Andrew Bogott) [20:31:47] I don't see where the access was removed in teh consolidation, which means it's new access, etc etc [20:31:59] cscott, subbu, gwicke, can you log an RT request and get robla to approve? [20:32:03] I could be missing it tho, and if we can find where it was removed by accident then different story [20:32:04] cscott got tin/deploy access based on an rt ticket iirc cscott do you have it? [20:32:09] yes, hang on [20:32:11] (Or, whatever manager is appropriate) [20:32:25] (03CR) 10Ottomata: "Ja, I still say we should use the redis module, just not depend on it via the wikimetrics module. I think the wikimetrics should just say" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [20:32:53] this is just last mile access to restart the service since he already synced code to the parsoid cluster. [20:32:53] (03CR) 10Ottomata: "The role class is already responsible for setting up redis, and it does so via the module. Why make a cross module dependency if we don't" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [20:32:57] <_joe_> we don't like to be a pain, but in these cases process is still quite important :) [20:33:02] but, happy to write to terry. [20:33:08] our/his manager. [20:33:11] https://gerrit.wikimedia.org/r/#/c/135418/ seems like it *should* have added me to parsoid-admins [20:33:21] _joe_, sure, np :) [20:33:26] rt #7542 is the original deploy access request [20:33:47] ah https://gerrit.wikimedia.org/r/#/c/135418/2/modules/admin/data/data.yaml [20:34:01] I mean it's kind of sticky because parsoid is considered extra privs in our model [20:34:11] as in lots of ppl deploy code, few need parsoid privs [20:34:22] so requesting deployment access just didn't translate I think [20:34:29] i'm happy to file a new RT ticket if that's the best way to do it. subbu finished the actual deploy, so we're not time-constrained. [20:34:33] <_joe_> good for me, anyways [20:34:41] !log deployed parsoid 4ef9d6be [20:34:47] Logged the message, Master [20:34:50] <_joe_> chasemp: I think we can classify this as completing the ticket [20:34:56] <_joe_> I've already done that [20:35:14] (03PS1) 10Ottomata: Add cp3015-cp3018 to list of esams upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/142087 [20:35:41] (03CR) 10Ottomata: "Don't think I want to merge this evening, would prefer to wait until tomorrow." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142087 (owner: 10Ottomata) [20:35:41] I think knowing nothing else, if gabriel says that rt request meant parsoid specifically [20:35:51] that is as good as it gets [20:36:05] <_joe_> andrewbogott: just add the RT reference to the commit message I'd say [20:36:41] (03CR) 10RobH: [C: 032] radon needs trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/142063 (owner: 10RobH) [20:36:46] ok... [20:36:57] <_joe_> andrewbogott: do you agree? [20:37:18] OK, RT #7752 is the new ticket (if i did this correctly) [20:37:26] I don't know. the ticket doesn't mention parsoid, but I'm not clear if parsoid should actually be a different level of security vs. deploy [20:38:12] cscott: that looks fine, can you get Terry to reply? [20:38:15] the original ticket was for also for the new PDF backend... hopefully that's not Yet Another level of security. ;) [20:38:19] andrewbogott: since your in the mix, I'll defer to you dude, but from a purely admin module your on the right track :) [20:40:15] <_joe_> cscott: probably is :P [20:43:15] andrewbogott: the fun part is that anybody can swap out the code already [20:43:28] anybody in wikitech, that it [20:44:17] if salt wasn't broken for non-roots we'd also use that for the restarts [20:44:47] <_joe_> gwicke: yes that does not make much sense, I agree [20:46:30] (03PS4) 10Rush: jenkins - replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137992 [20:46:54] (03PS6) 10QChris: Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) [20:47:50] (03CR) 10Hashar: [C: 031] "sounds all good to me. I wasn't sure what systemuser and managedhome do but Chase clarified it to me." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137992 (owner: 10Rush) [20:49:08] (03CR) 10QChris: "Like ... abandoning this change and adapting only" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [20:49:21] PROBLEM - puppetmaster backend https on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:50:11] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.017 second response time [20:50:22] (03Abandoned) 10QChris: Take advantage of redis module again [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/141918 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [20:50:29] (03PS2) 10Andrew Bogott: Add cscott to parsoid-admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/142066 [20:50:53] (03CR) 10Rush: [C: 032 V: 032] "wheeeeeeeee" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137992 (owner: 10Rush) [20:53:25] !log puppet broken on gallium.wikimedia.org and lanthanum.eqiad.wmnet . That is being looked at. [20:53:34] Logged the message, Master [20:54:46] hm, apache died on palladium earlier. [20:54:56] hashar, are you investigating? [20:55:14] andrewbogott: yes with chasemp [20:55:35] <_joe_> andrewbogott: again? [20:55:40] caused by replacement of generic::systemuser with user hehe [20:55:46] but the root cause is in my lame manifests [20:55:52] _joe_: A few hours ago it did. [20:56:05] <_joe_> andrewbogott: ok this is happening on a daily basis [20:56:15] <_joe_> we should revert to a run each 30 minutes [20:56:27] <_joe_> and/or debug the new passenger [20:56:37] <_joe_> I don't know how, honestly [20:56:51] revert to each 30? I thought that's how it was now. [20:57:23] <_joe_> we're at each 20 [20:58:08] Is it dying at the same time each day? Today I restarted it at ~16:45 UTC [20:59:27] <_joe_> no it dies randomly [21:06:05] (03PS1) 10Rush: jenkins needs explicit jenkins-slave group [operations/puppet] - 10https://gerrit.wikimedia.org/r/142133 [21:10:40] (03CR) 10JanZerebecki: [C: 031] Improve nginx TLS/SSL settings. [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [21:17:19] (03CR) 10Rush: [C: 032] jenkins needs explicit jenkins-slave group [operations/puppet] - 10https://gerrit.wikimedia.org/r/142133 (owner: 10Rush) [21:20:22] (03CR) 10Gergő Tisza: [C: 031] Remove completed surveys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138634 (owner: 10MarkTraceur) [21:22:15] !log puppet fixed on gallium / lanthanum . It was missing a group definition. All fixed! Thanks Chase. [21:22:19] Logged the message, Master [21:29:03] ottomata: did you look for me ? [21:30:12] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=786 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1084: recurring_gc_schedule_sanity=0 [21:31:09] andrewbogott: looks like terry has approved adding me to parsoid-admin [21:31:40] (03CR) 10Hashar: "Worked like a charm. Thank you!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142133 (owner: 10Rush) [21:35:12] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=668 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1084: recurring_gc_schedule_sanity=0 [21:38:42] greg-g, Reedy : Flow in 1.24wmf12 (July 3rd) will be using new Mantle (awesome pun by jdlrobson) extension to share code with MobileFrontend. I think it makes sense to deploy Mantle earlier to shake out any deploy bugs and have it ready for Flow to use on beta labs. [21:40:04] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=668 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1084: recurring_gc_schedule_sanity=0 [21:40:26] spagewmf: yeah, lets get it set up in beta cluster ASAP [21:40:35] cscott: I will merge as soon as my browser starts responding again... [21:40:49] no worries, no rush. [21:40:51] spagewmf: it might be worth making the switch (updating those extensions) separately from the normal train deploy [21:41:00] greg-g: heh, I just read "Your new extension has been deployed and tested on the mw:beta cluster for weeks, right? Otherwise, STOP and talk to experts." what I rote myself :) [21:41:14] :) [21:41:17] (03CR) 10Andrew Bogott: [C: 032] Add cscott to parsoid-admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/142066 (owner: 10Andrew Bogott) [21:41:27] * cscott has the power [21:41:39] cscott: it'll take half an hour or so before the change is applied. [21:41:51] (03CR) 10LVilla (WMF): [C: 031] "Content in the patch looks right to me, but haven't validated/verified the various HTML changes." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141671 (owner: 10Filippo Giunchedi) [21:42:00] so long as it finishes before monday's parsoid deploy window. ;) [21:42:07] jgage: btw, http://tickets.gamh.com/events/426605 [21:42:24] matanya: ja [21:42:32] had someone pinging me about how to start volunteering with ops [21:42:38] about dogeydogey ? [21:42:38] was going to point thon to you [21:42:40] yup [21:42:42] greg-g: so I'll get Mantle on betalabs ASAP. It's still safe to deploy in 1.24wmf11 since it's unused. Should I add it to Deployments calendar? [21:43:01] found him ottomata. thank you for this. [21:43:06] spagewmf: yeah, I guess in the "next month" section right now since "next week" isn't created yet [21:43:17] cool, great [21:45:04] PROBLEM - check_fundraising_jobs on db1025 is CRITICAL: CRITICAL missing_thank_yous=668 [critical =500]: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1084: recurring_gc_schedule_sanity=0 [21:50:04] RECOVERY - check_fundraising_jobs on db1025 is OK: OK missing_thank_yous=0: recurring_gc_contribs_missed=0: recurring_gc_failures_missed=0: recurring_gc_jobs_required=1084: recurring_gc_schedule_sanity=0 [22:04:04] Some root should maybe quickly remove the no longer needed /a/common from fenari?! [22:08:26] (03PS17) 10QChris: Add backup role and scripts to wikimetrics [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [22:09:11] (03PS10) 10QChris: Enable the new backup role in wikimetrics if set [operations/puppet] - 10https://gerrit.wikimedia.org/r/139558 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [22:10:51] (03CR) 10jenkins-bot: [V: 04-1] Enable the new backup role in wikimetrics if set [operations/puppet] - 10https://gerrit.wikimedia.org/r/139558 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [22:13:54] (03PS1) 10Spage: add new Mantle extension, required by coming Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142142 (https://bugzilla.wikimedia.org/66094) [22:14:50] (03PS2) 10Spage: new Mantle extension on labs, required by coming Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142142 (https://bugzilla.wikimedia.org/66094) [22:14:53] (03CR) 10QChris: Add backup role and scripts to wikimetrics (032 comments) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [22:16:20] (03CR) 10QChris: Add backup role and scripts to wikimetrics (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [22:16:50] hey when I download the puppet repo I get this error: warning: remote HEAD refers to nonexistent ref, unable to checkout. [22:17:10] do i just git checkout remotes/origin/production [22:17:46] bblack: this is the varnish issue ^ ? [22:17:57] (03CR) 10Hashar: [C: 031] "Sounds good to me. Mantle is in mediawiki/extensions.git" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142142 (https://bugzilla.wikimedia.org/66094) (owner: 10Spage) [22:18:02] you used HEAD as the name of your remote here, dogeydogey, if I'm not much mistaken [22:18:13] hence the phrase "remote HEAD refers to..." [22:18:16] twkozlowski I just did git clone https://git.wikimedia.org/git/operations/puppet.git [22:18:52] I think the HTTPS cloning doesn't work too well for us; I always do it over SSH [22:18:54] ottomata: see hoo|away's request above^^ [22:19:24] twkozlowski don't have ssh access [22:19:36] so is git checkout remotes/origin/production okay? [22:19:40] or is there a better way to solve this? [22:19:51] dogeydogey: do you have a Gerrit account yet? [22:20:21] twkozlowski yes [22:20:29] so upload your keys there [22:21:19] dogeydogey: https://wikitech.wikimedia.org/wiki/Git [22:22:35] dogeydogey: also: https://www.mediawiki.org/wiki/Git [22:22:42] oh wow i registered for gerrit back in february [22:22:42] and some links around there [22:22:42] hah [22:22:46] oooh, I see [22:22:58] dogeydogey: puppet/.git/config [22:23:09] does it refer to branch master or production? [22:23:21] and of course: https://www.mediawiki.org/wiki/Gerrit/Tutorial [22:25:16] (git checkout production worked fine for me, it appears.) [22:25:54] weird, the correct ssh keys are in there [22:26:02] but still getting permission denied [22:26:22] dogeydogey: ^^ [22:26:58] twkozlowski fetch = +refs/heads/*:refs/remotes/origin/* [22:27:15] oh... it says [remote "origin"] [22:28:38] Well, I cloned the repo just as you did and then did git checkout production and it works [22:29:47] dogeydogey: do a clean clone with ssh [22:30:28] someone feel like +2'ing a wmf-config change that's beta cluster only? [22:30:35] https://gerrit.wikimedia.org/r/#/c/142142/ [22:34:55] matanya twkozlowski this worked: git clone ssh://username@gerrit.wikimedia.org:29418/operations/puppet.git [22:34:59] thanks for your help [22:37:08] _joe_: I noticed you mentioned icinga alerts for anomalies (for the APC monitoring that would graph out in graphite as well). I'd be interested to see how that works. Could I tag along? [22:42:52] matanya: no, I don't think I did anything to do with that checkout issue [22:43:31] thanks bblack [22:52:58] (03CR) 10BBlack: "Just repeating from irc:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142087 (owner: 10Ottomata) [22:54:05] (03CR) 10BBlack: "Is this good to go for deployment, or are we waiting on some other feedback first?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142001 (owner: 10Dr0ptp4kt) [22:54:31] (03CR) 10Dr0ptp4kt: "It's good to go." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142001 (owner: 10Dr0ptp4kt) [22:54:43] (03PS3) 10BBlack: Temporarily disable OM on 470-07. Rollout is in two phases. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142001 (owner: 10Dr0ptp4kt) [22:55:18] (03CR) 10BBlack: [C: 032 V: 032] Temporarily disable OM on 470-07. Rollout is in two phases. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142001 (owner: 10Dr0ptp4kt) [22:55:36] bblack thx [22:55:43] np [22:58:37] (03PS1) 10Spage: add new Mantle extension, required by coming Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142151 (https://bugzilla.wikimedia.org/66094) [23:00:04] RoanKattouw, mwalker, ori, MaxSem, aude: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140625T2300) [23:00:52] I dont want to do it because I have a pile of other work to do [23:00:53] but I can [23:01:21] no one else around? [23:01:29] guess I'm doing it :p [23:01:31] I can do it [23:01:35] I haven't done it in a wahile [23:01:37] *while [23:01:39] I'd rather not deploy myself, too slepy:| [23:01:43] ok [23:01:59] RoanKattouw, sold! :) [23:02:00] * Reedy wonders how MaxSem deploys himself [23:02:15] our patch should improve the db load on s5 a bit [23:02:23] * mwalker goes back to poking at this task that needs to be done for the end of the day [23:02:28] RoanKattouw: do you have a deployment degree yet? [23:02:40] aude, you're undeploying wikidata? :P [23:02:46] MaxSem: heh [23:03:14] (03PS6) 10Reedy: Remove remnants of . replaced with _ in "lang" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/134948 [23:04:03] spagewmf: Roan Kattouw, MD [23:04:06] Oh wait, that's something else :P [23:04:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [23:04:34] It's never Lupus. [23:09:09] doctor, doctor: after/during SWAT could you +2 https://gerrit.wikimedia.org/r/#/c/142142 ? Should have no effect in production [23:12:46] (03CR) 10Spage: "See also https://gerrit.wikimedia.org/r/#/c/142142 to enable Mantle on beta labs. We could enable Mantle in production in1.24wmf11 in adva" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142151 (https://bugzilla.wikimedia.org/66094) (owner: 10Spage) [23:12:56] Oh yay my SWAT things merged [23:12:58] spagewmf: Looking [23:13:34] greg-g: What is the deployment policy re new extensions on labs? Do they require security review or whatever? [23:14:12] RoanKattouw: 142142 , not 142151. csteipp gave us the OK, https://bugzilla.wikimedia.org/show_bug.cgi?id=66238 [23:14:15] RoanKattouw: yeah [23:14:20] and mantel has had it.... see above [23:16:13] aude: which patch will affect s5? [23:16:23] (sounds good -- just interested) [23:16:36] greg-g: Cool [23:16:45] spagewmf: Random question, where did javascripts/common/EventEmitter.js come from? [23:16:49] springle: https://gerrit.wikimedia.org/r/#/c/141997/ [23:16:52] *eventemitter.js lowercase [23:17:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [23:17:28] aude: nice, thanks. linking that in the incident report [23:17:48] RoanKattouw: that's MobileFrontend JavaScript, ask Jon Robson. Flow isn't using it. [23:17:51] Hah so this is exactly why we need de-silo-ization [23:17:54] i wonder how the bad code ever got merged and probably has been tehre a while [23:17:59] Look at Class.js and eventemitter.js [23:18:10] They provide the same feature set as oojs, with similar names even [23:18:20] (Not oojs-ui, oojs core) [23:18:55] RoanKattouw: indeed, I almost e-mailed Mantle's OO stuff as an example. [23:20:11] aude: well, it's found now. also s5 has another slave anyway, too. i guess you guys will find some other load for it ;) [23:20:24] RoanKattouw: and shahyar's Flow front-end code has a "Registers a given FlowComponent into the component registry, and also extends the class with FlowComponent", sounds familiar 8-) [23:20:45] springle: definitely [23:20:48] Yeah we have registries and factories [23:20:55] We should be able to at least standardize *this* [23:21:03] I mean, come on! It's backend code [23:21:12] If M has features oojs doesn't have, let's just merge them [23:21:41] Offhand it seems like our event emitter has more features (like. connect()) because we wrote our own rather than wrapping jQuery's [23:22:02] But if there's some magical thing that M gives you that oojs doesn't, we should make that a feature of oojs most likely [23:23:04] !log catrope Started scap: Updating Wikidata and TimedMediaHandler [23:23:09] Logged the message, Master [23:25:11] to be fair, thats not new code, it was created in Feb 2013, and just moved from one extension to another now [23:25:30] (the eventemitter and friends) [23:25:52] (03PS1) 10MaxSem: Disable mobile upload CTA on wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142155 (https://bugzilla.wikimedia.org/66958) [23:25:57] I was about to ask when it was created [23:26:06] RoanKattouw: are you talking about EventEmitter? [23:26:11] jdlrobso_: That, and Class.js [23:26:20] this is exactly why Mantle was created :) [23:26:24] I was about to look up when oojs was broken out, probably around that time too [23:26:26] RoanKattouw: mobile has been using that code since the dawn of time [23:26:29] Right [23:26:42] *dawn of mobile time [23:26:42] I was told this is where the party is [23:26:58] so RoanKattouw what did i miss? [23:26:58] Similarly VisualEditor used ve.extendObject in ve.js since the dawn of VE time (2011ish) [23:27:12] Then we broke it out to oojs in ... the summer of 2013? [23:27:17] RoanKattouw: yup jgonera pointed this out when we started VE work [23:27:25] Or something [23:27:25] I'll look it up [23:27:25] So… two years later it got re-created? Great. [23:27:29] !log catrope Finished scap: Updating Wikidata and TimedMediaHandler (duration: 04m 24s) [23:27:32] but you've been working in a silo and so have we which is why we want to make Mantle [23:27:33] Logged the message, Master [23:27:39] jdlrobso_: I was ranting that this is exactly why we need to unify and standardize things [23:27:39] This is why we need to work together. [23:27:40] if VisualEditor can depend on Mantle even better [23:27:45] we can start consolidating all this code debt [23:27:45] Umm. [23:27:49] James_F: no shit sherlock :P [23:27:53] It's ridiculous that we have two separate projects that do class inheritance and event emission [23:28:02] RoanKattouw: sure. Silos. Time to fix it. [23:28:06] Mantle is newer. Why would we switch from the existing code to new implementations of existing code? [23:28:08] Yeah [23:28:13] OOjs is the pre-existing code here. [23:28:19] James_F: It sounds like Mantle is more than just that [23:28:24] I’m sure there are more than 2 projects doing class inheritance and/or event emission :) [23:28:28] James_F: So think of Mantle like 'mediawiki core code purgatory' [23:28:35] But yeah for classes and event emission, OOjs is more mature and is already in MW core [23:28:41] RoanKattouw: Oh, sure, the actual system is totally different, and using HandleBars is a different use case. [23:29:12] RoanKattouw: We should do more "brown bags" or whatever to help teams understand what tools they have available so they don't re-invent the wheel. [23:29:18] Yeah that would be nice [23:29:26] Because I didn't know this particular wheel had been reinvented [23:29:33] And if I didn't volunteer for SWAT today I still wouldn't know [23:29:40] * James_F sighs. [23:29:47] And I'm sure the people that did the reinvention also weren't aware there was already a wheel [23:30:02] Oh, indeed. [23:30:05] James_F: RoanKattouw I don't really appreciate your moaning here. This is precisely why I asked Flow to share code with us. [23:30:08] It's not anyone's fault. [23:30:09] it doesn't mean it can't change [23:30:16] it just means we start surfacing these better [23:30:20] Brief on-topic announcement: aude, tgr, your SWATs are done [23:30:31] RoanKattouw: did you do wmf10 also? [23:30:34] Yes [23:30:34] It's just aversion to actually discussing thing with other teams. [23:30:37] k [23:30:48] If all the teams depended on core and Mantle ie. worked on the same code base we'd notice these things better [23:30:50] jdlrobso_: I'm not blaming anyone, I'm just saying that this situation is awful [23:31:06] Well [23:31:11] We did put OOjs into MW core [23:31:19] I don't know how much better we can do "same code base" [23:31:29] James_F: i'm actually surprised you have both only seen the EventEmitter. I'm pretty damn sure jgonera has talked to Trevor and/or Krinkle about it [23:31:52] The event emitter in Mantle seems to just be a wrapper around jQuery anyway [23:31:52] jdlrobso_: I care about features not code. Obviously. [23:31:56] RoanKattouw: well as i understand it the multimedia team is using oojs no? [23:31:59] looks ok, nothing to notice different except maybe faster purges / saves [23:32:01] Yes they are [23:32:03] so that's good. [23:32:06] jdlrobso_: Indeed. [23:32:10] Flow and mobile are converging too [23:32:10] seems to be [23:32:11] And I'm not objecting to people not using oojs, that's fine [23:32:20] so it seems like we're heading in the right direction [23:32:39] when VE/MultimediaStuff and Mobile/Flow want to converge it will be even simpler [23:32:48] RoanKattouw: it would be nice to have a process for announcing "this piece of code is intended to go to core, eventually" [23:32:57] Yeah, I guess it would be [23:33:24] since making changes in core is so much more difficult, it's just not worth porting the code while it is being actively developed [23:33:29] Hmmmm [23:33:42] It looks like Mantle's event emitter might actually be fully compatible with OOjs's [23:33:51] It seems like its API is a subset of ours [23:33:56] So we could switch across? Neat. [23:33:57] RoanKattouw: correct [23:33:58] If that is true then migrating would be trivial [23:34:02] which means it can remain stuck in some extension for months, so it id easy to duplicate work unintentionally [23:34:09] i know that juliusz tinkered a little bit with it [23:34:19] You could do M.eventemitter = OO.EventEmitter; and things would just keep working [23:34:54] and now we are working in a shared code base with tests that show how things are suppose to work we can achieve this quicker :-') [23:35:02] Oooh, tests? [23:35:05] * RoanKattouw goes to read tests [23:35:44] RoanKattouw: the SWAT works for me, thanks! [23:35:55] jdlrobso_: What is this.sandbox.spy()? Is that a qunit thing? It looks pretty neat [23:36:06] RoanKattouw: yeh it allows you to mock stuff easily [23:36:09] uses sinon [23:36:19] that's a core thing now, afiak [23:36:25] for qunit [23:36:31] That's cool [23:36:38] so yeh i'm fine with us changing any of the implementations in Mantle/upstreaming stuff to core as long as we don't break the tests :-) [23:36:38] Makes writing tests for things like event emitters much easier [23:36:38] I'm not sure about the details of the discussion jdlrobso_ and RoanKattouw, but we have what we have in MF/Mantle because OO.js/OO.js-ui did not meet our needs when we evaluated it (which was a while back) [23:37:01] jgonera: For UI stuff, sure, but for OO/event stuff? [23:37:18] we aren’t using OOjs right now, but we could. I am interesting in migrating our current Flow class inheritance stuff to something cleaner. [23:37:18] What about OOjs didn't meet your needs? [23:37:24] it's probably much better today, although we would probably need to make it a bit more modular (RL-module wise) to use it on mobile [23:37:30] I ask because Mantle's OO and event systems are pretty much a subset [23:37:45] RoanKattouw, it seemed to have a lot of functionality we did not need [23:37:47] exactly [23:37:51] interested*. however, a lot of people are pushing us to use OOui if we go that route, which is not at all what we want. [23:37:52] shahyar: Yeah migrating OO stuff is less trivial so if you wanna collaborate on that that would be sweet [23:37:53] a subset that we need [23:37:59] jgonera: So you were worried about load? [23:38:01] RoanKattouw: which sounds about right. Mantle should only be a temporary thing whilst we push for standardisation. [23:38:16] Oh, I remember this discussion now [23:38:21] James_F, yes, although we could make a better job at comparing the load now [23:38:25] yeah, we’ve had this conversation before :) [23:38:28] jgonera: *nods* [23:38:28] Something about OO.EventEmitter being too much code or something [23:38:29] well, kind of. [23:38:52] shahyar: We've had the meta-conversation, probably, but this is a concrete and non-controversial thing, so it actually stands a chance of getting done :) [23:39:29] it’s not in our plans atm, but maybe more towards the end of the quarter, or possibly Q2 [23:39:43] Right [23:39:53] If I get bored on an airplane in the near future I might write a patch that does it :) [23:39:56] So RoanKattouw the plan with Mantle is to continue to move stuff out of MobileFrontend and Flow that's shared as much as possible. [23:40:04] jdlrobson: That's awesome [23:40:15] Once it's in there it'll get more visibility [23:40:23] we’d take another look at how we’re doing classes and events. however, I like what is happening with Mantle, as the code for it is quite small and leverages jQuery, which we already have everywhere. [23:40:23] But the idea would be we'd eventually want that to get into core in some form, so if you guys can work on that repository too maybe we can make this standardisation happen quicker. [23:40:25] And I might cannibalize some of those things for OOjs/OOUI in turn [23:41:40] jdlrobson: Sounds good. At some point, though, there are probably conflicting goals we need to talk about [23:41:48] Like, right now there are two event implementations [23:41:51] RoanKattouw: of course [23:41:56] One in Mantle which is a thin wrapper around jQuery [23:42:11] there are going to be some clashes, but this is the purpose of Mantle [23:42:11] And a more feature-rich one in OOjs which does not use / depend on jQuery at all [23:42:20] * jdlrobson wishes he called Mantle TheThunderDome now [23:42:23] hahaha [23:42:32] Complicated factor being, oojs is in core already [23:43:15] And used by three teams' code. [23:43:15] jdlrobson: Hmmm, how is M.EventEmitter meant to be used, as a base class that is inherited from? Or as a mixin? [23:44:37] RoanKattouw: it has two main uses 1) M.emit and M.on - ability for JavaScript modules to communicate with each other via global events 2) as a base class for our Classes so we have local events [23:44:40] if that makes sense [23:44:46] RoanKattouw, as a base class originally, although this got more complicated when we wanted our Api to inherit from mw.Api and we have a weird hack there [23:45:04] RoanKattouw, I'm all for having one EventEmitter [23:45:18] jdlrobson: I understand why .emit and .on are there, it's exactly the same in OO. I'm just asking, if I want to use it, is it required that my class inherit from EventEmitter (or from a class that indirectly inherits from it)? [23:45:25] RoanKattouw, I do not think though that "not depending on jQuery" is an advantage [23:45:32] jQuery isn't going anywhere [23:45:34] Yeah we had the multiple inheritance issue too [23:45:40] So we made it a mixin [23:45:52] jgonera: That's not why we don't depend on jQuery, we're not afraid of it or anything [23:45:59] jgonera: RoanKattouw +1 and i'm confident we can have one. [23:46:07] so why don't you depend on it? [23:46:09] jgonera: Remember that "we" (in the form of Timo) also write jQuery. ;-) [23:46:09] The only reason we wrote our own event emitter is because we wanted features that jQuery's doesn't provide, like .connect() [23:46:15] jgonera: OOUI does [23:46:19] I see [23:46:20] Just OOjs doesn't because there's no reason to [23:46:40] It just organically happened to not have any code that uses jQuery, also because it just does OO stuff and no DOM stuff [23:46:59] we (MF) should probably have a second look at OO.js [23:47:18] jgonera: You've got a bunch going on, though. :-) [23:47:23] I also remember Krinkle also saying you guys were doing something similar to us in RL but doing it in a better way but since VE is a big bag of code and mobile is a big bag of code it wasn't that visible. Mantle bubbles this stuff up a bit. [23:47:26] I'm all for sharing more code, even if it leads to some compromises on how we have to write our code [23:47:42] By all means do (at any time convenient for you), I'd be happy to give you a tour or explain things [23:47:48] And if there are things you don't like, we can fix them [23:47:56] or compromise on them [23:48:02] that sounds great RoanKattouw [23:48:03] RoanKattouw: if you want to file a bug against Mantle around EventEmitter please do it - it is on bugzilla [23:48:08] Oh cool [23:48:20] will make sure we track these opportunities [23:48:58] Yeah that's good [23:49:00] I'd generally say that new components or products in Bugzilla are generally worthy of a wikitech-l post. If nothing else, it would help bubble up awareness. :-) [23:49:23] (I've totally violated my newly-proposed suggestion in the past.) [23:49:38] Mantle is new enough that there's still time :) [23:49:46] People have been getting work done instead of talking about it ;) [23:49:51] (says the person talking about work rather than doing it) [23:50:25] James_F: RoanKattouw i thought i had mailed out about Mantle, but if i haven't I will be sure to do so when mobile and Flow are both dependent on it to outline the goals and the mission of it. [23:50:38] Yeah [23:50:39] come the end of the fiscal year I hope it is dead and we have suceeded. :) [23:50:39] thanks springle :) [23:50:55] :) [23:51:14] Hmm to your credit S did mention Mantle on wikitech-l at some point [23:51:18] But specifically in the context of templating [23:51:26] jdlrobson: That'd be great. [23:51:30] Which, BTW, I'm glad you found a slightly better home for the template RL module thing [23:51:51] jdlrobson: The end of the fiscal is on Monday… I assume to mean a year from Monday? :-) [23:52:06] James_F: that might be too optimistic even for me ;-) [23:52:09] jdlrobson: :-D [23:52:20] James_F: ok noted. I'll send a mail once Mantle is deployed. [23:52:51] jdlrobson: Cool. [23:52:54] and by that i mean betalabs or whatever - good to point at code [23:53:03] jdlrobson: But even before that is useful. [23:55:26] (03CR) 10Catrope: [C: 032] new Mantle extension on labs, required by coming Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142142 (https://bugzilla.wikimedia.org/66094) (owner: 10Spage) [23:55:33] (03Merged) 10jenkins-bot: new Mantle extension on labs, required by coming Flow [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142142 (https://bugzilla.wikimedia.org/66094) (owner: 10Spage) [23:55:35] James_F: Right now my main focus is getting the frontend rewrite deployed for Flow, and getting mobile dependent on Mantle. :) [23:55:44] jdlrobson: ---^^ Right now a Jenkins job is deploying Mantle to beta labs :) [23:55:49] ( spagewmf ---^^ ) [23:55:53] but i will write such email as soon as i have time to give it the email it deserves :) [23:56:00] heh, you beaten me to it [23:56:02] (03PS1) 10Rush: allow setting mysql_mode though mysql::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/142162