[00:00:09] Logged the message, Mr. Obvious [00:00:24] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/37157/' [00:00:33] Logged the message, Master [00:02:57] Hrmph [00:03:02] /etc/dsh/group isn't on bast1001 either [00:05:33] !log rebooting db1043 [00:05:41] Logged the message, Master [00:06:17] !log aaron synchronized php-1.21wmf5/extensions/SwiftCloudFiles/php-cloudfiles-wmf/cloudfiles_http.php [00:06:25] Logged the message, Master [00:07:00] PROBLEM - Host db1043 is DOWN: PING CRITICAL - Packet loss = 100% [00:08:38] RECOVERY - Host db1043 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [00:08:41] Change abandoned: Demon; "Actually, I'm thinking of just killing the package entirely. All it does is get in the way when we'r..." [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/27531 [00:12:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:24:58] notpeter: Rargh this deployment thing is getting annoying. To make it work I would have to deploy from /home (so fenari), and I'd have to modify the rsyncd config [00:25:12] The alternative is putting it in /h/w/common , which would slow down MediaWiki deploys tremendously [00:25:18] hrm, gotcha [00:25:25] well shoot [00:25:34] I mean I'm happy to run a temporary rsyncd on bast1001 [00:25:42] But, you know, this web is weaving itself fast [00:25:51] heh, true [00:26:03] how about we think about possible solutions and regroup and tlak about this tomorrow? [00:26:04] Hmm [00:26:07] * RoanKattouw has an idea [00:26:21] git is already installed on the Parsoid boxes by puppet [00:26:39] "Deploy" could be dsh git pull [00:27:01] That requires that I initially transfer the node_modules dir but that's fine, that's a one-off [00:27:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.274 seconds [00:28:42] RoanKattouw: where can I learn about parsoid and its architecture? my interest is mostly from an ops perspective but more generally wouldn't hurt (I've read mw.org/wiki/Parsoid, not enough...) [00:28:59] I'm not sure it's readily documented [00:29:08] The project leader is gwicke and he and his team hang out in #mediawiki-parsoid [00:30:32] RECOVERY - mysqld processes on db1043 is OK: PROCS OK: 1 process with command name mysqld [00:32:03] paravoid: what time is it there? [00:33:23] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 3118 seconds [00:33:24] 2:30am [00:33:29] EET [00:34:53] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay seconds [00:35:54] paravoid: what if swift auth just have an IP storage url? :p [00:37:37] I'd prefer not [00:37:44] let's solve the issue, not hide it under the carpet [00:38:14] !log testing mariadb-server-5.5 5.5.28-mariadb-wmf201212041~precise on db1043 (slaving enwiki) [00:38:17] any luck into writing a reproducable test case? [00:38:22] Logged the message, Master [00:38:24] no [00:38:25] binasher: \o/ [00:39:41] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 2708 seconds [00:41:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37155 [00:43:40] !log Added mexia, tola, kuo, lardner to /h/w/common/docroot/noc/pybal/pmtpa/parsoid , they appear to be pooled just fine now [00:43:48] Logged the message, Mr. Obvious [00:43:58] paravoid: I don't see this getting fixed soon [00:45:43] New patchset: Catrope; "Make /var/lib/parsoid owned by wikidev, makes deploying easier" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37165 [00:49:46] paravoid: alright, I give up on that for now [00:50:26] notpeter: Hmm, the Parsoid group isn't showing up in Ganglia yet, when can I expect that to happen? [00:50:38] (Puppet has run on wtp1 (aggregator), but I suppose it may not have run on the main Ganglia box?) [00:51:12] RoanKattouw: hhhhmmmm, is mercury in retrograde right now? [00:52:12] I don't know [00:52:22] I also don't see Rob's new machines /anywhere/ in Ganglia [00:52:34] Although that makes some sense, they think they belong to a group that doesn't show in Ganglia [00:52:35] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay 28 seconds [00:52:47] And wtp1 (aggregator) shows up in misc, but as down (!!) [00:53:05] New patchset: Asher; "pulling db59 from s1 for upgrade to precise" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37168 [00:53:43] See http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Miscellaneous%2520pmtpa&tab=m&vn= . It says wtp1 is down, which is bull. And there's no mexia/kuo/tola/lardner [00:53:51] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37168 [00:54:03] puppet probably need to run on nickel [00:55:25] !log asher synchronized wmf-config/db.php 'pulling db59 from s1 for system upgrade' [00:55:33] Logged the message, Master [00:56:29] !log shutting down mysql on db59, upgrading to precise [00:56:38] Logged the message, Master [01:00:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:08] PROBLEM - mysqld processes on db59 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:17:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [01:19:28] !log rebooting db59 [01:19:38] Logged the message, Master [01:21:14] PROBLEM - Host db59 is DOWN: PING CRITICAL - Packet loss = 100% [01:23:02] RECOVERY - Host db59 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [01:42:41] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [01:49:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:38] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [02:04:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.438 seconds [02:24:10] !log LocalisationUpdate completed (1.21wmf5) at Thu Dec 6 02:24:10 UTC 2012 [02:24:19] Logged the message, Master [02:29:42] New patchset: Reedy; "Defrag commented out readonly statements" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36594 [03:00:10] New patchset: Asher; "returning db59 to service" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37181 [03:00:32] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37181 [03:19:04] i took mysql down on db59 rebooted it again, and no nagios messages here :/ [03:25:00] !log asher synchronized wmf-config/db.php 'returning db59 to s1' [03:25:11] Logged the message, Master [03:26:11] !log asher synchronized wmf-config/db.php 'lowering db59 weight for warmup' [03:26:20] Logged the message, Master [03:27:44] !log asher synchronized wmf-config/db.php 'setting db59 to full weight' [03:27:52] Logged the message, Master [03:47:47] New patchset: Pyoungmeister; "adding nagios groups for parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37183 [03:52:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37183 [04:26:35] New patchset: Ryan Lane; "Fix nagios groups that are breaking nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37185 [04:26:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37185 [05:58:12] RECOVERY - mysqld processes on db59 is OK: PROCS OK: 1 process with command name mysqld [05:58:30] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [05:58:30] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [05:58:30] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [06:00:45] PROBLEM - Parsoid on mexia is CRITICAL: (null) [06:01:12] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: Connection refused [06:01:21] PROBLEM - Parsoid on wtp1001 is CRITICAL: (null) [06:01:30] PROBLEM - Parsoid on kuo is CRITICAL: (null) [06:01:39] PROBLEM - Parsoid on tola is CRITICAL: (null) [06:01:57] PROBLEM - Parsoid on lardner is CRITICAL: (null) [06:01:57] PROBLEM - Parsoid on wtp1 is CRITICAL: (null) [06:30:46] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:31:44] maybe sq48 needs a cmjohnson1 [06:39:52] New patchset: Catrope; "Make /var/lib/parsoid owned by wikidev and g+ws, makes deploying easier" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37165 [07:13:53] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:13:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:30:55] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [07:30:55] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:48:25] New patchset: Jeremyb; "bug 42765 - throttle.php: Mumbai outreach event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37196 [07:58:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:00:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [08:00:48] New patchset: Jeremyb; "bug 42765 - throttle.php: Mumbai outreach event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37196 [08:03:42] New review: Jeremyb; "thanks Matthewrbowker for CR via IRC" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/37196 [08:31:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.027 seconds [08:50:16] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:12:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.382 seconds [09:33:19] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [09:36:20] doh [09:36:21] hello [09:36:26] so I broke ops/puppet.git :( [09:44:55] yep [09:45:04] luckily the workaround is pretty simple [09:54:20] I really need to train other people to modify the workflow :-] [09:57:20] should be fixed now [09:57:40] Change restored: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37009 [09:58:32] Something's wrong with the job queue. Translation notification jobs are getting "stuck" somehow. I'd like someone to check the job queue for metawiki and DELETE any present translation notification jobs. How can I get that done? [09:58:41] New patchset: Hashar; "validate jenkins job" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37009 [09:58:53] hashar: ^^ Who can I poke? [10:00:52] siebrand: at this hour me I guess [10:01:14] though I am supposed to deploy the new workflow on Mediawiki/core this morning [10:01:33] hashar: This is something that's very disruptive in wikis at the moment. [10:01:52] hashar: See https://bugzilla.wikimedia.org/show_bug.cgi?id=42715 [10:02:16] hashar: I'm looking up queue IDs to make it easier for you. If you are to do something, please report back in the bug what you were able to delete. [10:03:15] hashar: The job id is probably translationNotificationJob [10:03:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:04:18] siebrand: I will deploy the new workflow first [10:04:28] siebrand: should take roughly an hour then I look at the job queue issue [10:04:49] hashar: so "job_cmd = 'translationNotificationJob'" [10:05:10] hashar: great. I'll add this info to the issue and will add you in CC. [10:08:39] !log applying new CI workflow on mediawiki/core.git . Changes now needs CR+2 to trigger unit tests. [10:08:51] Logged the message, Master [10:12:26] yeah new workflow applied to mw/core@master [10:18:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.055 seconds [10:39:29] bah I am locked [10:50:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:07:00] hashar: What's the status on the job queue question I asked? [11:07:18] on hold till I finish deploying the new workflow [11:07:20] doing regression tests locally right now [11:07:21] should not be long :) [11:08:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.068 seconds [11:16:05] siebrand: ok finished my tests :) [11:19:41] soo hm [11:19:57] metawiki has 2213 translationNotificationJobs pending [11:30:23] hashar: remove the lot :) [11:30:24] hashar: please nuke them all. [11:30:29] snap [11:30:48] hashar: (make a dump if possible) [11:31:12] hashar: it's semi-sensitive data, so please only distribute to AaronS and the likes. [11:31:31] well looking at the dump for https://meta.wikimedia.org/w/index.php?title=User_talk:Vladimir_Penov&action=history [11:31:47] I get three translationNotificationJob jobs for it [11:31:51] still? [11:31:55] from Nov 27th, Dec 04 and Dec 01. [11:32:01] each of them have job_attempts = 3 [11:32:33] siebrand: ^^^ [11:32:43] looking over the edits, it seems like the job gets half done and then stops and starts all over again [11:32:45] (if you're able to save the data, nuke from production after that, and report on the bug, then we'll get someone to fix it, hopefully.) [11:33:00] I'm working for another client, under high pressure, and don't have the time at the moment. [11:33:09] ok ok [11:33:13] this really needs someone involved who did the recent job queue reowrk. [11:33:23] should I just nuke all translationNotificationJob jobs meanwhile ? [11:33:30] then report back too Aaron so he looks at it? [11:34:37] hashar: yes [11:34:51] but aaron will need the data [11:35:12] it would be better not to throw further tests on users' heads [11:35:46] what puzzle me is that our runJobs log does not show any translationNotificationJob :( [11:38:10] hashar: would you mind me quoting you in the bug? [11:38:28] Seddon: yes please :) [11:38:47] hashar: cool, Ill do that now [11:39:29] ahh [11:39:35] Seddon: I used the wrong grep command :) [11:39:50] so they are there in the log? [11:40:55] apparently [11:40:58] looking for them [11:41:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:41:47] the jobs do end up with an error [11:41:55] but that is an empty error :/ [11:43:22] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [11:46:01] hm [11:46:23] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [11:51:04] I shall (since I am so out of my league with allof this) assume that should not be the case [11:51:19] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [11:51:56] fuuauufuazueazeu [11:54:43] lolwut [11:55:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [11:56:05] hashar: Niklas is suggesting that previous behaviour of the job queue was to drop a failed job, and now it may be re-run. [11:56:14] siebrand: I think so [11:56:26] siebrand: I am enhancing the translationnotification job [11:56:29] so it returns a proper value [11:56:32] hashar: The empty status may have to do with it. So we're indeed dealing with a pretty serious regression in the job queue redesign. [11:56:37] apparently returns always NULL [11:56:44] which is recognized as a failure by the job queue [11:56:57] will follow up with Nikerabbit [11:57:12] hashar: He's having his day off. Finnish independence day. [11:57:18] ah nice [11:57:29] well will sort it out with someone else so :-] [11:58:10] ill post some of this on the job queue bug [12:13:37] hashar: Thanks for the help. [12:14:22] Seddon: hold on till I get a patch to send :-] [12:14:27] ok cool :) [12:14:30] fighting with our Status class right now [12:14:35] WICH IS A HUGE PILE OF MESS [12:14:37] :-D [12:15:39] I am beginning to understand why noone wants to work on this job queue :P [12:16:47] well [12:16:57] Seddon: it got rewritten recently [12:17:05] so there is not a lot of people knowing the source code [12:17:23] AND making a mistake in that code tends to cause major issues on the live wikis such as the one with translation notifications [12:17:37] so most people are probably to afraid of breaking the site [12:18:03] luckily, the WMF has a team to handle this (I am part of that team) and we have very smart people (I am not part of them). [12:18:15] lol, oh dear.... Ill brace for impact :P [12:19:38] I genuinely appreciate the help with this :) [12:19:51] my job :-] [12:20:12] I am a contractor for WMF > Platform Engineering > MediaWiki maintenance. [12:20:19] so whenever something is broken, that ends up to us :-] [12:20:29] I have said it to others, count yourself lucky that we arn't doing the international fundraiser at the moment [12:20:58] I would not have been as calm and collected over the last four days as I have been :P [12:21:03] we would probably not have deployed the new job queue system in such case [12:21:24] so I get a patch [12:21:28] no I need to write tests for it [12:21:51] hashar: the new job queue was deployed well before the decision not to do the FR [12:21:51] New patchset: Reedy; "Kill of secure.wm.o related configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37209 [12:22:01] * Reedy waves bye bye to secure.wm.o [12:22:27] Reedy: I thought we wanted to keep it around to still serve the old URLs [12:22:28] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36659 [12:22:34] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36595 [12:22:40] Reedy: ah maybe that is handled on the apache side [12:23:13] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [12:23:47] AFAIK all the redirects are inplace at the apache level [12:28:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:29] new User::newFromName( 'foobar' ); [12:32:29] ... [12:32:32] I am getting tired [12:36:56] Reedy: I have enabled this morning the new workflow on mediawiki/core.git [12:36:59] Reedy: so +2 to get tests [12:37:08] Reedy: will enable auto submit later this afternoon [12:43:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.693 seconds [13:02:59] hashar: so, should I dist-upgrade gallium? [13:06:02] paravoid: sure :) [13:06:13] paravoid: not sure which packages are going to be upgraded foo [13:06:16] though [13:06:16] grg [13:06:25] paravoid: I am going to grab a snack first [13:08:03] New patchset: Reedy; "Defrag commented out readonly statements" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36594 [13:08:26] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36594 [13:08:34] Snnnack time [13:08:35] brb [13:13:29] back [13:17:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:51] hashar: all is well [13:21:51] paravoid: while you are at it, we could get PHPUnit upgraded :) [13:21:57] paravoid: though that is done via PEAR :( [13:32:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.424 seconds [14:05:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [14:56:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:55] New patchset: Dereckson; "bug 42765 - throttle.php: Mumbai outreach event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37196 [14:59:59] New review: Dereckson; "PS3: adding 2 hours at event end" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/37196 [15:01:18] hashar: apergos: you both have been involved with throttles in the past and now I'm realizing I don't understand entirely how it works... what happens after the time period has elapsed if there were more than the standard limit created that day? do the accounts from during the period not count for afterwards? [15:02:14] jeremyb: IIRC the throttles are only valid during a specific time range [15:02:42] hashar: but there's a default throttle that applies all the time [15:02:50] for all IPs [15:03:26] then once the throttle is expired, the old default apply again [15:03:34] and I guess people will be prevented from creating new account :-] [15:03:37] I have no idea honestly [15:03:55] someone should document it ;) [15:04:04] I don't know how we keep track of how many # of account creation have been made for a given IP [15:04:05] me neither, I've never really loked into what happens when the time period is over [15:04:08] maybe that is in memcached [15:04:16] *looked [15:05:01] $key = wfMemcKey( 'acctcreate', 'ip', $ip ); [15:05:02] ah yeah [15:05:05] memcached :-] [15:05:19] jeremyb: so if the default is 5 account per IP over 24hours [15:05:28] we raise it to 50 for an hour [15:05:37] 50 accounts are created, the value is at 50 [15:05:47] no more account can be created for the next 24 hours from that IP [15:05:57] regardless of the $wgAccountCreationThrottle value [15:06:01] well that's inconvenient [15:06:14] i'm thinking we maybe have a separate counter for throttle exceptions [15:06:21] you want a way to flush that value when the time is expired [15:06:34] includes/specials/SpecialUserlogin.php look below the hook call named "ExemptFromAccountCreationThrottle" [15:06:35] and revert to the old counter (from earlier that day) when we revert to default behavior [15:06:56] that could work too [15:07:11] or make the memcached key to contain the value of wgAccountCreationThrottle [15:07:29] this way whenever we change wgAccountCreationThrottle a new key will be hit thus reseting the throttle [15:07:31] hashar: what if there's 2 events in a day? ;) [15:07:41] also, what's 24 hrs? is it rolling? or calendar day UTC? or? [15:07:49] jeremyb: from the same IP ? Then we could raise the throttle to 1000 :-) [15:07:52] oh [15:07:58] 24 hours is the timeout of the memcached key [15:08:04] so that is since the last write to it [15:08:04] huh [15:08:35] if on Dec 5th at 3pm someone create the last account, nobody would be able to create one till Dec 5th 3pm [15:08:51] then the key will expire in memcached [15:08:54] there will be no more value [15:09:01] and the account will be allowed :- [15:09:07] dec 5th == dec 5th [15:09:10] afaik [15:09:13] so 24 hours form when the limit was reaached [15:09:14] oh yeah [15:09:33] if on Dec 5th at 3pm someone create the last account, nobody would be able to create one till Dec SIX 3pm [15:09:37] :-) [15:09:41] New patchset: Dereckson; "(bug 42767) Throttle rule for Goa Event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37221 [15:09:52] apergos: thanks for the rephrase :-]]]] [15:10:01] yw [15:10:16] jeremyb: feel free to submit a documentation block in our config, I will be happy to approve it [15:10:45] hashar: k, maybe later [15:12:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.081 seconds [15:24:06] New patchset: Mark Bergsma; "Initial version of swiftrepl, a Swift-to-Swift API replicator script" [operations/software] (master) - https://gerrit.wikimedia.org/r/37223 [15:28:31] New review: Dereckson; "Ok, I ask on the bug." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/37196 [15:29:14] Dereckson: did you see my queries above too? [15:32:42] jeremyb: yes, I've just red your remark about lack of request for the extra margin on the bug. I submitted the patchset before to read your statement, as I needed to have it stabilized to use it as a dependency for the next bug (another throttle request for an event occuring the 12). [15:33:03] i saw the dep [15:33:27] Dereckson: i meant w/ hash ar/aper gos [15:35:08] Red. This comforts me in the idea we should add a little margin for time and participants on each throttle rule. [15:36:26] There is no real risk of abuse, as the required IP are by definition trusted. And we mitigate the risk they can't create an account at the end of the event or if they got too many participants. [15:40:21] * jeremyb doesn't know what "Red" means [15:40:23] but ok [15:41:38] Dereckson: as a general rule I'd like us (shell and related people) to not be event planners. teach people to submit accurate requests and they can themselves figure in buffers as they wish. by the time it gets to us it should be clear what to do. no guesswork [15:42:41] you know what I would like is this: [15:42:49] admin say "oh we have an event on our project" [15:42:57] there are existing pages like https://en.wikipedia.org/wiki/Wikipedia:How_to_run_an_edit-a-thon and https://outreach.wikimedia.org/wiki/GLAM/Model_projects/Edit-a-thon_How-To [15:43:08] we can add / change the instructions there if needed [15:43:12] admin goes to the exptension page, sets the ip range(s), the time frame, and the number of accounts, or simply removes the limit for them [15:43:17] for that time frame [15:43:25] apergos: yeah, i knew that was coming! [15:43:29] wmf staff is never involved :-D [15:43:43] win for everyone [15:44:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:33] * ^demon hides from apergos' suggestion [15:47:49] <^demon> Since that's something on my really really back burner. [15:47:59] * ^demon goes into hiding [15:50:55] cookie licker! [15:51:01] * apergos gices daemon more cookies to lick [15:51:05] gives them too [15:51:34] <^demon> Why eat the cookies when you can just lick them endlessly? :) [15:51:58] cmjohnson1: did the 720s arrive? [15:52:21] yes...they are all here [15:52:35] racked but nothing else...i will be working on getting them up and running today and tomorrow [15:53:06] cool, thanks a lot! [15:59:16] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [15:59:16] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [15:59:16] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [15:59:32] apergos: did you see sq48? [15:59:39] again? [15:59:41] no I didn't [16:01:05] funny, there was a new alert on it just before i was already typing a mention of it ;) [16:01:51] err, just before my message when i was already typing a mention of it* [16:01:54] * jeremyb is sleepy [16:02:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [16:02:58] joy [16:03:06] jeremyb: s/Red/Read. I understand your objection about event planning, but I have some difficulties to understand the advantages to act as a bot and have a very formal approach on this matter. It seemed to be common sense to add such a margin. [16:03:54] I can ssh in but [16:03:59] -bash: /etc/profile: Input/output error [16:04:05] -bash: /usr/bin/free: Input/output error [16:04:06] :-D [16:04:08] bad disk [16:04:13] off to cmjohnson1 ;) [16:04:13] apergos: nice suggestion ; I could try to prepare such an extension if you get me the documentation about how to send the throttle information to relevant caches. [16:04:41] Dereckson: how would caching be involved? [16:05:01] the memcache key yeah? [16:05:32] we don't touch caches now so why should an extension? [16:05:44] jeremyb: sq48? [16:05:52] jeremyb: hashar noted the account create setting where stored in Memcache ($key = wfMemcKey( 'acctcreate', 'ip', $ip );) [16:05:54] anyways smething is surely out of whack on sq48, http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Upload+squids+pmtpa&h=sq48.wikimedia.org&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS [16:05:58] were stored [16:06:09] cmjohnson1: yeah. not that i have any authority to give it to you ;) broke yesterday too [16:06:39] The last Puppet run was at Thu Dec 6 16:00:25 UTC 2012 (2 minutes ago). [16:06:42] really! [16:06:46] cmjohnson1: just that apergos' paste sounds like a bad disk or something between disk and rest of system [16:06:51] I am nearly certain it is dead...older Dell PE1950...a capacitor on teh HDD controller goes bad and kills the system [16:07:05] ok well I'm getting off [16:07:13] it has killed many of our squids and older srv's [16:07:16] should I open an rt ticket for it? [16:07:59] apergos: sure...i will confirm the issue. I have servers onsite that are set for donation I can pull the card from one of those and get it working again. [16:08:18] yeah typical 1950 fault [16:08:26] the capacitor from hell [16:08:41] dell has a factory in hell? [16:08:42] but it takes years to manifest in my experience [16:08:43] ;P [16:08:58] 1950s were quite resilient [16:09:10] do I give this to you cmjohnson1? [16:09:26] no give it to sbernardin or put in pmtpa que w/nobody [16:10:08] done (nobody) [16:10:42] cool..thx apergos...hopefully have it running again later today [16:13:58] Dereckson: i imagined (hoped?) memcache wouldn't be relevant for such an extension (and still do). anyway, we know the relevant config option. so just grep for the option and see where it's used and how [16:14:20] thank you cm johnson [16:15:03] ok well back in a while, got some things to do [16:29:33] New patchset: Mark Bergsma; "Handle many more error conditions" [operations/software] (master) - https://gerrit.wikimedia.org/r/37231 [16:32:07] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:34:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:54] hrmmm, puppetcamp in ghent right before fosdem. FYI [16:47:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.895 seconds [17:15:01] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:15:01] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [17:22:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:07] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [17:32:08] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [17:38:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.269 seconds [17:56:37] RobH: ping [17:58:40] preilly: sup [18:03:03] RobH: Do we have a spare box to run Varnish on right now? [18:03:21] RobH: I want to put Varnish in from of Parsoid for the Visual Editor [18:03:24] we have some misc servers, single cpu, non SSD, no not spec for varnish [18:03:33] they would be identical to the ones running parsoid [18:03:43] and i assume it has to be in tampa? [18:04:00] I only have 1 spare server left in tampa. [18:04:03] RobH: is Parsoid's boxes in PMTPA right now? [18:04:07] yep [18:04:14] i just spun up 4 more for roan yesterday [18:04:25] wtp1, and now mexia, tola, lardner, and kuo [18:04:34] i have a single identical server to those available [18:04:36] RobH: so you have 1 more box that I could use then? [18:04:37] and thats it. [18:04:41] if CT approves it [18:04:43] you can have it [18:04:48] RobH: okay cool [18:05:03] its all wired up and such, so to make it work is just a vlan change and some OS isntall stuff [18:05:25] oh wait [18:05:27] preilly: i have more than 1 [18:05:36] heh, i have 6 servers left in tampa unallocated [18:05:43] (that should make it easier for you to get approved) [18:05:58] RobH: Okay as soon as CT gets in I'll request it [18:06:05] all identical to what wtp1 and such are [18:06:06] cool [18:06:24] ping me when he has it all approved and we'll get it all deployed for ya [18:06:38] RobH: okay great [18:06:39] its dual disk, hardware raid1 [18:06:54] so i assume just basic same disk parititon as wtp1? (or else list what you want in ticket please) [18:07:08] RobH: same partition scheme is fine [18:07:09] that way i dont keep you guys waiting on questions =] [18:07:11] cool [18:07:22] and i assume outside IP as it is varnish? [18:07:38] RobH: which queue do I use for this in RT? [18:07:42] procurement [18:07:47] RobH: yes outside IP [18:09:07] RobH: Does something like, "We need to provision 1 misc server for Varnish in PMTPA to place in front of NodeJS boxes used for Parsoid." work for the RT ticket? [18:09:31] yep, you can note that i said there were 6 available identical to wtp1 and such if you want [18:09:31] if it's going to be used in production, please provision at least 2... [18:09:46] seems reasonable. [18:09:54] (what paravoid says) [18:10:05] I have no idea of the plans wrt parsoid, I'm looking forward to today's metrics meeting [18:10:21] i imagine the caching for the temp parsoid cluster will also be just as temp [18:10:32] and eventualy replaced with to spec hardware, like all the wtp cluster will be [18:11:04] RobH: created ticket #4042. [18:11:29] I'll add a bit to it for background and clarification [18:11:53] paravoid: Parsoid is launching on ENWIKI on Monday [18:12:01] yikes. [18:12:23] I wonder when people were planning to tell ops... [18:12:29] paravoid: yeah it was a surprise to us all [18:12:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:13:08] I don't mind you helping us out -- esp. considering the time constraints [18:13:19] but isn't setting up caches and everything our team's job? [18:14:16] deciding how many boxes to use and which and things like that? [18:14:31] the entire parsoid cluster is a rushed fubar project [18:14:39] its running on misc servers for now, but has to move [18:14:45] and doesnt seem actually planned. [18:15:04] no hardware is to spec, nothing was planned for over a day in advance of request it seems =P [18:15:16] speaking of procurement... RobH, any ideas what's the ETA for yttrium? [18:15:29] MaxSem: uhhh, yttrium was reinstalled already i thought, was it not? [18:15:32] and why launching on enwiki? [18:15:51] MaxSem: i stand corrected! [18:15:56] I'll take a look at it now. [18:16:09] cheers Rob [18:16:36] yea its network ports been moved, i'll update the puppet files and reinstall it now [18:16:45] sorry about that [18:18:39] woosters: ping [18:18:50] preilly: he just mailed that he's OOO sick. [18:19:04] woosters: can you approve ticket #4042 in RT [18:19:06] also, you kind of ignored me :-) [18:19:06] paravoid: thanks [18:19:38] paravoid: I hadn't read your comments yet [18:19:43] paravoid: reading now [18:19:58] paravoid: yes it is your teams job [18:20:15] New patchset: Demon; "Install git on all servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37247 [18:20:45] Platonides: that's a great question for James_F [18:21:49] MaxSem: So if looks like yttrium was manually partitioned [18:21:56] which is nasty and i try to not do now a days [18:22:33] preilly, seems like wanting it to break on deployment, instead of going at a slower pace [18:22:59] its dual 1tb disks [18:23:01] (iirc) [18:23:07] RobH, I don't have any special requirements for partitioning, a few gigs for /var is all I need [18:23:08] That wasn't our choice, it was decided for us [18:23:15] ok, then i do the stock raid lvm [18:23:19] Platonides: it's not my call at all [18:23:19] keep it simple [18:23:24] Platonides: talk to James_F [18:23:52] And it was announced in various places so I'm surprised no one in ops knew. But yes, talk to James_F [18:25:04] !log Reweighted pmtpa Parsoid pool according to # of cores [18:25:17] Logged the message, Mr. Obvious [18:25:36] RobH: Erik is looking at the RT ticket #4042 right now [18:25:40] New patchset: RobH; "retasking yttrium to internal vlan" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37248 [18:25:49] RobH: He should be approving it right now'ish [18:26:03] Either way, we screwed up by not realizing the need for more boxes earlier on, my apologies for that and thanks to you all in ops for helping us make it happen quickly [18:26:53] I imagine there is some level of frustration since we have a dozen ops projects we had to freeze for fundraising [18:27:05] and you guys keep making us push other things that technicallllly should be frozen ;] [18:27:22] so if the entire site catches fire, we're blaming you guys [18:27:28] RobH: Erik just approved the ticket [18:27:42] RobH: When would you be able to provision that boxes? [18:27:59] preilly - just approved it. robh - pls provision them [18:28:04] Right, that makes sense [18:28:12] will do [18:28:13] woosters: Okay great thanks! [18:28:23] woosters: by-the-way I hope that you feel better soon [18:28:42] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37248 [18:28:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [18:28:54] woosters: no one likes a sore throat [18:29:17] thks. .. ya, sucks :-( [18:29:44] preilly: yea I'll allocate and get the puppet work done to install them and drop a network ticket [18:29:51] then we can beg mark or leslie to change vlan [18:29:55] New patchset: Dereckson; "(bug 42771) Logo for mr.wikiquote" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37249 [18:29:57] RobH: Okay great [18:30:00] woosters: maybe you caught it from maggie? ;-P [18:30:04] RobH: Will they be running precise? [18:30:08] yes [18:30:15] unless you need them to do something else? [18:30:20] the defult is precise. [18:30:20] paravoid: I like the thread you just started [18:30:25] default even [18:30:42] RobH: okay great [18:30:52] jeremyb - someone sneezed into my face earlier this week :-( [18:31:02] on caltrain [18:31:17] paravoid: can I PM [18:31:31] always :) [18:31:49] woosters: wow, i guess that means you could have a timestamp for infection time ;) [18:32:21] (which i guess is rare to be able know) [18:32:50] anyway, stock up on chicken soup [18:34:30] MaxSem: yttrium is now reinstalling, will ping you when done [18:34:32] <^demon> !log running `mwscript extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki` in screen on hume. If it causes problems, kill it and ping me [18:34:37] ok, working on parsoid varnish allocation now. [18:34:41] Logged the message, Master [18:36:40] RobH: awesome thanks so much [18:40:22] RobH: The "freeze" was never declared to the rest of Engineering. Sorry if it was for Ops. :-( [18:40:56] well, changing major site infrastructure when site downtime would be horrible... [18:41:05] but yea, i just passing what i know, im not actually mad [18:41:20] i just spin up servers man, doesnt matter what for as long as the rest of you stay happy. [18:41:21] =] [18:51:12] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [19:01:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:28] !log authdns-update for celsus and constable servers [19:06:39] Logged the message, RobH [19:08:21] sbernardin: You about? [19:08:29] I just allocated this thing, and i dont feel like having to change [19:08:34] i have a bad drac connection i need fixed [19:12:33] !log kaldari Started syncing Wikimedia installation... : [19:12:41] Logged the message, Master [19:17:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.169 seconds [19:21:08] New patchset: RobH; "adding servers celsus and constable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37253 [19:21:26] preilly: Ok, I have 90% of the stuff done for this. LeslieCarr is going to fix the ports, and sbernardin needs to troubleshoot the DRAC for constable. [19:21:35] i expect both of those items done in a couple of hours [19:21:41] and then i'll install later today. [19:21:51] RobH: Okay great thanks [19:21:59] RobH: I really appreciate it [19:22:02] if for some reason it takes too long and its way past quitting time for me [19:22:11] i have set it up so its all just goign to be boot and pxe load [19:22:15] and it will do the rest unattended. [19:22:19] sweet [19:22:25] i'll update ticket with details as well [19:22:51] LeslieCarr: nice non answer [19:23:00] hehe [19:24:16] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37253 [19:26:24] New review: Alex Monk; "Actually it looks like I2c6ab07d needs to be rebased first." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/34113 [19:26:52] huh... its 14:47 [19:26:56] i didnt eat breakfast or lunch [19:27:03] im hungry. [19:27:31] * RobH has not left his chair since 930 [19:27:35] sbernardin: Hey! [19:27:48] (you about see pm?) [19:30:18] preilly: when did you want this deployed by? [19:30:29] ie: if steve isnt onsite and i have to deploy tomorrow AM is that too late? [19:30:42] if it is i can just allocate and get another server setup, but annoying since i did all the work for this one [19:31:30] RobH: I was hoping today [19:31:44] RobH: tomorrow AM would be fine too [19:32:05] is one of two ok for today? [19:32:09] i can install celsus now [19:32:17] its constable's drac that is wonky [19:32:49] (i dont wanna hold up deployment, so i figtured one of two is ok? [19:32:55] New review: Alex Monk; "Needs rebase" [operations/apache-config] (master) C: -1; - https://gerrit.wikimedia.org/r/13293 [19:33:02] RobH: dude, you need noms [19:33:13] too busy to move today. [19:33:26] though my head hurts so badly its making me quesy [19:33:29] yay lack of sleep! [19:33:38] they should come to you then! [19:34:15] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [19:35:57] New patchset: Aaron Schulz; "Removed old swift hacks." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37255 [19:36:52] i broke site.pp [19:36:56] New patchset: RobH; "broke with , fixing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37256 [19:37:00] fixing, dunno why it didnt catch it [19:37:47] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37255 [19:38:10] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37256 [19:38:38] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37037 [19:38:54] bleehhhh [19:38:58] someone merged a bunch of shit with mine! [19:39:01] damn it [19:39:47] AaronSchulz: and binasher [19:39:56] so your alls changes im about to merge on sockpuppet [19:40:02] its not going to break blogs is it? [19:40:27] RobH: aaron was just mediawiki config? [19:40:31] RobH: i merged it. if it breaks blogs, it isn't your fault. [19:40:51] ok, its merged on sockpuppet [19:41:01] !log aaron synchronized wmf-config 'deployed 7c21802fc48ef4c5ee0edab3118d03cbbdc1866d' [19:41:09] Logged the message, Master [19:41:49] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36197 [19:42:38] !log demon synchronized wmf-config/InitialiseSettings.php 'Deploying Idf3b7f0c' [19:42:47] Logged the message, Master [19:48:02] Ok, celsus is finishing its install. [19:48:10] when its done im totally leaving and going to buy food. [19:48:29] i have not stood up in 5 hours. [19:48:43] RobH: GET UP STAND UP! [19:48:48] damn it... [19:48:53] * RobH pulls up itunes and plays the song [19:48:58] hehe [19:49:33] alternatively you could read the article! https://en.wikipedia.org/wiki/Get_Up,_Stand_Up [19:50:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:06] preilly: So celsus is now installed and is doing its initial puppet run (just the standard, since nothing else has been defined but that) [19:54:26] when its done its all yours, constable will be fixed first thing when Steve is onsite tomorrow at 9AM EST [19:54:43] and when that is done I will get it installed and online for you (around 10AM EST) [19:54:52] New patchset: MaxSem; "Set up yttrium as a temp Solr server for GeoData" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35931 [19:55:13] can someone please review^^^ [19:55:54] i would but i really have NO IDEA what the solr stuff should be [19:56:08] who would other than you? ;] [19:56:17] (ie: who the heck is qualified to review it) [19:56:39] though its a new file, that has references ONLY to the one server you allocated for it [19:56:47] so im not sure what else would go wrong by it not being right anyhow [19:57:18] RobH: okay sweet [19:57:29] New review: RobH; "Seems to only affect the one server he defines, but someone with a passing knowledge to solr config ..." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/35931 [19:59:22] New patchset: Demon; "Adding wikidata repo prune script to hume's crons" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37258 [20:04:58] New patchset: Demon; "Adding wikidata repo prune script to hume's crons" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37258 [20:05:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [20:24:04] !log Running sync-common on hume [20:24:12] Logged the message, Master [20:26:51] Can someone please do chmod g+r /home/wikipedia/common$ ls -al php-1.21wmf5/maintenance/.mwsql_history [20:26:54] fail [20:27:09] Can someone please do this on fenari please: chmod g+r /home/wikipedia/common/php-1.21wmf5/maintenance/.mwsql_history [20:31:46] !log kaldari Finished syncing Wikimedia installation... : [20:31:54] Logged the message, Master [20:31:55] scap finished :) [20:32:01] and I'm not dead yet [20:37:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:14] New patchset: Kaldari; "Turning off wgResourceLoaderExperimentalAsyncLoading on test2" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37289 [20:39:45] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37289 [20:41:58] !log kaldari synchronized wmf-config/InitialiseSettings.php 'Turning off wgResourceLoaderExperimentalAsyncLoading on test2' [20:42:07] Logged the message, Master [20:45:50] New review: MaxSem; "This file was taken from GeoData master, where it was already reviewed." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/35931 [20:56:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [21:20:01] New patchset: Ottomata; "Including and requiring nrpe::packages in nrpe::check, rather than just specifying the dependency." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37295 [21:22:22] New review: Ottomata; "I had just enabled udp2log varnishncsa logging for blog.wikimedia.org. varnish::logging uses nrpe::..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/37295 [21:22:42] binasher, or sombaawwdy: [21:22:42] https://gerrit.wikimedia.org/r/#/c/37295/ [21:22:58] that'll fix an angry puppet agent on marmontel [21:23:01] i hope! [21:25:00] mike_wang: How are things going? Do you have some reasonable things to work on? [21:28:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.918 seconds [21:41:55] New review: RobH; "looks sane" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37258 [21:41:55] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37258 [21:44:26] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [21:45:31] heya, notpeter, could you give this a peek? https://gerrit.wikimedia.org/r/#/c/37295/ [21:47:54] New patchset: Demon; "cron syntax expects an array of values" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37302 [21:48:26] New review: RobH; "once more with feeling!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37302 [21:48:27] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37302 [21:50:50] !log reedy synchronized php-1.21wmf5/includes/media/DjVuImage.php [21:50:58] Logged the message, Master [21:51:04] <^demon> Hmm, so wikidata user? [21:51:04] crontab: user `wikidev' unknown [21:51:04] err: /Stage[main]/Misc::Maintenance::Wikidata/File[/var/log/wikidata]/ensure: change from absent to directory failed: Could not set 'directory on ensure: Could not find user wikidev at /var/lib/git/operations/puppet/manifests/misc/maintenance.pp:133 [21:51:15] <^demon> Oh, wikidev. [21:51:16] <^demon> Weird. [21:51:48] <^demon> We can do mwdeploy:wikidev [21:51:50] heh [21:51:53] <^demon> I'll tweak it [21:51:58] hume has no wikidev group included [21:52:03] the gid 500 usually right? [21:52:15] thats included on a LOT of misc hosts for dev [21:52:20] but not on hume. [21:52:20] <^demon> No wikidev group? Other hume crons have it [21:52:23] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [21:52:33] but it's recently been reinstalled... [21:52:36] New patchset: Ori.livneh; "Update config for Extension:EventLogging" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37304 [21:52:37] indeed [21:52:43] and wikidev isnt included inthe puppet run [21:52:51] if we can confirm that other scripts need it [21:52:54] <^demon> I lied. [21:52:57] we can just include the user in site.pp call [21:53:00] (i would think) [21:53:07] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37304 [21:53:16] ^demon: you lied as in no other scripts need it? [21:53:38] at this point its not breaking hume, so its not an ASAP fix it this second or hum eis out of sync [21:53:46] <^demon> Yes, no other scripts are using it. [21:53:46] it just fails that and applies everything else (it appears) [21:53:52] ahh, then fix yer script! [21:53:53] ;] [21:54:11] then yea i rather not include a group just for a single script if we can avoid it. [21:54:24] New patchset: Demon; "Use mwdeploy instead of wikidev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37305 [21:54:34] New patchset: Ori.livneh; "Update EventLogging config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37306 [21:54:57] <^demon> We can all sudo as mwdeploy anyway, so it's not like I can't see the logs if I want. [21:54:59] New review: RobH; "note to self: insist on payment of bribes before agreeing to code review" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37305 [21:54:59] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37305 [21:55:28] yea i just dont like more site.pp cruft ;] [21:55:45] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37306 [21:56:14] applying on hume now [21:56:49] binasher: heh doOperations is faster on swift than nas1 [21:57:03] I have gotten to spend all this week doing non-onsite things.... [21:57:11] i could really, really, really get used to this. [21:57:19] ^demon: done, looks like its working [21:57:22] <^demon> I've always got puppet stuff needing merging :) [21:57:24] <^demon> Thanks! [21:57:27] well tp90 [21:57:29] directories were created and seem good [21:57:31] !log olivneh synchronized wmf-config/CommonSettings.php 'Updating Extension:EventLogging config' [21:57:40] Logged the message, Master [21:58:03] AaronSchulz: o_O [21:58:18] <^demon> RobH: Cron should run in 2m. I'll keep an eye on the log [21:58:26] tp50 is close, sometime on is faster than the other [22:01:48] <^demon> RobH: Script ran fine, thanks for merging this! [22:04:23] ottomata: sorry I was slow. still need review? [22:05:11] if you would danke [22:05:28] i think its fine, i just want to be sure I dont' break other stuff [22:05:30] it shouldn't [22:05:38] afaik, this wouldn't work anywhere else anyway unless nrpe was already included [22:05:48] it just so happens that marmontel did not have nrpe included [22:05:57] and we are setting up varnishncsa loggers there [22:06:03] which by default use nrpe for monitoring [22:06:22] so, this keeps the dependency in place, but also includes the dependency if it hasn't already been [22:09:00] ^demon: welcome =] [22:10:19] New patchset: Ryan Lane; "Some nagios ganglia and nagios fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37316 [22:10:33] ottomata: hmmm, alright, sounds legit [22:12:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37316 [22:14:11] ottomata: wait a second [22:14:36] ja? [22:14:43] if $::network_zone == "internal" { include nrpe [22:14:50] hmm [22:14:54] I would habeeb that that is there for a reason [22:14:55] zat in varnish? [22:15:03] that's in base::standard-packages [22:15:13] hm [22:15:22] I would like some confirmation on why that's there before merging [22:15:28] as I smell security implications [22:15:28] what's network zone on monmartel [22:15:36] yeah, me too, which is what I was asking :) [22:15:44] it seemed the right thing to do, but I don't know the whole picture there [22:16:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:46] where does network_zone ge set? [22:16:50] so, it's marmontel.wikimedia.org [22:16:57] soooo, I'm going to assume that that's not too internal [22:17:34] I'd like to talk to probably ma_rk or ryan about this before proceding [22:18:30] woosters: would anyone in Ops be able to help me resolve bug 42452 by clearing the bits varnish caches one by one? [22:18:55] notpeter: what about marmontel? [22:18:57] notpeter, ok cool, thanks, [22:19:05] Ryan_Lane: [22:19:05] https://gerrit.wikimedia.org/r/#/c/37295/ [22:21:48] ottomata: ok, we talked a wee bit [22:21:55] and if oyu set up some ip tables rules [22:21:59] should be legit [22:22:03] hm [22:22:25] ok, or maybe I could just say monitor => false [22:22:28] on the varnish::logging instance? [22:22:29] that "internal only" thing is for code exec [22:22:51] that could also be legit [22:23:07] kaldari: Any particular context behind testing with wgResourceLoaderExperimentalAsyncLoading? Some issue? [22:23:31] i'm really just trying to make as few changes as possible for this that allows webstatscollector to get blog stats [22:23:38] I'm not sure I understand your question. I just turned it off, not on. [22:24:08] so, notpeter, Ryan_Lane, better to do iptables or just not monitor this process? [22:24:22] kaldari: Yes, I'm just curious as to why the sudden change? [22:24:29] why do you need nrpe to monitor the service? [22:24:29] either is fine. if you're cool with no monitoring, then go for it [22:24:54] !log catrope synchronized php-1.21wmf5/includes/Title.php '67de9f2cd4e37b7b7081d47a2652aa0158680c5c' [22:25:03] Logged the message, Master [22:25:06] !log catrope synchronized php-1.21wmf5/includes/EditPage.php '67de9f2cd4e37b7b7081d47a2652aa0158680c5c' [22:25:14] Logged the message, Master [22:25:18] !log catrope synchronized php-1.21wmf5/resources/mediawiki/mediawiki.Uri.js '67de9f2cd4e37b7b7081d47a2652aa0158680c5c' [22:25:27] Logged the message, Master [22:25:31] this is just to see if the service is running? [22:25:53] from talking with chrismcmahon we thought it would be better if test2 wasn't using wgResourceLoaderExperimentalAsyncLoading so that it would be a more accurate test environment for cluster deployments. [22:26:18] also Roan said no one had been doing anything with wgResourceLoaderExperimentalAsyncLoading in a while [22:27:23] k [22:27:51] I'm pretty sure it affects some browser behavior in unpredictable ways. [22:28:12] Ryan_Lane, that's what it is doing [22:28:14] yes, which is why we enabled it on test2 and not test. So that we can see how it behaves. [22:28:15] Krinkle: is anyone actively evaluating wgResourceLoaderExperimentalAsyncLoading or is it basically shelved for now? [22:28:22] I don't really care, that's just what varnish::logging does by default [22:28:22] shelved for now [22:28:34] what're you testing on test2 though? [22:28:42] everything [22:28:53] RoanKattouw: That hash doesn't exist in the repo, the merge was done locally? [22:29:05] ottomata: what does the service do? [22:29:26] sorry if I'm missing things, but about a billion people are talking to me right now [22:29:33] New patchset: Demon; "Enabling changes table for wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37324 [22:29:35] haha, its ok [22:29:36] Krinkle: It should exist ... [22:29:48] um, varnishncsa instance sends varnish access logs to udp2log hosts [22:29:52] It's a Gerrit-generated merge [22:29:54] RoanKattouw: local commit "67de9f2 Merge changes Ieb9eee45,Ib0e40714 into wmf/1.21wmf5" [22:29:55] there are 3 instance (one for each udp2log host) [22:29:58] Hm.. [22:29:59] odd [22:30:16] Krinkle: pull core and you'll get it too [22:30:23] so, i think the automatic nrpe stuff will notify us if the varnishncsa process dies [22:30:39] RoanKattouw: I filed it in the gerrit ui [22:30:47] Krinkle: I was told test2 is the current preferred test environment specifically for the reason that it's closer to the live wiki configs than test [22:30:59] kaldari: test environment for what? [22:31:10] new code that's going to be deployed [22:31:33] test2 is in the deployment chain, we expose new stuff there before it goes live [22:31:41] that's where chris's automated tests are running as well [22:32:06] maybe it would make more sense to turn wgResourceLoaderExperimentalAsyncLoading on on test.wiki instead [22:32:07] New patchset: Catrope; "Use the service IP for Parsoid rather than using wtp1 directly" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37327 [22:32:25] sure, but that's mostly operations pov from what I know. code quality / safety and user creations stuff should ideally not be done on the production cluster as it'll go in global centralauth [22:32:33] isn't the beta labs ready for this yet? [22:32:40] It can have 100% the same configuration if we want to [22:32:46] kaldari: I have some running vs. test2, some vs. beta, some on the mobile site. [22:33:11] Krinkle: beta labs doesn't get every update to every extension yet. [22:33:24] rather late, since at that point it'll be running on the production cluster. [22:33:26] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37327 [22:33:44] chrismcmahon: It can run on a wmf branch instead of master. [22:33:58] or we make a script to let it simulate the wmf-branch set up for the latest master [22:34:00] !log olivneh synchronized php-1.21wmf5/extensions/EventLogging [22:34:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [22:34:06] (i.e. the same extensions and their latest masters) [22:34:08] Logged the message, Master [22:34:35] no problem, just surprised that we're using test2 for this. we should at least be working on getting labs to be usable for this. [22:35:01] Krinkle: wouldn't that be nice. :-) we are working on getting labs usable for that, but we're not there yet. [22:35:19] what's the official purpose of test2 anyway? [22:35:25] for het deploy [22:35:37] so we have a test wiki on the next and prev wmf branch [22:35:55] New patchset: Ryan Lane; "Fix wtp1001 aggregator config as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37329 [22:35:58] ah [22:36:04] and then it got lost and someone decides to push the next branch to both [22:36:07] that's been bothering me too [22:36:27] yeah, that's would be useful [22:36:32] that's = that [22:36:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37329 [22:37:06] especially for deploying backported bugfixes [22:37:27] same for people updating extensions to latest master in before-last wmf branches. On more than one occasion has it caused compatibility issues. We should agree on freezing extension updates for the older wmf branches (only mediawiki/core updates if absolutely needed) we branch and deploy often enough to get those updates out. [22:37:49] (e.g. extension developed against master, not compatible with state in previous wmf branch) [22:38:39] New patchset: Ryan Lane; "Allow/Deny use space as a delimiter, not a comma" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37331 [22:38:41] oh well, the gripes of he observer :) [22:38:45] the* [22:38:52] New patchset: Demon; "Enabling changes table for wikidata" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37324 [22:39:15] Krinkle: it's step by step [22:39:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37331 [22:39:28] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37324 [22:39:30] Krinkle: I think you bring up some good points [22:40:45] would anyone in Ops be able to help me resolve bug 42452 by clearing the bits varnish caches one by one? [22:41:00] !log demon synchronized wmf-config/CommonSettings.php 'Deploying Ia1208778, Ib80a42fa' [22:41:09] Logged the message, Master [22:41:22] New patchset: Ottomata; "Not monitoring varnishncsa instances for blog.wikimedia.org." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37295 [22:41:24] kaldari: what? Is that happening again? how is that possible [22:41:38] kaldari: that was fixed with the wmf5 patch in ext.Vector and core by adding h3 [22:41:47] notpeter, zat ok then? [22:41:47] https://gerrit.wikimedia.org/r/#/c/37295/ [22:41:50] I saw it myself [22:42:09] yes, and I even deployed the CSS patches a day before the wmf5 deployment [22:42:21] but people are still seeing the broken CSS [22:42:26] New patchset: Catrope; "Attempting to fix Ganglia aggregation on wtp1/wtp1001" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37332 [22:42:36] Ryan_Lane: ---^^ [22:43:05] Ryan_Lane: I think I found the problem (again) [22:43:15] Krinkle: I think there must be some stuck caches on one of the bits varnish servers [22:43:31] kaldari: there can't be such thing as a stuck cache [22:43:47] it has a timestamp on it, the old cache would be no longer referenced with the next timestamp. [22:43:57] and the unversioned ones are purged every 5 minutes [22:44:11] kaldari: got a live link to see? [22:44:16] no [22:44:22] the ones I have have all been purged [22:44:29] I can't reproduce myself either [22:44:42] I need to see it for it to confirm. purging it manually in bits wouldn't make a difference if it hasn't purged itself by now. [22:45:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37332 [22:48:04] Krinkle: personally I have no idea how it would be possible for people to still be getting the old CSS (thus my "stuck cache" theory). Do you have any ideas? [22:48:27] kaldari: I have a few, but too many to spill. I'd have to see it [22:50:01] There's a summary of what I know so far here: https://bugzilla.wikimedia.org/show_bug.cgi?id=42452#c23 [22:52:19] kaldari - ryan_lane is on duty .. he will be able to help [22:53:07] !log aaron synchronized php-1.21wmf5/includes/job/JobQueueDB.php 'deployed 471f6840613ae5137464b6449db88bc30608e85f' [22:53:11] kaldari: I find your last comment more interesting :) [22:53:15] Logged the message, Master [22:53:16] c28 [22:54:54] Krinkle: Yes, and I've heard other reports via IRC of people solving the problem by clearing their cookies (resetting session) [22:55:25] kaldari: Either bad headers or a bad session, I don't see a solution. [22:55:56] Is it happening for logged out users? [22:56:16] Not that I know of, at least since the CSS patches we did [22:58:08] notpeter, Ryan_Lane, I'm going to self review this one, since it is much less dangerous and shouldn't change anything that already exists [22:58:08] It doesn't seem to be a rare problem though, as even a few people here in the office reported the problem [22:58:12] it'll just make puppet happy [22:58:13] https://gerrit.wikimedia.org/r/#/c/37295/ [22:58:14] s'ok? [22:58:22] ^demon: do you know if the parser cache is split by skin as well? i.e. if we were to change the html of editsections, would that have to be for all skins? I recall the parser using some kind of placeholders to allow those to be user/skin specific without cache fragementation [22:58:42] editsection-links* [22:58:44] [edit] [22:58:45] <^demon> No, we don't vary on skin. [22:58:51] <^demon> Right, that's done after parse. [22:58:54] cool [22:59:06] That didn't used to be like that always though,right? [22:59:12] (which is great) [22:59:15] <^demon> Right, we used to vary on skin. [22:59:34] okay. Just wanted to verify that that was indeed the reason we didn't change this in 2009. [22:59:57] (the usability initiative section-edit-links improvement was done with javascript) [23:00:38] mutante: *where* on https://doc.wikimedia.org/ should I look at site.pp ? [23:02:29] <^demon> sumanah: I don't think that's correct, apache config must be wrong. Check out http://doc.wikimedia.org/ [23:02:32] <^demon> (no s) [23:02:50] New review: Ottomata; "I switched this to just not monitor these varnishncsa processes, instead of dealing with network sec..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37295 [23:02:51] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37295 [23:03:54] ^demon: https: "This is our continous integration server for MediaWiki." http: puppet & puppetsource. You wanna file the bug or shall I? [23:04:32] <^demon> Lemme just look at the apache config. [23:04:39] <^demon> It's probably just because stuff was copy+pasted. [23:05:12] okay. [23:05:49] doc.wikimedia.org? Instead of noc? [23:06:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:53] Susan: schlock.wikimedia.org [23:07:47] I'm just not sure I understand doc.wm.o. [23:08:39] Look at its content? [23:09:01] It looks server configuration files. [23:09:05] looks like [23:09:29] Or docs for Puppet, I guess. [23:11:03] New patchset: Demon; "Setup SSL properly for doc.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37336 [23:11:17] <^demon> sumanah: Right now it's just puppet. We're going to move mediawiki docs there too. [23:11:27] <^demon> Meant that for Susan, damn autocomplete. [23:11:37] <^demon> sumanah: That patchset ^ should fix it [23:11:45] Heh. [23:11:54] MediaWiki docs? [23:12:02] <^demon> The stuff that's at svn.wm.o/doc/ right now [23:12:04] I don't follow. [23:12:16] <^demon> doc.wm.org is for auto-generated docs. [23:12:28] All right. [23:12:31] <^demon> We want that off the svn box :) [23:12:36] ^demon: speaking of which, which machine actually runs doxygen to produce the output? [23:12:52] <^demon> ori-l: For svn.wm.o, or doc.wm.o? [23:12:55] http://svn.wikimedia.org/doc/ is far more attractive now than I remember. [23:13:01] svn.wm.o [23:13:06] kaulen [23:13:06] <^demon> That's on the svn box, formey. [23:13:11] oh, duh [23:13:20] * ori-l wants a paper copy and doxygen has latex output.. i wonder how many pages it would amount to [23:13:27] ^demon, the SSL certificate for doc.wikimedia.org seems to be broken FYI [23:13:37] ori-l: One could hope that it is OVER 9000 [23:13:43] <^demon> Thehelpfulone: Yes, I put a patch in gerrit. [23:13:44] doc.wm.o instead of docs.wm.o? [23:13:49] Just one. [23:13:55] <^demon> We could serveralias easily. [23:13:57] ok [23:14:36] Thanks ^demon and is there anything I could help with there? [23:14:48] <^demon> poke an opsen with a stick :) [23:15:07] <^demon> I'll do it. [23:15:46] <^demon> Ryan_Lane: SSL on the new doc.wm.o is busted, but I think I fixed it. https://gerrit.wikimedia.org/r/#/c/37336/ [23:16:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37336 [23:16:22] that looks right [23:16:23] gj [23:16:39] it was kind of magical and fairy-tale esque to have http and https serving very different things [23:16:42] which host is that on? [23:16:52] <^demon> gallium. [23:16:56] ok [23:17:14] <^demon> sumanah: It's because it matched the only :443 vhost it could find, which is integeration.mw.o [23:17:15] <^demon> :) [23:17:16] salt 'gallium*' cmd.run 'puppetd -tv' [23:17:19] <3 [23:18:05] heh, nice [23:18:10] screw ssh [23:18:48] * Damianz hugs his SOL [23:19:10] notice: Run of Puppet configuration client already in progress; skipping [23:19:15] I hate puppet [23:19:21] hmm, websearching for the man page for "salt" is a little hard [23:19:28] sumanah: saltstack [23:19:57] ahhhh [23:20:01] <^demon> sumanah: My first result for "man salt" is Salt N Pepa - Whatta Man [23:20:03] * ^demon shrugs [23:20:17] ManSalt - Muscle Soak for Men, Bath Salts for Men [23:20:18] man page probably won't help too much [23:20:24] it'll show you the matching options [23:20:25] binasher: how much ram do the db servers (say 63) have? [23:20:32] you need to look at the modules lists [23:20:35] or the state lists [23:20:36] That's a pretty good song actually [23:20:42] https://wikitech.wikimedia.org/index.php?search=saltstack&title=Special%3ASearch&fulltext=1 no results [23:20:48] or how to use events or how to use returners (which are really awesome) [23:20:51] sumanah: no docs there [23:21:08] I should likely write some docs [23:21:24] though they'd really mostly just point back to saltstack's docs [23:21:37] <^demon> Docs would be good, if you expect any of us to be your guinea pigs. [23:21:45] !log awjrichards synchronized php-1.21wmf5/extensions/MobileFrontend/ 'fix for MobileFrontend bug 42749' [23:21:47] I have docs for the deployment system ;) [23:21:53] Logged the message, Master [23:22:02] even developer-like docs [23:22:31] I just want to mention this, because it's awesome: http://docs.saltstack.org/en/latest/ref/returners/index.html [23:22:42] Ryan_Lane: basically I just linked https://wikitech.wikimedia.org/view/Git-deploy#Basic_design which is the first/only place "salt" appears on that wiki to saltstack.org -- at least that will point people to the right place [23:22:57] <^demon> So, totally different subject...I just thought of something that ruined my plan of offloading gitviewing from manganese. [23:23:01] you can run commands and have the returned data run through a function, which can store the returned data places, like redis [23:23:12] sumanah: thanks [23:23:18] yw [23:23:34] ^demon: what's that? [23:23:38] not sure I like that [23:23:40] <^demon> The jsession wouldn't carry across different hostnames. [23:23:45] Damianz: not sure you like what? [23:23:50] returners [23:23:54] <^demon> So we couldn't expect gerrit to manage permissions. [23:23:54] Damianz: why not? [23:23:58] ^demon: use a proxy? [23:24:18] ah [23:24:19] right [23:24:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.020 seconds [23:24:38] Damianz: I was thinking of using a returner to present data from deployment [23:24:40] AaronSchulz: db63 and servers above that are getting 96, most servers below that number have 64 [23:24:41] As far as I understand it the return is run on the client - so using it for stashing data back to a central system is a bit of a headache. I can see where it might be useful though [23:24:54] <^demon> I thought of running gerrit as a second slave, but gerrit only operates in sshd mode as a slave. [23:25:04] <^demon> It doesn't do a read-only UI x_x [23:25:13] ah yeah, it runs on the client [23:25:27] Ryan_Lane: was wondering if you (or anyone in Ops) have any thoughts on bug 42452. I was thinking perhaps we should try clearing the bits varnish caches one by one, but Krinkle didn't think that would help. [23:25:51] <^demon> So yeah, I gotta rethink this somehow :\ Because I want it to work properly, but I'm afraid of overloading manganese eventually. [23:26:00] <^demon> I wish multi-master support worked :\ [23:26:04] Damianz: yeah, it would have the minions put data into something like redis, rather than report back to the master [23:26:04] we need to be able to reproduce it in order to get anything useful. [23:26:31] kaldari: it's only 1 url though right? Could purge it, should be fine [23:26:54] 1 url? [23:26:56] or touch the file and sync it. but then again, we touched it many times since then. [23:26:58] Ryan_Lane: That's fine if you trust the clients ;) Split opinion on how I'd want to route info back....could be interesting using something like casandra for crazy data grabbing though [23:27:18] Damianz: yeah. in the situation I want to use it, I trust the client [23:27:21] kaldari: the bits url to that module? [23:27:43] yeah, but that URL veries considerably [23:27:45] <^demon> Ryan_Lane: Did puppet finish on gallium? I'm still getting served the old cert. [23:27:46] varies [23:27:54] no [23:27:58] puppet is being an asshole [23:28:03] <^demon> stupid puppet. [23:29:30] can someone help out with a mobile varnish purge? binasher, mutante, LeslieCarr? [23:30:05] kaldari: ^ watcg and learn ;) [23:30:30] heh, if awjr manages to skip me in line I'll be pissed ;) [23:30:53] We have a line? [23:30:56] the secret is whiskey, kaldari [23:31:05] shit, I forgot! [23:31:29] which i will happily provide to whoever flushes the mobile varnish cache first next time im in SF [23:33:07] awjr: I was trying to get a non-mobile varnish cache flush :) [23:33:13] ohho! [23:33:34] between the two of us we will destroy all WMF varnish caching [23:35:42] kaldari: how so (veries considerably) ? [23:36:07] I believe that URL includes the gadgets as well [23:36:15] so varies per user [23:36:27] skins css is loaded separately [23:36:49] I think the problem is also affect the Vector ext css [23:36:54] affect = affecting [23:37:05] i.e. the collapse triangles [23:37:40] yes, ext.vector only affects collapsibility. since it isn't confined to that the problem isn't those urls. [23:37:44] kaldari, shouldn't RL take care of it for you? [23:37:51] the skin css is doing the big headings [23:38:00] MaxSem: yes, one would expect that [23:38:15] see bug 42452 [23:39:18] MaxSem: during development html structure was changed and stylesheets without backwards compatibility, that was wrong. It was fixed in wmf [23:39:31] somehow it is still happening in some users browsers [23:39:54] which suggests either their browser is keeping cache longer than it should, or there is a server with an outdated mediawiki checkout. [23:40:23] but since no dev seems to be able to reproduce it.. [23:41:26] !log awjrichards synchronized php-1.21wmf5/extensions/MobileFrontend/javascripts/common/mf-application.js 'Touch file' [23:41:35] Logged the message, Master [23:42:06] I saw the bug on Monday, but not since [23:44:31] paravoid: thank you for starting that thread this morning [23:50:19] preilly: yw [23:50:39] paravoid: :) [23:50:40] let us know if we can help you on your work for this [23:53:01] !log reedy synchronized php-1.21wmf5/extensions/ParserFunctions/Expr.php [23:53:09] Logged the message, Master [23:54:17] New patchset: Pyoungmeister; "coredb_mysql: attempting laner's hackaround for module/role class issue" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37343 [23:55:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37343 [23:57:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds