[00:00:09] Logged the message, Mr. Obvious [00:00:24] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/37157/' [00:00:33] Logged the message, Master [00:02:57] Hrmph [00:03:02] /etc/dsh/group isn't on bast1001 either [00:05:33] !log rebooting db1043 [00:05:41] Logged the message, Master [00:06:17] !log aaron synchronized php-1.21wmf5/extensions/SwiftCloudFiles/php-cloudfiles-wmf/cloudfiles_http.php [00:06:25] Logged the message, Master [00:07:00] PROBLEM - Host db1043 is DOWN: PING CRITICAL - Packet loss = 100% [00:08:38] RECOVERY - Host db1043 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [00:08:41] Change abandoned: Demon; "Actually, I'm thinking of just killing the package entirely. All it does is get in the way when we'r..." [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/27531 [00:12:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:24:58] notpeter: Rargh this deployment thing is getting annoying. To make it work I would have to deploy from /home (so fenari), and I'd have to modify the rsyncd config [00:25:12] The alternative is putting it in /h/w/common , which would slow down MediaWiki deploys tremendously [00:25:18] hrm, gotcha [00:25:25] well shoot [00:25:34] I mean I'm happy to run a temporary rsyncd on bast1001 [00:25:42] But, you know, this web is weaving itself fast [00:25:51] heh, true [00:26:03] how about we think about possible solutions and regroup and tlak about this tomorrow? [00:26:04] Hmm [00:26:07] * RoanKattouw has an idea [00:26:21] git is already installed on the Parsoid boxes by puppet [00:26:39] "Deploy" could be dsh git pull [00:27:01] That requires that I initially transfer the node_modules dir but that's fine, that's a one-off [00:27:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.274 seconds [00:28:42] RoanKattouw: where can I learn about parsoid and its architecture? my interest is mostly from an ops perspective but more generally wouldn't hurt (I've read mw.org/wiki/Parsoid, not enough...) [00:28:59] I'm not sure it's readily documented [00:29:08] The project leader is gwicke and he and his team hang out in #mediawiki-parsoid [00:30:32] RECOVERY - mysqld processes on db1043 is OK: PROCS OK: 1 process with command name mysqld [00:32:03] paravoid: what time is it there? [00:33:23] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 3118 seconds [00:33:24] 2:30am [00:33:29] EET [00:34:53] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay seconds [00:35:54] paravoid: what if swift auth just have an IP storage url? :p [00:37:37] I'd prefer not [00:37:44] let's solve the issue, not hide it under the carpet [00:38:14] !log testing mariadb-server-5.5 5.5.28-mariadb-wmf201212041~precise on db1043 (slaving enwiki) [00:38:17] any luck into writing a reproducable test case? [00:38:22] Logged the message, Master [00:38:24] no [00:38:25] binasher: \o/ [00:39:41] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 2708 seconds [00:41:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37155 [00:43:40] !log Added mexia, tola, kuo, lardner to /h/w/common/docroot/noc/pybal/pmtpa/parsoid , they appear to be pooled just fine now [00:43:48] Logged the message, Mr. Obvious [00:43:58] paravoid: I don't see this getting fixed soon [00:45:43] New patchset: Catrope; "Make /var/lib/parsoid owned by wikidev, makes deploying easier" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37165 [00:49:46] paravoid: alright, I give up on that for now [00:50:26] notpeter: Hmm, the Parsoid group isn't showing up in Ganglia yet, when can I expect that to happen? [00:50:38] (Puppet has run on wtp1 (aggregator), but I suppose it may not have run on the main Ganglia box?) [00:51:12] RoanKattouw: hhhhmmmm, is mercury in retrograde right now? [00:52:12] I don't know [00:52:22] I also don't see Rob's new machines /anywhere/ in Ganglia [00:52:34] Although that makes some sense, they think they belong to a group that doesn't show in Ganglia [00:52:35] RECOVERY - MySQL Slave Delay on db1043 is OK: OK replication delay 28 seconds [00:52:47] And wtp1 (aggregator) shows up in misc, but as down (!!) [00:53:05] New patchset: Asher; "pulling db59 from s1 for upgrade to precise" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37168 [00:53:43] See http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Miscellaneous%2520pmtpa&tab=m&vn= . It says wtp1 is down, which is bull. And there's no mexia/kuo/tola/lardner [00:53:51] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37168 [00:54:03] puppet probably need to run on nickel [00:55:25] !log asher synchronized wmf-config/db.php 'pulling db59 from s1 for system upgrade' [00:55:33] Logged the message, Master [00:56:29] !log shutting down mysql on db59, upgrading to precise [00:56:38] Logged the message, Master [01:00:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:08] PROBLEM - mysqld processes on db59 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [01:17:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [01:19:28] !log rebooting db59 [01:19:38] Logged the message, Master [01:21:14] PROBLEM - Host db59 is DOWN: PING CRITICAL - Packet loss = 100% [01:23:02] RECOVERY - Host db59 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [01:42:41] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [01:49:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:38] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [02:04:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.438 seconds [02:24:10] !log LocalisationUpdate completed (1.21wmf5) at Thu Dec 6 02:24:10 UTC 2012 [02:24:19] Logged the message, Master [02:29:42] New patchset: Reedy; "Defrag commented out readonly statements" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36594 [03:00:10] New patchset: Asher; "returning db59 to service" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37181 [03:00:32] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37181 [03:19:04] i took mysql down on db59 rebooted it again, and no nagios messages here :/ [03:25:00] !log asher synchronized wmf-config/db.php 'returning db59 to s1' [03:25:11] Logged the message, Master [03:26:11] !log asher synchronized wmf-config/db.php 'lowering db59 weight for warmup' [03:26:20] Logged the message, Master [03:27:44] !log asher synchronized wmf-config/db.php 'setting db59 to full weight' [03:27:52] Logged the message, Master [03:47:47] New patchset: Pyoungmeister; "adding nagios groups for parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37183 [03:52:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37183 [04:26:35] New patchset: Ryan Lane; "Fix nagios groups that are breaking nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37185 [04:26:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37185 [05:58:12] RECOVERY - mysqld processes on db59 is OK: PROCS OK: 1 process with command name mysqld [05:58:30] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [05:58:30] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [05:58:30] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [06:00:45] PROBLEM - Parsoid on mexia is CRITICAL: (null) [06:01:12] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: Connection refused [06:01:21] PROBLEM - Parsoid on wtp1001 is CRITICAL: (null) [06:01:30] PROBLEM - Parsoid on kuo is CRITICAL: (null) [06:01:39] PROBLEM - Parsoid on tola is CRITICAL: (null) [06:01:57] PROBLEM - Parsoid on lardner is CRITICAL: (null) [06:01:57] PROBLEM - Parsoid on wtp1 is CRITICAL: (null) [06:30:46] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:31:44] maybe sq48 needs a cmjohnson1 [06:39:52] New patchset: Catrope; "Make /var/lib/parsoid owned by wikidev and g+ws, makes deploying easier" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37165 [07:13:53] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:13:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:30:55] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [07:30:55] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:48:25] New patchset: Jeremyb; "bug 42765 - throttle.php: Mumbai outreach event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37196 [07:58:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:00:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [08:00:48] New patchset: Jeremyb; "bug 42765 - throttle.php: Mumbai outreach event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37196 [08:03:42] New review: Jeremyb; "thanks Matthewrbowker for CR via IRC" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/37196 [08:31:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.027 seconds [08:50:16] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [09:12:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.382 seconds [09:33:19] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [09:36:20] doh [09:36:21] hello [09:36:26] so I broke ops/puppet.git :( [09:44:55] yep [09:45:04] luckily the workaround is pretty simple [09:54:20] I really need to train other people to modify the workflow :-] [09:57:20] should be fixed now [09:57:40] Change restored: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37009 [09:58:32] Something's wrong with the job queue. Translation notification jobs are getting "stuck" somehow. I'd like someone to check the job queue for metawiki and DELETE any present translation notification jobs. How can I get that done? [09:58:41] New patchset: Hashar; "validate jenkins job" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37009 [09:58:53] hashar: ^^ Who can I poke? [10:00:52] siebrand: at this hour me I guess [10:01:14] though I am supposed to deploy the new workflow on Mediawiki/core this morning [10:01:33] hashar: This is something that's very disruptive in wikis at the moment. [10:01:52] hashar: See https://bugzilla.wikimedia.org/show_bug.cgi?id=42715 [10:02:16] hashar: I'm looking up queue IDs to make it easier for you. If you are to do something, please report back in the bug what you were able to delete. [10:03:15] hashar: The job id is probably translationNotificationJob [10:03:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:04:18] siebrand: I will deploy the new workflow first [10:04:28] siebrand: should take roughly an hour then I look at the job queue issue [10:04:49] hashar: so "job_cmd = 'translationNotificationJob'" [10:05:10] hashar: great. I'll add this info to the issue and will add you in CC. [10:08:39] !log applying new CI workflow on mediawiki/core.git . Changes now needs CR+2 to trigger unit tests. [10:08:51] Logged the message, Master [10:12:26] yeah new workflow applied to mw/core@master [10:18:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.055 seconds [10:39:29] bah I am locked [10:50:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:07:00] hashar: What's the status on the job queue question I asked? [11:07:18] on hold till I finish deploying the new workflow [11:07:20] doing regression tests locally right now [11:07:21] should not be long :) [11:08:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.068 seconds [11:16:05] siebrand: ok finished my tests :) [11:19:41] soo hm [11:19:57] metawiki has 2213 translationNotificationJobs pending [11:30:23] hashar: remove the lot :) [11:30:24] hashar: please nuke them all. [11:30:29] snap [11:30:48] hashar: (make a dump if possible) [11:31:12] hashar: it's semi-sensitive data, so please only distribute to AaronS and the likes. [11:31:31] well looking at the dump for https://meta.wikimedia.org/w/index.php?title=User_talk:Vladimir_Penov&action=history [11:31:47] I get three translationNotificationJob jobs for it [11:31:51] still? [11:31:55] from Nov 27th, Dec 04 and Dec 01. [11:32:01] each of them have job_attempts = 3 [11:32:33] siebrand: ^^^ [11:32:43] looking over the edits, it seems like the job gets half done and then stops and starts all over again [11:32:45] (if you're able to save the data, nuke from production after that, and report on the bug, then we'll get someone to fix it, hopefully.) [11:33:00] I'm working for another client, under high pressure, and don't have the time at the moment. [11:33:09] ok ok [11:33:13] this really needs someone involved who did the recent job queue reowrk. [11:33:23] should I just nuke all translationNotificationJob jobs meanwhile ? [11:33:30] then report back too Aaron so he looks at it? [11:34:37] hashar: yes [11:34:51] but aaron will need the data [11:35:12] it would be better not to throw further tests on users' heads [11:35:46] what puzzle me is that our runJobs log does not show any translationNotificationJob :( [11:38:10] hashar: would you mind me quoting you in the bug? [11:38:28] Seddon: yes please :) [11:38:47] hashar: cool, Ill do that now [11:39:29] ahh [11:39:35] Seddon: I used the wrong grep command :) [11:39:50] so they are there in the log? [11:40:55] apparently [11:40:58] looking for them [11:41:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:41:47] the jobs do end up with an error [11:41:55] but that is an empty error :/ [11:43:22] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [11:46:01] hm [11:46:23] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [11:51:04] I shall (since I am so out of my league with allof this) assume that should not be the case [11:51:19] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [11:51:56] fuuauufuazueazeu [11:54:43] lolwut [11:55:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [11:56:05] hashar: Niklas is suggesting that previous behaviour of the job queue was to drop a failed job, and now it may be re-run. [11:56:14] siebrand: I think so [11:56:26] siebrand: I am enhancing the translationnotification job [11:56:29] so it returns a proper value [11:56:32] hashar: The empty status may have to do with it. So we're indeed dealing with a pretty serious regression in the job queue redesign. [11:56:37] apparently returns always NULL [11:56:44] which is recognized as a failure by the job queue [11:56:57] will follow up with Nikerabbit [11:57:12] hashar: He's having his day off. Finnish independence day. [11:57:18] ah nice [11:57:29] well will sort it out with someone else so :-] [11:58:10] ill post some of this on the job queue bug [12:13:37] hashar: Thanks for the help. [12:14:22] Seddon: hold on till I get a patch to send :-] [12:14:27] ok cool :) [12:14:30] fighting with our Status class right now [12:14:35] WICH IS A HUGE PILE OF MESS [12:14:37] :-D [12:15:39] I am beginning to understand why noone wants to work on this job queue :P [12:16:47] well [12:16:57] Seddon: it got rewritten recently [12:17:05] so there is not a lot of people knowing the source code [12:17:23] AND making a mistake in that code tends to cause major issues on the live wikis such as the one with translation notifications [12:17:37] so most people are probably to afraid of breaking the site [12:18:03] luckily, the WMF has a team to handle this (I am part of that team) and we have very smart people (I am not part of them). [12:18:15] lol, oh dear.... Ill brace for impact :P [12:19:38] I genuinely appreciate the help with this :) [12:19:51] my job :-] [12:20:12] I am a contractor for WMF > Platform Engineering > MediaWiki maintenance. [12:20:19] so whenever something is broken, that ends up to us :-] [12:20:29] I have said it to others, count yourself lucky that we arn't doing the international fundraiser at the moment [12:20:58] I would not have been as calm and collected over the last four days as I have been :P [12:21:03] we would probably not have deployed the new job queue system in such case [12:21:24] so I get a patch [12:21:28] no I need to write tests for it [12:21:51] hashar: the new job queue was deployed well before the decision not to do the FR [12:21:51] New patchset: Reedy; "Kill of secure.wm.o related configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37209 [12:22:01] * Reedy waves bye bye to secure.wm.o [12:22:27] Reedy: I thought we wanted to keep it around to still serve the old URLs [12:22:28] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36659 [12:22:34] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36595 [12:22:40] Reedy: ah maybe that is handled on the apache side [12:23:13] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [12:23:47] AFAIK all the redirects are inplace at the apache level [12:28:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:29] new User::newFromName( 'foobar' ); [12:32:29] ... [12:32:32] I am getting tired [12:36:56] Reedy: I have enabled this morning the new workflow on mediawiki/core.git [12:36:59] Reedy: so +2 to get tests [12:37:08] Reedy: will enable auto submit later this afternoon [12:43:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.693 seconds [13:02:59] hashar: so, should I dist-upgrade gallium? [13:06:02] paravoid: sure :) [13:06:13] paravoid: not sure which packages are going to be upgraded foo [13:06:16] though [13:06:16] grg [13:06:25] paravoid: I am going to grab a snack first [13:08:03] New patchset: Reedy; "Defrag commented out readonly statements" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36594 [13:08:26] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36594 [13:08:34] Snnnack time [13:08:35] brb [13:13:29] back [13:17:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:20:51] hashar: all is well [13:21:51] paravoid: while you are at it, we could get PHPUnit upgraded :) [13:21:57] paravoid: though that is done via PEAR :( [13:32:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.424 seconds [14:05:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [14:56:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:55] New patchset: Dereckson; "bug 42765 - throttle.php: Mumbai outreach event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37196 [14:59:59] New review: Dereckson; "PS3: adding 2 hours at event end" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/37196 [15:01:18] hashar: apergos: you both have been involved with throttles in the past and now I'm realizing I don't understand entirely how it works... what happens after the time period has elapsed if there were more than the standard limit created that day? do the accounts from during the period not count for afterwards? [15:02:14] jeremyb: IIRC the throttles are only valid during a specific time range [15:02:42] hashar: but there's a default throttle that applies all the time [15:02:50] for all IPs [15:03:26] then once the throttle is expired, the old default apply again [15:03:34] and I guess people will be prevented from creating new account :-] [15:03:37] I have no idea honestly [15:03:55] someone should document it ;) [15:04:04] I don't know how we keep track of how many # of account creation have been made for a given IP [15:04:05] me neither, I've never really loked into what happens when the time period is over [15:04:08] maybe that is in memcached [15:04:16] *looked [15:05:01] $key = wfMemcKey( 'acctcreate', 'ip', $ip ); [15:05:02] ah yeah [15:05:05] memcached :-] [15:05:19] jeremyb: so if the default is 5 account per IP over 24hours [15:05:28] we raise it to 50 for an hour [15:05:37] 50 accounts are created, the value is at 50 [15:05:47] no more account can be created for the next 24 hours from that IP [15:05:57] regardless of the $wgAccountCreationThrottle value [15:06:01] well that's inconvenient [15:06:14] i'm thinking we maybe have a separate counter for throttle exceptions [15:06:21] you want a way to flush that value when the time is expired [15:06:34] includes/specials/SpecialUserlogin.php look below the hook call named "ExemptFromAccountCreationThrottle" [15:06:35] and revert to the old counter (from earlier that day) when we revert to default behavior [15:06:56] that could work too [15:07:11] or make the memcached key to contain the value of wgAccountCreationThrottle [15:07:29] this way whenever we change wgAccountCreationThrottle a new key will be hit thus reseting the throttle [15:07:31] hashar: what if there's 2 events in a day? ;) [15:07:41] also, what's 24 hrs? is it rolling? or calendar day UTC? or? [15:07:49] jeremyb: from the same IP ? Then we could raise the throttle to 1000 :-) [15:07:52] oh [15:07:58] 24 hours is the timeout of the memcached key [15:08:04] so that is since the last write to it [15:08:04] huh [15:08:35] if on Dec 5th at 3pm someone create the last account, nobody would be able to create one till Dec 5th 3pm [15:08:51] then the key will expire in memcached [15:08:54] there will be no more value [15:09:01] and the account will be allowed :- [15:09:07] dec 5th == dec 5th [15:09:10] afaik [15:09:13] so 24 hours form when the limit was reaached [15:09:14] oh yeah [15:09:33] if on Dec 5th at 3pm someone create the last account, nobody would be able to create one till Dec SIX 3pm [15:09:37] :-) [15:09:41] New patchset: Dereckson; "(bug 42767) Throttle rule for Goa Event" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37221 [15:09:52] apergos: thanks for the rephrase :-]]]] [15:10:01] yw [15:10:16] jeremyb: feel free to submit a documentation block in our config, I will be happy to approve it [15:10:45] hashar: k, maybe later [15:12:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.081 seconds [15:24:06] New patchset: Mark Bergsma; "Initial version of swiftrepl, a Swift-to-Swift API replicator script" [operations/software] (master) - https://gerrit.wikimedia.org/r/37223 [15:28:31] New review: Dereckson; "Ok, I ask on the bug." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/37196 [15:29:14] Dereckson: did you see my queries above too? [15:32:42] jeremyb: yes, I've just red your remark about lack of request for the extra margin on the bug. I submitted the patchset before to read your statement, as I needed to have it stabilized to use it as a dependency for the next bug (another throttle request for an event occuring the 12). [15:33:03] i saw the dep [15:33:27] Dereckson: i meant w/ hash ar/aper gos [15:35:08] Red. This comforts me in the idea we should add a little margin for time and participants on each throttle rule. [15:36:26] There is no real risk of abuse, as the required IP are by definition trusted. And we mitigate the risk they can't create an account at the end of the event or if they got too many participants. [15:40:21] * jeremyb doesn't know what "Red" means [15:40:23] but ok [15:41:38] Dereckson: as a general rule I'd like us (shell and related people) to not be event planners. teach people to submit accurate requests and they can themselves figure in buffers as they wish. by the time it gets to us it should be clear what to do. no guesswork [15:42:41] you know what I would like is this: [15:42:49] admin say "oh we have an event on our project" [15:42:57] there are existing pages like https://en.wikipedia.org/wiki/Wikipedia:How_to_run_an_edit-a-thon and https://outreach.wikimedia.org/wiki/GLAM/Model_projects/Edit-a-thon_How-To [15:43:08] we can add / change the instructions there if needed [15:43:12] admin goes to the exptension page, sets the ip range(s), the time frame, and the number of accounts, or simply removes the limit for them [15:43:17] for that time frame [15:43:25] apergos: yeah, i knew that was coming! [15:43:29] wmf staff is never involved :-D [15:43:43] win for everyone [15:44:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:33] * ^demon hides from apergos' suggestion [15:47:49] <^demon> Since that's something on my really really back burner. [15:47:59] * ^demon goes into hiding [15:50:55] cookie licker! [15:51:01] * apergos gices daemon more cookies to lick [15:51:05] gives them too [15:51:34] <^demon> Why eat the cookies when you can just lick them endlessly? :) [15:51:58] cmjohnson1: did the 720s arrive? [15:52:21] yes...they are all here [15:52:35] racked but nothing else...i will be working on getting them up and running today and tomorrow [15:53:06] cool, thanks a lot! [15:59:16] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [15:59:16] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [15:59:16] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [15:59:32] apergos: did you see sq48? [15:59:39] again? [15:59:41] no I didn't [16:01:05] funny, there was a new alert on it just before i was already typing a mention of it ;) [16:01:51] err, just before my message when i was already typing a mention of it* [16:01:54] * jeremyb is sleepy [16:02:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [16:02:58] joy [16:03:06] jeremyb: s/Red/Read. I understand your objection about event planning, but I have some difficulties to understand the advantages to act as a bot and have a very formal approach on this matter. It seemed to be common sense to add such a margin. [16:03:54] I can ssh in but [16:03:59] -bash: /etc/profile: Input/output error [16:04:05] -bash: /usr/bin/free: Input/output error [16:04:06] :-D [16:04:08] bad disk [16:04:13] off to cmjohnson1 ;) [16:04:13] apergos: nice suggestion ; I could try to prepare such an extension if you get me the documentation about how to send the throttle information to relevant caches. [16:04:41] Dereckson: how would caching be involved? [16:05:01] the memcache key yeah? [16:05:32] we don't touch caches now so why should an extension? [16:05:44] jeremyb: sq48? [16:05:52] jeremyb: hashar noted the account create setting where stored in Memcache ($key = wfMemcKey( 'acctcreate', 'ip', $ip );) [16:05:54] anyways smething is surely out of whack on sq48, http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Upload+squids+pmtpa&h=sq48.wikimedia.org&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS [16:05:58] were stored [16:06:09] cmjohnson1: yeah. not that i have any authority to give it to you ;) broke yesterday too [16:06:39] The last Puppet run was at Thu Dec 6 16:00:25 UTC 2012 (2 minutes ago). [16:06:42] really! [16:06:46] cmjohnson1: just that apergos' paste sounds like a bad disk or something between disk and rest of system [16:06:51] I am nearly certain it is dead...older Dell PE1950...a capacitor on teh HDD controller goes bad and kills the system [16:07:05] ok well I'm getting off [16:07:13] it has killed many of our squids and older srv's [16:07:16] should I open an rt ticket for it? [16:07:59] apergos: sure...i will confirm the issue. I have servers onsite that are set for donation I can pull the card from one of those and get it working again. [16:08:18] yeah typical 1950 fault [16:08:26] the capacitor from hell [16:08:41] dell has a factory in hell? [16:08:42] but it takes years to manifest in my experience [16:08:43] ;P [16:08:58] 1950s were quite resilient [16:09:10] do I give this to you cmjohnson1? [16:09:26] no give it to sbernardin or put in pmtpa que w/nobody [16:10:08] done (nobody) [16:10:42] cool..thx apergos...hopefully have it running again later today [16:13:58] Dereckson: i imagined (hoped?) memcache wouldn't be relevant for such an extension (and still do). anyway, we know the relevant config option. so just grep for the option and see where it's used and how [16:14:20] thank you cm johnson [16:15:03] ok well back in a while, got some things to do [16:29:33] New patchset: Mark Bergsma; "Handle many more error conditions" [operations/software] (master) - https://gerrit.wikimedia.org/r/37231 [16:32:07] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:34:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:36:54] hrmmm, puppetcamp in ghent right before fosdem. FYI [16:47:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.895 seconds [17:15:01]