[08:56:07] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:58:13] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:14:41] are we having overload issues? I just got a red warning and: Sorry, the servers are overloaded at the moment. [10:14:42] Too many users are trying to view this page. Please wait a while before you try to access this page again. [10:14:42] • Pool queue is full [10:14:50] (on the enWiki main page) [10:15:20] the whole interface loaded just not page content (even the article feedback tool loaded, that red box and warning loaded in place of content) [10:15:25] refreshing worked... [10:21:28] TimStarling: ^ (if your not paying attention in here) [10:21:42] for what page? [10:21:50] it was on the en main page [10:27:08] I guess I should increase the limit [10:28:52] !log tstarling synchronized wmf-config/PoolCounterSettings.php 'increased max queue from 50 to 100 on reports that the limit was reached on the enwiki main page in normal operation' [10:28:55] Logged the message, Master [10:29:10] is it a new limit? [10:30:17] no, it's was introduced a year ago [10:30:26] weird... [10:35:30] ok, bed time for now [10:35:44] thanks Tim [10:43:53] PROBLEM - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:26] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:11:10] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [12:22:16] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:24:13] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:58:33] !log hashar synchronized php-1.19/includes/SiteStats.php 'Reenable SiteStatsInit::articles() for bug 35169. SiteStatsInit::doAllAndCommit() still disabled since it breaks the site' [12:58:36] Logged the message, Master [12:59:23] if the DB start having huge load, that *could* be the cause [13:11:02] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:13:08] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:23:02] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [13:23:29] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [13:56:47] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:12:17] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:36:20] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:37:50] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.8406035135 (gt 8.0) [15:50:53] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:52:32] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.0018782883 (gt 8.0) [15:56:44] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.50495070796 [16:36:47] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.6404059459 (gt 8.0) [16:41:20] WM file storage is on Swift now, right? [16:42:14] Kindof [16:42:29] thumbs [16:42:34] not the originals [16:42:46] apergos: I heard a lot of complains about files deletion being slow [16:43:11] if people can document it a little more specifically [16:43:20] they can bugzilla it and it can be looked into [16:43:56] (is it all files, only onces in a while. what does "slow" mean, where in the process is it slow to respond) [16:44:02] *once in a while [16:48:53] apergos: well, I do not really understand what do you mean "more specifically" [16:49:03] It's everywhere, as far as I can see [16:49:14] Deletion of file takes about 8 seconds [16:52:46] if a bug report just goes in that says "deleting files is slow" [16:52:58] I guarantee you that the next response will be "we need more information" [16:53:21] apergos is getting into the forecasting business [16:53:34] so if you or someone else can give a specific example (I was on wikiX, I deleted file Y, id showed me this and then I had to wait Z time before it did Q" [16:53:40] then that would help [16:54:17] and it's fine to say that this behavior is on all fles you've deleted [16:54:22] just we have to have a starting point [16:54:32] I am ? [16:54:38] which forecasting business is that? [16:54:41] ( jeremyb ) [16:54:56] tomorrow is supposed to be weather like early summer [16:54:57] apergos: i guarantee you the next response ;) [16:55:12] *today* is early summer [16:55:14] I might have to go out and enjoy it some [16:55:21] it's not even spring yet [16:55:23] * jeremyb too [16:55:39] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [16:56:37] see, maybe it's not all wikis, maybe it;s something recent that didn't happen a week ago, etc etc [16:56:47] so the more info people have the more likely they are to be able to find the issue [17:10:44] apergos: https://bugzilla.wikimedia.org/show_bug.cgi?id=35326 — is this good? [17:11:22] can you give one file name that you deleted? just so we have it? [17:12:07] apergos: https://test.wikipedia.org/w/index.php?title=File:Gpl-dos.png&action=edit&redlink=1 — one of them [17:12:13] please add it to the bug report [17:12:26] OK [17:12:41] and the other thing, I take it the 8 seconds was from the time you clicked delete or submit or whatever and it came back with any display? [17:13:56] I think so [17:14:06] apergos: well, I actually took time from the comment [17:14:21] that's the other thing I would want in the report, is where that 8 seconds is observed [17:14:26] other than that, good to go [17:42:00] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:43:39] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [17:45:36] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [17:53:33] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [17:53:33] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:58:11] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:13:01] crap [18:13:26] Nikerabbit: ? [18:16:02] RECOVERY - Squid on brewster is OK: TCP OK - 0.009 second response time on port 8080 [18:16:33] jeremyb: got conflicts while doing svn up [18:17:32] Nikerabbit: did you join the git bandwagon? [18:18:17] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.68127149123 (gt 8.0) [18:18:21] jeremyb: huh? [18:19:14] !log nikerabbit synchronized php-1.19/extensions/WebFonts/resources/ext.webfonts.fontlist.js 'i18ndeploy r114160' [18:19:17] Logged the message, Master [18:19:43] !log nikerabbit synchronized php-1.19/includes/Linker.php 'i18ndeploy r114160' [18:19:46] Logged the message, Master [18:20:46] Nikerabbit: git merging is pretty well established to be less pain than svn [18:45:53] PROBLEM - Host brewster is DOWN: CRITICAL - Host Unreachable (208.80.152.171) [18:47:41] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.19666785714 (gt 8.0) [18:54:49] !log Disabled all production CiviCRM Jenkins jobs, for CiviCRM upgrade. [18:54:53] Logged the message, Master [18:58:57] !log Put production civicrm / drupal instance in offline mode for upgrade [18:59:00] Logged the message, Master [19:04:56] RECOVERY - Host brewster is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [19:09:06] !log Started the CiviCRM 4.1.1 update script on prod. [19:09:10] Logged the message, Master [19:16:02] PROBLEM - Host virt3 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:48] !log CiviCRM 4.1.1 update script finished executing on prod. [19:18:51] Logged the message, Master [19:22:05] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.42543630631 [19:27:11] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [19:27:38] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [19:29:38] can anybody guess why this page http://pt.wikipedia.org/w/index.php?title=Especial:P%C3%A1ginas_%C3%B3rf%C3%A3s&limit=500&offset=5000 is giving me a "no results page"? [19:30:10] I mean I know there is more results to be shown, but it stops in the 5000th [19:30:36] well te english messages says that it's cached and there is a max of 5000 resukts available in the cache [19:30:37] so [19:30:52] The following data is cached, and was last updated 11:13, 19 March 2012. A maximum of 5,000 results are available in the cache. [19:31:02] oh, the portugues message does not say that [19:31:14] * chicocvenancio goes look in the translate wiki [19:31:32] RECOVERY - Host virt3 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [19:31:48] yeah, probably want to fix that up [19:38:45] the message in translatewiki is wrong [19:38:57] hence the my mistake... [19:39:02] apergos:^^ [19:42:14] ok, good to know [19:42:26] easy to fix if you have translate rights [19:42:38] I dont... [19:42:50] How do I get them? [19:42:50] can you see who does over there/ [19:42:58] well hmm you could do that too [19:43:06] I think there are intrusctions off the main page [19:43:20] it's not hard, it might take a couple days is all [19:43:28] but then you can work on all the messages [19:43:33] I think many users in ptwiki have them... If it is too much work to get Ill ask them to change it... [19:44:23] ok [19:45:09] it's not work for you, you just create yourself an account there, put a note or so about your babels on your user page and leave a note on a request page, unless they changed it [19:45:16] so maybe 2-3 minutes for everything [19:45:20] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 12.9389685345 (gt 8.0) [19:45:22] it's just waiting later to get them [19:45:36] but you don't have to do anything after that excpt check back on it [19:48:10] done! [19:48:24] is there a translate wiki channel? [19:48:55] got it #mediawiki-i18n [19:55:43] apergos: asked over there and they gave me the rights immediately [19:55:56] sweet! [20:07:24] hello [20:07:53] !log catrope synchronized php-1.19/extensions/ArticleFeedback/modules/jquery.articleFeedback/jquery.articleFeedback.js 'r114176' [20:07:53] Logged the message, Master [20:07:54] there's a special page on the pt.wiki that hasn't been updated since 2009, how can we update it? [20:12:20] apergos: would you know something about that? [20:12:20] some of those we might not update again if they are too much of a burden on the servers [20:12:20] you could ask someone who's looked at that more recently or look at bugzilla to see if it's come up on other projects [20:12:20] I'm sorry to be so vague (also it'a late evening here) [20:12:26] then wouldn't it be better to remove the page? [20:12:29] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.797791875 [20:12:58] tschis: see e.g. https://bugzilla.wikimedia.org/show_bug.cgi?id=15434 [20:13:41] (and the other bugs at https://bugzilla.wikimedia.org/showdependencytree.cgi?id=29782&hide_resolved=1) [20:16:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:40] apergos: thanks for your help :} [20:20:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.026 seconds [20:31:50] !log awjrichards synchronized php/extensions/MobileFrontend/MobileFormatter.php 'Adding debugging code to MobileFormatter' [20:31:54] Logged the message, Master [20:36:39] !log awjrichards synchronized php/extensions/MobileFrontend/MobileFormatter.php 'Addin more debugging code to MobileFormatter' [20:36:43] Logged the message, Master [20:43:58] !log awjrichards synchronized php/extensions/MobileFrontend/MobileFormatter.php 'Addin more debugging code to MobileFormatter' [20:44:01] Logged the message, Master [20:53:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:54:31] !log awjrichards synchronized php/extensions/MobileFrontend/MobileFormatter.php 'Addin more debugging code to MobileFormatter' [20:54:34] Logged the message, Master [20:57:31] !log awjrichards synchronized php/extensions/MobileFrontend/MobileFormatter.php 'Removing debugging code from MobileFormatter' [20:57:34] Logged the message, Master [20:59:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.488 seconds [21:00:10] Hello. I am a java-script writer at Wikimedia Commons and I would like to know if the edit-rate limits are secret. If not, is there a summary somewhere so I can adjust our tools to prevent "You've exceeded your rate limit. Please wait some time and try again" [21:01:13] This occurs especially when requesting file deletions because a lot of changes are required for this task. [21:02:12] rillke: there was a thread on wikitech-l recently on rate limits [21:05:20] rillke: http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/59543 [21:05:45] thanks [21:05:51] * rillke is reading [21:09:07] That does not tell me facts about edit-rates. [21:09:27] If no one will tell me here, I will have to test to find out myself. [21:09:35] Poor servers. [21:09:39] the edit rate limits are: respect the db lag of the server [21:09:49] and don't run concurrent jobs, run in serial [21:09:57] It is a javascript that performs a deletion request [21:10:29] I can't tell the user that he cannot request deletion because of the big lag, [21:10:31] :-) [21:10:39] you can wait to do it [21:10:59] or you can return a message that says to try again in a bit because the servers are overloaded [21:11:03] there are scripts that do it [21:11:07] yep [21:11:31] But especially for IPs you have a very limited hardcoded edit limit. [21:13:19] My tools __always__ wait until one request has finished before starting the next one. [21:13:27] At least for IPs [21:14:01] good. they should do that in all cases [21:14:12] we don't have a global limit [21:14:22] the projects have their own limits for scripts [21:14:32] you will want to check the policy for the particular project [21:14:59] Oh, I mean to ask, how *do* you stop web tools from creating concurrent API requests (specifically one user who submits a form, say)? [21:15:02] *meant [21:15:16] Jarry1250: we don't? [21:15:31] if we notice server load we'll do sometihing about it [21:15:34] Oh, I meant "you" as in "me"? :) [21:15:44] ah [21:15:45] Jarry1250: if someone misbehaves we just cut them off entirely (or maybe try to contact them first) [21:15:46] Hehe: Cat-a-lot is sending up to 200 request at one time (not developed by me) [21:15:49] *didn't mean the ? to be there [21:16:13] And it is trying until it gets it edits done. [21:16:23] I need to sleep [21:16:31] If you prefer this way, you get it. [21:16:31] talk to folks later [21:16:42] Jarry1250: one user interactive posting something (few requests via JS) can't do much harm [21:16:49] did you guys just roll out an new "sandbox" gadget? [21:17:17] Good night, apergos. [21:17:31] it points to the incubator - it appears to be on by default - and points to the incubator, not sure if they'll appreciate that.. [21:17:43] Versa_Versa: We don't roll out Gadgets, that's done by local wiki communities [21:17:44] night [21:18:20] * Versa_Versa pokes around for en.wiki types who may know something about the gadget [21:18:38] Versa_Versa: Look at the history of MediaWiki:Gadgets-definition [21:18:54] Versa_Versa: Or do you just mean the whole "My sandbox" link thing? [21:19:33] the whole link thing - points to the wrong wiki.. [21:20:11] mySandbox[ResourceLoader|dependencies=mediawiki.util|default|rights=createpage]|mySandbox.js [21:20:12] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 299 seconds [21:20:39] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 327 seconds [21:25:57] heh.. it only goes to incubator is you are accessing via https! [21:33:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:39:15] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [21:39:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.011 seconds [21:46:06] I know that there is $wgRateLimits and that it is "still in use" [21:46:28] array( 'edit' => array( 'anon' => null, // for any and all anonymous edits (aggregate) [21:46:45] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , frwiktionary (25984) [21:46:46] oops wrong [22:12:24] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [22:15:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.019 seconds [22:40:16] Nemo_bis: fyi, I did inform the students a priori to make accounts at home and some of them did, but some didn't, so they had to create them on the spot and in the first timeslot, everything was fine, but in the second one, 4 users hit the throttle, so I had to create accounts for them. which was fine, because it was scalable, but imagine if I had not told them to create accounts and I'd have to open 30 accounts at a time... :) [22:43:24] I'm surprised no one is discussing the massive lag on db36. 5280 seconds! [22:43:46] anomie: Exactly a mile ;) [22:56:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:01:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.889 seconds [23:08:05] db36 is at 6769 seconds lag now. Does anyone know what's going on with it? [23:09:15] That's probably Asher messing with it [23:09:19] See his post on wikitech-l today [23:10:27] indeed [23:10:52] It's killing any bot that uses maxlag. :( [23:11:51] binasher: Can't we just comment out db36 from db.php then? [23:11:53] wow. upload is being incredibly slow [23:12:28] The way it is now, maxlag is broken, and maintenance scripts would be broken in a similar way [23:12:46] RoanKattouw: well, don't do any maintenance :) [23:12:57] RoanKattouw: then when the migration moves on to the next db, we'll be down two enwiki db's for some number of hours and maxlag will be broken again [23:13:37] Oh it's a script? [23:15:00] RoanKattouw: same as all the other 1.19 schema migrations (and 1.18, 1.17, etc) [23:15:04] domas used to do it by hand [23:15:32] so maybe just back to 1.17 [23:15:34] he probably used to open up the hard drives and flip bits with his laser eyes [23:15:38] hehe [23:15:54] I did it by hand once, it wasn't fun, so I wrote a script [23:16:01] TimStarling: BTW, Diederik asked me to remind you that he'd like his udp-filter merge reviewed [23:16:44] I'm pretty busy at the moment, with the upcoming security release [23:17:01] Ah yes, OK [23:17:02] maybe we need a schema migration flag that tells maxlag to ignore up to one lagged slave per cluster if there's at least one or two that aren't [23:17:04] shit. I should have pushed that nginx logging fix while I was doing this video push [23:32:49] gn8 folfs [23:32:51] gn8 folks [23:35:00] !log fixed a few files, on commons and other wikis, with empty oi_archive_name values even though the file was on NFS [23:35:04] Logged the message, Master [23:36:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:40:45] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.05151869919 (gt 8.0) [23:43:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.039 seconds [23:46:00] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.22160618321 (gt 8.0) [23:51:10] !log awjrichards synchronizing Wikimedia installation... : Pushing changes to MobileFrontend per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments#19_March.2C_2012 [23:51:14] Logged the message, Master [23:54:30] !log awjrichards synchronizing Wikimedia installation... : Redoing accidentally aborted scap, Pushing changes to MobileFrontend per https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments#19_March.2C_2012 [23:54:33] Logged the message, Master [23:55:17] RoanKattouw: http://commons.wikimedia.org/w/api.php?action=query&titles=File:Fuendetodos_-_Nevera_de_Culroya.JPG&prop=imageinfo&iilimit=5&iiprop=timestamp|comment|sha1|size [23:55:19] :) [23:55:51] * AaronSchulz feels like he is turning over a rock [23:56:00] Whoa [23:56:15] Same timestamp, SAME SHA-1, same dimensions, different file size? [23:56:50] so I'm going through 341 commons file, thinking there was one bug case, but I'm seeing all kinds of different bugs [23:57:03] heh [23:57:13] the good news is that more stuff is recoverable than I thought [23:57:24] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is -5.39289333333 [23:57:40] the ugly error case I thought covered most of the files only covers a small fraction [23:58:21] RoanKattouw: we reeeally should make oldimage/image more like revision/page [23:58:34] it would cut down on this kind of crap by reducing the amount of state [23:59:21] Yeah we really should [23:59:26] There's a branch that's rotting somewhere [23:59:32] Timo, Bryan and I created it for WP10