[00:00:09] old news [00:00:15] Roan said those were harmless [00:00:19] noisy though [00:00:24] Yeah [00:00:32] I'll go through and clear out old paths in the DB [00:00:33] * AaronSchulz is piping through grep for php-1.19 [00:00:37] After we fully migrate to 1.19 [00:00:45] Also, 1.19 shouldn't throw those notices at all [00:00:50] we're getting about 20 per second [00:01:01] (And I'm very surprised those paths even manage to get in there, it's supposed to be impossible) [00:01:51] moar unit tests [00:02:19] real coders use C, do all the code in one go, and don't need tests [00:02:21] Is editing ment to be broken on MW wiki atm? [00:02:27] No? is it? [00:02:30] * AaronSchulz wishes he was that good [00:02:36] Cannot find section [00:02:36] You tried to edit a section that does not exist. It may have been moved or deleted while you were viewing the page. [00:02:46] Link? [00:03:00] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 241 seconds [00:03:19] I just vandalised my user page fine [00:03:24] hmm seems to be a pile of old image description pages [00:03:27] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 270 seconds [00:03:37] Reedy: is that metaphysically possible? [00:04:00] http://www.mediawiki.org/wiki/Category:Images_with_unknown_copyright_status (at least that i've noticed) [00:04:45] so https://www.mediawiki.org/w/index.php?title=File:Flash_Video_Extension_Screenshot_with_explanations.png&action=edit [00:04:48] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [00:05:23] well yes, that is one [00:05:24] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.876 seconds [00:05:24] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.883 seconds [00:05:26] What the hell [00:05:30] That's a bug for sure [00:06:05] §ion=0 works [00:06:11] Makes you wonder what the default value is, if any [00:08:07] if ( $this->section != '' ) { [00:08:07] // Get section edit text (returns $def_text for invalid sections) [00:08:07] $text = $wgParser->getSection( $this->getOriginalContent(), $this->section, $def_text ); [00:09:18] PROBLEM - Lucene on search6 is CRITICAL: Connection timed out [00:09:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:51] but that error is shown whenever $this->textbox1 === false [00:11:37] var_dump( $page->getRevision()->getText() ); [00:11:39] bool(false) [00:12:00] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.631 seconds [00:12:00] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.625 seconds [00:12:05] same with $rev = Revision::newFromId( 114629 ); [00:12:11] though it gets the user name and such [00:13:12] RECOVERY - Lucene on search6 is OK: TCP OK - 8.992 second response time on port 8123 [00:14:06] DiffHistoryBlob::patch: incorrect base checksum [00:14:30] this is the bug apergos found [00:14:34] about a year ago [00:14:51] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [00:15:54] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:15:54] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:16:07] !log tstarling synchronized php-1.19/includes/HistoryBlob.php 'temp fix for checksum bug' [00:16:09] Logged the message, Master [00:17:43] TimStarling: is there a bug #? [00:17:55] gn8 folks [00:18:16] no, I'm filing one now [00:18:57] at the time I was unhappy about apergos just hacking around it in the ugliest way possible, I thought he should have fixed it properly [00:19:41] are the line numbers in CodeReview new? [00:19:46] yeah [00:19:49] nice [00:19:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.672 seconds [00:19:57] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.699 seconds [00:20:01] something hashar added in relation to adding comments to lines [00:20:04] * robla pokes around for more substantial breakage [00:20:07] * AaronSchulz vaguely recalls this [00:20:28] we're going to miss being able to hack on our code review tool, aren't we? [00:20:28] PHP Fatal error: Cannot access protected property WikiPage::$mTouched in /home/wikipedia/common/php-1.19/includes/Article.php on line 1743 [00:20:30] robla: there's a few other stats things I added.... We can use them for like the next month we still have CR! [00:20:52] magic get [00:20:52] yum [00:21:18] PROBLEM - Lucene on search6 is CRITICAL: Connection timed out [00:22:12] AaronSchulz: probably saner just fixing the actual caller to get Touched [00:22:30] I couldn't find it yesterday [00:22:38] ah [00:22:40] * AaronSchulz will use the logs [00:22:53] Why have we got __get and __set again? :/ [00:23:46] are we ready to do some more? [00:24:25] !log aaron synchronized php-1.19/includes/Article.php [00:24:27] Logged the message, Master [00:24:38] Reedy: that b/c code [00:24:43] *that is [00:24:43] indeed [00:25:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:39] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [00:25:53] which one of these next? https://www.mediawiki.org/wiki/MediaWiki_1.19/Communications [00:26:15] RECOVERY - Lucene on search6 is OK: TCP OK - 0.002 second response time on port 8123 [00:26:16] let's just get strategywiki and usability out of the way now [00:26:36] and what the heck, throw in simplewiki and simplewiktionary [00:27:11] Need to drop hewikisource from the list [00:28:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: strategywiki, usabilitywiki, simplewiki and simplewiktionary to 1.19wmf1 [00:29:02] Logged the message, Master [00:29:03] well, I'm not thrilled about doing it ambassador free, but I think we should still do it [00:29:22] we need an RTL wiki [00:29:31] lemme try to scare up someone on hewiki [00:29:33] i wonder if there's anyone about in their irc channel [00:30:20] Reedy: I tried that...there's just one person [00:30:58] ask, um, asaf? [00:31:02] he lives upstairs [00:31:05] well he doesn't live there [00:31:10] but you know [00:31:48] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.865 seconds [00:31:48] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.863 seconds [00:35:05] * robla checks on staff channel [00:35:07] !log tstarling synchronized php-1.19/includes/HistoryBlob.php [00:35:10] Logged the message, Master [00:36:34] Reedy: haha, the mTouched bug is in robots.php [00:36:41] * AaronSchulz would not have guessed that [00:36:49] Niice [00:38:47] !log aaron synchronized live-1.5/robots.php 'fixed access to protected Page field' [00:38:49] Logged the message, Master [00:39:03] lol [00:40:59] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:47] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , enwiki (24029) [00:43:06] 24k, nice :) [00:43:12] jobs? [00:43:19] yeah [00:43:33] Are we solving US unemployment now? [00:43:37] Someone is updating templates again [00:44:15] why weren't we mentioned in the state of the union ? [00:44:25] !log starting to delete broken thumbnails from swift and squid. job running in a screen session on ms-fe1 [00:44:27] Logged the message, Master [00:44:32] * AaronSchulz is surprised how little error log spam there is for 1.19 [00:44:44] TimStarling: do you think it's reasonable to start in on squid purging now? [00:44:53] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.856 seconds [00:44:57] not enough spam? the server kittehs will starve! [00:45:02] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.530 seconds [00:45:13] AaronSchulz: anything exciting? [00:45:17] robla: purging what from squid? [00:45:27] maplebed wants to start his work [00:45:33] oh, thumbnails? [00:45:37] Reedy: aside from tuuuumbleweed!...no [00:45:40] yeah [00:45:45] TimStarling: digging swift for broken thumbnails and purging them from both swift and squid. [00:46:00] have you written the script already? [00:46:04] I have. [00:46:12] can I review it? [00:46:16] you want to take a look? [00:46:17] sure. [00:46:30] it's on ms-fe1 in ~root/purgebadimages/delete-stuff.py [00:46:31] public flogging has commenced [00:46:38] it'll be called from ./caller.py [00:46:45] delete-stuff is an excellent name for a script [00:46:50] to get 20 concurrent purgers, each working on one bucket. [00:46:54] werdna: inorite! [00:47:07] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: meta to 1.19wmf1 [00:47:09] Logged the message, Master [00:47:30] hrm [00:47:34] TimStarling: I warn you, it's ugly as shit. [00:47:42] and for that I apologize. [00:47:53] half way [00:48:16] function HTCPPurge( $url ) { [00:48:16] print "baz"; [00:48:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:05] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:09] :) [00:49:11] Helpful output is helpful [00:49:24] it looks slow [00:49:25] hey, it told me the thing was getting successfully called. [00:49:28] I can probably pull it. [00:49:30] TimStarling: it is slow. [00:49:32] how many images have we got to purge? [00:49:42] !log aaron synchronized extract2.php 'fixed access to protected Page field' [00:49:44] Logged the message, Master [00:49:49] but given my test on one commons bucket, my estimate is it'll finish in 1.5 days. [00:50:45] !log dist-upgrading prototype.wikimedia.org [00:50:48] Logged the message, Master [00:51:24] I guess that's tolerable [00:51:34] TimStarling: the other thing that'll make it be fast enough is that since the buckets are all broken apart into shards, I'm gonig to run 20 copies simultaneously. [00:51:44] if it's too slow, I'll bump it to 30. or 50. [00:51:46] :P [00:51:57] ah right [00:52:08] (doing them serially would take weeks, it's true.) [00:52:32] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [00:52:32] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [00:53:34] (that estimate came from a test yesterday where going through one commons bucket took about 4 hours and we have 256 commons buckets, 256 enwiki buckets, and then many smaller buckets.) [00:54:14] hm. I think I didn't do math right yesterday. [00:54:29] * maplebed sets it to run 30 copies. [00:54:44] at least there's no risk of overloading squid with a high rate of HTCP purges [00:56:08] it's also only 1.6% of thumbs that are affected, so I'll have to make 100 calls to swift for every 16 HTCP packets. [00:56:21] err.. 116 calls to swift for every 16 HTCP packets. [00:56:27] (100 stats and 16 deletes) [00:56:53] New patchset: Lcarr; "Adding mobile wap to redirect to new mobile site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [00:57:10] binasher: that look right for rewriting the url ? https://gerrit.wikimedia.org/r/#change,2620 [00:57:13] line 122, that should use the actual correct container name not a random shard, right? [00:57:20] * maplebed looks [00:57:32] PHP Fatal error: Allowed memory size of 125829120 bytes exhausted (tried to allocate 7864320 bytes) in /usr/local/apache/common-local/php-1.19/extensions/WikimediaMessages/WikimediaLicenseTexts.i18n.php on line 17315 [00:57:33] thanks! [00:57:36] piles of these [00:57:47] PROBLEM - MySQL Replication Heartbeat on db50 is CRITICAL: CRIT replication delay 204 seconds [00:57:47] PROBLEM - MySQL Slave Delay on db50 is CRITICAL: CRIT replication delay 206 seconds [00:57:59] AaronSchulz: that's a lot of lines [00:58:12] TimStarling: fixed. reload? [00:58:15] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2606 [00:58:15] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [00:58:18] * AaronSchulz opens the file in NetBeans [00:58:39] TimStarling: "Opening the file could cause OutOfMemoryError, which.." [00:58:47] it's over 25,000 [00:58:56] lol [00:58:56] * AaronSchulz chuckles at his IDE's warning and opens anyway [00:59:37] file is 4.64MB [00:59:45] over 40k lines in WikimediaMessages alone [00:59:49] Just a bit ;) [01:00:13] just take it out of CommonSettings.php [01:00:27] LeslieCarr: that wouldn't send a redirect at all, but would internally rewrite the query to lang.mobile which then gets internally rewritten again to lang.wiki.. should work for serving the requested content under the original wap domain name [01:00:31] $wmgUseWikimediaLicenseTexts [01:00:47] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.547 seconds [01:01:02] oh [01:01:05] i think it's fine to do that [01:01:19] okay cool :) [01:01:23] but [01:01:39] Won't we have people complaining about the messages being missing? [01:01:41] i'm guessing there's a better way to do that :) [01:02:44] Reedy: which is worse, messages being missing or the site? [01:03:05] Depends who you ask ;) [01:03:08] maplebed: I see variable names pix, byts, pixes, sorted_byts [01:03:17] what to requests for en.wap look like that aren't for / [01:03:17] is this your own special language or some sort of regional dialect? [01:03:26] lol [01:03:32] I was feeling uninventive. [01:03:39] and wanted to write hard-to-read code. [01:03:44] that file went up from 4.0M to 4.6M this release cycle [01:04:12] I do actually usually choose better names (delete-stuff?) [01:04:19] just not this time. [01:04:44] !log reedy synchronized wmf-config/CommonSettings.php 'Disable WikimediaLicenseTexts for the time being' [01:04:47] Logged the message, Master [01:04:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:05:09] Reedy: you made the right choice Sam, you made the right choice :) [01:05:11] TimStarling: so... does it look ok to run? [01:05:18] so pix = width? [01:05:22] AaronSchulz: if enwiki ask, it wasn't me [01:05:22] binasher wap.wikipedia.org:80 208.80.152.75 - - [15/Feb/2012:21:29:51 +0000] "GET /images/81px-Wikipedia-logo.gif HTTP/1.0" 302 599 "http://aoliva.com/blog/2011/12/29/banco-de-imagenes-y-sonidos-del-ite/" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.11) Gecko/20101013 Ubuntu/9.04 (jaunty) Firefox/3.6.11" or wap.wikipedia.org:80 208.80.152.47 - - [15/Feb/2012:22:20:42 +0000] "GET /wiki/A_Midsummer_Night's_Dream HTTP/1.0" 302 603 "-" [01:05:23] MCS 9.0" [01:05:34] pix is the ###px pulled out of teh URL - I think that's width? [01:05:49] people are using wap for image links on blogs? ugh [01:05:54] byt = file length? [01:05:57] yeah. [01:06:01] robla: shall we put enwikiquote and enwikibooks over? [01:06:02] LeslieCarr: do you see anything that looks like an article request [01:06:24] did we just disable license text for 1.18 and 1.19, or just 1.19? [01:06:33] * robla wants to understand that change some more [01:06:44] everywhere for the moment [01:06:46] binasher: " wap.wikipedia.org:80 208.80.152.88 - - [15/Feb/2012:22:24:01 +0000] "GET /wiki/%C3%89tats_pontificaux HTTP/1.0" 302 600 "-" "Sevenval FIT MCS 9.0" " [01:06:51] and sizes is also file length except indexed by width? [01:07:00] yup. [01:07:07] Eh [01:07:17] robla: it's disabled everywhere except testwiki and commons [01:07:26] i'll just turn it off on testwiki then [01:07:33] LeslieCarr: can you see if anything contains ?go= or &go= ? [01:07:36] binasher: relatively rare though, would adding a .* at the end and have it all redirect to the frontpage of X.mobile.wikipedia.org ? [01:07:52] what is sorted_byts? [01:08:07] no, you're only rewriting the Host: header of the request [01:08:14] binasher: wap.wikipedia.org:80 10.64.0.130 - - [15/Feb/2012:14:48:05 +0000] "GET /transcode.php?go=eduardo+cobi%C3%A1n+y+roffignac&seg=2&phpsessid=36df2daa30ee6686168b6eb99f687222 HTTP/1.0" 302 791 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; sbcydsl 3.12; YComp 5.0.0.0; yplus 5.1.02b)" [01:08:19] !log reedy synchronized wmf-config/CommonSettings.php 'Scrap that' [01:08:21] Logged the message, Master [01:08:40] !log reedy synchronized wmf-config/InitialiseSettings.php 'Just disable wmgUseWikimediaLicenseTexts for testwiki' [01:08:42] Logged the message, Master [01:09:02] you seem to be checking the file sizes sorted by file size against the file sizes sorted by width [01:09:17] TimStarling: After making a list of all sizes ordered by width, I create a copy of the list and sort it by sizes. If they're the same, all images with increasing width have increasing sizes and we're cool. If they're not the same, then a "larger" image has a smaller size meaning it's truncated. [01:09:38] you think this will be reliable? [01:09:42] it's not a perfect indicator but it's a decent heuristic. [01:09:46] what if there's only one thumbnail? [01:09:55] oh those crazy old transcode.php urls [01:09:56] 1.18wmf1 it is 4.15MB, 1.19wmf it's 4.52MB [01:09:59] or what if the smallest one is truncated? [01:10:00] I throw those out into a seprate file that I'll look at later. [01:10:05] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.260 seconds [01:10:10] if the smallest one is truncated it won't catch it. [01:10:33] LeslieCarr: i think your varnish change is good, just rewrite the comment above it s/Redirect/Rewrite/ [01:10:38] I'm ok only hitting a large majority since the problem can be fixed for an individual image by doing ?action=purge. [01:10:49] (hitting a large majority but not 100%) [01:11:20] what if the truncation happened towards the end of the file? [01:11:29] probably won't catch it. [01:11:43] we've seen a few files truncated at around 64KB [01:12:22] I'd really like to figure out the OOM and 500 thing [01:12:33] fwiw, this is the same algorithm that generated my "affects ~4.5% of all images, 1.6% of all thumbs" estimate. [01:12:40] we may need to scrub the rest of the deployment if we don't figure that out [01:13:11] so you don't have any figures on what percentage of errors this will actually catch? [01:13:12] well, WM is only going to be an issue when we want to do commons [01:13:25] because you've not written any script which compares the thumbnails against the source? [01:13:26] It's literallly not got enough memory to load in the file [01:13:30] New patchset: Lcarr; "Adding mobile wap to redirect to new mobile site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [01:13:34] TimStarling: that's correct. [01:13:50] why not just do a HEAD request to ms5 and check the Content-Length header? [01:14:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:18] What's the 500 thing? [01:14:28] binasher: ^^ [01:14:39] I'm happy to do that as a second pass; I'd like to get this started to at least start clearing out the ones I know are broken. [01:14:43] Reedy: Erik was getting 500 errors loading recent changes on meta [01:15:00] but i won't be able to do the HEAD to ms5 comparison and get it running before I leave for the day. [01:15:05] Oh, didn't see anyone say that.. [01:15:07] you mean swift is still enabled in squid? [01:15:13] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2620 [01:15:14] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2620 [01:15:17] no, but broken images are still cached in qsuid. [01:15:28] robla: do you know where/what he was doing? [01:15:39] squid is currently backed by ms5, so broken images whose squid caches expire will come back ok, [01:15:43] looking at Special:RecentChanges :) [01:15:51] I can't repro, for what it's worth [01:15:51] but those expiration times are potentially as large as 10 day.s [01:16:06] Ahh [01:16:10] robla: try loading 500 RC items [01:16:24] loading 500 makes an error 500 [01:16:26] * robla does [01:16:26] how fitting [01:16:35] 100 is ok, 250 isn't [01:17:20] yup [01:17:29] well, I think it should be ok to run this until you write a better script [01:17:50] ok. [01:18:09] if you replaced the shelling out to the swift client with a proper swift client library, and did the HEAD request to ms5, it should be faster overall as well as giving more results [01:18:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.373 seconds [01:18:35] I'm a bit nervous about calling out to ms5 for every thumbnail; it'd be pretty easy to overwhelm ms5. [01:18:51] I see memory errors on 1.19 for wmlicensestext still [01:18:54] *oom [01:19:34] it's still in ExtensionMessages-1.19.php [01:19:50] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.946 seconds [01:20:06] ahh [01:20:26] I'm still unclear on the effect of disabling wmlicensestext [01:21:38] ha [01:21:55] you can't disable it on all wikis except commons, it'll split the l10n cache [01:22:12] it'll cause constant refreshing of the cache file [01:22:14] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:22:19] ok, it's running. [01:23:27] anyway I imagine disabling it will break commons horribly [01:23:43] removing all non-english messages would probably be better [01:24:05] Even whilst commons is still on 1.18? [01:24:20] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:24:56] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.831 seconds [01:25:09] it'll only break the 1.18 wikis [01:25:32] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:27:56] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:49] !log tstarling synchronized php-1.19/extensions/WikimediaMessages/WikimediaLicenseTexts.i18n.php [01:28:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:51] Logged the message, Master [01:29:27] !log tstarling synchronized wmf-config/CommonSettings.php 're-enabled WikimediaLicenseTexts' [01:29:29] Logged the message, Master [01:29:47] Query: DELETE FROM `msg_resource` [01:29:48] Function: MessageBlobStore::clear [01:29:50] Error: 1213 Deadlock found when trying to get lock; try restarting transaction (10.0.6.47) [01:30:55] meh, transient [01:31:00] never mind, I'm wrong [01:31:08] ;) [01:31:36] !log distupgrading formey [01:31:39] Logged the message, Master [01:32:20] TimStarling: it looks like it's purging about 3 objects per second. [01:32:33] I feel like that's a little higher than I'd like [01:32:33] AaronSchulz: fixing that problem is what extension-list is for [01:33:20] !log tstarling synchronized wmf-config/InitialiseSettings.php [01:33:22] http://ganglia.wikimedia.org/latest/graph_all_periods.php?m=swift%20object%20change&z=small&h=Swift%20pmtpa%20prod&c=Swift%20pmtpa&r=hour <-- is measured in deletions per 30s period, so divide by 30 to get deletes per second [01:33:22] Logged the message, Master [01:34:00] we're about to have a short svn downtime [01:34:14] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.483 seconds [01:34:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.402 seconds [01:34:33] maplebed: squid can handle hundreds of deletions per second [01:34:33] is that going to severly affect anyone? [01:34:41] but can ms5? [01:34:43] Ryan_Lane: for 1 min or so? [01:34:50] sure. let me know when it's ok [01:34:52] err [01:34:56] yeah, just a reboot [01:34:57] you're worried about a reduced hit rate? [01:35:02] Ryan_Lane: ok [01:35:06] yeah, ms5 should be able to, since it's handling 60qps, this would bump it to 63qps. [01:35:08] that ok? [01:35:08] TimStarling: yeah. [01:35:16] ok. [01:35:22] !log rebooting formey [01:35:23] Ryan_Lane: sure, be quick :) [01:35:25] Logged the message, Master [01:35:40] that's not how it works... [01:36:09] Ryan_Lane: this isn't good timing [01:36:19] TimStarling: I'm assuming the worst case, where every image I delete is immediately requested [01:36:24] well, that's why I was asking.... [01:36:48] probably won't be during our deploy window [01:37:23] PROBLEM - Host formey is DOWN: CRITICAL - Host Unreachable (208.80.152.147) [01:38:11] and it's back [01:38:15] formey is back [01:38:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:20] maplebed: just run it and watch the hit rate [01:38:26] RECOVERY - Host formey is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [01:38:30] it hasn't changed appreciably yet. [01:38:35] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:40] robla: I thought this was inbetween deployment windows... [01:38:43] robla: meh, t'was quick [01:39:03] I think a better model would be to consider what percentage of the cache you're deleting and how that will affect hit rate [01:39:14] to be fair, if the dist-upgrade went bad, it could have been longer [01:39:34] Ryan_Lane: hmm, I thought you were just rebooting [01:39:53] AaronSchulz: I had already done the dist-upgrade [01:40:04] it didn't affect any packages that svn relied on [01:40:09] I was thinking we had a longer window scheduled today. well, live and learn [01:40:36] * AaronSchulz sees 3-7 in his calender [01:40:44] the cache has maybe 97% of the images [01:40:45] hm [01:40:49] I only put 3-5 on http://wikitech.wikimedia.org/view/Software_deployments [01:40:49] ah well [01:41:05] if you delete 1% of the cache, it will have say 96% of the images [01:41:06] robla: and there you have it :) [01:41:08] am I missing this calendar? [01:41:22] occupy the cache! [01:41:25] the WMF Engineering calendar [01:41:28] can someone add me to it? [01:41:30] so the miss rate would go from 3% to 4% and you'd have a 33% increase in backend traffic [01:41:42] oh. I'm on it [01:41:55] crap. it does indeed say 3-7 [01:41:56] sorry [01:42:33] werdna: I could make a size comment about that… [01:43:00] no prob [01:43:25] 93.97% 15.347369 2 - query-m: INSERT IGNORE INTO `msg_resource` (mr_lang,mr_resource,mr_blob,mr_timestamp) VALUES ('X') [01:43:35] http://meta.wikimedia.org/w/index.php?title=Special:RecentChanges&limit=120&forceprofile=true [01:44:06] maplebed: your estimate is that you're deleting 1.5% of thumbnails [01:44:17] right. [01:44:31] I see how that logic works. [01:44:42] ms5 should be able to cope with that almost instantaneously [01:44:48] I agree. [01:44:54] maplebed: tim's logic always works :) [01:45:02] well, I sent out email with instructions on how to kill it if shit hits the fan. [01:45:05] just in case. [01:46:38] I imagine these 1.5% don't get heavy traffic, or else they would have already been purged [01:47:26] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.538 seconds [01:48:04] aaron cleared profiling data [01:48:09] is meta the only one with broken RecentChanges? [01:48:12] seemingly [01:48:22] Is it worth creating a symlink to reduce the spam of all these RL stat calls? [01:48:33] Reedy: I noticed 250 on meta actually works [01:48:42] Reedy: not really, use grep :) [01:49:02] AaronSchulz: grep for what where? [01:49:07] 250 doesn't for me [01:49:13] to filter out spam [01:49:14] oh, you mean filter the log [01:49:15] lol [01:50:02] finding out what is causing recentchanges to 500 would be saner.. [01:50:35] RECOVERY - MySQL Slave Delay on db50 is OK: OK replication delay 0 seconds [01:50:44] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [01:51:03] Reedy: lots of missing profileout calls, sigh [01:51:28] Not suprised [01:51:38] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:10] AaronSchulz: do we get any useful info as to where they are? [01:52:50] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 196 seconds [01:54:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.382 seconds [01:55:01] * Reedy tries Platonides' script [01:55:32] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [01:55:59] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 613s [01:56:17] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 628s [01:56:45] Aha [01:56:46] Hits [01:57:59] static function gender( $parser, $username ) { [01:58:13] by manual inspection [01:59:04] !log aaron synchronized php-1.19/includes/parser/CoreParserFunctions.php 'fixed profiling calls' [01:59:06] Logged the message, Master [01:59:21] aaron cleared profiling data [02:00:48] 80.92% 8.379425 6 - LocalisationCache::recache [02:00:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:00] Reedy: why would that get called 6 times? [02:01:14] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.744 seconds [02:01:59] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 216 seconds [02:02:08] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 223 seconds [02:03:38] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.884 seconds [02:03:55] AaronSchulz: does it happen every time? [02:04:27] not always [02:04:43] might be a certain entry triggering it [02:05:04] that falls off on refresh (since I have a "top X" query param) [02:05:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:54] what's a good way to get the version in StartProfiler.php? [02:06:55] !log reedy synchronized php-1.19/includes/filerepo/backend/FileBackend.php 'Add wfProfileOut( __METHOD__ )' [02:06:57] Logged the message, Master [02:07:10] we should have separate profiling sections in http://noc.wikimedia.org/cgi-bin/report.py for different versions [02:08:53] $IP? [02:10:45] robla: did you notice the new "followed-up revisions"? [02:11:10] * robla looks [02:11:19] !log tstarling synchronized wmf-config/StartProfiler.php 'split out 1.19' [02:11:21] Logged the message, Master [02:11:33] tstarling cleared profiling data [02:12:04] Reedy: that's convenient [02:12:30] yeah [02:12:53] right, that file is a decoy... [02:13:23] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:13:23] it's a trap! [02:13:26] robla: also check stats, fixme/new revisions for paths, in this case just /trunk/phase3 [02:14:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:14:40] !log tstarling synchronized php-1.19/StartProfiler.php [02:14:43] Logged the message, Master [02:14:53] uhh... [02:15:02] yeah ok [02:15:10] !log tstarling synchronized php-1.19/StartProfiler.php [02:15:13] Logged the message, Master [02:15:39] for the record, it was only the 1.19 wikis that I broke [02:15:50] db12 is lagging by over 1000 [02:15:52] heh, yup, i checked [02:15:57] Due to high database server lag, changes newer than 1,033 seconds may not appear in this list. [02:16:00] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:16:10] I wonder if that's related to those msg queries [02:16:10] Betacommand: it's doing schema changes [02:16:19] or if it's just binasher [02:16:37] an enwiki slave is getting migrated [02:16:43] ah then, nvm [02:16:56] is that what's causing the upward spiral of database lag, binasher? [02:16:59] ah, db12 where all the recent change queries go [02:17:14] and watchlist, and contribs... [02:17:34] How long will the update take? [02:17:34] BarkingFish: yup [02:17:41] it's enwiki, a long time ;) [02:17:47] ok second attempt [02:17:50] Nascar1996: which update? [02:17:57] Reedy: Ok then, it's just people mentioning the lag going well over 1000 seconds now [02:18:05] MediaWiki 1.19 [02:18:07] !log tstarling synchronized php-1.19/StartProfiler.php [02:18:08] but if it's a biggy, that's no biggy :) [02:18:09] Logged the message, Master [02:18:23] !log LocalisationUpdate completed (1.18) at Thu Feb 16 02:18:23 UTC 2012 [02:18:26] Logged the message, Master [02:18:26] where is that lag visible? [02:18:32] http://meta.wikimedia.org/wiki/Special:Log/translationreview?limit=10&forceprofile=true [02:18:34] watchlist [02:18:34] 95.86% 13.298977 8 - LocalisationCache::recache [02:18:38] !log tstarling synchronized php-1.18/StartProfiler.php [02:18:40] Logged the message, Master [02:18:41] TimStarling: I think it's logging [02:18:50] * AaronSchulz tries to narrow down [02:19:01] tstarling cleared profiling data [02:19:15] binasher: as we realised it has no extra covering indexes.. can we point stuff at another box [02:19:16] ? [02:19:32] TimStarling: any chance 1.18 wikis can still log profiling data to "all" ? [02:19:55] what's wrong with 1.18? [02:19:57] Reedy: yes, but then the other box will start migrating in a few hours [02:20:05] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:20:22] TimStarling: the "all" db is what gets fed into graphite [02:20:24] Well, if we ask nicely, Tim might move it back when db12 has finished [02:20:43] ok [02:21:10] !log tstarling synchronized php-1.18/StartProfiler.php [02:21:12] Logged the message, Master [02:21:21] what am I moving? [02:21:53] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:22:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2619 [02:22:18] * AaronSchulz blames "Translation review log" [02:22:19] TimStarling: making sure enwiki isn't using a really lagged slave for watchlists et al [02:22:28] AaronSchulz: for RC? [02:22:33] after db12 has finished migrating [02:22:36] TimStarling: if i move all of the groupLoadsByDB db's now pointing to db12 to another db, can you move it back when the migrations move on to whatever i move it to? [02:22:48] sure [02:23:31] i'll move them to db53 which is last in the s1 array.. the migrations might not get there before i log in tomorrow morning [02:23:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.256 seconds [02:24:02] where's the log from the schema change script? [02:24:05] the enwiki migrations are skipping the revision table alter for now though, so they may not take too long [02:25:35] !log asher synchronized wmf-config/db.php 'moving watchlist etc from db12 to db53' [02:25:37] Logged the message, Master [02:26:46] Nascar1996: BarkingFish should be somewhat better now [02:27:15] ok, I'll see [02:27:38] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:28:44] TimStarling: you can tail /home/asher/db/119-migration/coredbs-1.out [02:28:58] New patchset: Pyoungmeister; "new logrotate for search nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [02:29:10] ok [02:29:27] * AaronSchulz reads formatTranslationreviewLogEntry [02:29:50] thanks Reedy, I'm out for now. [02:30:00] Look forward to seeing the finished article in the morning :) [02:30:01] return wfMessage( 'logentry-translationreview-message' )->params( [02:30:02] '', // User link in the new system [02:30:02] night [02:30:04] '#', // User name for gender in the new system [02:30:05] Message::rawParam( $link ) [02:30:07] )->inLanguage( $language )->text(); [02:30:08] ok, that is scary [02:30:23] cheers binasher [02:30:25] BarkingFish ruined the code-flood [02:30:38] combobreaker [02:32:12] !log asher synchronized wmf-config/StartProfiler.php 'setting 1.18 wiki profiling id to all' [02:32:14] Logged the message, Master [02:32:52] Aaron's going to try something [02:33:21] he thinks that the logging of translation messages is the problem where someone works in a bunch of different languages [02:34:07] well, maybe he's capitulating... [02:34:30] !log LocalisationUpdate completed (1.19) at Thu Feb 16 02:34:30 UTC 2012 [02:34:32] Logged the message, Master [02:35:00] I'll explain what he thought: he thought that it was doing a recache per language [02:35:19] (or rather, my muddy explanation of what he tried to explain to me) [02:35:43] so the code I posted above isn't the scary code, it's the code below that [02:35:57] after the 'if ( $action === 'group' ) {' [02:37:06] no one is playing with en.wiki at the moment are they? http://pastie.org/private/fnux8rt1koq1docgtc8g [02:38:51] oh the sandbox thing is a gadget by the looks [02:39:13] p858snake|l: there's also db migration work going on. [02:39:18] binasher: ^ [02:39:26] I'm getting a block option when reverting http://i638.photobucket.com/albums/uu105/Busabout/Untitled-1.png First time I've got it and I'm not a Sysop [02:39:47] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.441 seconds [02:41:09] Bidgee: on enwiki? [02:41:24] yer [02:41:54] No code has been changed on enwiki... [02:42:15] TimStarling: also bugstatus gadget got updated and bit reworked (and reenabled) if you havn't gotten to look at it but not as default currently [02:42:17] Strange. [02:42:53] * AaronSchulz blames the Translate extension hook translateMessageDocumentationLanguage() [02:43:32] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:44:11] Bidgee: did you click the block button? [02:44:26] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:44:35] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:46:56] !log aaron synchronized php-1.19/extensions/Translate/TranslateHooks.php 'live hack to deal with 500s on log/RC views' [02:46:59] Logged the message, Master [02:48:11] \o/ [02:49:41] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.916 seconds [02:50:59] edit toolbar gone on meta? [02:51:15] fine for me [02:51:28] checked javascript console? [02:53:44] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:54:28] Reedy: d'oh, I probably should have done that. I've got it fixed [02:54:34] heh [02:54:50] I had to disable, then reenable [02:54:54] in prefs [02:56:17] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.652 seconds [02:58:11] robla: have all stage 1 wikis been pushed? [02:58:20] no [02:58:20] not yet [02:58:30] we should probably end this now though [02:58:54] I was just about to call it a night and go to bed now [02:59:02] * AaronSchulz is getting twitchy and scatterbrained [02:59:10] yeah, ok...let's reschedule the remaining ones [02:59:23] maybe we'll lump them in with commons [03:00:05] glad we got as far as we did, though! [03:00:11] aaron cleared profiling data [03:00:22] I was worried we'd have to revert on meta [03:01:35] Could almost just do the enwikibooks/enwikiquote tomorrow as I'm going to be around for most of the day [03:01:47] doing non english ones is not so simple [03:02:32] If there's anything that still needs dealing with later, if someone can ping/email me, and I'll see about taking care of it post sleep [03:02:43] ok, thanks Reedy! [03:03:36] TimStarling: everything look good to you? [03:04:14] yes [03:04:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:04:59] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.572 seconds [03:09:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.927 seconds [03:11:42] TimStarling: now I wonder....was WikimediaLicenseTexts a red herring, or was that something we also needed to do? [03:12:16] I removed all the non-english translations from it [03:12:44] we'll be hearing more about that when we deploy to commons [03:13:33] I don't remember, though, we're we trying to solve the problem that Aaron just fixed? or was that another unrelated problem? [03:13:51] that was another problem, I think [03:14:26] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:18:31] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.433 seconds [03:20:08] why does MW1.19 only show "undo" buttons next to edits without edit summaries? [03:20:17] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.360 seconds [03:21:10] file a bug [03:21:24] okay [03:25:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:08] seems fine for me: http://www.mediawiki.org/w/index.php?title=User_talk:Dantman/Analytics_integration&curid=79056&diff=500043&oldid=494636 vs http://www.mediawiki.org/w/index.php?title=Mobile_Full_Screen_Search_Results&curid=79482&diff=500040&oldid=498045 [03:27:06] I mean on history pages [03:27:35] see http://www.mediawiki.org/w/index.php?title=Mobile_Full_Screen_Search_Results&action=history [03:28:12] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:42] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.886 seconds [03:38:42] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [03:38:51] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 0 seconds [03:41:15] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 195 seconds [03:42:09] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 196 seconds [03:42:45] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:43:21] !log tstarling synchronized wmf-config/db.php 'restored db12 in query groups now that the schema changes have finished' [03:43:24] Logged the message, Master [03:45:45] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.149 seconds [03:46:39] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.472 seconds [03:48:39] Is it known that the IRC feed when doing blocks displays the blockers name twice? [03:49:48] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:49:56] not surprising [03:50:31] file a bug [03:50:51] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:52:21] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [03:56:24] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:56:33] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:57:45] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [03:58:21] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [04:00:09] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.563 seconds [04:03:18] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.258 seconds [04:04:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:07:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:14:59] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.341 seconds [04:16:11] !log tstarling synchronized php-1.18/StartProfiler.php [04:16:13] Logged the message, Master [04:16:16] tstarling cleared profiling data [04:17:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.385 seconds [04:17:26] tstarling cleared profiling data [04:20:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:21:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:29:45] !log tstarling synchronized wmf-config/CommonSettings.php [04:29:47] Logged the message, Master [04:29:49] tstarling cleared profiling data [04:30:17] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [04:46:20] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:48:53] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [04:49:11] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.296 seconds [04:51:17] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [04:52:56] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [04:53:32] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [04:54:35] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:54:41] I can't load any wiki [04:55:56] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:56:05] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:57:08] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [04:57:17] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [05:00:35] PROBLEM - MySQL Slave Delay on db38 is CRITICAL: CRIT replication delay 196 seconds [05:01:20] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 241 seconds [05:06:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.554 seconds [05:11:00] !log on db40: truncating a few shards to free up space for the OS [05:11:02] Logged the message, Master [05:11:06] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:02] !log truncated pc000, pc001, pc002, pc003 [05:19:04] Logged the message, Master [05:21:30] !log on hume: running mwscript purgeParserCache.php --wiki=enwiki --age=7776000 [05:21:32] Logged the message, Master [05:23:15] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.704 seconds [05:26:33] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:18] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:06] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:53:15] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.856 seconds [05:57:18] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:06:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.843 seconds [06:09:12] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:09:39] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [06:09:57] RECOVERY - MySQL Slave Delay on db38 is OK: OK replication delay 0 seconds [06:09:57] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.733 seconds [06:10:24] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:13:42] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:00] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:30] PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: CRIT replication delay 213 seconds [06:21:39] PROBLEM - MySQL Slave Delay on db52 is CRITICAL: CRIT replication delay 221 seconds [06:31:29] Namespace filtering in RC isn't working for the Mediawiki namespace [06:31:35] (MW 1.19) [06:31:47] It's called "MediaWiki" [06:31:53] Not Mediawiki [06:32:05] https://meta.wikimedia.org/w/index.php?namespace=8&tagfilter=&translations=filter&limit=250&title=Special%3ARecentChanges [06:32:28] Oh, nevermind [06:32:32] It shows now.. [06:33:04] It doesn't for me [06:33:28] you need to not to filter the translations [06:33:30] p858snake|l: Because you have to do "No action" for the Filter translations. [06:33:33] yeah [06:38:48] RECOVERY - MySQL Slave Delay on db52 is OK: OK replication delay 0 seconds [06:38:48] RECOVERY - MySQL Replication Heartbeat on db52 is OK: OK replication delay 0 seconds [06:39:48] uh [06:42:51] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 245 seconds [06:43:00] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 254 seconds [06:49:00] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:50:12] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [06:50:17] Dear Wiki technicians., I work for sa-ws and there is no user account creation log displayed on the RC page., [06:50:29] can u help me out? [06:54:09] New review: Hashar; "Thanks Leslie!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2606 [07:01:09] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:03:42] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [07:36:18] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [07:36:18] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [07:53:33] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [08:14:37] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:49] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [09:02:37] I got an error while uploading on Commons [09:02:38] :( [09:04:33] http://i638.photobucket.com/albums/uu105/Busabout/Error.png [09:21:20] Bidgee: still around? [09:34:09] Yer [09:35:30] Bidgee: have you opened a bug for http://i638.photobucket.com/albums/uu105/Busabout/Error.png ? [09:35:34] if not , I will open one [09:36:25] have you reproduced the issue? [09:37:47] Bidgee: if you still have the error window, can you copy paste it to http://dpaste.org/ so I can copy it? Thanks! [09:40:50] So far so good. [09:41:01] I didn't copy the page, sorry. :( [09:41:05] it is ok :) [09:41:22] did you get an error when retrying? [09:42:59] Bidgee: also do you remember the full filename you tried to upload? [09:46:08] !log hashar synchronized wmf-config/swift.php 'Add a wfDebugLog call for bug 34440: swift list_objects giving InvalidResponseException' [09:46:11] Logged the message, Master [09:48:26] File:Union Club Hotel in 2004.jpg [09:48:46] No error on the retry [09:49:48] !log on db40: truncated pc004, pc005, pc006, pc007 [09:49:51] Logged the message, Master [09:50:25] Bidgee: I have opened a bug report ( https://bugzilla.wikimedia.org/34440 ) [09:50:41] Bidgee: our swift guru will probably have a look at it whenever he wake up (he is in the US) [09:51:18] Bidgee: thanks for reporting! [09:54:55] When you want the message again, you never get it! :P [09:55:17] So far the other uploads have uploaded without any issue [10:07:36] hey [10:07:47] ipv6 [10:07:58] are we implementing it anytime soon> [10:14:38] Steven_Zhang, it depends on what you mean [10:14:53] Steven_Zhang, http://wikitech.wikimedia.org/view/IPv6_deployment [10:15:08] well someone wants to know if we're participating in the july launch [10:15:15] bleh [10:15:17] gtg [10:40:42] PROBLEM - LVS Lucene on search-pool2.svc.pmtpa.wmnet is CRITICAL: Connection timed out [10:44:27] RECOVERY - LVS Lucene on search-pool2.svc.pmtpa.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [10:49:08] New patchset: Catrope; "Define BINDIR in purge-checkuser" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2621 [10:49:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2621 [10:58:18] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2621 [10:58:19] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2621 [11:02:54] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:04:06] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [11:21:18] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 120 MB (1% inode=62%): /var/lib/ureadahead/debugfs 120 MB (1% inode=62%): [11:25:30] RECOVERY - Disk space on srv224 is OK: DISK OK [12:18:09] PROBLEM - Host lily is DOWN: PING CRITICAL - Packet loss = 100% [13:04:25] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 189 MB (2% inode=62%): /var/lib/ureadahead/debugfs 189 MB (2% inode=62%): [13:09:58] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 194 MB (2% inode=62%): /var/lib/ureadahead/debugfs 194 MB (2% inode=62%): [13:11:19] RECOVERY - Disk space on srv223 is OK: DISK OK [13:25:47] !log Running cleanupUploadStash.php across all wikis [13:25:50] Logged the message, Master [13:37:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2616 [13:37:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2616 [13:37:39] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2619 [13:37:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2619 [13:53:36] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [13:57:39] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:59:36] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [14:23:17] !log hello world [14:23:19] Logged the message, Master [14:25:33] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:06] !log rebooting search1003 [14:26:08] Logged the message, Master [14:30:39] PROBLEM - Puppet freshness on search1002 is CRITICAL: Puppet has not run in the last 10 hours [14:30:57] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [14:33:39] PROBLEM - SSH on search1003 is CRITICAL: Connection refused [14:33:57] PROBLEM - RAID on search1003 is CRITICAL: Connection refused by host [14:33:57] PROBLEM - DPKG on search1003 is CRITICAL: Connection refused by host [14:34:24] PROBLEM - Disk space on search1003 is CRITICAL: Connection refused by host [14:37:42] PROBLEM - Lucene on search1003 is CRITICAL: Connection refused [14:43:15] PROBLEM - Disk space on mw40 is CRITICAL: DISK CRITICAL - free space: /tmp 60 MB (3% inode=87%): [14:44:18] RECOVERY - SSH on search1003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:50:04] RECOVERY - Disk space on mw40 is OK: DISK OK [14:52:19] PROBLEM - Puppet freshness on ganglia1001 is CRITICAL: Puppet has not run in the last 10 hours [15:02:48] New patchset: Pyoungmeister; "usinga more up to date db list, by roan's suggestion" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2622 [15:03:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2622 [15:03:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2622 [15:03:52] PROBLEM - NTP on search1003 is CRITICAL: NTP CRITICAL: No response from NTP server [15:46:55] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:48:16] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [16:03:10] !log clearing cache on srv1 [16:03:12] Logged the message, Master [16:08:50] hello brion you little rascal you [16:18:11] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:20:31] !log rebooting search1001 [16:20:33] Logged the message, Master [16:23:35] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 31.01 ms [16:26:26] PROBLEM - DPKG on search1001 is CRITICAL: Connection refused by host [16:26:35] PROBLEM - Disk space on search1001 is CRITICAL: Connection refused by host [16:27:02] PROBLEM - RAID on search1001 is CRITICAL: Connection refused by host [16:27:47] PROBLEM - SSH on search1001 is CRITICAL: Connection refused [16:30:38] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [16:30:42] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwikibooks and enwikiquote to 1.19wmf1 [16:30:45] Logged the message, Master [16:33:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:35] RECOVERY - SSH on search1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:35:44] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.68473669565 (gt 8.0) [16:37:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.755 seconds [16:39:32] uttlepr: which DC is search1001 in? [16:40:30] jeremyb: anything with 1XXX is eqiad [16:41:41] yup [16:44:35] jeremyb: uttlepr is a troll which has been banned several times from here and other channels [16:44:52] oh, good [16:45:29] RECOVERY - RAID on search1001 is OK: OK: no RAID installed [16:45:30] Vito: thanks! [16:45:58] yw! [16:46:14] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.93967701754 [16:46:14] RECOVERY - DPKG on search1001 is OK: All packages OK [16:46:14] RECOVERY - Disk space on search1001 is OK: DISK OK [16:55:32] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.031 second response time on port 8123 [16:55:50] !log reedy synchronized php-1.18/extensions/FeaturedFeeds/ 'r111650' [16:55:52] Logged the message, Master [16:56:08] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2438* [16:58:51] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: frwikisource to 1.19wmf1 [16:58:54] Logged the message, Master [17:08:53] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 172 MB (2% inode=62%): /var/lib/ureadahead/debugfs 172 MB (2% inode=62%): [17:08:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:11] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 268 MB (3% inode=62%): /var/lib/ureadahead/debugfs 268 MB (3% inode=62%): [17:12:47] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 272 MB (3% inode=62%): /var/lib/ureadahead/debugfs 272 MB (3% inode=62%): [17:15:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.952 seconds [17:15:29] RECOVERY - Disk space on srv221 is OK: DISK OK [17:26:26] aharoni: should be live now [17:26:31] first quick impression is ok [17:26:57] We've not really found many issues so far [17:27:47] hashar: most of the errors in fatalmonitor are 1.18 [17:30:08] my own RTL fix works well. [17:30:16] compare https://he.wikisource.org/wiki/Test! in Chrome and in Firefox. [17:30:33] Chrome supports dir="auto", Firefox still doesn't (but will soon). [17:30:39] :) [17:31:42] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: betawikiversity to 1.19wmf1 [17:31:43] Logged the message, Master [17:43:05] for info, nothing bad pour betawikiversity by passing to 1.19 [17:43:21] -pour +for [17:44:05] dcrochet: thanks [17:48:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:53:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.255 seconds [18:08:19] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:13:43] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 30.87 ms [18:16:52] PROBLEM - Disk space on search1002 is CRITICAL: Connection refused by host [18:17:01] PROBLEM - RAID on search1002 is CRITICAL: Connection refused by host [18:17:04] Reedy: have you deployed to enwiki yet? [18:17:19] PROBLEM - SSH on search1002 is CRITICAL: Connection refused [18:17:28] PROBLEM - DPKG on search1002 is CRITICAL: Connection refused by host [18:20:46] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [18:25:41] Function: User::invalidateCache [18:25:43] Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.0.6.46) [18:25:44] huh, that was random [18:26:37] RECOVERY - SSH on search1002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:26:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.469 seconds [18:39:41] New patchset: Ottomata; "Buncha mini changes + hackiness to parse a few things. This really needs more work" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2623 [18:42:43] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2623 [18:44:55] PROBLEM - NTP on search1002 is CRITICAL: NTP CRITICAL: No response from NTP server [19:02:43] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [19:02:43] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.020 seconds response time. www.wikipedia.org returns 208.80.154.225 [19:05:16] RECOVERY - Lighttpd HTTP on dataset2 is OK: HTTP OK HTTP/1.0 200 OK - 4906 bytes in 0.022 seconds [19:06:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.594 seconds [19:24:15] New patchset: Lcarr; "Moving generic::tcptweaks to "standard" server class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2624 [19:25:14] !log aaron synchronized php-1.19/includes/api/ApiQueryAllUsers.php [19:25:16] Logged the message, Master [19:26:04] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2624 [19:26:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2624 [19:26:15] !log aaron synchronized php-1.19/includes/api/ApiQueryAllUsers.php [19:26:16] Logged the message, Master [19:29:18] !log aaron synchronized php-1.18/includes/api/ApiQueryAllUsers.php [19:29:19] Logged the message, Master [19:29:56] !log doing some debugging for bug 34451 [19:29:58] Logged the message, Master [19:30:21] !log aaron synchronized php-1.18/includes/api/ApiQueryAllUsers.php [19:30:23] Logged the message, Master [19:30:33] hmm, seems like a lack of equality propagation [19:36:27] !log aaron synchronized php-1.18/includes/api/ApiQueryAllUsers.php [19:36:29] Logged the message, Master [19:39:58] AaronSchulz: looking a thte active users thing? [19:45:14] !log aaron synchronized php-1.18/includes/api/ApiQueryAllUsers.php 'pushing comment changes :)' [19:45:17] Logged the message, Master [19:46:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:57] Reedy: can you test http://en.wikipedia.org/w/api.php?action=query&list=allusers queries? [19:47:01] they look fine too me [19:47:31] * AaronSchulz tried with/without group/activeusers param and different limits [19:50:32] AaronSchulz yo [19:50:38] you have a minute? [19:51:05] it's better to just ask the basic question :) [19:51:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.942 seconds [19:51:53] Reedy: want to port that to wmf1.19? :p [19:52:10] * AaronSchulz is ready to nom [19:52:19] I'll have a look in a few mins [19:55:42] AaronSchulz: https://bugzilla.wikimedia.org/34427 -- swift? [19:57:11] hexmode: I don't understand what that bug is getting at. [19:57:21] i.e. what's incorrect about what's boing shown? [19:57:45] when saibo looked, there was no file, but now timestamps show a file should have been there [19:58:32] fascinating. [19:58:49] but no, it's definitively not related to swift; swift was not in production service at 19:39 yesterday. [19:58:51] (UTC) [19:59:03] hrm [19:59:04] ok [19:59:18] forgot you guys backed it out [20:10:49] PROBLEM - Disk space on mw15 is CRITICAL: DISK CRITICAL - free space: /tmp 18 MB (1% inode=87%): [20:13:35] robla: I managed to recreate a truncated thumb. :( (granted, once out of about 30 tries, but still.) [20:13:40] maybe we need to look more closely at https://mikewest.org/2008/11/generating-etags-for-static-content-using-nginx [20:13:57] bummer [20:14:31] (this is in comparison to before, where I created one on every attempt) [20:14:53] that's still kinda dicey though [20:15:08] maybe we should go with Tim's original suggestion on this [20:15:33] i.e. write the file prior to sending it [20:15:55] actually, I have a different theory - I might have requested it for the second time before it was finished getting generated. [20:16:14] oh, race condition? [20:16:23] yeah; [20:16:40] my test was to connect, drop, connect, get first 10 packtes, [20:17:02] so if the first 10 packets had the "whole" file because it was half way written to NFS, [20:17:42] AaronSchulz: did you catch all that? [20:18:38] Reedy: don't forget 1.19wmf :) [20:18:51] AaronSchulz: I've merged it already [20:18:56] I just had to svn up before I could svn ci [20:18:57] :P [20:19:23] I can't recreate it. [20:19:32] fluke? I'll keep trying. [20:19:35] Reedy: ah, I see [20:19:48] maplebed: was the file already on ms5? [20:19:54] no. [20:20:16] I'm requesting a file with $RAND for the width. [20:20:34] (but I'm requesting it several times so I can try different styles of abort) [20:21:42] is there someone here who can copy a single file for me into dumps.wikimedia.org/android/ [20:21:42] ? [20:23:32] yuvipanda: anyone from ops [20:23:43] Reedy: does that count you? [20:23:47] :D [20:23:57] binasher: did you seem tim's profiling changes? [20:24:10] to the c collector? [20:24:13] yep [20:24:18] yup [20:24:41] yuvipanda: I'm not ops! ;) [20:25:03] Reedy: can you play one on ssh? :D [20:25:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:24] anyway, who do you suggest I poke? [20:25:32] yuvipanda: upload it somewhere, dump the url in here [20:25:46] For dumps related things, I usually ask apergos nicely [20:28:00] http://dl.dropbox.com/u/8768784/Wikipedia-v1.1-beta1.apk [20:28:06] * yuvipanda pokes apergos nicely [20:28:19] can you copy that url's apk into dumps.wikimedia.org/android? [20:28:54] AaronSchulz: (and robla) https://wikitech.wikimedia.org/view/User:Bhartshorne/truncated_thumbnail_issue [20:29:04] bah. [20:29:15] AaronSchulz: (and robla) http://wikitech.wikimedia.org/view/User:Bhartshorne/truncated_thumbnail_issue [20:29:16] RECOVERY - Disk space on mw15 is OK: DISK OK [20:29:39] !log reedy synchronized php-1.19/includes/api/ApiQueryAllUsers.php 'r111675' [20:29:42] Logged the message, Master [20:29:52] done [20:30:04] I though I made that directoy a symlink into other [20:30:08] and now it seems not to be [20:30:18] * apergos resolves to have a word with the relevant people... [20:30:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.918 seconds [20:30:41] apergos: thanks :D [20:30:43] the less things that live in /data/xmldumps that aren't actually dumps, the better [20:32:00] well, if we have some place to distribute these off, I'd gladly put them there instead of putting them in dumps.* [20:32:35] in dumps is find, there's just a subdirectory "other" that I'd rather have them in [20:32:53] *fine [20:33:55] PROBLEM - HTTP on singer is CRITICAL: Connection refused [20:34:49] AaronSchulz: swift code is still apparently being hit in wmf-config/swift.php [20:34:59] incalid argument supplied for foreach on line 45 [20:45:20] Secure server down? [20:45:48] what secure server? just type https:// and the usual url [20:45:55] secure.wikimedia.org [20:46:12] see http://status.wikimedia.org/ [20:46:37] don't use secure [20:46:55] I know :-) Still down though... [20:47:00] we want it to be down [20:47:01] dead [20:47:03] gone [20:47:07] stomped into the ground even :-D [20:47:49] Was up 10 min ago. [20:48:37] the day suddenly got better [20:48:48] oh? [20:50:39] without secure I mean [20:50:56] :-) [20:51:31] I wonder if that's singer [20:51:34] I don't remember any more [20:52:26] yeah, that's it [21:03:31] We found a bug on MW 1.19. The special page Deleted Contribution is no more update. See https://bugzilla.wikimedia.org/show_bug.cgi?id=34456 [21:04:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.953 seconds [21:12:39] New patchset: Lcarr; "Fixing varnish language" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2625 [21:13:08] PROBLEM - Varnish HTTP mobile-frontend on cp1042 is CRITICAL: Connection refused [21:13:26] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2625 [21:13:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2625 [21:15:11] New patchset: Diederik; "Added full support for ip address and ip range filtering Added full support for regular expression matching" [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2626 [21:18:57] !log the most recent apache update (thanks puppet) must have broke things on singer. the url.wm.o config wants /srv/org/wikimedia/url/ but I have no idea what that service ever did or what is supposed to be in there. will someone who knows this undocumented information please check it? thanks. [21:19:00] Logged the message, Master [21:19:03] * apergos grouches [21:21:19] <^demon> url.wm.o? [21:24:41] yeah [21:24:47] on singer [21:25:07] I have no idea what it is, whether it's new, old, should be there, should be gone, if it should be there what the contents should be [21:25:08] *nothing* [21:26:18] <^demon> I've never ever heard of it either. [21:26:30] <^demon> And url.wm.o is kind of bulky to be a shortening service :p [21:32:51] I could take it out of sites-enabled and bring apache back up I guess... dunno. [21:34:47] http://en.planet.wikimedia.org/ is down [21:35:35] yeah [21:35:39] guess so [21:36:24] it's failing for some worse reason [21:36:34] the url.wm.o docroot warning is just a warning [21:36:39] * apergos hates on singer [21:40:19] !log reedy synchronized php-1.19/extensions/Vector/modules/ext.vector.collapsibleNav.js 'r111687' [21:40:21] Logged the message, Master [21:44:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:44:32] [Thu Feb 16 20:30:54 2012] [error] Server should be SSL-aware but has no certificate configured [Hint: SSLCertificateFile] ((null):0) [21:45:02] !log singer certificate issues, looks like [21:45:05] Logged the message, Master [21:48:38] maplebed: weird, testing with eqiad, I don't get those container 404 errors [21:49:00] it just get an empty array as expected [21:49:49] that is odd. [21:49:54] what happens if the container doesn't exist? [21:50:27] well swift.php won't die over that, it always caught the "no container" exception [21:50:44] excellent. [21:50:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.383 seconds [21:51:59] maplebed: https://github.com/rackspace/php-cloudfiles/blob/master/cloudfiles_http.php [21:52:15] where am I looking? [21:52:15] odd, for for '$this->error_str = "Container has no Objects.";' [21:52:23] there is a 204 and a 404 case [21:52:27] I wonder what the difference is [21:52:43] maybe 404 is the container not existing? [21:53:15] which would be odd if I just checked that it existed beforehand [21:54:07] * AaronSchulz wonders if its eventual consistency [21:54:19] if the container was created a half a second ago [21:54:48] but we don't create them automatically yet [21:54:52] so that seems unlikely [21:55:39] New patchset: Lcarr; "Fixing ")" for wap redirection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2627 [21:56:18] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2627 [21:56:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2627 [21:59:29] RECOVERY - Varnish HTTP mobile-frontend on cp1042 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.062 seconds [21:59:45] TimStarling: the tl;dr version: maplebed confirmed that there will be cases where copying from ms5 NFS to swift fails due to lack of md5 [22:00:43] however, he believes that it'll be a much smaller %, so his plan is to log all images and follow up 5min or so later with a cleanup script [22:01:26] why would it be a small percentage? [22:01:48] I would have thought almost all of them would fail [22:01:57] I'm going to let maplebed defend that one. he didn't convince me :) [22:02:41] here's the basic logic, though: [22:02:45] TimStarling: aborting the connection (pre-etag) on generated images caused a broken image 100% of the time. aborting a connection on an existing image causes a truncated image in about 35% of my tests. [22:02:50] s/35/5/ [22:03:16] sounds like an artifact of your test setup [22:03:27] why don't you just send the Content-Length header? [22:03:36] it's right there in the response header from ms5 [22:03:41] ooooh [22:03:45] all you have to do is copy it through, then you have 0% failures [22:03:59] I said this in my original email [22:04:00] AaronSchulz: does Swift take a content-length? [22:04:13] TimStarling: that assumes swift buffers the entire file before sending it on, right? [22:04:16] the python HTTP library should check it [22:04:24] robla: what do you mean? [22:04:33] * AaronSchulz is writing a response to someone on CR [22:04:42] the swift proxy? no, the proxy doesn't have to buffer it [22:04:43] so, when you send an etag to swift, and it doesn't match, the upload fails [22:04:51] the server will buffer, obviously, it has to compute the MD5 [22:05:05] is there a similar optoin for content-length? [22:05:17] TimStarling: I'm not sure I understand then. Would it be easier to show me what you mean with a patch to rewrite.py? [22:05:24] :P [22:06:18] robla: yes, but not for chunked transfer [22:06:27] d'oh [22:06:43] ok [22:06:58] it should work for chunked transfer [22:07:40] it does etag but not content-length? gah.... [22:08:12] patch swift! [22:08:27] (I'm only half joking) [22:11:12] robla: http://docs.openstack.org/cactus/openstack-object-storage/developer/content/chunked-transfer-encoding.html [22:11:25] http://paste.tstarling.com/p/CAvcAF.html [22:11:27] this also conforms to HTTP spec [22:11:54] http://www.ietf.org/rfc/rfc2616.txt [22:12:29] TimStarling: that header will then be ignored inside wmf.client.Put_object_chunked, as AaronSchulz points out (in the docs). [22:12:41] well, don't use chunked encoding then [22:13:57] http://paste.tstarling.com/p/DOXUXt.html [22:13:59] easy [22:14:15] I spent about 20 min trying to explore that from different angles [22:14:20] (at least) [22:14:32] I wasn't able to twist maplebed's arm [22:15:38] the default transfer encoding (identity) just has all the data on the wire, with the Content-Length header to indicate the end of it [22:15:44] I don't think that diff is going to work; I think it'll take more than that. [22:16:20] but I'll read it again more carefully first. [22:16:31] so all you need to do to support it is to send the data as it comes [22:16:59] :q [22:17:02] moops. [22:17:57] !log catrope synchronized wmf-config/InitialiseSettings.php 'Enable wgResourceLoaderExperimentalAsyncLoading on test2wiki' [22:17:59] Logged the message, Master [22:21:37] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:24:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:24:10] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.115 seconds response time. www.wikipedia.org returns 208.80.154.225 [22:28:15] maplebed: it might work [22:28:27] I'm slowly convincing myself of the same thing. [22:28:48] I was reading the send() docs to make sure it wasn't redoing headers and such and it's not [22:29:12] so that should work for streaming, which is what we need [22:29:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.662 seconds [22:35:25] New patchset: Bhartshorne; "trying Tim's suggestion of abandoning chunked-encoding for swift puts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2628 [22:35:31] AaronSchulz: I'm going to load Tim's diff into ms-fe1 ^^^ [22:36:20] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2628 [22:36:20] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2628 [22:37:11] !log catrope synchronized wmf-config/CommonSettings.php 'Add live hack to enable $wgResourceLoaderExperimentalAsyncLoading on meta only for me (User:Catrope)' [22:37:13] Logged the message, Master [22:37:29] RoanKattouw: :) [22:37:50] * AaronSchulz should start hacking in features for User:Aaron Schulz [22:38:37] New patchset: Hashar; "puppet local linter!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2629 [22:39:10] RECOVERY - Host db1035 is UP: PING OK - Packet loss = 0%, RTA = 27.08 ms [22:40:36] * AaronSchulz reads http://en.wikipedia.org/wiki/Chunked_transfer_encoding#Rationale [22:40:59] maplebed: I can only assume the reason rewrite used chunked-transfer was the result of a coin flip [22:41:08] *rewrite.py [22:41:47] what do you mean? [22:42:01] PROBLEM - mysqld processes on db1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:42:08] that it made no sense [22:42:24] maybe it was due to a misunderstanding of how streams work [22:42:45] thinking that each send() has to be self-contained in some sense at the HTTP level [22:42:54] It's true, I was working under the assumption that using chunked encoding was a conscious choice with a valid reason. [22:43:55] * TimStarling just found a string for which adler32($s) == 0 [22:44:09] for https://bugzilla.wikimedia.org/show_bug.cgi?id=34428 [22:44:57] lol [22:45:12] I have no idea what adler32 is but I'm facepalming just reading that [22:45:42] How can you implement a checksum algorithm wrong? I mean surely there's a reference implementation, and some canonical input,output pairs [22:46:02] tests only proof that the given cases work :) [22:46:08] [/captain obvious] [22:46:10] Well sure [22:46:39] But if you port MD5 to a different language, the first thing you do is try it on the empty string and on e.g. the file itself [22:46:43] Or /dev/random maybe [22:47:16] RECOVERY - mysqld processes on db1035 is OK: PROCS OK: 1 process with command name mysqld [22:47:32] AaronSchulz, TimStarling: non-chunked encoding is currently running on ms-fe1 if you want to throw anything at it. (chunked is still running on ms-fe2 for comparison) [22:47:37] I'm just saying that not testing the single most unit-testable kind of code in the history of programming is unforgiveable [22:47:51] RoanKattouw: did tim say $s was ''? [22:48:07] It's probably not [22:48:31] so how would testing '' help? [22:48:59] I think the author figured that the details didn't matter, because the hash would only be used internally [22:50:58] anyway I pondered for a while how to find a string to initialise it with [22:51:00] RoanKattouw: it's certainly easier to test the MW, that's for sure [22:51:10] PROBLEM - MySQL Slave Running on db1035 is CRITICAL: CRIT replication Slave_IO_Running: No Slave_SQL_Running: No Last_Error: Rollback done for prepared transaction because its XID was not in the [22:51:23] I eventually settled on using a base string for which A (per the wikipedia article definition) is zero [22:51:57] and then perturbing it by subtracting a number from a moving location, and adding it to the last byte to keep A the same [22:52:08] then you're only searching a 16-bit space [22:53:06] You had to find a string for which A'($s) == 0, where A' is a buggy implementation of A? [22:53:56] no, I had to find a string where A($s) == 0, which enables you to efficiently compute A' as A($s . $x) [22:54:09] because the hash state is the whole of the hash [22:54:35] so a string which hashes to zero initialises the state of the algorithm to zero, which is what A' incorrectly does [22:56:53] hrm, you guys realize you're just searching for a 15 year old's spelling of ass? :) [22:57:13] Aah [22:57:25] So then you can simulate the broken implementation, I see [22:57:30] yep [22:57:57] * robla resumes paying attn to this conversation [22:59:48] !log catrope synchronized wmf-config/CommonSettings.php 'Also give User:Cmcmahon experimental async loading on meta' [22:59:50] Logged the message, Master [23:03:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:13] RECOVERY - MySQL Slave Running on db1035 is OK: OK replication [23:06:55] PROBLEM - mysqld processes on db1035 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:08:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.934 seconds [23:09:03] !log catrope synchronized php-1.19/includes/resourceloader/ 'r111699' [23:09:05] Logged the message, Master [23:09:52] !log catrope synchronized php-1.19/resources/mediawiki/mediawiki.js 'r111699, r111700' [23:09:53] Logged the message, Master [23:10:17] !log catrope synchronized php-1.19/resources/startup.js 'touch' [23:10:19] Logged the message, Master [23:12:45] maplebed: did it work? [23:12:56] looks like yes, though I need to run another test. [23:28:49] robla: futzing the content-length header makes swift return a 500 [23:30:00] maplebed: not a 404? [23:30:06] nope. [23:30:11] it just blew a gasket. [23:30:17] hrm [23:30:31] (by futzing with, I mean content_length += 20) [23:31:01] that seems appropriate though; it is finding the file, but what it's getting doesn't match what it thinks it should be. [23:31:30] lemme make sure I understand you correctly: 1. hack in +=20 to content length 2. upload 3. try accessing that URL 4. get 500 [23:31:36] maplebed: is that what you mean? [23:31:40] not quite. [23:31:44] or do you mean it returns 500 from the write? [23:32:14] 1. modify the copy-into-swift portion of the 404 handler to do content_length += 20 [23:32:30] 2. request a file that doesn't exist, triggering the copy-into-swift portion. [23:33:13] 3. get 500. [23:33:32] ok.....that's a 500 from the PUT request then, I'm assuming [23:33:40] that's actually what we want, I think [23:33:57] me too. [23:34:30] but, the important question is what happens when you make a subsequent GET request on the URL [23:34:45] same thing. [23:34:50] 500? [23:34:56] (aka swift didn't put in the file before returning the 500) [23:35:05] it failed the put entirely [23:35:59] I guess what I'm asking is "does it return the exact same result as if you had never tried to write to that location in the first place?" [23:36:01] so the GET is 500? [23:36:25] yes to both. [23:36:49] swift returns 500s on file not found? that's ....unique [23:37:03] no, it's returning 500 on a failed put. [23:37:10] (which is a subrequest of the 404) [23:37:29] I guess I'm confused. [23:37:30] oh, that makes sense then [23:37:43] so...the real test is the 4 steps I described [23:38:06] "upload" doesn't exist as a step in the current swift infrastructure. [23:38:27] PUT does, though, right? [23:38:44] by hand, sure. as part of the actual flow of bits in production, no. [23:38:52] PUT only happens as a byproduct of 404. [23:39:13] the only production-like way to trigger it is by a GET to an object that doesn't exist. [23:39:15] I'm talking about the Swift client-server interactoin, which I'm assuming happens over HTTP [23:40:02] the stuff outside of the Swift client-server interaction doesn't interest me for purposes of this conversation [23:40:21] what I'm trying to make sure you've checked is that a 500 error doesn't leave junk in Swift [23:41:01] that's correct; after mucking with the content-length header the object I was requesting was not written into swift. [23:41:15] k...great! [23:42:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:43:24] maplebed: so, can you turn swift on again? [23:43:38] you mean put it back in production? [23:43:59] http://ganglia.wikimedia.org/latest/graph_all_periods.php?m=swift%20object%20change&z=small&h=Swift%20pmtpa%20prod&c=Swift%20pmtpa&r=hour shows it's still deleting existing truncated thumbs. [23:44:18] I'd rather wait for that to finish. [23:48:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.844 seconds [23:50:32] sure [23:51:58] maplebed: how close is your fixup script to completing? [23:53:00] robla: the sweeper? I've been working on testing non-chunked encoding instead. [23:53:48] !log catrope synchronized wmf-config/CommonSettings.php 'Tweak live hack so async loading is enabled for all logged-in users on all 1.19 wikis' [23:53:50] Logged the message, Master [23:54:19] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [23:54:55] maplebed: did you stop the scripts that you started yesterday? [23:55:06] no, tehy're still going. [23:55:17] any idea how close they are to finishing? [23:56:08] hi [23:56:13] can someon help [23:56:19] !ask [23:56:19] with? [23:56:49] I'm afraid they won't finish today - one is in the 6s (counting up from 0) and the other is in the ds (counting down from f) [23:57:01] hi can someone help me with something [23:57:07] that's still 7/16 that have to finish. [23:57:17] Guest50878: Can you please describe what you need help with? [23:57:18] Guest50878: Don't ask if you can ask, just ask your question :) [23:57:19] PROBLEM - Puppet freshness on search1003 is CRITICAL: Puppet has not run in the last 10 hours [23:57:24] ok thanks [23:57:39] I'm on meta wiki now, and it just had 1.19 deployed [23:57:51] RoanKattouw: he didn't ask about asking - s/he wanted to check our availability and readiness :) [23:58:00] one of the new changes is there is the +/- diff change for contribs pages [23:58:00] http://meta.wikimedia.org/wiki/Special:Contributions/Okeyes_%28WMF%29 [23:58:04] my question is [23:58:17] is there a way to use CSS to remove those +/- things [23:58:36] What do you want to remove exactly [23:58:38] +239 +1 ? [23:58:41] The whole (+834) thing? [23:58:43] 03? [23:58:59] yes [23:59:02] let me clarify [23:59:11] # 23:44, 16 February 2012 (diff | hist) . . (+834)‎ . . User talk:Kudpung ‎ (reply) (top) [23:59:12] .mw-plusminus-pos { display: none } [23:59:16] Yeah [23:59:24] ok let me try that [23:59:29] That will probably also hide (+nnn) markers elsewhere though [23:59:48] oh...yeah. how do I limit it to contribs pages