[00:00:49] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:01:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:03:41] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:04:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:08:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [00:17:53] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:18:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:21:20] * Damianz looks at Krinkle and wonders if they're ok [00:22:11] looks pretty much like what is already in place, but puppetized. [00:22:21] I don't use the bots project that much though [00:23:41] Yeah, it's trying to make it less ad-hoc for when we move into restricted instances... some stuff will need refactoring later (like hard coded mysql passwords, though they depend on fixing the mysql class to write a .my.cnf file). [00:23:58] New review: Ryan Lane; "inline comment." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/26441 [00:27:06] we have an API log now [00:32:23] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:32:29] Ryan_Lane: do you happen to know anything about how caching works; specifically in esams; and more specifically why purging the cache via action=purge would clear ptmpa but not esams? [00:33:05] maybe purge messages aren't working properly? [00:33:11] it's a multicast purge [00:33:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:33:33] any easy way to actually debug that? [00:33:40] the multicast -> unicast -> unicast -> multicast relay may be broken [00:34:15] that sounds... frightening! [00:34:20] mwalker: I think mark wrote some tool for this a while back [00:34:26] well, you can't do multicast across the WAN [00:34:34] yep; got that :) [00:34:35] you need a relay [00:34:52] mwalker: hey:) [00:35:02] mutante: *waves* [00:35:27] mutante: did you meet kelsey hightower? [00:35:29] mwalker: i just noticed some error related to your shell access. do you miss anything you need? [00:35:40] jeremyb: heh, no [00:35:53] hrmmm, how odd [00:35:58] mutante: erm... not that I have noticed [00:36:00] http://wikitech.wikimedia.org/view/Multicast_HTCP_purging [00:37:42] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:38:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:38:49] jeremyb: oh, why? i see its someone at Puppet Labs [00:39:04] mutante: you were talking about the forge [00:39:17] i think [00:41:19] oh yeah, maybe i have listened to his talk without remembering the name :p [00:41:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:33] they kept saying how the forge has now over 500 modules [00:41:39] heh [00:41:43] so i searched for mediawiki [00:47:10] Ryan_Lane: ok; so parsing of the HTCP page makes sense given what I'm seeing in the code -- as I don't have access to the cluster, how many cookies do I have to provide to you/someone to look into if the udpmcast.py script is still running? [00:47:34] this assumes I know whereit is running [00:47:48] :p always the question... supposedly bayle... [00:48:05] really convincing that it's monitored then :D [00:48:08] oxygen is one [00:48:17] puppet is the source of truth [00:48:19] don't trust wikitech [00:48:30] linne and oxygen [00:49:01] Ryan_Lane: Do you know whats up with gerrit and viewing changes in file that contain foreign characters (i.e. i18n files), even small updates aren't shown: https://gerrit.wikimedia.org/r/#/c/21034/7/MoodBar.i18n.php [00:49:23] uncaught exception in javascript [00:49:33] blank page [00:49:37] Krinkle: no clue. ask chad [00:49:53] also running on linne [00:49:59] ^damnit [00:51:56] I don't see traffic on linne [00:53:00] I see it on oxygen being sent to linne [00:53:53] Krinkle: used to work... [00:53:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.665 seconds [00:54:13] that extra line wasn't there beofre [00:54:36] besides its "wrong" (the instance name is not a FQDM and the instance is already in the table ) [00:54:39] hm [00:54:44] oxygen is relaying to a multicast group [00:55:05] this must not be the relay I'm looking for [00:55:55] ah. that's for squid logging [00:55:58] * Ryan_Lane sighs [00:56:16] let me guess the other relay isn't puppetized [00:56:25] Ryan_Lane: probably not :p [00:56:32] that would make too much sense [00:56:33] -_- [00:57:03] maybe it's either emery or locke [00:57:38] Hipster admin checks config management then monitoring then realises it was the guy with a beard that installed it. [00:58:21] bayle is now tarin [00:58:24] and runs poolcounter [00:58:36] I have no clue where this runs [00:58:43] it's not documented and I can't find it in puppet [00:58:55] maybe it doesnt anymore! and that's why its broken :) [00:59:18] it's unlikely that it's been broken for a very long time [00:59:33] people are pretty good about telling us when shit is broken [01:00:28] it's not listed as down in nagios [01:00:32] but that assumes we have a check for it [01:00:46] if it isn't in puppet it probably doesn't have a check either ;) [01:01:08] mwalker: the person that would know about this is mark [01:01:24] ok; I'll ping him in the morning [01:01:29] thanks much kindly for looking :) [01:01:54] yw [01:03:16] !log updating a lot of mailing list descriptions per [[BZ:37537]] / thehelpfulone [01:03:27] Logged the message, Master [01:29:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 230 seconds [01:41:22] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 247 seconds [01:42:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.720 seconds [01:48:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:48:52] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [01:48:52] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [01:48:52] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [01:48:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:49:10] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [01:50:22] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [02:17:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:55] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [02:27:22] !log LocalisationUpdate completed (1.21wmf1) at Fri Oct 5 02:27:21 UTC 2012 [02:27:37] Logged the message, Master [02:32:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [02:38:58] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:45:59] !log LocalisationUpdate completed (1.20wmf12) at Fri Oct 5 02:45:59 UTC 2012 [02:46:12] Logged the message, Master [03:49:55] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [04:39:17] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [06:01:39] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [07:00:53] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [07:03:26] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [07:23:52] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [07:23:52] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:37:34] New patchset: Dereckson; "(bug 40776) Namespace configuration for gu.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26747 [08:02:25] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:16] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:29:44] mark: I'm here [08:29:50] you said something about "instructing me" for something :P [08:34:49] New patchset: ArielGlenn; "DO NOT MERGE YET migrate from ms7 upload6 to nas1 upload7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26749 [08:48:59] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:05] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [09:47:24] paravoid: nag about range requests! :) [09:54:41] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [09:57:05] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:00:15] will do [10:00:36] told Tollef about the std.integer wraparound on breakfast [10:00:44] heh [10:00:54] I told him you worked around it but he said he'll fix it nevertheless :) [10:00:59] thanks :) [10:01:12] they should [10:01:19] they actually use strtol() into an int [10:03:36] so, what's the story with range requests? not supported at all? [10:03:47] and video players use them to seek I presume [10:03:56] i have to investigate it, just starting on that [10:04:06] but as far as I know currently, varnish doesn't support them towards its backends [10:04:15] so it can only serve them for data it already has [10:05:34] ouch [10:06:28] wow, NetApp is filing up quite well I see [10:06:30] almost there [10:09:17] but for cached objects, I don't yet understand what the difference would be between varnish/swift [10:10:27] back in a while (in time for the window), gotta get lunch [10:20:11] seems like the squids are not caching those videos at all [10:41:39] and it's not working great with the esams squids either [10:41:55] my current impression is that those videos may have worked reasonably well in pmtpa just since the content is so closeby [10:41:59] but not anywhere else [10:45:16] back with food [10:49:32] Apache logs are noisy this morning [10:49:45] With swift related warnings [10:50:07] uh oh [10:51:29] 26/1000 [10:52:25] what's that mean? [10:52:42] when? [10:52:46] I restarted ms-fe2 just moments ago [10:52:53] er, swift on ms-fe2 [10:52:58] ah [10:53:09] That'd probably explain the transfer closed errors [10:53:12] but it shouldn't have made an effect [10:53:18] hmm [10:53:30] does MW retry? [10:53:53] and where do you see that? fluorine? [10:54:15] tail -n 1000 /home/wikipedia/syslog/apache.log | grep " Invalid response" | wc -l [10:54:17] on fenari [10:54:43] so in a little bit those should be gone [10:54:59] Most of them are 8 minutes or so ago [10:55:08] ok [10:55:17] yeah, over a 3 or 4 second period [10:55:17] sounds about right [10:55:36] can ignore them then! [10:55:39] I probably was too quick between depooling and restarting [10:55:42] does it retry? [10:55:47] MW that is [10:56:16] we need to restart swift frontends every few days until we fix the memory leak [10:56:26] (grumble) [10:56:36] Probably not [10:56:40] I guess Arron would know [10:57:02] okay, I'll be a lot more careful then [10:57:34] video download from esams over ipv6 goes at an appalling rate of 300 kbyte/s [10:57:54] *ouch* [10:58:42] Connect time: 113 ms [10:58:42] Request to response headers: 117 ms [10:58:42] Request to first data byte: 118 ms [10:58:42] Received 18948862 bytes, at 301000 bytes/s average [10:58:42] Request to end of data: 63025 ms [10:58:43] Total time: 63138 ms [10:58:47] (from fenari) [10:59:25] apergos: ok [10:59:28] are you doing the migration? [10:59:48] I guess [11:00:01] i can do it too [11:00:24] so there is the one rsync left on ms7 [11:00:36] it's in the middle of comons/temp and wob't finish any tmie soon [11:00:44] so I'm gonna shoot that [11:00:50] we don't care about temp [11:01:06] temp is... well, temp [11:01:17] this was the one sweep to get stuff close to up to date [11:01:25] anyways it will be close enough [11:01:36] okay, just saying :) [11:02:01] uh huh, it was doing upload/wikipedia/* I believe [11:02:32] so first is: are we gonna try to do this without readonly? [11:03:39] what's "this"? swap the MW config? [11:03:47] upload6 for upload7? [11:04:28] yeah, all that [11:05:46] I just did a quick check to see if upload7 is mounted everywhere [11:05:58] I just see a few misc hosts like fenari (different mountpoint), spence, snapshot* which don't have it [11:06:21] only for hosts that have upload6 mounted currently [11:06:28] dsh -Mcg mediawiki-installation "[ -d /mnt/upload6 ] && ls -d /mnt/upload7" [11:06:30] that's what I used [11:06:42] fenari should get it when the change to manifests/misc-servers.pp:misc::extension-distributor gets done [11:07:08] did you check whether ext distributor actually works? [11:07:14] i've heard it's been broken for some time [11:07:42] I hadn't heard that so I didn't check [11:08:06] yeah it's not loading for me [11:08:07] so nevermind that [11:08:24] I don't see what read-only would give you [11:08:41] you'll swap it, then what? you can't test it, so you'll then do it rw again [11:11:16] even if it doesn'tl oad we probably shouldn't break the cron job, unless there's been a decision to retire the extension [11:11:31] sure it's not critical right now but let's not make someone else's clenaup job harder later [11:11:44] the cleanup is to have that thing rewritten [11:11:47] that's way overdue [11:12:33] it's broken already, I don't want to spend any more time on trying to fix it until it's rewritten as a proper service which doesn't impose dependencies on our core infrastructure [11:12:49] nt trying to "fix it", just maintaining the status quo [11:13:13] Depending on what's up with it, it shouldn't take much to kick it back to life [11:13:26] it shouldn't take much to rewrite it [11:13:47] In theory, it's mostly not needed though [11:15:10] as long as it's depending on an NFS share on all app servers, it's needed :) [11:15:55] mark: briefly chatted with Tollef [11:16:40] no easy way around that limitation, varnish doesn't support caching partial objects [11:16:45] yeah I know [11:16:47] I don't think that snaps need it for anything (if they do I'll deal with it later but I can't imagine why) [11:16:47] so range on the backend doesn't really work [11:16:57] can it do it in pass mode? [11:17:02] he suggested m3u8, but that's a much larger change [11:17:03] I don't think it can, but that would already help [11:17:12] sorry, pipe mode [11:17:31] I wonder why spence has the upload6 mount [11:17:32] hehe, I thought of pipe too, but didn't have a chance to ask yet [11:17:37] the next presentation started [11:17:47] anything interesting? [11:18:00] phk's talk was very nice [11:18:24] there was a KPN talk, nothing too interesting, another completely boring out-of-place talk [11:18:28] and now the VGNETT talk [11:18:52] which should be interesting, since varnish was created for them [11:18:55] yes [11:19:35] I'm going to declare that spence doesn't need it so we can move on [11:19:46] yes [11:20:08] heh [11:20:47] Reedy: do you know what the status of work on Captcha is? [11:20:48] I'll prep the commonsettings and initialisesettgins changes and shove em out to gerrit [11:21:59] Nope.. I don't think i've seen anyone working on it [11:23:24] this would be /mnt/upload7/private/captcha (after the move), lemme make sure we copied it over [11:23:38] and it's served outof private so that should all be mw doing the lifting, no need for ms7 [11:23:43] 's webserver [11:23:56] so it's not pulled from http://upload. ? [11:24:20] good [11:24:27] well I watched dtrace for a half hour yesterday [11:24:33] so we can actually kill ms7 after today [11:24:36] and while I saw some math and other crap I saw not one captcha [11:24:40] no we can't [11:24:54] those math are from cached pages, we can kill it after those expire [11:25:05] ah math urls have changed? [11:25:05] couple of weeks? [11:25:07] but soon [11:25:11] uh huh [11:25:19] moved to top level [11:25:19] will they expire? [11:25:28] this happened over a month ago [11:25:32] does mediawiki realize this? [11:25:46] the URL change [11:25:47] or is this another case of squids revalidating them and mediawiki saying sure, unchanged! [11:25:57] that I can't tell you [11:26:04] people need to think about $wgCacheEpoch :P [11:27:33] the hash is the same, we could do 301s [11:27:35] $wgCacheEpoch = '20110101000000'; [11:27:36] lol [11:27:43] oh myyy [11:27:48] what is that? [11:27:53] New patchset: ArielGlenn; "migration from ms7 upload6 to nas1 upload7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26773 [11:28:03] $wgThumbnailEpoch = '20110101000000'; [11:28:22] We should probably move that forward at least a year again ;) [11:28:39] to a few weeks back, properly [11:28:42] to just after this change [11:28:47] yeah [11:28:52] what is that epoch thing? [11:29:03] it means that mediawiki considers everything cached before that date to be invalid [11:29:07] for all caches [11:29:15] ohrly [11:29:21] we use it to selectively clear the cache for certain wikis in some cases [11:29:30] I've tagged you for the two gerrit changes of config files, Reedy [11:29:57] https://gerrit.wikimedia.org/r/#/c/26773/ and https://gerrit.wikimedia.org/r/#/c/26749/ [11:30:06] when we are ready to make it happen. [11:30:13] i woke up ready [11:30:35] ah that's my problem, never really woke up [11:30:57] so captchas are not being served by ms7's webserver? are we sure about that? [11:31:06] apergos is sure about that [11:31:07] (since you're asking for review... :) [11:31:10] no, I'm not sure of anything. but [11:31:25] I watched for 1/2 hour all the gets via dtrace [11:31:27] not one [11:31:39] that's as good as I'm gonig to get [11:31:46] apergos is as sure as apergos will ever be ;) [11:31:52] no, we can have someone who knows MW to confirm :) [11:31:56] that, and private/ is not served out of upload generally [11:32:06] I'm sure 300G will be enough for captchas for like a day [11:32:21] it will be enoough for years [11:32:24] move it [11:32:27] hehe [11:32:34] i don't have time for this sillyness :P [11:32:41] I already have that move in the commit [11:32:41] heh [11:32:46] if it breaks I want to know now [11:32:49] typical me, typical mark :) [11:32:58] me chickening out I mean [11:33:04] and mark being bold [11:33:44] Changes both look sane [11:33:51] well next in the list as far as I can tell is deploy em [11:34:02] who wants to push da button? [11:34:09] you do [11:34:22] pushing the button doesn't really do that much ;) [11:35:01] hahaha [11:36:14] reedy (or someone) please merge the changes in gerrit and I will pull em to fenari and sync common file on ... meh. three separate files, so three separate sync common files [11:36:32] nah [11:36:36] just use sync-dir wmf-config [11:36:41] nice [11:36:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26773 [11:36:53] I shall [11:36:55] Sanity for a change! [11:36:58] heh [11:37:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26749 [11:37:23] You could have at least changed the commit message =D [11:37:48] :-D [11:37:51] funner this way [11:38:03] WAIT YOU DIDN'T really merge that! :-P [11:38:16] :D [11:38:21] Get out clause for when it breaks :) [11:40:38] going around now [11:40:55] shouldn't it log? I forget what logs which [11:41:31] !log running sync-dir wmf-config to migrate media etc from ms7 upload6 to nas1 upload7 [11:41:41] yeah it should do.. and you can append a summary [11:41:42] Logged the message, Master [11:41:58] oh I ran it with a message [11:42:05] I just don't see it logging in some channel [11:42:11] !log ariel synchronized wmf-config 'migrate media etc from ms7 upload6 to nas1 upload7' [11:42:18] thanks :-/ [11:42:22] Logged the message, Master [11:42:27] 199 and 281 fail [11:42:32] as expected [11:42:44] 83 PHP Warning: opendir(/mnt/upload7/private/captcha) [function.opendir]: failed to open dir: No such file or directory in /usr/loc [11:42:44] al/apache/common-local/php-1.20wmf12/extensions/ConfirmEdit/FancyCaptcha.class.php on line 103 [11:42:58] bbaaahhh [11:43:12] the number doesn't seem to be increasing thought [11:43:23] oh, doubled [11:43:34] well I verified that the dir is on the netapp [11:43:35] hmm [11:43:47] got a host for that? [11:43:50] reedy@srv190:~$ cd /mnt/upload7/private/captcha [11:43:50] -bash: cd: /mnt/upload7/private/captcha: No such file or directory [11:43:57] uh oh [11:44:06] no private folder [11:44:19] ah [11:44:28] eh? [11:44:28] it's upload7/images/ [11:44:29] bah [11:44:34] it's upload7/upload/private [11:44:47] whaaat [11:44:49] then everything's broken [11:44:51] eh? [11:44:52] rollback [11:44:54] how did we miss that [11:44:56] Actual: /mnt/upload7/upload/private/captcha [11:45:05] or actually let me just fix that [11:45:19] Looking for: /mnt/upload7/private/captcha [11:45:24] it won't be just private, it'd be for everything [11:45:27] yes [11:45:42] shall I just move that entire dir now? [11:45:44] it'll break the rsyncs [11:45:46] it should fix everything else [11:45:50] no rsyncs going [11:45:55] then I'll do that [11:46:03] ExtensionDistributor errors (expected) [11:46:05] it'll take a while [11:46:21] rollback until that happens? [11:46:22] no it won't [11:46:23] why would it [11:46:24] it's done [11:46:27] lol [11:46:46] oh right [11:46:49] !log Moved /mnt/upload7/upload/* to /mnt/upload7, removed upload /mnt/upload7/upload dir [11:46:59] thanks [11:47:00] Logged the message, Master [11:47:03] are things looking better now? [11:48:04] numbers of warnings has stopped increasing [11:48:15] good [11:48:19] ok [11:48:55] let's start that final rsync with --update and not --delete [11:49:01] and mind the path change [11:49:15] I'm watching ms7 bevahoir for a bit [11:49:23] you think it's gonna cry? [11:49:26] no [11:49:31] just want to be sure we have no more writes [11:49:31]