[00:00:49] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:01:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:03:41] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:04:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:08:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [00:17:53] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:18:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:21:20] * Damianz looks at Krinkle and wonders if they're ok [00:22:11] looks pretty much like what is already in place, but puppetized. [00:22:21] I don't use the bots project that much though [00:23:41] Yeah, it's trying to make it less ad-hoc for when we move into restricted instances... some stuff will need refactoring later (like hard coded mysql passwords, though they depend on fixing the mysql class to write a .my.cnf file). [00:23:58] New review: Ryan Lane; "inline comment." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/26441 [00:27:06] we have an API log now [00:32:23] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:32:29] Ryan_Lane: do you happen to know anything about how caching works; specifically in esams; and more specifically why purging the cache via action=purge would clear ptmpa but not esams? [00:33:05] maybe purge messages aren't working properly? [00:33:11] it's a multicast purge [00:33:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:33:33] any easy way to actually debug that? [00:33:40] the multicast -> unicast -> unicast -> multicast relay may be broken [00:34:15] that sounds... frightening! [00:34:20] mwalker: I think mark wrote some tool for this a while back [00:34:26] well, you can't do multicast across the WAN [00:34:34] yep; got that :) [00:34:35] you need a relay [00:34:52] mwalker: hey:) [00:35:02] mutante: *waves* [00:35:27] mutante: did you meet kelsey hightower? [00:35:29] mwalker: i just noticed some error related to your shell access. do you miss anything you need? [00:35:40] jeremyb: heh, no [00:35:53] hrmmm, how odd [00:35:58] mutante: erm... not that I have noticed [00:36:00] http://wikitech.wikimedia.org/view/Multicast_HTCP_purging [00:37:42] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [00:38:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [00:38:49] jeremyb: oh, why? i see its someone at Puppet Labs [00:39:04] mutante: you were talking about the forge [00:39:17] i think [00:41:19] oh yeah, maybe i have listened to his talk without remembering the name :p [00:41:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:33] they kept saying how the forge has now over 500 modules [00:41:39] heh [00:41:43] so i searched for mediawiki [00:47:10] Ryan_Lane: ok; so parsing of the HTCP page makes sense given what I'm seeing in the code -- as I don't have access to the cluster, how many cookies do I have to provide to you/someone to look into if the udpmcast.py script is still running? [00:47:34] this assumes I know whereit is running [00:47:48] :p always the question... supposedly bayle... [00:48:05] really convincing that it's monitored then :D [00:48:08] oxygen is one [00:48:17] puppet is the source of truth [00:48:19] don't trust wikitech [00:48:30] linne and oxygen [00:49:01] Ryan_Lane: Do you know whats up with gerrit and viewing changes in file that contain foreign characters (i.e. i18n files), even small updates aren't shown: https://gerrit.wikimedia.org/r/#/c/21034/7/MoodBar.i18n.php [00:49:23] uncaught exception in javascript [00:49:33] blank page [00:49:37] Krinkle: no clue. ask chad [00:49:53] also running on linne [00:49:59] ^damnit [00:51:56] I don't see traffic on linne [00:53:00] I see it on oxygen being sent to linne [00:53:53] Krinkle: used to work... [00:53:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.665 seconds [00:54:13] that extra line wasn't there beofre [00:54:36] besides its "wrong" (the instance name is not a FQDM and the instance is already in the table ) [00:54:39] hm [00:54:44] oxygen is relaying to a multicast group [00:55:05] this must not be the relay I'm looking for [00:55:55] ah. that's for squid logging [00:55:58] * Ryan_Lane sighs [00:56:16] let me guess the other relay isn't puppetized [00:56:25] Ryan_Lane: probably not :p [00:56:32] that would make too much sense [00:56:33] -_- [00:57:03] maybe it's either emery or locke [00:57:38] Hipster admin checks config management then monitoring then realises it was the guy with a beard that installed it. [00:58:21] bayle is now tarin [00:58:24] and runs poolcounter [00:58:36] I have no clue where this runs [00:58:43] it's not documented and I can't find it in puppet [00:58:55] maybe it doesnt anymore! and that's why its broken :) [00:59:18] it's unlikely that it's been broken for a very long time [00:59:33] people are pretty good about telling us when shit is broken [01:00:28] it's not listed as down in nagios [01:00:32] but that assumes we have a check for it [01:00:46] if it isn't in puppet it probably doesn't have a check either ;) [01:01:08] mwalker: the person that would know about this is mark [01:01:24] ok; I'll ping him in the morning [01:01:29] thanks much kindly for looking :) [01:01:54] yw [01:03:16] !log updating a lot of mailing list descriptions per [[BZ:37537]] / thehelpfulone [01:03:27] Logged the message, Master [01:29:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:04] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 230 seconds [01:41:22] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 247 seconds [01:42:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.720 seconds [01:48:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:48:52] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [01:48:52] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [01:48:52] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [01:48:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:49:10] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [01:50:22] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [02:17:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:55] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [02:27:22] !log LocalisationUpdate completed (1.21wmf1) at Fri Oct 5 02:27:21 UTC 2012 [02:27:37] Logged the message, Master [02:32:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [02:38:58] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:45:59] !log LocalisationUpdate completed (1.20wmf12) at Fri Oct 5 02:45:59 UTC 2012 [02:46:12] Logged the message, Master [03:49:55] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [04:39:17] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [06:01:39] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [07:00:53] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [07:03:26] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [07:23:52] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [07:23:52] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:37:34] New patchset: Dereckson; "(bug 40776) Namespace configuration for gu.wiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26747 [08:02:25] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [08:05:16] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:29:44] mark: I'm here [08:29:50] you said something about "instructing me" for something :P [08:34:49] New patchset: ArielGlenn; "DO NOT MERGE YET migrate from ms7 upload6 to nas1 upload7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26749 [08:48:59] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [08:51:05] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [09:47:24] paravoid: nag about range requests! :) [09:54:41] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [09:57:05] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [10:00:15] will do [10:00:36] told Tollef about the std.integer wraparound on breakfast [10:00:44] heh [10:00:54] I told him you worked around it but he said he'll fix it nevertheless :) [10:00:59] thanks :) [10:01:12] they should [10:01:19] they actually use strtol() into an int [10:03:36] so, what's the story with range requests? not supported at all? [10:03:47] and video players use them to seek I presume [10:03:56] i have to investigate it, just starting on that [10:04:06] but as far as I know currently, varnish doesn't support them towards its backends [10:04:15] so it can only serve them for data it already has [10:05:34] ouch [10:06:28] wow, NetApp is filing up quite well I see [10:06:30] almost there [10:09:17] but for cached objects, I don't yet understand what the difference would be between varnish/swift [10:10:27] back in a while (in time for the window), gotta get lunch [10:20:11] seems like the squids are not caching those videos at all [10:41:39] and it's not working great with the esams squids either [10:41:55] my current impression is that those videos may have worked reasonably well in pmtpa just since the content is so closeby [10:41:59] but not anywhere else [10:45:16] back with food [10:49:32] Apache logs are noisy this morning [10:49:45] With swift related warnings [10:50:07] uh oh [10:51:29] 26/1000 [10:52:25] what's that mean? [10:52:42] when? [10:52:46] I restarted ms-fe2 just moments ago [10:52:53] er, swift on ms-fe2 [10:52:58] ah [10:53:09] That'd probably explain the transfer closed errors [10:53:12] but it shouldn't have made an effect [10:53:18] hmm [10:53:30] does MW retry? [10:53:53] and where do you see that? fluorine? [10:54:15] tail -n 1000 /home/wikipedia/syslog/apache.log | grep " Invalid response" | wc -l [10:54:17] on fenari [10:54:43] so in a little bit those should be gone [10:54:59] Most of them are 8 minutes or so ago [10:55:08] ok [10:55:17] yeah, over a 3 or 4 second period [10:55:17] sounds about right [10:55:36] can ignore them then! [10:55:39] I probably was too quick between depooling and restarting [10:55:42] does it retry? [10:55:47] MW that is [10:56:16] we need to restart swift frontends every few days until we fix the memory leak [10:56:26] (grumble) [10:56:36] Probably not [10:56:40] I guess Arron would know [10:57:02] okay, I'll be a lot more careful then [10:57:34] video download from esams over ipv6 goes at an appalling rate of 300 kbyte/s [10:57:54] *ouch* [10:58:42] Connect time: 113 ms [10:58:42] Request to response headers: 117 ms [10:58:42] Request to first data byte: 118 ms [10:58:42] Received 18948862 bytes, at 301000 bytes/s average [10:58:42] Request to end of data: 63025 ms [10:58:43] Total time: 63138 ms [10:58:47] (from fenari) [10:59:25] apergos: ok [10:59:28] are you doing the migration? [10:59:48] I guess [11:00:01] i can do it too [11:00:24] so there is the one rsync left on ms7 [11:00:36] it's in the middle of comons/temp and wob't finish any tmie soon [11:00:44] so I'm gonna shoot that [11:00:50] we don't care about temp [11:01:06] temp is... well, temp [11:01:17] this was the one sweep to get stuff close to up to date [11:01:25] anyways it will be close enough [11:01:36] okay, just saying :) [11:02:01] uh huh, it was doing upload/wikipedia/* I believe [11:02:32] so first is: are we gonna try to do this without readonly? [11:03:39] what's "this"? swap the MW config? [11:03:47] upload6 for upload7? [11:04:28] yeah, all that [11:05:46] I just did a quick check to see if upload7 is mounted everywhere [11:05:58] I just see a few misc hosts like fenari (different mountpoint), spence, snapshot* which don't have it [11:06:21] only for hosts that have upload6 mounted currently [11:06:28] dsh -Mcg mediawiki-installation "[ -d /mnt/upload6 ] && ls -d /mnt/upload7" [11:06:30] that's what I used [11:06:42] fenari should get it when the change to manifests/misc-servers.pp:misc::extension-distributor gets done [11:07:08] did you check whether ext distributor actually works? [11:07:14] i've heard it's been broken for some time [11:07:42] I hadn't heard that so I didn't check [11:08:06] yeah it's not loading for me [11:08:07] so nevermind that [11:08:24] I don't see what read-only would give you [11:08:41] you'll swap it, then what? you can't test it, so you'll then do it rw again [11:11:16] even if it doesn'tl oad we probably shouldn't break the cron job, unless there's been a decision to retire the extension [11:11:31] sure it's not critical right now but let's not make someone else's clenaup job harder later [11:11:44] the cleanup is to have that thing rewritten [11:11:47] that's way overdue [11:12:33] it's broken already, I don't want to spend any more time on trying to fix it until it's rewritten as a proper service which doesn't impose dependencies on our core infrastructure [11:12:49] nt trying to "fix it", just maintaining the status quo [11:13:13] Depending on what's up with it, it shouldn't take much to kick it back to life [11:13:26] it shouldn't take much to rewrite it [11:13:47] In theory, it's mostly not needed though [11:15:10] as long as it's depending on an NFS share on all app servers, it's needed :) [11:15:55] mark: briefly chatted with Tollef [11:16:40] no easy way around that limitation, varnish doesn't support caching partial objects [11:16:45] yeah I know [11:16:47] I don't think that snaps need it for anything (if they do I'll deal with it later but I can't imagine why) [11:16:47] so range on the backend doesn't really work [11:16:57] can it do it in pass mode? [11:17:02] he suggested m3u8, but that's a much larger change [11:17:03] I don't think it can, but that would already help [11:17:12] sorry, pipe mode [11:17:31] I wonder why spence has the upload6 mount [11:17:32] hehe, I thought of pipe too, but didn't have a chance to ask yet [11:17:37] the next presentation started [11:17:47] anything interesting? [11:18:00] phk's talk was very nice [11:18:24] there was a KPN talk, nothing too interesting, another completely boring out-of-place talk [11:18:28] and now the VGNETT talk [11:18:52] which should be interesting, since varnish was created for them [11:18:55] yes [11:19:35] I'm going to declare that spence doesn't need it so we can move on [11:19:46] yes [11:20:08] heh [11:20:47] Reedy: do you know what the status of work on Captcha is? [11:20:48] I'll prep the commonsettings and initialisesettgins changes and shove em out to gerrit [11:21:59] Nope.. I don't think i've seen anyone working on it [11:23:24] this would be /mnt/upload7/private/captcha (after the move), lemme make sure we copied it over [11:23:38] and it's served outof private so that should all be mw doing the lifting, no need for ms7 [11:23:43] 's webserver [11:23:56] so it's not pulled from http://upload. ? [11:24:20] good [11:24:27] well I watched dtrace for a half hour yesterday [11:24:33] so we can actually kill ms7 after today [11:24:36] and while I saw some math and other crap I saw not one captcha [11:24:40] no we can't [11:24:54] those math are from cached pages, we can kill it after those expire [11:25:05] ah math urls have changed? [11:25:05] couple of weeks? [11:25:07] but soon [11:25:11] uh huh [11:25:19] moved to top level [11:25:19] will they expire? [11:25:28] this happened over a month ago [11:25:32] does mediawiki realize this? [11:25:46] the URL change [11:25:47] or is this another case of squids revalidating them and mediawiki saying sure, unchanged! [11:25:57] that I can't tell you [11:26:04] people need to think about $wgCacheEpoch :P [11:27:33] the hash is the same, we could do 301s [11:27:35] $wgCacheEpoch = '20110101000000'; [11:27:36] lol [11:27:43] oh myyy [11:27:48] what is that? [11:27:53] New patchset: ArielGlenn; "migration from ms7 upload6 to nas1 upload7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26773 [11:28:03] $wgThumbnailEpoch = '20110101000000'; [11:28:22] We should probably move that forward at least a year again ;) [11:28:39] to a few weeks back, properly [11:28:42] to just after this change [11:28:47] yeah [11:28:52] what is that epoch thing? [11:29:03] it means that mediawiki considers everything cached before that date to be invalid [11:29:07] for all caches [11:29:15] ohrly [11:29:21] we use it to selectively clear the cache for certain wikis in some cases [11:29:30] I've tagged you for the two gerrit changes of config files, Reedy [11:29:57] https://gerrit.wikimedia.org/r/#/c/26773/ and https://gerrit.wikimedia.org/r/#/c/26749/ [11:30:06] when we are ready to make it happen. [11:30:13] i woke up ready [11:30:35] ah that's my problem, never really woke up [11:30:57] so captchas are not being served by ms7's webserver? are we sure about that? [11:31:06] apergos is sure about that [11:31:07] (since you're asking for review... :) [11:31:10] no, I'm not sure of anything. but [11:31:25] I watched for 1/2 hour all the gets via dtrace [11:31:27] not one [11:31:39] that's as good as I'm gonig to get [11:31:46] apergos is as sure as apergos will ever be ;) [11:31:52] no, we can have someone who knows MW to confirm :) [11:31:56] that, and private/ is not served out of upload generally [11:32:06] I'm sure 300G will be enough for captchas for like a day [11:32:21] it will be enoough for years [11:32:24] move it [11:32:27] hehe [11:32:34] i don't have time for this sillyness :P [11:32:41] I already have that move in the commit [11:32:41] heh [11:32:46] if it breaks I want to know now [11:32:49] typical me, typical mark :) [11:32:58] me chickening out I mean [11:33:04] and mark being bold [11:33:44] Changes both look sane [11:33:51] well next in the list as far as I can tell is deploy em [11:34:02] who wants to push da button? [11:34:09] you do [11:34:22] pushing the button doesn't really do that much ;) [11:35:01] hahaha [11:36:14] reedy (or someone) please merge the changes in gerrit and I will pull em to fenari and sync common file on ... meh. three separate files, so three separate sync common files [11:36:32] nah [11:36:36] just use sync-dir wmf-config [11:36:41] nice [11:36:50] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26773 [11:36:53] I shall [11:36:55] Sanity for a change! [11:36:58] heh [11:37:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26749 [11:37:23] You could have at least changed the commit message =D [11:37:48] :-D [11:37:51] funner this way [11:38:03] WAIT YOU DIDN'T really merge that! :-P [11:38:16] :D [11:38:21] Get out clause for when it breaks :) [11:40:38] going around now [11:40:55] shouldn't it log? I forget what logs which [11:41:31] !log running sync-dir wmf-config to migrate media etc from ms7 upload6 to nas1 upload7 [11:41:41] yeah it should do.. and you can append a summary [11:41:42] Logged the message, Master [11:41:58] oh I ran it with a message [11:42:05] I just don't see it logging in some channel [11:42:11] !log ariel synchronized wmf-config 'migrate media etc from ms7 upload6 to nas1 upload7' [11:42:18] thanks :-/ [11:42:22] Logged the message, Master [11:42:27] 199 and 281 fail [11:42:32] as expected [11:42:44] 83 PHP Warning: opendir(/mnt/upload7/private/captcha) [function.opendir]: failed to open dir: No such file or directory in /usr/loc [11:42:44] al/apache/common-local/php-1.20wmf12/extensions/ConfirmEdit/FancyCaptcha.class.php on line 103 [11:42:58] bbaaahhh [11:43:12] the number doesn't seem to be increasing thought [11:43:23] oh, doubled [11:43:34] well I verified that the dir is on the netapp [11:43:35] hmm [11:43:47] got a host for that? [11:43:50] reedy@srv190:~$ cd /mnt/upload7/private/captcha [11:43:50] -bash: cd: /mnt/upload7/private/captcha: No such file or directory [11:43:57] uh oh [11:44:06] no private folder [11:44:19] ah [11:44:28] eh? [11:44:28] it's upload7/images/ [11:44:29] bah [11:44:34] it's upload7/upload/private [11:44:47] whaaat [11:44:49] then everything's broken [11:44:51] eh? [11:44:52] rollback [11:44:54] how did we miss that [11:44:56] Actual: /mnt/upload7/upload/private/captcha [11:45:05] or actually let me just fix that [11:45:19] Looking for: /mnt/upload7/private/captcha [11:45:24] it won't be just private, it'd be for everything [11:45:27] yes [11:45:42] shall I just move that entire dir now? [11:45:44] it'll break the rsyncs [11:45:46] it should fix everything else [11:45:50] no rsyncs going [11:45:55] then I'll do that [11:46:03] ExtensionDistributor errors (expected) [11:46:05] it'll take a while [11:46:21] rollback until that happens? [11:46:22] no it won't [11:46:23] why would it [11:46:24] it's done [11:46:27] lol [11:46:46] oh right [11:46:49] !log Moved /mnt/upload7/upload/* to /mnt/upload7, removed upload /mnt/upload7/upload dir [11:46:59] thanks [11:47:00] Logged the message, Master [11:47:03] are things looking better now? [11:48:04] numbers of warnings has stopped increasing [11:48:15] good [11:48:19] ok [11:48:55] let's start that final rsync with --update and not --delete [11:49:01] and mind the path change [11:49:15] I'm watching ms7 bevahoir for a bit [11:49:23] you think it's gonna cry? [11:49:26] no [11:49:31] just want to be sure we have no more writes [11:49:31] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:49:31] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [11:49:31] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:49:31] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [11:49:31] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [11:49:54] hahahahahaha [11:50:49] brb lunch [11:50:58] enjoy [11:51:02] I'm eating mine still [11:51:15] am I the only one who's being hungry for this migration? [11:52:58] I ate just before too [11:54:12] looks like we are in good shape [11:54:48] so we probably lost a few originals while we had the switchover done but the directory wasn't right? [11:54:54] can we sync those with the php sync script? [11:55:08] yeah that script will catch any inconsistencies [11:55:22] how nice [11:56:01] yes it is a lifesaver [11:56:15] no [11:56:17] but it is an image saver [11:56:38] (in more ways than one) [11:58:12] What do we want to do with the cache epochs? [11:58:45] I think we should try to set that to just after the change date [11:58:55] if that results in too many problems, I believe we can put it back, right [11:59:12] please do it, fundraising is having other issues as well [11:59:16] so it's about time we up that [11:59:23] ok, will do in a few minutes [11:59:26] What was the change date? [12:00:30] i'm not sure about that [12:00:43] "over a month ago" is what paravoid says [12:00:57] and there was a fundraising change around 8/9, so it should be no earlier than that either [12:01:12] rsync -PrltDp --ignore-existing --stats --update -include-from=/etc/rsync.includes /export/upload/ /mnt right? [12:01:26] why --ignore-existing? [12:01:40] ah yer right [12:02:02] path look sane? [12:02:07] I never know that [12:02:11] rsync always surprises me [12:02:12] hahaha [12:02:40] yes seems right [12:02:48] an additional / at the source means "copy the contents of this dir" [12:02:59] uh huh [12:03:23] ok well I can shoot it if it starts copying piles of crap right away [12:03:33] go ahead [12:04:36] !log started one last rsync op upload/ from ms7 to nas1 with --update [12:04:38] !log reedy synchronized wmf-config/CommonSettings.php 'Update wgCacheEpoch and wgThumbnailEpoch to the start of 2012 (+ 1 year)' [12:04:47] Logged the message, Master [12:04:58] Logged the message, Master [12:05:01] Reedy: i'd like to raise it about a month ago [12:05:05] but let's first see what this does [12:05:47] I was just thinking of using it as an intermediary [12:06:24] intermediary? [12:06:32] a step [12:06:39] ok [12:06:53] i don't think we'll have much cached from before 2012 [12:06:55] at least I would hope not ;) [12:06:57] indeed [12:06:59] exactly! [12:07:07] if we work out a date, I'll change it in a couple of hours [12:07:21] in a couple hours? can't do it now? [12:07:22] we have a window now [12:07:28] and also i'll be gone after 4.30 pm [12:07:30] my time [12:07:37] long window is long [12:07:42] heh [12:07:53] what date? [12:07:57] that we need to work out [12:08:03] whenever the math change was synced out... [12:08:10] I have no idea how to check that really [12:08:21] 2nd Ocotober [12:08:22] 09:06 paravoid: squid deploy all: switching math & timeline to be served from swift [12:08:30] so after that [12:08:41] 3 days ago? [12:08:48] er no then [12:09:06] paravoid should be back after lunch [12:09:10] he can probably tell us more [12:09:34] there's logs of doing copies on 19th september [12:09:46] still time to get it done today thoough [12:10:40] https://gerrit.wikimedia.org/r/#/q/math,n,z [12:11:20] added file backend support, right [12:11:48] but when was this synced out [12:12:18] september 7 also I think [12:12:19] according to SAL [12:12:37] i'm thinking that 9/8 would be a good cache epoch date [12:12:41] it's about a month ago [12:12:48] should not leave us with empty caches [12:12:53] but will fix the fundraising and math issues [12:13:10] sounds reasonable [12:13:17] We can always move it forward again in a week or 2 [12:13:25] if necessary yeah [12:13:26] Date: Thu Sep 6 12:06:38 2012 -0700 [12:13:28] or backwards if we have load issues now [12:13:30] shows the merge on fenari [12:14:03] i'm not seeing effects of your earlier raise [12:14:08] now much anyway [12:14:42] wanna raise it now? [12:14:42] good! :p [12:15:58] 8th September? [12:16:05] we could do steps again [12:16:07] first 8/9 [12:16:13] that's the fundraising change [12:16:28] Thumbnail too? [12:16:35] no [12:16:40] not needed I think [12:16:45] it's for banners or something [12:17:31] !log reedy synchronized wmf-config/CommonSettings.php 'wgCacheEpoch to 20120908000000' [12:17:41] Logged the message, Master [12:17:55] i meant 20120809 hehe [12:17:56] oh well [12:18:22] if we're fine we're fine [12:18:52] :-D [12:19:16] load is slightly up [12:21:18] apergos: that final rsync should take a few hours I think? [12:21:25] lots longer [12:21:40] normally takes a few hours [12:21:43] doesn't matter [12:21:44] to ms1002 [12:21:50] to ms1002? [12:21:54] yes [12:22:15] on an in-sync replica I mean [12:23:17] well when my daily rsyncs from ms8 to ms1001 were taking most of a day (figure 30gb of data but lots of files to check) [12:23:24] s/when// [12:23:45] http://ganglia.wikimedia.org/latest/?c=MySQL%20pmtpa&h=pc1.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [12:23:51] the parser cache is taking a bit of a hit [12:23:56] but looks fine [12:24:16] hmm [12:24:28] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [12:24:46] yeah looks well in bounds, just a small jump [12:26:04] yeah aborted clients is nothing out of the ordinary either [12:26:34] good [12:34:32] argh [12:34:38] owernships are totally different on the netapp [12:37:02] bbuuttt [12:37:18] *sigh* [12:38:46] apergos: can you restart rsyncs with -o and -g? [12:38:53] uh huh [12:39:14] hmm [12:39:17] perhaps we should rollback [12:39:22] that bad? [12:39:31] well [12:39:38] who knows what's breaking right now [12:39:49] ownerships are fucked up all over the place [12:40:00] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [12:40:24] deletes and moves would work in swift but not on the netapp [12:40:32] reads would be fine of course [12:40:38] you mean the sync script can fix it up? [12:41:07] will it delete files and such? [12:41:08] it would catch tose, but we have to fix ownserships on all the dirs in the meantime [12:41:18] does it not assume stuff did work? [12:41:42] I believe it will toss stuff on the flat filesystem that shouldn't be there [12:41:54] along with its other fixes [12:42:02] alright then [12:42:10] as long as the changes took place in swift, we should verify that [12:44:09] have you restarted the rsync yet? [12:44:51] yes [12:45:14] what is it working on now? [12:45:37] New review: Demon; "The RT stuff is probably going to be spun out to a plugin and I've considered doing the same thing f..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26040 [12:45:37] some math dir that's worthless [12:45:48] but we don't have to do ownereship of all the files to fix this up I don't think [12:45:52] just the dirs [12:46:01] ok [12:46:04] let's run some find commands then [12:46:05] of which for the images it's two leels per project right? [12:46:07] what should we set to? [12:46:28] in othe rwords wikipedia/xx/yy/xx that far down [12:46:42] we'll have to do wiki*/xx/archive/ separately [12:46:48] I don't remember the layout of those [12:48:03] you know these dirs 0/00 0/01 etc are all 777 on ms7 [12:48:30] same with the netapp [12:48:33] running now [12:48:47] why archive seperately? [12:49:12] ah it's also two levels [12:49:18] there's something that's three, I forget what [12:49:25] what does it matter what the structure is? [12:49:49] if you don't want to do a stat on several million files [12:50:30] you could set maxdepth [12:50:36] good idea [12:52:55] back [12:53:54] what did I miss? I saw a couple of notifies in the awaylog, but they may be stale [12:54:12] problems with deletion [12:54:29] see wikitech-l (short scrollback) [12:54:51] (sorry, lunch times are not really in my control here at the event) [12:54:58] file permissions are totally fucked up on the netapp [12:55:00] since they were not rsynced [12:55:02] you're not really on the hook here anyways [12:55:44] i'm wondering if we're not better off rolling back [12:55:46] perms actually did go over, but not ownerships. [12:55:51] right [12:57:07] oh joy [12:59:04] i think we should either rollback, or temporarily turn off writing to the netapp [12:59:07] this is gonna take too long [12:59:17] rolling back is probably better [12:59:21] there's not much of a downside to it [12:59:45] if it makes any difference, I agree [13:00:13] this does mean that the rsyncing will get pretty fucked up though [13:00:28] ms7 is now out of date [13:00:32] we'll have to rely on the php script then [13:05:28] however we may have to do that now too [13:05:32] since the netapp is probably fucked up too [13:10:53] ok got another faster find running to fix up just the typical dirs [13:11:07] ok, and I just did archive for en wp [13:18:41] and now I started an even faster for loop which doesn't need to scan directories [13:18:50] yay [13:20:18] so much junk in there [13:21:26] well the bad news is I just found out that we need to set the ownership on all those files [13:21:33] really dumb, delete sohuld just need the dir perms but [13:21:39] how come? [13:22:05] I tested one delete wth ownership of all the parent dirs fixed but not the file (fail) and another with [13:22:18] the ownersihp of the file itself fixed (success) [13:22:19] no idea [13:22:27] but that sucks majorly [13:23:23] here's the file (with failure): [13:23:30] -rw-r--r-- 1 root root 627710 2006-10-23 16:26 /mnt/originals/wikipedia/en/d/d1/Tabemas.jpg [13:23:43] publically readable >_< and yet... [13:23:58] Error deleting file: Could not read or write file "mwstore://local-multiwrite/local-public/d/d1/Tabemas.jpg" due to insufficient permissions or missing directories/containers [13:25:33] ok so [13:25:36] let's rollback to ms7 [13:25:38] so at this point [13:25:39] yeah [13:25:40] run the sync script [13:25:42] dangit [13:25:48] after that finishes, rsync over properly [13:25:50] then try again [13:26:11] this was big fail on our part [13:26:16] how did we not notice that [13:26:30] ok, I don't know to to run aaron's script so we'll have to wait for him for that [13:26:45] argh [13:26:45] really [13:26:52] will you do the rollback? [13:26:55] yep [13:29:47] New patchset: ArielGlenn; "rollback ms7 to nas1 upload6/upload7 changes, need perms/owners fixup" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26788 [13:29:54] Reedy: still around? [13:30:02] do it yourself [13:30:32] if he responds in a minute I prefer he check it and gerritmerge [13:30:37] if he doesn't then I'll proceed anyays [13:30:39] you didn't just use gerrit rollback? [13:30:57] no [13:31:06] why not? [13:31:13] can I do that with multple commits? [13:31:17] of course [13:31:25] ok well didn't know that [13:33:08] New patchset: Mark Bergsma; "Revert "DO NOT MERGE YET migrate from ms7 upload6 to nas1 upload7"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26790 [13:33:22] New patchset: Mark Bergsma; "Revert "migration from ms7 upload6 to nas1 upload7"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26791 [13:33:42] wait so you're doing them? [13:33:54] well I was waiting for you to do it but nothing happened [13:34:05] fine I'll abandon it [13:34:21] go ahead and merge those 2 changes [13:34:23] Change abandoned: ArielGlenn; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26788 [13:34:42] no need to wait on a review for a rollback [13:35:28] lol [13:35:55] apergos? [13:36:05] yes? [13:36:10] for christ sake merge it [13:36:13] the more you talk at me the less I can actually get it done [13:36:41] Change merged: ArielGlenn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26791 [13:37:06] Change merged: ArielGlenn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26790 [13:38:24] Reedy: do you know where this sync script lives for filebackend? [13:38:45] syncFileBackend.php [13:38:47] in maintenance [13:38:50] maintenance/syncFileBackend.php ? [13:38:53] that's probably it yeah [13:38:58] there's some log file stuff he feeds to it [13:39:22] apergos: so are you syncing that out? [13:39:22] Journal positions [13:39:41] yes [13:39:43] ok [13:40:41] the more I rush, the more typos etc I make [13:40:52] however it is going around now [13:42:03] !log ariel synchronized wmf-config 'rollback ms7 to nas1 upload6 to upload7 migration' [13:42:14] Logged the message, Master [13:42:28] I think we can wait on the sync; from what aaron was telling me yesterday the front end will be fine (now that we don't have problems with the ownership causing things to outright break) [13:42:47] evenif the back end is not in sync with the front end [13:42:56] yeah [13:43:10] mwscriptwikiset maintenance/RecordFileBackendPos.php all.dblist --src local-multiwrite --posdir /home/aaron/backend-prenetapp-pos --days 10 [13:44:27] that's serious overkll for us right [13:44:30] 10 days back [13:44:34] we want like one [13:46:22] let's do syncing during the weekend [13:46:26] then cutover again afterwards [13:46:29] monday or tuesday [13:46:58] tuesday cause monday is a wmf holiday and I'd like to have reedy around again [13:47:21] Just to let you know that User:GrahamHardy posted on w:en:VPT about a problem related to the media migration: 'Could not read or write file "mwstore://local-multiwrite/local-public/6/66/SweeneyAstray.jpg" due to insufficient permissions or missing directories/containers' [13:47:29] yep [13:47:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:48] Richardguk: can you see if the user is still having that problem? [13:48:25] i've posted a reply to VPT and linked to the wikitech-l announcement Airel posted yesterday [13:48:38] have asked if it's ongoing [13:48:47] !log stopping puppet on brewster [13:48:58] Logged the message, notpeter [13:50:38] I'm leaving the rsync --update to run for now, I'll check in with aaron later and we'll cordinate all the pieces *sigh* [13:50:59] guess I'd better send some email [13:51:04] i'm thinking of reverting varnish to squid now too [13:51:05] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [13:51:11] i'm pretty done with this week [13:51:22] i don't want to debug yet another evening [13:51:42] GrahamHardy has replied at VPT saying upload is ok now: http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Picture_upload_problems_.3F [13:51:48] mark: while you're still about I have a question or two. [13:52:14] shoot [13:52:24] the first is: I took srv191 out of the bits backends array in puppet, but after a full day it's still getting traffic. why would that be? [13:52:38] would love for traffic to be 0 before test-upgrading... [13:53:05] i don't know [13:53:09] is it still in the varnish configs? [13:53:10] fair enough [13:53:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [13:54:29] !log Rollback of upload-lb.eqiad; sending traffic back to pmtpa [13:54:39] Logged the message, Master [13:54:42] they shouldn't be... but I can poke around more closely. [13:54:53] a quick check should tell you eh [13:54:59] yeah [13:55:15] was just wondering if something popped to mind for you [13:55:58] the second thing is: can you check the switch in eqiad that the mc* hosts are plugged into and verify that all of the ports are configured? [13:56:08] only the first 8 are [13:56:31] ah. would you be willing to enable all 16? [13:56:38] no there are no cables yet [13:56:58] coudl you at least do 14? that has some kinda test cable [13:57:27] https://rt.wikimedia.org/Ticket/Display.html?id=3587 [13:57:43] 14 has no cable atm [13:57:50] and iirc, it didn't work before [13:57:57] hrm, ok. [13:58:08] will wait until robh is onsite [13:58:14] robh has no cables either ;) [13:58:20] blarg. [13:58:25] I mean [13:58:26] you see, simple stupid shit like this is why we postponed that switchover [13:58:27] hurray.... [13:58:35] ugh. [13:58:41] the cables we ordered didn't work either [13:58:46] the previous ones didn't [13:58:48] :/ [13:58:54] the sfps on the other end were not accepted by the NICs [13:59:09] that's..... frustrating. to say the least. [13:59:27] so nevermind until next week [13:59:30] okie dokie [13:59:33] and even then we'll probably have to order something [14:00:18] ok. I was not 100% clear on the status of this, so I was going to try to image a couple this morning, but if it's a no-go, then I can do it whenever. [14:00:53] !log restarting puppet on brewster [14:01:04] Logged the message, notpeter [14:17:09] New patchset: Ottomata; "Installing MongoDB on stat1 for Aaron Halfaker." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26799 [14:17:26] meester markers, if you got a sec I would love a review of that one [14:17:47] particularly, i'm not sure if you'd rather I made a more generic MongoDB module or something [14:17:51] or found one somewhere [14:17:58] this is probably all we need for stat1 though [14:18:07] mark ^ [14:18:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26799 [14:26:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:48] ottomata: sorry, no time today, I have to leave now [14:34:09] ok thanks anyway, it is a pretty simple change (install package, start service), i'll ask someone else to review, but would love for your comments in post review if you get a chance on monday or tuesday or something [14:39:59] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:42:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.223 seconds [15:15:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:54] New review: Reedy; "ohnoes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26725 [15:25:31] New review: Reedy; "Heracy!" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26799 [15:31:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [16:02:51] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:03:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:12:27] Is the Fatal exception of type MWException on Commons by clicking https://commons.wikimedia.org/w/index.php?title=Special:UserLogin&type=signup&returnto=Special:UserLogin known? [16:17:50] heya notpeter, coudl I bug you for a review? [16:17:59] https://gerrit.wikimedia.org/r/#/c/26799/ [16:19:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.456 seconds [16:42:59] Anyone with root about? [16:44:55] yeah [16:46:30] Jeff_Green: can you run on fenari: cd /mnt/upload6/private/captcha && ln -s . captcha-render [16:50:37] done [16:51:02] cheers [16:51:04] are you working on the partition 99% issue? [16:51:27] nope, signup is broken on all projects for anons [16:51:49] 99% /mnt/upload6 [16:51:54] oh [16:51:58] csteipp: ^^ [16:52:23] Hmm, 30T 30T 317G [16:52:26] 317GB free... [16:52:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:57:44] Change abandoned: MaxSem; "Whee" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17964 [16:59:13] yeah 317gb [16:59:21] it will last til the next attempt [17:06:26] !log reedy synchronized wmf-config/ [17:06:38] Logged the message, Master [17:07:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.607 seconds [17:08:50] New patchset: Andrew Bogott; "Generate an admin password during mediawiki install." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26812 [17:09:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26812 [17:14:26] New patchset: Ottomata; "Fixing orange ivory coast log for Wikipedia Zero" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26814 [17:15:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26814 [17:15:42] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26814 [17:17:03] New review: Andrew Bogott; "Ryan, I welcome your comments on this. It works fine, but maybe there's an existing pattern for doi..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/26812 [17:17:38] !log reedy synchronized php-1.21wmf1/extensions/ConfirmEdit/FancyCaptcha.class.php 'New memcached key' [17:17:49] Logged the message, Master [17:21:40] New review: MaxSem; "It's too webscale for us!" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26799 [17:22:31] !log Live hacked 1.21wmf1 ConfirmEdit/FancyCaptcha.class.php cache key for directory list. Needs reverting after 24 hours or so [17:22:41] Logged the message, Master [17:24:54] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [17:24:54] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [17:32:57] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26812 [17:42:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.086 seconds [17:57:17] Ryan_Lane: so what's wrong with zfs? :) [17:57:26] in linux? [17:57:34] yes [17:57:38] * apergos eyerolls [17:59:42] apergos: what about it? [17:59:53] I don't even like it on solaris [18:00:03] too fragile [18:00:32] if you use replication anyways [18:01:26] this issue list https://github.com/zfsonlinux/zfs/issues doesn't inspire confidence though [18:01:41] hmm, seems like the native linux port is still 0.x and not fully stable [18:05:19] yes [18:05:32] also, it's owned by oracle [18:05:35] same with lustre [18:05:43] yeah I saw :p [18:05:46] both of those things make it kind of terrifying to use [18:05:48] eerrgghh [18:05:54] they were both sun products [18:06:00] I see lots of Oracle signs and buses in the city...they own everything! :) [18:06:02] so, oracle keeping them forever is unlikely [18:06:21] AaronSchulz: Including Howard Street between 3rd and 4th *shakes fist* [18:06:34] * AaronSchulz likes the oracle ad on a bus about it being "the cheapest" in addition to other stuff [18:07:06] RoanKattouw: I mean the bus had no one but the driver in it ;) [18:07:11] oh, are they doing their yearly convention? [18:07:12] haha [18:07:29] all windows were covered by the ads [18:08:00] apergos: Yeah, and VMWorld was a few weeks ago. So they're doing their annual closure of Howard (between 3rd and 4th, where the two Moscone buildings are across the street from each other) again, which screws up traffic on my commute [18:08:34] boooo [18:08:42] makes you love em evenmore [18:08:43] (And they also randomly closed a block of Taylor for a party last week, but that doesn't affect me) [18:08:44] Ryan_Lane: are you terrified of mysql now? [18:08:56] AaronSchulz: there's two forks, so no [18:09:02] MariaDB [18:09:07] and percona [18:09:16] drizzle is dead, so it doesn't count [18:09:20] does percona actually do core work or addon stuff? [18:09:32] I know Maria has lots of mysql talent [18:09:41] I was under the assumption that percona did core, but I could be wrong [18:21:32] !log reedy synchronized php-1.21wmf1/extensions/ProofreadPage [18:21:43] Logged the message, Master [18:29:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.829 seconds [19:18:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.034 seconds [19:44:16] New patchset: Pyoungmeister; "re-adding srv191 to bits backends, and removing mw61" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26835 [19:45:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26835 [19:46:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26835 [19:47:34] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 202 seconds [19:47:43] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 208 seconds [19:47:57] * Reedy whistles [19:50:34] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 0 seconds [19:50:52] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [20:02:49] !log removing srv190 from imagescaling pool due to misbuilt pacakge [20:03:00] Logged the message, notpeter [20:07:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:22:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [20:23:29] New patchset: Andrew Bogott; "Include fqdn in the 'DNS name' entry in instance status." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26840 [20:24:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26840 [20:25:27] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26840 [20:44:39] !log updated mwlib.zim to 0.3.0 [20:44:49] Logged the message, Master [20:48:29] if I add /etc/hosts rules for relevant hostnames (en,upload,bits, etc.) to point to esams.wikimedia.org load balancers, is that roughly equivalent to accessing the site from Europe? [20:55:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:09:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.629 seconds [21:39:37] Eloquence: yes [21:39:51] we have a bunch of IPs [21:39:56] about 12 [21:40:31] right [21:41:18] this might help the fundraiser team debug the banner loading issue, since there are reports that centralnotices on cached pages are sometimes very slow to load from europe [21:42:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:01] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:51:01] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [21:51:01] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [21:51:01] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:51:01] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [22:01:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [22:25:58] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [22:33:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:40:58] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:48:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds [23:21:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:12] New patchset: Dzahn; "adding account for Steven Walling and adding to stat1 per RT-3653" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26903 [23:23:15] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/26903 [23:23:23] New patchset: Dzahn; "adding account for Steven Walling and adding to stat1 per RT-3653" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26903 [23:24:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26903 [23:36:26] stat1 /usr/sbin/gmond[1145]: slurpfile() read() buffer overflow on file /proc/stat [23:37:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [23:49:52] New patchset: Dzahn; "fix a minor syntax error and remove a ," [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26908 [23:50:52] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26908 [23:52:04] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours