[00:00:17] Susan: ha ha [00:00:22] additionally, I don't currently expose modules directly on minions [00:00:43] it goes through a runner, which imposes additional restrictions [00:00:54] to deploy new modules, you need root access [00:01:09] you can't just let any non-root user run commands on all apaches like we do with dsh [00:01:54] we could have a cmd.run module that drops its privileges to a non-root users [00:01:56] *user [00:02:54] then it would be a drop-in replacement for dsh [00:03:58] Is there a page explaining the deficiencies of using dsh? I'm looking at "Scap" and "Git-deploy" on wikitech.wikimedia.org. [00:04:26] TimStarling: have you measured IOPS from the net app? [00:04:30] Susan: there's only one major complaint with dsh right now, and it's not dsh's issue [00:04:45] Susan: it's that the dsh groups must be manually kept up to date [00:04:53] Ah. [00:04:56] Is that tedious? [00:05:15] it's method of handling timeouts, non-matching host keys, and such isn't great either, but it's something that can be dealt with [00:05:20] Krinkle: https://gerrit.wikimedia.org/r/#/c/45117/1 [00:05:38] Susan: it's more that it's bad if it isn't kept up to date with what's pooled [00:05:59] Because you end up with out-of-sync servers? [00:06:02] yes [00:06:05] Got it. [00:06:23] so, if you guys want to replace what I wrote in salt, you need to: [00:06:28] 1. have proper reporting [00:06:50] 2. have a way to get to that reporting outside of a deployment [00:07:15] 3. have a way of fetching things to the systems without modifying the current code [00:07:28] that's roughly it. [00:07:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 00:07:40 UTC 2013 [00:07:46] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 00:07:44 UTC 2013 [00:08:06] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:08:06] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 00:08:03 UTC 2013 [00:08:15] RECOVERY - Puppet freshness on mw15 is OK: puppet ran at Thu Jan 24 00:08:11 UTC 2013 [00:08:22] Ryan_Lane, there's also the issue of testwiki over NFS [00:08:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:08:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 00:08:34 UTC 2013 [00:08:57] MaxSem: when I said I would add that into the deployment system I was told it wasn't necessary and that we should push people to beta [00:09:06] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:09:13] bleh [00:09:15] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 00:09:05 UTC 2013 [00:09:25] beta is not 100% equivalent to prod [00:09:32] my solution was: git deploy test [00:09:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:09:37] and will not be this year [00:09:46] which would fetch/checkout on a single system [00:09:49] update dsh group from Nagios config. @fenari:~$ grep "function updategroup" `which upgrade-helper` -A 30 [00:10:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:10:25] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 00:10:21 UTC 2013 [00:10:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:10:55] Ryan_Lane, if fast-deploying to testwiki is a matter of one command, it could be acceptable [00:11:09] well, in the long run I think even scap will no longer use nfs [00:11:24] so this is an issue no matter what [00:12:19] Ryan_Lane, is there a touch-over-salt script already? could be massively useful even before we switch to git-deploy:) [00:12:41] MaxSem: not yet. should be easy to add, though [00:13:06] actually, adding the cmd.run as non-root user would likely be the best thing to add [00:15:56] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 00:15:45 UTC 2013 [00:16:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:16:25] * Ryan_Lane shrugs [00:16:50] I'll just go back to working on labs if we're not planning on using any of the new deployment code [00:18:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 196 seconds [00:18:46] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 226 seconds [00:19:05] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 238 seconds [00:20:05] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:20:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [00:21:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [00:23:50] New patchset: Pyoungmeister; "removing ill-defined memory check from kaulen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45498 [00:24:02] Ryan_Lane: Speaking of Labs, I see that the Wikimedia Foundation is hiring someone to work on Tool Labs (as a contractor)? [00:24:10] yep [00:24:24] that will involve setting things up inside of projects [00:24:26] Will that primarily be setting up DB replication, then? [00:24:29] no [00:24:32] Hmm. [00:24:37] that's being handled by the current team [00:24:40] New review: Dzahn; "thanks" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45498 [00:24:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45498 [00:24:51] What things inside projects? [00:25:13] setting up a working environment and helping people transition [00:25:36] puppetizing everything as it's needed, walking people through the process, etc. [00:25:40] Ah, okay. [00:26:05] will also set up services that aren't supported labs-wide as well, likely [00:26:14] Has it been decided what OS will be used? [00:26:19] Solaris? Debian? Something else? [00:26:19] ubuntu [00:26:23] All right. [00:27:00] And has it been decided whether per-user projects will be the norm? Like we currently have toolserver.org/~user/ and such. [00:27:11] There was a push for multi-maintainer projects/tools at some point. [00:27:16] Among other failed pushes. [00:28:13] lol, solaris [00:28:17] ideally it will not be per-user [00:28:28] the norm should be multi-maintainer [00:28:35] Susan: better hope apergos isn't listening [00:28:50] it's possible to create multi-maintainer accounts by creating a labsconsole user [00:29:09] Reedy: I fear felicity much more than apergos. :-) [00:29:10] or by adding system users via puppet [00:29:23] Change abandoned: Tim Starling; "Let's just abandon this for now. It can be rewritten when we actually need it. I'm not sure if it wi..." [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/43356 [00:30:26] Ryan_Lane: Yeah... users will need a lot of hand-holding, I think. [00:30:44] AaronSchulz: Thanks for the link [00:30:51] Many Toolserver users can barely SSH. [00:30:57] yep. and it's more than the current labs team can handle while also working on the infrastructure [00:31:03] hence the contractor ;) [00:31:08] * Susan nods. [00:31:16] I think there will also be one on the wmde side? [00:31:20] What's the goal for stable DB replication? End of year? [00:31:34] 4 months ago? :) [00:31:37] well, it's supposed to be end of this month or beginning of next [00:31:37] Heh. [00:31:46] nah. roadmap showed this quarter [00:31:57] end of february would actually be the deadline [00:32:09] we hoped to get to it before then, though [00:32:24] Hoped? [00:32:31] well, there's still time ;) [00:32:42] s/Hoped/Hope/ [00:33:01] New patchset: Tim Starling; "adding acct pkg to base::standard-packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43991 [00:33:08] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43991 [00:33:15] First quarter would be nice. [00:33:27] It might also be nice to fix the Toolserver's DBs. [00:33:36] The replica are all horribly corrupt. [00:33:40] replicas [00:33:54] they easily corrupt because users are allowed to write to tables in them [00:33:55] Not horribly, but just enough corruption to be super-annoying. [00:34:12] I thought you were going to say "because MySQL sucks." [00:34:14] Susan: I have been talking to nosy and DaB about getting them dumps of our dbs so they can fix things [00:34:22] notpeter: That'd be awesome. :-) [00:34:23] I just gave them s4 and s7 bumps [00:34:25] *dumps [00:34:38] this is one of the reasons we don't want to allow writable tables in labs db replicas [00:34:38] s1 is all that matters. ;-) [00:34:39] waiting to hear back from them about what they'd like next [00:35:09] New patchset: Tim Starling; "Add .gitreview" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/43347 [00:35:14] Change merged: Tim Starling; [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/43347 [00:35:25] Ryan_Lane: Yeah, I vaguely remember that coming up. Joining against DBs is awfully convenient, though. [00:35:44] yes, but at the expense of it being broken all the time and ops' sanity [00:36:31] I hadn't realized that user writes were the suspect (culprit) of frequent DB corruption. I thought it was just general network hiccups and such. [00:59:09] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [01:03:05] csteipp: yay, extension RSS! i did not expect that to even still be alive..awesome how it grew from this http://meta.wikimedia.org/wiki/User:Mutante/RSSFeed#Source [01:03:16] # $input = mysql_escape_string($input); haha [01:03:43] hah, mutante, if I'd known that was from you I would have vetoed it on principle ;) [01:04:03] arrr:) [01:05:13] Yeah, that escaping was pretty l33t :) [01:05:42] I'm ashamed to say I have worse floating around the nets. [01:08:45] i forgot all of this: "we were thinking of even making this an automatic feature, (read from "interwiki" sql table and auto-include RCs of other Wikis in according wiki page like /Feeds/OtherWiki/RecentChanges. (original idea by mattis manzel)" [01:09:25] Mattis Manzel always wanted to build the "wiki net" [01:12:35] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jan 24 01:12:33 UTC 2013 [01:36:16] TimStarling: http://docs.saltstack.org/en/latest/ref/modules/all/salt.modules.cmdmod.html#salt.modules.cmdmod.run <— already has a "runas" option which drops privileges [01:36:55] TimStarling: we can further wrap that in a runner to limit scoping (grains, regex, etc) or futher limit things by the system calling the runner [01:45:56] but there's no authentication for executing publish jobs apart from host-based authentication? [01:46:34] so you couldn't, say, prevent the apache user from executing jobs normally used by admins? [01:47:35] in this situation I'd use the mwdeploy user on the system [01:47:41] and the runner would always set mwdeploy [01:51:03] or if you wanted to trust the deployment system itself to enforce users, you could have the calling user sent along with the call [01:51:55] wait. do you mean an apache user on one of the apaches itself? [01:52:48] it's not actually host based authentication. it's certificate based authentication, and you set acls based on the certificate name [01:54:15] can different users have different certificates? [01:55:12] though it's possible, the system isn't set up for that [01:55:40] salt-api, though, if we choose to use it would offer that level of flexibility. I don't think it's in a state that's ready for use, though [02:07:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [02:08:06] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [02:11:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42506 [02:16:39] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Jan 24 02:16:16 UTC 2013 [02:25:28] New patchset: Dzahn; "replace new index.html with old redirect to meta planet page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45508 [02:28:22] !log LocalisationUpdate completed (1.21wmf8) at Thu Jan 24 02:28:21 UTC 2013 [02:28:34] Logged the message, Master [02:36:05] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [02:43:29] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:46:29] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:47:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45267 [02:52:59] !log LocalisationUpdate completed (1.21wmf7) at Thu Jan 24 02:52:58 UTC 2013 [02:53:09] Logged the message, Master [02:59:10] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:31] New review: Dzahn; "yeah, ugly HTML, but starting out with exactly what was on old pre-puppet setup to enhance later" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45508 [02:59:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45508 [02:59:59] RECOVERY - HTTP on formey is OK: HTTP OK: HTTP/1.1 200 OK - 3596 bytes in 6.708 second response time [03:08:39] TimStarling: any idea why djvu files have such massive metadata? [03:08:55] they have OCR text in them [03:09:03] it's a ProofreadPage feature [03:11:33] !log DNS update - switching planet over to zirconium / new planet [03:11:43] Logged the message, Master [03:16:13] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [03:22:53] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:13] PROBLEM - SSH on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:03] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:09] hm. formey died [03:26:23] PROBLEM - HTTPS on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:27:05] connect: com2 port is currently in use [03:27:06] :D [03:27:13] * Ryan_Lane stabs drac5 [03:28:19] no output on console [03:28:22] !log powercycling formey [03:28:32] Logged the message, Master [03:28:36] PROBLEM - HTTPS on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:45] PROBLEM - SSH on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:04] RECOVERY - SSH on formey is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:31:13] RECOVERY - HTTPS on formey is OK: OK - Certificate will expire on 08/22/2015 22:23. [03:31:18] RECOVERY - HTTP on formey is OK: HTTP OK HTTP/1.1 200 OK - 3596 bytes in 0.074 seconds [03:31:44] RECOVERY - HTTP on formey is OK: HTTP OK: HTTP/1.1 200 OK - 3596 bytes in 0.055 second response time [03:32:03] RECOVERY - HTTPS on formey is OK: OK - Certificate will expire on 08/22/2015 22:23. [03:32:12] RECOVERY - SSH on formey is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:34:25] TimStarling: I made https://gerrit.wikimedia.org/r/#/c/45510/ [03:34:38] still feels awkward, oh well [03:34:40] * AaronSchulz should go home [04:04:18] New patchset: Tim Starling; "Disable E3Experiments due to bug 44298" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45511 [04:06:46] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45511 [04:07:45] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 04:07:37 UTC 2013 [04:07:54] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:07:54] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 04:07:49 UTC 2013 [04:08:04] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 04:07:56 UTC 2013 [04:08:05] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jan 24 04:07:58 UTC 2013 [04:08:09] !log tstarling synchronized wmf-config/InitialiseSettings.php [04:08:10] TimStarling: if you haven't merged yet, could you hang on a minute or three? [04:08:12] oh. [04:08:19] Logged the message, Master [04:08:34] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with args jenkins [04:08:35] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Jan 24 04:08:32 UTC 2013 [04:08:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:08:35] !log tstarling synchronized wmf-config/CommonSettings.php [04:08:44] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 04:08:40 UTC 2013 [04:08:45] Logged the message, Master [04:08:54] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:09:04] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 04:08:58 UTC 2013 [04:09:20] lol. [04:09:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:09:54] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:11:34] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with args jenkins [04:13:46] New patchset: Reedy; "Everything non 'pedia to 1.21wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45513 [04:14:24] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 04:14:21 UTC 2013 [04:14:46] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45513 [04:14:54] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [05:07:46] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [05:08:16] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:04:09] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [06:11:54] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [06:16:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:25:48] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 06:25:45 UTC 2013 [06:25:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:25:59] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 06:25:57 UTC 2013 [06:26:08] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 06:26:01 UTC 2013 [06:26:48] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:26:58] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:34:39] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 06:34:36 UTC 2013 [06:34:48] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:43:18] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 06:43:15 UTC 2013 [06:43:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [07:59:28] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [07:59:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:05:28] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with args jenkins [08:07:48] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 08:07:38 UTC 2013 [08:07:49] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 08:07:42 UTC 2013 [08:07:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:08:08] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 08:08:01 UTC 2013 [08:08:29] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:08:29] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with args jenkins [08:08:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:09:28] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with args jenkins [08:14:29] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with args jenkins [08:15:38] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 08:15:36 UTC 2013 [08:16:28] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:54:48] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [09:03:42] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [09:04:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [09:14:43] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:15:13] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:19:19] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [09:19:19] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [09:19:19] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [09:19:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:19:19] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [09:19:20] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [09:19:20] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:47:55] New review: Nikerabbit; "This file is a wonderful mix of spaces and tabs indentation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45349 [10:11:04] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:12:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [10:12:34] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [10:14:49] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [10:15:14] rsyncs I suppose [10:25:21] !log mlitn synchronized wmf-config/CommonSettings.php 'Reinstate AFTv5 test groups' [10:25:33] Logged the message, Master [10:56:23] New patchset: Matthias Mullie; "Reinstate AFTv5 test groups" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45527 [10:58:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:59:46] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [11:00:41] New patchset: Matthias Mullie; "Reinstate AFTv5 test groups" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45527 [11:04:48] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45527 [11:06:00] !log mlitn synchronized wmf-config/CommonSettings.php 'Reinstate AFTv5 test groups' [11:06:11] Logged the message, Master [11:55:43] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [11:56:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:07:42] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 12:07:37 UTC 2013 [12:07:43] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:07:43] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 12:07:39 UTC 2013 [12:08:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:08:07] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 12:07:56 UTC 2013 [12:08:07] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 12:07:58 UTC 2013 [12:08:42] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:08:53] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 12:08:46 UTC 2013 [12:09:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:09:03] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 12:09:00 UTC 2013 [12:09:43] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:10:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:18:53] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 12:18:50 UTC 2013 [12:19:42] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:19:43] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 12:19:40 UTC 2013 [12:20:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [13:00:20] what's ms1 or ms2 ? [13:16:21] they are media storage servers [15:04:26] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:05:31] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [16:07:49] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 16:07:44 UTC 2013 [16:08:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:08:40] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 16:08:36 UTC 2013 [16:09:39] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:09:58] robh: you around yet? [16:11:59] cmjohnson1 or robh, ahem ahem ahem ahem cough cough cough https://rt.wikimedia.org/Ticket/Display.html?id=3946 [16:12:00] :) [16:12:55] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45358 [16:13:12] New patchset: Diederik; "Replace custom CS carrier codes with MCC-MNC mobile carrier codes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45353 [16:13:18] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:13:38] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:29] ottomata: I know...other things have taken priority...my issue is that it is the partman recipe is failing during install. why? idk yet [16:14:39] haha, yeah [16:14:46] i know its cool [16:15:11] its not a huge priority for us to have it, it just would be nice and it has been a while [16:15:23] mind if I just bump you every week or so about it? you can put it off as long as you need [16:17:06] ottomata: thx [16:28:01] Anyone knows if Sumana usually hangs around IRC and what his nick is? [16:28:42] her* [16:28:44] her nick is sumanah [16:28:57] paravoid: That's too easy. :-) [16:29:04] I'd try #wikimedia-tech though [16:29:17] she doesn't join -operations afaik [16:29:36] I didn't expect she would, but I was pretty sure you guys'd know. :-) [16:38:21] Coren- #mediawiki and #wikimedia-dev are also good places to look [17:01:31] anyone around in case of emergency for a mobile redeployment of last weeks changes? we will also need a mobile varnish cache flush - we were hoping to get started soon and would likely need the flush ~15mins and to keep an eye on cluster health (particularly bits) for the next hour or so to make sure nothing explodes [17:04:46] awjr: what actually happens if you don't do the cache flush? [17:05:33] mark broken mobile experience; mostly a result of serving out-of-date js/css, or a combination of up to date and out of date js/css [17:06:08] so basically no backwards compatibility [17:06:15] is it really that bad? [17:06:21] unfortunately, yes. [17:06:28] it's not always - it depends on the kinds of changes going out [17:06:33] but this one will require a cache flush [17:06:47] mark we are in the process of fixing this [17:07:21] hey mark, drdee is asking me to give stefan petrea (a contractor who already has access on stat1) to the analytics nodes so he can run some hadoop shell commands, do we need to put in an RT for this? [17:07:45] my best guess is within the next month we will drastically reduce the need for mobile varnish cache flushes but not necessarily totally remove the need [17:07:47] https://noc.wikimedia.org/cgi-bin/report.py [17:07:48] ottomata: yes, RT is needed for every access request [17:07:52] [17:07:53] cool, thought so, just double checking [17:07:59] but i am hoping we'll have it all resolved by the end of the fiscal [17:08:01] danke [17:09:51] mark if we do our redeployment now, would you be available to help with a mobile varnish cache flush in 10-15 mins and help respond if things explode (like they did last week with bits dying?) - i don't expect they will, but just in case [17:10:00] yeah I guess [17:10:08] mark thank you [17:14:30] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:45] bits dying = back to 2001 look. It's nostalgic. ;-) [17:15:19] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:15:56] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:16:00] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [17:17:41] awjr: can we possibly purge just a subset of bits (even if some complicated regexp is needed) or really all of it? [17:18:03] mark we dont need to purge anything from bits [17:18:08] mark we just need to purge the mobile varnish cache [17:18:10] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:18:28] ok [17:18:38] mark, what's typically needed is to purge older HTML because newer resources screw it [17:18:46] understood [17:18:47] we've never had to purge anything from bits, afaik [17:19:38] we can purge one mobile box a a time [17:20:07] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:20:09] i forgot ew're giong to have to scap :| [17:20:24] but i heard scap takes a lot less time nowadays … we'll need the flush shortly after scap is complete [17:21:38] when do you want to start? [17:23:18] im getting the changes up on fenari now, we'll do a quick sanity check then i'll start the scap [17:23:23] so probably scap in ~5-10 mins [17:23:26] ok [17:29:03] New patchset: Mark Bergsma; "Reduce mobile varnish backend connection count to 600 (4x normal use)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45558 [17:29:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45558 [17:34:11] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [17:34:22] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [17:40:25] ok, running scap now [17:55:57] how is scap doing? [17:56:08] !log awjrichards Started syncing Wikimedia installation... : Redploy rolled back MobileFrontend changes from Jan 17 [17:56:19] Logged the message, Master [17:57:11] mark just finished the l10n updates [18:00:35] !log authdns update "fixing typo on db1060.mgmt.eqiad" [18:00:46] Logged the message, Master [18:06:58] Change merged: preilly; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45353 [18:07:12] mutante: ping [18:08:10] thank you preilly! [18:08:27] drdee: it will be live shortly [18:09:30] preilly: merged on sockpuppet [18:09:45] mutante: okay thanks [18:10:06] ok why are you guys doing this in the middle of a deployment with no advance warning? [18:10:46] mark: Thursday 18:00-19:00 10 a.m. - 11 a.m. Wikipedia Zero partner testing [Dan / Patrick] [18:11:03] mark: taken from http://wikitech.wikimedia.org/view/Deployments [18:11:05] so we're doing two things at once now? [18:11:37] mark: what is the other thing? [18:11:59] mobilefrontend redeploy [18:12:04] Thursday, Jan 24 [18:12:04] 17:00-18:00 [18:12:04] 09:00-10:00 [18:12:04] Re-deployment of http://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments#17_January.2C_2013 [Arthur/Max] [18:12:06] mark: are you referring to Thursday, Jan 24 17:00-18:00 09:00-10:00 Re-deployment of MobileFrontend after Jan 18 rollback [Arthur/Max] [18:12:08] still waiting on scap to finish [18:12:27] awjr: you're 12 minutes over your window [18:12:43] preilly: thanks, i realize that. like i said, waiting for scap to finish [18:12:58] awjr: how long has scap been running? [18:13:31] preilly: i started scap at 940am pst [18:13:35] 33 minutes [18:13:45] awjr: that seems really long [18:13:57]