[00:00:17] Susan: ha ha [00:00:22] additionally, I don't currently expose modules directly on minions [00:00:43] it goes through a runner, which imposes additional restrictions [00:00:54] to deploy new modules, you need root access [00:01:09] you can't just let any non-root user run commands on all apaches like we do with dsh [00:01:54] we could have a cmd.run module that drops its privileges to a non-root users [00:01:56] *user [00:02:54] then it would be a drop-in replacement for dsh [00:03:58] Is there a page explaining the deficiencies of using dsh? I'm looking at "Scap" and "Git-deploy" on wikitech.wikimedia.org. [00:04:26] TimStarling: have you measured IOPS from the net app? [00:04:30] Susan: there's only one major complaint with dsh right now, and it's not dsh's issue [00:04:45] Susan: it's that the dsh groups must be manually kept up to date [00:04:53] Ah. [00:04:56] Is that tedious? [00:05:15] it's method of handling timeouts, non-matching host keys, and such isn't great either, but it's something that can be dealt with [00:05:20] Krinkle: https://gerrit.wikimedia.org/r/#/c/45117/1 [00:05:38] Susan: it's more that it's bad if it isn't kept up to date with what's pooled [00:05:59] Because you end up with out-of-sync servers? [00:06:02] yes [00:06:05] Got it. [00:06:23] so, if you guys want to replace what I wrote in salt, you need to: [00:06:28] 1. have proper reporting [00:06:50] 2. have a way to get to that reporting outside of a deployment [00:07:15] 3. have a way of fetching things to the systems without modifying the current code [00:07:28] that's roughly it. [00:07:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 00:07:40 UTC 2013 [00:07:46] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 00:07:44 UTC 2013 [00:08:06] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:08:06] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 00:08:03 UTC 2013 [00:08:15] RECOVERY - Puppet freshness on mw15 is OK: puppet ran at Thu Jan 24 00:08:11 UTC 2013 [00:08:22] Ryan_Lane, there's also the issue of testwiki over NFS [00:08:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:08:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 00:08:34 UTC 2013 [00:08:57] MaxSem: when I said I would add that into the deployment system I was told it wasn't necessary and that we should push people to beta [00:09:06] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:09:13] bleh [00:09:15] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 00:09:05 UTC 2013 [00:09:25] beta is not 100% equivalent to prod [00:09:32] my solution was: git deploy test [00:09:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:09:37] and will not be this year [00:09:46] which would fetch/checkout on a single system [00:09:49] update dsh group from Nagios config. @fenari:~$ grep "function updategroup" `which upgrade-helper` -A 30 [00:10:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:10:25] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 00:10:21 UTC 2013 [00:10:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:10:55] Ryan_Lane, if fast-deploying to testwiki is a matter of one command, it could be acceptable [00:11:09] well, in the long run I think even scap will no longer use nfs [00:11:24] so this is an issue no matter what [00:12:19] Ryan_Lane, is there a touch-over-salt script already? could be massively useful even before we switch to git-deploy:) [00:12:41] MaxSem: not yet. should be easy to add, though [00:13:06] actually, adding the cmd.run as non-root user would likely be the best thing to add [00:15:56] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 00:15:45 UTC 2013 [00:16:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:16:25] * Ryan_Lane shrugs [00:16:50] I'll just go back to working on labs if we're not planning on using any of the new deployment code [00:18:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 196 seconds [00:18:46] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 226 seconds [00:19:05] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 238 seconds [00:20:05] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:20:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [00:21:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 27 seconds [00:23:50] New patchset: Pyoungmeister; "removing ill-defined memory check from kaulen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45498 [00:24:02] Ryan_Lane: Speaking of Labs, I see that the Wikimedia Foundation is hiring someone to work on Tool Labs (as a contractor)? [00:24:10] yep [00:24:24] that will involve setting things up inside of projects [00:24:26] Will that primarily be setting up DB replication, then? [00:24:29] no [00:24:32] Hmm. [00:24:37] that's being handled by the current team [00:24:40] New review: Dzahn; "thanks" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45498 [00:24:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45498 [00:24:51] What things inside projects? [00:25:13] setting up a working environment and helping people transition [00:25:36] puppetizing everything as it's needed, walking people through the process, etc. [00:25:40] Ah, okay. [00:26:05] will also set up services that aren't supported labs-wide as well, likely [00:26:14] Has it been decided what OS will be used? [00:26:19] Solaris? Debian? Something else? [00:26:19] ubuntu [00:26:23] All right. [00:27:00] And has it been decided whether per-user projects will be the norm? Like we currently have toolserver.org/~user/ and such. [00:27:11] There was a push for multi-maintainer projects/tools at some point. [00:27:16] Among other failed pushes. [00:28:13] lol, solaris [00:28:17] ideally it will not be per-user [00:28:28] the norm should be multi-maintainer [00:28:35] Susan: better hope apergos isn't listening [00:28:50] it's possible to create multi-maintainer accounts by creating a labsconsole user [00:29:09] Reedy: I fear felicity much more than apergos. :-) [00:29:10] or by adding system users via puppet [00:29:23] Change abandoned: Tim Starling; "Let's just abandon this for now. It can be rewritten when we actually need it. I'm not sure if it wi..." [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/43356 [00:30:26] Ryan_Lane: Yeah... users will need a lot of hand-holding, I think. [00:30:44] AaronSchulz: Thanks for the link [00:30:51] Many Toolserver users can barely SSH. [00:30:57] yep. and it's more than the current labs team can handle while also working on the infrastructure [00:31:03] hence the contractor ;) [00:31:08] * Susan nods. [00:31:16] I think there will also be one on the wmde side? [00:31:20] What's the goal for stable DB replication? End of year? [00:31:34] 4 months ago? :) [00:31:37] well, it's supposed to be end of this month or beginning of next [00:31:37] Heh. [00:31:46] nah. roadmap showed this quarter [00:31:57] end of february would actually be the deadline [00:32:09] we hoped to get to it before then, though [00:32:24] Hoped? [00:32:31] well, there's still time ;) [00:32:42] s/Hoped/Hope/ [00:33:01] New patchset: Tim Starling; "adding acct pkg to base::standard-packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43991 [00:33:08] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43991 [00:33:15] First quarter would be nice. [00:33:27] It might also be nice to fix the Toolserver's DBs. [00:33:36] The replica are all horribly corrupt. [00:33:40] replicas [00:33:54] they easily corrupt because users are allowed to write to tables in them [00:33:55] Not horribly, but just enough corruption to be super-annoying. [00:34:12] I thought you were going to say "because MySQL sucks." [00:34:14] Susan: I have been talking to nosy and DaB about getting them dumps of our dbs so they can fix things [00:34:22] notpeter: That'd be awesome. :-) [00:34:23] I just gave them s4 and s7 bumps [00:34:25] *dumps [00:34:38] this is one of the reasons we don't want to allow writable tables in labs db replicas [00:34:38] s1 is all that matters. ;-) [00:34:39] waiting to hear back from them about what they'd like next [00:35:09] New patchset: Tim Starling; "Add .gitreview" [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/43347 [00:35:14] Change merged: Tim Starling; [operations/debs/wikimedia-task-appserver] (master) - https://gerrit.wikimedia.org/r/43347 [00:35:25] Ryan_Lane: Yeah, I vaguely remember that coming up. Joining against DBs is awfully convenient, though. [00:35:44] yes, but at the expense of it being broken all the time and ops' sanity [00:36:31] I hadn't realized that user writes were the suspect (culprit) of frequent DB corruption. I thought it was just general network hiccups and such. [00:59:09] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [01:03:05] csteipp: yay, extension RSS! i did not expect that to even still be alive..awesome how it grew from this http://meta.wikimedia.org/wiki/User:Mutante/RSSFeed#Source [01:03:16] # $input = mysql_escape_string($input); haha [01:03:43] hah, mutante, if I'd known that was from you I would have vetoed it on principle ;) [01:04:03] arrr:) [01:05:13] Yeah, that escaping was pretty l33t :) [01:05:42] I'm ashamed to say I have worse floating around the nets. [01:08:45] i forgot all of this: "we were thinking of even making this an automatic feature, (read from "interwiki" sql table and auto-include RCs of other Wikis in according wiki page like /Feeds/OtherWiki/RecentChanges. (original idea by mattis manzel)" [01:09:25] Mattis Manzel always wanted to build the "wiki net" [01:12:35] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jan 24 01:12:33 UTC 2013 [01:36:16] TimStarling: http://docs.saltstack.org/en/latest/ref/modules/all/salt.modules.cmdmod.html#salt.modules.cmdmod.run <— already has a "runas" option which drops privileges [01:36:55] TimStarling: we can further wrap that in a runner to limit scoping (grains, regex, etc) or futher limit things by the system calling the runner [01:45:56] but there's no authentication for executing publish jobs apart from host-based authentication? [01:46:34] so you couldn't, say, prevent the apache user from executing jobs normally used by admins? [01:47:35] in this situation I'd use the mwdeploy user on the system [01:47:41] and the runner would always set mwdeploy [01:51:03] or if you wanted to trust the deployment system itself to enforce users, you could have the calling user sent along with the call [01:51:55] wait. do you mean an apache user on one of the apaches itself? [01:52:48] it's not actually host based authentication. it's certificate based authentication, and you set acls based on the certificate name [01:54:15] can different users have different certificates? [01:55:12] though it's possible, the system isn't set up for that [01:55:40] salt-api, though, if we choose to use it would offer that level of flexibility. I don't think it's in a state that's ready for use, though [02:07:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [02:08:06] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [02:11:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42506 [02:16:39] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Jan 24 02:16:16 UTC 2013 [02:25:28] New patchset: Dzahn; "replace new index.html with old redirect to meta planet page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45508 [02:28:22] !log LocalisationUpdate completed (1.21wmf8) at Thu Jan 24 02:28:21 UTC 2013 [02:28:34] Logged the message, Master [02:36:05] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [02:43:29] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:46:29] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:47:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45267 [02:52:59] !log LocalisationUpdate completed (1.21wmf7) at Thu Jan 24 02:52:58 UTC 2013 [02:53:09] Logged the message, Master [02:59:10] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:31] New review: Dzahn; "yeah, ugly HTML, but starting out with exactly what was on old pre-puppet setup to enhance later" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45508 [02:59:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45508 [02:59:59] RECOVERY - HTTP on formey is OK: HTTP OK: HTTP/1.1 200 OK - 3596 bytes in 6.708 second response time [03:08:39] TimStarling: any idea why djvu files have such massive metadata? [03:08:55] they have OCR text in them [03:09:03] it's a ProofreadPage feature [03:11:33] !log DNS update - switching planet over to zirconium / new planet [03:11:43] Logged the message, Master [03:16:13] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [03:22:53] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:13] PROBLEM - SSH on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:03] PROBLEM - HTTP on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:26:09] hm. formey died [03:26:23] PROBLEM - HTTPS on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:27:05] connect: com2 port is currently in use [03:27:06] :D [03:27:13] * Ryan_Lane stabs drac5 [03:28:19] no output on console [03:28:22] !log powercycling formey [03:28:32] Logged the message, Master [03:28:36] PROBLEM - HTTPS on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:28:45] PROBLEM - SSH on formey is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:04] RECOVERY - SSH on formey is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:31:13] RECOVERY - HTTPS on formey is OK: OK - Certificate will expire on 08/22/2015 22:23. [03:31:18] RECOVERY - HTTP on formey is OK: HTTP OK HTTP/1.1 200 OK - 3596 bytes in 0.074 seconds [03:31:44] RECOVERY - HTTP on formey is OK: HTTP OK: HTTP/1.1 200 OK - 3596 bytes in 0.055 second response time [03:32:03] RECOVERY - HTTPS on formey is OK: OK - Certificate will expire on 08/22/2015 22:23. [03:32:12] RECOVERY - SSH on formey is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:34:25] TimStarling: I made https://gerrit.wikimedia.org/r/#/c/45510/ [03:34:38] still feels awkward, oh well [03:34:40] * AaronSchulz should go home [04:04:18] New patchset: Tim Starling; "Disable E3Experiments due to bug 44298" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45511 [04:06:46] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45511 [04:07:45] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 04:07:37 UTC 2013 [04:07:54] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:07:54] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 04:07:49 UTC 2013 [04:08:04] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 04:07:56 UTC 2013 [04:08:05] RECOVERY - Puppet freshness on zinc is OK: puppet ran at Thu Jan 24 04:07:58 UTC 2013 [04:08:09] !log tstarling synchronized wmf-config/InitialiseSettings.php [04:08:10] TimStarling: if you haven't merged yet, could you hang on a minute or three? [04:08:12] oh. [04:08:19] Logged the message, Master [04:08:34] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with args jenkins [04:08:35] RECOVERY - Puppet freshness on virt0 is OK: puppet ran at Thu Jan 24 04:08:32 UTC 2013 [04:08:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:08:35] !log tstarling synchronized wmf-config/CommonSettings.php [04:08:44] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 04:08:40 UTC 2013 [04:08:45] Logged the message, Master [04:08:54] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:09:04] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 04:08:58 UTC 2013 [04:09:20] lol. [04:09:35] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:09:54] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:11:34] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with args jenkins [04:13:46] New patchset: Reedy; "Everything non 'pedia to 1.21wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45513 [04:14:24] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 04:14:21 UTC 2013 [04:14:46] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45513 [04:14:54] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [05:07:46] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [05:08:16] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:04:09] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [06:11:54] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [06:16:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:25:48] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 06:25:45 UTC 2013 [06:25:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:25:59] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 06:25:57 UTC 2013 [06:26:08] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 06:26:01 UTC 2013 [06:26:48] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:26:58] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:34:39] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 06:34:36 UTC 2013 [06:34:48] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:43:18] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 06:43:15 UTC 2013 [06:43:59] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [07:59:28] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [07:59:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:05:28] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with args jenkins [08:07:48] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 08:07:38 UTC 2013 [08:07:49] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 08:07:42 UTC 2013 [08:07:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:08:08] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 08:08:01 UTC 2013 [08:08:29] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:08:29] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with args jenkins [08:08:58] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:09:28] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with args jenkins [08:14:29] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 0 processes with args jenkins [08:15:38] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 08:15:36 UTC 2013 [08:16:28] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:54:48] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [09:03:42] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [09:04:33] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [09:14:43] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [09:15:13] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [09:19:19] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [09:19:19] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [09:19:19] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [09:19:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:19:19] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [09:19:20] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [09:19:20] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:47:55] New review: Nikerabbit; "This file is a wonderful mix of spaces and tabs indentation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45349 [10:11:04] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:12:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [10:12:34] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [10:14:49] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [10:15:14] rsyncs I suppose [10:25:21] !log mlitn synchronized wmf-config/CommonSettings.php 'Reinstate AFTv5 test groups' [10:25:33] Logged the message, Master [10:56:23] New patchset: Matthias Mullie; "Reinstate AFTv5 test groups" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45527 [10:58:56] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [10:59:46] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [11:00:41] New patchset: Matthias Mullie; "Reinstate AFTv5 test groups" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45527 [11:04:48] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45527 [11:06:00] !log mlitn synchronized wmf-config/CommonSettings.php 'Reinstate AFTv5 test groups' [11:06:11] Logged the message, Master [11:55:43] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [11:56:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:07:42] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 12:07:37 UTC 2013 [12:07:43] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:07:43] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 12:07:39 UTC 2013 [12:08:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:08:07] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 12:07:56 UTC 2013 [12:08:07] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 12:07:58 UTC 2013 [12:08:42] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:08:53] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 12:08:46 UTC 2013 [12:09:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:09:03] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 12:09:00 UTC 2013 [12:09:43] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:10:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:18:53] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 12:18:50 UTC 2013 [12:19:42] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:19:43] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 12:19:40 UTC 2013 [12:20:02] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [13:00:20] what's ms1 or ms2 ? [13:16:21] they are media storage servers [15:04:26] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:05:31] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [16:07:49] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 16:07:44 UTC 2013 [16:08:40] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:08:40] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 16:08:36 UTC 2013 [16:09:39] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:09:58] robh: you around yet? [16:11:59] cmjohnson1 or robh, ahem ahem ahem ahem cough cough cough https://rt.wikimedia.org/Ticket/Display.html?id=3946 [16:12:00] :) [16:12:55] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45358 [16:13:12] New patchset: Diederik; "Replace custom CS carrier codes with MCC-MNC mobile carrier codes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45353 [16:13:18] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:13:38] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [16:14:29] ottomata: I know...other things have taken priority...my issue is that it is the partman recipe is failing during install. why? idk yet [16:14:39] haha, yeah [16:14:46] i know its cool [16:15:11] its not a huge priority for us to have it, it just would be nice and it has been a while [16:15:23] mind if I just bump you every week or so about it? you can put it off as long as you need [16:17:06] ottomata: thx [16:28:01] Anyone knows if Sumana usually hangs around IRC and what his nick is? [16:28:42] her* [16:28:44] her nick is sumanah [16:28:57] paravoid: That's too easy. :-) [16:29:04] I'd try #wikimedia-tech though [16:29:17] she doesn't join -operations afaik [16:29:36] I didn't expect she would, but I was pretty sure you guys'd know. :-) [16:38:21] Coren- #mediawiki and #wikimedia-dev are also good places to look [17:01:31] anyone around in case of emergency for a mobile redeployment of last weeks changes? we will also need a mobile varnish cache flush - we were hoping to get started soon and would likely need the flush ~15mins and to keep an eye on cluster health (particularly bits) for the next hour or so to make sure nothing explodes [17:04:46] awjr: what actually happens if you don't do the cache flush? [17:05:33] mark broken mobile experience; mostly a result of serving out-of-date js/css, or a combination of up to date and out of date js/css [17:06:08] so basically no backwards compatibility [17:06:15] is it really that bad? [17:06:21] unfortunately, yes. [17:06:28] it's not always - it depends on the kinds of changes going out [17:06:33] but this one will require a cache flush [17:06:47] mark we are in the process of fixing this [17:07:21] hey mark, drdee is asking me to give stefan petrea (a contractor who already has access on stat1) to the analytics nodes so he can run some hadoop shell commands, do we need to put in an RT for this? [17:07:45] my best guess is within the next month we will drastically reduce the need for mobile varnish cache flushes but not necessarily totally remove the need [17:07:47] https://noc.wikimedia.org/cgi-bin/report.py [17:07:48] ottomata: yes, RT is needed for every access request [17:07:52] [17:07:53] cool, thought so, just double checking [17:07:59] but i am hoping we'll have it all resolved by the end of the fiscal [17:08:01] danke [17:09:51] mark if we do our redeployment now, would you be available to help with a mobile varnish cache flush in 10-15 mins and help respond if things explode (like they did last week with bits dying?) - i don't expect they will, but just in case [17:10:00] yeah I guess [17:10:08] mark thank you [17:14:30] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:14:45] bits dying = back to 2001 look. It's nostalgic. ;-) [17:15:19] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:15:56] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:16:00] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [17:17:41] awjr: can we possibly purge just a subset of bits (even if some complicated regexp is needed) or really all of it? [17:18:03] mark we dont need to purge anything from bits [17:18:08] mark we just need to purge the mobile varnish cache [17:18:10] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:18:28] ok [17:18:38] mark, what's typically needed is to purge older HTML because newer resources screw it [17:18:46] understood [17:18:47] we've never had to purge anything from bits, afaik [17:19:38] we can purge one mobile box a a time [17:20:07] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:20:09] i forgot ew're giong to have to scap :| [17:20:24] but i heard scap takes a lot less time nowadays … we'll need the flush shortly after scap is complete [17:21:38] when do you want to start? [17:23:18] im getting the changes up on fenari now, we'll do a quick sanity check then i'll start the scap [17:23:23] so probably scap in ~5-10 mins [17:23:26] ok [17:29:03] New patchset: Mark Bergsma; "Reduce mobile varnish backend connection count to 600 (4x normal use)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45558 [17:29:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45558 [17:34:11] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.128 second response time [17:34:22] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [17:40:25] ok, running scap now [17:55:57] how is scap doing? [17:56:08] !log awjrichards Started syncing Wikimedia installation... : Redploy rolled back MobileFrontend changes from Jan 17 [17:56:19] Logged the message, Master [17:57:11] mark just finished the l10n updates [18:00:35] !log authdns update "fixing typo on db1060.mgmt.eqiad" [18:00:46] Logged the message, Master [18:06:58] Change merged: preilly; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45353 [18:07:12] mutante: ping [18:08:10] thank you preilly! [18:08:27] drdee: it will be live shortly [18:09:30] preilly: merged on sockpuppet [18:09:45] mutante: okay thanks [18:10:06] ok why are you guys doing this in the middle of a deployment with no advance warning? [18:10:46] mark: Thursday 18:00-19:00 10 a.m. - 11 a.m. Wikipedia Zero partner testing [Dan / Patrick] [18:11:03] mark: taken from http://wikitech.wikimedia.org/view/Deployments [18:11:05] so we're doing two things at once now? [18:11:37] mark: what is the other thing? [18:11:59] mobilefrontend redeploy [18:12:04] Thursday, Jan 24 [18:12:04] 17:00-18:00 [18:12:04] 09:00-10:00 [18:12:04] Re-deployment of http://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments#17_January.2C_2013 [Arthur/Max] [18:12:06] mark: are you referring to Thursday, Jan 24 17:00-18:00 09:00-10:00 Re-deployment of MobileFrontend after Jan 18 rollback [Arthur/Max] [18:12:08] still waiting on scap to finish [18:12:27] awjr: you're 12 minutes over your window [18:12:43] preilly: thanks, i realize that. like i said, waiting for scap to finish [18:12:58] awjr: how long has scap been running? [18:13:31] preilly: i started scap at 940am pst [18:13:35] 33 minutes [18:13:45] awjr: that seems really long [18:13:57] awjr: what is the fan out? [18:14:02] fan out? [18:14:28] last time i ran scap it took a couple of hours (!) but apparently it has been fixed since then [18:14:34] for ddsh [18:14:39] Is it back to 30 yet/ [18:14:46] *? [18:14:48] AaronSchulz: it was the other day [18:14:54] AaronSchulz: it must be lower again [18:15:08] dsh -F30 -cM -f /tmp/tmp.VkLOkm2Hn4 -o -oSetupTimeout=10 /usr/bin/scap-1 "mw1010.eqiad.wmnet mw1070.eqiad.wmnet mw40.pmtpa.... [18:15:14] !log awjrichards Finished syncing Wikimedia installation... : Redploy rolled back MobileFrontend changes from Jan 17 [18:15:22] awjr: are you using watch? [18:15:25] just finished - i just need a few more mintues for the cache flush, etc [18:15:27] preilly: yes [18:15:27] Logged the message, Master [18:15:44] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [18:16:03] awjr: I don't have anything other than the Varnish change that already got merged [18:16:08] !log awjrichards synchronized php-1.21wmf8/extensions/MobileFrontend 'touch files' [18:16:18] Logged the message, Master [18:16:27] mark: ok, can you flush the mobile varnish cache? [18:16:40] !log awjrichards synchronized php-1.21wmf7/extensions/MobileFrontend 'touch files' [18:16:40] awjr: Why do you need that? [18:16:51] preilly: on account of the changes that are being deployed [18:16:52] Logged the message, Master [18:17:04] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [18:17:05] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 192 seconds [18:17:15] purged cp1041 [18:17:21] New patchset: Demon; "MediaWiki repository is now offline" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45564 [18:17:22] waiting a bit before I do the others [18:17:26] ok [18:17:29] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 201 seconds [18:17:34] mark: I really don't think that it's needed [18:17:56] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 211 seconds [18:18:05] preilly: we needed to purge the mobile varnish cache the last time these cahnges were deployed [18:18:10] it is needed. [18:18:20] i would not ask for this if it weren't [18:18:28] awjr: I really don't think that is the case [18:18:55] we'll find out easily enough, as 3/4 mobile varnish servers still have cache [18:19:07] mark: yeah I was just thinking that [18:19:31] i'm browsing around some [18:19:31] we pushed out js/css changes to stable MobileFrontend - MobileFrontend hard-codes requests for those resources (with the timestamp) in the html, which gets cached by varnish [18:19:33] site looks fine to me [18:19:46] what would break? [18:19:54] preilly: seems to be 30 [18:20:00] looking at puppet [18:20:13] AaronSchulz: any objections on unmounting /mnt/upload6, /mnt/upload7, /mnt/thumbs, /mnt/thumbs2 from everywhere? [18:20:19] New review: Pgehres; "This does not affect fundraising." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/45564 [18:20:34] New review: Demon; "Don't merge until tomorrow." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/45564 [18:21:03] preilly: believe me, i'm the one who's always been screaming loudest about those mobile cache purges, I don't do them lightly ;) [18:21:07] sometimes breakage is non-obvious, or you have to browse around a fair amount to see it. in this particular case, you would definitely see breakage in the beta version of the site on phones that do not support jquery [18:21:23] preilly, mark: as stated elsewhere, we are working on fixing the issue [18:21:27] I think we will purge, but we'll do it slowly and controlled [18:21:38] no need to completely wipe everything in seconds [18:22:19] awjr: probably not the right time to have this discussion but is there anywhere I can read about these plans? [18:22:20] paravoid: no [18:22:32] e.g. are you planning to use ESI? [18:22:34] especially not since scap itself took ages [18:23:03] cache hit rate for cp1041 is at 20% now [18:23:04] paravoid, we have ESI support in a branch [18:23:11] bits around 99.6% [18:23:24] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:52] also, we have preparations to further the ResourceLoader support [18:24:11] which will also reduce the need in purges [18:24:48] paravoid starting next week we are improving RL support in MobileFrontend like MaxSem just mentioned and getting rid of support for a lot of dynamic features for devices that don't support jquery [18:24:55] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 26.67 ms [18:25:02] which will greatly reduce when we need to do the varnish cache flushes [18:25:02] sbernardin: around? [18:25:22] sbernardin: can you please kill srv278 once and for all? [18:25:24] once we have that in place, ew'd only need to do cache flushes when html for those devices changes, which is pretty rare [18:25:54] paravoid: we have been experimenting with ESI and are talking about getting it fully implemented in MF hopefully by the end of the fiscal [18:26:14] paravoid: will do [18:26:26] sbernardin: thanks! [18:26:44] wiped cp1042 [18:26:46] if someone could also help me to beat our front-end devs into submission to make backwards-compatible changes to HTML and CSS, we will not need any purges at all:P [18:26:46] paravoid: sbernardin : how about srv266 , that is its evil twin ?:) [18:26:53] paravoid: it's being decommissioned ...right? [18:26:59] sbernardin: correct [18:27:04] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:28:10] New patchset: Faidon; "Unmount upload-related NFS from everywhere" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45565 [18:28:53] New review: Faidon; "Checked with Aaron, no objections." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/45565 [18:28:54] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45565 [18:29:15] oh man, what a good day today [18:29:17] no more NFS [18:29:25] not exactly [18:29:26] close :) [18:29:28] i've been wanting that since like 2004 [18:29:36] well, no more NFS, but we still depend on ms7 [18:29:38] working on that [18:29:38] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:29:51] that's a different thing [18:30:13] and we still have /home in NFS :) [18:30:13] mark: does busy threads on application servers eqiad seem a bit high to you? [18:30:18] but sure, progress! [18:30:50] mark: no more NFS — now that's progress [18:30:56] AaronSchulz: btw, there was that review extension of yours that was still using NFS; did that get fixed? [18:30:59] AaronSchulz, paravoid: nice work! [18:31:14] paravoid: it's still disabled, and I don't maintain it [18:31:28] features should have disabled it ages ago so I don't know how much a shit I give [18:31:35] haha :) [18:31:40] AaronSchulz: ha ha ha [18:32:34] preilly: yep, we need to increase the limit as far as memory allows [18:32:38] those boxes could do with more memory nowadays [18:33:44] New patchset: Aaron Schulz; "Removed unused NFS file backends." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45566 [18:34:33] mark: I wonder if it makes sense to use those new apaches in eqiad instead of Tampa [18:34:44] mark: and then order another set for Tampa later [18:34:54] why not both? [18:34:56] paravoid: https://gerrit.wikimedia.org/r/#/c/45566/1? [18:35:30] mark: well the first set is order already right? [18:35:48] mark: I just wanted them as fast as possible [18:36:02] mark: but I think they're needed in both for sure [18:36:31] we're not that tight on capacity really [18:36:53] but I believe we can use more too [18:37:12] we should also possibly look into upgrading ram in the existing eqiad boxes [18:37:20] mark: well having 96 hosts up for application servers scares me a bit [18:37:23] that would give them a lot more capacity for low cost [18:37:27] really? [18:37:31] dude, you should have been here in 2004-2005 [18:37:32] mark: that's a great point [18:37:39] mark: ha ha ha [18:38:03] this feels rather comfortable comparatively ;) [18:38:07] mark: I trying to figure out what's up with mw1103 [18:38:15] mark: yeah I bet [18:38:16] New patchset: Aaron Schulz; "Removed unused NFS file backends." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45566 [18:38:27] preilly: I had to fix that doc typo :) [18:38:38] "files backends" [18:39:04] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [18:39:04] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [18:39:17] preilly: high cpu utilization you mean? [18:39:20] AaronSchulz: s/NFS and Swift files/ the file [18:39:24] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [18:39:25] mark: yeah [18:39:33] preilly: mw1103 has hyperthreading disabled [18:39:34] PROBLEM - NTP on srv278 is CRITICAL: NTP CRITICAL: Offset unknown [18:39:41] we should make that consistent across the cluster [18:39:57] mark: yeah I agree I was talking to Tim about that the other day [18:40:09] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [18:40:24] mark: do you think we should just have it off everywhere? [18:40:32] yes [18:40:43] i thought it was actually disabled everywhere [18:41:01] mark: so did I at one point [18:41:05] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.134 second response time [18:41:07] mark: I'm not sure what happened [18:41:13] preilly: merging [18:41:20] AaronSchulz: sweet [18:41:37] that's quite sucky [18:41:42] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45566 [18:41:48] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.070 second response time [18:42:33] !log aaron synchronized wmf-config/InitialiseSettings.php [18:42:44] Logged the message, Master [18:42:54] !log aaron synchronized wmf-config/filebackend.php 'Removed unused NFS file backends' [18:43:05] Logged the message, Master [18:43:35] RECOVERY - NTP on srv278 is OK: NTP OK: Offset 6.186962128e-05 secs [18:43:40] preilly: that changes things, then we have less capacity than I thought [18:43:55] mark: yeah [18:44:14] let's order two apache racks for eqiad [18:44:44] * mark puts in a ticket [18:44:48] mark: awesome [18:45:22] seems 60 servers were already ordered [18:45:23] wtf [18:46:33] mark: for Tampa [18:46:40] no, also for eqiad [18:46:46] 60 servers [18:46:50] mark: wait seriously? [18:46:58] mark: I thought it was just 60 total [18:47:05] that count doesn't make a lot of sense from a servers per rack pov, but whatever [18:47:19] mark: so there is 120 on order? [18:47:23] i believe so [18:47:37] ordered yesterday [18:48:47] awjr: app server capacity doesn't allow purging the remaining two varnish boxes at this time, so I'll do that one by one in a bit, after dinner [18:48:58] ok mark, thanks [18:51:07] yeah RobH was mentioning that yesterday [18:52:33] awjr: that was the point I was trying to make in the mobile channel earlier [18:52:51] preilly: sorry, i don't follow? [18:53:54] awjr: app server capacity [18:54:08] awjr: no worries it's no big deal [18:54:14] that's the point I was already making to preilly well over a year ago [18:54:27] preilly: for sure, i understand [18:54:30] mark: heh heh [18:54:42] so shush :P [18:54:54] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [18:55:01] we are digging ourselves out of this, i promise :) [19:01:10] AaronSchulz: any ideas on how to move the predetermined sized thumb idea forward? [19:04:12] paravoid: I wonder if a patch with the optional feature could be added [19:04:19] or at least shown in gerrit [19:04:46] paravoid: but another thing would be to look at typical thumb sizes [19:04:55] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/libanon] (master) - https://gerrit.wikimedia.org/r/45577 [19:05:04] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/45578 [19:05:20] http://monitor.us.archive.org/weathermap/weathermap.html [19:05:34] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/wikibugs] (master) - https://gerrit.wikimedia.org/r/45579 [19:06:53] Would someone mind reviewing and submitting them please? [19:09:36] Reedy, is a platform deployment going on right now? [19:09:47] No... [19:09:52] Not that I know of [19:10:19] ok, then I'll squeeze in a little fix [19:10:44] maxsem@fenari:/home/wikipedia/common/php-1.21wmf7$ git pull [19:10:44] Permission denied (publickey). [19:10:44] fatal: The remote end hung up unexpectedly [19:11:08] did someone change the remotes? [19:11:25] worked fine for me [19:11:39] http://p.defau.lt/?uSwiXXJhsuUPXK3RhCPkgA [19:12:02] done wmf8 too [19:12:29] meh, proxycommand [19:12:57] I thought we checked out via https before? [19:13:47] extensions definitely [19:15:06] New patchset: Faidon; "Kill remaining references to pybaltestfile.txt" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45582 [19:18:43] New patchset: Reedy; "Remove $urlprotocol as it's set to """ [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42995 [19:19:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42995 [19:19:31] !log aaron synchronized php-1.21wmf8/thumb.php 'deployed c6d478f73f33f051c1060ffe4b9c5855e07d358f' [19:19:42] Logged the message, Master [19:20:48] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [19:20:49] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [19:20:49] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [19:20:49] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [19:20:49] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [19:20:49] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:20:49] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:20:53] Fail [19:21:06] Don't submit those 3 changes :p [19:21:21] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45582 [19:21:24] looks like it was a putty problem, restarting it helped [19:23:00] !log maxsem synchronized php-1.21wmf7/extensions/MobileFrontend/includes/skins/SkinMobile.php 'https://gerrit.wikimedia.org/r/#/c/45562/' [19:23:11] Logged the message, Master [19:24:17] !log maxsem synchronized php-1.21wmf8/extensions/MobileFrontend/includes/skins/SkinMobile.php 'https://gerrit.wikimedia.org/r/#/c/45562/' [19:24:28] Logged the message, Master [19:27:05] Change abandoned: Reedy; "(no reason)" [operations/debs/libanon] (master) - https://gerrit.wikimedia.org/r/45577 [19:27:08] Change abandoned: Reedy; "(no reason)" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/45578 [19:27:10] Change abandoned: Reedy; "(no reason)" [operations/debs/wikibugs] (master) - https://gerrit.wikimedia.org/r/45579 [19:29:40] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/libanon] (master) - https://gerrit.wikimedia.org/r/45585 [19:29:49] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/45586 [19:30:16] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/wikibugs] (master) - https://gerrit.wikimedia.org/r/45587 [19:30:48] Change abandoned: Reedy; "(no reason)" [operations/debs/libanon] (master) - https://gerrit.wikimedia.org/r/45585 [19:30:50] Change abandoned: Reedy; "(no reason)" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/45586 [19:30:55] Change abandoned: Reedy; "(no reason)" [operations/debs/wikibugs] (master) - https://gerrit.wikimedia.org/r/45587 [19:31:27] ottomata, I see a delicious mix of iostreams and printf in https://gerrit.wikimedia.org/r/#/c/45569/1/srcmisc/packet-loss.cpp [19:31:33] looks good other than that [19:32:19] OOPS [19:32:19] thanks [19:33:05] ottomata, and spaces instead of tabs [19:35:49] ah good call too [19:35:49] danke [19:38:28] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [19:39:19] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [19:39:36] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/libanon] (master) - https://gerrit.wikimedia.org/r/45588 [19:39:45] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/45589 [19:40:04] Change abandoned: Reedy; "(no reason)" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/45589 [19:40:10] Change abandoned: Reedy; "(no reason)" [operations/debs/libanon] (master) - https://gerrit.wikimedia.org/r/45588 [19:43:21] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/libanon] (master) - https://gerrit.wikimedia.org/r/45590 [19:43:32] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/45591 [19:43:56] New patchset: Reedy; "Add .gitreview and .gitignore" [operations/debs/wikibugs] (master) - https://gerrit.wikimedia.org/r/45592 [19:44:44] nth time lucky [19:47:35] dschoon: try to take over the world? [19:47:44] always [19:58:37] * AaronSchulz hopes the reference was received [19:59:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:00:21] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:01:03] paravoid: I don't like how LocalFile caches negatives for a week [20:01:14] uhm [20:01:15] okay? [20:01:37] seems like asking for trouble :) [20:01:42] * Reedy hands AaronSchulz a template [20:04:18] AaronSchulz: Nope. Don't know what Brain you are talking about. [20:05:32] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [20:07:42] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 20:07:36 UTC 2013 [20:07:42] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 20:07:37 UTC 2013 [20:08:21] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:08:32] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 20:08:25 UTC 2013 [20:08:32] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [20:08:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:08:42] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 20:08:32 UTC 2013 [20:09:21] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:09:32] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 20:09:22 UTC 2013 [20:09:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:10:22] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:11:32] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [20:12:06] Does someone fancy approving 3 easy commits please? https://gerrit.wikimedia.org/r/45590 https://gerrit.wikimedia.org/r/45591 https://gerrit.wikimedia.org/r/45592 [20:13:32] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Thu Jan 24 20:13:23 UTC 2013 [20:14:21] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:14:31] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: Connection refused [20:15:12] gotcha Reedy... [20:17:31] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: Connection refused [20:17:37] Change merged: Ottomata; [operations/debs/libanon] (master) - https://gerrit.wikimedia.org/r/45590 [20:17:44] Change merged: Ottomata; [operations/debs/librsvg] (master) - https://gerrit.wikimedia.org/r/45591 [20:17:48] Change merged: Ottomata; [operations/debs/wikibugs] (master) - https://gerrit.wikimedia.org/r/45592 [20:18:11] thanks [20:20:07] New patchset: Ottomata; "Adding Stefan's laptop ssh key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45597 [20:20:32] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: Connection refused [20:20:56] New patchset: Ottomata; "Adding Stefan's laptop ssh key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45597 [20:27:11] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Thu Jan 24 20:27:08 UTC 2013 [20:27:23] New patchset: Reedy; "Add .gitignore and .gitreview" [operations/debs/nginx] (master) - https://gerrit.wikimedia.org/r/45598 [20:27:41] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:29:44] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45597 [20:31:20] purged cp1043 [20:31:55] New patchset: Reedy; "Add .gitignore and .gitreview" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/45599 [20:32:18] silly repo [20:33:27] New patchset: Ottomata; "Removing space in Stefan's ssh key id" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45600 [20:33:42] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45600 [20:34:02] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: Connection refused [20:45:22] New patchset: Tpt; "Add two new blogs to the French planet as requested in meta:Planet_Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45602 [20:47:20] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [20:48:21] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [20:50:14] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [20:51:10] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [20:52:09] New patchset: Faidon; "Switch Swift's thumbhost to eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45604 [20:52:42] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45604 [20:53:03] New patchset: Ottomata; "filters.oxygen.erb - capturing lines X-CS headers to zero-x-cs.log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45605 [20:53:20] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [20:53:31] all mobile varnish servers have bans [20:53:56] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45605 [20:54:20] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [20:56:16] !log gradual restart of ms-fe[1-4] proxies [20:56:26] Logged the message, Master [20:56:45] New patchset: Reedy; "Add script to update the interwiki cache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42133 [20:57:29] binasher: have you looked at http://riemann.io/ ? [21:02:21] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45602 [21:02:46] dschoon: hey, that looks neat [21:03:07] i've liked what i've read so far. [21:03:28] and YourKit is fucking fantastic [21:03:31] so i trust their engineers [21:06:23] MaxSem, got a sec to do this one? [21:06:24] https://gerrit.wikimedia.org/r/#/c/45650/ [21:06:26] since you commented on the other? [21:06:45] ottomata, I'm in a meeting, will look in an hour [21:08:11] cool [21:08:14] s'ok no worries [21:08:23] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: Connection refused [21:09:09] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [21:12:16] uh [21:12:28] anyone know why backend squid instances are throwing mad 404s? [21:12:30] in nagios at least [21:13:20] notpeter: 404s or refusing connection? saw the one above on knsq18? [21:14:35] oh, i see, hmm. they are "just" WARNINGS .. but it sounds crit.. [21:14:44] 404 is not so good... [21:15:22] yea, just saying thats probably why the bot didnt spam ..or did it [21:17:14] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [21:18:12] New patchset: Asher; "handle current situation where both datacenters are writeable, with the secondary datacenter using remote masters" [operations/software] (master) - https://gerrit.wikimedia.org/r/45652 [21:18:33] it's all upload [21:18:38] paravoid: you there? [21:18:44] I am [21:18:56] can you test wikipedia? [21:19:03] sorry? [21:19:09] all of the esams and pmtpa upload squids are showing all 404 [21:19:14] check_http_upload_on_port!3128 [21:19:22] I changed the check before [21:19:29] ah:) [21:19:59] $USER1$/check_http -H upload.wikimedia.org -I $HOSTADDRESS$ -u /monitoring/backend -p $ARG1$ [21:20:03] somehow that gets 404s [21:20:12] it's probably old, dupe defs [21:20:19] lemme test something [21:20:20] and makes it a WARN, not a CRIT though [21:21:24] nope. drat [21:21:29] New patchset: Asher; "handle current situation where both datacenters are writeable, with the secondary datacenter using remote masters" [operations/software] (master) - https://gerrit.wikimedia.org/r/45652 [21:22:36] !log taking srv278 down for decommissioning per paravoid [21:22:46] Logged the message, Master [21:23:12] where are these? [21:23:16] root@spence:/etc/nagios# /usr/local/nagios/libexec/check_http -H upload.wikimedia.org -I 91.198.174.57 -u /monitoring/backend -p 3128 [21:23:19] HTTP WARNING: HTTP/1.0 404 Not Found [21:23:20] thats a full commandline [21:23:25] for amssq47 [21:23:38] why aren't these warnings printed by nagios-wm? [21:23:47] paravoid: that's a very good question [21:24:11] because they are warnings? did we change something maybe to only show CRITs in IRC? [21:24:15] does nagios-wm alert for wranings ? [21:24:30] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:02] oh [21:25:03] anyway [21:25:04] looking [21:25:14] ./check_http has the option -e (expect) [21:25:22] you can tell that which return code you want [21:25:27] to make the 404s criticals [21:25:52] /usr/local/nagios/libexec/check_http -H upload.wikimedia.org -I 91.198.174.57 -u /monitoring/backend -p 3128 -e 200 [21:25:55] HTTP CRITICAL - Invalid HTTP response received from host on port 3128: HTTP/1.0 404 Not Found [21:25:59] yeah I know that this is [21:26:05] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [21:26:23] maybe [21:26:38] Change merged: Asher; [operations/software] (master) - https://gerrit.wikimedia.org/r/45652 [21:28:26] !log fixed dbtree to support our current situation of db-pmpta containing eqiad write masters [21:28:37] Logged the message, Master [21:29:13] i see the difference in the checkcommand between puppet and what is on spence.. yep [21:29:44] paravoid: -u /pybaltestfile.txt vs. -u /monitoring/backend is the (unrelated) change, right [21:29:53] or is that it [21:29:56] it's the related change [21:30:00] ah:) [21:30:33] it ends up on ms7, most likely my squid ACL is wrong [21:30:39] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [21:31:00] well, it works with -u /pybaltestfile.txt [21:31:09] is that the new version? then its just not on spence yet [21:31:16] /usr/local/nagios/libexec/check_http -H upload.wikimedia.org -I 91.198.174.57 -u /pybaltestfile.txt -p 3128 -e 200 [21:31:18] HTTP OK HTTP/1.0 200 OK [21:32:04] no that's the old version [21:32:21] gotcha, git pulled [21:33:39] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [21:34:00] yeah check_http does HTTP/1.0 [21:34:07] and we have a url_regex instead of urlpath_regex [21:34:34] New patchset: Dzahn; "add -e 200 to monitoring check_http_upload_on_port - so that if it returns a 404 that also becomes a crit and not just a warn, so eventually the IRC bot tells us about it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45657 [21:35:06] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [21:35:41] paravoid: you want that? i guess then the bot would have talked [21:36:22] or of course the return code could be $ARG2$ [21:36:45] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [21:38:20] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [21:39:17] PROBLEM - Backend Squid HTTP on sq41 is CRITICAL: Connection refused [21:39:19] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [21:39:24] !log deploying squid config for upload's /monitoring/ switch from url_regex to urlpath_regex [21:39:30] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [21:39:34] Logged the message, Master [21:40:30] PROBLEM - Backend Squid HTTP on sq41 is CRITICAL: Connection refused [21:40:40] it's rebuilding [21:40:43] looking to see why it died [21:42:31] checks looking better now? [21:42:34] PROBLEM - Backend Squid HTTP on knsq18 is CRITICAL: HTTP CRITICAL - No data received from host [21:43:30] sorry for the noise [21:44:08] paravoid: yeah, stuff is recovering [21:44:34] RECOVERY - Backend Squid HTTP on sq41 is OK: HTTP OK: HTTP/1.0 200 OK - 486 bytes in 0.065 second response time [21:44:35] RECOVERY - Backend Squid HTTP on knsq18 is OK: HTTP OK: HTTP/1.0 200 OK - 656 bytes in 0.423 second response time [21:44:35] it's going to be cached on some [21:44:49] any ideas how to send a purge everywhere? [21:45:00] no idea how to push something over htcp [21:45:12] I thought that we just did that by restarting squid/varnish ;) [21:45:25] no I pushed a new ACL [21:45:30] this doesn't invalidate caches [21:45:38] (varnish wasn't affected, just squid) [21:45:44] gotcha [21:45:44] RECOVERY - Backend Squid HTTP on knsq18 is OK: HTTP OK HTTP/1.0 200 OK - 657 bytes in 0.494 seconds [21:46:02] tbh, I try to avoid cache invalidation duties, so I'm not sure [21:46:11] RECOVERY - Backend Squid HTTP on sq41 is OK: HTTP OK HTTP/1.0 200 OK - 494 bytes in 0.001 seconds [21:46:27] I'd just let them expire [21:46:55] binasher: you'll probably know: how can I send a purge to all squids? htcp on wikitech is really old and I don't think it can be trusted [21:47:04] he's afk [21:47:22] oh [21:47:56] there's purgeList but needs a wikiname [21:48:09] oh, heh [21:50:04] use --aawiki [21:50:31] paravoid: in /home/wikipedia/common/php/maintenance do: echo 'http://blahblahblah' | php ./purgeList.php --wiki aawiki [21:50:37] from wikitech [21:50:48] needs a wiki [21:50:52] I want to do it on upload [21:50:57] anyway, doesn't matter much [21:51:24] all i know is that you use aawiki when there is no "real" wiki [21:52:45] New patchset: Pyoungmeister; "backing out precise maria db repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45664 [21:52:56] paravoid: yes, the warnings are gone from nagios/icinga [21:53:10] thanks [21:53:44] still want to add the -e 200 ? [21:54:08] just to turn warnings into criticals [21:54:24] ^demon: it seems webplatform has a top level repo [21:54:42] ^demon: but they're really only going to be hosting mediawiki extensions with us [21:54:52] all of their repos are empty [21:55:12] can they be deleted and extension repos in mediawiki be created for them? [21:56:24] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [21:56:44] <^demon> Ryan_Lane: I was under the impression they were weird extensions not fit for general consumption. [21:57:25] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.000 second response time on port 8123 [21:57:43] ^demon: they are fit for general consumption [21:58:16] ah, there is actually a puppet camp in Belgium one day before FOSDEM [21:58:29] http://puppetcampghent2013-eanrecl.eventbrite.com/?ref=eanrec&utm_source=eb_email&utm_medium=email&utm_campaign=attnews&utm_term=attlink# [21:58:39] <^demon> Ryan_Lane: There's already an extension called Comments. [21:58:56] ^demon: I'd like to work them into the normal mediawiki processes as much as possibl [21:59:00] *possible [21:59:03] binasher: "FOSDEM: MySQL And Friends Community Dinner 2013" [21:59:05] mutante: Are you going to go and troll and ask why it's so slow? :p [21:59:16] !log olivneh synchronized php-1.21wmf8/extensions/EventLogging [21:59:16] <^demon> Ryan_Lane: And is their Translate a fork or branch of the normal Translate? [21:59:19] Reedy: they are going to tell us to use puppetdb [21:59:25] they have a translate extension? [21:59:27] Logged the message, Master [21:59:34] I think it's a different extension [22:00:20] <^demon> I need to go back and look at the logs. I really remember them saying that Comments and Translate were like forks of the originals. [22:00:38] hm [22:01:14] ^demon: could they just be a branch? [22:01:16] "Monitorama 2013" - March 28, Cambridge, MA, Microsoft R&D center [22:01:21] ^demon: also their repos are empty [22:01:23] <^demon> I'm fine with branches. [22:01:25] <^demon> I know. [22:01:37] <^demon> I'm saying, let's not make more new empty repos if branches are better. [22:01:40] I'm asking them about it [22:01:49] Okay cool [22:02:19] yeah. I'd prefer we don't create more new empty repos, too :) [22:02:27] !log olivneh synchronized php-1.21wmf7/extensions/EventLogging [22:02:37] we really need better outreach with high profile users [22:02:38] Logged the message, Master [22:03:25] !log olivneh synchronized php-1.21wmf8/extensions/E3Experiments [22:03:36] Logged the message, Master [22:03:39] Ryan_Lane: indeed [22:04:01] !log olivneh synchronized php-1.21wmf7/extensions/E3Experiments [22:04:11] Logged the message, Master [22:08:36] New patchset: Pyoungmeister; "adding mariadb component to precise-wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45666 [22:09:51] ^demon: translate can just be deleted [22:10:16] ^demon: comments is a fork. they say it's "dramatically reworked" [22:12:05] New patchset: Lcarr; "Switching icinga to page us!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45668 [22:12:13] <^demon> WebPlatformTranslation deleted. [22:12:16] ^demon: so, new extension? branch? [22:12:18] okay, the icinga page-mageddon button is going to be turned on! [22:12:41] New patchset: Ori.livneh; "Re-enable E3Experiments on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45669 [22:12:47] <^demon> If it's a fork, new repo with a new name sounds fine. [22:13:06] ContextComments [22:13:57] <^demon> New repo created. [22:14:21] New patchset: Lcarr; "having icinga redirect to icinga.wm.org instead of neon.wm.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45670 [22:14:21] is it possible for those webplatform.org extensions to be moved to mediawiki/extensions? [22:15:05] I don't think we want a top-level project for every organisation that decides to write a MW extension [22:16:05] <^demon> TimStarling: They're all empty repositories. [22:16:22] <^demon> They never committed to them. We're making a few new repos in extensions as needed. [22:16:59] right, that makes it easy [22:17:15] <^demon> Ryan_Lane: So, what about WebPlatformAuth and WebPlatformSearchAutocomplete? [22:17:20] <^demon> New repos in extensions for those? [22:17:24] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [22:17:58] I'm asking about autocomplete [22:18:01] I don't really get that one [22:18:05] auth for sure [22:18:14] that's a currently existing repo [22:18:24] on their cluster [22:18:51] maybe we should keep webplatform.org [22:18:59] as their configuration repo [22:19:07] <^demon> I already deleted them all. [22:19:09] heh [22:19:12] nevermind then :) [22:19:12] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45669 [22:19:23] I'll create a repo later if we decide to host it [22:20:10] <^demon> Ok, well ContextComments and WebPlatformAuth both created in extensions. [22:20:27] awjr: oops i misread your email as you have a deployment tomorrow [22:20:32] not that you sent it and had it actually today [22:20:50] :) [22:22:41] ^demon: autocomplete is also written [22:22:46] and deployed on their cluster [22:22:56] I think that's all that's needed [22:23:39] <^demon> Ok, created WebPlatformSearchAutocomplete [22:23:49] thanks [22:24:43] is mw1085.eqiad.wmnet depooled? [22:26:57] notpeter, mutante ^^ [22:28:17] !log olivneh synchronized wmf-config/InitialiseSettings.php [22:28:24] yes [22:28:26] Logged the message, Master [22:28:31] as is 1072 [22:28:53] preilly: ^ [22:28:59] notpeter: got it thanks [22:29:05] notpeter: I was just making sure [22:29:09] cool [22:29:12] andre__: should planet have a component in BZ ? [22:29:22] in product wikimedia? [22:29:51] andre__: just writing the answer to "where should i report issues"..tell me whats best [22:31:50] there, new planet docs: http://wikitech.wikimedia.org/view/Planet.wikimedia.org [22:32:09] gone with the unpackaged, unpuppetized, SVN-using, runnign on singer, old planet [22:35:37] New patchset: Pyoungmeister; "temporarily adding db1033 and db1036 to decom.pp to get them out of nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45675 [22:38:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45675 [22:38:06] New patchset: Jalexander; "adding requested blogs to planet config, 1 de, 1 it, 1 cs, 4 fr, 4 en" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45676 [22:40:08] Jamesofur|away: there is one duplicate in "fr" because i merged this https://gerrit.wikimedia.org/r/#/c/45602/1/templates/planet/fr_config.erb [22:40:17] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45668 [22:40:22] arg, wrong James [22:40:32] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45670 [22:41:03] New review: Dzahn; "there is a duplicate or two in "fr" because i already merged this https://gerrit.wikimedia.org/r/#/..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45676 [22:43:09] New patchset: Dzahn; "adding requested blogs to planet config, 1 de, 1 it, 1 cs, 4 fr, 4 en" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45676 [22:43:38] New review: Dzahn; "removed 2 from "fr", amended" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45676 [22:43:39] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45676 [22:46:52] mutante: for technical issues it makes sense (bug report welcome against Wikimedia>Bugzilla so I don't forget). For request of people to get added I would ask James first what fits best for him. [22:47:41] andre__: ok, creating bug. and for adding feeds, yep, ideally people just create gerrit patch sets by themselves :) [22:47:50] now that they can :) [22:48:22] that said, i guess would have to still ask James what is ok to merge [22:48:40] Gerrit patchsets themselves? Dream on. ;) [22:49:03] For techy folks okay, but for average Wikimedia contributors that are not into code that sounds hard. [22:49:12] heh, just 3 steps http://wikitech.wikimedia.org/view/Planet.wikimedia.org#How_do_i_add.2Fremove_feed_URLs.3F [22:49:23] clone, edit a file, git review [22:49:33] we'll see:) [22:50:30] and i even saw that coming and added "[edit] How do i request changes if i can't or don't want to submit a change set myself? " :p [22:50:48] "Bring me a patch and you'll get added within 48 hours. Have no patch, wait for two months."? :D [22:51:24] There already was one https://gerrit.wikimedia.org/r/#/c/45602/ [22:51:25] Finally a sustainable business model for Wikimedia. [22:51:37] Yay, nice [22:52:34] * andre__ should go to sleep [22:52:53] right, good night then [22:53:09] ttyl [22:54:36] eek, why I see write queries on a slave? https://ishmael.wikimedia.org/more.php?host=db1001&hours=24&checksum=2841018436396406640 [22:55:12] !log changed all labs sudoers from "sudoUser: ALL" to "sudoUser: project-" [22:55:23] because someone didn't set read_only on them? [22:55:24] Logged the message, Master [22:55:48] !log changed all lab sudoer policies to !authenticate [22:55:49] !log updating all planets [22:55:58] Logged the message, Master [22:56:08] Logged the message, Master [23:07:12] New patchset: Pyoungmeister; "coredb: removing slave-related nagios checks from shards with no master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45679 [23:09:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45679 [23:17:16] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [23:17:22] icinga-wm: -_- [23:17:49] Ryan_Lane, so now it's at least being monitored... [23:18:07] it was before too [23:18:15] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.027 second response time on port 11000 [23:18:51] ok. I'm going to make virt1 the new controller [23:31:20] New patchset: Pyoungmeister; "removing db1036 and db1033 from decom.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45683 [23:31:48] RECOVERY - Host silicon is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [23:44:06] New patchset: Asher; "select mobile apache backend based on $mw_primary instead of hardcoding. this matches how the mobile api backed is selected." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45688 [23:44:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45688 [23:47:33] notpeter: i'm adding you as a reviewer for a schema change (https://gerrit.wikimedia.org/r/#/c/45672/) [23:48:36] hehe [23:48:52] !log olivneh synchronized php-1.21wmf7/extensions/GettingStarted [23:49:03] Logged the message, Master [23:49:28] !log olivneh synchronized php-1.21wmf8/extensions/GettingStarted [23:49:39] Logged the message, Master [23:49:58] !log installing package upgrades on zirconium [23:50:11] Logged the message, Master