[00:03:35] !log maxsem synchronized php-1.21wmf9/extensions/MobileFrontend 'Touchhhhh' [00:03:36] Logged the message, Master [00:20:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:20:27] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [00:20:27] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:22:03] preilly: ahoy [00:33:07] RobH: hey, have you been working on the blog stuffs? [00:33:49] or was that daniel? [00:35:48] notpeter: me [00:36:00] did you make some dumps of the db on db9? [00:36:20] yep, and anyone older than the most recent can go away, want me to go police them? [00:36:42] i make one right before I roll an upgrade [00:36:46] nah, but I'm going to move them to the /a partition [00:36:53] because the root partition is really full [00:37:01] ok, i'll be sure to put on there in future [00:37:05] cool! [00:37:31] I'll make /a/blog_dumps for them [00:38:17] just wanted to make sure that I didn't disappear them on you :) [00:38:40] nah, they are pretty much useless after i verify the upgrade [00:39:06] ah, gotcha [00:39:08] New patchset: Krinkle; "admins.pp: Add new key for 'krinkle'. Invalidate old key." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48766 [00:39:23] i just keep the most recent one because i am paranoid [00:39:29] makes sense [00:39:34] and like the ability to roll back major updates [00:39:40] so if we have room, keeping them is never bad. [00:39:58] yeah, /a has a good amount of space [00:40:03] cool [00:40:10] should be enough to last us until we start using new boxes.... [00:41:14] New review: Krinkle; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48766 [00:42:26] !log restarting varnishncsa on cp1043 [00:42:28] Logged the message, notpeter [00:49:55] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48766 [00:50:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48766 [00:52:18] RECOVERY - Puppet freshness on ms-fe1002 is OK: puppet ran at Wed Feb 13 00:52:04 UTC 2013 [01:13:19] Hm. I should probably catch up on as much Labs infrastructure docs as I can get my hands on. Is what's on labsconsole it? [01:13:39] Pretty much, yes. [01:13:42] rfaulkner: nose should good to me [01:14:07] I was hoping that there was a hidden cache I could tap once I learn the hidden handshake. :-) [01:14:48] Coren: not really. If you see obvious holes or have suggestions about how to organize… I'm happy to explain and/or rewrite. [01:15:49] andrewbogott: Well, I don't yet know what parts I don't know, but once I know I'll let you know. :-) [01:16:01] sounds good :) [01:20:12] RECOVERY - NTP on mw1182 is OK: NTP OK: Offset -0.004286646843 secs [01:20:13] RECOVERY - NTP on mw1188 is OK: NTP OK: Offset 0.001575231552 secs [01:20:21] RECOVERY - NTP on mw1165 is OK: NTP OK: Offset -0.0001726150513 secs [01:20:57] RECOVERY - NTP on mw1176 is OK: NTP OK: Offset -0.00748705864 secs [01:26:10] New patchset: Pyoungmeister; "create a test.w.o role class to increase maxclients" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48770 [01:28:02] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48770 [01:28:12] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48770 [01:46:03] New patchset: Pyoungmeister; "explicitly passing maxclients to applicationserver::config::apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48771 [02:02:03] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [02:03:42] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:15:22] preilly: https://gerrit.wikimedia.org/r/4877[4|5|6|7] [02:15:41] started defining unit tests plus some fixes [02:28:07] !log LocalisationUpdate completed (1.21wmf9) at Wed Feb 13 02:28:06 UTC 2013 [02:28:11] Logged the message, Master [02:40:36] PROBLEM - Puppet freshness on mw37 is CRITICAL: Puppet has not run in the last 10 hours [02:52:39] !log LocalisationUpdate completed (1.21wmf8) at Wed Feb 13 02:52:39 UTC 2013 [02:52:41] Logged the message, Master [03:17:48] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [03:59:35] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [04:23:45] RECOVERY - MySQL disk space on neon is OK: DISK OK [04:44:41] New review: Tim Starling; "Patch Set 3: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/46907 [04:44:50] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46907 [05:04:25] Change abandoned: Tim Starling; "git-deploy didn't happen." [operations/apache-config] (newdeploy) - https://gerrit.wikimedia.org/r/43148 [05:22:49] New patchset: Tim Starling; "Make mwscript sudo to apache if an admin tries to run a script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48785 [05:23:12] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48785 [05:23:20] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48785 [05:30:10] New patchset: Tim Starling; "Maybe also avoiding running scripts as root would be good?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48786 [05:33:12] New patchset: Tim Starling; "Prevent MediaWiki maintenance scripts from running as privileged users" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44200 [05:33:45] New review: Tim Starling; "Patch Set 3: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/44200 [05:33:46] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44200 [05:43:39] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [05:49:13] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48786 [05:49:22] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48786 [05:50:44] !log tstarling synchronized multiversion/MWScript.php [05:50:46] Logged the message, Master [06:16:44] New patchset: Krinkle; "(bug 39380) Enabling secure login (HTTPS)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [06:17:16] New review: Krinkle; "Patch Set 11:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [06:43:36] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [07:14:46] RECOVERY - MySQL disk space on neon is OK: DISK OK [07:52:25] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [08:17:16] hello [08:17:33] apergos: good morning :-] Will you be around for the next hour or so ? [08:17:48] morning [08:17:50] I'm here [08:18:01] and wishing you were a mobile front end person :-D [08:18:04] cool! I am going to update Jenkins API tokens and might poke you to get a key updated :) [08:18:07] oh [08:18:09] sure thing [08:18:11] mobile causing trouble again ? [08:18:17] dumps broken [08:18:28] Fatal error: Call to a member function getText() on a non-object in /a/usr/local/apache/common-local/php-1.21wmf8/extensions/MobileFrontend/includes/MobileContext.php on line 273 [08:18:35] damn [08:18:42] someone didn't run test suite and broke them [08:18:48] surely MaxSem could help with MobileFrontend issues [08:18:56] that's the hope [08:19:01] we'll see when he gets on [08:19:18] who, me? [08:19:32] looking [08:19:32] MobileFrontend seems to echo some fatal errors :/ [08:20:43] [13-Feb-2013 15:17:51] [08:20:48] hmm how is that possible? :-D [08:20:58] a date in the future! [08:21:14] wmf8 [08:21:46] that is from a request on the Thailand's wiki [08:21:51] so most probably local time instead of gmt [08:22:04] [13-Feb-2013 15:17:51] Fatal error: Call to a member function getDBkey() on a non-object at /usr/local/apache/common-local/php-1.21wmf8/extensions/MobileFrontend/includes/skins/SkinMobile.php on line 264 [08:22:09] MaxSem: another one :) [08:22:24] hm didn't see those [08:23:37] the first one breaks abstracts and stubs [08:23:38] and therefore also page content dumps [08:23:55] RECOVERY - MySQL disk space on neon is OK: DISK OK [08:25:26] !log jenkins: changing encryption key and regenerating secrets. See {{bug|44592}} [08:25:27] Logged the message, Master [08:26:10] apergos, hashar: https://gerrit.wikimedia.org/r/48797 [08:26:55] * apergos is already enjoying the improved gerrit [08:28:38] MaxSem: I don't know anything about MF but that change is not going to get things any worse :-] [08:29:03] MaxSem: CR +2 [08:29:10] um, is that really going to prevent the fatal or will it just happen in the if? [08:29:13] * apergos is looking [08:32:50] apergos, I can deploy it [08:33:59] oh, this is the second error [08:34:05] sure [08:34:44] I'll +2 that [08:34:55] no I won't, someone else did ;-D [08:35:06] thanks hashar [08:35:36] now the wmf branch need an update :-] [08:35:41] branches [08:35:43] yup [08:35:53] but max knows how to do that [08:36:17] I was running on wmf8 for one of these [08:36:19] so I could easily test that again [08:36:36] (command already cued up) [08:39:21] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [08:40:37] I'm on it [08:42:06] thanks a lot btw [08:43:53] apergos, do you need it only on wmf8? [08:44:11] don't know but I was going to say let's test it there first [08:44:23] I can check the other failures and see what version they had [08:45:00] there are failures also in 9 [08:48:53] grrrr, new gerrit merges so slowwwwwly [08:49:30] MaxSem: might be jenkins / unit tests [08:49:43] then it's you whom I should be biting [08:49:49] indeed [08:49:51] * MaxSem bites hashar [08:50:02] merge into what ? [08:50:05] mw/core ? [08:50:27] this is an extension [08:52:45] !log maxsem synchronized php-1.21wmf8/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/#/c/48797/' [08:52:45] Logged the message, Master [08:52:48] here we go [08:53:35] ok [08:53:57] if it fixed wmf8, I can deploy it to wmf9 too [08:54:09] I'll find out shortly [08:54:17] ope [08:54:26] PHP Fatal error: Call to a member function getText() on a non-object in /a/usr/local/apache/common-local/php-1.21wmf8/extensions/MobileFrontend/includes/MobileContext.php on line 273 [08:54:34] well it fixed ne but now [08:54:34] eh [08:54:36] we have the second [08:54:53] sorry for the horrible tying [08:54:57] *typing! [08:55:49] or maybe it didn't fix the one (since I haven't encountered that error in the dumps) [08:55:52] anyways.... [08:56:29] I'm an idiot [08:56:59] I forgive you. [08:57:06] * apergos whacks Susan [08:57:15] Now the other cheek. [08:57:23] that's pretty cheeky of you [08:57:59] :D [08:57:59] * apergos could go somewhere raunchy with this but this is a publically logged channel [08:58:49] https://gerrit.wikimedia.org/r/48804 [09:00:14] all right let's try that [09:01:04] merged [09:07:12] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [09:10:02] !log maxsem synchronized php-1.21wmf8/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/#/c/48804/' [09:10:04] Logged the message, Master [09:11:20] looks better [09:11:43] I'd say push them both out to 9 [09:17:59] !log maxsem synchronized php-1.21wmf9/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/#/c/48804/' [09:18:00] Logged the message, Master [09:24:12] it wil be a little while before I have a complete run and can move onto a test under 9, but I presume it will be successful [09:24:25] thanks again [09:24:38] :) [09:35:33] apergos: Jenkins is still processing .. :D [09:35:44] daaannnggg :-) [09:36:36] RECOVERY - MySQL disk space on neon is OK: DISK OK [09:36:41] my test dump run is still going, it's in meta-current [09:37:30] ah now it's in meta-history [09:54:06] so... how is jenkins? [09:58:02] rekeying stuff [09:58:09] apparently it parse all the .xml files there [09:58:14] I guess it will take a while [09:58:23] feel free to get to lunch / out / bed whatever :-] [09:58:30] it is probably not going to cause any issues [10:00:11] heh [10:00:16] not bed, its very early here :-) [10:00:32] also it is pouring buckets and there is occasional thunder and lightning [10:00:46] a nice day for hot chocolate which I will fix soon [10:07:35] yummm [10:14:57] today I learned another english idiomatic: "pouring buckets" [10:14:59] I guess that is the same as "it is raining cats and dogs" [10:15:09] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [10:16:30] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 197 seconds [10:16:30] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 197 seconds [10:17:30] Yes. [10:19:28] it is, though the image is not of buckets falling from the sky (as the other one invokes the image of cats and dogs falling) but of buckets of water being emptied onto the passersby below, at least that is how I envision it [10:21:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:21:54] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [10:21:54] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:44:24] RECOVERY - MySQL disk space on neon is OK: DISK OK [11:18:16] apergos: yeah that "pouring buckets" image makes a lot of sense. Much more than the cats and dogs failing upon us :-] [11:19:11] ah [11:19:17] Jenkins has completed its rekeying stuff [11:20:27] and how does it look? [11:20:44] Zuul is still able to communicate with Jenkins using its API key [11:20:46] so I guess it is fine [11:23:37] apergos: I have closed the bug. Rekeying is a success as far as I am concerned. [11:23:42] apergos: thanks for staying around :-] [11:23:56] sure! [11:23:59] it's hailing her enow [11:24:04] the fun never ends.... [11:34:02] lunch time [11:34:06] bb soon [11:57:31] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [12:30:13] RECOVERY - MySQL disk space on neon is OK: DISK OK [12:41:37] PROBLEM - Puppet freshness on mw37 is CRITICAL: Puppet has not run in the last 10 hours [13:12:04] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 1 seconds [13:13:16] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 1 seconds [13:26:18] New patchset: Mark Bergsma; "Remove statically configured test backend in favor of dynamic director" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48829 [13:27:39] New review: Mark Bergsma; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48829 [13:27:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48829 [13:33:06] !log reedy synchronized php-1.21wmf9/resources/mediawiki [13:33:07] Logged the message, Master [13:48:48] https://gerrit.wikimedia.org/r/48771 "currently every box has the default, this will get the apis up to their intended 100, and imagescalers down to their intended 18" [13:48:49] Oops :D [14:00:23] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [14:06:23] PROBLEM - Puppet freshness on tin is CRITICAL: Puppet has not run in the last 10 hours [14:12:23] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:42:48] RECOVERY - MySQL disk space on neon is OK: DISK OK [15:06:22] New review: Dzahn; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47795 [15:06:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47795 [15:11:09] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [15:12:06] <-- hmm, that looks temporary.. ssh_exchange_identification: Connection closed by remote host [15:12:15] but 3 seconds later.. login just fine [15:12:48] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:19:38] Unable to open CDB file for write "/home/wikipedia/common/php-master/cache/l10n/l10n_cache-ab.cdb" [15:19:38] pff [15:19:40] (on labs) [15:20:08] Invalid escape flag: j [15:20:14] pff (on RT) [15:21:03] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [15:21:37] hashar: it's already owned by another user? [15:24:17] yeah the l10n cache files were owned by mwdeploy [15:24:26] our permissions are SUCH as mess :-] [15:26:47] mutante: I guess the mail to cron would be fixed now. [15:27:46] hashar: great:) i did not get a new one yet. before i got it every 5 minutes or something [15:29:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.565 seconds [15:44:46] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [15:51:11] RECOVERY - MySQL disk space on neon is OK: DISK OK [16:05:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [16:25:32] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [16:26:04] heya mark, you around? [16:26:48] q about the best way to make some aggregated kraken data available for graphing [16:28:59] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [16:35:36] New patchset: Ottomata; "Including stats system user on analytics nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48838 [16:36:05] New review: Ottomata; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48838 [16:36:14] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48838 [16:40:29] ottomata: yes? [16:46:06] so, in hdfs [16:46:14] there is /wmf/public directory [16:46:45] this directory is meant to be used as a place to store output hadoop jobs that can be used as input to limn to graph [16:47:16] previously, we were pointing limn at a .csv file that was available over http via the proxy [16:47:35] since we're not doing that now, i need to figure out a new way to make that data availble. [16:47:51] i could copy that directory over periodically to stat1001 and host it from stats.wikimedia.org [16:48:26] also, stefan is working on creating a .deb for Limn. once that is done we'd like to puppetize and host reportcard (and other limn sites) on stat1001 [16:48:50] if Limn is on stat1001, it has access to the /wmf/public directory hdfs over http via webhdfs [16:49:37] without knowing any more specifics, the latter seems like the cleanest solution [16:50:15] i think so too, and that should be ok since stat1001 is in eqiad and already on the backend network, so no public proxy is needed, right? [16:50:35] if it were in pmtpa it would be no different [16:50:39] right right [16:50:43] but [16:50:45] i guess i mean just on the backend network in general [16:50:49] how is access control handled by webhdfs? [16:51:23] the files in /wmf/public are world readable, so webhdfs will allow access to them [16:51:38] so basically webhdfs just looks at unix file permissions? [16:51:48] right, and webhdfs is configured to run as a particular hdfs user [16:51:56] that doesn't seem very secure, does it? [16:52:01] too easy to make mistakes [16:52:14] yeah probably [16:52:56] also, since it'll cross the analytics VLANs acl, you'll need to add it to the ticket [16:53:32] that's true, oh yeah, I wanted to talk about that yesterday in the meeting, but forgot to bring it up with all the other stuff [16:53:47] i assume webhdfs is some FUSE thing? [16:54:39] no, rest http api that comes with hadoop [16:54:58] right, but how is it accessed by limn? [16:55:09] just a url [16:55:53] so basically, webhdfs needs a proper setup, with puppetization, good access control, SSL, security review [16:56:22] right (i'm verifying that this url actually uses webhdfs at the moment, one sec…it might just be a datanode thing…), but yeah [16:56:32] (been a few months since I looked at this) [16:56:58] so I think you can't use that yet [16:57:09] if something is needed now, some temporary rsync is probably better [16:57:46] New patchset: Alex Monk; "(bug 44587) Multiple changes for trwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48841 [17:03:25] yeah, sounds good, i'm not entirely sure if the whole process at the moment. It looks like this url is not actually using webhdfs (hue web ui is), but its just a datanode service that turns on with the datanode [17:03:50] is webhdfs currently configured? [17:04:37] hue uses it, yes [17:05:26] what is hue and where does it run? [17:05:42] hue is a web interface for a lot of generic hadoop services [17:05:49] it is running on analytics1027 [17:06:09] ok [17:06:18] is webhdfs configured to only allow access from the analytics cluster? [17:11:43] morning [17:12:00] re: hue and the proxy, I was told that it's not temporary as I initially thought [17:12:10] and that it will be the portal for analysts [17:12:32] whatever that's gonna be, it needs to be locked down now [17:12:33] hue, yes, proxy and how people would needs figuring out [17:12:34] yeah [17:12:35] if that's the case, then it should get a service IP and hostname and get behind the normal SSL cluster [17:12:43] yeah totally [17:12:52] there is no proxy running [17:12:54] not set up an nginx in an1027 as we briefly discusses [17:13:03] and nothing running on analytics1001 right now [17:13:04] s/s$/d [17:13:06] ottomata: there may not be a proxy running [17:13:12] but right, mark sorry [17:13:14] was checking about webhdfs [17:13:18] but does webhdfs honour requests from the rest of the network? [17:13:27] it really shouldn't at this poit [17:13:28] point [17:13:35] just checked, and yes it does [17:13:39] so we should turn it off [17:13:49] either that [17:13:53] or lock it down [17:13:55] you need to turn it off or lock it down sufficiently rightaway [17:14:24] i would prefer to lock it down (iptables on analytics nodes for basically the same things that are in that RT ticket) [17:14:29] but if you would rather me turn it off I will do so [17:14:42] i don't like just iptables [17:14:53] i'm guessing webhdfs can be configured with access control as well [17:15:03] that would be a good start [17:15:18] i'm looking into it…but I think with kerberos :( [17:17:32] When security is on, authentication is performed by either Hadoop delegation token or Kerberos SPNEGO. If a token is set in the delegationquery parameter, the authenticated user is the user encoded in the token. If the delegation parameter is not set, the user is authenticated by Kerberos SPNEGO. [17:18:47] if it doesn't have anything better, I think it should be turned off at this point [17:18:58] link...? [17:19:21] http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-yarn/hadoop-yarn-site/WebHDFS.html [17:19:39] what's webhdfs? a java app? a cgi? [17:19:48] hdfs rest API that ships with hadoop [17:20:09] that are its interfaces [17:20:13] but what is *it*? [17:20:56] a fuse filesystem? a jar that runs under a java servlet container? a cgi? [17:20:57] its part of the namenod [17:20:59] namenode [17:21:05] so hadoop java app [17:21:35] ugh [17:21:42] its disableable though [17:21:43] doesn't look like it has any ip based access control [17:21:45] just a config setting [17:21:55] yeah, just kerberos :/ [17:21:55] then do that now until we can have a thorough look at it [17:21:56] run it on loopback and set up a reverse proxy in fornt of it [17:22:03] hmmmmmm [17:22:09] that's a good idea [17:22:13] that does whatever the hell we want [17:22:29] that's not going to help us with authz though [17:22:50] well, we can at least keep it restricted to analytics nodes with that, and only people with shell access to those could use it anyway [17:22:54] is that ok? [17:23:09] i mean, that's doable with iptables too though, is reverse proxy better? [17:24:18] yes, reverse proxy is better [17:24:34] happy to do that, can I ask why? [17:25:31] because application level security always trumps firewalling? [17:25:32] really both is best [17:25:39] but we're gonna have that ACL too [17:25:50] and any application should be secure when the firewall is inactive [17:26:41] aye [17:26:45] ok [17:26:51] q then [17:28:23] this would be easy to puppetize in the analytics branch where variables are avaiable, but that is no longer active. I can commit a reverse proxy setup to production puppet and get it reviewed by paravoid, but doing it immediately won't be as pretty as the final product once everything is properly reviewed and puppetized [17:28:37] more hardcoded crap [17:28:39] etc. [17:28:48] stop thinking of some large overhaul far into the future [17:29:04] i'm thinking of an iterative one that we haven't really started [17:29:05] start fixing things now [17:29:17] yes [17:29:18] and we'll iteratively improve them [17:29:20] can't it be iterative with band aids :) [17:29:23] ? [17:29:28] this is your active setup, it needs to be locked down now [17:29:35] without puppet if it's not properly done in puppet yet [17:29:50] hm [17:29:51] ok so [17:30:15] i can do it now without puppet. [17:30:15] i can do it now with puppet but in band aid form [17:30:15] i can do it slowly and beautifully in puppet in pretty form [17:30:24] you do it now without puppet [17:30:31] and then you do it slowly and beautifully in puppet in pretty form later [17:30:34] ack [17:30:35] like it. [17:30:39] haha paravoid does not :p [17:30:43] I do [17:30:46] oh ok [17:30:46] haha [17:30:49] as long as later is not in 6 months [17:30:52] nono [17:30:58] i want to work on this whenever you are available to do so [17:31:03] later means as soon as you close the immediate hole [17:31:25] this should have been there from the very start [17:31:30] since right after you did apt-get install hadoop :P [17:31:38] aye [17:31:54] or at the very least, before you started putting real data in [17:32:02] paravoid, we were talking last night about having a pow wow about the current status and best way to proceed last night. [17:32:11] i will set up reverse proxy on an1027 now, [17:32:21] you avail in an hour or two to pow wow? [17:32:53] i don't know [17:33:14] ah you asked pv ;) [17:33:18] right [17:33:21] I am in an hour, not two [17:33:22] but [17:33:26] he's my puppet nit picker [17:33:33] I would really like to see an architectural overview first [17:33:37] right [17:33:38] in mail preferrably, so others can see it as well [17:33:41] that's what the pow wow is for [17:33:43] hmmmMMMmm [17:33:49] I kept telling this yesterday [17:34:05] agreed [17:34:11] architectural overview is needed for puppet review [17:34:18] to get the full picture anyway [17:34:43] mark obviously has an opinion, asher seems to have one too and others would benefit from learning how this whole infrastructure works and will work [17:35:03] (incl. myself) [17:35:20] cool, you want how things are now, how things should be, or both? [17:35:28] both I think [17:35:59] I mean, David showed us a diagram yesterday [17:36:00] ok [17:36:05] it was the first time I was seeing this [17:36:23] not all ops were there yesterday either [17:36:45] ok [17:36:48] will do. [17:37:00] ok right now, will do reverse proxy thing on an27 [17:37:01] this, plus a few more details [17:37:19] I've heard names of components and had to Google them individually [17:37:25] i will also ask you guys for a review of that work, even though its not in puppet [17:37:28] like what Hue is, what Storm is and several others that I forgot [17:37:32] right, totally [17:37:35] oh right, "pig" too, still no idea what that is [17:37:37] i had to do that for a few months myself [17:38:15] quick pig tutorial by erosen: [17:38:21] https://www.mediawiki.org/wiki/Analytics/Kraken/Tutorial [17:42:38] mutante, re 4515 - would removing the access requests tag from the tickets help somehow? [17:44:20] ottomata: to clarify, I'm not asking for in-depth tutorials on each of the components [17:44:38] New patchset: Mark Bergsma; "Enable RPS (Receive Packet Steering) on LVS balancers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48843 [17:44:42] just one-line description of what each component that we're using so that we can talk the same language [17:45:51] and an overview to the design; if it's documented somewhere already, a link is probably sufficient [17:46:36] New review: Mark Bergsma; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48843 [17:46:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48843 [17:47:10] mark: hah, you rock [17:47:10] hm, ok, cool, can do paravoid [17:47:56] we'll see about that as I run puppet ;-) [17:48:32] you probably need a notify, don't you? [17:48:41] so it takes immediate effect rather than at reboot [17:48:56] yeah I think so [17:49:00] although a reboot is required anyway [17:49:06] for other reasons [17:49:12] but this one is easy to fix [17:49:16] which ones? [17:49:29] selecting the deadline scheduler for example [17:49:32] which we do in grub now [17:49:39] oh you mean after we provision a box? [17:49:47] yes [17:49:50] yeah, definitely [17:49:54] i'm gonna run it on all now anyway [17:49:56] root@amslvs4:~# start enable-rps [17:49:56] start: Job failed to start [17:49:58] initcwnd also takes effect only after a reboot [17:49:59] alas ;-) [17:51:08] ah right [17:51:12] it's not run in a shell [17:53:25] Thehelpfulone: just replied on ticket [17:54:27] Thehelpfulone: no, you don't need to do manual stuff, we can either ignore the old ones or drop it from db .. but i will look closer tomorrow before touching all [17:54:46] mutante, there's a bulk update that you can move them all into the queue in one go [17:55:00] even talked to support channel ,it's on irc.perl.org aka. irc.infobot.org [17:55:09] yea, i know the bulk update [17:55:33] maybe we can make it NOT send emails [17:55:51] oh you mean it sends emails to you when a new ticket appears? [17:55:59] not sure, i think it does [17:56:02] yeah if you can disable the email interface or something [17:56:20] when I try to do the query: 'CF.{tags}' = 'access requests' it seems to time out [17:56:22] eh, yea, it sends email on that [17:56:27] but if I do it for a single word tag it works [17:56:32] i meant if it sends email when using bulk update and just moving tickets [17:57:04] that might be related to the bug?! shrug [17:57:22] in the db tables it isnt a problem to select them [17:57:34] its not like it really is that much data [17:57:52] yeah, not really many tickets [17:58:19] need to be sure if it doesnt break when deleting an id from CustomFieldValues [17:58:31] removing it from tickets seems unproblematic [17:58:44] New patchset: Mark Bergsma; "Run in a shell" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48846 [17:58:45] New patchset: Mark Bergsma; "Allow upstart jobs to be started after install" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48847 [17:58:48] will refine/merge after dinner [18:00:51] mutante, ah I got it, it was a chrome issue, so there's 54 tickets that have the 'access request' tag [18:01:08] I could bulk update to move into access-requests and remove the tag so you can kill it? [18:02:14] Thehelpfulone: if you can make it not spam everybody and they are not all re-opened or something.. sure. [18:02:56] Thehelpfulone: but you can also let me do it tomorrow.. on db after making a backup .. [18:03:21] sure, I'll add the query to the ticket, I think there's a Suppress All Outgoing Mail [18:03:21] setting too so that could work [18:03:24] and it doesnt matter much because they are all closed/history anyways [18:03:54] it's cleaner though for historic purporses. yep [18:04:22] just right now i am about to have guests/food [18:04:28] so ttyl, k? [18:04:31] yep [18:04:36] k, cya [18:05:23] RoanKattouw_away: your cron on fenari is mad spammin', yo [18:06:17] Thehelpfulone: ah, last thing. there is a BCC: thing on everything global .. [18:06:18] RoanKattouw_away: forwarded you a smaple [18:18:37] notpeter: have you seen https://gerrit.wikimedia.org/r/#/c/48664/1 ? [18:21:01] AaronSchulz: ah, I saw it come through my inbox [18:24:45] New patchset: awjrichards; "Include forceHTTPS cookies in list of mobile cookies passed to apaches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48854 [18:26:16] AaronSchulz: gimme a sec and I'll for rael review it [18:27:27] !log aaron synchronized php-1.21wmf9/extensions/TranslationNotifications 'deployed d7bf60b57f472e5f75101f58505f61842848f961' [18:27:28] Logged the message, Master [18:31:33] paravoid: Do you happen to have a familiarity on what has to happen to lvs servers in eqiad to use row c? Chris ran the connections, but the config needs to be updated to work with new ips https://rt.wikimedia.org/Ticket/Display.html?id=3294 [18:32:08] New review: Jdlrobson; "Patch Set 1: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/48854 [18:34:55] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48771 [18:35:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48771 [18:35:53] small bug on the mobile site that should be fixed by https://gerrit.wikimedia.org/r/#/c/48854/1 - can someone take a look/push? [18:36:30] RobH: I haven't done this before but it looks trivial enough [18:36:34] just copy the stanza in site.pp? [18:36:59] New patchset: RobH; "adding in mw1201-1220 into api & apache service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48855 [18:37:10] site.pp:1137 onwards [18:37:45] I'll make the changes and ping you then, cuz im not certain at all ;] [18:37:59] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [18:38:35] preilly: ^^ [18:38:37] paravoid: have you looked at: https://gerrit.wikimedia.org/r/#/c/48854/1/templates/varnish/mobile-frontend.inc.vcl.erb ? [18:38:55] thanks preilly [18:39:41] cmjohnson1: You want to do some code review (look at my patchset) [18:39:45] Ryan_Lane: can you look at: https://gerrit.wikimedia.org/r/#/c/48854/1/templates/varnish/mobile-frontend.inc.vcl.erb [18:39:47] * RobH is attempting to not self review. [18:39:57] k [18:40:03] csteipp: Are you okay with: https://gerrit.wikimedia.org/r/#/c/48854/1/templates/varnish/mobile-frontend.inc.vcl.erb [18:40:13] if it looks good to you, you can merge it in gerrit [18:40:19] robh: also...when equinix runs an oob cat 5 link...would they run it to the cabinet [18:40:19] and i'll take care of merge on sockpuppet [18:40:21] RobH: self-review is awesome. it's all about being radically self-reliant [18:40:23] https://rt.wikimedia.org/Ticket/Display.html?id=4026 [18:40:28] can't find the cable [18:40:59] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [18:41:08] cmjohnson1: cabinet 307 is c7 [18:41:12] i have no idea why it would go there [18:41:14] this seems wrong. [18:41:27] all cross connects are supposed to terminate in the dmarc cabinet [18:41:46] Ryan_Lane: $this->setCookie( 'forceHTTPS', 'true', time() + 2592000, false ); //30 days [18:41:55] I take it there isn't some wildly out of place rj45 connection port on the dmarc eh? =] [18:42:05] Ryan_Lane: from includes/User.php:3008 [18:42:28] cmjohnson1: Uhh, Z-Side Customer and Cage Number EQUINIX INC. DC6:1:62260:EQUINIX [18:42:30] that isnt our cage [18:42:47] New review: preilly; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48854 [18:42:57] Change merged: preilly; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48854 [18:43:29] thanks preilly - does that automatically get pushed out by puppet now/ [18:43:30] ? [18:43:49] cmjohnson1: I updated the ticket, but I think they put our out of band connection in someone elses cage. [18:43:59] the work order lists some other cage, not ours. [18:44:24] awjr: once it's merged on sockpuppet [18:44:55] merged [18:45:01] * aude wonders what sockpuppet is? :) [18:45:04] groovy thanks paravoid preilly [18:45:22] awjr: don't forget to thank paravoid and Ryan_Lane [18:45:31] awjr: Oh, and that csteipp guy [18:45:41] thanks gents :) [18:45:48] aude: its the puppetmaster certificate server [18:45:49] awjr: and thank you for writing the VCL change ;-) [18:45:51] awjr: we were discussing in person about what that cookie is [18:45:56] RobH: ah, okay [18:46:01] so any gerrit changes have an additional operations only step for root to merge it live onto cluster [18:46:06] (in operations puppet that is) [18:46:13] and based of the explanation of what that is, I threw the idea that maybe the http->https redirect could be done in VCL [18:46:24] robh: my review link is not there...but looks ok [18:46:39] i.e. if it's merely a http->https redirect based on a cookie value, there's no reason to go all the way back to appservers and load half of mediawiki core [18:46:41] are you logged in? [18:46:41] paravoid: I think that is a better plan long-term [18:46:47] that makes sense paravoid [18:47:01] paravoid: do you want to make that change? [18:47:03] robh: i thought 307 was c7 but didn't see the cable [18:47:18] what are the cookie values? [18:47:31] cmjohnson1: So you should be able to review, check if you are logged in, and if so, log out and back in. [18:47:31] nope not logged in [18:47:33] paravoid: 'true' [18:47:42] ha [18:47:43] cmjohnson1: I updated the ticket, they put our out of band cable in someone elses cabinat [18:47:48] can it also be "false"? [18:47:48] its written right on work order =] [18:48:08] New review: Cmjohnson; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48855 [18:48:12] paravoid: not as far as i can tell looking through where it gets set in MW core [18:48:16] customer side cabinet dc6:62260 [18:48:16] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48855 [18:48:22] thats not us, heh [18:48:40] Ryan_Lane: can you force a puppet run on the mobile caches to pick up awjr 's change [18:48:43] nope [18:48:58] robh: your changes are merged if you wanna run on sockpuppet [18:49:09] paravoid: yeah, it's either true or empty/unset [18:49:14] yep, merged when i saw it scroll past =] [18:49:15] yep [18:49:27] thanks Ryan_Lane [18:51:52] cmjohnson1: thx dude [18:52:15] yw [18:59:51] !log mw1041 is borked and will be powered down and removed from network for h/w checks [18:59:52] Logged the message, Master [19:01:42] paravoid, are you bribable into package review?:P [19:02:13] if you're talking about the OSM stuff, there's no way I can find time for them this week [19:02:19] smaller changes I can fit somewhere [19:02:49] I did see some of the OSM ones though and they do need more work [19:03:16] :( [19:03:59] the osm2pgsql one would be a good one to start with [19:04:14] the one with the stylesheets might be better to break up differently [19:04:35] notpeter: damn you! I never wanted to remember these sun commands! [19:04:41] now i have to recall all the pain.... [19:04:58] hey [19:05:03] New patchset: Pyoungmeister; "reviving db29 for use by pgehres" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48862 [19:05:04] I don't even know what the sun commands are! [19:05:17] which sun commands? [19:05:17] heh [19:05:21] the ilom stuff [19:05:30] i recall them, i just wanna give notpeter shit about it. [19:06:07] oh that's the least of my problems with Sun servers [19:06:08] im so upset im crying (its not allergies, sun ilom makes me cry) [19:06:24] all lights-out suck [19:06:28] you might be allergic to sun hardware [19:06:31] there's not a single one of them that's sane [19:06:37] that I've seen at least [19:07:21] ok, so db29 isnt rebooting, it may be borked, looking into it [19:07:27] notpeter: was it offline due to issues or anything that you are aware of? [19:07:36] MaxSem: so where did these packages come from? [19:07:44] RobH: nope [19:07:56] was on and even had mysql running about 5 minutes ago [19:07:59] paravoid, OSM dev Kai Krueger [19:08:19] were previously maintained in one shared repo [19:08:23] some of the files mention a different name, Frederik Ramm [19:08:35] and I found some packages on the web with that name [19:08:53] yes, there are different versions [19:09:17] guess Kai based upon Frederik's work [19:10:31] New review: Faidon; "Patch Set 1: Code-Review-1" [operations/debs/osm2pgsql] (master) C: -1; - https://gerrit.wikimedia.org/r/48605 [19:10:35] notpeter: its borked [19:10:40] i need to drop a ticket for sbernardin to check it out [19:10:50] it wont take LOM commands, needs full power removal to reset [19:10:56] known sun ilom issue. [19:11:04] ok, then how about db27 :) [19:11:15] lemme make these tickets first, then will take a look at that one, same thing? [19:11:19] yep [19:11:47] MaxSem: so, there [19:11:59] aha, I've seen your comment [19:12:08] py is doing a graceful restart of all apaches [19:12:12] MaxSem: I can go through and review the rest at some point, but before that I'd like some introduction to what we're doing and how [19:12:22] and why those components were picked instead of others [19:12:32] !log py gracefulled all apaches [19:12:32] I know virtually nothing about the project [19:12:33] Logged the message, Master [19:13:27] paravoid, do you need a general architecture overview? [19:13:30] https for mobile on non-wikipedia projects are causing ssl warnings - it looks like the certs are for *.wikipedia.org (eg for https://en.m.wikivoyage.org) [19:13:36] RobH, Ryan_Lane ^ [19:13:46] is that something we can get fixed quick? [19:13:48] awjr, a long known bug [19:14:28] MaxSem: that'd be great, yes [19:15:11] awjr: this is the same bug as reported before [19:15:25] Ryan_Lane: yeah - just found the rt ticket https://rt.wikimedia.org/Ticket/Display.html?id=2136 [19:16:03] wow that's been open since 2011... [19:16:55] no [19:16:55] this is different [19:17:09] Ryan_Lane: do you know rt/bug #? [19:17:21] I'm nearly positive you are the one that filed it [19:17:29] lol could be ... [19:17:42] we don't have enough IP addresses to handle mobile [19:17:56] New patchset: Aude; "Enable Wikibase on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48866 [19:18:40] Ryan_Lane: oh - is there an alternative way to deal with it? [19:18:45] no [19:18:48] PROBLEM - Host mw1045 is DOWN: PING CRITICAL - Packet loss = 100% [19:19:12] notpeter: got bash? :) [19:19:38] Ryan_Lane: manage-volumes runs as a cron on labstore2, right? What user does it run as? [19:20:17] paravoid, basically: mod_tile is an Apache module that serves map tiles, it requests absent tiles from the renderd demon which uses Mapnik for rendering. osm2pgsql and Osmosis are used for import/updates of the OSM DB. The DB is PG+PostGis [19:20:20] https://rt.wikimedia.org/Ticket/Display.html?id=2541 [19:20:31] andrewbogott: glustermanager [19:20:41] Sorry, the servers are overloaded at the moment. [19:20:41] Too many users are trying to view this page. Please wait a while before you try to access this page again. [19:20:44] Timeout waiting for the lock [19:20:45] not again..... [19:20:49] http://en.wikipedia.org/wiki/New_York_City [19:21:01] anyone seen that? :o [19:21:03] AaronSchulz: what is this I don't even [19:21:10] yeah, sorry, getting distracted [19:21:23] heh [19:21:29] !log aaron synchronized php-1.21wmf9/maintenance/runJobs.php 'deployed b03384f0ebd95e7f79638fb14ccd55da9c186d97' [19:21:31] Logged the message, Master [19:21:32] Ryan_Lane: thx [19:22:41] thanks for the link Ryan_Lane [19:22:44] New patchset: Pyoungmeister; "reviving db27 for use by pgehres" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48862 [19:23:33] paravoid, at least initially we're going to use the same configuration that OSM uses: http://wiki.openstreetmap.org/wiki/Creating_your_own_tiles#Creating_tiles_using_Mapnik_and_mod_tile [19:23:36] [19:23:40] for the error i got [19:23:56] paravoid: MaxSem we can ask kai about the packages [19:24:14] not sure if the ubuntu one is sufficient but it might be [19:24:19] he maintains the package [19:24:47] New patchset: Hashar; "beta: remove deprecated $urlprotocol" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48867 [19:25:14] New patchset: Alex Monk; "(bug 44893) Set up redirect from tartupeedia.ee to a page on etwiki" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/48868 [19:25:32] awjr: so I proposed a solution, RobH is going to have a look at it [19:25:55] paravoid: re the ssl certs/mobile? [19:26:01] yes. [19:26:04] sweet, thanks paravoid [19:26:29] paravoid: what are you proposing: [19:26:30] ? [19:26:31] we chatted a bit in person [19:26:34] ryan, rob and me [19:26:36] New review: Hashar; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/48867 [19:26:50] paravoid: actually, can you put it in the rt ticket? [19:26:51] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48867 [19:26:52] sorry, being usually remote I know how that feels :) [19:27:00] hehehe :) [19:27:06] i appreciate it [19:27:52] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 194 seconds [19:28:14] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 198 seconds [19:29:06] Request: POST http://en.wikipedia.org/w/index.php?title=New_York_City&action=submit, from 69.164.222.250 via cp1020.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.138 (10.64.0.138) [19:29:11] Error: ERR_READ_TIMEOUT, errno [No Error] at Wed, 13 Feb 2013 19:28:34 GMT [19:29:14] that's without wikibase [19:29:58] * aude thinks its' somewhat normal but not nice [19:30:05] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moved remaning wikis to 1.21wmf9 [19:30:07] Logged the message, Master [19:30:20] MaxSem: I don't think we can meaningfully do this discussion over IRC [19:30:26] :o [19:30:34] I'm basically late to the game and I have multiple questions [19:30:59] AaronSchulz: /nonexistent/ [19:31:02] like why postgres, how are we going to scale those apaches (varnish I presume), how large those tiles are and what kind of storage requirements we have [19:31:06] ooops [19:31:10] why modtile and not tilelite, etc. [19:31:11] Reedy: AaronSchulz https://gerrit.wikimedia.org/r/#/c/48863/ [19:31:33] i'll try resubmitting it to make jenkins happy [19:31:55] New patchset: Aaron Schulz; "Moved remaning wikis to 1.21wmf9" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48869 [19:32:06] New review: Aaron Schulz; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/48869 [19:32:18] "why modtile and not tilelite" - because the SW ran by OSM itself is best ATM, all the rest are a gambling [19:32:19] paravoid: it's the tiles that can be cached with squid or varnish [19:32:21] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48869 [19:32:44] aude: what is this for? [19:32:46] aude: We'll do wmf9 then poke at wikidata stuff :) [19:32:56] the database isn't as big of a deal, although if we render lots of styles then we want more database instances [19:33:01] Reedy: that's fine [19:33:13] paravoid, "why PG" - because it's the basic OSM design decision, and MySQL doesn't have alternatives to PostGIS [19:33:14] AaronSchulz: reedy can probably handle our deployment [19:33:23] MaxSem: +1 [19:33:41] osm originally had mysql but it wasn't very good [19:35:17] paravoid, "how are we going to scale those apaches" - that's the most interesting part:) currently, OSM runs off one mighty tileserver backed by multiple Squids [19:35:32] MaxSem: there are geodns caches [19:35:37] e.g. additional tile servers [19:35:50] geodns spreads the requests around [19:35:51] aude, Squids != tileservers [19:35:56] yes [19:36:19] "tileserver" === "Apache that serves/renders tiles" [19:36:24] yep [19:36:37] there can be multiple + squids [19:36:38] paravoid, the whole point of the Copenhagen event we're dragging you to is to scale this architecture to multiple tileservers [19:36:41] aren't we supposed to do this discussion in Copenhagen? [19:36:45] heh [19:36:47] :) [19:37:27] considering we don't have something right now, I don't think the topic is /just/ scaling [19:37:46] "setting up something that scales" sounds better [19:37:50] New review: Pyoungmeister; "Patch Set 1: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/48664 [19:37:54] paravoid, I wanted to have basic packages by then [19:38:21] * aude agrees [19:38:22] notpeter: thanks, maybe apergos can look [19:38:27] there's a loop here [19:38:31] AaronSchulz: I give you +1, as I think another set of eyes would be wise. that said, if you want me to just +2 it, I'd be ok with that as well [19:38:53] dunno either you were looking for review or just merge :) [19:38:59] * AaronSchulz never knows what apergos is up to :) [19:39:08] setting up something that scales means choosing the components/software used [19:39:11] s/either/whether/ [19:39:24] paravoid, no there isn't. we will start with these packages [19:39:56] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [19:40:07] at least, there aren't feasible alternatives to these 3 - mod_tile, osm2pgsql and styles [19:40:23] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [19:40:43] (styles can be changed at any moment before the thing officially goes live, but here's the start) [19:40:51] Hello! I could use a review/merge related to beta They are both related to the Apache manifests: https://gerrit.wikimedia.org/r/45115 (get rid of a dupe definition) https://gerrit.wikimedia.org/r/47398 (let apache start automatically on beta) [19:43:45] paravoid, the main point of choice would be renderd vs. Tirex, but that's basically all for the first stage [19:46:02] there's also node.js based rendering :) but renderd is well-tested (not very flexible) [19:46:10] tirex is more flexible and good [19:46:38] anyway, i'm busy with wikibase deployment ot enwiki :D [19:47:17] !log reedy synchronized php-1.21wmf9/extensions/Wikibase/ [19:47:17] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [19:47:18] Logged the message, Master [19:47:29] sorry, I'd like to make those calls. [19:47:43] me and the rest of ops that is [19:48:24] hrm [19:52:05] PROBLEM - Apache HTTP on mw1201 is CRITICAL: Connection refused [19:52:59] PROBLEM - Apache HTTP on mw1202 is CRITICAL: Connection refused [19:57:39] paravoid, these calls will be made final in Copenhagen - however, all three packages currently up for review are required in either case;) [19:57:56] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 192 seconds [19:58:42] feel free to work on them, don't expect me to merge them/put them up in apt before then. [19:59:17] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 227 seconds [19:59:21] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable Wikibase Client on enwiki' [19:59:22] Logged the message, Master [20:01:23] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [20:02:44] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [20:06:51] New patchset: Tpt; "(bug 40759) Let ProofrzadPage setup namespaces for is Wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [20:07:54] New patchset: Tpt; "(bug 40759) Let Proofread Page setup namespaces for is Wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [20:09:32] New patchset: Mark Bergsma; "Allow upstart jobs to be started after install" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48847 [20:10:08] New review: Mark Bergsma; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48847 [20:10:21] New review: Mark Bergsma; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48846 [20:10:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48846 [20:10:53] New review: Siebrand; "Patch Set 2:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [20:15:05] New patchset: Tpt; "(bug 40759) Let Proofread Page setup namespaces for fi.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [20:17:33] Who broke the sql script? :( [20:18:04] reedy@fenari:/home/wikipedia/common$ sql enwiki [20:18:04] Cannot run a MediaWiki script as a user in the group wikidev [20:18:38] RECOVERY - MySQL disk space on neon is OK: DISK OK [20:18:51] Reedy: was that from Tim's change last night? [20:19:08] I'm guessing so, yeah [20:19:23] I can use mwscript sql.php --wiki=foo, but it's not the mysql cli app [20:20:34] New patchset: Mark Bergsma; "Some NICs support > 10 queues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48879 [20:21:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48847 [20:21:14] New patchset: Siebrand; "(bug 40759) Let Proofread Page setup namespaces for fi.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [20:21:20] New review: Mark Bergsma; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48879 [20:21:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48879 [20:22:10] New patchset: Siebrand; "(bug 40759) Let Proofread Page setup namespaces for fi.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48877 [20:22:32] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 183 seconds [20:23:26] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 190 seconds [20:23:27] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [20:23:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:23:27] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:25:49] !log reedy synchronized wmf-config/CommonSettings.php 'Uncommenting enwiki from localClientDatabases' [20:25:50] Logged the message, Master [20:25:59] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:26:53] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:33:56] PROBLEM - Apache HTTP on mw1208 is CRITICAL: Connection refused [20:34:32] PROBLEM - Apache HTTP on mw1206 is CRITICAL: Connection refused [20:34:59] PROBLEM - Apache HTTP on mw1205 is CRITICAL: Connection refused [20:39:20] PROBLEM - Apache HTTP on mw1207 is CRITICAL: Connection refused [20:40:40] New patchset: CSteipp; "Enable Global Abuse Filters on test, mediawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48070 [20:42:07] New patchset: Reedy; "Enable wikidata client on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48883 [20:42:45] New review: Reedy; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48883 [20:42:46] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48883 [20:49:02] New patchset: Krinkle; "Deploy Global AbuseFilters to Meta-Wiki, MediaWiki and test.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48070 [20:49:14] PROBLEM - Apache HTTP on mw1204 is CRITICAL: Connection refused [20:50:56] New review: Krinkle; "Patch Set 5: Code-Review-1" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/48070 [20:54:32] New patchset: CSteipp; "Deploy Global AbuseFilters to Meta-Wiki, MediaWiki and test.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48070 [20:56:40] New review: Reedy; "Patch Set 6: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/48070 [20:58:34] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48070 [20:59:24] New patchset: Aude; "change wikidata link url to have www prefix" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48893 [21:04:24] New review: Daniel Kinzler; "Patch Set 1: Code-Review+1" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/48893 [21:08:26] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.122 second response time [21:09:35] !log reedy cleared profiling data [21:09:37] Logged the message, Master [21:09:37] AaronSchulz: [21:09:54] I'm looking at https://gerrit.wikimedia.org/r/#/c/48664/1/modules/mediawiki_new/templates/jobrunner/jobs-loop.sh.erb [21:10:32] RECOVERY - Puppet freshness on mw37 is OK: puppet ran at Wed Feb 13 21:10:16 UTC 2013 [21:10:38] am I blind or is it possible that you could have runJobs called with --procs= some negative value ? [21:11:13] I don't think you are blind [21:11:38] yeah it needs another check [21:11:58] !log csteipp synchronized wmf-config [21:11:58] Logged the message, Master [21:12:29] PROBLEM - Apache HTTP on mw1203 is CRITICAL: Connection refused [21:12:35] apergos: I must have been thinking that sleep 1 was already in a loop [21:12:37] which it's not atm [21:12:51] apergos: anything else? [21:12:58] that's all I saw [21:15:38] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.276 second response time [21:16:32] RECOVERY - Apache HTTP on mw1206 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.136 second response time [21:18:33] New review: Reedy; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48893 [21:18:35] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48893 [21:18:38] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.123 second response time [21:19:11] !log reedy synchronized wmf-config/CommonSettings.php [21:19:12] Logged the message, Master [21:19:41] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.133 second response time [21:23:07] AaronSchulz: I am going to afk soon so if you want a merge soon by me you should get the new patchset in now [21:24:24] rfaulkner: Adam Werbach got arrested today [21:24:50] apergos: ok [21:25:37] rfaulkner: https://sphotos-a.xx.fbcdn.net/hphotos-ash3/529857_10200664297509691_1959035885_n.jpg [21:26:00] apergos: when are you back? [21:26:23] well tomorrow morning :-D [21:26:42] it's 11 pm or so here right now [21:26:49] notpeter: confirming its db27 i can reboot and reinstall, and it presently has no data we need? (It is online right now) [21:26:55] im going to kick it now. [21:27:15] Change abandoned: Aude; "already enabled :D" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48866 [21:27:29] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.142 second response time [21:27:34] RobH: go for it [21:27:37] !log db29 offline due to hardware issue, related ticket 4526 [21:27:38] Logged the message, RobH [21:27:43] !log db27 rebooting for reinstallation [21:27:44] Logged the message, RobH [21:29:32] AaronSchulz: is it ok to wander off or should I stick around for a few more mins? [21:29:44] apergos: a few min :) [21:29:52] ok! [21:30:32] sbernardin: You still onsite by chance? (I know its getting later there) [21:30:33] !log Running sync-common on mw1202 [21:30:34] Logged the message, Master [21:30:42] notpeter: So db27 also wont take reboot commands via ilom ;] [21:30:52] hurray.... [21:30:57] while its online and i can manually reboot via OS, i dont want to push some half broken item into service [21:31:05] yeah [21:31:11] ie: i can reinstall, but if we ever have a hung OS its screwed [21:31:18] im going to drop another trouble ticket for steve to fix [21:31:22] yeah [21:31:25] whichever one he gets to work first i'll install [21:31:27] let's just get it fully powercycled [21:31:31] cool [21:31:32] thanks! [21:32:11] New patchset: Aaron Schulz; "Modified jobs-loop script to keep a fuller pipeline." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48664 [21:32:16] apergos: ^ [21:32:22] !log Running sync-common on mw1204 [21:32:23] Logged the message, Master [21:32:31] looking [21:34:22] I think this is ok [21:34:32] I'm going to +2 it [21:34:44] thanks [21:34:57] New review: ArielGlenn; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48664 [21:35:06] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48664 [21:35:38] I'll even stick aorund for a few more minutes in case you see anything weird in the logs from the job runners [21:35:45] but only a few more mins :-) [21:36:42] apergos: testing on one box? [21:37:09] um [21:37:14] that could be done [21:37:31] oh woops [21:37:32] sec [21:37:44] need to merge in puppet heh [21:38:15] New patchset: Andrew Bogott; "Detect and log volumes from deleted projects." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48941 [21:38:30] apergos: nice catch on that missing loop [21:38:33] that would have sucked [21:38:38] yw [21:39:05] rsync: rename "/usr/local/apache/common-local/php-1.21wmf9/extensions/UploadWizard/.UploadWizard.i18n.php.K5yKeS" -> "php-1.21wmf9/extensions/UploadWizard/UploadWizard.i18n.php": No such file or directory (2) [21:39:09] o_0 [21:39:22] puppet run on mw1010 [21:39:35] heh [21:40:25] uh oh [21:40:29] no good [21:40:37] I saw this: [21:40:39] /usr/bin/python /usr/lib/command-not-found -- jobs-loop [21:40:42] and now nothing [21:40:45] RECOVERY - Apache HTTP on mw1204 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.139 second response time [21:40:55] apergos: what did you type? [21:40:59] it needs .sh [21:41:06] puppetd --test :-D [21:41:14] anyway, the real problem is syntax on line 99 [21:41:39] subprocscreate=$$(12 - subproccount)' [21:41:58] woops [21:42:19] apergos: I don't see the problem [21:42:31] no? [21:42:55] apergos: try "jobs-loop.sh" [21:43:05] you'll see the error, but I don't see what's wrong [21:43:13] gah, nvm [21:43:14] how about $subproccount ? [21:43:16] :-D [21:43:17] that should be $(( )) [21:43:18] !log reedy synchronized php-1.21wmf9 'resync to fix file permission errors' [21:43:19] Logged the message, Master [21:43:25] for math expressions [21:43:27] ok fixing [21:43:37] apergos: I made that mistake elsewhere but caught it first [21:43:43] I want to do $$() instead of $(( )) [21:43:52] yep that too [21:44:29] !log Running sync-common on mw1203 [21:44:30] Logged the message, Master [21:44:30] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.128 second response time [21:44:42] New patchset: Aaron Schulz; "Fixed math expression syntax." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48942 [21:44:45] apergos: ^ [21:44:57] already looking [21:45:19] I didn't see any other instances [21:45:24] don't you ned $subproccount ? [21:45:37] not in math expressions [21:45:51] ok [21:46:21] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48942 [21:46:31] well let's try it again [21:46:32] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48942 [21:48:23] no dice [21:48:28] New patchset: Nemo bis; "(bug 44974) Add localised/v2 logos for Wikipedias without one (first installment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48952 [21:48:36] apergos: looking [21:48:59] /usr/local/bin/jobs-loop.sh: line 120: syntax error near unexpected token `morehpjobs=n' [21:49:13] is there a lint type thing for this? [21:49:25] (after puppet resolves vars) [21:49:26] RobH: will be here for another hour or so....need something? [21:49:53] no idea, sadly [21:50:35] I assume you want "n" there [21:50:52] no, that part is old [21:51:00] I think it may not like the func() though [21:51:03] right above it [21:51:15] sbernardin: Ya, can you pull power on db29 and db27? [21:51:23] apergos: it wants just func right? [21:51:24] sbernardin: they each have tickets in pmtpa queue i added today, so brand new issues [21:51:37] the ilom isnt working, so power removal is the best fix (usually) [21:51:42] yep [21:52:00] sbernardin: https://rt.wikimedia.org/Ticket/Display.html?id=4528 & https://rt.wikimedia.org/Ticket/Display.html?id=4526 [21:52:15] i just need one of the two systems to work =] (db27 seems more promising) [21:52:26] notpeter: ^ sbernardin is on it [21:52:45] uh I think it wants "function" [21:52:46] New patchset: Aaron Schulz; "Call functions properly." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48954 [21:52:47] apergos: ^ sigh [21:52:58] how long does it take for resource loader stuff to be recached? e.g. for logged in users? [21:53:02] woo! thank you, sbernardin [21:53:07] once i edit MediaWiki:Common.css [21:53:21] oh [21:53:21] We can touch some files to try and help it.. [21:53:23] :-D yeah [21:53:26] http://en.wikipedia.org/wiki/Main_Page?action=purge [21:53:36] preilly: wow! [21:53:39] um [21:53:40] seems to do nothing but ?debug=true makes the "edit links" disappear [21:53:51] for the main page, wikidata is supressed [21:54:02] ah, bash -n [21:54:05] it's a bug that we don't supress the link also but we can hide it for now [21:54:15] hmm we'll see [21:54:21] oh, yeah you can do that [21:54:26] but it's not very foolproof [21:54:27] preilly: i suppose better that it's for a good cause … at first i thought it may be something shady [21:54:38] rfaulkner: ah [21:55:02] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48954 [21:55:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48954 [21:55:17] apergos: I'll remember that next time ;) [21:55:28] the puppet stuff doesn't seem to cause much trouble for it [21:55:30] I keep thinking you test these things :-P [21:55:48] apergos: kind of hard to do locally [21:55:58] labs might help [21:56:18] doo dee doo dee doo [21:56:22] waiting for puppet run again [21:56:23] alright the link is gone from the main page :D [21:56:50] apergos: we have jobs-loop running in labs right? [21:57:06] I have no idea tbh [21:57:19] * AaronSchulz can bother hashar about that [21:57:30] wow holy crap there are a billion job runners now on 1010 [21:57:53] 333 [21:58:01] must b something broken there :-D [21:58:56] Ryan_Lane: So the unified cert rquires someone at digicert who is out of the office. [21:59:03] PROBLEM - Apache HTTP on mw1051 is CRITICAL: Connection refused [21:59:03] apergos: I thought it got slow [21:59:06] they just said will do this evening and get it back to us [21:59:07] -_- [21:59:10] * Ryan_Lane nods [21:59:12] so will be tomorrow i suppose =P [21:59:18] going to shoot them all [21:59:24] oh well, atleast its not wiating on me now [21:59:26] \o/ [21:59:34] apergos: why are there none now? [21:59:39] (like it was before) [21:59:45] no idea [21:59:48] didn't get to them [21:59:58] still on 1010 right? [22:00:08] apergos: maybe OOM killer? :) [22:00:18] hahaha [22:00:22] yeah 1010 [22:00:29] jesus [22:00:30] though the sh script itself wouldn't have much mem use [22:00:56] shot the main script [22:01:14] the rest will complete in a bit I guess [22:01:26] um maybe you want to look at that math bit again :-) [22:01:36] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [22:01:48] you think that's it? [22:02:26] dunno but somehow subprocscreate must be getting borked [22:02:40] I tested the ps command to get $subproccount yesterday [22:02:48] hm [22:02:52] It seems odd that the little subtraction would be the problem [22:03:18] I tested that too (constant - var) [22:03:27] oh and all of these are procs=12 too :-D [22:05:26] apergos: that expression definitely works [22:05:31] hm [22:07:49] the if and while are ok too [22:08:00] I'm a bit baffled [22:08:10] you said it was ~300? [22:09:24] 333 [22:09:29] what user does htis need to run as? [22:09:42] PROBLEM - SSH on mw1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:47] should be apache I guess, since Tim's recent change [22:09:52] apergos: maybe it $$ vs $BASHPID ? [22:10:27] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:10:32] *it is [22:12:13] 201 now [22:12:45] apergos: do think the ps command should use $BASHPID? [22:12:48] New review: Faidon; "Patch Set 3: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/40784 [22:12:59] maybe somehow $$ is for a parent of the script itself [22:13:19] !log adding mw1201-1208 to eqiad apache pool [22:13:20] Logged the message, RobH [22:14:27] notpeter: its amazing how much faster these work when lvs actually can connect to them ;] [22:14:29] apergos: I'll try that and give up if that doesn't help [22:14:35] ok [22:14:35] hahaha [22:14:48] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.094 second response time [22:14:57] is anyone else unable to log in to RT? [22:14:57] RECOVERY - SSH on mw1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:15:06] did someone just bring mw1051 back up? [22:15:14] i even did a password reset and it said it was successful [22:15:26] wfm [22:15:28] AaronSchulz: I think the jobrunners aren't entirely happy: https://ganglia.wikimedia.org/latest/?c=Jobrunners%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [22:15:38] we know [22:15:42] ok! [22:15:45] I shut up now [22:15:46] im glad i undid my jobrunner change so thats not my fault ;] [22:15:51] they aren't happy cause they are runnig 300 processes on each box [22:15:57] thanks though [22:15:59] RobH: what change? heh [22:16:03] freakin' over acheivers [22:16:05] i allocated more boxen to jobrunners [22:16:17] but then notpeter told me about how too many jobrunners causes them to lock one another [22:16:26] so meh [22:16:36] (allocated them to general apache instead) [22:16:38] RobH: totally, there could be even more jobrunners tripping over each other right now ;) [22:16:50] also, our apapches were insanely underprovisioned... [22:16:51] well, if i left them, and if lvs was setup, indeed, heh [22:17:18] going to add 12 more to general apache pool today [22:17:27] then an additional 30 or so this week when lvs is fixed. [22:17:30] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:30] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:51] apergos: no that won't help [22:18:04] bah I triesd set -x but it didn't give me useful output, too much flooding [22:18:04] that doesn't seem to do what I want locally [22:18:17] oh, well, I guess it's revert time for now [22:18:24] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:18:40] yep [22:18:59] huh... [22:19:09] so we have mw1189-mw1193 in the ganglia api apache group [22:19:09] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:19:13] trying to clean up my mess on 1010 first [22:19:13] yet they are not in pybal at all. [22:19:17] * RobH goes to check node lists [22:19:27] PROBLEM - SSH on mw1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:19:38] urrrrgh [22:19:46] we have apaches sitting idle not pooled in any node or pybal list [22:19:48] whyyyyyyy [22:20:30] PROBLEM - SSH on mw1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:20:38] well that was painful (shot them finallly) [22:20:46] apergos: I hate how git revert interacts with git-review [22:20:57] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:20:58] uh oh [22:21:06] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:21:10] New patchset: Aaron Schulz; "Revert "Fixed math expression syntax."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48959 [22:21:24] New patchset: Aaron Schulz; "Revert "Call functions properly."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48960 [22:21:30] New patchset: Aaron Schulz; "Revert "Modified jobs-loop script to keep a fuller pipeline."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48961 [22:21:35] apergos: buttons work faster ;) [22:21:40] oh golly :-D [22:21:50] notpeter: is there any reason you know of why we have mw1189-mw1193 allocated to api in site.pp and installation/ganglia, but not in dsh lists or pybal config? [22:22:03] no clue [22:22:05] I didn't set those up [22:22:07] they seem to be fine to me, puppet runs on them all, going to add them into service unless i have a reason not to [22:22:07] apergos: I guess you merge those [22:22:16] yeah sec [22:22:20] sorry for keeping you up a little more than "a few minutes" [22:22:24] :p [22:22:24] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48959 [22:22:34] notpeter: cool, if you didnt know and I didnt know, im gonna assume no one does about these and just use them. [22:22:35] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48959 [22:22:36] we can mess with this later [22:22:43] RobH: go for it! [22:22:51] we are underutilized on apaches, and we have some sitting idle =P [22:22:54] RECOVERY - SSH on mw1014 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:22:54] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:23:08] s/underutilized/underallocated [22:23:10] underutilized? [22:23:11] ah [22:23:12] two different things [22:23:19] bleh [22:23:19] New review: ArielGlenn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48960 [22:23:24] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48960 [22:23:37] ugh [22:23:39] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:23:46] last change not liked by jenkins [22:23:53] https://gerrit.wikimedia.org/r/#/c/48961/ [22:23:55] that one [22:24:10] oh wait, those are the new ones i added... im a dumbass. [22:24:16] i cannot add them in until lvs is setup, damn it. [22:24:26] AaronSchulz: ? [22:24:33] apergos: not sure how such a simple revert series can fail [22:24:37] I guess I can do rebase [22:24:42] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:24:46] it's just that one, the other two I pushed through [22:25:00] does RT suddenly stop allowing login by email address once you've set yourself a nickname? [22:25:42] * apergos merges the two changes as they are, better no jobrunners than overloaded boxes [22:25:45] apergos: but you'd think undoing everything in reverse order would work [22:25:47] for a few mins anyways [22:25:49] that's my only explanation. but i'm also sleep deprived atm. i did finally manage to get in [22:25:50] yes I would [22:26:06] so rebase gives no conflicts... [22:26:08] so irritating [22:26:11] hahaha figures [22:26:30] PROBLEM - SSH on mw1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:41] New patchset: Aaron Schulz; "Revert "Modified jobs-loop script to keep a fuller pipeline."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48961 [22:27:06] waiting for jenkins... [22:27:21] New review: ArielGlenn; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48961 [22:27:30] it liked it, who knows why [22:27:31] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48961 [22:28:06] well this is what I get for trying to check bash syntax at 11pm at night [22:28:09] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:28:18] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:36] PROBLEM - SSH on mw1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:12] PROBLEM - SSH on mw1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:20] apergos: ganglia looks pretty lol atm [22:29:23] heh [22:29:42] well that was an unhappy mw1012 [22:29:48] PROBLEM - SSH on mw1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:18] do you have the numbers of the job runners handy? [22:30:32] like is it 1001 through something? [22:30:45] 1001-10016 [22:30:50] ok [22:31:09] PROBLEM - SSH on mw1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:09] PROBLEM - SSH on mw1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:31:22] 1010 is saved, the rest not so much [22:33:49] apergos: are they able to even do the puppet run? [22:34:01] very doubtful [22:34:06] can Ihelp in any way [22:34:07] ? [22:34:16] RECOVERY - MySQL disk space on neon is OK: DISK OK [22:34:16] the jobrunners are starting to dos the dbs... [22:34:20] ugh [22:34:29] well it would be liek this [22:34:32] we can just powercycle [22:34:33] New patchset: RobH; "storage1 decom, storage2 decom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48963 [22:34:37] powercycle and immdiately run puppet on it [22:34:38] it'll be fine [22:34:40] yeah [22:34:41] yep [22:34:49] apergos: I'll take 1-8 [22:34:51] you do 9-16 [22:34:52] if you don't immediately run puppet then the job runner will kill it [22:34:55] yeah [22:34:55] okay [22:34:56] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:37] PROBLEM - SSH on mw1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:50] !log powercycling all eqiad jobrunners [22:36:51] Logged the message, notpeter [22:36:53] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:37:14] 1010 is ok you can leave that one [22:37:18] or I can woops [22:37:23] heh [22:37:43] New review: RobH; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48963 [22:37:51] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48963 [22:38:06] awww, gerrit-wm doesnt output my silly commit messages [22:38:12] ^demon: I miss my commit messages! [22:38:28] ;[ [22:38:41] RECOVERY - SSH on mw1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:39:24] RECOVERY - SSH on mw1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:39:24] RECOVERY - SSH on mw1015 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:39:51] PROBLEM - NTP on mw1008 is CRITICAL: NTP CRITICAL: Offset unknown [22:40:27] RECOVERY - SSH on mw1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:40:27] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:40:45] PROBLEM - NTP on mw1001 is CRITICAL: NTP CRITICAL: Offset unknown [22:40:45] PROBLEM - NTP on mw1003 is CRITICAL: NTP CRITICAL: Offset unknown [22:41:21] PROBLEM - SSH on mw1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:26] New patchset: RobH; "virt1001-1003 changed into pc1001-1003" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48965 [22:41:39] RECOVERY - NTP on mw1008 is OK: NTP OK: Offset 0.001338362694 secs [22:41:39] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:42:23] New review: RobH; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48965 [22:42:32] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48965 [22:42:33] RECOVERY - NTP on mw1003 is OK: NTP OK: Offset 0.0003992319107 secs [22:43:00] RECOVERY - SSH on mw1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:43:00] RECOVERY - SSH on mw1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:43:39] apergos: you got anymore ? [22:44:17] I'm still working on them [22:44:26] want me to grab a couple more? [22:44:45] if you guys arent using for loops [22:44:47] you are doing it wrong. [22:45:13] or be really lazy and use the node group =] [22:45:13] I'm not, I have to log on to each one the instant it comes up and do the puppet run [22:45:19] otherwise it will just fall over again [22:45:32] bleh [22:45:51] !log mw27 locked up, powercycling [22:45:52] Logged the message, RobH [22:46:00] PROBLEM - SSH on mw1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:14] paravoid: so on that list, storage1, virt1001-1003, srv266 are handled (or in RT) [22:46:29] and db61 is known [22:46:36] RECOVERY - SSH on mw1013 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:46:54] PROBLEM - Apache HTTP on mw1098 is CRITICAL: Connection refused [22:48:24] RECOVERY - Host mw27 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [22:49:23] apergos: I'm getting 1015 and 1016 [22:50:07] I'm on 1015 [22:50:11] pok [22:50:11] so just do 1016 [22:50:12] RECOVERY - SSH on mw1009 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:50:13] I got 16 [22:50:41] ok [22:51:15] RECOVERY - SSH on mw1014 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:51:15] RECOVERY - SSH on mw1012 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:52:27] PROBLEM - Apache HTTP on mw27 is CRITICAL: Connection refused [22:52:36] PROBLEM - NTP on mw1014 is CRITICAL: NTP CRITICAL: Offset unknown [22:54:19] ok, these are looking sane again [22:54:24] RECOVERY - NTP on mw1014 is OK: NTP OK: Offset 0.00131046772 secs [22:54:51] it looks like there are still some crazylocks on the dbs [22:54:54] but they're clearing up [22:55:56] good whew [22:57:11] apergos: go to sleep now :) [22:57:33] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [22:57:38] hahaha [22:57:54] well that was basically a disaster [22:58:01] lucky I didn't kil the site [22:58:09] RECOVERY - NTP on mw1001 is OK: NTP OK: Offset 0.00101852417 secs [22:58:20] need to find a nice way to test these cause that was not it :-D [22:58:37] livehack on one node ;) [22:58:51] is this not how we test things? [22:58:54] notpeter: where were most the queries? slaves, master, or both? [22:59:09] notpeter: it was until the next puppet run ;) [22:59:17] heheheh [22:59:19] like a moving wall of doom... [22:59:36] probably not the best approach, heh [22:59:55] well, nothing has ever gone wrong with a while loop that forks before... [23:00:03] so, there's gotta be a frist time for everything ;) [23:00:14] it was only 300 little tiny jobs :pD [23:00:16] per host [23:01:00] notpeter: db1013 isnt shown in use anyplace that i can see [23:01:12] but its got nagios down errors and the like [23:01:28] Did you have someplace you want it, or can i add to decomissioning.pp for a few days so its no longer monitored? [23:01:39] RobH: I habeeb jeff was using that until recently [23:01:39] check rt [23:01:53] i have no open tickets with db1013 in them [23:01:57] not peter, thanks for your help, otherwise I would still be powercycling boxes [23:02:13] ahjj [23:02:14] https://rt.wikimedia.org/Ticket/Display.html?id=4272 [23:02:25] wiped, reimaged, and powered off until needed... [23:02:29] that breaks puppet checking ;] [23:02:31] powering on. [23:03:33] RECOVERY - Puppet freshness on ocg3 is OK: puppet ran at Wed Feb 13 23:03:21 UTC 2013 [23:04:54] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 205 seconds [23:05:01] notpeter: dbs looked OKish in ganglia [23:05:12] notpeter: was it slave lag? [23:05:21] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 216 seconds [23:06:44] AaronSchulz: yep, they're back to normal [23:06:52] good to know [23:07:46] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.119 second response time [23:08:02] AaronSchulz: no, max connections... [23:08:12] for wikiadmin? [23:08:24] oohhh [23:08:30] that makes complete sense [23:08:35] that was asher's hack to avoid replag due to jobqueue [23:08:40] 300 conns per host... [23:08:42] I guess it may have saved the sight then [23:08:45] *site [23:08:46] that would about get it [23:08:48] Error connecting to 10.64.16.23: Too many connections [23:09:34] wait, i did something to save the site? [23:09:44] yes, ain't it grand? :-) [23:10:18] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [23:10:35] ok I am so outa here [23:10:46] apergos: night [23:10:51] cya apergos [23:10:54] sorry that didn't go well AaronSchulz, we can look at it again tomorrow I guess [23:10:59] 'night! [23:11:49] !log db1013 online and puppet updated, needs to be pushed into some db cluster service [23:11:50] Logged the message, RobH [23:15:12] RobH: https://rt.wikimedia.org/Ticket/Display.html?id=4528 [23:15:29] ? [23:15:31] Am I doing this to db27 or db29? [23:15:41] ahh, i cut and pased it [23:15:47] both =] but on that ticket 27 [23:15:49] my bad [23:15:55] thx dude [23:17:04] !log rebooting db27 per https://rt.wikimedia.org/Ticket/Display.html?id=4528 [23:17:05] Logged the message, Master [23:17:56] !log rebooting db29 per https://rt.wikimedia.org/Ticket/Display.html?id=4528 [23:17:57] Logged the message, Master [23:19:00] PROBLEM - Host db29 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:39] PROBLEM - Host db27 is DOWN: PING CRITICAL - Packet loss = 100% [23:21:38] New patchset: Ryan Lane; "Disabling password authentication in production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48972 [23:22:36] !log db27 and db29 will be up and down for reinstallations [23:22:38] Logged the message, RobH [23:22:49] notpeter: steve fixed them, i'll get one spun up for you [23:23:23] RobH: woo! [23:23:33] sbernardin: thanks for taking care of that so promptly! [23:23:38] New review: Faidon; "Patch Set 1: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/48972 [23:23:48] RECOVERY - NTP on ocg3 is OK: NTP OK: Offset -0.008704662323 secs [23:23:57] RECOVERY - Host db29 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [23:25:36] can anyone check how the job queue is for enwiki? [23:26:31] New patchset: RobH; "virt1004 isnt presently in service, yanking from nagios checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48973 [23:26:34] not too bad aude: https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=statistics [23:27:19] nah, i see my edits :D [23:27:25] didn't take too long [23:27:40] (diff | hist) . . Ucieszków (Q4540140); 18:21 . . Aude (talk | contribs) (Language link added: pl:Ucieszków) [23:27:46] New review: RobH; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48973 [23:27:56] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48973 [23:28:49] i see stuff in recent changes, although the volume of edits might not be so much [23:28:52] aude: < 1000 [23:29:00] since bots added most stuff already [23:29:00] Reedy: thanks [23:29:31] it might be necessary to choose to see more than the last 50 edits, though [23:29:44] since it might take a minute for the jobs to process and appear in RC [23:31:14] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48972 [23:31:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48972 [23:35:11] AaronSchulz: do you use copper/zinc/magnesium for swift testing or can I give them back to RobH? [23:36:11] I haven't used it in a while, since I've mostly been testing locally with ceph [23:36:24] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [23:36:30] paravoid: I guess if there is a real use for them then they should be repurposed [23:36:50] plus they are sitting in nagios errors presently [23:36:51] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [23:37:03] So if cool with you guys, I am going to reclaim them. [23:37:08] ok [23:37:21] cool, thx [23:37:26] if you need more test nodes, just let me know [23:37:44] (just cuz im taking these back doesnt mean i wont give you new ones if hte need arises ;) [23:40:18] RECOVERY - MySQL disk space on neon is OK: DISK OK [23:41:19] RobH: https://www.youtube.com/watch?v=eNiR5ZTb_MA [23:42:26] mineeeee [23:42:32] all the severs are mine! [23:42:34] ahem. [23:42:54] * Reedy grins [23:42:57] RobH: whatchu gonna do with all them servers? [23:43:12] push them into a pile and scrooge mcduck through them [23:43:20] Same as he always done with the servers [23:43:24] Try to take over the world! [23:43:31] RobH: that sounds so insanely painful.... [23:43:40] notpeter: i think i hate you for asking me to work on db27/29 [23:48:24] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:25] !log reinstalling sq48 [23:48:26] Logged the message, Master [23:49:41] ok, perhaps rather than horde them i'll just dispatch them out in a fair manner based on RT requests... [23:49:43] business as usual. [23:50:08] !log db29 reinstalling [23:50:09] Logged the message, RobH [23:50:12] PROBLEM - Host db29 is DOWN: PING CRITICAL - Packet loss = 100% [23:56:03] RECOVERY - Host db29 is UP: PING OK - Packet loss = 0%, RTA = 2.36 ms [23:56:39] PROBLEM - SSH on sq48 is CRITICAL: Connection refused