[01:15:19] nagios-plugins drama: https://bugzilla.redhat.com/show_bug.cgi?id=1054340 [01:17:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [01:36:25] (03PS1) 10Springle: repool db1039, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108478 [01:38:47] (03CR) 10Springle: [C: 032] repool db1039, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108478 (owner: 10Springle) [01:39:55] !log springle synchronized wmf-config/db-eqiad.php 'repool db1039, warm up' [01:40:04] Logged the message, Master [02:10:06] (03PS1) 10Springle: repool db1022 & db1039 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108481 [02:10:30] (03CR) 10Springle: [C: 032] repool db1022 & db1039 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108481 (owner: 10Springle) [02:10:37] (03Merged) 10jenkins-bot: repool db1022 & db1039 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108481 (owner: 10Springle) [02:12:17] !log springle synchronized wmf-config/db-eqiad.php 'repool db1022, db1039 full steam' [02:12:26] Logged the message, Master [02:12:45] !log LocalisationUpdate completed (1.23wmf10) at 2014-01-20 02:12:45+00:00 [02:12:52] Logged the message, Master [02:16:47] (03CR) 10MZMcBride: "Does this file need a(n explicit) license?" (031 comment) [operations/software] - 10https://gerrit.wikimedia.org/r/108467 (owner: 10Tim Landscheidt) [02:22:29] !log LocalisationUpdate completed (1.23wmf11) at 2014-01-20 02:22:28+00:00 [02:22:35] Logged the message, Master [02:39:42] (03CR) 10Tim Landscheidt: "I'm not sure. Much of the Puppet/software/configuration stuff has no explicit licence, so I didn't want to step out of line. I'm fine wi" [operations/software] - 10https://gerrit.wikimedia.org/r/108467 (owner: 10Tim Landscheidt) [02:40:25] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-01-20 02:40:25+00:00 [02:40:32] Logged the message, Master [02:50:29] (03PS1) 10Ori.livneh: Revoke anomie's key per his request [operations/puppet] - 10https://gerrit.wikimedia.org/r/108483 [02:50:50] (03PS2) 10Ori.livneh: Revoke anomie's key per his request [operations/puppet] - 10https://gerrit.wikimedia.org/r/108483 [02:50:57] (03CR) 10Ori.livneh: [C: 032 V: 032] Revoke anomie's key per his request [operations/puppet] - 10https://gerrit.wikimedia.org/r/108483 (owner: 10Ori.livneh) [03:08:12] o.O [04:18:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [04:51:46] (03PS1) 10Ori.livneh: Gzip SVGs on front & back upload varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/108484 [04:51:56] (03PS2) 10Ori.livneh: Gzip SVGs on front & back upload varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/108484 [05:34:00] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:00] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [06:55:38] (03PS2) 10Andrew Bogott: gridengine: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107035 (owner: 10Matanya) [06:57:37] (03CR) 10Andrew Bogott: [C: 032] gridengine: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107035 (owner: 10Matanya) [07:04:49] (03PS2) 10Andrew Bogott: deployment: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107343 (owner: 10Matanya) [07:05:05] (03PS2) 10Andrew Bogott: dynamicproxy: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107339 (owner: 10Matanya) [07:17:37] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107339 (owner: 10Matanya) [07:19:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [07:27:08] (03CR) 10Andrew Bogott: [C: 032] deployment: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107343 (owner: 10Matanya) [07:40:48] (03PS1) 10Matanya: coredb_mysql: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/108488 [07:41:10] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [07:42:21] andrewbogott: Leslie Carr is leaving??? [07:42:40] Yeah, I think Friday was her last day :( [07:42:45] Yes. [07:43:35] :/ [07:43:49] to where? [07:45:36] I… can't remember. Cumulus, maybe? She's taking some time off first, though. [07:45:44] (03PS4) 10Andrew Bogott: base: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107355 (owner: 10Matanya) [07:46:05] so all SF based ops are leaving? :) [07:46:58] Not quite all… but yeah, I visited the office on last week and it was very quiet. [07:47:10] Daniel and Rob and Gage are in SF. And Ken is about to move there. [07:47:42] I feel like there's another who I'm forgetting :) Ori is semi-ops I guess. [07:47:53] Which Rob? [07:48:09] Halsell [07:48:17] He moved to SF? [07:48:20] He was in DC until this year, recently moved. [07:48:20] yeah [07:48:26] Oh, right. [07:48:28] I'm in D.C. [07:48:29] Seems /much/ happier in CA :) [07:48:32] Heh. [07:48:50] I'm visiting SF this week. [07:49:13] There are still lots of folks @ the office, just not so many ops. [07:49:45] http://orgcharts.wmflabs.org/ [07:49:48] is down [07:50:25] hm, is that the one linked to from the staff page? [07:50:52] yes [07:51:34] Beta indeed. [07:51:48] Gloria: not meaning to knock DC, I just think it's no fun to be constantly plagued with data-center requests from people you never see. I'm not sure how Chris manages it. [07:52:08] Well, yeah. :-) [07:52:57] DC work is the worst ops responisbility [07:53:18] Looks like orgcharts is marktraceur's baby. [07:53:22] Most likely out today. [07:53:47] Yes and yes. [07:57:17] orgcharts says "(111)Connection refused: proxy: HTTP: attempt to connect to 127.0.0.1:8888 (localhost) failed" [07:57:34] could be anything. [08:00:26] (03CR) 10Andrew Bogott: [C: 032] base: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107355 (owner: 10Matanya) [08:00:53] ^ scary one! [08:01:57] yes, agreed [08:04:33] clean runs, though, seems safe. [08:04:39] * andrewbogott keeps fingers crossed [08:05:00] andrewbogott: can i call $::site from an erb? [08:06:14] I think for globals you have to do that ugly _lookup thing. I think I would copy into a local var first and refer to that in .erb. [08:06:31] although, maybe that's silly… not sure. [08:06:53] not sure $site can be a local var easily :/ [08:07:27] Can't just do $local_site=$::site? [08:07:43] and that is not ugly? :P [08:07:55] Oh it's definitely ugly! [08:08:10] Probably it's fine to use it directly, I just don't know if it needs special syntax... [08:08:18] Maybe it'll totally work, I should've just led with 'I don't know' [08:09:06] i'm doing puppet 3 compatibility fixes, so this is the source of my question [08:09:16] i guess testing is the best answer [08:09:29] yep [08:41:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [10:01:11] (03PS12) 10Matanya: Move mail manifests to a module called 'exim' [operations/puppet] - 10https://gerrit.wikimedia.org/r/68584 (owner: 10Andrew Bogott) [10:04:59] (03CR) 10Matanya: "patchset 11 was rebased. a heck of a work." [operations/puppet] - 10https://gerrit.wikimedia.org/r/68584 (owner: 10Andrew Bogott) [10:09:00] (03Abandoned) 10Matanya: gitblit: service has no status [operations/puppet] - 10https://gerrit.wikimedia.org/r/108321 (owner: 10Matanya) [10:20:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [10:24:50] PROBLEM - swift-container-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-object-updater on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-account-reaper on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-container-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-object-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-account-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-object-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:00] PROBLEM - swift-account-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:00] PROBLEM - DPKG on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:00] PROBLEM - swift-account-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:10] PROBLEM - swift-container-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:20] PROBLEM - swift-container-updater on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:30] PROBLEM - RAID on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:30] PROBLEM - swift-object-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:40] PROBLEM - SSH on ms-be1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:40] PROBLEM - puppet disabled on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:51] bye bye ms-be1003 [10:35:00] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:36:50] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [10:56:49] huh [10:57:09] just in time akosiaris [10:57:10] bug: soft lockups... I can't regain control... powercycling [10:57:39] don't forget to log :) [10:58:50] !log powercycling ms-be1003. Console full of messages BUG: soft lockup - CPU#stuck for #s [10:58:57] Logged the message, Master [10:59:38] akosiaris: thanks [11:00:00] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:00:26] now a cron question please. when does torrus-discovery run on netmon1001? [11:00:30] !log restarting Gitblit on antimony to test upstart script [11:00:37] Logged the message, Master [11:00:59] or on streber [11:01:00] RECOVERY - swift-container-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:01:10] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [11:01:10] RECOVERY - swift-container-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:01:20] RECOVERY - swift-object-auditor on ms-be1003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:01:20] RECOVERY - RAID on ms-be1003 is OK: OK: optimal, 14 logical, 14 physical [11:01:30] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:01:30] RECOVERY - puppet disabled on ms-be1003 is OK: OK [11:01:40] RECOVERY - swift-container-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:01:40] RECOVERY - swift-container-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:01:40] RECOVERY - swift-object-server on ms-be1003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:01:40] RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:01:40] RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:01:40] RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:01:41] RECOVERY - swift-account-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:01:50] RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:01:50] RECOVERY - DPKG on ms-be1003 is OK: All packages OK [11:01:50] RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:03:00] PROBLEM - gitblit process on antimony is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar gitblit.jar [11:03:01] matanya: never ? [11:03:10] not good [11:03:20] is torrus broken? [11:03:40] no [11:03:59] torrus is neither of streber nor netmon1001 for what is see btw [11:04:05] neither on* [11:04:10] from what* [11:04:13] damn... [11:05:00] RECOVERY - gitblit process on antimony is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar gitblit.jar [11:05:14] gitblit is me [11:06:17] so it is down [11:06:36] i'm just testing something [11:06:43] i'll be done in a moment; i !logged it [11:07:01] akosiaris: sorry on manutius [11:09:21] matanya: daily [11:09:43] thanks, so i'll convert the exec to a cron job [11:13:10] !log ms-be1003 kernel log show first lockups around Jan 19 06:30 UTC being XFS related. [11:13:15] Logged the message, Master [11:17:29] (03PS1) 10Ori.livneh: Add upstart job definition file for Gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/108492 [11:18:33] matanya: ^ [11:18:41] good night [11:20:19] see ori [11:20:28] 3:20 AM :D [11:20:37] my thoughts exactly :) [11:20:38] thanks ori [11:20:51] sleep well ori [11:20:55] (03CR) 10Matanya: [C: 031] Add upstart job definition file for Gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/108492 (owner: 10Ori.livneh) [11:21:13] thanks! (to both.) bye [11:54:44] akosiaris: Know anything about opendj? [11:54:59] https://rt.wikimedia.org/Ticket/Display.html?id=6676 [12:04:43] (03PS1) 10Matanya: torrus: move into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108498 [12:05:36] andrewbogott: not much [12:05:39] andrewbogott: not related to https://gerrit.wikimedia.org/r/#/c/102629/ ? [12:06:10] andrewbogott: want help to investigate that issue with replication ? [12:06:13] matanya: Don't think so, they've been out of sync for a couple of days at least. [12:06:28] akosiaris: Yes please… I'm staring at log files but nothing obvious is jumping out. [12:06:43] It seems vaguely possible that they got out of sync when the pmtpa cable was cut [12:08:00] andrewbogott: ok gimme a sec to wrap up an email. [12:12:10] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [12:13:55] so andrewbogott, virt1000 is not replicating from virt0... let's see [12:15:15] I've restarted opendj on both servers and it seems fairly happy. [12:15:24] it synced ? [12:15:33] Ah, no. [12:15:38] I mean -- the log files seem happy. [12:15:44] But that account is still present on virt0 and not on virt1000 [12:22:17] andrewbogott: since 3 Jan from what i see.. [12:22:35] at least that is when the logs start complaining [12:23:24] that's relatively close to when the certs were changed [12:23:36] But do you agree that they're not complaining now? [12:23:58] are you looking at replication log or error log? [12:24:17] replication log [12:24:38] yes they are not complaining now [12:25:12] in fact, not since 16 Jan [12:25:33] Yeah, I think that's when I restarted opendj on virt1000 [12:26:34] Hm, apparently I didn't log it :( [12:29:19] so, the easy way forward is probably to reinitialize the slave from the master [12:30:48] Does replication only happen as the transactions occur? It won't resync itself? [12:30:59] normally yes [12:31:24] Ah, ok. Then I was just misunderstanding; I thought that if both servers were communicating they would gradually sync. [12:31:45] So is there a master/slave relationship or are they peers? Is one db guaranteed to be a subset of the other? [12:32:03] I think this software supports multimaster [12:32:15] i don't know if it is configured that way though [12:33:10] https://wikitech.wikimedia.org/wiki/LDAP#Initialize_replica [12:33:17] this could prove useful [12:33:42] not yet clear to me where to run it though and with what args... [12:34:07] Yeah -- I'm uneasy because of the 'guaranteed to be a subset' question... [12:34:50] a replication is done on a whole tree always [12:35:02] so dc=wikimedia,dc=org must be the same on both [12:36:16] I don't understand. [12:36:20] Or my question didn't make sense... [12:37:04] Isn't it possible that uid=foo,ou=people,dc=wikimedia,dc=org is on virt0 and not virt1000 [12:37:13] only if they are unsynced [12:37:15] but uid=bar,ou=people,dc=wikimedia,dc=org on virt1000 but not virt0? [12:37:16] not by design [12:37:21] Ah, right. [12:37:22] aaah [12:37:28] OK. But clearly they /are/ unsynced atm [12:37:48] so you mean that they both have changes ? [12:37:57] since the time the got unsynced... [12:38:01] split brain [12:38:06] that would be bad.... [12:38:41] i sure hope it hasn't happened [12:38:52] I don't know if it did or didn't :( [12:39:20] ok i will try to change something in my account on virt1000 [12:39:29] Ah, good idea. [12:39:31] hopefully it will answer that it is a readonly replica [12:44:17] crap [12:44:28] so... local changes on virt1000 are possible [12:45:02] akosiaris andrewbogott it would be good to check the head of the tree and see if it is in sync [12:45:19] :( [12:45:23] then you can see if the entire tree is screwed up or just part [12:49:47] huh [12:49:54] so virt1000 => virt0 changes [12:49:59] are propagating normally [12:50:04] no probs there [12:51:38] ok, I'm trying to think this through... [12:51:51] It's not obvious to me what uses virt1000 and what uses virt0. [12:52:08] For instance I looked at ldap.conf on gerrit.wikimedia.org and it didn't specify an ldap server at all. [12:52:35] It seems likely that tampa stuff (= labs) writes to virt0 and eqiad stuff (= everything else) to virt1000 [12:52:42] But I don't know that that's right. [12:52:47] huh [12:52:55] Suppose there's a way to just git a big diff, and then resolve by hand? [12:52:56] also virt0=> virt1000 works normally [12:53:10] i just added an attribute to my account on one server [12:53:16] it showd up on the other [12:53:29] deleted it from the second one and it got deleted on the first one two [12:53:35] but not the other way around? [12:53:46] so both directions work fine now [12:53:58] but something got lost at some point... [12:54:02] so you say we have a gap [12:54:26] what if you regenrate his config by hand on one side? [12:54:32] matanya: I know of at least one user account that's on virt0 but not virt1000 [12:54:48] yes ilmerovingio [12:54:50] matanya: right, if we know the total diff we can do that... [12:54:55] he approched me at first [12:55:09] hello [12:55:28] I'd try to fix him only as a start to see if we can sort out the root cause [12:55:44] I think the root cause is already resolved. [12:55:52] even better [12:56:17] diffing between the servers will be costy? [12:56:31] yes [12:56:55] too bad [12:57:09] Well, here, let me just do a dump of all accounts and compare just to see order of magnitude... [12:57:23] (though of course more than just accounts are involved probably) [12:57:24] can we force clean replection? [12:57:43] a quick dump speaks about 400k of data [12:57:46] in ldif format [12:57:54] let's see now... [12:58:53] are the user credentials stored in ldap? [12:59:17] yes [13:00:07] problem is we can't just figure out the diff and load those on virt1000. Afterwards virt0 will try to get the changes and it will fail [13:00:36] about 79 accounts differ [13:00:41] i got the point [13:01:25] akosiaris: can you put the server in lockdown mode until you fix? [13:01:28] akosiaris, does that failure matter though? [13:01:33] what about stopping the replication on the server that has problem, reimport the ldap database and restart the replication? [13:01:50] andrewbogott: yes because it will no longer get the changes from virt1000 [13:02:04] I am back at thinking reinitialize the one of two from the other [13:02:06] Wait, once a single rep fails it will stop forever after? [13:02:47] virt1000 has more user accounts than virt0 [13:02:50] That surprises me... [13:02:52] it used to in the previous generations (Sun DS, Fedora DS etc, all those are the same software more or less) [13:03:03] well, wait, let me verify that [13:03:36] i think you should use the fact it is a day off in USA and take a lockdown for short time to fix [13:03:40] wrong, virt0 has more [13:03:51] for some reason the ldif is bigger on virt1000, but more accounts on virt0 :( [13:04:06] maybe you passed -LLL on ldapsearch ? [13:04:10] comments etc... [13:04:14] i know i did [13:04:17] so diffing again [13:04:47] akosiaris: 77 of the 79 accounts that differ are on virt0 and not virt1000. So, doing virt0=>virt1000 seems pretty sound to me. [13:04:58] titleblacklist.wmflabs.org [13:05:04] this seems to be only on virt1000 [13:05:16] dc=208.80.153.160 [13:05:20] so it is not just accounts [13:05:24] also labs machines ? [13:05:36] Yep, that should also be on virt0 and not virt1000 [13:05:40] since labs is in tampa [13:06:25] http://opendj.forgerock.org/opendj-server/doc/admin-guide/index/chap-troubleshooting.html#troubleshoot-repl [13:06:36] this helped me in the past ^ [13:07:54] if the default purge delay is three days... [13:08:04] then we don't have historical info any more :( [13:08:31] too bad [13:08:49] i am getting pretty confident this is what happened [13:09:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [13:10:23] ok so [13:10:33] I say we reinitialized virt1000 from virt0 [13:10:59] and then readd the few missing entries [13:11:11] that are present on virt1000 but not virt0 [13:11:32] And readding by hand won't break replication forever? [13:12:02] you mean after virt0 basically overwrites all data present on virt1000 ? [13:12:11] no [13:12:18] Earlier you said 'problem is we can't just figure out the diff and load those on virt1000. Afterwards virt0 will try to get the changes and it will fail' [13:12:43] Now you are saying 'and then readd the few missing entries' [13:12:49] Aren't those the same thing, just different scale? [13:13:08] yes but the key word here is "after reinitialization" [13:13:30] cause what will happen is that virt1000 will be a mirror of virt0 [13:13:58] the problem earlier was that [13:14:16] if we added record B on virt1000 and it already was present on virt0 [13:14:20] that would break [13:14:36] cause hey... you tried to add an already existing record [13:14:54] but if we don't do that, but add nonexistent records, that is fine [13:15:09] Ooooh, ok now I see what you are saying :) [13:15:45] So next you're going to say "I know how to do all those things, you should just stand by and watch" [13:15:58] heh... [13:16:10] I have done them in the past indeed [13:16:13] just not on this software [13:16:20] more like it's grandpa [13:16:32] hopefully not much has changed :-) [13:18:02] all the ds stuff is very similar, even active directory, tfoo [13:21:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [13:23:49] ^ this is because puppet is disabled on hooft. [13:25:06] why though ? [13:25:19] i don't see something in SAL [13:25:34] I don't know what/whose hooft is [13:33:56] andrewbogott: can you please guide me how to test etherpad on labs? I wasn't able to understnad how i get to the site from labs [13:34:28] matanya, you mean, test a new etherpad install? [13:34:31] yes [13:35:10] Hm… hopefully you can just set up a proxy pointing to the instance and it'll just work. [13:35:40] an apache proxy? [13:35:50] Since I think that etherpad just uses http [13:35:51] https://wikitech.wikimedia.org/wiki/Special:NovaProxy [13:36:04] You can just assign your etherpad instance a public address there. [13:36:08] And then browse! [13:36:19] (Yuvi wrote most of that, it's new :) ) [13:37:23] matanya: Am I answering your question, sort of? [13:38:11] I hope, i'll need to check this out. I didn't know I have the right of using public IP [13:39:07] You don't need one, that's what the proxy is for. [13:39:26] I see [13:39:32] The proxy box has its own public IP. That GUI assigns a new DNS name to the proxy box, and automatically routes traffic using that DNS name to your labs machine. [13:39:48] I'll test it at night [13:40:07] my night ... [13:40:12] It's very simple, you just type the name you want, select the instance, submit, wait a few minutes. [13:40:17] And then clean up afterwards, ideally :) [13:41:02] andrewbogott: out of cirousity, what else is on your endless list? [13:41:25] Um… labs to eqiad (and many subtasks thereof) [13:41:37] Refactoring all of puppet to use modules [13:41:47] …some other stuff :) [13:42:00] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [13:42:25] Mostly I'm just fixated on the labs DC migration and don't have a lot of brain space for other long-term plans. [13:42:40] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [13:42:57] BTW, i have rebased you exim module, as it started to get too far from prod. [13:43:02] *your [13:43:36] ok. I think mark sort of hated that module, so it may end up getting redone at some point. [13:43:40] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.038 second response time on port 389 [13:44:00] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.036 second response time on port 636 [13:44:39] too bad, wasted some nice portion of time on it [13:44:41] !log stopped, backed up virt1000 and virt0 and started them up. Preparing for reinitializing virt1000 from virt0 [13:44:49] Logged the message, Master [13:45:30] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [13:45:40] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [13:47:30] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.002 second response time on port 389 [13:47:40] RECOVERY - LDAPS on virt1000 is OK: TCP OK - 0.000 second response time on port 636 [13:51:32] !log started virt1000 reinitialization from virt0 [13:51:39] Logged the message, Master [13:51:45] !log finished virt1000 reinitalization from virt0 [13:51:47] that was fast... [13:51:51] Logged the message, Master [13:51:52] sooo let's see... [13:52:29] andrewbogott: your user is there [13:52:39] So I see! [13:52:55] I think we are ok... I could not even find something that has really changed on virt1000 and needed to be added manually [13:52:59] ilmerovingio: so… can you log into gerrit now? [13:53:14] * andrewbogott figures this will turn out not to have been the problem :/ [13:53:23] is the network disruption also affecting dns? [13:53:35] heh [13:53:43] oh! akosiaris, fun fact, you have to restart pdns when you tinker with opendj. [13:53:45] Shall I? [13:53:51] ??? [13:53:54] sure [13:53:54] JohannesK_WMDE: Give me a minute… [13:54:00] please tell me how [13:54:20] Oh, just service pdns restart [13:54:22] I did it already. [13:54:35] It's a bug in pdns, it flips out if there's an ldap interruption. [13:54:36] only one server ? [13:54:36] Sometimes. [13:54:44] Both virt0 and virt1000 [13:54:49] JohannesK_WMDE: better? [13:54:52] andrewbogott: ah wow! yes that looks better. [13:54:57] ok :-) [13:55:42] andrewbogott: let me check [13:56:05] akosiaris: I'm about to punch out for the night. Thanks a million for sorting this, would've taken me 5x as long on my own. [13:56:31] andrewbogott: YES :) [13:56:38] \o/ [13:56:42] well done [13:57:24] andrewbogott: happy to be of service. good night :-) [13:58:48] akosiaris: Next time I see that failure on virt1000 I will remember that it MUST BE FIXED within three days :) [14:00:14] * andrewbogott glances at icinga before bed [14:00:56] ahahaha [14:01:26] do we have anything to catch that next time ? [14:01:40] already gone... no worries, I 'll ask again tomorrow :-) [14:02:32] hmmm... i still can't connect to apache. it worked a few days ago. i can see apache2 listening on *:80, and i can connect to other services listening on other tcp ports on the same host. [14:02:53] is there a special role i need to enable to make apache work? i only ticked 'apache2' [14:03:48] i mean 'webserver::apache2' [14:04:00] i think not. that should be sufficient [14:04:15] if you see something listening on 80 but not being able to connect to it [14:04:23] then it is firewall [14:04:37] which in this case... sounds weird [14:04:56] akosiaris: might be. but i didn't make any configurations to iptables myself [14:05:03] try iptables -nxvL and see if you have any rules there [14:06:52] akosiaris: nah. looks empty: [14:06:58] jkroll@sylvester:~> sudo iptables -nxvL [14:06:58] Chain INPUT (policy ACCEPT 25259783 packets, 99414245407 bytes) [14:06:58] pkts bytes target prot opt in out source destination [14:06:58] Chain FORWARD (policy ACCEPT 0 packets, 0 bytes) [14:06:58] pkts bytes target prot opt in out source destination [14:06:58] Chain OUTPUT (policy ACCEPT 8764915 packets, 519661372 bytes) [14:06:58] pkts bytes target prot opt in out source destination [14:07:02] sry for spam [14:07:35] any other ideas... anybody? :) [14:18:26] JohannesK_WMDE: can you telnet to 80? [14:20:47] matanya: yes from the same host, no from any other host including tools-login [14:21:57] which is strange, because other ports work [14:23:13] whay are you trying to do? browse to an instance? [14:23:32] yes, i have apache running on my host and want to connect to it [14:24:11] what is "my host" and from where you are trying to connect to it? [14:24:49] the host is the sylvester.wmflabs.org instance [14:25:18] and your are trying from? [14:25:57] i have tried to connect to port 80 from sylvester (works), tools-login (timeout), toolserver (timeout), my machine at home (timeout) [14:26:23] what is the ip of sylvester? [14:26:38] but i can connect to another service on the same host: http://sylvester.wmflabs.org:8090/list-graphs [14:26:56] sylvester.wmflabs.org has address 208.80.153.176 [14:27:29] so it might be a misconfiguration of apache, but "netstat -lpt" shows it listening on *:80, so i think it should work. [14:29:00] what does netstat -an say? [14:29:09] netstat -an |grep 80 [14:30:05] matanya: $ sudo netstat -anp [14:30:08] [...] [14:30:09] tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 22647/apache2 [14:30:34] and anything in /var/log/apache2/error.log [14:30:47] or /var/log/apache2/access.log [14:30:54] or you don't even get there? [14:32:05] a lot, because i have been trying things like "a2ensite default / service apache2 reload" etc. but the last lines are: [14:32:12] [Mon Jan 20 13:58:46 2014] [notice] Apache/2.2.22 (Ubuntu) configured -- resuming normal operations [14:32:12] [Mon Jan 20 13:59:12 2014] [notice] caught SIGTERM, shutting down [14:32:12] [Mon Jan 20 13:59:12 2014] [notice] Apache/2.2.22 (Ubuntu) configured -- resuming normal operations [14:32:12] [Mon Jan 20 14:01:29 2014] [notice] caught SIGTERM, shutting down [14:32:12] [Mon Jan 20 14:01:30 2014] [notice] Apache/2.2.22 (Ubuntu) configured -- resuming normal operations [14:32:12] [Mon Jan 20 14:03:59 2014] [notice] SIGUSR1 received. Doing graceful restart [14:32:12] [Mon Jan 20 14:03:59 2014] [notice] Apache/2.2.22 (Ubuntu) configured -- resuming normal operations [14:32:13] [Mon Jan 20 14:19:39 2014] [error] [client 127.0.0.1] client sent HTTP/1.1 request without hostname (see RFC2616 section 14.23): / [14:33:17] i think the last line was me typing "GET / HTTP/1.1" into telnet. [14:34:51] is there any ferm rule there? [14:35:22] ferm rule? [14:36:41] http://ferm.foo-projects.org/ [14:37:42] no, not that i know of. ferm is installed, at least not via apt. [14:38:15] can i try to log in to this machine? [14:38:20] (that is, "dpkg -l | grep ferm" is empty) [14:42:07] i guess i could add you to the project members list temporarily and that should give you access. now the question is, how do i actually do that :D [14:43:01] JohannesK_WMDE: you can do it via wikitech add member [14:43:16] what is the project name? [14:43:58] right, here it is. what's your wikitech username? [14:44:07] the projectname is catgraph [14:44:22] matanya [14:44:45] okay, i've added you. see if you can log in. [14:45:21] i;m in [14:45:58] where did you get the 208.80.153.176 address from? [14:46:14] it's a public ip address [14:46:21] given by admins [14:47:40] i see you only have 10.4.0.216 [14:47:46] that seems to be the reason [14:48:09] did you include the puppet proxy class ? [14:49:20] akosiaris: nslcd[1111]: [14914e] error writing to client: Broken pipe [14:49:48] is this related to the ldap issue? ^ [14:52:17] matanya: the 208.80.153.176 address is simply what i get by e.g. "host sylvester.wmflabs.org" from outside of labs... it's the public ip address, this should be alright. and i can connect to that address, too. just not to apache. [14:52:39] yeah, I see that [14:52:57] no, i didn't add a puppet proxy class. i didn't know i had to do that. [14:54:55] JohannesK_WMDE: i'm not sure you have to [14:55:12] but i see something weird in the logs: [warn] NameVirtualHost *:80 has no VirtualHosts [14:55:46] did you config apache to serve anything? [14:55:46] you think i need to enable protoproxy::proxy_sites? it seems to be related to https? i can't really tell what that role does at a glance... [14:56:03] yes, that's why i thought it might be a misconfiguration [14:56:32] the role allows you access private ip instances using a proxy [14:57:03] I can't ping 208.80.153.176 from anywhere [14:57:24] not even from sylvester [14:57:44] strange, works from here [14:58:29] it is possible you can't reach the external IP from inside labs... but it should work from outside [15:01:07] ok, it seems to be an issue specific to port 80 [15:02:34] ok, i found your issue [15:02:43] nothing is served on *:80 [15:03:01] JohannesK_WMDE: since nothign is served on 80 you timeuot [15:03:33] matanya: are you sure? try "curl localhost" on sylvester [15:04:08] 404 [15:04:12] it's possible that no site is enabled, but from localhost, apache does answer the request [15:04:22] from outside, i don't even get an answer [15:04:57] okay, i ran a2ensite default. try "curl localhost" now [15:05:31] yeah, i see what you mean [15:05:47] ... working on localhost now, but not from the outside [15:05:55] yes [15:06:05] (03CR) 10Alexandros Kosiaris: [C: 032] Add Niharika Kohli's blog to the English Planet [operations/puppet] - 10https://gerrit.wikimedia.org/r/108213 (owner: 10Odder) [15:08:27] something specific to port 80 [15:08:27] hmmm. [15:08:54] is port 80 block from outside access by some openstack stuff? [15:09:09] * matanya isn't very familiar with labs setup [15:09:42] doubtful [15:11:22] pong [15:11:55] labs has a firewall by default, you need to explicitly open ports [15:11:57] oh, I see Niharika's blog is already there, thanks akosiaris [15:12:41] i rest my case JohannesK_WMDE :) [15:12:53] mark: really? huh... here is something i did not know... [15:13:22] akosiaris: treat me with a cookie :) [15:13:55] matanya: thanks a lot for you help, i'm not familiar with labs setup yet, either [15:14:02] what kind of cookie would you like ? [15:14:20] but i didn't know about a default firewall either, mark, any info on how to configure that? [15:14:28] yeah I'm wondering too [15:14:29] code review cookie [15:14:33] it's similar to how you manage projects [15:14:41] but god knows where all that's being hidden now [15:14:55] somewhere on wikitech :P [15:14:57] manage security groups [15:15:14] yes that [15:15:32] heh... I even have a rule with entries there [15:15:34] damn... [15:15:44] I did not remember that at all [15:17:55] * matanya marks the issue as solved [15:19:27] what about the question above akosiaris? the nslcd issue [15:20:03] right! yes, that's where i configured the other ports in the first place... i remember now... doh. :) thanks for putting up with my stupidity everybody. ;) [15:20:56] JohannesK_WMDE: don't forget to remove me from the project [15:21:57] matanya: yes. unless you'd like to stay, and check out what we're doing. [15:22:32] if you don't mind leaving me in, i'll be glad to sniff around :) [15:22:48] matanya: so still experience that ? [15:23:10] https://wikitech.wikimedia.org/wiki/LDAP#Client have a look here [15:23:26] not me, was reported on #wikimedia-labs when you did the ldap stuff [15:24:05] ok. it should not manifest anymore [15:24:33] that it is what i said, lets see later on [15:24:38] matanya: ok. :) stuff is in /home/local-catgraph. i made sure you're not in the sudoers list though. ;) [15:24:47] sure [15:26:04] (03PS2) 10Ottomata: Updating debian/control to use newer standards, no longer depending on nagios3, removing pyversions file [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/107608 [15:26:06] (03PS1) 10Ottomata: Fixing syntax error [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108520 [15:26:08] (03PS1) 10Ottomata: ganglia_parser - log level should be ERROR by default [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108521 [15:26:16] strange thing is: a few days ago, i could connect to apache from outside, without any change in security groups. that was around the time when the network problems happened. [15:26:34] (03CR) 10Ottomata: [C: 032 V: 032] Updating debian/control to use newer standards, no longer depending on nagios3, removing pyversions file [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/107608 (owner: 10Ottomata) [15:26:44] (03CR) 10Ottomata: [C: 032 V: 032] Fixing syntax error [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108520 (owner: 10Ottomata) [15:26:57] (03CR) 10Ottomata: [C: 032 V: 032] ganglia_parser - log level should be ERROR by default [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108521 (owner: 10Ottomata) [15:27:56] ottomata: didn't you decide to drop ganglios? [15:29:33] not dropped, it it still used for some non generic things [15:29:36] memory stuff [15:29:37] not sure [15:29:45] i dropped it for my use case and a few others [15:30:00] but, apparently I left debug logging on when I was messing with it [15:30:04] and it was filling up logs on neon [15:30:05] fixing that now [15:32:21] (03PS1) 10Ottomata: Incrementing debian version [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108522 [15:32:53] (03PS2) 10Ottomata: Incrementing debian version [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108522 [15:33:14] (03CR) 10Ottomata: [C: 032 V: 032] Incrementing debian version [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108522 (owner: 10Ottomata) [15:35:42] (03PS1) 10Ottomata: Adding gbp.conf [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108523 [15:36:01] (03CR) 10Ottomata: [C: 032 V: 032] Adding gbp.conf [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108523 (owner: 10Ottomata) [15:36:42] (03PS1) 10Ottomata: Adding section header to gbp.conf [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108524 [15:36:52] (03CR) 10Ottomata: [C: 032 V: 032] Adding section header to gbp.conf [operations/software/ganglios] - 10https://gerrit.wikimedia.org/r/108524 (owner: 10Ottomata) [15:49:10] today is a WMF holiday, right? [15:49:36] (03CR) 10Ottomata: "Ah, sorry I didn't review this harder before. Matanya, that would work, but the $webrequset_filter_directory variable is really only rele" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 (owner: 10Matanya) [15:49:53] yup [16:06:56] (03CR) 10JanZerebecki: [C: 04-1] "According to http://docs.puppetlabs.com/guides/templating.html#referencing-variables this will not work, how about scope.lookupvar('coredb" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108488 (owner: 10Matanya) [16:22:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [17:37:26] i get an error at https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=all&search=Comparison+of+Windows+7+editions&fulltext=Search [17:37:30] "An error has occurred while searching: The search backend returned an error: " [17:56:12] (03PS1) 10BryanDavis: logstash: Updates for udp2log filtering [operations/puppet] - 10https://gerrit.wikimedia.org/r/108533 [18:02:00] (03CR) 10Matanya: "Thanks, I forgot to put the lookupvar, will fix shortly." [operations/puppet] - 10https://gerrit.wikimedia.org/r/108488 (owner: 10Matanya) [18:02:29] jackmcbarn: what is the error? [18:03:37] matanya: There's none. Search for "Comparison of Windows" works, "Comparison of Windows 7" fails. [18:04:25] I see [18:04:35] "7" alone works. That doesn't look /totally/ critical to me. jackmcbarn, could you file a bug? [18:12:58] (03PS7) 10Matanya: udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 [18:13:37] (03CR) 10jenkins-bot: [V: 04-1] udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 (owner: 10Matanya) [18:14:25] heya ori, you there? [18:14:27] working? [18:14:37] i've got a pip mediawiki-vagrant + wikimetrics problem [18:15:04] (03PS8) 10Matanya: udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 [18:15:42] (03CR) 10jenkins-bot: [V: 04-1] udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 (owner: 10Matanya) [18:17:03] matanya: the setting of template_variables => is in role/logging.pp [18:17:08] for the oxygen udp2log instance [18:17:22] yeah, i think i'm drunk. thanks ottomata [18:17:26] hah [18:17:32] yup :) [18:19:34] (03PS9) 10Matanya: udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 [18:19:52] jenkins shoot me if it doesn't work [18:20:10] PROBLEM - MySQL Slave Delay Port 3308 on labsdb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:20:11] (03CR) 10jenkins-bot: [V: 04-1] udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 (owner: 10Matanya) [18:20:20] this is enough [18:20:54] ottomata: what does he want for me? [18:21:00] RECOVERY - MySQL Slave Delay Port 3308 on labsdb1003 is OK: OK replication delay 0 seconds [18:22:19] jackmcbarn: Filed https://bugzilla.wikimedia.org/show_bug.cgi?id=60261 [18:23:09] matanya [18:23:18] it is a parameter to udp2log::instance [18:23:19] so [18:23:27] i didn't mean put it exactly on line 301 [18:23:39] i meant add it to the parameters passed to the udp2log::instnace { 'oxygen' [18:23:43] it is part of the class? [18:23:46] yes [18:23:50] well [18:23:51] of the define [18:23:51] yeah [18:24:00] i have zero knowlegde of the udp2log [18:24:15] maybe i should read about it before fixing stuff there [18:24:19] https://gist.github.com/ottomata/8525912 [18:24:34] there is documentation of misc::udp2log::instance define [18:24:36] in udp2log.pp [18:24:45] starting line 71 [18:25:21] I'm a proud proven dumb [18:25:23] for a usage example [18:25:32] see class role::logging::mediawiki [18:25:41] on line 65 in logging.pp [18:26:08] ja its cool! not dumb just gotta learn to use it. it isn't the best puppetization out there, but its ok. [18:27:48] so line 66 should look like: template_variables => { 'webrequest_filter_directory' => webrequest_filter_directory }, [18:27:58] is this what you mean ottomata ? [18:29:50] i'm unsure if our line numbers are the same [18:29:56] grep logging.pp for [18:30:00] misc::udp2log::instance [18:30:07] do you see that? [18:30:07] mine is on line 301 [18:30:57] oh, i see [18:31:25] I have it on 287 [18:31:39] ok eyah [18:31:44] so that is the define usage [18:31:50] it is declaring the oxygen udp2log instance [18:31:52] using the define [18:31:56] (defines are kinda like functions) [18:31:58] so [18:32:04] you need to pass the template_variables parameter in there [18:32:07] (03PS10) 10Matanya: udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 [18:32:08] like I pasted in the gist [18:32:19] * matanya crosses fingers [18:32:41] also, matanya, you might want to revert your change to udp2log.pp altogether, as you aren't actually adding anything there [18:32:43] just some formatting changes [18:32:51] yeah [18:32:57] if you want to reformat udp2log.pp (yes please, tabs -> spaces, etc.) go right ahead in a different commit :) [18:33:15] logging.pp [18:33:16] SO CLOSE [18:33:17] aalmost [18:33:24] template_variables => { 'webrequest_filter_directory' => $webrequest_filter_directory }, [18:33:28] note the dollar sine [18:33:30] dign [18:33:31] sign [18:33:33] * [18:33:38] it is defined a few lines up in that class [18:34:15] as you have it, it will set the value of $template_variables['webrequest_filter_directory'] to the literal string 'webrequest_filter_directory [18:34:18] you want to set it to the variable [18:35:57] (03PS11) 10Matanya: udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 [18:36:40] so lame 11 commits for such a trivial fix [18:36:56] yeah, soryr, i shoulda reviewed harder earlier and given you good advice to start with [18:37:26] oh [18:37:27] also [18:37:28] :/ [18:37:44] erb variables don't start with $ [18:37:49] they are either unqualified [18:37:51] or start with @ [18:37:55] they shoudl start with @ [18:38:01] so in filters.oxygen.erb [18:38:06] this [18:38:06] $template_variables['webrequest_filter_directory'] [18:38:07] shoudl be [18:38:10] @template_variables['webrequest_filter_directory'] [18:38:13] yeah [18:38:21] also, don't forget to revert udp2log.pp changes in your next PS :) [18:38:38] yes, that too [18:46:40] PROBLEM - Varnish traffic logger on cp1067 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:50] PROBLEM - Varnish HTCP daemon on cp1067 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:10] PROBLEM - Varnish HTTP text-backend on cp1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:50:31] RECOVERY - Varnish traffic logger on cp1067 is OK: PROCS OK: 2 processes with command name varnishncsa [18:50:40] RECOVERY - Varnish HTCP daemon on cp1067 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [18:51:00] RECOVERY - Varnish HTTP text-backend on cp1067 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.001 second response time [18:58:39] (03PS12) 10Matanya: udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 [18:59:07] oh god [18:59:16] I have removed the freaking file! [18:59:26] git said i did! [18:59:45] * matanya gives up on this patchset [19:00:16] haha [19:00:16] nooo [19:00:17] just do [19:00:24] git checkout manifests/misc/udp2log.pp [19:00:30] i thikn that will fix everything [19:00:30] I did [19:00:34] you did! [19:00:57] i don't see the file removed in gerrit [19:01:04] i just see a line addition in udp2log.pp [19:01:11] and also to make sure git reset e202b manifests/misc/udp2log.pp [19:02:38] no matter what I do, it shows the file as modified [19:04:37] ottomata: do me a favor, just merge it so i won't need to see the PS anymore :P [19:05:18] bah internet [19:07:49] ottomata1: here now but leaving in a minute. what was the q? [19:07:59] (03PS2) 10Matanya: coredb_mysql: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/108488 [19:09:56] enough for today. I do only stupid things now [19:10:03] awwww [19:10:32] (03PS1) 10Ottomata: Updating git::clone so that gerrit urls can be assumed by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/108537 [19:10:37] it looks fine in gerrit! :p [19:10:56] matanya: , want me to just finish it off and merge? you got everything right in there, just need to revert udp2log.pp [19:11:07] feel free [19:11:21] i won't get it right today [19:12:18] sometimes it better to stay in bed on monday in order not to fix monday's work the rest of the week [19:12:53] haha [19:12:56] yeha [19:12:57] hah [19:23:00] (03PS13) 10Ottomata: udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 (owner: 10Matanya) [19:23:50] (03CR) 10Ottomata: [C: 032 V: 032] udp2log: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/107828 (owner: 10Matanya) [19:23:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [19:26:03] thank you [19:26:45] yup [19:26:47] ! [19:26:55] looking good too [19:26:57] just ran this on oxygen [19:27:02] thanks for the patchsets!~ [19:55:02] who knows apache well? [19:55:42] (in particular i'm battling the wrong vhost being used for a req. while testing protrel fixes.) [20:02:20] jeremyb: i know it ok [20:02:23] what'sup? [20:05:13] jeremyb: i would be able to help, if it wasn't monday [20:12:02] ksnider: got a minute for a quick pm? [20:12:17] Yep1 [20:12:20] ! [20:12:31] QueenOfFrance: i agree with your pm [20:19:42] ottomata: well as i said, the wrong vhost is being used when processing a request [20:19:55] so i comment out some stuff from all.conf and it works better [20:20:02] but idk why that makes a difference [20:20:10] almost seems like it should be broken in prod [20:20:20] jeremyb: does it go by name? [20:20:55] ottomata: speaking of which... apache version in prod? [20:21:01] also, where should i be testing this stuff in labs? what's a good project? [20:21:14] (apache-config git repo changes) [20:21:28] jeremyb: what node? I don't know the apaches well [20:21:40] ha, i dunno! [20:21:42] hm [20:22:00] hm [20:22:08] jeremyb: gist or paste your configs somewhere? [20:22:32] often vhosts are wrong due to the server name not being set properly or something [20:22:36] matanya: i don't understand your question [20:22:38] and load order matters with vhosts [20:22:50] ottomata: it's just gerrit 106109 PS2 [20:22:51] matanya is asking if it is a NameVirtualHost [20:23:04] exactly [20:23:35] oh, i glanced at that change [20:23:41] ottomata: so, first i comment out nonexistant.conf from all.conf and that makes a diff but it's wrong a different way [20:23:56] ottomata: then i comment out everything and just uncomment redirects.conf and that seems more normal [20:24:03] oof, ok, when I said earlier i know apache well, i meant generic apache, i don't know much about wmf setup [20:24:04] :p [20:24:24] ottomata: anyway, pls dpkg -l apache\* on any prod text box [20:25:55] mutante: andrewbogott_afk: where should i be testing? [20:26:12] jeremyb, do you happen to know the hostname of one of them? [20:26:14] Gloria: (look up) :) [20:26:20] ottomata: one of what? [20:26:34] oh, prod text? [20:26:34] sure [20:26:45] ah foudn one [20:26:48] mw1017 [20:26:50] ms2017 [20:26:50] ha [20:26:51] thanks [20:26:52] yeah that [20:27:17] yeah, that should be ok. not prod though [20:27:20] 2.2.22-1ubuntu1.4 [20:27:45] ok, so exactly the same thing i was testing against [20:27:59] jeremyb: you can check in labs i guess [20:28:04] matanya: project? [20:28:25] ottomata: so, what about other possible differences? in /etc/apache2 but outside /etc/apache2/wmf [20:29:32] oof, i dunno jeremyb, i can maybe help you more in depth in an hourish, i'm in a hangout with dan atm working on something [20:29:41] where are you testing currently? [20:30:18] jeremyb: I think https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep [20:30:25] would be a good place [20:30:29] matanya: i guess i could do that... [20:30:55] ottomata: k. i'll move to deployment-prep [20:31:28] hrmm, i wonder how it's custom (beta) hostnames are done though [20:33:04] btw, do you know http://ottoportland.com/ ? [20:33:39] jeremyb: you can use the public ip [20:34:37] ha, nice jeremyb, i do not [20:34:39] but i do konw [20:36:25] http://www.otto-usa.com/index.html [20:48:50] ksnider: is there a policy on copying out a task from RT? [20:49:42] matanya: not unless it's from the security, procurement or other secure queue (or otherwise is information that one would expect needed protecting as such) [20:50:25] ok, thanks, simple ops-request. apt-pinning task. [20:50:54] Should be no problem. :) [20:54:23] ottomata: Hi. [20:54:47] ottomata: You're looking at Jeremy's redirect changes? [20:55:25] i was talking to him about it [20:55:26] haven't really looked at them yet [20:56:48] Gloria: had some trouble testing. also, wasn't sure where to test. matanya suggested deployment-prep but after having a look there i'm not so sure [20:57:05] why jeremyb ? [20:57:05] Ah. [20:57:21] Redirect changes are scary. [20:57:36] Don't want to break *.wikimedia.org again or anything like that. [20:57:50] again :) [20:57:58] Gloria: well in this case i can't even get the old version to work (before my changes) [20:58:37] Nice. [20:59:00] Gloria: i was going to diff HTTP responses... [20:59:12] Didn't we discuss writing tests? [20:59:38] I wonder if there's a bug about that. [21:01:26] https://git.wikimedia.org/blob/operations%2Fapache-config.git/70ec6d15cff8b678f51f469800ed7c6618ff7f55/refreshDomainRedirects [21:01:31] Interesting syntax highlighting. [21:05:44] https://bugzilla.wikimedia.org/show_bug.cgi?id=43266 [21:06:59] Gloria: i wrote tests [21:07:21] not sure how to automate them or where to check in. and maybe you want to add some cases [21:08:20] I'd at least like the cases from the bug report covered. [21:08:26] Because that bug has been marked fixed quite a few times now. [21:09:14] Gloria: well it was never marked fixed by me :) [21:09:17] but yes, i agree [21:09:23] and i did attempt to test! [21:09:47] https://bugzilla.wikimedia.org/show_bug.cgi?id=31369 is the bug we're discussing, for lurkers. [21:10:03] Gloria: here's what i started with: http://dpaste.com/1561574/plain/ [21:10:33] Gloria: with both HTTP and HTTPS tests for all of them [21:10:52] Are you programmatically comparing output? [21:10:57] comparing response, I mean. [21:11:11] i was just using diff -u [21:11:21] Blergh. [21:11:24] and apache-fast-test [21:11:37] I guess the language doesn't matter. [21:11:58] anyway, can't get the currently deployed conf (in prod) to give the right results [21:12:01] But a test script should have a simple dictionary and reasonable logic. [21:12:10] maybe someone else could set up an apache for me [21:12:16] Well, we know that prod is broken. ;-) [21:12:19] Thus the bug. [21:12:21] no [21:12:24] What isn't working? [21:12:31] You've yet to explain. [21:12:37] i did explain [21:12:46] stuff is being handled by the wrong vhost [21:12:58] Where are you testing? [21:12:59] so, e.g. wikimediafoundation.org/fundraising [21:13:08] is just redirecting to wikipedia homepage [21:13:13] locally [21:13:47] With what config? [21:13:56] Does it only happen for wikimediafoundation.org? [21:15:10] no [21:15:16] it happens with lots of domains [21:16:08] as i said, commenting/uncommenting in all.conf makes a big diff [21:17:07] Oh, I see scrollback now. [21:17:18] So much noise. [21:20:03] Gloria: anyway, look at the testcases [21:21:21] Make me. [21:22:47] ... [21:23:50] (03PS1) 10OliverKeyes: Add r-base to Hadoop worker machines [operations/puppet] - 10https://gerrit.wikimedia.org/r/108633 [21:25:12] maybe someone could add me back to deployment-prep projectadmin and then i'll test on a brand new box. or suggest another project for this [21:25:16] * jeremyb runs away, bbl [21:29:12] (03PS2) 10OliverKeyes: Add r-base to Hadoop worker machines [operations/puppet] - 10https://gerrit.wikimedia.org/r/108633 [21:33:10] (03PS3) 10OliverKeyes: Add r-base to Hadoop worker machines [operations/puppet] - 10https://gerrit.wikimedia.org/r/108633 [22:24:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [22:30:34] (03PS1) 10Odder: Add namespace aliases for NS_PROJECT on zhwikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108641 [22:44:55] (03PS1) 10Ottomata: Puppetizing wikimetrics [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/108643 [22:45:38] (03CR) 10Ottomata: [C: 04-1] "Original work done here:" [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/108643 (owner: 10Ottomata) [22:45:58] (03CR) 10Ottomata: "Moving to submodule here:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 (owner: 10Milimetric) [22:46:08] (03Abandoned) 10Ottomata: [not ready for review] Puppetizing Wikimetrics in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/96042 (owner: 10Milimetric)