[01:15:19] nagios-plugins drama: https://bugzilla.redhat.com/show_bug.cgi?id=1054340 [01:17:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [01:36:25] (03PS1) 10Springle: repool db1039, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108478 [01:38:47] (03CR) 10Springle: [C: 032] repool db1039, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108478 (owner: 10Springle) [01:39:55] !log springle synchronized wmf-config/db-eqiad.php 'repool db1039, warm up' [01:40:04] Logged the message, Master [02:10:06] (03PS1) 10Springle: repool db1022 & db1039 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108481 [02:10:30] (03CR) 10Springle: [C: 032] repool db1022 & db1039 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108481 (owner: 10Springle) [02:10:37] (03Merged) 10jenkins-bot: repool db1022 & db1039 full steam [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/108481 (owner: 10Springle) [02:12:17] !log springle synchronized wmf-config/db-eqiad.php 'repool db1022, db1039 full steam' [02:12:26] Logged the message, Master [02:12:45] !log LocalisationUpdate completed (1.23wmf10) at 2014-01-20 02:12:45+00:00 [02:12:52] Logged the message, Master [02:16:47] (03CR) 10MZMcBride: "Does this file need a(n explicit) license?" (031 comment) [operations/software] - 10https://gerrit.wikimedia.org/r/108467 (owner: 10Tim Landscheidt) [02:22:29] !log LocalisationUpdate completed (1.23wmf11) at 2014-01-20 02:22:28+00:00 [02:22:35] Logged the message, Master [02:39:42] (03CR) 10Tim Landscheidt: "I'm not sure. Much of the Puppet/software/configuration stuff has no explicit licence, so I didn't want to step out of line. I'm fine wi" [operations/software] - 10https://gerrit.wikimedia.org/r/108467 (owner: 10Tim Landscheidt) [02:40:25] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-01-20 02:40:25+00:00 [02:40:32] Logged the message, Master [02:50:29] (03PS1) 10Ori.livneh: Revoke anomie's key per his request [operations/puppet] - 10https://gerrit.wikimedia.org/r/108483 [02:50:50] (03PS2) 10Ori.livneh: Revoke anomie's key per his request [operations/puppet] - 10https://gerrit.wikimedia.org/r/108483 [02:50:57] (03CR) 10Ori.livneh: [C: 032 V: 032] Revoke anomie's key per his request [operations/puppet] - 10https://gerrit.wikimedia.org/r/108483 (owner: 10Ori.livneh) [03:08:12] o.O [04:18:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [04:51:46] (03PS1) 10Ori.livneh: Gzip SVGs on front & back upload varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/108484 [04:51:56] (03PS2) 10Ori.livneh: Gzip SVGs on front & back upload varnishes [operations/puppet] - 10https://gerrit.wikimedia.org/r/108484 [05:34:00] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:37:00] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [06:55:38] (03PS2) 10Andrew Bogott: gridengine: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107035 (owner: 10Matanya) [06:57:37] (03CR) 10Andrew Bogott: [C: 032] gridengine: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107035 (owner: 10Matanya) [07:04:49] (03PS2) 10Andrew Bogott: deployment: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107343 (owner: 10Matanya) [07:05:05] (03PS2) 10Andrew Bogott: dynamicproxy: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107339 (owner: 10Matanya) [07:17:37] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107339 (owner: 10Matanya) [07:19:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [07:27:08] (03CR) 10Andrew Bogott: [C: 032] deployment: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107343 (owner: 10Matanya) [07:40:48] (03PS1) 10Matanya: coredb_mysql: puppet 3 compatibility fix: fully qualify variables [operations/puppet] - 10https://gerrit.wikimedia.org/r/108488 [07:41:10] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [07:42:21] andrewbogott: Leslie Carr is leaving??? [07:42:40] Yeah, I think Friday was her last day :( [07:42:45] Yes. [07:43:35] :/ [07:43:49] to where? [07:45:36] I… can't remember. Cumulus, maybe? She's taking some time off first, though. [07:45:44] (03PS4) 10Andrew Bogott: base: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107355 (owner: 10Matanya) [07:46:05] so all SF based ops are leaving? :) [07:46:58] Not quite all… but yeah, I visited the office on last week and it was very quiet. [07:47:10] Daniel and Rob and Gage are in SF. And Ken is about to move there. [07:47:42] I feel like there's another who I'm forgetting :) Ori is semi-ops I guess. [07:47:53] Which Rob? [07:48:09] Halsell [07:48:17] He moved to SF? [07:48:20] He was in DC until this year, recently moved. [07:48:20] yeah [07:48:26] Oh, right. [07:48:28] I'm in D.C. [07:48:29] Seems /much/ happier in CA :) [07:48:32] Heh. [07:48:50] I'm visiting SF this week. [07:49:13] There are still lots of folks @ the office, just not so many ops. [07:49:45] http://orgcharts.wmflabs.org/ [07:49:48] is down [07:50:25] hm, is that the one linked to from the staff page? [07:50:52] yes [07:51:34] Beta indeed. [07:51:48] Gloria: not meaning to knock DC, I just think it's no fun to be constantly plagued with data-center requests from people you never see. I'm not sure how Chris manages it. [07:52:08] Well, yeah. :-) [07:52:57] DC work is the worst ops responisbility [07:53:18] Looks like orgcharts is marktraceur's baby. [07:53:22] Most likely out today. [07:53:47] Yes and yes. [07:57:17] orgcharts says "(111)Connection refused: proxy: HTTP: attempt to connect to 127.0.0.1:8888 (localhost) failed" [07:57:34] could be anything. [08:00:26] (03CR) 10Andrew Bogott: [C: 032] base: lint clean [operations/puppet] - 10https://gerrit.wikimedia.org/r/107355 (owner: 10Matanya) [08:00:53] ^ scary one! [08:01:57] yes, agreed [08:04:33] clean runs, though, seems safe. [08:04:39] * andrewbogott keeps fingers crossed [08:05:00] andrewbogott: can i call $::site from an erb? [08:06:14] I think for globals you have to do that ugly _lookup thing. I think I would copy into a local var first and refer to that in .erb. [08:06:31] although, maybe that's silly… not sure. [08:06:53] not sure $site can be a local var easily :/ [08:07:27] Can't just do $local_site=$::site? [08:07:43] and that is not ugly? :P [08:07:55] Oh it's definitely ugly! [08:08:10] Probably it's fine to use it directly, I just don't know if it needs special syntax... [08:08:18] Maybe it'll totally work, I should've just led with 'I don't know' [08:09:06] i'm doing puppet 3 compatibility fixes, so this is the source of my question [08:09:16] i guess testing is the best answer [08:09:29] yep [08:41:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [10:01:11] (03PS12) 10Matanya: Move mail manifests to a module called 'exim' [operations/puppet] - 10https://gerrit.wikimedia.org/r/68584 (owner: 10Andrew Bogott) [10:04:59] (03CR) 10Matanya: "patchset 11 was rebased. a heck of a work." [operations/puppet] - 10https://gerrit.wikimedia.org/r/68584 (owner: 10Andrew Bogott) [10:09:00] (03Abandoned) 10Matanya: gitblit: service has no status [operations/puppet] - 10https://gerrit.wikimedia.org/r/108321 (owner: 10Matanya) [10:20:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [10:24:50] PROBLEM - swift-container-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-object-updater on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-account-reaper on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-container-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-object-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-account-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:50] PROBLEM - swift-object-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:00] PROBLEM - swift-account-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:00] PROBLEM - DPKG on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:00] PROBLEM - swift-account-server on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:10] PROBLEM - swift-container-replicator on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:20] PROBLEM - swift-container-updater on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:30] PROBLEM - RAID on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:30] PROBLEM - swift-object-auditor on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:25:40] PROBLEM - SSH on ms-be1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:40] PROBLEM - puppet disabled on ms-be1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:51] bye bye ms-be1003 [10:35:00] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:36:50] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [10:56:49] huh [10:57:09] just in time akosiaris [10:57:10] bug: soft lockups... I can't regain control... powercycling [10:57:39] don't forget to log :) [10:58:50] !log powercycling ms-be1003. Console full of messages BUG: soft lockup - CPU#stuck for #s [10:58:57] Logged the message, Master [10:59:38] akosiaris: thanks [11:00:00] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:00:26] now a cron question please. when does torrus-discovery run on netmon1001? [11:00:30] !log restarting Gitblit on antimony to test upstart script [11:00:37] Logged the message, Master [11:00:59] or on streber [11:01:00] RECOVERY - swift-container-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:01:10] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [11:01:10] RECOVERY - swift-container-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:01:20] RECOVERY - swift-object-auditor on ms-be1003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:01:20] RECOVERY - RAID on ms-be1003 is OK: OK: optimal, 14 logical, 14 physical [11:01:30] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [11:01:30] RECOVERY - puppet disabled on ms-be1003 is OK: OK [11:01:40] RECOVERY - swift-container-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:01:40] RECOVERY - swift-container-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:01:40] RECOVERY - swift-object-server on ms-be1003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:01:40] RECOVERY - swift-account-reaper on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:01:40] RECOVERY - swift-object-updater on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:01:40] RECOVERY - swift-object-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:01:41] RECOVERY - swift-account-replicator on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:01:50] RECOVERY - swift-account-auditor on ms-be1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:01:50] RECOVERY - DPKG on ms-be1003 is OK: All packages OK [11:01:50] RECOVERY - swift-account-server on ms-be1003 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:03:00] PROBLEM - gitblit process on antimony is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/java .*-jar gitblit.jar [11:03:01] matanya: never ? [11:03:10] not good [11:03:20] is torrus broken? [11:03:40] no [11:03:59] torrus is neither of streber nor netmon1001 for what is see btw [11:04:05] neither on* [11:04:10] from what* [11:04:13] damn... [11:05:00] RECOVERY - gitblit process on antimony is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar gitblit.jar [11:05:14] gitblit is me [11:06:17] so it is down [11:06:36] i'm just testing something [11:06:43] i'll be done in a moment; i !logged it [11:07:01] akosiaris: sorry on manutius [11:09:21] matanya: daily [11:09:43] thanks, so i'll convert the exec to a cron job [11:13:10] !log ms-be1003 kernel log show first lockups around Jan 19 06:30 UTC being XFS related. [11:13:15] Logged the message, Master [11:17:29] (03PS1) 10Ori.livneh: Add upstart job definition file for Gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/108492 [11:18:33] matanya: ^ [11:18:41] good night [11:20:19] see ori [11:20:28] 3:20 AM :D [11:20:37] my thoughts exactly :) [11:20:38] thanks ori [11:20:51] sleep well ori [11:20:55] (03CR) 10Matanya: [C: 031] Add upstart job definition file for Gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/108492 (owner: 10Ori.livneh) [11:21:13] thanks! (to both.) bye [11:54:44] akosiaris: Know anything about opendj? [11:54:59] https://rt.wikimedia.org/Ticket/Display.html?id=6676 [12:04:43] (03PS1) 10Matanya: torrus: move into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/108498 [12:05:36] andrewbogott: not much [12:05:39] andrewbogott: not related to https://gerrit.wikimedia.org/r/#/c/102629/ ? [12:06:10] andrewbogott: want help to investigate that issue with replication ? [12:06:13] matanya: Don't think so, they've been out of sync for a couple of days at least. [12:06:28] akosiaris: Yes please… I'm staring at log files but nothing obvious is jumping out. [12:06:43] It seems vaguely possible that they got out of sync when the pmtpa cable was cut [12:08:00] andrewbogott: ok gimme a sec to wrap up an email. [12:12:10] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [12:13:55] so andrewbogott, virt1000 is not replicating from virt0... let's see [12:15:15] I've restarted opendj on both servers and it seems fairly happy. [12:15:24] it synced ? [12:15:33] Ah, no. [12:15:38] I mean -- the log files seem happy. [12:15:44] But that account is still present on virt0 and not on virt1000 [12:22:17] andrewbogott: since 3 Jan from what i see.. [12:22:35] at least that is when the logs start complaining [12:23:24] that's relatively close to when the certs were changed [12:23:36] But do you agree that they're not complaining now? [12:23:58] are you looking at replication log or error log? [12:24:17] replication log [12:24:38] yes they are not complaining now [12:25:12] in fact, not since 16 Jan [12:25:33] Yeah, I think that's when I restarted opendj on virt1000 [12:26:34] Hm, apparently I didn't log it :( [12:29:19] so, the easy way forward is probably to reinitialize the slave from the master [12:30:48] Does replication only happen as the transactions occur? It won't resync itself? [12:30:59] normally yes [12:31:24] Ah, ok. Then I was just misunderstanding; I thought that if both servers were communicating they would gradually sync. [12:31:45] So is there a master/slave relationship or are they peers? Is one db guaranteed to be a subset of the other? [12:32:03] I think this software supports multimaster [12:32:15] i don't know if it is configured that way though [12:33:10] https://wikitech.wikimedia.org/wiki/LDAP#Initialize_replica [12:33:17] this could prove useful [12:33:42] not yet clear to me where to run it though and with what args... [12:34:07] Yeah -- I'm uneasy because of the 'guaranteed to be a subset' question... [12:34:50] a replication is done on a whole tree always [12:35:02] so dc=wikimedia,dc=org must be the same on both [12:36:16] I don't understand. [12:36:20] Or my question didn't make sense... [12:37:04] Isn't it possible that uid=foo,ou=people,dc=wikimedia,dc=org is on virt0 and not virt1000 [12:37:13] only if they are unsynced [12:37:15] but uid=bar,ou=people,dc=wikimedia,dc=org on virt1000 but not virt0? [12:37:16] not by design [12:37:21] Ah, right. [12:37:22] aaah [12:37:28] OK. But clearly they /are/ unsynced atm [12:37:48] so you mean that they both have changes ? [12:37:57] since the time the got unsynced... [12:38:01] split brain [12:38:06] that would be bad.... [12:38:41] i sure hope it hasn't happened [12:38:52] I don't know if it did or didn't :( [12:39:20] ok i will try to change something in my account on virt1000 [12:39:29] Ah, good idea. [12:39:31] hopefully it will answer that it is a readonly replica [12:44:17] crap [12:44:28] so... local changes on virt1000 are possible [12:45:02] akosiaris andrewbogott it would be good to check the head of the tree and see if it is in sync [12:45:19] :( [12:45:23] then you can see if the entire tree is screwed up or just part [12:49:47] huh [12:49:54] so virt1000 => virt0 changes [12:49:59] are propagating normally [12:50:04] no probs there [12:51:38] ok, I'm trying to think this through... [12:51:51] It's not obvious to me what uses virt1000 and what uses virt0. [12:52:08] For instance I looked at ldap.conf on gerrit.wikimedia.org and it didn't specify an ldap server at all. [12:52:35] It seems likely that tampa stuff (= labs) writes to virt0 and eqiad stuff (= everything else) to virt1000 [12:52:42] But I don't know that that's right. [12:52:47] huh [12:52:55] Suppose there's a way to just git a big diff, and then resolve by hand? [12:52:56] also virt0=> virt1000 works normally [12:53:10] i just added an attribute to my account on one server [12:53:16] it showd up on the other [12:53:29] deleted it from the second one and it got deleted on the first one two [12:53:35] but not the other way around? [12:53:46] so both directions work fine now [12:53:58] but something got lost at some point... [12:54:02] so you say we have a gap [12:54:26] what if you regenrate his config by hand on one side? [12:54:32] matanya: I know of at least one user account that's on virt0 but not virt1000 [12:54:48] yes ilmerovingio [12:54:50] matanya: right, if we know the total diff we can do that... [12:54:55] he approched me at first [12:55:09] hello [12:55:28] I'd try to fix him only as a start to see if we can sort out the root cause [12:55:44] I think the root cause is already resolved. [12:55:52] even better [12:56:17] diffing between the servers will be costy? [12:56:31] yes [12:56:55] too bad [12:57:09] Well, here, let me just do a dump of all accounts and compare just to see order of magnitude... [12:57:23] (though of course more than just accounts are involved probably) [12:57:24] can we force clean replection? [12:57:43] a quick dump speaks about 400k of data [12:57:46] in ldif format [12:57:54] let's see now... [12:58:53] are the user credentials stored in ldap? [12:59:17] yes [13:00:07] problem is we can't just figure out the diff and load those on virt1000. Afterwards virt0 will try to get the changes and it will fail [13:00:36] about 79 accounts differ [13:00:41] i got the point [13:01:25] akosiaris: can you put the server in lockdown mode until you fix? [13:01:28] akosiaris, does that failure matter though? [13:01:33] what about stopping the replication on the server that has problem, reimport the ldap database and restart the replication? [13:01:50] andrewbogott: yes because it will no longer get the changes from virt1000 [13:02:04] I am back at thinking reinitialize the one of two from the other [13:02:06] Wait, once a single rep fails it will stop forever after? [13:02:47] virt1000 has more user accounts than virt0 [13:02:50] That surprises me... [13:02:52] it used to in the previous generations (Sun DS, Fedora DS etc, all those are the same software more or less) [13:03:03] well, wait, let me verify that [13:03:36] i think you should use the fact it is a day off in USA and take a lockdown for short time to fix [13:03:40] wrong, virt0 has more [13:03:51] for some reason the ldif is bigger on virt1000, but more accounts on virt0 :( [13:04:06] maybe you passed -LLL on ldapsearch ? [13:04:10] comments etc... [13:04:14] i know i did [13:04:17] so diffing again [13:04:47] akosiaris: 77 of the 79 accounts that differ are on virt0 and not virt1000. So, doing virt0=>virt1000 seems pretty sound to me. [13:04:58] titleblacklist.wmflabs.org [13:05:04] this seems to be only on virt1000 [13:05:16] dc=208.80.153.160 [13:05:20] so it is not just accounts [13:05:24] also labs machines ? [13:05:36] Yep, that should also be on virt0 and not virt1000 [13:05:40] since labs is in tampa [13:06:25] http://opendj.forgerock.org/opendj-server/doc/admin-guide/index/chap-troubleshooting.html#troubleshoot-repl [13:06:36] this helped me in the past ^ [13:07:54] if the default purge delay is three days... [13:08:04] then we don't have historical info any more :( [13:08:31] too bad [13:08:49] i am getting pretty confident this is what happened [13:09:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [13:10:23] ok so [13:10:33] I say we reinitialized virt1000 from virt0 [13:10:59] and then readd the few missing entries [13:11:11] that are present on virt1000 but not virt0 [13:11:32] And readding by hand won't break replication forever? [13:12:02] you mean after virt0 basically overwrites all data present on virt1000 ? [13:12:11] no [13:12:18] Earlier you said 'problem is we can't just figure out the diff and load those on virt1000. Afterwards virt0 will try to get the changes and it will fail' [13:12:43] Now you are saying 'and then readd the few missing entries' [13:12:49] Aren't those the same thing, just different scale? [13:13:08] yes but the key word here is "after reinitialization" [13:13:30] cause what will happen is that virt1000 will be a mirror of virt0 [13:13:58] the problem earlier was that [13:14:16] if we added record B on virt1000 and it already was present on virt0 [13:14:20] that would break [13:14:36] cause hey... you tried to add an already existing record [13:14:54] but if we don't do that, but add nonexistent records, that is fine [13:15:09] Ooooh, ok now I see what you are saying :) [13:15:45] So next you're going to say "I know how to do all those things, you should just stand by and watch" [13:15:58] heh... [13:16:10] I have done them in the past indeed [13:16:13] just not on this software [13:16:20] more like it's grandpa [13:16:32] hopefully not much has changed :-) [13:18:02] all the ds stuff is very similar, even active directory, tfoo [13:21:50] PROBLEM - Puppet freshness on hooft is CRITICAL: Last successful Puppet run was Fri 17 Jan 2014 06:59:52 PM UTC [13:23:49] ^ this is because puppet is disabled on hooft. [13:25:06] why though ? [13:25:19] i don't see something in SAL [13:25:34] I don't know what/whose hooft is [13:33:56] andrewbogott: can you please guide me how to test etherpad on labs? I wasn't able to understnad how i get to the site from labs [13:34:28] matanya, you mean, test a new etherpad install? [13:34:31] yes [13:35:10] Hm… hopefully you can just set up a proxy pointing to the instance and it'll just work. [13:35:40] an apache proxy? [13:35:50] Since I think that etherpad just uses http [13:35:51] https://wikitech.wikimedia.org/wiki/Special:NovaProxy [13:36:04] You can just assign your etherpad instance a public address there. [13:36:08] And then browse! [13:36:19] (Yuvi wrote most of that, it's new :) ) [13:37:23] matanya: Am I answering your question, sort of? [13:38:11] I hope, i'll need to check this out. I didn't know I have the right of using public IP [13:39:07] You don't need one, that's what the proxy is for. [13:39:26] I see [13:39:32] The proxy box has its own public IP. That GUI assigns a new DNS name to the proxy box, and automatically routes traffic using that DNS name to your labs machine. [13:39:48] I'll test it at night [13:40:07] my night ... [13:40:12] It's very simple, you just type the name you want, select the instance, submit, wait a few minutes. [13:40:17] And then clean up afterwards, ideally :) [13:41:02] andrewbogott: out of cirousity, what else is on your endless list? [13:41:25] Um… labs to eqiad (and many subtasks thereof) [13:41:37] Refactoring all of puppet to use modules [13:41:47] …some other stuff :) [13:42:00] PROBLEM - LDAPS on virt0 is CRITICAL: Connection refused [13:42:25] Mostly I'm just fixated on the labs DC migration and don't have a lot of brain space for other long-term plans. [13:42:40] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [13:42:57] BTW, i have rebased you exim module, as it started to get too far from prod. [13:43:02] *your [13:43:36] ok. I think mark sort of hated that module, so it may end up getting redone at some point. [13:43:40] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.038 second response time on port 389 [13:44:00] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.036 second response time on port 636 [13:44:39] too bad, wasted some nice portion of time on it [13:44:41] !log stopped, backed up virt1000 and virt0 and started them up. Preparing for reinitializing virt1000 from virt0 [13:44:49] Logged the message, Master [13:45:30] PROBLEM - LDAP on virt1000 is CRITICAL: Connection refused [13:45:40] PROBLEM - LDAPS on virt1000 is CRITICAL: Connection refused [13:47:30] RECOVERY - LDAP on virt1000 is OK: TCP OK - 0.002 second response time on port 389 [13:47:40] RECOVERY - LDAPS on virt1000 is OK: TCP OK - 0.000 second response time on port 636 [13:51:32] !log started virt1000 reinitialization from virt0 [13:51:39] Logged the message, Master [13:51:45] !log finished virt1000 reinitalization from virt0 [13:51:47] that was fast... [13:51:51] Logged the message, Master [13:51:52] sooo let's see... [13:52:29] andrewbogott: your user is there [13:52:39] So I see! [13:52:55] I think we are ok... I could not even find something that has really changed on virt1000 and needed to be added manually [13:52:59] ilmerovingio: so… can you log into gerrit now? [13:53:14] * andrewbogott figures this will turn out not to have been the problem :/ [13:53:23] is the network disruption also affecting dns? [13:53:35] heh [13:53:43] oh! akosiaris, fun fact, you have to restart pdns when you tinker with opendj. [13:53:45] Shall I? [13:53:51] ??? [13:53:54] sure [13:53:54] JohannesK_WMDE: Give me a minute… [13:54:00] please tell me how [13:54:20] Oh, just service pdns restart [13:54:22] I did it already. [13:54:35] It's a bug in pdns, it flips out if there's an ldap interruption. [13:54:36] only one server ? [13:54:36] Sometimes. [13:54:44] Both virt0 and virt1000 [13:54:49] JohannesK_WMDE: better? [13:54:52]