[00:03:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:05:56] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.848 second response time [00:08:56] PROBLEM - HTTP on ekrem is CRITICAL: Connection timed out [00:10:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection timed out [00:11:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.929 second response time [00:17:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.347 seconds [00:17:46] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 7.356 second response time [00:21:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:46] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 6.071 second response time [00:26:55] i think i've asked this before but idk if i ever heard a resolution. why is a 400 considered "OK"? maybe we need to be checking a different path on that server? (and then checking for 200 or 30x or something) [00:27:19] idk if my past inquiries were about the same service or nto [00:27:21] not* [00:27:43] (see icinga-wm and nagios-wm above) [00:29:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:32:46] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 7.750 second response time [00:35:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:36:58] New patchset: Ryan Lane; "Remove project groups from sudo, add ops group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5216 [00:37:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5216 [00:41:45] !log adding interface for per-project sudo on OpenStackManager [00:41:47] Logged the message, Master [00:50:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection timed out [00:51:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.640 seconds [00:58:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 335 bytes in 6.602 second response time [01:26:43] New review: Ryan Lane; "Inline messages." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/3238 [01:30:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: Connection timed out [01:32:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:34:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.486 second response time [01:37:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:27] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 200 seconds [01:39:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 335 bytes in 3.953 second response time [01:39:21] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 251 seconds [01:39:27] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 17 seconds [01:39:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.560 seconds [01:42:21] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 189 seconds [01:42:27] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 197 seconds [01:46:19] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 18 seconds [01:46:27] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 0 seconds [01:51:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.678 second response time [01:54:37] PROBLEM - HTTP on ekrem is CRITICAL: Connection timed out [02:10:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.223 second response time [02:11:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:11:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:12:39] hi tim, does udp2log generate the data that is consumed by domas's webstatscollector? [02:13:37] PROBLEM - HTTP on ekrem is CRITICAL: Connection timed out [02:14:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.032 second response time [02:17:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:19:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.455 seconds [02:19:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 335 bytes in 5.064 second response time [02:37:28] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [02:42:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.708 second response time [02:45:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:57] PROBLEM - mysqld processes on blondel is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [02:51:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.014 second response time [03:01:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 8.654 second response time [03:05:37] PROBLEM - HTTP on ekrem is CRITICAL: Connection timed out [03:07:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.118 second response time [03:13:28] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [03:13:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:15:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.538 second response time [03:18:37] PROBLEM - HTTP on ekrem is CRITICAL: Connection timed out [03:26:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 6.357 second response time [03:37:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 9.056 second response time [03:45:37] PROBLEM - HTTP on ekrem is CRITICAL: Connection timed out [03:53:37] RECOVERY - HTTP on ekrem is OK: HTTP OK: HTTP/1.1 200 OK - 453 bytes in 8.735 second response time [03:53:47] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [03:53:47] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [03:53:47] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [03:53:47] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [03:53:47] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [05:24:10] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [08:05:05] New review: ArielGlenn; "ugly but necessary :-P" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5171 [08:05:49] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4896 [08:05:51] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/5171 [08:05:53] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4896 [09:24:57] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [09:24:57] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [09:24:57] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [09:32:54] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [09:32:54] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [09:50:07] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [09:52:57] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [10:48:31] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [10:51:22] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2675* [10:58:25] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2663* [11:03:58] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2363 [11:08:22] !log Sending European bits traffic back to esams [11:08:25] Logged the message, Master [11:16:34] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2625* [11:20:46] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2675* [11:31:45] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:31:54] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2275 [11:43:00] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:45:51] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [11:48:42] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2325 [11:52:54] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [11:59:57] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2388 [12:12:42] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2638* [12:14:12] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2350 [12:21:06] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [12:23:57] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [12:32:26] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [12:35:26] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [12:37:59] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [12:39:38] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [12:43:50] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [12:53:20] !log restarting/fixing etherpad issue [12:53:22] Logged the message, Master [12:53:58] * Reedy wonders why mutante isn't in -tech [12:54:38] joins [12:54:57] my screen died earlier [12:55:02] ahh [13:13:59] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [13:16:23] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:17:53] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:30:13] !log applied a patch to etherpad that allows admins to delete pads [13:30:16] Logged the message, Master [14:46:55] New patchset: Pyoungmeister; "adding add-new-wiki functionality to lucene utility script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5227 [14:47:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5227 [14:49:25] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5227 [14:49:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5227 [15:25:04] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [15:56:05] New patchset: Mark Bergsma; "Revert "Make pmtpa equal to esams"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5235 [15:56:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5235 [15:56:50] New patchset: Mark Bergsma; "Revert "Stop doing translations on tier2 pmtpa hosts"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5236 [15:57:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5236 [15:57:18] New patchset: Mark Bergsma; "Revert "Make pmtpa equal to esams"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5237 [15:57:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5237 [15:57:36] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5237 [15:58:13] New patchset: Mark Bergsma; "Revert "Temporarily place bits.pmtpa behind bits.eqiad to test if sess leakage occurs"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5238 [15:58:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5238 [15:58:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5235 [15:58:33] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5235 [15:58:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5236 [15:58:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5236 [15:59:06] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5238 [15:59:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5238 [17:22:50] New patchset: Bhartshorne; "fixed bug in regex matching only 0-9 instead of hex digits" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5241 [17:23:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5241 [17:23:18] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5241 [17:23:22] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5241 [18:24:20] apergos: just to remind you, tomorrow's our only chance to talk about the o2s stuff until may1st [18:24:20] I'm ooo 4/20-4/30 [18:24:21] no worries [18:24:21] I'mnot going to do it tonight, I am off for the day (9:30 pm) [18:24:21] of course. [18:24:21] if all goes according to plan tomorrow, I'll be looking at your code, my code, and the various use cases, to see what's needed [18:24:21] sweet. [18:24:22] and then checking in tomorrow evening my time (or sending some mail) [18:24:54] I don't have to start running it right now obviously, I just want to have either a basic app or a plan that you'll do it ( :-P ) by the time the migration starts. [18:27:35] hopefully the logger will get deployed while I'm gone [18:27:55] and then I (or we) can finish the o2s stuff and make sure it does everything we need before the migration starts. [18:28:46] uh huh [18:29:07] I will be looking at what the logger writes, though I've already checked the code so I'm pretty sure I know what its output will look like. [18:32:35] I haven't started anything to consume the output of the logger and throw it into the DB from which the filemover pulls. [18:32:56] ok [18:33:22] well like I say I'll look at it tomorrow [18:33:22] no worries [18:33:22] have fun! [18:33:22] heh [18:33:31] thanks [18:33:49] the worst part is over, that was digging through the mw code to see what was going on [18:34:07] I hadn't actually poked around in the file repo / backend stuff for several months [18:38:17] Hey ops people? [18:38:27] Something weird is going on with the memc check in Nagios [18:38:29] MEMCACHED CRITICAL - Can not connect to 10.0.2.251:11000 (Connection timed out) [18:38:31] Problem is: [18:38:44] 10.0.2.251 does not appear to be a live IP on our network [18:38:56] And it's commented out in mc.php , so the cluster isn't actually using it for memc [18:40:36] what's our power plug type ? [18:40:42] RoanKattouw: can you put that in a ticket please ? [18:41:32] I will after investigating a little bit more [18:41:43] Trying to determine if there is a memc server somewhere that's down / broken [18:42:22] hah, it IS using 10.0.2.251 [18:42:24] wtf [18:42:41] Ah, reading comprehension fail, nm [18:43:23] OK, so new problem: 10.0.2.251 is a legit memc server and everything, except that it's down [18:43:40] OK, so new problem: 10.0.2.251 is a legit memc server and everything, except that it's down [18:43:42] Oops [18:44:13] And there's a couple others that are up but look whacky [18:45:34] it happens, apparently [18:45:54] write a puppet rule to fix that [18:47:00] (generally need to swap out for another, whacky servers may or may not be investigated too) [18:47:31] The whacky ones have incr: 98 [18:47:38] So it ran a hundred incr()s and got 98 back [18:48:30] can be a test failure, try again, and they won't be [18:48:59] Hmm, yeah I ran it again, now it's 1 IP instead of 2 and it's differnet [18:49:03] So I guess that's just intermittent [18:49:18] But 251 is down. I'll swap it with a spare [18:49:34] Ahm [18:49:37] The ONE spare [18:49:54] Now we won't have any spares any more, either that or mc.php is out of date [18:51:02] maintaining a good spare list is also needed, sometimes [18:51:03] some day we could check what those intermittent things are [18:51:12] mc.php doesn't rebuild itself:) [18:51:19] RoanKattouw: would that be an explanation for occasional loss of session data? There apparently have been reports, but I hadn't had time to follow up [18:51:40] logging MC errors like we do for dberror could be useful [18:51:44] especially if followed up [18:51:51] robla: yes [18:51:54] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [18:52:16] robla: That would explain it yes [18:52:28] domas: People paying attention to Nagios reporting a memc down for 7 days would also be useful [18:53:01] Just like people maintaining the spares list in mc.php so the number of spare servers is >1 [18:53:10] * RoanKattouw pokes notpeter about the mc.php thing [18:53:49] roankattouw: :-) [18:53:55] there's more staff than ever! [18:56:57] useful to have dberror.log in all the different timezons :) [18:57:08] haha [18:57:12] Wed Apr 18 15:15:11 UTC 2012 mw25 enwiki Error connecting to 10.0.6.26: Can't connect to MySQL server on '10.0.6.26' (4) [18:57:12] Thu Apr 19 0:16:59 KST 2012 mw22 kowiki Error connecting to 10.0.6.66: Can't connect to MySQL server on '10.0.6.66' (4) (10.0.6.66) [18:57:12] Wed Apr 18 17:17:01 CEST 2012 mw6 dewiki Error connecting to 10.0.6.54: Can't connect to MySQL server on '10.0.6.54' (4) (10.0.6.54) [19:05:06] what time is the switchover to 1.20? [19:05:38] Thehelpfulone, on wich project ? [19:05:48] meta [19:07:31] Thehelpfulone, it's running, Wiktionaries was just pass under 1.20 [19:08:29] ok [19:08:47] Thehelpfulone, get a look at #wikimedia-tech to follow the version changes [19:09:18] I'm in there already, thanks I'll keep an eye on it [19:09:49] so check logmsgbot speaking [19:25:27] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [19:25:27] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [19:25:27] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [19:33:33] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [19:33:33] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [19:59:17] hey notpeter, do you have a guess on when oxygen will be ready as a filter box? [20:22:45] diederik: working on it right now. there's some difficulty in that it's working as both the multicast relay and a logging host, and both things are trying to use the same port, but I would say tomorrow or friday [20:23:49] awesome, thanks so much! [20:24:58] yep. I'm currently working on clean up some crappy code that some idiot wrote. damn you past-peter! [20:28:01] wouldn't that be past-notpeter? or can peter be only observed in the past and not the present? nevermind [20:28:49] hehehe [20:28:59] all I can say is get me out of this box, there might be a dead cat in here [20:29:16] eww [20:30:00] schrodinger's cat is never a pretty topic. well, sometimes... [20:31:44] <^demon> It's both a pretty and ugly topic. [20:40:00] a pretty ugly topic? :-P [21:04:51] who knows a bit more about the webstatscollector besides domas? [21:05:18] samod [21:06:08] whatsup [21:06:20] I doubt anyone knows much, I've forgotten most of stuff too! [21:11:24] hey [21:11:50] basic question [21:12:40] filter.c parses the squid logs and prints them to stdout but i don't understand how that data is send to port 3815 where collector.c is running [21:13:16] diederik: | log2udp :) [21:13:28] diederik: check out configuration in udp2log/squid [21:13:37] nice [21:14:11] so filter.c prints to stdout, log2udp reads that in and sends it to port 3815 and webstatscollector puts it back in a db and generates the files? [21:14:31] webstatscollector is a db [21:14:35] i know [21:15:03] collector is the daemon and puts it in a berkeley-db (which you know obviously) [21:15:17] but is that the correct data flow? [21:16:39] squids -> udp2log -> filter -> log2udp -> collector -> files, yes [21:17:27] thanks! [21:17:40] that's all (for now :D) [21:17:51] the BDB is limited to 2G iirc [21:17:52] or was that 1G [21:18:00] I've never checked if it is hitting the limit :) [21:18:13] i'll look into that [21:18:24] other concerns that you have? [21:18:40] how to query that thing realtime! [21:19:21] nice [21:23:04] will my labs account work for gerrit? [21:24:06] should do :P [21:24:15] domas: regarding BDB limits I think we are fine: http://doc.gnu-darwin.org/am_misc/dbsizes.html [21:24:41] well how do I make a gerrit account stwalkerster? it says public key access denied [21:24:48] permission* [21:24:53] diederik: no, I mean, the manually specified one [21:24:59] diederik: it doesn't have backing file store, iirc [21:25:12] ssh://thehelpfulone@gerrit.wikimedia.org:29418/operations/puppet [21:25:15] that's what I'm using [21:25:19] you need to put your ssh key into gerrit [21:25:24] +git clone at the front [21:25:29] <^demon> Thehelpfulone: Gerrit doesn't copy ldap's ssh keys automatically. You have to login to gerrit the first time and add them there. [21:25:59] is my username/password the same? [21:26:07] should be [21:26:28] when you're logged in, go to settings (top right) -> ssh public keys [21:26:45] so yay, making icinga have a proper mail server. right now it uses the "mail" command for many things [21:27:36] hmm still denied [21:27:42] it should be open SSH in gerrit right? [21:30:17] I just copied and pasted what I put into labs stwalkerster from labs [21:30:21] doesn't seem to like it [21:31:03] hmm... idk then [21:31:09] mine seemed to like it [21:32:26] LeslieCarr: sorry, was mid-email to pediapress and I wanted to get that out [21:32:29] ok so . . . [21:33:13] no problem [21:33:14] :) [21:33:15] domas: i cannot find a manually specified limit in the source code, is it in collector.c? [21:33:49] yeah [21:33:58] cache size [21:34:01] ah so it was working but it hung so I killed it [21:34:08] LeslieCarr: the main issue I would flag is the fact that the icinga host (I forget which it is) has been configured to relay its outbound mail through mchenry etc. [21:34:15] now I get "destination patch "puppet" already exists" [21:34:17] that's just exim config [21:34:22] any idea how to delete that? ;) [21:34:31] or overwrite [21:34:34] Jeff_Green: it's neon [21:34:39] k [21:34:46] hm [21:35:19] oh wait [21:35:22] it has a backing database [21:35:26] LeslieCarr: see neon:/etc/exim4/exim4.conf [21:35:31] I'm talking about something else then, udpprofiler maybe [21:35:48] okay, cool [21:36:11] ahhh, I think I livehacked cache in once [21:36:12] :) [21:36:18] didn't commit back in [21:36:21] it is in locke source [21:36:32] that was when locke had issues [21:36:50] LeslieCarr: smart_route is the section [21:37:08] diederik: how do I find the path that it's cloned to? [21:37:13] I need to remove it so that I can start again [21:37:19] dir [21:37:31] it should just clone it in your current working directory [21:37:42] I don't know what that is... [21:38:14] dir says command not found hmm [21:40:04] Jeff_Green: so what should we change it to ? localhost ? [21:40:17] also, if we do that, will the external mail (the pages for verizon et al) stillwork ? [21:40:50] * Thehelpfulone is a git n00b, how do I figure out what my working directory is? [21:40:57] LeslieCarr: sorry, Lily's barfing again [21:41:06] look at the same file on aluminium [21:41:30] Jeff_Green: no worries, sick kid > new monitoring system that will page you at 1am [21:41:35] heh [21:42:05] i'll try to come back in an hour or so [21:44:10] good luck [21:49:57] New patchset: Catrope; "Apply a custom skin to Gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3285 [21:50:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3285 [21:59:15] New patchset: Catrope; "Gerrit CSS tweak: word-wrap commit summaries" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5289 [21:59:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5289 [22:00:36] New review: Trevor Parscal; "We need this badly. Horizontal scroll bars suck." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5289 [22:10:20] hi Ryan_Lane [22:10:33] for some reason my public key is not working in gerrit, does it need to be in a different format to labs? [22:11:03] well, you need to upload one there [22:11:09] yeah I did that ;) [22:11:11] I'm using git clone ssh://thehelpfulone@gerrit.wikimedia.org:29418/operations/puppet [22:11:11] it can likely be in the same format [22:11:17] what format is it in? [22:11:28] open SSH [22:11:31] like on labs [22:11:33] that would work fine [22:11:53] it tells me Permission denied (publickey) [22:17:59] any idea what could be causing it Ryan_Lane? I'm able to access labs fine... [22:18:11] lemme see [22:18:19] thanks [22:18:59] Thehelpfulone: try for me. I'm tailing the log [22:19:50] okay, did it a couple of times [22:20:32] I don't see any attempt [22:20:47] huh [22:21:05] I get a fatal: The remote end hung up unexpectedly too [22:21:15] can you telnet to the port for me? [22:21:33] and if that works, then try sshing to it [22:22:01] ok one sec [22:23:11] wow it's slow to load windows features to enable it.. [22:26:38] brb [22:38:38] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [22:38:43] back Ryan_Lane, just trying to enable telnet. [22:39:47] eh? [22:39:52] you don't need a telnet server [22:39:55] just a client [22:40:09] windows doesn't have a telnet client installed by default these days :( [22:40:13] I know, but the client is not enabled on windows by default [22:40:16] I fucking hate windows [22:42:17] I hate to state the possibly annoying now it's too late, but doesn't putty have telnet/raw connection types that don't require system reboots to use? [22:42:32] probably [22:42:53] people should really stop using windows :) [22:42:58] well windows is not giving me telnet... an error occurred :( [22:43:05] try putty [22:43:10] or use cygwin [22:43:19] or a linux virtual machine [22:43:35] or an operating system that doesn't actively hate tech people [22:43:36] :) [22:46:06] I'll try putty [22:46:52] hmm it comes out odd, mangled text [22:47:15] that's fine [22:47:19] it connected, though? [22:47:22] now try ssh [22:47:25] with your key [22:47:28] ok [22:47:49] yeah it took me to maganese.wikimedia.org [22:47:50] on telnet [22:48:17] ahah! putty fatal error [22:48:44] you are trying port 29418, right? [22:48:45] what should the port be Ryan_Lane? 22 or 29418? [22:48:55] 22 won't work [22:49:08] looks like it worked to me [22:49:17] so putty's just disappeared [22:49:33] it connected and disconnected you [22:49:45] because you aren't actually allowed to ssh to the service [22:49:49] ok [22:49:53] but, it's working [22:50:02] there's something wrong with your git config, or something [22:50:05] so it's something to do with the evil that is windows [22:50:09] or git yeah [22:50:15] how do I uninstall git and start again? [22:50:20] I have no clue [22:50:26] I haven't used Windows in years [22:50:34] oh it's a windows program, I'll just use the normal uninstall process [22:50:36] THO|Cloud, how did you install it? [22:50:46] :) [22:50:52] I tried the netinstaller and the normal one stwalkerster [22:51:05] I think that might be causing the problems, I don't think I got rid of the netinstaller properly [22:54:54] * THO|Cloud grumbles about windows [22:55:41] !log updating exim4.conf on mchenry to not allow old ranges [22:55:44] Logged the message, Mistress of the network gear. [22:56:11]