[00:11:10] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [00:45:18] zzz ==_____== [00:47:11] New review: tstarling; "(no comment)" [analytics/udp-filters] (master) - https://gerrit.wikimedia.org/r/2142 [01:36:03] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 1321 seconds [01:40:57] New review: tstarling; "Looks good overall, it's certainly much better than the collection of filters that came before. Ther..." [analytics/udp-filters] (master) C: 0; - https://gerrit.wikimedia.org/r/2142 [01:43:33] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:06:09] !log LocalisationUpdate completed (1.18) at Tue Jan 31 02:06:09 UTC 2012 [02:06:10] Logged the message, Master [02:08:23] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 865s [02:21:21] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1642s [02:30:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:21] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:43:41] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:44:41] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:31] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 39s [03:07:11] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.920 seconds [03:15:50] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 120 MB (1% inode=60%): /var/lib/ureadahead/debugfs 120 MB (1% inode=60%): [03:15:50] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 17 MB (0% inode=60%): /var/lib/ureadahead/debugfs 17 MB (0% inode=60%): [03:17:51] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.744 seconds [03:27:00] RECOVERY - Disk space on srv223 is OK: DISK OK [03:49:40] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=60%): /var/lib/ureadahead/debugfs 0 MB (0% inode=60%): [03:51:30] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:51:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:02:50] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.421 seconds [04:05:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.750 seconds [04:07:23] RECOVERY - Disk space on srv221 is OK: DISK OK [04:12:06] who knows about the new mailman setup? [04:12:12] how many web backends? [04:12:19] is there an LVS or a squid or a varnish? [04:12:41] Dmcdevit: is getting vastly different results than i but ping gives him the same IP i have [04:13:13] and i see no relevant headers to tell me if there's a proxy [04:18:33] RECOVERY - Disk space on srv219 is OK: DISK OK [04:23:23] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:23:33] RECOVERY - Disk space on es1004 is OK: DISK OK [04:35:12] * jeremyb gives up on searching the puppet repo [04:36:08] Ryan_Lane: multichill: any ideas? ^^ [04:36:22] err, that was supposed to be mutante [04:36:32] jeremyb: huh? [04:36:48] multichill: < jeremyb> err, that was supposed to be mutante [04:37:08] multichill: wow, it's early there! good morning! [04:37:09] Ah, I'm generally not awake at half past 6 in the morning ;-) [04:37:22] yeah, i chose based on TZ. i left out ma rk [04:38:53] I was just called out of bed [04:39:01] oh, page? [04:39:30] * jeremyb has some trouble reconciling half past 6. isn't it half past 5? [04:39:33] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:40:24] Yes, it is ..... too early [04:40:25] multichill: are you up for good? [04:42:31] i guess not! [04:43:00] wb multichill ;) [04:43:42] wtf [04:45:13] Looks like I'm not getting anymore sleep [04:56:27] well in the absence of an actual sysadmin, maybe i could have a couple guinea pigs send themselves password reminders for any list via the web UI? [04:56:38] i just want to know if the form submits ok [04:57:19] RoanKattouw: hey, you're the one with kinda root but don't really use it? [04:57:25] New patchset: tstarling; "Disable wmerrors log_backtrace since it is buggy and causes segfaults." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2152 [04:57:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2152 [04:57:52] That's me [04:57:57] New review: tstarling; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2152 [04:58:36] where did the +2 option go? [04:59:00] TimStarling: looks like you only did +1? [04:59:13] that was the maximum [04:59:22] someone changed your bits? [04:59:23] Then you don't have sufficient rights on that repo [04:59:38] I used to have +2 on the test branch, I don't have that anymore either [04:59:42] can you poke sodium and see if it looks normal? also any idea if it's behind some kind of load balancing? (lvs/squid/varnish) [05:00:05] I don't even know what sodium is or does [05:00:29] Dmcdevit is having issues with the web UI (chrome saying "Error 324 (net::ERR_EMPTY_RESPONSE): The server closed the connection without sending any data") [05:00:30] presumably it is a mistake [05:00:42] RoanKattouw: mailman [05:00:47] Ryan also said he'd accidentally kicked me out of the testlabs group [05:01:13] he and I get the same IP with ping and it works fine for me [05:01:44] Sodium responds to ping and ssh, and it seems to be idle according to top(1) [05:02:42] Oh, hmm, OK [05:02:56] New review: tstarling; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2152 [05:03:15] heh, No score [05:04:06] lighttpd is running and I see nothing weird in its logs [05:06:27] the fact that puppet gives mailman the star cert makes me think that it's not proxied [05:13:00] !log since puppet is broken, disabled wmerrors backtrace logging by adding a separate configuration file in /etc/php5/conf.d and reloading apache [05:13:02] Logged the message, Master [05:13:33] puppet is broken or gerrit is? [05:14:08] puppet is broken in that it uses a broken instance of gerrit [05:14:16] which is broken in the sense that it won't let me do things on it [05:23:39] !log the segfaults didn't stop, so I'm disabling wmerrors entirely for now [05:23:40] Logged the message, Master [05:24:47] guess what? [05:25:38] they stopped? [05:25:48] not giving me a lot of choices [05:25:49] they didn't stop [05:26:12] is this dsh or just one box or what? [05:27:27] never mind, it looks like the wmerrors-related segfaults did stop [05:27:35] we've just got other segfaults now [05:27:51] and it's everywhere, not one box [05:28:46] basically you get a fatal error and then a segfault as the process shuts down [05:29:31] Jan 31 05:18:26 10.0.11.49 apache2[18807]: PHP Fatal error: Allowed memory size of 125829120 bytes exhausted (tried to allocate 72 bytes) in /usr/local/apache/common-local/php-1.18/includes/parser/Preprocessor_DOM.php on line 797 [05:29:31] Jan 31 05:18:26 10.0.11.49 apache2[5585]: [notice] child pid 18807 exit signal Segmentation fault (11) [05:29:34] like that [05:30:42] this is the same thing you enabled core dumps for the other day? [05:30:48] yes [05:31:12] but I don't have time to fix it right now so I'm trying to get the site into some approximate working order [05:31:47] did someone disable core dumps on that box? [05:32:00] i've not a clue [05:32:20] so, that limit it's hitting is exactly 120MB fwiw [05:32:30] actually I put it in a puppetized configuration file, and puppet is running again [05:32:33] so it'll be reverted [05:34:25] ok there's been no more segfaults since 05:24 [05:34:40] I only did a graceful restart so it was probably just a process that hadn't finished yet [05:34:53] no more relevant segfaults, I should say [05:42:38] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:43:37] Change abandoned: tstarling; "superseded by configuration pushed out with dsh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2152 [05:48:09] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:56:54] RECOVERY - Full LVS Snapshot on db42 is OK: OK no full LVM snapshot volumes [05:57:04] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:08:14] RECOVERY - MySQL disk space on db42 is OK: DISK OK [06:34:26] !log added myself to the gerrit "administrators" group [06:34:27] Logged the message, Master [07:02:32] TimStarling: hm. you should be able to do everything in gerrit [07:02:41] lemme make sure you are in the ops groups [07:03:04] you're in the ops group... [07:12:33] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:16:51] Any server admins around? [07:17:07] Got a tech question re pending changes [07:17:31] Go ahead and ask [07:17:36] So, if pending changes were to be disabled on new [07:17:39] Enwp [07:17:55] Would it make a mess of the logs? [07:18:13] Read: disabled = extension deleted [07:19:15] I'm looking at closing an RFC on the removal of pending changes on enwp [07:19:41] Some think it's complete removal including the user right could make a mess [07:19:45] hmm, I'm not sure [07:19:52] You'd have to ask Aaron Schulz [07:20:02] Hence I thought I should ask. [07:20:05] Who is not on IRC right now, so I suggest you e-mail him [07:20:18] What's their email? [07:20:34] aschulz at wikimedia [07:20:41] Ta [07:20:48] :) [07:22:53] RoanKattouw: which change were you guys trying to merge? [07:23:17] I wasn't trying to do anything [07:23:23] Tim was trying to change something in wmerrors I think [07:23:26] I don't see one merged or still around [07:23:45] When he mentioned he couldn't +2 anymore, I mentioned I had a similar issue a while ago but that you'd fixed it by putting me back into testlabs [07:23:50] ah https://gerrit.wikimedia.org/r/2152 [07:24:00] that's the production branch [07:24:02] he's in the ops group [07:24:05] it should just work [07:24:16] I'm betting it had a hiccup with ldap at that point [07:24:24] I need to figure out how to point it to multiple LDAP servers [07:24:39] damn documentation doesn't say how [07:40:39] Ryan_Lane: run LDAP behind LVS? [07:41:00] LDAP failover is client-side, not server-side [07:41:08] i don't follow [07:41:17] i'm just saying if gerrit doesn't have that option... [07:41:20] clients automatically failover to another server [07:41:24] I'm sure gerrit has it [07:41:34] it's using LDAP libraries for java [07:41:41] it's absolutely supported there [07:42:09] worst case I just read the code and see how they are putting in the server lists [07:42:44] i saw a place that had a custom service that would just check backends periodically and update iptables to change where requests went. i.e. all reqs went to the same place until a change was made and then all went to the new place [07:43:02] o.O [07:43:12] was called pmilb. poor mans iptables lb [07:43:13] that's a terrible hack [07:44:00] i noticed [07:44:06] I guess it works when you have nothing else :D [07:45:32] the thing that really sucked about it was that it only worked on traffic coming from other machines, not on traffic from the box where the iptables was. (maybe fixable i guess but i didn't really try) so services on the same box that used LDAP auth had to hard code a backend [07:45:59] wait, this was for LDAP? [07:46:04] * Ryan_Lane groans [07:46:16] yes... [07:46:25] I don't know of a single LDAP library that doesn't support client failover [07:46:33] actually the backends where AD [07:46:35] were* [07:46:42] that doesn't matter [07:46:49] of course not [07:46:57] AD is just LDAP with a bastardized kerberos and a weird schema [07:47:10] the main consumers were java and apache i think [07:47:12] *weird proprietary schema [07:47:21] and both support client failover [07:47:38] * jeremyb has no answers [07:47:40] heh [07:47:57] someone needs to hit that admin with a cluestick :) [08:23:18] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [08:23:18] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [08:44:14] TimStarling: the wikitech page should probably mention that it runs under jetty. in particular I wonder how this part works: Gerrit shells out to GitWeb, a Perl application which is installed using the stock package. [08:44:19] does it use jetty's CGI support or some other method? [08:52:00] * jeremyb wonders if the prod apc is somehow shared between servers or each has it's own cache? [08:52:06] * jeremyb sleeps [08:54:12] That's wrong, it doesn't shell out [08:54:16] afaict [08:54:20] gitweb is just a web app afaict [09:14:13] is there any reason we use svn+ssh:// to fetch the wmf branch on fenari ? [09:14:34] It's for laziness, allows people to commit from fenari [09:14:47] that also prevents me from fetching the change [09:14:53] since my ssh key is not there :-) [09:16:18] Can't you just forward your agent or put your key there? [09:16:53] not really willing to share that private key to every WMF roots :D [09:16:59] but I have a WMF ssh key on the cluster [09:17:18] guess I can have it added to svn.wm.org [09:17:43] In the meantime, would you like me to run svn up for you? [09:17:52] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [09:18:04] http://www.google.com/search?q=ssh+commit <-- we are the first link [09:18:09] hashar: use SSH agent... [09:18:49] lol [09:19:06] RoanKattouw: that is r110001 , makes Badtitle error pages emits a 400 HTTP status code [09:19:18] RoanKattouw: that was requested by the mobile team since it seems to break some mobile agents [09:19:26] bug report : https://bugzilla.wikimedia.org/show_bug.cgi?id=33646 [09:19:33] rev: https://www.mediawiki.org/wiki/Special:Code/MediaWiki/110001 [09:19:48] You haven't actually merged it to 1.18wmf1 [09:20:00] saper: yeah I could use SSH agent forwarding. But our guru told me it was bad practice to forward a key :-D [09:20:13] yeah I have also lost access to 1.18wmf1 [09:20:21] Oh, meh [09:20:24] I'll merge it then [09:20:31] * hashar wonders why I tried to find out how to sync when I can't even merge ... [09:20:34] sorry roan :-( [09:20:40] I am probably still asleep [09:20:47] * RoanKattouw has been up since 5am [09:21:22] been up for 20 hours yesterday, reached bed at 3am [09:21:30] ouch [09:21:40] daughter has teethes pain [09:21:46] I was up for 24 hours on Saturday/Sunday [09:22:03] at your age, you could probably stay up for 48 hours straight [09:22:27] Well, I did discover what it takes for me to sleep on a plane on this trip [09:22:54] I got on a plane at 11pm, flew for 13 hours, walked around an airport for two hours, got back on the plane, flew for another 2 hours, then slept for 3 [09:23:55] So I'd been up for, let's see ... [09:24:19] have you slept during the 13hours fly? [09:24:20] 9am to 4pm the next day, so that's 31 hours [09:24:28] No, I didn't sleep during the first leg [09:24:38] doh [09:24:45] Tried to but couldn't [09:25:06] I felt really tired around 2am, tried to sleep, then at 3:30am I felt wide awake and watched a movie [09:25:34] Grr, someone's been touching 1.18wmf1 without syncing [09:25:38] * RoanKattouw finds out who [09:25:43] hashar: the agent does not send the key anywhere, it just responds to challenges. If PKCS#11 smartcard API is used, it even cannot hase a key [09:27:47] saper: as I understand it, any root user on the server would be able to hijack my identity [09:27:50] Reedy, of course [09:27:50] !log catrope synchronized php-1.18/includes/Exception.php 'r110368' [09:27:52] Logged the message, Master [09:28:05] yeah Reedy is our merging / shell bug bot [09:28:25] !log catrope synchronized php-1.18/includes/Wiki.php 'r110368' [09:28:26] Logged the message, Master [09:29:14] hashar: There you go ---^^ [09:29:38] $ curl -I 'http://en.wikipedia.org/wiki/%5B%5B' [09:29:38] HTTP/1.0 400 Bad Request [09:29:40] \o/ [09:31:22] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:31:30] RoanKattouw: thanks a ton! [09:46:52] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:52] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:10] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:55:10] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 404454 MB (3% inode=99%): [09:58:30] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 383892 MB (3% inode=99%): [10:01:07] RECOVERY - DPKG on db42 is OK: All packages OK [10:01:07] RECOVERY - Disk space on db42 is OK: DISK OK [10:01:07] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [10:16:20] when is the editing from the current diff thing going to be fixed [10:16:42] cause it is getting really tiresome [10:18:20] What "thing"? [10:18:23] I'm not aware of this bug [10:18:48] You can't edit from the current diff window? [10:18:51] Or at least I cannot [10:19:01] link? [10:19:11] http://en.wikipedia.org/w/index.php?title=User_talk%3ARyulong&action=historysubmit&diff=474158643&oldid=474158598 [10:19:15] no edit links anywhere [10:19:28] and by "edit from the current diff window" I mean edit sections [10:19:40] aaah [10:19:43] there's a bugzilla report about it [10:20:10] how was this something that the dev went "Hmm, no I don't think anyone wants that anymore"? [10:20:45] it maight not have been removed deliberately [10:22:00] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [10:23:24] New review: Dzahn; "https://developer.mozilla.org/en/Mobile/Viewport_meta_tag" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2110 [10:23:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2110 [10:31:12] New review: Dzahn; "looks good and already got a +2, just not verified" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2109 [10:31:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2109 [11:08:43] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:11:23] RECOVERY - MySQL slave status on es1004 is OK: OK: [11:11:53] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:43] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:12:43] PROBLEM - Disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:13:23] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:22:43] PROBLEM - RAID on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:43] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:25:53] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:27:23] PROBLEM - DPKG on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:33:23] PROBLEM - MySQL Recent Restart on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:08] New patchset: Tim Starling; "Disable wmerrors since it is buggy and causes segfaults." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2154 [11:50:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2154 [11:52:20] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2154 [11:52:26] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2154 [11:52:27] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2154 [12:14:59] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [12:20:47] TimStarling: the wikitech page should probably mention that it runs under jetty. in particular I wonder how this part works: Gerrit shells out to GitWeb, a Perl application which is installed using the stock package. [12:20:49] That's wrong, it doesn't shell out [12:20:55] final Process proc = [12:20:56] Runtime.getRuntime().exec(new String[] {gitwebCgi.getAbsolutePath()}, [12:20:56] makeEnv(req, project), repo.getDirectory()); [12:21:07] looks like shelling out to me [12:22:43] this is GitWebServlet.java [12:23:06] Ah, so it does [12:23:16] I thought gitweb had its own web frontend, guess not [12:23:58] gitweb is a perl web app [12:24:25] gerrit is pretending to be a regular web server [12:24:31] Oh, that's right [12:24:38] shelling out to perl with the environment set up like ordinary CGI [12:24:38] Gerrit impersonates every server type on the planet [12:24:53] Or at least HTTP, SSH and git [12:57:20] RECOVERY - mysqld processes on db42 is OK: PROCS OK: 1 process with command name mysqld [12:57:20] RECOVERY - DPKG on db42 is OK: All packages OK [12:57:50] RECOVERY - MySQL Slave Running on db42 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [12:59:30] RECOVERY - MySQL disk space on db42 is OK: DISK OK [12:59:30] RECOVERY - Full LVS Snapshot on db42 is OK: OK no full LVM snapshot volumes [13:01:20] RECOVERY - MySQL Idle Transactions on db42 is OK: OK longest blocking idle transaction sleeps for 0 seconds [13:03:10] RECOVERY - RAID on db42 is OK: OK: State is Optimal, checked 2 logical device(s) [13:09:00] RECOVERY - Disk space on db42 is OK: DISK OK [13:14:20] RECOVERY - MySQL Recent Restart on db42 is OK: OK 3416479 seconds since restart [13:45:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:10] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:30:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:35:11] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:11] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:43] saper: are you on? [15:06:08] mutante: yes [15:06:18] hallochen [15:06:22] !log changed nameservers for wikimedia.pl per [[RT:2277]]/[[bugzilla:33509]] [15:06:24] Logged the message, Master [15:06:31] <<-- :) [15:06:42] saper: i _just_ hit save [15:06:55] great [15:07:19] gerrit is taking over the world [15:07:21] https://github.com/openstack-ci/git-review/pull/3#issuecomment-3740052 [15:07:41] my pull request to github was dismissed automatically and force me to use … gerrit!! :D [15:07:50] saper: fns1.42.pl 79.98.145.34 / fns2.42.pl 195.80.237.194 [15:09:00] perfect [15:09:14] 86400 seconds are cached anyway [15:09:50] mutante: do they accept IPv6 glues? [15:10:01] fns2.42.pl is 2a02:2978::a503:4209:2 [15:11:48] saper: "Wrong format of Ip Address!" :/ ..i'll try to find out [15:12:08] don't worry [15:12:12] it's in a different domain [15:12:16] ok [15:12:23] so glue is not really necessary [15:12:35] alright [15:12:36] you might ask them or change registrars [15:13:01] i'll bring it up, might be something for IPv6 enable day or so [15:13:22] good test when looking for registrars - do they support IPv6 glues and DNSSEC [15:13:37] * mutante nods [15:17:34] sixxs has a list http://www.sixxs.net/faq/dns/?faq=ipv6glue [15:20:58] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.968 seconds [15:21:08] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Jan 31 15:20:53 UTC 2012 [15:28:15] mutante: maybe not up to date as many things there unfortunately [15:41:55] Dmcdevit: still reliably broken? [15:59:38] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.373 seconds [16:03:18] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:03:48] PROBLEM - Host ekrem is DOWN: CRITICAL - Host Unreachable (208.80.152.178) [16:13:48] is the RC feed down? [16:15:17] probably a TS issue [16:15:19] Thehelpfulone: you mean the wmf ircd? [16:15:56] hrmm, ekrem is dead? is that still the WAP proxy? can't remember what else [16:16:12] Thehelpfulone: http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&feed=rss ? worksforme ? [16:16:22] yes [16:16:26] sorry the IRC one [16:16:27] mutante, the RC to IRC bridge [16:16:29] Oh, that'd be the IRC box, yeah [16:17:12] 16:20 <+nagios-wm> RECOVERY - HTTP on ekrem is OK: [16:17:13] hmm [16:17:15] oh, ekrem is also IRC? i forgot [16:17:29] mutante: whole host down is more recent [16:17:39] !log ekrem suddenly died around 16:03 UTC, breaking the RC IRC feed [16:17:41] Logged the message, Mr. Obvious [16:17:57] mark: ----^^ [16:19:19] connecting to ekrem mgmt [16:19:46] last output i can see is just: * Stopping web [16:20:06] frozen, will powercycle [16:21:02] !log powercycling ekrem - mgmt just showed "Stopping web" and was frozen completely [16:21:03] Logged the message, Master [16:22:23] ..recovering journal.. [16:24:18] /-\|/-\|/-\ [16:24:32] ekrem login: [16:24:47] enjoy this hold music while we replay the journal [16:25:29] whats the process you need for the IRC bridge? checking [16:26:22] does it not just fix itself on boot? [16:26:28] RECOVERY - Host ekrem is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [16:27:09] jeremyb: well, docs say "you will want to start up the bot if it is not running already. :) [16:27:20] how useful [16:27:28] It should start itself [16:27:30] ircd [16:27:51] At least according to puppet, ircd should start automatically [16:28:07] !log ekrem - su -c /usr/local/ircd-ratbox/bin/ircd irc [16:28:09] Logged the message, Master [16:28:11] Nikerabbit you about? [16:28:58] RoanKattouw: umpf, it cant run puppet :/ [16:29:21] Well ircd seems to be running [16:29:25] good [16:29:42] The IRC server let me connec [16:30:07] !log reedy synchronized php-1.18/extensions/SpamBlacklist/SpamBlacklist_body.php 'r110401' [16:30:08] !log ekrem - gets Error 500 on SERVER when running puppet [16:30:08] Logged the message, Master [16:30:10] Logged the message, Master [16:31:11] However, the bridge is not running [16:32:06] which is weird [16:32:19] I mean the relay does seem to be running, it's just not working [16:32:27] killed it again, started again (because it says to start AFTER ircd runs (of course) [16:32:30] hmm [16:32:48] i looked at http://wikitech.wikimedia.org/view/IRC#Starting_the_bot [16:32:53] aah [16:33:01] I guess that dependency isn't in puppet [16:33:13] well, puppet run breaks with Error 500 here [16:33:35] That shouldn't matter, puppet should have installed the init scripts a long time ago, right? [16:33:38] Hm, or maybe not [16:33:48] Well the relay thingy is running [16:33:52]