[00:19:26] PROBLEM host: asher1 is DOWN address: asher1 CRITICAL - Host Unreachable (asher1) [00:19:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [00:19:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [00:19:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [00:20:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [00:20:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [00:23:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [00:23:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [00:49:26] PROBLEM host: asher1 is DOWN address: asher1 CRITICAL - Host Unreachable (asher1) [00:49:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [00:49:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [00:49:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [00:50:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [00:50:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [00:53:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [00:53:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [01:19:26] PROBLEM host: asher1 is DOWN address: asher1 CRITICAL - Host Unreachable (asher1) [01:19:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [01:19:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [01:19:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [01:20:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [01:20:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [01:23:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [01:23:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [01:49:26] PROBLEM host: asher1 is DOWN address: asher1 CRITICAL - Host Unreachable (asher1) [01:49:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [01:49:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [01:49:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [01:50:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [01:50:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [01:53:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [01:53:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [02:05:24] petan|wk: hi, still need reboots? [02:19:26] PROBLEM host: asher1 is DOWN address: asher1 CRITICAL - Host Unreachable (asher1) [02:19:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [02:19:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [02:19:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [02:20:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [02:20:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [02:23:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [02:23:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [02:36:09] !log testlabs - rebooting asher1, webserver-lcarr [02:36:26] RECOVERY host: webserver-lcarr is UP address: webserver-lcarr PING OK - Packet loss = 0%, RTA = 0.87 ms [02:36:27] oh [02:36:28] those? [02:36:33] those are all on virt3 [02:36:37] I left them down on purpose [02:36:46] RECOVERY host: asher1 is UP address: asher1 PING OK - Packet loss = 0%, RTA = 0.81 ms [02:36:47] though virt3 doesn't look overly overloaded [02:37:07] arr, well < petan|wk> mutante, ssmollett can you reboot all these instances reported by nagios? [02:37:17] I'd say let's mute them in nagios [02:37:46] yeah, thats why i hesitated, i figured some might be down on purpose [02:37:48] it's possible we'll run the system out of memory again [02:38:06] yeah ok, let me shut them down again [02:38:10] since we're two dimms down [02:39:20] oh..hm. shutting down not just a click away .P hrmm [02:39:34] nope. there's no command for doing so [02:39:44] either need to ssh into it and shutdown, or do it from the system its on [02:39:52] using virsh destroy [02:40:01] destroy sounds evil :) [02:40:04] yeah [02:40:13] it isn't a graceful shutdown [02:40:27] neither is reboot, though [02:42:26] RECOVERY Total Processes is now: OK on essex-9 essex-9 output: PROCS OK: 99 processes [02:43:50] Ryan_Lane: whats the correct domain name? trying instance name, hostname or fqdn, but getting "failed to get domain" [02:43:56] RECOVERY Current Load is now: OK on essex-9 essex-9 output: OK - load average: 0.87, 1.09, 0.63 [02:44:07] what are you trying to do? [02:44:10] ssh into it? [02:44:13] destroy instances [02:44:19] virsh list [02:44:21] in virsh shell [02:44:29] then, list [02:44:36] RECOVERY Current Users is now: OK on essex-9 essex-9 output: USERS OK - 1 users currently logged in [02:44:36] then look for instance- [02:44:44] it'll be the same as i- [02:45:06] then destroy instance- [02:45:16] RECOVERY Disk Space is now: OK on essex-9 essex-9 output: DISK OK [02:45:30] ah, gotta prefix "instance-". instance-0000003a . i tried "I-0000003a" like in the Nova Resource page [02:45:42] yeah. nova expands that in the background [02:46:06] RECOVERY Free ram is now: OK on essex-9 essex-9 output: OK: 78% free memory [02:47:21] !log testlabs - "destroyed" I-0000003a (asher1) and I-00000134 (webserver-lcarr) again to prevent OOM [02:47:39] heh [02:47:47] the stupid bot isn't in the room [02:47:55] :p [02:47:59] gimme a sec [02:48:10] it needs to be fixed [02:48:16] RECOVERY dpkg-check is now: OK on essex-9 essex-9 output: All packages OK [02:48:51] it's going to die in a sec [02:48:53] !log test [02:48:53] Message missing. Nothing logged. [02:49:01] !log test test [02:49:27] ok. it'll be good now [02:49:35] re-try the log :) [02:49:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [02:49:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [02:49:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [02:50:04] !log testlabs - rebooted asher1, webserver-lcarr [02:50:05] Logged the message, Master [02:50:14] !log testlabs - "destroyed" I-0000003a (asher1) and I-00000134 (webserver-lcarr) again to prevent OOM [02:50:15] Logged the message, Master [02:50:36] PROBLEM host: asher1 is DOWN address: asher1 CRITICAL - Host Unreachable (asher1) [02:50:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [02:52:06] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [02:53:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [02:53:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [03:13:14] I'd like to have permission to execute host commands (notifcations, scheduled downtime) on Nagios [03:13:33] and/or feature request to do those via IRC bot ;) [03:18:44] host command is just a Nagios permission issue, while downtime would be accepted but fails due to file permissions (Error: Could not open command file '/var/lib/nagios3/rw/nagios.cmd' for update! .. The permissions on the external command file and/or directory may be incorrect. " .. lemme check puppet files [03:19:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [03:19:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [03:19:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [03:20:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [03:21:36] PROBLEM host: asher1 is DOWN address: asher1 CRITICAL - Host Unreachable (asher1) [03:22:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [03:23:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [03:23:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [03:26:49] hmm. not puppetized yet it seems [03:27:33] !log nagios puppet broken due to "Could not find class misc::apache2" [03:27:34] Logged the message, Master [03:36:30] !log nagios even though listed in all authorized_for_* commands in cgi.cfg i get denied to execute any by web ui. guess related to the Apache LDAP auth / auto-login [03:36:31] Logged the message, Master [03:37:37] :Q [03:49:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [03:49:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [03:49:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [03:50:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [03:51:46] PROBLEM host: asher1 is DOWN address: asher1 CRITICAL - Host Unreachable (asher1) [03:52:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [03:53:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [03:53:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [03:57:46] mutante: yeah, not too sure how to handle that [03:57:52] authentication scares me a little [03:58:06] since non-ops can access it as root [03:58:12] which means they can steal our passwords [03:58:33] mutante: also, nagios config isn't puppetized in labs because it isn't puppetized in production [03:58:39] so, it's scripted in labs [03:58:42] using SMW queries [03:59:27] we really need an SSO server [03:59:52] then everything could send people off to the SSO server to authenticate, and would get a token, rather than a password [04:00:41] Ryan_Lane: while i still get the "Your account does not have permissions to execute commands." in the web ui, i did fix the "permissions on external command file" thing just now.. well manually [04:00:49] ah [04:00:54] I dunno how to set that up [04:00:56] following instructions from Nagios faq [04:01:08] isn't that really unsafe? [04:01:11] so now i could use the "Downtime" link, and schedule a downtime for host asher1 [04:01:51] I thought there were vulnerabilities with allowing execution of commands [04:01:54] it is like: create a group "nagiocmd" which was missing, add the user nagios and the webserver user (www-data) to that group [04:02:49] then give the group rwx and g+s on that named pipe called "rw" [04:03:26] hmm, yeah, basically just wanted to see if that works, when doing what it says in the faq [04:05:18] http://nagios.manubulon.com/traduction/docs14en/commandfile.html [04:06:54] you're right about this: "If you've installed Nagios on a public/multi-user machine, I would suggest setting more restrictive permissions on the external command file and using something like CGIWrap to run the CGIs as a specific user. Failing to do so may allow normal users to control Nagios through the external command file! " [04:11:05] Change on 12mediawiki a page Wikimedia Labs/Agreement to disclosure of personally identifiable information was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=513110 edit summary: [04:11:57] restricted it again for now.. we'll have to take a closer look [04:13:29] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=513112 edit summary: /* Documents */ [04:14:45] heh nice, i didnt see the bot doing that yet:) [04:17:32] Change on 12mediawiki a page Wikimedia Labs/Agreement to disclosure of personally identifiable information was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=513114 edit summary: [04:18:06] heh [04:18:13] !log nagios - temp. changed permissions on external command file per Nagios FAQ, added group "nagiocmd" to see if that allows me to schedule downtimes, it does (independetly from the host command perms), but took permissions back due to security concerns [04:18:14] Logged the message, Master [04:18:22] it only works on mediawiki.org. We need to make it work for labsconsole too [04:19:01] Change on 12mediawiki a page Wikimedia Labs/Terms of use was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=513115 edit summary: [04:19:21] Change on 12mediawiki a page Wikimedia Labs/Terms of use/exception policy was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=513116 edit summary: [04:19:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [04:19:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [04:19:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [04:20:06] Ryan_Lane: thinking along the line of "let humans execute host commands / downtimes" but ONLY via the ircbot, not using the web ui for it at all ... [04:20:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [04:21:02] Change on 12mediawiki a page Wikimedia Labs/Account creation text was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=513117 edit summary: [04:21:18] the ircbot is even more dangerous :) [04:21:32] irc has no authentication at all, really [04:21:47] we'd have to trust cloaks [04:21:50] accounts can be hijacked during netsplits fairly easily [04:21:58] true..sigh [04:22:02] it's also easy to steal credentials [04:22:13] if this was jabber I'd be ok with it :) [04:22:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [04:22:58] Change on 12mediawiki a page Wikimedia Labs/Agreement to disclosure of personally identifiable information was modified, changed by Ryan lane link https://www.mediawiki.org/w/index.php?diff=513118 edit summary: [04:23:46] if just the bot user was in he nagiocmd group, and no users could get on the host, at least it could just execute hardcoded commands like scheduled downtime, but not arbitrary comamnds [04:23:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [04:23:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [04:23:55] ah. true [04:23:58] that would be better [04:24:31] I'd be ok with that [04:32:47] since we'd still have to pass args (hostname, duration of downtime) we'd really have to sanitize user input though to avoid injections .. [04:41:48] well [04:41:57] yeah. that's true [04:42:05] ascii only [04:42:08] for hostnames [04:42:20] well, same restrictions as on creation of hostnames [04:42:29] also, the bot should check to ensure the host actually exists [04:42:45] by querying for a list of hosts, then checking the string against it [04:42:59] +1 [04:43:09] then the duration should be [smhd] [04:43:37] that way we never use the strings to make queries or to run commands [04:43:40] several scripts to do it via shell or cron http://exchange.nagios.org/directory/Addons/Scheduled-Downtime [04:43:55] yep [04:49:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [04:49:46] PROBLEM host: puppet-lucid is DOWN address: puppet-lucid CRITICAL - Host Unreachable (puppet-lucid) [04:49:56] PROBLEM host: turnkey-1 is DOWN address: turnkey-1 CRITICAL - Host Unreachable (turnkey-1) [04:50:46] PROBLEM host: pad2 is DOWN address: pad2 CRITICAL - Host Unreachable (pad2) [04:52:46] PROBLEM host: webserver-lcarr is DOWN address: webserver-lcarr CRITICAL - Host Unreachable (webserver-lcarr) [04:53:46] PROBLEM host: dumpster01 is DOWN address: dumpster01 CRITICAL - Host Unreachable (dumpster01) [04:53:46] PROBLEM host: dumps-4 is DOWN address: dumps-4 CRITICAL - Host Unreachable (dumps-4) [05:18:32] !log nagios - put all the hosts currently down into scheduled downtimes for the next 3 days with manual bash commands [05:18:32] Logged the message, Master [05:19:01] ^ the bot notifications should stop flooding the channel now :) [05:19:46] PROBLEM host: hugglewa-w1 is DOWN address: hugglewa-w1 CRITICAL - Host Unreachable (hugglewa-w1) [05:21:27] well, besides this one because i misspelled the hostname and it has a different reason as well.. adding it anyways [05:21:59] http://wikitech.wikimedia.org/view/OpenStack#Mounting_an_instance.27s_disk [05:22:02] that's fun documentation :) [05:22:10] ah. cool. thanks [05:22:39] that's how to mount any of the kinds of disks we have in use :) [05:23:17] it's cool that qemu has commands for doing that [05:23:49] * werdna waves Ryan_Lane [05:23:52] that said, I still have no clue what's wrong with hugglewa-w1 [05:23:55] werdna: howdy [05:24:10] ahaa! @ mounting disks [05:24:30] it's doing a dhcp lookup and getting a return [05:24:32] hugglewa-w1 i added comment "networking config, FIXME?" or so [05:24:39] I have no clue why it isn't getting an IP [05:24:54] or setting one, that is [05:26:18] what are you doing working at this time, Ryan_Lane? [05:26:22] It's almost time for ME to go home :) [05:26:34] heh [05:26:44] I needed to write an email [05:26:50] then I wanted to troubleshoot an issue [05:27:08] then I realized I should write documentation on what I was doing, so other people would know how to do it [05:28:26] !log hugglewa I can't seem to get hugglewa-w1 to boot. It gets an IP via DHCP, but seems to fail its networking somehow. [05:28:27] Logged the message, Master [05:29:00] !log hugglewa it may be good to delete/recreate. Let me know if you need to rescue any data off of it, I can do so before deletion [05:29:01] Logged the message, Master [05:29:19] mutante: it's possible to save an instance's data by mounting its disks :) [05:29:35] or to reconfigure it, if someone massively fucked it up [05:31:47] gotcha. nice! [05:34:07] http://wikitech.wikimedia.org/view/Nagios#Scheduling_downtimes_with_a_shell_command [05:34:15] bbiaw, really need to get some food [12:46:02] petan|wk: the hugglewa-w1 instance sleeps... [14:11:34] IWorld|mobile: what [14:12:44] petan|wk: do you know if its possible to enable something like $wgDebugLogFile on commons.wikimedia.beta.wmflabs.org [14:12:51] j^: yes [14:13:00] we have an issue with video uploads and i want to find out whats causing the error [14:13:09] if ( $wgDBname == "commonswiki" ) { $wgDe... [14:13:23] j^: you can insert that to InitialiseSettings.php [14:14:16] j^: you need the transcoding instance running for that [14:14:18] because it's down [14:14:46] petan|wk: ah transcoding is the next step, right now api / upload throws a 500 error [14:15:01] some way to find out what web server the requests is going to? [14:15:07] ah it's up [14:15:20] InitialiseSettings.php has some udp logging is that also available on labs? [14:15:29] web server can be found when you open source of page you attempt to open [14:16:00] I don't know if udp logging is possible since there is nothing listening for them [14:16:24] but if you want I can enable some kind of logger for that [14:16:46] if thats possible would be great since otherwise i have to check all web instances for the error each time [14:16:57] ok [14:17:19] since afaik api requests might not end up at one instance but could be distributed over all of them [14:17:32] sure [14:24:38] try usr/local/apache/common-local/errors [14:24:40] errors.log [14:24:53] it's a shared log file [14:24:59] it's already getting filled up [14:25:00] :D [14:25:14] ah great, thanks [14:52:55] re [14:53:55] PROBLEM Current Load is now: CRITICAL on pediapress-ocg3 pediapress-ocg3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:54:35] PROBLEM Current Users is now: CRITICAL on pediapress-ocg3 pediapress-ocg3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:54:48] petan|wk: i get a lot of Unable to load Tor exit node list: cold load disabled on page-views. [14:55:15] PROBLEM Disk Space is now: CRITICAL on pediapress-ocg3 pediapress-ocg3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:55:21] not sure thats a known problem [14:56:05] PROBLEM Free ram is now: CRITICAL on pediapress-ocg3 pediapress-ocg3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:57:08] j^: no idea what is that [14:57:25] PROBLEM Total Processes is now: CRITICAL on pediapress-ocg3 pediapress-ocg3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:57:30] probably some error from extension [14:58:15] PROBLEM dpkg-check is now: CRITICAL on pediapress-ocg3 pediapress-ocg3 output: CHECK_NRPE: Error - Could not complete SSL handshake. [14:58:55] RECOVERY Current Load is now: OK on pediapress-ocg3 pediapress-ocg3 output: OK - load average: 1.42, 1.78, 1.10 [14:59:35] RECOVERY Current Users is now: OK on pediapress-ocg3 pediapress-ocg3 output: USERS OK - 1 users currently logged in [15:00:15] RECOVERY Disk Space is now: OK on pediapress-ocg3 pediapress-ocg3 output: DISK OK [15:01:05] RECOVERY Free ram is now: OK on pediapress-ocg3 pediapress-ocg3 output: OK: 88% free memory [15:01:56] petan|wk: is it possible to log exceptions somewhere or show them on the web wgShowExceptionDetails=True [15:02:10] not sure all requests go in the log right now [15:02:25] RECOVERY Total Processes is now: OK on pediapress-ocg3 pediapress-ocg3 output: PROCS OK: 85 processes [15:03:15] RECOVERY dpkg-check is now: OK on pediapress-ocg3 pediapress-ocg3 output: All packages OK [15:10:49] j^: for all wikis or common wiki only? [15:11:15] petan|wk: just commons.wikimedia.beta.wmflabs.org [15:13:27] ok [15:14:45] hi petan|wk [15:14:49] hey [15:14:51] on http://labs.wikimedia.beta.wmflabs.org/wiki/Global_Requests 'd like to test rev:113591 for passing parameters to the UploadWizard (something that's needed mainly for Wiki Loves Monuments, probably more testing people will appear here soon ;-) I'd need to create UploadWizard campaigns, thank you! --Elya (talk) 21:48, 19 March 2012 (UTC) [15:14:59] https://www.mediawiki.org/wiki/Special:Code/MediaWiki/113591 [15:15:04] can you put that on labs? [15:15:36] ok [15:15:47] thanks [15:16:56] hmm petan, for https://bugzilla.wikimedia.org/show_bug.cgi?id=28633 werdna has created a patch at https://www.mediawiki.org/wiki/Special:Code/MediaWiki/111217, but there's been no-one able to test it yet, do you know anyone that has a bit of abuse filter experience to be able to test and possibly code review it? [15:17:20] code review is problem [15:17:28] process is broken [15:17:52] btw ticket is flagged as done [15:17:57] is it still open? [15:18:08] should I reopen? [15:18:47] hmm [15:18:54] well the bug patch has been created [15:19:04] but I don't know if it's been implemented [15:19:13] if u think https://www.mediawiki.org/wiki/Special:Code/MediaWiki/111217 [15:19:13] AFAIK it hasn't because the code review still says "new" [15:19:21] yeah [15:19:24] ah [15:19:30] I will check it [15:21:24] it's big diff [15:21:32] I flagged it as tested since it works on labs [15:21:39] it's definitely not on production [15:21:50] it works on labs? [15:21:52] it won't be likely until deployment on .20 [15:21:57] yes it does [15:22:04] did you try it or did it just not break the whole wiki? [15:22:10] hmm when is .20? [15:22:20] I wanted it to be out in 1.19 :( [15:22:21] hm... I don't know [15:22:55] ... and labs went back down again [15:23:46] it's pretty new patch [15:23:52] it couldn' [15:23:57] couldn't be in 19 [15:24:56] seems to work to me [15:24:59] what's down [15:25:49] http://labs.wikimedia.beta.wmflabs.org/wiki/Special:RecentChanges [15:25:52] HTTP Error 500 (Internal Server Error): An unexpected condition was encountered while the server was attempting to fulfill the request. [15:26:19] it opens to me :) [15:26:21] and then Error 139 (net::ERR_TEMPORARILY_THROTTLED): Requests to the server have been temporarily throttled. [15:26:30] :o [15:26:33] refresh [15:26:47] age User:Vmcherriekkrebae .(Spam: content was: "[http://kolejlegenda.edu.my/ooi-chong-seong/ Ooi Chong Seon [15:26:50] I am refreshing [15:26:56] this is what the bots want you to do [15:27:02] you shouldn't leave the url in comment ;)O [15:27:24] bah I usually don't [15:27:26] Thehelpfulone (Talk | contribs) deleted page User:Vmcherriekkrebae [15:27:29] ok [15:27:31] :D [15:28:28] oh that's a lie, http://labs.wikimedia.beta.wmflabs.org/wiki/Special:Log/delete :P [15:28:40] I won't* [15:28:43] hi petan|wk. Our instance is off. [15:28:46] whats lie [15:29:24] anyways, so can I test the abuse filter patch petan|wk, have you applied it? [15:29:47] yes [15:29:49] it's there [15:30:28] petan|wk: can you setup a new instance? [15:32:50] for? [15:33:13] Hugglewa [15:33:22] I am trying to fix current one [15:33:27] ok [15:33:55] PROBLEM Current Load is now: CRITICAL on diablo-lucid diablo-lucid output: Connection refused by host [15:34:35] PROBLEM Current Users is now: CRITICAL on diablo-lucid diablo-lucid output: Connection refused by host [15:35:15] PROBLEM Disk Space is now: CRITICAL on diablo-lucid diablo-lucid output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:36:05] PROBLEM Free ram is now: CRITICAL on diablo-lucid diablo-lucid output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:37:25] PROBLEM Total Processes is now: CRITICAL on diablo-lucid diablo-lucid output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:38:15] PROBLEM dpkg-check is now: CRITICAL on diablo-lucid diablo-lucid output: CHECK_NRPE: Error - Could not complete SSL handshake. [15:39:28] Change on 12mediawiki a page Wikimedia Labs was modified, changed by Sumanah link https://www.mediawiki.org/w/index.php?diff=513338 edit summary: Project:Labsconsole_accounts to ask for an account [15:40:11] petan|wk: available for a PM? [15:41:00] petan|wk: see my query ;) [15:41:29] !log deployment-prep j: install php-pear on deployment-web3/4/5 required by TMH [15:41:30] Logged the message, Master [15:41:34] !log deployment-prep petrb: getting sql server down I found a bunch of corrupted db's, rollback is necessary [15:41:35] Logged the message, Master [15:41:58] petan|wk: ok, think i found the problem, you can disable the debug output and logs again [15:42:11] j^: ok [15:42:21] but there is another problem, I guess you need the test site right now? [15:42:29] I found that most of db's are broken [15:42:46] since we moved to gluster and there was outage on labs, databases got corrupted a lot [15:42:57] Change on 12mediawiki a page Wikimedia Labs was modified, changed by IWorld link https://www.mediawiki.org/w/index.php?diff=513344 edit summary: better [15:43:00] I wanted to fix it now, but if you want I can do it later [15:43:59] petan|wk: you can fix it now, am done with fixing and testing will happen a later [15:44:04] ok [16:09:37] --> https://www.mediawiki.org/wiki/Project:Labsconsole_accounts [16:12:00] !log deployment-prep petrb: mysql is back up [16:12:01] Logged the message, Master [16:16:05] !log deployment-prep petrb: it seems that corruption of db is worse than I expected, need to restore backup old few months [16:16:06] Logged the message, Master [16:16:23] Damianz: here? [16:16:40] Sortof [16:16:50] db is totaly broken, around ~400 db's are corrupted [16:17:08] :( [16:17:12] it's pretty weird, because it seems that it was corrupted even in backup [16:17:24] I just recovered all of them and it still throw same errors [16:17:30] I assume the backup is just sqldumps not a copy of the innodb files? [16:17:43] d error "1033: Incorrect information in file: './bswiktionary/#sql-66c8_68.frm' [16:17:57] it's both [16:18:02] either dump and copy [16:18:11] dump is older though [16:18:22] I am about to recover from dump now [16:19:07] I don't even know if data are corrupted or schemes only [16:19:19] I could try to copy just the scheme now [16:19:25] and restore the original data files [16:19:31] Copying just the frm files might work [16:19:38] damn I hate to do that :D [16:19:38] It's rather weird it's broken though [16:19:42] yes it is [16:19:50] I think it's gluster's fail [16:20:01] how does it deal with quotas? [16:20:15] I hope it doesn't redirect data to /dev/null [16:20:18]