[00:06:13] 10Wikimedia-Labs-Infrastructure: WMFLabs: Ganglia deamon is taking up a lot of memory - https://phabricator.wikimedia.org/T73761#1072154 (10Krinkle) 5Open>3declined [02:28:23] PROBLEM - Host tools-exec-09 is DOWN: PING CRITICAL - Packet loss = 100% [02:28:44] PROBLEM - Host tools-webproxy is DOWN: PING CRITICAL - Packet loss = 100% [02:29:09] PROBLEM - Host tools-webproxy-test is DOWN: PING CRITICAL - Packet loss = 100% [02:29:18] PROBLEM - Host tools-exec-03 is DOWN: PING CRITICAL - Packet loss = 100% [02:29:45] PROBLEM - Host tools-webproxy-01 is DOWN: PING CRITICAL - Packet loss = 100% [02:29:51] PROBLEM - Host tools-exec-cyberbot is DOWN: PING CRITICAL - Packet loss = 100% [02:30:19] PROBLEM - Host tools-webgrid-04 is DOWN: PING CRITICAL - Packet loss = 100% [02:30:53] PROBLEM - Host tools-webgrid-tomcat is DOWN: PING CRITICAL - Packet loss = 100% [02:30:59] PROBLEM - Host tools-webproxy-02 is DOWN: PING CRITICAL - Packet loss = 100% [02:31:26] umm :/ [02:31:30] PROBLEM - Host tools-submit is DOWN: PING CRITICAL - Packet loss = 100% [02:31:41] D'oh... Down again [02:31:53] PROBLEM - Host tools-exec-07 is DOWN: PING CRITICAL - Packet loss = 100% [02:32:18] PROBLEM - Host ToolLabs is DOWN: PING CRITICAL - Packet loss = 100% [02:34:03] not just tools, beta is also down [02:34:58] Coren: ^ [02:35:24] Oh, FFS. virt1002 *again*? [02:35:29] 1012* [02:35:33] Is it just me (or my service) or are services on the tools server hanging at the moment? [02:35:59] JohnMarkOckerblo: Looks like hardware fail. [02:37:00] Cursed server. [02:37:18] ah, that's the virt1002 you were just mentioning? (I'd look at the channel log, but the link given here isn't working at the moment either) [02:37:59] ha, that's because it's on labs :P [02:38:12] * Coren spends a minute or two trying to find the root cause. [02:38:18] Otherwise, I'll just reboot the box. [02:38:48] weird thing is I can shell in just fine, but the web services aren't responding. [02:39:30] Feel free to reboot as you see fit, though. [02:39:43] Yeah, I can SSH but can't ping other hosts from it. [02:39:51] JohnMarkOckerblo: One of the hosts is ill; that one includes some webproxies. [02:43:42] I'm attempting to suspend the individual instances before I reboot the host. With luck, they may come back without having to be rebooted. [02:43:57] now, why web server is down? [02:44:27] ok, thanks. [02:44:32] Dosn't look like that will work. Sad. [02:47:20] Things should recover within ~10m while I reboot the host. [02:47:56] Coren: network on virt1012 locked up? [02:48:26] andrewbogott: Yes. I tried something different this time; I tried to suspend the instances before the reboot but they all errored out anyways. :-( [02:48:48] Yeah, I tried to suspend last time but nova-compute crashed during the suspend calls. Did that happen for you as well? [02:48:58] I’m sure that suspending and resuming an individual instance works... [02:49:01] andrewbogott: The logs were completely unhelpful; I have *no* idea why networking died. [02:49:12] Yeah, so, same as last time :( [02:49:18] andrewbogott: I tried them one at a time, but nova-compute SEGV'ed out [02:49:23] Except now it’s happened twice on one server. [02:50:11] andrewbogott: How do you look at the console on the HP ilo? [02:50:34] PROBLEM - Host tools-webgrid-tomcat is DOWN: CRITICAL - Host Unreachable (10.68.16.29) [02:51:08] PROBLEM - Host tools-submit is DOWN: CRITICAL - Host Unreachable (10.68.17.1) [02:51:21] Coren: I think that’s all here: https://wikitech.wikimedia.org/wiki/HP_DL3N0 [02:51:33] 'vsp' [02:52:24] Box is back up [02:53:32] 6Labs, 10Tool-Labs: create bigbrotherrc for drtrigonbot - https://phabricator.wikimedia.org/T90912#1072772 (10scfc) I created a `.bigbrotherrc` with `webservice` in it, but … just at this moment, `tools-submit` which runs `bigbrother` is down :-). [02:53:35] Now you probably have to ‘nova start’ each instance. [02:53:51] Yeah, about to do so [02:53:55] Do you have a list already or shall I? [02:53:56] ok [02:54:28] andrewbogott: I have a list already. [02:54:58] Hm… seems unlikely that this is a software issue since there are three identical boxes [02:55:03] coud be, though :( [02:56:04] andrewbogott: They are in error mode; do you remember offhand how to clear that state? [02:56:18] nova reset-state —active [02:56:27] ohright [02:56:38] And then nova reboot (since ‘start’ doesn’t work from ‘active’) [02:57:11] for h in $(cat instances); do nova reset-state --active $h && nova reboot $h;done [02:57:56] that should do it [02:58:07] although in my experience there are stragglers that take a second go [03:00:55] They are booting gradually. [03:01:16] :D [03:02:48] We did *not* need another outage. [03:03:40] Well, this one was only 10 minutes at least. [03:04:11] 'Don't say hi, until we are over the bridge.' - Swedish saying. [03:05:01] 6Labs, 10Tool-Labs: Various instances unresponsive in "ACTIVE" (previously: "ERROR") state - https://phabricator.wikimedia.org/T91043#1072786 (10scfc) [03:05:50] RECOVERY - Host tools-webproxy-test is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [03:05:53] andrewbogott: It's only starting 2-3/minute, I expect that's normal? [03:06:02] RECOVERY - Host tools-webproxy-02 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [03:06:30] Yeah, that sounds about right. [03:07:02] 6Labs, 10Tool-Labs: Various instances unresponsive in "ACTIVE" (previously: "ERROR") state - https://phabricator.wikimedia.org/T91043#1072800 (10coren) 5Open>3Resolved a:3coren Known, immediately signaled on IRC, and being fixed. One of the hosts got ill. [03:07:04] RECOVERY - Host tools-exec-03 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [03:07:04] RECOVERY - Host tools-webproxy-01 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [03:07:42] RECOVERY - Host tools-exec-09 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [03:07:58] RECOVERY - Host ToolLabs is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [03:08:40] RECOVERY - Host tools-webproxy is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [03:08:48] Wooo! [03:09:00] Okay, everything is back up :D [03:09:16] RECOVERY - Host tools-webgrid-04 is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [03:09:18] Woah. nova was on steroids; it started almost 30 in 90s [03:09:30] thanks: do i need to restart my webservice? [03:09:35] tools-login and other grid hosts are in separate virts, I guess? [03:10:14] JohnMarkOckerblo: You probably won't have to; as the grid recovers most things should restart on their own within the next 5 minutes. [03:11:00] Zhaofeng_Li: Yes, I tried to spread them around on as many hosts as possible - we've just been unlucky with the host that holds the webproxies in the past couple weeks. [03:11:06] RECOVERY - Host tools-webgrid-tomcat is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [03:11:27] okay. I'm getting a blank page now rather than a hang, but that might just be things not fully restarted yet, (It's a Perl-based script, expecting to read some files, if that matters) [03:11:33] andrewbogott: Fun fact: one of the hosts seems to have been suspended correctly. [03:11:42] RECOVERY - Host tools-submit is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [03:11:55] Hm [03:11:58] So it’s possible, vaguely [03:12:02] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [03:12:21] Coren: my toollab web went down several times randomly, with 502 Bad Gateway [03:12:27] JohnMarkOckerblo: It takes a few minutes for the grid to recover, and while it does the load is high enough that services tend to remain queued for a while until things settle. [03:12:30] and webservice restart makes it work again [03:13:12] liangent: 502? That normally means the process is up but not responding to requests from the proxy. Restarting will kill it and start a new one, so that'd be expected. Do you have logs? [03:13:32] Cyberpower678: around? [03:13:41] Coren: where can I see logs [03:14:05] nakon, yes? [03:14:09] kind of [03:14:12] liangent: Normally, in your home. access.log and error.log by default with lighttpd; but if you use another webserver that varies. [03:14:15] nakon, who are you? [03:14:28] another toolserver user :) [03:14:34] :p [03:14:58] thanks, i'll send PM [03:15:42] Coren: last access.log entry before restart: 10.68.16.4 tools.wmflabs.org - [24/Feb/2015:01:33:09 +0000] "GET /liangent-php/load.php/zhwiki?debug=false&lang=zh-cn&modules=jquery.checkboxShiftClick%2Ccookie%2ChighlightText%2CmakeCollapsible%2Cmw-jump%2Cplaceholder%2Csuggestions%7Cmediawiki.api%2CsearchSuggest%2Cuser%7Cmediawiki.page.ready&skin=vector&version=20141114T033456Z&* HTTP/1.1" 200 39429 "https://tools.wmflabs.org/liangent-php/index.p [03:15:57] and the last error.log: 2015-02-23 09:50:28: (mod_fastcgi.c.2701) FastCGI-stderr: PHP Fatal error: Maximum execution time of 30 seconds exceeded in /data/project/liangent-ph [03:15:58] p/mw/includes/utils/StringUtils.php on line 509 [03:16:54] andrewbogott: Actually, three are marked suspended. "Interesting" [03:17:25] Coren: use virt1005 if you want to tinker. It only has two instances on it, both of them there for my experiments. [03:17:49] I successfully suspended an instance, rebooted the host, and resumed the instance yesterday. So it’s something about numbers/rates I suspect [03:18:19] May be made worse by the network being ill. [03:18:55] But it's too late for me to want to tinker. Ima massage things into full recovery and flee. [03:19:03] could be, although I don’t think that nova talks to the instances via their network interfaces… commands should go straight to libvirt over eth0 which was clearly working fine [03:19:13] Coren: ok [03:19:39] I sent an email to Yuvi and Ops suggesting possible stopgaps. But nothing to labs-l [03:19:54] Yeah, that didn't work so well once it segfaulted. :-) [03:20:02] RECOVERY - Host tools-exec-cyberbot is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [03:20:12] afaict, all instances are up. [03:20:18] * Coren checks the tools grid now. [03:20:56] For what it’s worth, this happened at a slightly different time than the last one, so it’s presumably not cron related. [03:20:56] RECOVERY - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [03:20:58] * andrewbogott grasps at straws [03:21:39] 10Tool-Labs: Test how bigbrother reacts to user names not resolving and, if necessary, fix it - https://phabricator.wikimedia.org/T90410#1072812 (10scfc) I had to restart `bigbrother` again just now after the reboot of `tools-submit` so this suggests that this wasn't a one-time fluke. [03:22:05] andrewbogott: can you migrate tools-webproxy-01 and 02 off virt1002? [03:22:15] ok, ftl finally up. Don't know whether it was the repeated webservice restart or just other dependences wakign up. [03:22:24] YuviPanda|zzz: do you mean 1012? [03:22:31] andrewbogott: gah yes [03:22:40] Is it OK if I shut them down to do so? [03:22:45] They aren't active atm so is ok [03:22:49] JohnMarkOckerblo: The last couple nodes just finished waking up; chances are lots of jobs were stuck in queue [03:22:52] Yeah [03:22:52] great, I’ll do that right now [03:23:12] Coren: Thanks for getting everything back up and running! [03:23:52] YuviPanda|zzz: sorry if my email woke you up or otherwise interrupted… [03:24:23] andrewbogott: nah, I had woken up otherwise... [03:25:23] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1072818 (10scfc) [03:25:33] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:25:46] andrewbogott: Coren I'm thinking I should add you guys to toollabs alerting group on shinken [03:25:55] It did catch the hosts being down [03:26:05] YuviPanda|zzz: You should indeed. [03:26:10] yep [03:26:25] Transient puppet failures are still a thing tho [03:29:21] AFAICT, the grid is in full health, and only three jobs have errored out. [03:29:31] Coren: I haven't try that, but since the process is still up, I doubt whether bigbrother will work for me [03:30:18] if bigbrother could actually monitor webservices by sending http requests it would be nice [03:32:08] liangent: It might, though I'm a bit worried by the idea of it killing running webservices. [03:34:42] in case it's serving someone else right now? [03:36:12] Coren: qmod -rq lightpd would restart all lighty jobs right? [03:36:12] Probably won't stagger them [03:36:22] Or that a burp of lag, or a db stall, and so on. A better solution, imo, would be to have a standardized way for tool writers to have a watchdog that could check for real status beyond "does it just live" [03:36:31] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1072829 (10scfc) List of users with `.bigbrotherrc`s with `webservice`: ``` sudo find /data/project /home -mindepth 2 -maxdepth 2 -type f -name .bigbrotherrc -exec grep -l '^webservice' \{\}... [03:37:24] YuviPanda|zzz: webgrid-lighttpd, but yeah. They'll stagger because load "naturally", but it's going to be a horde at the gates anyways. [03:37:46] PROBLEM - Host tools-webproxy-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.139) [03:39:15] Coren: right. I can write a script to do it more gently I guess [03:39:42] Coren: so any idea to prevent further 502s? [03:42:06] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [03:45:34] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [03:46:25] YuviPanda|zzz: -02 is on virt1011 now. I may have accidentally killed 01, still looking into that [03:46:39] Cool [03:46:57] liangent: Not without a clearer idea of why your tool ends up stalled. That said, that last error seems to point at it crunching away too hard on some requiests and hitting the php limit - that may be the issue. There are only so many connections available by default, so if it takes longer to answer a request than the interval between requests you'll eventually starve. [03:47:08] RECOVERY - Host tools-webproxy-01 is UP: PING OK - Packet loss = 0%, RTA = 6.84 ms [03:47:17] * Coren needs to go to sleep now. [03:48:51] YuviPanda|zzz: nevermind, -01 is fine, it just took a minute. It’s on virt1010 now. [03:49:12] andrewbogott: whee cool. [03:49:38] andrewbogott: I need to figure out how to set up a health check and fail over of floating ips tho [03:49:54] I guess that would need to hit the nova api [03:50:21] Yeah, to change the IP it would. [03:50:34] But at least having a backup is a good start. [03:51:05] Yeah [03:51:34] I could set up DNS round robin [03:51:51] andrewbogott: oh wikitech has a bug that doesn't let me associate hostnames with ips [03:52:07] …how so? [03:52:08] I filed it yesterday. I was trying to set up a test host for these things... [03:52:27] (On phone let me try to find bug) [03:52:42] Are you doing something unusual or do you mean that a perfectly ordinary thing broke? [03:52:52] Perfectly ordinary thing [03:53:02] add hostname to ip [03:53:10] Adds a rdns type record [03:53:17] ok [03:53:26] I need to eat dinner, will look at that in a bit [03:54:57] andrewbogott: https://phabricator.wikimedia.org/T90856 [03:54:59] Ok [03:55:06] I'm also on phone [03:55:18] Coren: andrewbogott thanks for taking care of the outage [03:58:10] andrewbogott: I'll handle the outage report tomorrow morning. [04:18:30] I don't know if this is of interest to anyone here, but here's a clonable version of toolserver SVN archive: http://sourceforge.net/projects/toolserver/?source=directory [04:18:44] (thanks to Nemo_bis and nosy) [05:18:48] 6Labs, 10Wikimedia-Labs-wikitech-interface: Wikitech doesn't allow to associate a hostname with a public ip address - https://phabricator.wikimedia.org/T90856#1072859 (10Andrew) Before: # 208.80.155.192, hosts, wikimedia.org dn: dc=208.80.155.192,ou=hosts,dc=wikimedia,dc=org objectClass: domainrelatedobject o... [05:20:12] PROBLEM - Host tools-exec-cyberbot is DOWN: PING CRITICAL - Packet loss = 100% [05:20:16] PROBLEM - Host tools-webgrid-04 is DOWN: PING CRITICAL - Packet loss = 100% [05:20:22] PROBLEM - Host tools-exec-09 is DOWN: PING CRITICAL - Packet loss = 100% [05:21:09] PROBLEM - Host tools-webproxy-test is DOWN: PING CRITICAL - Packet loss = 100% [05:21:38] PROBLEM - Host tools-webgrid-tomcat is DOWN: PING CRITICAL - Packet loss = 100% [05:21:52] PROBLEM - Host tools-submit is DOWN: PING CRITICAL - Packet loss = 100% [05:22:56] PROBLEM - Host tools-exec-07 is DOWN: PING CRITICAL - Packet loss = 100% [05:23:20] PROBLEM - Host ToolLabs is DOWN: PING CRITICAL - Packet loss = 100% [05:23:56] Labs-issues again? [05:24:21] jep, seems to be the same thing as before ... that server is really cursed. [05:24:21] PROBLEM - Host tools-exec-03 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:34] andrewbogott, YuviPanda|zzz: ^^ [05:24:43] PROBLEM - Host tools-webproxy is DOWN: PING CRITICAL - Packet loss = 100% [05:24:54] OK, be back in a minute and will look [05:27:24] looking... [05:35:28] YuviPanda|zzz: So, it’s happening again — are those proxies ready? If so I can edit ldap by hand to point to the new IP [05:55:00] andrewbogott: gah no. I just got proper internet. [05:55:06] I'm getting them.ready right now [05:55:11] thanks [05:59:48] andrewbogott: can you also migrate tools-webproxy *off* that host? also I can’t bring the new ones up without access to tools-webproxy, certificate is only on tools-webproxy and the dynamicproxy host afaik, andt hey’re both down... [05:59:50] sorry. [06:00:11] YuviPanda: yes, one moment... [06:04:19] andrewbogott: I’m going to email labs-l and let people know [06:04:27] ok [06:09:41] YuviPanda: it’s copying still, should be done soon [06:12:32] andrewbogott: sent a long email with explanations. [06:13:14] thanks [06:13:19] I wish /I/ had an explanation :) [06:13:55] andrewbogott: :D these aren’t ‘explanations’ explanations, but just ‘why has it been a bad few weeks' [06:17:01] andrewbogott: the email basically says ‘we have had two very unrelated hardware issues very close to each other in time, and that’s why this is an issue’. [06:17:08] hopefully there’ll be less gnashing of teeth [06:37:50] YuviPanda: tools-webproxy is backon virt1012 but you can probably reach it now to grab the files you need [06:39:01] ok [06:52:39] andrewbogott: btw, I still can’t reach tools-webproxy [06:53:41] YuviPanda: noted… I’m not sure what to do about that at this moment... [06:53:46] andrewbogott: right. ok [06:53:49] The backing image it depends on isn’t available elsewhere [06:54:01] ah, I see. [06:54:21] maybe I can copy it off virt1012, lemme look [06:55:10] well, I don’t know what the deal is, I can’t find the image on virt1012 either. Even though clearly it was working… [06:55:30] andrewbogott: well, don’t worry about it atm, I’d say. [06:55:45] andrewbogott: tools is down anyway, and I’m not sure how well it’ll recover if we just bring back tools-webproxy and not the other tools hosts that went down [06:56:09] I would hope that the tools hosts which are up would be able to function still... [06:58:14] andrewbogott: yeah, those are. tools-login is up and most bots themselves should be fine. [07:10:13] RECOVERY - Host ToolLabs is UP: PING OK - Packet loss = 0%, RTA = 2.03 ms [07:10:17] YuviPanda: tools-webproxy is back [07:10:25] booyeah [07:10:36] are there other instances I should move to make tools happier? [07:10:44] looking [07:10:48] tools-submit [07:10:57] (I think it was on virt1012) [07:11:02] ok [07:12:29] RECOVERY - Host tools-webproxy is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [07:16:07] FYI ToolScript is happy again [07:16:36] so is Reasonator [07:16:51] GerardM-: yup, most tools should be fine now, if maybe a bit slow [07:18:51] I can't ping tiles.wmflabs.org it seems. [07:19:45] Nicolas: partial labs outage in progres... [07:19:50] *progress [07:19:57] andrewbogott: can you also move the dynamicproxy-gateway instance off virt1012? [07:21:09] YuviPanda: moving. [07:21:13] tools-submit should be back up [07:21:15] andrewbogott: whee, thanks. [07:21:39] RECOVERY - Host tools-submit is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [07:26:16] YuviPanda WDQ is down [07:26:35] GerardM-: yup, the proxy is down. will be back up (and.rew is moving it atm) [07:27:16] ... again ... you are unlucky, it is not ops that can do a better job [07:28:13] !log tools.kmlexport restarted and moved to trusty [07:29:28] !log tools.kmlexport it doesn’t like trusty’s version of perl, moving back to precise [07:34:05] the problem seems to be a host not in the configuration of catscan [07:34:07] No route to host in /data/project/catscan2/public_html/omniscan.inc on line 132 [07:51:54] GerardM-: It's down wdq.wmflabs.org, and pinging it gets "destination host unreachable" [07:52:58] andrewbogott: any luck with dynamicproxy-gateway? [07:53:04] I think many Wikidata-related tools, hosted on Labs or not, rely on it. [07:53:18] It claims to be booting... [07:57:48] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:58:38] PROBLEM - Puppet failure on tools-exec-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:58:54] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:02:47] andrewbogott: tools-webproxy-02 is now available. let me make tools-webproxy-01 available as well. They’ll be hotspares - we can switch anytime by manually switching the floating IP [08:02:48] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [08:02:59] great [08:03:35] andrewbogott: it’s a bandaid-y solution, though. our current proxy design wasn’t built with multiple ‘masters’ in mind. [08:03:41] anyway, better than nothing. I’ll finish up and document [08:04:05] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:04:14] dynamicproxy-gateway is back [08:05:26] GerardM-: wdq is back [08:05:28] Zhaofeng_Li: ^ [08:08:37] RECOVERY - Puppet failure on tools-exec-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:27:27] !log tools restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well [08:27:32] Logged the message, Master [08:34:47] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1072960 (10yuvipanda) [08:35:00] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1051019 (10yuvipanda) (Removing the Hackathon project since this needs to be fixed *now*) [08:40:26] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1072965 (10yuvipanda) So we will eventually have two proxies - tools-webproxy-01 and tools-webproxy-02, and they'll be hotspares. Webservices will... [08:42:52] RECOVERY - Host tools-webproxy-test is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [08:44:14] RECOVERY - Host tools-webgrid-04 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [08:45:02] RECOVERY - Host tools-exec-cyberbot is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [08:45:47] RECOVERY - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [08:46:20] RECOVERY - Host tools-exec-03 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [08:46:26] RECOVERY - Host tools-webgrid-tomcat is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [08:46:28] andrewbogott: yay! :) [08:46:40] Let’s see if it lasts more than 45 minutes this time [08:46:54] 👍 [08:47:06] andrewbogott: right. tools-webproxy-01 and -02 are hotspares now. [08:47:25] andrewbogott: and since we’ve moved the important bits off, even if virt1012 goes down now toollabs won’t be down. [08:47:30] true [08:47:38] But it will still make me very sad [08:47:41] RECOVERY - Host tools-exec-09 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [08:49:43] andrewbogott: yup :( [08:52:56] andrewbogott: beta is also back up fully now [08:53:04] great [08:53:08] “for now" [08:53:13] andrewbogott: heh. [08:53:49] andrewbogott: you should get some sleep. [08:54:01] I’m looking forward to it! [08:54:14] andrewbogott: :D <3 thank you! [08:54:20] andrewbogott: I’ll keep an eye out [08:54:43] I hope that the trusty upgrade was worth it… this would’ve been a 10-minute outage if I’d just rebooted [08:55:41] andrewbogott: yeah, but can’t keep rebooting... [08:55:55] Certainly not once per hour [08:57:38] yeah [08:58:01] andrewbogott: later today I’ll take stock of which hosts are on which machines, and later on we can maybe distribute them some more. [08:58:17] I definitely think we can get to a point where one virt machine going out won’t take out toollabs in the next few days [09:01:03] andrewbogott: I’m going to have some food, I’ll keep an eye on IRC / shinken. [09:01:24] sounds good, thanks. [09:43:04] Hey, is there a policy what license graphical content and text content from tools via labs have? F. e.: https://tools.wmflabs.org/wikihistory/wh.php?page_title=Dresden [09:44:50] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 3721 bytes in 0.058 second response time [09:45:57] Is it dead again now [09:46:18] Not fully [09:49:55] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 740329 bytes in 3.459 second response time [09:50:08] Bettet [10:00:51] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 3721 bytes in 0.634 second response time [10:02:22] I give a talk on sunday about licenses and will have no error in my presentation used for teaching ;) . [10:03:03] https://docs.google.com/presentation/d/1y57W8BNx4jpGMCEKD_TrTLVrbuVc05GVhNtRChKBsJo/edit#slide=id.g6cf20fe0e_037 [10:15:53] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 740278 bytes in 3.030 second response time [10:51:51] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 383 bytes in 0.002 second response time [10:54:16] thank you [11:12:09] !log deployment-prep start mysql on deployment-db1 [11:12:16] Logged the message, Master [11:31:19] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [11:32:36] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Conduct a Tool Labs workshop in Leon Hackathon - https://phabricator.wikimedia.org/T91058#1073167 (10yuvipanda) 3NEW a:3yuvipanda [11:36:56] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Conduct a Tool Labs workshop in Lyon Hackathon - https://phabricator.wikimedia.org/T91058#1073179 (10Qgil) [11:50:39] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Create a set of 'template' tools in various languages with deploy scripts for toollabs - https://phabricator.wikimedia.org/T91059#1073191 (10yuvipanda) 3NEW a:3yuvipanda [12:31:05] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Create a set of 'template' tools in various languages with deploy scripts for toollabs - https://phabricator.wikimedia.org/T91059#1073243 (10yuvipanda) With this going from nothing to a 'I am accessing LabsDB and making API calls to enwiki' time for a new tool should be... [12:36:07] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a Tool Labs Workshop in Wikimania hackathon - https://phabricator.wikimedia.org/T91061#1073255 (10yuvipanda) 3NEW a:3yuvipanda [12:36:53] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a research tools workshop at wikimania hackathon 2015 - https://phabricator.wikimedia.org/T91062#1073265 (10yuvipanda) 3NEW a:3yuvipanda [13:09:39] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073283 (10yuvipanda) Alright, so toollabs webproxy is now running on tools-webproxy-01, with a hotspare in tools-webproxy-02. To switch to the sp... [13:09:48] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073284 (10yuvipanda) a:3yuvipanda [13:10:23] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073288 (10yuvipanda) [13:20:48] 10Tool-Labs, 7Documentation: Wikimedia Labs system admin (sysadmin) documentation sucks - https://phabricator.wikimedia.org/T57946#1073311 (10yuvipanda) I think this is fairly important, and we should make better docs now. New tools admins would be joining us shortly, and this page should be much better. [13:21:17] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [13:22:33] 10Tool-Labs, 7Documentation: Wikimedia Labs system admin (sysadmin) documentation sucks - https://phabricator.wikimedia.org/T57946#1073312 (10yuvipanda) I've moved the old page to https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation/Admin/Archive and am creating a new page. [13:26:31] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Conduct a Tool Labs workshop at Lyon Hackathon - https://phabricator.wikimedia.org/T91058#1073315 (10Krinkle) [13:41:02] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073325 (10yuvipanda) https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation/Admin has documentation now :D [13:52:42] 10Tool-Labs, 7Documentation: Wikimedia Labs system admin (sysadmin) documentation sucks - https://phabricator.wikimedia.org/T57946#1073333 (10yuvipanda) 5Open>3Resolved Tested the failover. Worked perfectly. Haven't tested instructions on bringing back a dead instance, though. [13:52:55] Coren: qstat tells me that job 7663764 is supposedly running on continuous@tools-exec-07.eqiad.wmflabs, but I don't see the process running on that host. Any idea? [13:53:02] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a research tools workshop at wikimania hackathon 2015 - https://phabricator.wikimedia.org/T91062#1073336 (10Halfak) [13:53:14] I imagine a qdel and resubmit would fix it, but I thought I'd let you have a chance to look at it first if you want. [13:53:34] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1073345 (10yuvipanda) [13:53:35] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073343 (10yuvipanda) 5Open>3Resolved Tested the failover. Worked perfectly. Haven't tested instructions on bringing back a dead instance, though. [13:53:56] anomie: there was an outage earlier today, so it’s probably fallout from that. [13:54:13] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a research tools workshop at wikimania hackathon 2015 - https://phabricator.wikimedia.org/T91062#1073265 (10Halfak) [13:54:28] It's mostly the desync where qstat thinks it's still running I'm concerned about. [13:54:44] right [13:56:22] nice, it's so colorful here now :) [13:56:22] anomie: Probably a cadaver from the outage; you can qdel -f it [13:56:41] * anomie does so [14:00:02] ... now it says state "dRr", but it still exists [14:00:39] anomie: I -f’d it for you now [14:02:31] anomie: It may take a minute or two before it notices it's dead. [14:03:47] Well, let's try it again with job 7663714 [14:07:17] That one disappeared fine. [14:10:09] 6Labs, 10Tool-Labs: Have at least two uwsgi nodes so that grid engine can reschedule jobs when one goes down - https://phabricator.wikimedia.org/T91065#1073367 (10yuvipanda) 3NEW [14:12:11] 6Labs, 10Tool-Labs: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1073376 (10yuvipanda) 3NEW [14:17:57] 10Wikimedia-Labs-Infrastructure: Create -latest alias for dumps - https://phabricator.wikimedia.org/T47646#1073393 (10yuvipanda) p:5Lowest>3Normal [14:21:53] YuviPanda: You had a good idea to test that again; it would fail right now. There have been changes in the config that have not been properly reflected on shadow yet. [grumble] [14:22:11] Coren: this is why we should puppetize tem all :D [14:22:33] Coren: can you make a note of everything being done in detail on https://phabricator.wikimedia.org/T90546 [14:22:33] ? [14:23:05] YuviPanda: I don't mean the same change on both - I mean a change in the layout of -master that would have neede another - different - change on -shadow to account for it. :-) [14:23:09] YuviPanda: I will. [14:23:17] right. [14:33:25] 6Labs, 10Tool-Labs: Set up a schedule for doing failover exercises for toollabs - https://phabricator.wikimedia.org/T91068#1073400 (10yuvipanda) 3NEW [14:37:53] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [14:47:58] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:02] !log tools testing gridengine master failover starting now [14:50:10] Logged the message, Master [14:51:46] YuviPanda: /var/lib/gridengine/default/common/act_qmaster is the thing to watch. As the shadow server notices the hearbeat no longer updating, it should start a master on itself and update that. [14:52:02] Coren: and that’s a symlink that points to NFS? [14:52:26] YuviPanda: sorta. /var/lib/gridengine is a bind mount to NFS [14:52:44] aaah [14:53:07] gridengine does need its config and spool shared between nodes. [14:53:52] right [14:54:06] I'm stracing the sge_shadowd right now, looking at it poll the heartbeat file. [14:54:17] right [14:55:22] 5m check interval is hella long when you're waiting for it. :-) [14:55:32] :) [14:57:41] Coren: can you add documentation on what the failover does and how it works to https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation/Admin [14:59:41] * Coren nods. [15:00:17] * Coren "patiently" waits. [15:00:39] * Coren double checks the timeout. [15:05:12] Ah, bah, 600s default [15:05:23] That's too many seconds. [15:06:10] ah, 10mins [15:06:15] Coren: did it pick it up now? [15:06:39] It should soon - the poll interval is 60s. [15:07:07] * Coren watches it like a hawk. [15:08:11] But if it doesn't within the next minute or so I'm going to presume the man page is the one that lies and change the config to what it was before according to the prose documentation. [15:09:09] Coren: cool. should also start the master back up, though :) [15:09:11] (About the contents of /var/lib/gridengine/default/common/shadow_masters - one says it should contain the name of the shadows and one says it should contain the name of the master /and/ the name of the shadows) [15:09:21] Yeah, I'm restarting master. [15:10:21] Coren: cool [15:10:27] !log tools Master restarted - test not sucessful. [15:10:32] Logged the message, Master [15:10:33] YuviPanda: I shall debug this now. [15:10:35] !log tools increase instance quota to 64 [15:10:38] Coren: cool. [15:10:39]