[00:06:13] 10Wikimedia-Labs-Infrastructure: WMFLabs: Ganglia deamon is taking up a lot of memory - https://phabricator.wikimedia.org/T73761#1072154 (10Krinkle) 5Open>3declined [02:28:23] PROBLEM - Host tools-exec-09 is DOWN: PING CRITICAL - Packet loss = 100% [02:28:44] PROBLEM - Host tools-webproxy is DOWN: PING CRITICAL - Packet loss = 100% [02:29:09] PROBLEM - Host tools-webproxy-test is DOWN: PING CRITICAL - Packet loss = 100% [02:29:18] PROBLEM - Host tools-exec-03 is DOWN: PING CRITICAL - Packet loss = 100% [02:29:45] PROBLEM - Host tools-webproxy-01 is DOWN: PING CRITICAL - Packet loss = 100% [02:29:51] PROBLEM - Host tools-exec-cyberbot is DOWN: PING CRITICAL - Packet loss = 100% [02:30:19] PROBLEM - Host tools-webgrid-04 is DOWN: PING CRITICAL - Packet loss = 100% [02:30:53] PROBLEM - Host tools-webgrid-tomcat is DOWN: PING CRITICAL - Packet loss = 100% [02:30:59] PROBLEM - Host tools-webproxy-02 is DOWN: PING CRITICAL - Packet loss = 100% [02:31:26] umm :/ [02:31:30] PROBLEM - Host tools-submit is DOWN: PING CRITICAL - Packet loss = 100% [02:31:41] D'oh... Down again [02:31:53] PROBLEM - Host tools-exec-07 is DOWN: PING CRITICAL - Packet loss = 100% [02:32:18] PROBLEM - Host ToolLabs is DOWN: PING CRITICAL - Packet loss = 100% [02:34:03] not just tools, beta is also down [02:34:58] Coren: ^ [02:35:24] Oh, FFS. virt1002 *again*? [02:35:29] 1012* [02:35:33] Is it just me (or my service) or are services on the tools server hanging at the moment? [02:35:59] JohnMarkOckerblo: Looks like hardware fail. [02:37:00] Cursed server. [02:37:18] ah, that's the virt1002 you were just mentioning? (I'd look at the channel log, but the link given here isn't working at the moment either) [02:37:59] ha, that's because it's on labs :P [02:38:12] * Coren spends a minute or two trying to find the root cause. [02:38:18] Otherwise, I'll just reboot the box. [02:38:48] weird thing is I can shell in just fine, but the web services aren't responding. [02:39:30] Feel free to reboot as you see fit, though. [02:39:43] Yeah, I can SSH but can't ping other hosts from it. [02:39:51] JohnMarkOckerblo: One of the hosts is ill; that one includes some webproxies. [02:43:42] I'm attempting to suspend the individual instances before I reboot the host. With luck, they may come back without having to be rebooted. [02:43:57] now, why web server is down? [02:44:27] ok, thanks. [02:44:32] Dosn't look like that will work. Sad. [02:47:20] Things should recover within ~10m while I reboot the host. [02:47:56] Coren: network on virt1012 locked up? [02:48:26] andrewbogott: Yes. I tried something different this time; I tried to suspend the instances before the reboot but they all errored out anyways. :-( [02:48:48] Yeah, I tried to suspend last time but nova-compute crashed during the suspend calls. Did that happen for you as well? [02:48:58] I’m sure that suspending and resuming an individual instance works... [02:49:01] andrewbogott: The logs were completely unhelpful; I have *no* idea why networking died. [02:49:12] Yeah, so, same as last time :( [02:49:18] andrewbogott: I tried them one at a time, but nova-compute SEGV'ed out [02:49:23] Except now it’s happened twice on one server. [02:50:11] andrewbogott: How do you look at the console on the HP ilo? [02:50:34] PROBLEM - Host tools-webgrid-tomcat is DOWN: CRITICAL - Host Unreachable (10.68.16.29) [02:51:08] PROBLEM - Host tools-submit is DOWN: CRITICAL - Host Unreachable (10.68.17.1) [02:51:21] Coren: I think that’s all here: https://wikitech.wikimedia.org/wiki/HP_DL3N0 [02:51:33] 'vsp' [02:52:24] Box is back up [02:53:32] 6Labs, 10Tool-Labs: create bigbrotherrc for drtrigonbot - https://phabricator.wikimedia.org/T90912#1072772 (10scfc) I created a `.bigbrotherrc` with `webservice` in it, but … just at this moment, `tools-submit` which runs `bigbrother` is down :-). [02:53:35] Now you probably have to ‘nova start’ each instance. [02:53:51] Yeah, about to do so [02:53:55] Do you have a list already or shall I? [02:53:56] ok [02:54:28] andrewbogott: I have a list already. [02:54:58] Hm… seems unlikely that this is a software issue since there are three identical boxes [02:55:03] coud be, though :( [02:56:04] andrewbogott: They are in error mode; do you remember offhand how to clear that state? [02:56:18] nova reset-state —active [02:56:27] ohright [02:56:38] And then nova reboot (since ‘start’ doesn’t work from ‘active’) [02:57:11] for h in $(cat instances); do nova reset-state --active $h && nova reboot $h;done [02:57:56] that should do it [02:58:07] although in my experience there are stragglers that take a second go [03:00:55] They are booting gradually. [03:01:16] :D [03:02:48] We did *not* need another outage. [03:03:40] Well, this one was only 10 minutes at least. [03:04:11] 'Don't say hi, until we are over the bridge.' - Swedish saying. [03:05:01] 6Labs, 10Tool-Labs: Various instances unresponsive in "ACTIVE" (previously: "ERROR") state - https://phabricator.wikimedia.org/T91043#1072786 (10scfc) [03:05:50] RECOVERY - Host tools-webproxy-test is UP: PING OK - Packet loss = 0%, RTA = 1.33 ms [03:05:53] andrewbogott: It's only starting 2-3/minute, I expect that's normal? [03:06:02] RECOVERY - Host tools-webproxy-02 is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [03:06:30] Yeah, that sounds about right. [03:07:02] 6Labs, 10Tool-Labs: Various instances unresponsive in "ACTIVE" (previously: "ERROR") state - https://phabricator.wikimedia.org/T91043#1072800 (10coren) 5Open>3Resolved a:3coren Known, immediately signaled on IRC, and being fixed. One of the hosts got ill. [03:07:04] RECOVERY - Host tools-exec-03 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [03:07:04] RECOVERY - Host tools-webproxy-01 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [03:07:42] RECOVERY - Host tools-exec-09 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [03:07:58] RECOVERY - Host ToolLabs is UP: PING OK - Packet loss = 0%, RTA = 0.68 ms [03:08:40] RECOVERY - Host tools-webproxy is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [03:08:48] Wooo! [03:09:00] Okay, everything is back up :D [03:09:16] RECOVERY - Host tools-webgrid-04 is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [03:09:18] Woah. nova was on steroids; it started almost 30 in 90s [03:09:30] thanks: do i need to restart my webservice? [03:09:35] tools-login and other grid hosts are in separate virts, I guess? [03:10:14] JohnMarkOckerblo: You probably won't have to; as the grid recovers most things should restart on their own within the next 5 minutes. [03:11:00] Zhaofeng_Li: Yes, I tried to spread them around on as many hosts as possible - we've just been unlucky with the host that holds the webproxies in the past couple weeks. [03:11:06] RECOVERY - Host tools-webgrid-tomcat is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [03:11:27] okay. I'm getting a blank page now rather than a hang, but that might just be things not fully restarted yet, (It's a Perl-based script, expecting to read some files, if that matters) [03:11:33] andrewbogott: Fun fact: one of the hosts seems to have been suspended correctly. [03:11:42] RECOVERY - Host tools-submit is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [03:11:55] Hm [03:11:58] So it’s possible, vaguely [03:12:02] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [03:12:21] Coren: my toollab web went down several times randomly, with 502 Bad Gateway [03:12:27] JohnMarkOckerblo: It takes a few minutes for the grid to recover, and while it does the load is high enough that services tend to remain queued for a while until things settle. [03:12:30] and webservice restart makes it work again [03:13:12] liangent: 502? That normally means the process is up but not responding to requests from the proxy. Restarting will kill it and start a new one, so that'd be expected. Do you have logs? [03:13:32] Cyberpower678: around? [03:13:41] Coren: where can I see logs [03:14:05] nakon, yes? [03:14:09] kind of [03:14:12] liangent: Normally, in your home. access.log and error.log by default with lighttpd; but if you use another webserver that varies. [03:14:15] nakon, who are you? [03:14:28] another toolserver user :) [03:14:34] :p [03:14:58] thanks, i'll send PM [03:15:42] Coren: last access.log entry before restart: 10.68.16.4 tools.wmflabs.org - [24/Feb/2015:01:33:09 +0000] "GET /liangent-php/load.php/zhwiki?debug=false&lang=zh-cn&modules=jquery.checkboxShiftClick%2Ccookie%2ChighlightText%2CmakeCollapsible%2Cmw-jump%2Cplaceholder%2Csuggestions%7Cmediawiki.api%2CsearchSuggest%2Cuser%7Cmediawiki.page.ready&skin=vector&version=20141114T033456Z&* HTTP/1.1" 200 39429 "https://tools.wmflabs.org/liangent-php/index.p [03:15:57] and the last error.log: 2015-02-23 09:50:28: (mod_fastcgi.c.2701) FastCGI-stderr: PHP Fatal error: Maximum execution time of 30 seconds exceeded in /data/project/liangent-ph [03:15:58] p/mw/includes/utils/StringUtils.php on line 509 [03:16:54] andrewbogott: Actually, three are marked suspended. "Interesting" [03:17:25] Coren: use virt1005 if you want to tinker. It only has two instances on it, both of them there for my experiments. [03:17:49] I successfully suspended an instance, rebooted the host, and resumed the instance yesterday. So it’s something about numbers/rates I suspect [03:18:19] May be made worse by the network being ill. [03:18:55] But it's too late for me to want to tinker. Ima massage things into full recovery and flee. [03:19:03] could be, although I don’t think that nova talks to the instances via their network interfaces… commands should go straight to libvirt over eth0 which was clearly working fine [03:19:13] Coren: ok [03:19:39] I sent an email to Yuvi and Ops suggesting possible stopgaps. But nothing to labs-l [03:19:54] Yeah, that didn't work so well once it segfaulted. :-) [03:20:02] RECOVERY - Host tools-exec-cyberbot is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [03:20:12] afaict, all instances are up. [03:20:18] * Coren checks the tools grid now. [03:20:56] For what it’s worth, this happened at a slightly different time than the last one, so it’s presumably not cron related. [03:20:56] RECOVERY - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [03:20:58] * andrewbogott grasps at straws [03:21:39] 10Tool-Labs: Test how bigbrother reacts to user names not resolving and, if necessary, fix it - https://phabricator.wikimedia.org/T90410#1072812 (10scfc) I had to restart `bigbrother` again just now after the reboot of `tools-submit` so this suggests that this wasn't a one-time fluke. [03:22:05] andrewbogott: can you migrate tools-webproxy-01 and 02 off virt1002? [03:22:15] ok, ftl finally up. Don't know whether it was the repeated webservice restart or just other dependences wakign up. [03:22:24] YuviPanda|zzz: do you mean 1012? [03:22:31] andrewbogott: gah yes [03:22:40] Is it OK if I shut them down to do so? [03:22:45] They aren't active atm so is ok [03:22:49] JohnMarkOckerblo: The last couple nodes just finished waking up; chances are lots of jobs were stuck in queue [03:22:52] Yeah [03:22:52] great, I’ll do that right now [03:23:12] Coren: Thanks for getting everything back up and running! [03:23:52] YuviPanda|zzz: sorry if my email woke you up or otherwise interrupted… [03:24:23] andrewbogott: nah, I had woken up otherwise... [03:25:23] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1072818 (10scfc) [03:25:33] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:25:46] andrewbogott: Coren I'm thinking I should add you guys to toollabs alerting group on shinken [03:25:55] It did catch the hosts being down [03:26:05] YuviPanda|zzz: You should indeed. [03:26:10] yep [03:26:25] Transient puppet failures are still a thing tho [03:29:21] AFAICT, the grid is in full health, and only three jobs have errored out. [03:29:31] Coren: I haven't try that, but since the process is still up, I doubt whether bigbrother will work for me [03:30:18] if bigbrother could actually monitor webservices by sending http requests it would be nice [03:32:08] liangent: It might, though I'm a bit worried by the idea of it killing running webservices. [03:34:42] in case it's serving someone else right now? [03:36:12] Coren: qmod -rq lightpd would restart all lighty jobs right? [03:36:12] Probably won't stagger them [03:36:22] Or that a burp of lag, or a db stall, and so on. A better solution, imo, would be to have a standardized way for tool writers to have a watchdog that could check for real status beyond "does it just live" [03:36:31] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1072829 (10scfc) List of users with `.bigbrotherrc`s with `webservice`: ``` sudo find /data/project /home -mindepth 2 -maxdepth 2 -type f -name .bigbrotherrc -exec grep -l '^webservice' \{\}... [03:37:24] YuviPanda|zzz: webgrid-lighttpd, but yeah. They'll stagger because load "naturally", but it's going to be a horde at the gates anyways. [03:37:46] PROBLEM - Host tools-webproxy-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.139) [03:39:15] Coren: right. I can write a script to do it more gently I guess [03:39:42] Coren: so any idea to prevent further 502s? [03:42:06] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [03:45:34] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [03:46:25] YuviPanda|zzz: -02 is on virt1011 now. I may have accidentally killed 01, still looking into that [03:46:39] Cool [03:46:57] liangent: Not without a clearer idea of why your tool ends up stalled. That said, that last error seems to point at it crunching away too hard on some requiests and hitting the php limit - that may be the issue. There are only so many connections available by default, so if it takes longer to answer a request than the interval between requests you'll eventually starve. [03:47:08] RECOVERY - Host tools-webproxy-01 is UP: PING OK - Packet loss = 0%, RTA = 6.84 ms [03:47:17] * Coren needs to go to sleep now. [03:48:51] YuviPanda|zzz: nevermind, -01 is fine, it just took a minute. It’s on virt1010 now. [03:49:12] andrewbogott: whee cool. [03:49:38] andrewbogott: I need to figure out how to set up a health check and fail over of floating ips tho [03:49:54] I guess that would need to hit the nova api [03:50:21] Yeah, to change the IP it would. [03:50:34] But at least having a backup is a good start. [03:51:05] Yeah [03:51:34] I could set up DNS round robin [03:51:51] andrewbogott: oh wikitech has a bug that doesn't let me associate hostnames with ips [03:52:07] …how so? [03:52:08] I filed it yesterday. I was trying to set up a test host for these things... [03:52:27] (On phone let me try to find bug) [03:52:42] Are you doing something unusual or do you mean that a perfectly ordinary thing broke? [03:52:52] Perfectly ordinary thing [03:53:02] add hostname to ip [03:53:10] Adds a rdns type record [03:53:17] ok [03:53:26] I need to eat dinner, will look at that in a bit [03:54:57] andrewbogott: https://phabricator.wikimedia.org/T90856 [03:54:59] Ok [03:55:06] I'm also on phone [03:55:18] Coren: andrewbogott thanks for taking care of the outage [03:58:10] andrewbogott: I'll handle the outage report tomorrow morning. [04:18:30] I don't know if this is of interest to anyone here, but here's a clonable version of toolserver SVN archive: http://sourceforge.net/projects/toolserver/?source=directory [04:18:44] (thanks to Nemo_bis and nosy) [05:18:48] 6Labs, 10Wikimedia-Labs-wikitech-interface: Wikitech doesn't allow to associate a hostname with a public ip address - https://phabricator.wikimedia.org/T90856#1072859 (10Andrew) Before: # 208.80.155.192, hosts, wikimedia.org dn: dc=208.80.155.192,ou=hosts,dc=wikimedia,dc=org objectClass: domainrelatedobject o... [05:20:12] PROBLEM - Host tools-exec-cyberbot is DOWN: PING CRITICAL - Packet loss = 100% [05:20:16] PROBLEM - Host tools-webgrid-04 is DOWN: PING CRITICAL - Packet loss = 100% [05:20:22] PROBLEM - Host tools-exec-09 is DOWN: PING CRITICAL - Packet loss = 100% [05:21:09] PROBLEM - Host tools-webproxy-test is DOWN: PING CRITICAL - Packet loss = 100% [05:21:38] PROBLEM - Host tools-webgrid-tomcat is DOWN: PING CRITICAL - Packet loss = 100% [05:21:52] PROBLEM - Host tools-submit is DOWN: PING CRITICAL - Packet loss = 100% [05:22:56] PROBLEM - Host tools-exec-07 is DOWN: PING CRITICAL - Packet loss = 100% [05:23:20] PROBLEM - Host ToolLabs is DOWN: PING CRITICAL - Packet loss = 100% [05:23:56] Labs-issues again? [05:24:21] jep, seems to be the same thing as before ... that server is really cursed. [05:24:21] PROBLEM - Host tools-exec-03 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:34] andrewbogott, YuviPanda|zzz: ^^ [05:24:43] PROBLEM - Host tools-webproxy is DOWN: PING CRITICAL - Packet loss = 100% [05:24:54] OK, be back in a minute and will look [05:27:24] looking... [05:35:28] YuviPanda|zzz: So, it’s happening again — are those proxies ready? If so I can edit ldap by hand to point to the new IP [05:55:00] andrewbogott: gah no. I just got proper internet. [05:55:06] I'm getting them.ready right now [05:55:11] thanks [05:59:48] andrewbogott: can you also migrate tools-webproxy *off* that host? also I can’t bring the new ones up without access to tools-webproxy, certificate is only on tools-webproxy and the dynamicproxy host afaik, andt hey’re both down... [05:59:50] sorry. [06:00:11] YuviPanda: yes, one moment... [06:04:19] andrewbogott: I’m going to email labs-l and let people know [06:04:27] ok [06:09:41] YuviPanda: it’s copying still, should be done soon [06:12:32] andrewbogott: sent a long email with explanations. [06:13:14] thanks [06:13:19] I wish /I/ had an explanation :) [06:13:55] andrewbogott: :D these aren’t ‘explanations’ explanations, but just ‘why has it been a bad few weeks' [06:17:01] andrewbogott: the email basically says ‘we have had two very unrelated hardware issues very close to each other in time, and that’s why this is an issue’. [06:17:08] hopefully there’ll be less gnashing of teeth [06:37:50] YuviPanda: tools-webproxy is backon virt1012 but you can probably reach it now to grab the files you need [06:39:01] ok [06:52:39] andrewbogott: btw, I still can’t reach tools-webproxy [06:53:41] YuviPanda: noted… I’m not sure what to do about that at this moment... [06:53:46] andrewbogott: right. ok [06:53:49] The backing image it depends on isn’t available elsewhere [06:54:01] ah, I see. [06:54:21] maybe I can copy it off virt1012, lemme look [06:55:10] well, I don’t know what the deal is, I can’t find the image on virt1012 either. Even though clearly it was working… [06:55:30] andrewbogott: well, don’t worry about it atm, I’d say. [06:55:45] andrewbogott: tools is down anyway, and I’m not sure how well it’ll recover if we just bring back tools-webproxy and not the other tools hosts that went down [06:56:09] I would hope that the tools hosts which are up would be able to function still... [06:58:14] andrewbogott: yeah, those are. tools-login is up and most bots themselves should be fine. [07:10:13] RECOVERY - Host ToolLabs is UP: PING OK - Packet loss = 0%, RTA = 2.03 ms [07:10:17] YuviPanda: tools-webproxy is back [07:10:25] booyeah [07:10:36] are there other instances I should move to make tools happier? [07:10:44] looking [07:10:48] tools-submit [07:10:57] (I think it was on virt1012) [07:11:02] ok [07:12:29] RECOVERY - Host tools-webproxy is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [07:16:07] FYI ToolScript is happy again [07:16:36] so is Reasonator [07:16:51] GerardM-: yup, most tools should be fine now, if maybe a bit slow [07:18:51] I can't ping tiles.wmflabs.org it seems. [07:19:45] Nicolas: partial labs outage in progres... [07:19:50] *progress [07:19:57] andrewbogott: can you also move the dynamicproxy-gateway instance off virt1012? [07:21:09] YuviPanda: moving. [07:21:13] tools-submit should be back up [07:21:15] andrewbogott: whee, thanks. [07:21:39] RECOVERY - Host tools-submit is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [07:26:16] YuviPanda WDQ is down [07:26:35] GerardM-: yup, the proxy is down. will be back up (and.rew is moving it atm) [07:27:16] ... again ... you are unlucky, it is not ops that can do a better job [07:28:13] !log tools.kmlexport restarted and moved to trusty [07:29:28] !log tools.kmlexport it doesn’t like trusty’s version of perl, moving back to precise [07:34:05] the problem seems to be a host not in the configuration of catscan [07:34:07] No route to host in /data/project/catscan2/public_html/omniscan.inc on line 132 [07:51:54] GerardM-: It's down wdq.wmflabs.org, and pinging it gets "destination host unreachable" [07:52:58] andrewbogott: any luck with dynamicproxy-gateway? [07:53:04] I think many Wikidata-related tools, hosted on Labs or not, rely on it. [07:53:18] It claims to be booting... [07:57:48] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:58:38] PROBLEM - Puppet failure on tools-exec-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:58:54] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [08:02:47] andrewbogott: tools-webproxy-02 is now available. let me make tools-webproxy-01 available as well. They’ll be hotspares - we can switch anytime by manually switching the floating IP [08:02:48] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [08:02:59] great [08:03:35] andrewbogott: it’s a bandaid-y solution, though. our current proxy design wasn’t built with multiple ‘masters’ in mind. [08:03:41] anyway, better than nothing. I’ll finish up and document [08:04:05] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:04:14] dynamicproxy-gateway is back [08:05:26] GerardM-: wdq is back [08:05:28] Zhaofeng_Li: ^ [08:08:37] RECOVERY - Puppet failure on tools-exec-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:27:27] !log tools restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well [08:27:32] Logged the message, Master [08:34:47] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1072960 (10yuvipanda) [08:35:00] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1051019 (10yuvipanda) (Removing the Hackathon project since this needs to be fixed *now*) [08:40:26] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1072965 (10yuvipanda) So we will eventually have two proxies - tools-webproxy-01 and tools-webproxy-02, and they'll be hotspares. Webservices will... [08:42:52] RECOVERY - Host tools-webproxy-test is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [08:44:14] RECOVERY - Host tools-webgrid-04 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [08:45:02] RECOVERY - Host tools-exec-cyberbot is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [08:45:47] RECOVERY - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [08:46:20] RECOVERY - Host tools-exec-03 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [08:46:26] RECOVERY - Host tools-webgrid-tomcat is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [08:46:28] andrewbogott: yay! :) [08:46:40] Let’s see if it lasts more than 45 minutes this time [08:46:54] 👍 [08:47:06] andrewbogott: right. tools-webproxy-01 and -02 are hotspares now. [08:47:25] andrewbogott: and since we’ve moved the important bits off, even if virt1012 goes down now toollabs won’t be down. [08:47:30] true [08:47:38] But it will still make me very sad [08:47:41] RECOVERY - Host tools-exec-09 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [08:49:43] andrewbogott: yup :( [08:52:56] andrewbogott: beta is also back up fully now [08:53:04] great [08:53:08] “for now" [08:53:13] andrewbogott: heh. [08:53:49] andrewbogott: you should get some sleep. [08:54:01] I’m looking forward to it! [08:54:14] andrewbogott: :D <3 thank you! [08:54:20] andrewbogott: I’ll keep an eye out [08:54:43] I hope that the trusty upgrade was worth it… this would’ve been a 10-minute outage if I’d just rebooted [08:55:41] andrewbogott: yeah, but can’t keep rebooting... [08:55:55] Certainly not once per hour [08:57:38] yeah [08:58:01] andrewbogott: later today I’ll take stock of which hosts are on which machines, and later on we can maybe distribute them some more. [08:58:17] I definitely think we can get to a point where one virt machine going out won’t take out toollabs in the next few days [09:01:03] andrewbogott: I’m going to have some food, I’ll keep an eye on IRC / shinken. [09:01:24] sounds good, thanks. [09:43:04] Hey, is there a policy what license graphical content and text content from tools via labs have? F. e.: https://tools.wmflabs.org/wikihistory/wh.php?page_title=Dresden [09:44:50] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 3721 bytes in 0.058 second response time [09:45:57] Is it dead again now [09:46:18] Not fully [09:49:55] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 740329 bytes in 3.459 second response time [09:50:08] Bettet [10:00:51] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 3721 bytes in 0.634 second response time [10:02:22] I give a talk on sunday about licenses and will have no error in my presentation used for teaching ;) . [10:03:03] https://docs.google.com/presentation/d/1y57W8BNx4jpGMCEKD_TrTLVrbuVc05GVhNtRChKBsJo/edit#slide=id.g6cf20fe0e_037 [10:15:53] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 740278 bytes in 3.030 second response time [10:51:51] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 383 bytes in 0.002 second response time [10:54:16] thank you [11:12:09] !log deployment-prep start mysql on deployment-db1 [11:12:16] Logged the message, Master [11:31:19] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [11:32:36] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Conduct a Tool Labs workshop in Leon Hackathon - https://phabricator.wikimedia.org/T91058#1073167 (10yuvipanda) 3NEW a:3yuvipanda [11:36:56] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Conduct a Tool Labs workshop in Lyon Hackathon - https://phabricator.wikimedia.org/T91058#1073179 (10Qgil) [11:50:39] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Create a set of 'template' tools in various languages with deploy scripts for toollabs - https://phabricator.wikimedia.org/T91059#1073191 (10yuvipanda) 3NEW a:3yuvipanda [12:31:05] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Create a set of 'template' tools in various languages with deploy scripts for toollabs - https://phabricator.wikimedia.org/T91059#1073243 (10yuvipanda) With this going from nothing to a 'I am accessing LabsDB and making API calls to enwiki' time for a new tool should be... [12:36:07] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a Tool Labs Workshop in Wikimania hackathon - https://phabricator.wikimedia.org/T91061#1073255 (10yuvipanda) 3NEW a:3yuvipanda [12:36:53] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a research tools workshop at wikimania hackathon 2015 - https://phabricator.wikimedia.org/T91062#1073265 (10yuvipanda) 3NEW a:3yuvipanda [13:09:39] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073283 (10yuvipanda) Alright, so toollabs webproxy is now running on tools-webproxy-01, with a hotspare in tools-webproxy-02. To switch to the sp... [13:09:48] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073284 (10yuvipanda) a:3yuvipanda [13:10:23] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073288 (10yuvipanda) [13:20:48] 10Tool-Labs, 7Documentation: Wikimedia Labs system admin (sysadmin) documentation sucks - https://phabricator.wikimedia.org/T57946#1073311 (10yuvipanda) I think this is fairly important, and we should make better docs now. New tools admins would be joining us shortly, and this page should be much better. [13:21:17] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [13:22:33] 10Tool-Labs, 7Documentation: Wikimedia Labs system admin (sysadmin) documentation sucks - https://phabricator.wikimedia.org/T57946#1073312 (10yuvipanda) I've moved the old page to https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation/Admin/Archive and am creating a new page. [13:26:31] 10Tool-Labs, 10Wikimedia-Hackathon-2015: Conduct a Tool Labs workshop at Lyon Hackathon - https://phabricator.wikimedia.org/T91058#1073315 (10Krinkle) [13:41:02] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073325 (10yuvipanda) https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation/Admin has documentation now :D [13:52:42] 10Tool-Labs, 7Documentation: Wikimedia Labs system admin (sysadmin) documentation sucks - https://phabricator.wikimedia.org/T57946#1073333 (10yuvipanda) 5Open>3Resolved Tested the failover. Worked perfectly. Haven't tested instructions on bringing back a dead instance, though. [13:52:55] Coren: qstat tells me that job 7663764 is supposedly running on continuous@tools-exec-07.eqiad.wmflabs, but I don't see the process running on that host. Any idea? [13:53:02] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a research tools workshop at wikimania hackathon 2015 - https://phabricator.wikimedia.org/T91062#1073336 (10Halfak) [13:53:14] I imagine a qdel and resubmit would fix it, but I thought I'd let you have a chance to look at it first if you want. [13:53:34] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1073345 (10yuvipanda) [13:53:35] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1073343 (10yuvipanda) 5Open>3Resolved Tested the failover. Worked perfectly. Haven't tested instructions on bringing back a dead instance, though. [13:53:56] anomie: there was an outage earlier today, so it’s probably fallout from that. [13:54:13] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a research tools workshop at wikimania hackathon 2015 - https://phabricator.wikimedia.org/T91062#1073265 (10Halfak) [13:54:28] It's mostly the desync where qstat thinks it's still running I'm concerned about. [13:54:44] right [13:56:22] nice, it's so colorful here now :) [13:56:22] anomie: Probably a cadaver from the outage; you can qdel -f it [13:56:41] * anomie does so [14:00:02] ... now it says state "dRr", but it still exists [14:00:39] anomie: I -f’d it for you now [14:02:31] anomie: It may take a minute or two before it notices it's dead. [14:03:47] Well, let's try it again with job 7663714 [14:07:17] That one disappeared fine. [14:10:09] 6Labs, 10Tool-Labs: Have at least two uwsgi nodes so that grid engine can reschedule jobs when one goes down - https://phabricator.wikimedia.org/T91065#1073367 (10yuvipanda) 3NEW [14:12:11] 6Labs, 10Tool-Labs: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1073376 (10yuvipanda) 3NEW [14:17:57] 10Wikimedia-Labs-Infrastructure: Create -latest alias for dumps - https://phabricator.wikimedia.org/T47646#1073393 (10yuvipanda) p:5Lowest>3Normal [14:21:53] YuviPanda: You had a good idea to test that again; it would fail right now. There have been changes in the config that have not been properly reflected on shadow yet. [grumble] [14:22:11] Coren: this is why we should puppetize tem all :D [14:22:33] Coren: can you make a note of everything being done in detail on https://phabricator.wikimedia.org/T90546 [14:22:33] ? [14:23:05] YuviPanda: I don't mean the same change on both - I mean a change in the layout of -master that would have neede another - different - change on -shadow to account for it. :-) [14:23:09] YuviPanda: I will. [14:23:17] right. [14:33:25] 6Labs, 10Tool-Labs: Set up a schedule for doing failover exercises for toollabs - https://phabricator.wikimedia.org/T91068#1073400 (10yuvipanda) 3NEW [14:37:53] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [14:47:58] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:02] !log tools testing gridengine master failover starting now [14:50:10] Logged the message, Master [14:51:46] YuviPanda: /var/lib/gridengine/default/common/act_qmaster is the thing to watch. As the shadow server notices the hearbeat no longer updating, it should start a master on itself and update that. [14:52:02] Coren: and that’s a symlink that points to NFS? [14:52:26] YuviPanda: sorta. /var/lib/gridengine is a bind mount to NFS [14:52:44] aaah [14:53:07] gridengine does need its config and spool shared between nodes. [14:53:52] right [14:54:06] I'm stracing the sge_shadowd right now, looking at it poll the heartbeat file. [14:54:17] right [14:55:22] 5m check interval is hella long when you're waiting for it. :-) [14:55:32] :) [14:57:41] Coren: can you add documentation on what the failover does and how it works to https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Documentation/Admin [14:59:41] * Coren nods. [15:00:17] * Coren "patiently" waits. [15:00:39] * Coren double checks the timeout. [15:05:12] Ah, bah, 600s default [15:05:23] That's too many seconds. [15:06:10] ah, 10mins [15:06:15] Coren: did it pick it up now? [15:06:39] It should soon - the poll interval is 60s. [15:07:07] * Coren watches it like a hawk. [15:08:11] But if it doesn't within the next minute or so I'm going to presume the man page is the one that lies and change the config to what it was before according to the prose documentation. [15:09:09] Coren: cool. should also start the master back up, though :) [15:09:11] (About the contents of /var/lib/gridengine/default/common/shadow_masters - one says it should contain the name of the shadows and one says it should contain the name of the master /and/ the name of the shadows) [15:09:21] Yeah, I'm restarting master. [15:10:21] Coren: cool [15:10:27] !log tools Master restarted - test not sucessful. [15:10:32] Logged the message, Master [15:10:33] YuviPanda: I shall debug this now. [15:10:35] !log tools increase instance quota to 64 [15:10:38] Coren: cool. [15:10:39] Logged the message, Master [15:10:50] !log tools created tools-webgrid-generic-02 [15:10:54] Logged the message, Master [15:11:07] * Coren annoys as sge_shadowd clearly picked up on the returned master. [15:15:20] Ah, I can forcibly cause the shadow master to be verbose with an env var [15:15:24] That'll help. [15:17:39] 6Labs, 10Tool-Labs: Test and verify that OGE master/shadow failover works as expected - https://phabricator.wikimedia.org/T90546#1073472 (10yuvipanda) p:5Triage>3High @Coren tried it just now, didn't work. He's investigating. If virt1003 goes down then master goes down as well, and things are bad. [15:17:55] 6Labs, 10Tool-Labs: Test and verify that OGE master/shadow failover works as expected - https://phabricator.wikimedia.org/T90546#1073475 (10yuvipanda) a:3coren [15:18:01] Coren: ^ I’ve assigned that to you :) [15:18:07] * YuviPanda goes to make more bugs about re-jiggering instances [15:20:46] !log tools Gridengine master failover test part deux - now with verbose logs [15:20:52] Logged the message, Master [15:21:01] (Also, interval made much smaller for testing) [15:22:04] good afternoon [15:23:14] anyone knows who is in charge for GeoHack in Tool Labs? [15:23:28] I was until 6 months ago [15:23:53] YuviPanda: LOL [15:24:21] I would like to use it in my script to consult where some geolocations belong to. [15:24:45] at the same time i was wondering if the tables in mediawiki geo_tags belong to geohack. [15:24:50] marcmiquel: can you expand on what ‘consult where some geolocations belong to’? [15:24:55] YuviPanda: *that* way of testing would never have worked: when you shut down the master cleanly it puts down a lockfile to prevent shadows from taking over. [15:25:00] I assume he want to changes links on https://en.wikipedia.org/w/index.php?title=Template:GeoTemplate&action=edit [15:25:04] Coren: oh, I see. [15:25:11] marcmiquel: It's Magnus Manske and Kolossos [15:25:25] Coren: kill -9 the master then? :) [15:25:32] YuviPanda: Because, obviously, you don't WANT a new master popping up when you shut it down on purpose. [15:25:33] Good luck getting responses from those two [15:25:42] right [15:25:50] for instance, having my hometown coords: 41° 34′ 43″ N, 1° 37′ 4″ E. I would like to identify the territory where it belongs. [15:25:57] in this case, catalonia. [15:26:09] i think in geohack there is es-ca [15:26:15] spain-catalonia as a field [15:26:23] marcmiquel: I’d suggest looking at geohack’s source code and seeing how it does that. [15:26:32] I suspect it hits OpenStreetMaps? [15:26:33] where could i find it? i saw the website is broken [15:27:03] marcmiquel: it shouldn’t be broken... [15:27:27] https://tools.wmflabs.org/geohack/geohack.php?pagename=Chennai¶ms=13_5_2_N_80_16_12_E_type:city(4681087)_region:IN-TN works fine [15:27:29] https://wiki.toolserver.org/view/GeoHack [15:27:31] Drop me a link demonstrating where's it and I'll look into today. But I've gotta go [15:27:45] oh [15:27:47] right [15:27:48] that died. [15:27:49] !log tools Gridengine master failover test part three; killing the master with -9 [15:27:54] Logged the message, Master [15:27:56] I don’t think the toolserver documentation is up anywhere. [15:28:07] YuviPanda: neither the code seems to be there [15:28:16] yeah, looking around... [15:28:19] hey i created a .bigbrotherrc for a tool 25min ago, no reaction so far. what did I miss? [15:28:38] https://tools.wmflabs.org/tree-of-life/ [15:28:48] tools.tree-of-life@tools-login:~$ cat .bigbrotherrc [15:28:49] jzerebecki: it doesn’t start things up if they aren’t already up. [15:28:49] webservice [15:28:59] jzerebecki: I'm doing tests with the gridengine master at the moment, which may interfere with job scheduling. I expect things will wake back up shortly. [15:29:08] it’s on my list of things to ‘fix' [15:29:28] 02/27/2015 15:28:45| main|tools-shadow|W|starting program: /usr/sbin/sge_qmaster [15:29:49] marcmiquel: http://bitbucket.org/magnusmanske/geohack [15:29:51] is the source [15:29:56] YuviPanda: uh what is the precondition for tit to restart it? [15:30:01] awesome YuviPanda [15:30:08] jzerebecki: just do ‘webservice start' [15:30:19] # cat act_qmaster [15:30:19] tools-shadow.eqiad.wmflabs [15:30:28] wooo [15:30:29] YuviPanda: the webservice was running at some point [15:30:41] jzerebecki: right. it probably went down during today’s outage. [15:30:47] and didn’t come back up because of lack of .bigbrotherrc [15:30:52] my second question was... is geo_tags in mediawiki from geohack? [15:30:57] YuviPanda: So yeah, it worked fine all along, so long as the master actually *dies*. :-) [15:31:02] marcmiquel: nope. geohack is just on toollabs. [15:31:16] yup. trying to understand ho bigbrother works, how does it know when to restart stuff and when not to? [15:31:33] then, where is that information from? [15:31:45] jzerebecki: It restarts unconditionally once it has seen it running one. [15:31:45] ahm. [15:31:47] once* [15:31:49] marcmiquel: geo_tags? I do not know. MaxSem might know. [15:31:54] ah ok thx [15:31:58] marcmiquel: you can also check mediawiki.org [15:32:51] just that the table seemed great [15:33:07] but when looking at gt_country was very uncomplete [15:33:14] NULLs where everywhere in every language [15:33:31] yeah, I don’t think it’s used in production atm. [15:33:36] !log tools Switched back to -master. I'm making a note here: great success. [15:33:41] Logged the message, Master [15:34:08] maybe in production there are the coords users introduce but then WP uses geohack to redirect users [15:34:17] to geohack page [15:34:28] marcmiquel: you might be able to use http://developer.mapquest.com/web/products/open/geocoding-service [15:34:38] Coren: \o/ sweet. can you update the bug + documentation page/ [15:34:45] YuviPanda: So yeah, note for the future: if we want to test failover we have to make it fail - not shut it down. :-) [15:35:43] 6Labs, 10Tool-Labs: Test and verify that OGE master/shadow failover works as expected - https://phabricator.wikimedia.org/T90546#1073550 (10coren) 5Open>3Invalid It worked all long, so long as the failover is tested by making the master //fail//. If it's shutdown cleanly then the shadow masters (correctly... [15:35:44] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1073552 (10coren) [15:35:57] thanks YuviPanda. it might be useful. only that it is a pity that while having coords available in a table, the info is uncomplete. [15:36:04] 6Labs, 10Tool-Labs: Test and verify that OGE master/shadow failover works as expected - https://phabricator.wikimedia.org/T90546#1073553 (10coren) 5Invalid>3Resolved [15:36:05] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1061721 (10coren) [15:36:46] Coren: can you document the debug env variable, etc on that bug as well? [15:37:05] anyway, am off for food. brb in a bit. [15:37:58] YuviPanda: good appetite [15:38:08] and thanks for helping! [15:38:32] marcmiquel: yw! [15:38:33] 6Labs, 10Tool-Labs: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1073566 (10yuvipanda) 3NEW [15:38:58] Coren: I hope to do https://phabricator.wikimedia.org/T91066 soon as well. [15:39:08] that also means moving all the tomcat stuff to trusty, but since it’s the same JVM version it should be alright [15:39:17] alright, food for real [15:39:40] Yeah, the positive thing about Java (there had to be at least one) is that what counts is the JVM not the OS. :-) [15:42:59] Coren: yeah [15:43:21] Coren: can you look at the puppet failures being reported on the new web proxies? Seem to be related to hba [15:47:12] YuviPanda|food: The onlyif Tim added to quiesce the logs prevents the first creation of the file. [16:31:59] RECOVERY - Puppet failure on tools-webproxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:36:00] RECOVERY - Puppet failure on tools-webproxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:44:43] 6Labs, 10Beta-Cluster, 6operations: Backport new salt-syndic packages - https://phabricator.wikimedia.org/T85442#1073725 (10ArielGlenn) I've imported salt-syndic_2014.1.11 into our lucid/precise/rtrusty repos. All dependencies should be there already. Let me know if it wfy. [16:55:12] Coren, YuviPanda|food: Sorry for the ping, but could any of you restart copyvios's web? [16:56:03] Earwig: ^ [16:56:07] {{done}} [16:56:28] Thanks a bunch [16:57:09] :+1: [17:04:17] Coren: can you put a bigbrotherrc file in it as well? [17:04:27] I try to do so every time I restart [17:53:39] !log tools increased quota to 512G RAM and 256 cores [17:53:43] Logged the message, Master [18:06:20] PROBLEM - Puppet failure on tools-uwsgi-02 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [0.0] [18:07:39] @seen Technical-13 [18:07:39] Cyberpower678: I have never seen Technical-13 [18:07:47] @seen T13|detached [18:07:48] Cyberpower678: Last time I saw T13|detached they were changing the nickname to , but is no longer in channel ################################################## at 2/17/2015 4:13:20 PM (10d1h54m27s ago) [18:08:05] petan, umm... ^ [18:08:13] what [18:08:28] What's up with wm-bot's response to @seen [18:08:38] @seen T13|mobile [18:08:38] Cyberpower678: Last time I saw T13|mobile they were quitting the network with reason: Quit: http://enwp.org/User:Technical_13 is having connection troubles and should be back soon. N/A at 2/25/2015 3:52:12 AM (2d14h16m26s ago) [18:08:57] that is true name of channel he was in [18:09:26] :O [18:09:58] There's a channel ##################################################? [18:11:18] RECOVERY - Puppet failure on tools-uwsgi-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:12:16] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [0.0] [18:16:37] more mysterious errors on beta cluster: Database query error (internal_api_error_DBQueryError) (MediawikiApi::ApiError) [18:27:32] Cyberpower678: of course [18:37:17] PROBLEM - Puppet failure on tools-uwsgi-02 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [18:37:27] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [18:52:42] hi! could someone point me to an instruction on how one could create an instance in labs? [19:02:56] 10Wikibugs: MultiCol text overlines Templates in FF - https://phabricator.wikimedia.org/T91098#1074263 (10Nnvu) 3NEW [19:20:03] I get Puppet status: failed when creating instance on labs - anybody knows how to find out what's wrong? [19:39:15] I just set up a new labs instance http://drmf.wmflabs.org/w/index.php?title=Special:UserLogin&returnto=Main+Page an was trying to login with admin and the default password [19:40:08] I get a strange error message Call to a member function setExpectation() on a non-object (NULL) [19:44:12] SMalyshev: The most likely issue is your quotas being reached. [19:44:42] SMalyshev: You can see what usage you have and what you have left from the manage projects page. [19:45:19] Coren: quotas seem to be fine [19:45:42] Coren: I've rerun puppet manually and it seems to be ok. but on initial deployment it fails for some reason [19:46:22] Oh, *puppet* status failed. Sorry, I had misunderstood you. [19:47:06] That's often expected; puppet often needs several passes before it settles - the first run often is unable to install some packages because the apt-repos are not yet in place, for instance. [19:47:26] also, for some reason, I can't create a proxy: Failed to create new proxy wdqwikidata.tools.wmflabs.org. [19:47:41] is there some permission I need to get? [19:48:12] No, but you normally can't create a proxy under tools.wmflabs.org that way; you want to do wdqwikidata.wmflabs.org instead [19:48:35] Coren: there's no such option in the domains list [19:49:37] there are a lot of 3-component ones but no wmflabs.org [19:50:01] Oh, right, I forgot webproxies don't use the same list as public IP management. [19:50:26] Give me a minute and I'll see what's up. [19:50:31] thanks [19:50:41] Yes, the wikitech error reporting is teh suxx0rz [19:52:44] Hm. There's clearly something wrong atm. [19:53:08] I'll need to do further debugging. [19:55:07] In the meantime, if you want to do testing, I can give you quota to create a public IP [19:58:53] Coren: that'd be cool, thanks [19:59:23] or if there's easier way to make browser work, that'd be fine too [19:59:50] SMalyshev: What is your wikitech username? [20:00:19] SMalyshev: Also, what project name is this? [20:00:23] Coren: smalyshev :) [20:00:53] wikidata-query [20:00:58] 6Labs, 10Wikimedia-Labs-wikitech-interface: Proxy creation fails with opaque error message - https://phabricator.wikimedia.org/T91114#1074527 (10coren) 3NEW [20:01:11] I cc'ed you on the bug so you get updates ^^ [20:01:33] thanks [20:02:47] SMalyshev: I just gave the project quota for a public IP. You can allocate one, and point it at the apropriate instance. [20:03:17] Coren: cool, thanks [20:04:54] Coren: how I do that? Create proxy on https://wikitech.wikimedia.org/w/index.php?title=Special:NovaProxy still produces failure [20:05:54] No, for this you go through 'manage addresses' [20:06:13] Allocate IP, then assign it and a name to the instance of your choice. [20:08:00] Coren: ok, did that - should I also do "add host name"? for which domain? [20:08:42] I tried domain wmflabs and got wdq-wikidata.153.80.208.in-addr.arpa [20:08:53] o_O? [20:08:53] not sure that's what is supposed to happen [20:09:01] Definitely not. [20:09:31] so what I put in "add host name"? [20:09:33] Lemme try this. [20:10:17] looks like wdq-wikidata.testme.wmflabs.org worked now [20:10:58] don't see how to get it without testme but that should be fine for me too [20:11:15] There's something broken with the name assignment - I expect it's the same issue that also prevents the proxies from working [20:11:45] So long as you can continue to work; once we figure out the root issue you'll be able to switch it to something that works for you. [20:12:09] now the IP resolves but I can't still access the URL [20:12:11] One last thing you'll have to do is to allow the apropriate port (80) to your security groups [20:12:15] e.g. http://wdq-wikidata.testme.wmflabs.org/wiki/ [20:12:22] It's firewalled by default [20:12:53] Coren: so what needs to be done to enable it? [20:13:06] Lemme do it for you, it'll take 30s. [20:13:18] It's under 'manage security groups' [20:13:19] ok, thanks [20:13:31] ah. [20:13:41] Done. [20:14:10] yeah seems to be working now, thanks! [20:14:58] Coren: one more question if you don't mind - so vagrant supports multiple wikis with different hostnames. Is there way to make the same work in this setup? [20:15:40] Yes, just add the hostname [20:15:46] SMalyshev: yes, you can use multiple wikis behind the proxy [20:16:03] aha, how I associate hostnames with wikis? [20:16:06] the trick is to make apache route them properly on the labs instance side [20:16:29] * bd808 looks to see if he has documented this [20:18:17] SMalyshev: apparently I haven't documented the how of this :( [20:19:15] bd808: any short version? :) [20:19:37] <^d> andrewbogott: Thanks for the e-mail, I'd missed the first one. [20:19:59] :) [20:20:10] <^d> most of those other instances can probably just be rebuilt (although I'd let others chime in). deployment-db1 should definitely be rescued though [20:20:19] <^d> It's a mysql master [20:21:08] SMalyshev: Make a /vagrant/puppet/hieradata/local.yaml file and add `mediawiki::multiwiki::base_domain: "-somebasename.wmflabs.org"` [20:21:30] ^d: I can move it to a different host now, but that will mean shutting it down for a few minutes [20:22:00] Then make proxys in labs that point vagarnthostname-somebasename.wmflabs.org to your labs instance [20:22:11] ^d: what do you think? Risk it and leave it be, or have an intentional outage? [20:22:32] bd808: need to run anything to enable this? provision? [20:22:39] <^d> andrewbogott: How long will it take? [20:22:42] <^d> Oh, few minutes [20:22:43] <^d> Hmm [20:22:49] the "vagarnthostname" part would be what comes before .wiki.local.wmftest.net in a local install [20:22:59] It’s an scp of the volume. So, long if it’s big, short if it’s small [20:23:16] SMalyshev: yes, you need to run labs-vagrant provision after adding the hiera config [20:23:37] That will change the vhost names in your /etc/apache2 config files [20:23:49] and from there things should "just work" [20:24:19] you can look in the apache config to see what names are expected by the various vhosts [20:24:20] <^d> andrewbogott: Eh, we'll leave it for now [20:24:35] ok [20:24:43] Don’t walk under any ladders this weekend [20:24:49] <^d> It's already afternoon on friday, I don't want to spend the afternoon rebuilding shit if it goes bad :p [20:26:17] bd808: success! thanks a lot!@ [20:26:28] excellent [20:26:41] * bd808 is writing something up for next time [20:27:51] bd808: yeah that'd be helpful, e.g. for running something like wikidata [20:36:02] SMalyshev: https://wikitech.wikimedia.org/wiki/Labs-vagrant#Use_multiple_wikis_on_a_single_labs-vagrant_host [20:36:16] Please add and update as you find things wrong there [20:37:31] bd808: looks like exactly what I did (except maybe you'd need to sudo labs-vagrant? I did it with sudo but I'm not sure if required), so I think that works [20:38:39] cool. I think labs-vagrant provision does sudo itself at the right point so it should work even wihtout being explicit [20:38:52] ok, great, thanks again! [20:53:16] PROBLEM - Puppet failure on tools-webgrid-06 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [20:56:07] Coren: ^ is because webgrid-06 is oom. Anything to be done about that? [20:56:27] oom? How in blazes did /that/ happen? [20:56:41] well, maybe I’m mistaken; take a look? [20:57:11] It's definitely not oom, it's about at 20% usage. [20:57:50] I tried to run puppet and it said [20:57:51] Error: Could not run command from prerun_command: Cannot allocate memory - fork(2) [20:57:51] Error: Could not run command from postrun_command: Cannot allocate memory - fork(2) [20:57:59] Aha! It's out of /proceess slots/ however/ [20:58:28] that’d do it [20:58:43] impressive. [20:59:06] Perfect storm, every single webserver on it has as many processes open as is possible. [20:59:23] * Coren ponders. [21:00:58] First time I've seen this in ages. Lemme see what we can do. [21:05:01] Aha. Nope. Nowhere near the thread limit but we /have/ hit the overcommit ratio. [21:05:59] Means that I have too many slots for that host [21:09:41] * Coren ponders. [21:15:38] andrewbogott: It's something trusty - precise hosts have way much more running jobs. Looking into it. [21:16:21] andrewbogott: thanks for working so hard recently (+Yuvi +Coren) to keep labs in somewhat a usable state :) (prompted re. the en masse emails) [21:21:08] JohnFLewis: I kind of enjoy the excitement :) But if it keeps I might not be able to take it. [21:21:53] yeah thanks guys [21:22:54] andrewbogott: let's hope the new hardware gets processed then for labs [21:28:20] RECOVERY - Puppet failure on tools-webgrid-06 is OK: OK: Less than 1.00% above the threshold [0.0] [21:29:12] Coren: ^ is a good sign! [21:30:52] andrewbogott: Yes, I rebalanced a bit the load between -05 and -06 (both trusty) [21:31:15] I still need to figure out why precise is much more forgiving. [22:02:12] HEY [22:02:21] Coren, Coren_: Do you have any idea what’s wrong with http://en.wikipedia.beta.wmflabs.org/wiki/Special:MobileOptions or how I could debug it? The page consistantly gives a 503 error, but only that page. [22:04:59] 6Labs, 10Tool-Labs, 5Patch-For-Review: ToolLabs web proxy tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1074784 (10scfc) My understanding is that dynamicproxy holds the proxying information essentially in memory, i. e. it starts with an empty plate. In that case, tools that st... [22:05:32] how is everyone today? [22:05:47] not bad [22:05:50] how does this work im confused lol [22:06:23] scotttriplett: How does what work? The IRC channel? [22:06:43] 6Labs, 10Tool-Labs: Set up a schedule for doing failover exercises for toollabs - https://phabricator.wikimedia.org/T91068#1074796 (10scfc) Well, if it'd affect users, the exercise wasn't successful :-). [22:07:16] well yea lol [22:08:11] Well, you yell something into the void and wait to see if anyone answers [22:08:25] this is new for me and im trying to see if im correct u can use these to make portals to other pages but how do i find my portals [22:08:27] Your luck may vary depending on your time zone [22:09:12] scotttriplett: not sure what you mean about portals [22:09:30] what is the void and how am i suppose to yell into it lol [22:09:53] you’re already yelling into it :) [22:12:39] chrismcmahon: Trying to figure out what’s up with http://en.wikipedia.beta.wmflabs.org/wiki/Special:MobileOptions, but doesn’t seem like any labs folks are around. Any idea who I could ping. If not, I’ll just send an email to the usual folks. [22:12:52] kaldari: looking... [22:13:50] kaldari: ask in -labs maybe. beta has been having weird issues possibly related to the recent hardware failures [22:14:10] I thought I was in -labs :) [22:14:10] oh, just realized what channel this is [22:14:29] this is not my week [22:14:48] NP, it’s been a long week [22:14:53] I've been seeing intermittent db failures, connect failures, but this is the first flat out 503 I've run into [22:15:13] chrismcmahon: And it’s only that one page (as far as I can tell) [22:16:33] chrismcmahon: searched the logs? (A basic question but still :p) [22:17:16] JohnFLewis: Actually, I was wondering about that. Where are the logs stored for beta labs? Are they on depoyment-bastion somewhere? [22:17:47] kaldari: yeah under the /data/ iirc [22:18:17] I would look now but I don't have my labs ssh access on this device [22:19:13] I found some logs, but no idea which log I’m looking for [22:20:12] kaldari: it tough actually, are there any apache/HHVM error logs? I'm not sure if the bug was fixed where they are local to mediawiki instances only [22:20:39] JohnFLewis: Yes, there are 2 hhvm logs. I’ll look at those... [22:21:53] kaldari: if that turns up nothing might be worth searching mediawiki 1 and 2 in the usual logging place [22:21:55] JohnFLewis: OK, I think I found the culprit [22:22:02] JohnFLewis: Thanks! [22:22:08] Great [22:22:34] kaldari: thanks, that was fast :) [22:23:24] chrismcmahon, JohnFLewis: I now I know how to debug it myself in the future. Yay! [22:23:36] kaldari: where did you look for the logs? [22:23:42] cd /data/project/logs [22:23:55] on deployment-bastion [22:24:02] kaldari: OK, I thought it was fancier than that :) [22:24:08] and specifically the hhvm log [22:24:08] That's it, I was unsure if we used the /project/ subfolder on beta :p [22:24:26] * chrismcmahon looks. I haven't been out there in some time. [22:37:14] 10Wikimedia-Labs-General, 10Wikidata: Need a way to test with data set reasonably close to production - https://phabricator.wikimedia.org/T91131#1074849 (10Smalyshev) 3NEW [22:38:24] 6Labs, 10Tool-Labs, 5Patch-For-Review: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1074863 (10scfc) How much memory are we saving by having separate nodes for lighttpd-based tasks and overprovisioning them? (If that is still true; `modules/tooll... [22:43:29] 10Tool-Labs: Document how to turn shadow into master - https://phabricator.wikimedia.org/T91133#1074873 (10scfc) 3NEW [23:18:21] petan: can you please fix paste [23:18:22] ? [23:40:52] Coren: in https://phabricator.wikimedia.org/P341, do you know where the first entry (153.80.208.in-addr.arpa) came from? Did you add it by hand, perchance? [23:42:53] 6Labs, 10Wikimedia-Labs-wikitech-interface: Wikitech doesn't allow to associate a hostname with a public ip address - https://phabricator.wikimedia.org/T90856#1075020 (10Andrew) This is because of the wmflabs.org domain as defined in ldap: # wmflabs, hosts, wikimedia.org dn: dc=wmflabs,ou=hosts,dc=wikimedi...