[00:13:49] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:16:04] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:17:46] PROBLEM - Puppet failure on tools-uwsgi-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [00:22:43] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:24:47] PROBLEM - Puppet failure on tools-redis is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [00:29:21] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [00:34:35] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [00:35:34] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:38:51] RECOVERY - Puppet failure on tools-exec-wmt is OK: OK: Less than 1.00% above the threshold [0.0] [00:40:14] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [00:40:36] PROBLEM - Puppet failure on tools-exec-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:46:05] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [00:52:40] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [00:54:08] PROBLEM - Puppet failure on tools-dev is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:54:20] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [00:54:46] RECOVERY - Puppet failure on tools-redis is OK: OK: Less than 1.00% above the threshold [0.0] [00:55:53] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:56:13] PROBLEM - Puppet failure on tools-webgrid-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:01:18] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [01:01:47] PROBLEM - Puppet failure on tools-exec-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:01:57] PROBLEM - Puppet failure on tools-shadow is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:03:40] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:03:48] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:04:32] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [01:05:15] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [01:05:33] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [01:05:33] RECOVERY - Puppet failure on tools-exec-02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:05:47] PROBLEM - Puppet failure on tools-redis is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [01:07:13] PROBLEM - Puppet failure on tools-exec-03 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:07:48] RECOVERY - Puppet failure on tools-uwsgi-01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:07:58] PROBLEM - Puppet failure on tools-exec-11 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [01:08:53] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:09:52] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:10:16] PROBLEM - Puppet failure on tools-exec-09 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:10:26] PROBLEM - Puppet failure on tools-exec-06 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [01:11:55] PROBLEM - Puppet failure on tools-webgrid-05 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:12:01] PROBLEM - Puppet failure on tools-mail is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:12:51] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:12:55] PROBLEM - Puppet failure on tools-static is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:12:57] PROBLEM - Puppet failure on tools-exec-13 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:14:59] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:15:19] PROBLEM - Puppet failure on tools-webgrid-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:15:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:16:15] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:16:33] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [01:16:38] PROBLEM - Puppet failure on tools-exec-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:17:08] PROBLEM - Puppet failure on tools-exec-08 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [01:18:50] PROBLEM - Puppet failure on tools-uwsgi-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:20:20] PROBLEM - Puppet failure on tools-webgrid-03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [01:20:44] PROBLEM - Puppet failure on tools-webgrid-04 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:20:48] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:21:52] PROBLEM - Puppet failure on tools-exec-04 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [01:22:00] PROBLEM - Puppet failure on tools-webproxy is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:24:43] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:27:12] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [01:35:24] RECOVERY - Puppet failure on tools-exec-06 is OK: OK: Less than 1.00% above the threshold [0.0] [01:37:56] RECOVERY - Puppet failure on tools-exec-11 is OK: OK: Less than 1.00% above the threshold [0.0] [01:39:50] RECOVERY - Puppet failure on tools-exec-wmt is OK: OK: Less than 1.00% above the threshold [0.0] [01:40:18] RECOVERY - Puppet failure on tools-webgrid-02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:40:54] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [01:41:36] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [01:41:58] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [01:42:00] RECOVERY - Puppet failure on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [01:42:08] RECOVERY - Puppet failure on tools-exec-08 is OK: OK: Less than 1.00% above the threshold [0.0] [01:42:51] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [01:42:55] RECOVERY - Puppet failure on tools-static is OK: OK: Less than 1.00% above the threshold [0.0] [01:42:57] RECOVERY - Puppet failure on tools-exec-13 is OK: OK: Less than 1.00% above the threshold [0.0] [01:43:47] RECOVERY - Puppet failure on tools-uwsgi-01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:44:05] RECOVERY - Puppet failure on tools-dev is OK: OK: Less than 1.00% above the threshold [0.0] [01:44:57] RECOVERY - Puppet failure on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [01:45:25] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:45:32] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [01:46:14] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [01:46:14] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:46:15] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [01:46:34] RECOVERY - Puppet failure on tools-exec-02 is OK: OK: Less than 1.00% above the threshold [0.0] [01:46:55] RECOVERY - Puppet failure on tools-exec-04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:47:03] RECOVERY - Puppet failure on tools-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [01:50:43] RECOVERY - Puppet failure on tools-webgrid-04 is OK: OK: Less than 1.00% above the threshold [0.0] [01:50:49] RECOVERY - Puppet failure on tools-exec-catscan is OK: OK: Less than 1.00% above the threshold [0.0] [01:51:51] RECOVERY - Puppet failure on tools-exec-01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:51:57] RECOVERY - Puppet failure on tools-webproxy is OK: OK: Less than 1.00% above the threshold [0.0] [01:52:13] RECOVERY - Puppet failure on tools-exec-10 is OK: OK: Less than 1.00% above the threshold [0.0] [01:53:41] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [01:53:47] RECOVERY - Puppet failure on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0] [01:54:47] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [01:55:17] RECOVERY - Puppet failure on tools-exec-09 is OK: OK: Less than 1.00% above the threshold [0.0] [01:55:46] RECOVERY - Puppet failure on tools-redis is OK: OK: Less than 1.00% above the threshold [0.0] [01:57:14] RECOVERY - Puppet failure on tools-exec-03 is OK: OK: Less than 1.00% above the threshold [0.0] [01:58:56] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [02:06:06] hi all i'm trying to get a bot running under the grid engine and i'm having problems, it's a node bot, so maybe that's uncharted territory? [02:06:11] https://gist.github.com/edsu/e41e5e37ec63c09fa961 [02:07:53] if you have any ideas why it might be dumping core please let me know [05:26:41] 3Tool-Labs: program created by proprietary compiler allowed on labs? - https://phabricator.wikimedia.org/T74253#978687 (10scfc) Not looking at the legal side: Using yet another™ programming language with yet another™ framework feels very wrong to me. The bots/tools community is not that small, but small enough... [05:34:23] edsu: try adding -l release=trusty to your jstart [05:34:29] and also use trusty.tools.wmflabs.org to login [05:34:41] edsu: our normal nodes have a somewhat outdated node, and trusty’s is much more usable [05:35:01] Eloquence: also, you might want to grant it more memory with -mem 4G [05:35:21] bah, I meant edsu [05:37:26] 3Tool-Labs: program created by proprietary compiler allowed on labs? - https://phabricator.wikimedia.org/T74253#978709 (10Giftpflanze) The bot/tools community is in fact so small that forcing developers to use languages they don't know or don't feel comfortable using is a waste of these developers' valuable cont... [05:39:16] 3Tool-Labs: program created by proprietary compiler allowed on labs? - https://phabricator.wikimedia.org/T74253#978711 (10yuvipanda) /me agrees with @Giftpflanze [05:41:41] edsu: I started writing documentation! https://wikitech.wikimedia.org/wiki/User:Yuvipanda/Node_on_tools [05:45:30] YuviPanda: thx ^^ [05:45:37] 3Labs-Team, Wikimedia-Labs-Infrastructure: Debian Jessie image for Labs - https://phabricator.wikimedia.org/T75592#978719 (10yuvipanda) @Coren @andrew What's still blocking this? I heard there's some sort of upstream issue? Link? [05:46:11] annika_: :) I’d like to think of tools as gifts and not enforce too many restrictions on creative freedom when writing those. [05:46:54] hehe :) [05:47:29] annika_: unless it’s going to take us a huge amount of work to support, of course :) [05:47:49] in which case we should enable the tool authors to support it themselves, rather than arbitrarily restricting it [05:47:59] haha ^^ [06:43:11] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [06:56:54] PROBLEM - Puppet failure on tools-webgrid-tomcat is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:57:25] hmm [06:57:56] 3Tool-Labs: program created by proprietary compiler allowed on labs? - https://phabricator.wikimedia.org/T74253#978775 (10scfc) I didn't say anything about //forcing// developers to use a limited set of programming languages. But for example if you develop a bot using Pywikibot, you can wallow in not having to... [07:01:34] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [07:13:17] RECOVERY - Puppet failure on tools-exec-10 is OK: OK: Less than 1.00% above the threshold [0.0] [07:21:34] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [07:21:54] RECOVERY - Puppet failure on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [0.0] [07:28:46] 3Tool-Labs: program created by proprietary compiler allowed on labs? - https://phabricator.wikimedia.org/T74253#978821 (10jayvdb) fwiw, the bot framework is here: http://inkowik.github.io/pbwb/ https://github.com/inkowik/pbwb 2896 sloc and the bot is here: https://github.com/inkowik/inkobot >>! In T74253#75771... [07:32:27] 3Labs-Team, Wikimedia-Labs-Infrastructure: Debian Jessie image for Labs - https://phabricator.wikimedia.org/T75592#978827 (10faidon) The upstream bug that @Coren referred to before is https://bugs.debian.org/616689 which affects us because /var is on LVM on Labs images. This isn't really blocking this, though; i... [07:45:11] 3Labs-Team, Wikimedia-Labs-Infrastructure: Debian Jessie image for Labs - https://phabricator.wikimedia.org/T75592#978844 (10yuvipanda) I *really* think we should just move / to LVM and get it over with. The reason /var is on LVM is because disks on labs are usually tiny compared to production, and the default /... [11:04:05] YuviPanda: thanks for the advice re: trusty ; unfortunately I'm still seeing coredumps when logged into trusty and using -l release=trusty https://gist.github.com/edsu/e41e5e37ec63c09fa961 [11:04:25] YuviPanda: should I open a ticket? [11:12:26] 3Wikimedia-Labs-Infrastructure, operations, Beta-Cluster: Change mwdeploy homeDirectory field in LDAP from /home/mwdeploy to /var/lib/mwdeploy - https://phabricator.wikimedia.org/T86903#979157 (10hashar) 3NEW [11:14:14] edsu: if in doubt, please fill a bug :] [11:43:29] 3Tool-Labs: Open Grid Engine Job dumps core (node) - https://phabricator.wikimedia.org/T86905#979212 (10edsu) 3NEW [11:43:38] hashar: done :) [11:43:42] :-) [11:43:46] I am off [11:49:24] :) [13:42:49] 3Wikimedia-Labs-Infrastructure, operations, Beta-Cluster: Change mwdeploy homeDirectory field in LDAP from /home/mwdeploy to /var/lib/mwdeploy - https://phabricator.wikimedia.org/T86903#979419 (10Reedy) I blame @yuvipanda ``` 07:19 YuviPanda: set home of mwdeploy to /home/mwdeploy in LDAP ``` [14:47:46] http://tools.wmflabs.org/wiwosm doesn't work, is it a known problem? [15:06:01] Hi! How can I request to rename my wikitech username? [15:06:57] avgas: Generally, you cannot. It's nearly impossible to do right, and requires a lot of manual work by the ops team. [15:08:28] Coren: Yeah, I thought it was... so where I can ask to delete my username (I want to create a new one)? [15:14:00] Just go ahead and create a new one and abandon the previous account. There is no "cost" to unused accounts. [15:19:05] Coren: ok, thank you very much! [15:44:14] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:45:13] Coren, if you are up can we talk about Debian? [15:45:30] andrewbogott: I'm up, and I got a bit of time ahead of me. [15:46:22] ok. So I'm a bit lost re: lvm. [15:46:40] I think faidon has suggested two different things, and I'm not clear on which (if either) actually fixes the problem we're having. [15:47:07] And as much as that upstream bug is marked as a blocker for Jessie, it really doesn't look like anyone is working on it [15:48:41] Coren, I'm /also/ confused because there keeps being discussion about /var being its own partion, vs /var/log being its own partition [15:48:51] I'm pretty sure that the latter is useful, and the former totally unimportant [15:49:56] Well, Faidon suggested a workaround (the delay), and really doesn't like /var or /var/log being separate. Personally, I told him I was okay with no separate /var/log so long as all of / was LVM at least (and he doesn't disagree) [15:50:25] But the way images are build makes / on lvm literally impossible. [15:50:34] built* [15:50:56] oh, faidon is opposed to a separate /var/log as well? Hm [15:51:04] The delay *should* make our current images work, all we need to do is pass a parameter to the kernel. [15:51:18] He doesn't see the point to it. [15:51:31] I also feel like everyone's glossing over the fact that existing instances will stay partitioned the way they are, regardless. So if we change the partitioning scheme them puppet will have to branch like mad [15:51:43] And, it is true, a full / is not quite as dangerous as it once was. [15:51:51] how so? [15:52:12] andrewbogott: One of Faidon's arguments is that separate /var/log is unlike prod - and that's true. [15:52:22] why would puppet have to branch? [15:52:33] Yeah, but the disk volume size will be unlike prod regardless. [15:52:35] puppet doesn't care about the filesystem layout [15:52:50] I've checked, didn't find any code that cares [15:52:54] even if there is, it's buggy [15:53:03] and even if there is, it's probably branched out already for prod/labs [15:53:13] paravoid: branch only in code that refers to partitioning. So, our puppet classes that partition the unused volume space will have to know what vintage of instance they're running on [15:53:31] we have puppet classes that partition unused volume space? [15:53:44] things like role::labs::lvm::srv [15:53:59] paravoid: Yeah, so that stuff that wants a big /srv can get it, etc. [15:54:37] that's still LVM though, no? [15:54:39] (Also, once upon a time, deployment-prep used it to replicate /a setup but that thankfully no longer exists) [15:54:53] yes, still lvm. so maybe the difference doesn't matter? I'm not sure [15:54:58] it won't [15:55:12] nothing in puppet should care about where /var is [15:55:20] andrewbogott: It shouldn't, not in practice, because we're talking about the /unallocated/ space [15:55:51] paravoid: There is one thing that cares right now: some instances have set aside HUGE /var/log because they have tons and tons of needed logging. [15:56:05] is this in puppet? [15:56:19] role::labs::lvm::biglogs iirc [15:56:21] Yeah, I think it's just role::labs::lvm::biglogs? I'm looking [15:57:06] Admitedly, none of this depend on *all* of /var being lvm'ed [15:57:08] so that's an LVM /var/log, not /var, no? [15:57:17] I don't mind that [15:57:21] paravoid: excellent. [15:57:35] 8G isn't "huge" [15:57:36] Yeah, /var is an artefact because some instances used to put local databases in /var/lib (ew!) [15:57:38] I think that means that everyone agrees: separate /var/log is useful, separate /var , meh [15:57:54] paravoid: It's huge on lab scales where instances have like 20G :-) [15:58:40] sure, but maybe it makes sense to just give instances a bit larger / [15:58:48] it's pretty small now, right? [15:59:29] paravoid: it is pretty small. But I'd rather that /var/log fill up than / [15:59:36] paravoid: 10G, which suffices for most things -- giving more prevents more flexible allocation (for /srv for instance) [15:59:42] (sorry if I'm misunderstanding) [16:00:16] Coren, do you know offhand how to add that delay to the kernel flags? [16:01:14] andrewbogott: Not offhand, but I recall having see how in openstack previously and it wasn't overly complicated. I'm about to do the filesystem thang, but I have about 2h ahead of me. I can dig it out if you want. [16:01:35] If you have an idea of where to start, then yes please [16:01:41] Or point me in the right direction [16:02:23] andrewbogott: It's with glance. IIRC, it's a property of the image. [16:02:48] oh! OK, I'll look. [16:02:52] glance? [16:03:04] paravoid: The image storage/management layer of openstack [16:03:13] isn't grub being used? [16:04:00] I don't see how glance would be involved but I may be missing something essential to how labs work [16:04:11] andrewbogott: https://wiki.openstack.org/wiki/LibvirtCustomKernelArgs [16:04:26] are we booting kernels directly? [16:04:58] It uses grub, as far as I know [16:05:09] right, so forget all that [16:05:16] so... [16:05:23] base has code to change grub's config [16:05:36] but so far we've stayed away from puppet-managing this in its entirety [16:05:49] instead of that, we just add (or remove) settings with puppet [16:06:01] and otherwise rely on what the installer set up for us [16:06:23] so prod had rootdelay=, with 3d3868aa17b8670e847332b37ce8bd44cb530248 [16:06:55] for labs... probably something in bootstrap-vz [16:07:11] you basically want to append something to GRUB_CMDLINE_LINUX_DEFAULT= under /etc/default/grub [16:07:39] * andrewbogott digs [16:07:53] I'd start with moving /var back into /, though [16:08:00] this might solve this for you, no action needed [16:08:20] or maybe not, who knows :) [16:13:32] Coren, can you explain this to me? /bin/tar cf - -C / var | /bin/tar xf - -C /tmp [16:13:42] Is it just making a backup, or is that restored onto the actual /var sometime later? [16:14:11] RECOVERY - Puppet failure on tools-exec-10 is OK: OK: Less than 1.00% above the threshold [0.0] [16:14:40] andrewbogott: Moar contekst? [16:14:54] Coren: that's in firstboot.sh in the partitioning code [16:14:59] that command uses tar to copy the contents of /var into /tmp/var/ [16:15:10] I know what it does, but… why? [16:15:26] Lemme look at the script, I'm sure I've a good reason. :-) [16:16:07] It could be a relic from an earlier attempt to preserve the contents of the old partition... [16:16:21] Ah, preceeded by /bin/mount /dev/mapper/vd-log /tmp/var/log [16:16:32] So it's copying into the future /var/log [16:16:46] So as to not loose any of the pre-firstboot.sh logs. [16:20:38] ok, makes sense [16:21:11] So to limit to /var/log it would be... [16:21:13] /bin/tar cf - -C / var/log | /bin/tar xf - -C /tmp/var [16:22:19] Coren, this look right to you? https://gerrit.wikimedia.org/r/185195 [16:22:28] Well, that overcomplicates things since it still presumes /var is mounted there. Lemme see [16:23:47] it… shouldn't [16:25:16] Commented [16:30:13] Coren: better? [16:31:12] Yep [17:04:16] Welcome to emergency mode! [17:05:29] what? [17:06:19] annika_: ah, nothing, just, a failure that Coren and I have been bashing up against for ages [17:11:01] andrewbogott: With the rootdelay even? [17:11:20] Coren, nope, haven't gotten to that yet [17:11:50] Ah. I was pretty sure that /var/log wouldn't make a change vs it and /var [17:37:26] Coren, paravoid, booting with rootdelay=N gets me nothing, it still drops into emergency mode. https://dpaste.de/zeWi [17:37:39] What N are you using? [17:39:21] um… I literally thought I was supposed to be using 'N' for 'no' [17:39:31] So that could be the problem :) [17:39:36] What value do you suggest? [17:41:32] It very well may. :-) It's seconds, and IIRC prod uses something fairly large (90, I think). A good value to try might be 30? [17:41:51] It's just giving time for the autodetection to settle. [18:15:54] Filesystem stalled while remounting readonly: ETC 4 min [18:16:05] how do I proxy a websocket from a process running on the execution grid, through to be public-facing? [18:16:30] notconfusing: It's not very hard, you just have to tell the proxy - but atm I'm in the middle of maintenance so I can't help. Will be able to soon. [18:17:00] Coren, no rush, I'll be online all day. [18:18:36] * greg-g waves to notconfusing [18:19:17] greg-g, hello there. what's the news on your side of the galaxy? [18:20:45] PROBLEM - Puppet failure on tools-exec-05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:21:11] Oh, ew. That. [18:21:59] * YuviPanda waves at notconfusing [18:22:58] YuviPanda, hello to you there too. [18:24:29] notconfusing: my team's integration service (Beta Cluster) is also out right now :) other than that, doing some quarterly review slide creation [18:24:45] (beta cluster out due to labs) [18:25:12] PROBLEM - Puppet failure on tools-exec-10 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [0.0] [18:34:41] notconfusing: not really feasible to try out websocket now, since toollabs is down for scheduled maintenance [18:35:44] RECOVERY - Puppet failure on tools-exec-05 is OK: OK: Less than 1.00% above the threshold [0.0] [18:57:31] Coren: I’m going to go to sleep in an hour maybe. Is that ok? [18:57:57] Yeah, I'm nearing the point where there's nothing to do but wait for the copy to be done. [18:58:21] Coren: \o/ cool [18:58:28] Coren: do do a write up so I know what’s happening as well? [18:59:07] YuviPanda: Heh. Here's all of it: "New LVM volume created with thin provisioning; files being copied from the old (fixed-allocation) filesystem to the new one." :-) [18:59:18] heh :) [18:59:18] ok [18:59:19] It's simple, long and boring. [18:59:26] !log testing [18:59:26] Message missing. Nothing logged. [18:59:29] oooh [18:59:44] oooh? [19:00:49] Labs isn't *down* down. Just /data/project and /home are readonly. It's a partial outage and I expect many things can still work - anything that doesn't depend on writing there certainly should. [19:04:33] Coren: no, I mean, it’s dead on -operations [19:04:37] Coren: so I’m surprised it’s alive here [19:08:09] ok, thanks for the updates, i was wondering why all the files became read-only [19:11:23] 18:59‧ !log this works? [19:11:23] 18:59‧ well, apparently not [19:11:24] 18:59‧ Logged the message, Master [19:11:37] Coren: yup, apparently I’m impatient :) [19:11:39] or lagged [19:11:40] Nope. It works on -operations too, you're just too impatient. :-) [19:14:29] Ohh, that explains that :P [19:32:33] YuviPanda: Hm.. gerrit bot gone? Anything I can do to help [19:33:05] Krinkle: partial labs outage due to readonly FS, and I think the bot died because it can’t write to NFS for logs [19:33:07] wikibugs is dead too [19:33:20] Krinkle: so, we could either temporarily run them on our local machines, or just wait for labs to come back [19:33:26] wikibugs seems alive [19:33:40] is it relaying anything? [19:33:50] don’t think it is... [19:35:07] YuviPanda: it's working in #-dev [19:35:19] valhallasw`cloud: oh [19:35:24] valhallasw`cloud: because of the half joining thing? [19:35:28] yeah [19:35:32] valhallasw`cloud: hmm, right [19:36:02] Krinkle, I don't see the gerrit bot here either [19:36:27] Anshoe: Indeed. But the conversation here is about getting it back up. The bot is operated from a labs machine. [19:36:55] ohh, well, I'm only here cause of GCI. Don't know if I can be of much help :) [19:37:20] Krinkle: I’m tempted to just wait for labs to come back up. [19:37:25] k [19:37:26] ETA? [19:37:47] Krinkle: Coren knows [19:37:58] its' high noon on Dev O'clock in SF. The balmer peak of the day. 11-13:00 is quite productive I find. [19:38:16] YuviPanda: Ah, it runs outside labs? Interesting [19:38:24] I thought it depended on the event stream [19:38:31] Krinkle: it does, but it can run outside of labs, yeah. [19:38:36] Krinkle: it currently does run in toollabs [19:38:42] but I can knowck up a local instance [19:38:43] I know, but didn't think it'd run outside of [19:38:44] Ostensibly, as planned (some 20h from now) but there seems to be serious performance issues and I fear I'll have to abort and end the maintenance really early. [19:39:38] Krinkle: event stream is open to everyone who gets appropriate permissions on gerrit, from anywhere :) [19:39:45] I seem to be running into serious lvm fragmentation issues. [19:39:58] Krinkle: quite useful, just sad there isn’t a similar thing for phab [19:39:59] !fun [19:40:16] YuviPanda: Yeah, cool :) [19:40:30] I don't have enough room to cleanly do the copy; I'll have to wait until the new shelf gets here. :-( [19:40:51] I'll be able to decide one way or the other in 30m or so. [19:44:08] testing irc join-then-msg [19:44:13] ok, that seems to work [19:47:35] Yeah, no way this is going to work without a large, contiguous volume. :-( [19:48:35] petrb@huggle-win32:~/mxe$ touch bla [19:48:36] touch: cannot touch ‘bla’: Read-only file system [19:48:38] Coren ^ [19:48:51] you can reboo [19:48:52] t [19:49:15] petan: This has been announced weeks ago, with reminders since, and is in the channel topic. :-) [19:49:18] PROBLEM - Free space - all mounts on tools-exec-14 is CRITICAL: CRITICAL: tools.tools-exec-14.diskspace.root.byte_percentfree.value (<44.44%) [19:49:37] aha ok [19:49:44] But also, I'm going to have to abort and wait for moar hardware. :-( [19:50:03] So it'll become readwrite again shortly. [20:00:14] * Coren writes the email. [20:03:24] !log tools.wikibugs legoktm: Deployed c61edcfab64d62081edc3ccf89534764017f4a1c Make sure we're in the channel before messaging it wb2-irc [20:03:28] Logged the message, Master [20:04:57] Yay! [20:09:37] Grrr. Lots and lots of planning and effort down the drain because fail. :-( [20:28:06] * valhallasw`cloud hugs Coren [20:30:24] (03PS1) 10Legoktm: Make sure we're in the channel before messaging it [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/185227 [20:30:28] (03PS2) 10Merlijn van Deen: Make sure we're in the channel before messaging it [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/185227 (https://phabricator.wikimedia.org/T86758) (owner: 10Legoktm) [20:30:30] (03CR) 10Merlijn van Deen: [C: 031] "+2 on your part of the change ;-)" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/185227 (https://phabricator.wikimedia.org/T86758) (owner: 10Legoktm) [20:30:41] (03CR) 10Legoktm: [C: 031] Make sure we're in the channel before messaging it [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/185227 (https://phabricator.wikimedia.org/T86758) (owner: 10Legoktm) [20:30:55] (03CR) 10Merlijn van Deen: [C: 032] Make sure we're in the channel before messaging it [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/185227 (https://phabricator.wikimedia.org/T86758) (owner: 10Legoktm) [20:30:57] (03Merged) 10jenkins-bot: Make sure we're in the channel before messaging it [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/185227 (https://phabricator.wikimedia.org/T86758) (owner: 10Legoktm) [20:32:50] thanks for the note/email Coren, sorry it didn't work out [20:44:03] 3Wikimedia-Labs-Infrastructure, Labs-Team: Debian Jessie image for Labs - https://phabricator.wikimedia.org/T75592#980230 (10Andrew) >This isn't really blocking this, though; it can be easily worked around by passing `rootdelay=N` (which was the default > in prod for Ubuntu as well, until 3d3868aa17b8670e847332b... [20:45:47] 3Wikimedia-Labs-Infrastructure, Labs-Team: Debian Jessie image for Labs - https://phabricator.wikimedia.org/T75592#980244 (10Andrew) {F28878} [20:55:04] Coren: do you see an obvious typo or misunderstanding here? https://phab.wmfusercontent.org/file/data/kuwrwq2dzqiv5ikjavdl/PHID-FILE-u5gtohld4se7xviiawwz/xxjnas3kbdzqjdek/rootdelay300.txt [20:55:13] I'm back to being stumped. [20:59:05] andrewbogott: The link you followed to view this file is invalid or expired. [20:59:42] Hm… linked from the bottom of https://phabricator.wikimedia.org/T75592 it works [20:59:48] Coren: I was hoping tools-anomiebot wouldn't be affected since I moved all the active log files to /data/scratch. But a bunch of stuff died anyway :( [21:00:27] Bleh. Do you know what and how it died? Knowing that might help make the "real" maintenance easier if it can be worked around. [21:01:22] andrewbogott: root_delay=300 makes the startup pause a whole 5 minutes; and that still doesn't work? [21:01:33] same behavior [21:01:50] Although I think I captured the wrong log -- the one you're looking at only has a 20 second delay. [21:02:13] I'm not looking at any log; "he link you followed to view this file is invalid or expired." [21:02:52] Nope. It should have written reasons to logs in /data/stash when the tasks errored, but no luck despite other stuff logging there. If you can get any info for jobs 350 or 352, that might be able to tell us something, if only that it supposedly exited normally. [21:03:15] Coren: you get that if you follow the log link in the bug? [21:03:22] err, /data/scratch (I always get that wrong) [21:07:17] Coren: Huh. I told some of the jobs to re-exec themselves, and they died instead. exit_status 116, according to qacct. [21:07:35] 116? [21:07:52] Jobs 105619 and 6059566 [21:09:51] I don't remember having ever seen 116 [21:10:31] 3operations, Beta-Cluster, Wikimedia-Labs-Infrastructure: Change mwdeploy homeDirectory field in LDAP from /home/mwdeploy to /var/lib/mwdeploy - https://phabricator.wikimedia.org/T86903#980386 (10hashar) >>! In T86903#980010, @yuvipanda wrote: > It's /home/mwdeploy in prod, should be /home/mwdeploy in beta. Yea... [21:15:12] anomie: It's below 128 so, technically, this means the job ended with exit(116) [21:18:14] Coren: I wonder if it's because the job itself still had its output and error streams pointing to /data/project, even though it pretty much never writes anything there? [21:18:40] Well, /data/project is back to rw, so that shouldn't be it. [21:19:09] Lemme try to start a dummy job, make sure whatever the issue is isn't the grid itself first. [21:20:50] That worked fine. Hmmm. [21:25:16] Coren can you restart the jobs of Magnus ? [21:25:32] GerardM-: Sure, if you point me at which need restarting. [21:26:02] https://tools.wmflabs.org/toolscript/index.html?pastebin=x9TXBfZi [21:26:09] this is the job I need running [21:27:47] toolscript? Your link seems to work but I don't know how to decide whether it's broken or not - lemme restart it to be certail [21:28:43] GerardM-: Better? [21:28:45] have a look at the output [21:29:04] No [21:29:07] not better [21:30:12] GerardM-: As far as I can tell, the webservice is running fine. Lemme try to see if there is anything in the logs. [21:32:35] GerardM-: I see a lot of errors, but they're not from the tool itself. Apparently, something is trying to hit zh_yue.wikipedia.org and be_x_old.wikipedia.org neither of which seem to exist. Also, there seems to be an error in the code itself: "Invalid argument supplied for foreach() in /data/project/toolscript/public_html/misc.php on line 71" though that doesn't look like a fatal [21:35:25] it's zh-yue and be-x-old with hyphens and not underscores, no? [21:37:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [21:37:37] I can only tell what the tool is actually doing, not what it should be doing. :-) [21:38:19] But yeah, for zh-yue clearly. I never knew about be-x-old [21:38:41] Hm. Cyrillic. [21:40:29] Coren: got.wikipedia.org [21:40:30] WORST [21:40:31] Lots of that line 71 error. GerardM-: Sorry I can't help more, but the tool appears to have a bug (rather than a problem in the infrastructure) [21:41:05] Coren: spelling out the name of got in got requires code points outside of the Unicode BMP [21:41:18] promptly causes any Android Java code that encounters any got string to fail [21:41:47] I have never seen that script before, and I consider myself well-learned in world scripts and Unicode. Huh. [21:53:42] am i supposed to be able to access tools-redis from instances? [21:54:05] notconfusing: hey! yup [21:54:23] notconfusing: if you’re trying to communicate between tools, using a redis queue is a pretty nice idea too [21:54:24] cos i can do nslookup tools-redis [21:54:31] but i have a connection error when i use that [21:54:36] ‘use that’ as in? [21:55:14] are you using a library? commandline tools? [21:55:28] YuviPanda, well i tried to migrate from using tools-labs to using a dedicated instance so i could have a stable ip for serving out websocket [21:56:01] notconfusing: or you could wait for a day or so and I can help you figure that out :) [21:56:16] notconfusing: it’s been a very happening day, so I couldn’t get to your email [21:56:34] but it shouldn’t be more than half a day of work to get toollabs to support nodejs webservices properly [21:56:35] with websockets [21:56:38] * notconfusing grins, Yuvi panda, i understand, and i would still like to do it on tools [21:56:55] i was just trying to help myself while waiting [21:56:55] oh [21:56:57] it’s python? [21:56:59] hmm [21:57:06] yup [21:57:15] that might already work, we have uwsgi.. [21:57:22] notconfusing: however, who is going to be consuming your rebroadcast? [21:57:35] notconfusing: things on toollabs, things everywhere on the internet…? [21:58:05] at the moment a GUI application, and maybe this citatiton consortium in England Crossref [21:58:28] aaah [21:58:34] well then, websockets are necessary :) [21:58:41] so one thing on toolslabs and eventually everywhere on the internett [21:58:45] notconfusing: is the code in a public repository anywhere? [21:58:49] yup [21:58:52] link? [21:59:48] https://github.com/notconfusing/cocytus/blob/demo/cocytus-output.py [22:01:21] oooh, twisted [22:01:25] yup [22:01:32] from autobahn.twisted.wamp import ApplicationRunner [22:01:37] that’s the websocket outputter thing, isn’t it? [22:02:02] yes, you can see on line 37, how i would like to broadcast [22:02:11] right [22:02:30] notconfusing: this shouldn’t be *too* hard to do. I’ll take a stab tomorrow? [22:02:31] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [22:02:41] YuviPanda, deal [22:04:04] Coren: ping. I’m going to set up trusty versions of tools-webrid-tomcat. Thinking of calling it by its true name, which should perhaps be tools-webgrid-generic? [22:04:39] YuviPanda: I am full of agreement. Weren't you sleeping? [22:04:56] Coren: I was, but then got caught up watching a fascinating argument in -operations [22:05:22] Coren: tools-web-generic-01? [22:05:27] or tools-genericweb-01? [22:06:45] tools-webgrid-generic-01; let's be consistent even if ugly. :-) [22:09:10] but we have tools-uwsgi! [22:09:11] :) [22:09:29] but yeah, consistency, etc [22:09:50] !log tools created instance tools-webgrid-generic-01 [22:10:17] Logged the message, Master [22:10:35] damn [22:28:21] 3Tool-Labs-tools-Erwin's-tools: xwiki.php not working - https://phabricator.wikimedia.org/T86976#980699 (10MarcoAurelio) 3NEW [22:30:30] 3Tool-Labs-tools-Erwin's-tools: xwiki.php not working - https://phabricator.wikimedia.org/T86976#980707 (10MarcoAurelio) [22:30:31] 3Tool-Labs-tools-Erwin's-tools: Migrate https://toolserver.org/~erwin85/xwiki.php to Tool Labs - https://phabricator.wikimedia.org/T62878#980708 (10MarcoAurelio) [22:31:26] 3Tool-Labs-tools-Erwin's-tools: xwiki.php not working - https://phabricator.wikimedia.org/T86976#980699 (10MarcoAurelio) I think that T62878 is maybe what is blocking this from working, although I am not sure. Linking it here for reference. [22:32:55] 3Tool-Labs-tools-Erwin's-tools: xwiki.php not working - https://phabricator.wikimedia.org/T86976#980732 (10MarcoAurelio) [22:50:15] 3Wikimedia-Labs-wikistats: deploy a replacement for the old "wikistats admin" (WSA) script - https://phabricator.wikimedia.org/T38287#980763 (10Dzahn) p:5Normal>3Low [22:50:57] 3Wikimedia-Labs-wikistats: Fix all the Wikia stats - https://phabricator.wikimedia.org/T61943#980765 (10Dzahn) p:5Normal>3Low [22:54:15] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<11.11%) [23:04:16] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [23:12:33] PROBLEM - Puppet failure on tools-webgrid-generic-01 is CRITICAL: CRITICAL: 71.43% of data above the critical threshold [0.0] [23:22:37] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:23:31] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0]