[00:00:01] valhallasw: Coren: Aha, I think I've found the issue causing the jobs to disappear. Around mid April, the job id reset from 9999999 or something, to 0/1. I think it overwrote my low number jobs that had been running since near the start of eqiad. [00:05:13] a930913: wow. [00:05:57] Take a look at this. https://tools.wmflabs.org/paste/view/267ff131 [00:06:14] yuvipanda: failed 19? [00:06:36] hmm [00:06:40] no idea what that means... [00:06:42] a930913: can you open a bug? [00:09:11] yuvipanda: Tool-Labs or Labs-Infrastructure? [00:09:18] a930913: tool-labs [00:15:44] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 60.00% of data above the critical threshold [0.0] [00:16:42] 10Tool-Labs: Jobs Disappearing from SGE - https://phabricator.wikimedia.org/T99027#1283959 (10A930913) 3NEW [00:18:01] yuvipanda: Is that ok? [00:18:37] a930913: yup! thanks :) I've cc'd more people [00:27:56] 10Tool-Labs: Jobs Disappearing from SGE - https://phabricator.wikimedia.org/T99027#1283992 (10A930913) IRC logs show death throes. ``` 22:13 BracketBot: Unloading script. 22:13 BracketBot: Loading script. 22:13 BracketBot: Loading script.... [00:40:47] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [00:46:01] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1204 is CRITICAL 60.00% of data above the critical threshold [0.0] [00:46:56] a930913: do you have a more recent job id number? [01:10:59] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1204 is OK Less than 1.00% above the threshold [0.0] [03:03:20] 10Tool-Labs: Get rid of toolwatcher, use skeleton homedirs instead - https://phabricator.wikimedia.org/T91235#1284113 (10scfc) p:5Normal>3Lowest `pam_mkhomedir.so` is executed as part of `/etc/pam.d/common-session` which is part of `/etc/pam.d/sshd`. `/etc/pam.d/sudo` on the other hand includes `/etc/pam.d/... [03:03:34] 10Tool-Labs: Get rid of toolwatcher, use skeleton homedirs instead - https://phabricator.wikimedia.org/T91235#1284115 (10scfc) a:5coren>3None [03:29:20] !log tools drained, depooled and deleted tools-exec-15 [03:29:29] Logged the message, Master [03:30:10] PROBLEM - Host tools-exec-15 is DOWN: CRITICAL - Host Unreachable (10.68.17.61) [03:35:49] 6Labs, 10Tool-Labs: Rebuild a bunch of tools instances - https://phabricator.wikimedia.org/T97437#1284158 (10yuvipanda) Everything except -07 and -08 is now gone. [08:44:07] anomie: hi do you know who can approve oauth requests? [08:54:26] PROBLEM - Puppet staleness on tools-mailrelay-01 is CRITICAL 100.00% of data above the critical threshold [43200.0] [09:35:36] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1402 is CRITICAL 50.00% of data above the critical threshold [0.0] [09:45:24] valhallasw : hi, [09:46:21] I am trying to set up the language tool server. As of now I don't care if it is open all over the network. I want to see if the server can successfully be set up [09:46:45] so that once i send a text file i get a proper xml response. [09:47:19] ankita-ks: what do you want to use it for? [09:47:40] because depending on how it should be used, different setups might be more convenient [09:47:47] This is part of my gossip project where I want to enable language tool for visual editor. [09:48:05] *gsoc [09:48:08] damn . [09:49:10] So i wanted to test if I could successfully run the server. [09:49:11] hm, okay. I'm not sure if tool labs is the most convenient environment for that, as you'd also need to set up VE etc [09:49:22] okay. what would you suggest then? [09:49:29] basically, running generic services is not something that's well-supported in tool labs [09:49:32] your own labs project [09:49:44] that would give you full control over one or more servers [09:49:53] hmm...okay [09:50:10] Looking into that. [09:50:46] yuvipanda: I still don't see anything in shinken :P [09:50:59] valhallasw: anything as in? [09:51:07] valhallasw: oh, outside of accounts? yeah, that's going to be a PITA :| [09:51:16] valhallasw: and not a one day operation... [09:51:17] ? [09:51:26] no, I can't see any info when I log in with my account [09:51:35] gah, words. [09:51:39] outside of the guest account, I mean [09:51:44] or maybe I should juts log out and log in again, wikitech style? [09:51:49] no, that won't work [09:51:51] why is that so complicated? :/ [09:51:55] you can only see things that you're a contact in [09:52:09] okay [09:52:56] yuvipanda: so I just need to be added under https://github.com/wikimedia/operations-puppet/blob/production/modules/shinken/files/contactgroups.cfg#L7 ? [09:53:17] valhallasw: oh, hmm. that *could* work, yeah. [09:53:22] valhallasw: let's try? make a patch? [09:53:32] although I'm not sure if that gives me access to settings things like downtime [09:53:51] yeah, I'm not either [09:53:55] but it'll let you see them! [09:55:20] 6Labs: Remove old backups-of-backups from NFS - https://phabricator.wikimedia.org/T99061#1284484 (10coren) 3NEW [10:00:36] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1402 is OK Less than 1.00% above the threshold [0.0] [10:03:04] 10MediaWiki-extensions-OpenStackManager, 7I18n: GENDER support in openstackmanager-addedto, openstackmanager-failedtoadd - https://phabricator.wikimedia.org/T99063#1284511 (10Nemo_bis) 3NEW [10:14:04] 10Tool-Labs: Tool wp-world writes ungodly amounts of logs - https://phabricator.wikimedia.org/T99064#1284551 (10coren) 3NEW a:3Kolossos [10:14:18] 10Tool-Labs: Provide source/repository link on https://tools.wmflabs.org - https://phabricator.wikimedia.org/T86431#1284560 (10Ricordisamoa) [10:15:24] yuvipanda: checking. if they are large, ew can just re-enable puppet? [10:15:37] it's -catscan, -cyberbot, -gift and -wmt... [10:16:10] valhallasw: if they are large, I'd suggest: 1. force a puppet run, 2. restart them to make sure that files are in /tmp properly, 3. bind mount root to access old /tmp and clean it out [10:16:28] large, medium, medium and medium [10:16:40] I know some of those words [10:16:45] hah :D [10:16:53] bind mount root wat [10:16:55] so you can do like [10:16:59] mkdir /oldroot [10:17:05] mount --bind / /oldroot [10:17:06] and then [10:17:10] /oldroot/tmp [10:17:13] will have the *old* tmp [10:17:44] but erm. how will they get thta new partition for /tmp? [10:18:58] valhallasw: from step (1) of 'force a puppet run'? [10:19:23] yuvipanda: but there is already a file system....? *confused [10:19:55] valhallasw: bind mounts :) google them [10:20:11] yuvipanda: *you* wrote the puppet change, I don't know what changed exactly [10:20:16] errr [10:21:00] I only know that it changed something with partitions and that it couldn't be applied on old hosts for some reason [10:22:35] valhallasw: aaaah, *that*. right, so right now there's a root partition and then /tmp is just a directory [10:22:45] valhallasw: the puppet change creates a new LVM partition and then mounts it on /tmp [10:22:54] so this means that the new partition shadows the old /tmp [10:23:02] and so all the files that are there ewill kind of 'disappear' [10:23:05] *will [10:23:13] open file handles will still work [10:23:16] but that's fine, because that happens on reboot anywya [10:23:20] right [10:23:30] so if you run puppet [10:23:39] that forces the /tmp volume to be created [10:23:46] but there are still open file handles [10:23:49] to files on the old /tmp [10:23:51] so restarting fixes that [10:24:02] but this can safely be done while the system is already partitioned? won't take ages, need unmounts, etc? [10:24:11] nope, because LVM [10:24:19] these aren't physical partitions, but logical ones [10:24:24] and so can be done quite safely [10:24:32] that doesn't change the fact the filesystem expects a certain size [10:24:39] it does! [10:24:58] that's why LVM is quite awesome :D [10:25:12] i didn't quite believe it until I tried :) [10:25:29] eh. the first tutorial I can find says I need to resize2fs anyway [10:25:34] valhallasw: also, labs base images have 20G allocated for root partition, and rest is unallocated lvm. [10:25:55] aha. [10:26:51] so yeah, it does work. [10:27:04] valhallasw: now the problem with the specialized nodes is that I don't know how safe restarting them is [10:27:17] I have no idea what'll happen to the jobs... [10:27:57] 10Tool-Labs: document the need and usage patterns for special exec hosts - https://phabricator.wikimedia.org/T99067#1284594 (10valhallasw) 3NEW [10:28:13] yuvipanda: no host to queue the job on -> SGE will wait until it's back online [10:28:20] alright [10:29:40] 10Tool-Labs: document the need and usage patterns for special exec hosts - https://phabricator.wikimedia.org/T99067#1284601 (10valhallasw) [10:30:52] 10Tool-Labs: document the need and usage patterns for special exec hosts - https://phabricator.wikimedia.org/T99067#1284606 (10yuvipanda) So the problems I see with special hosts in general is: 1. No redundancy. If that host goes down... jobs don't get automatically rescheduled elsewhere 2. Special cases make i... [10:31:48] valhallasw: alright, i shall attempt to sleep again. good night [10:31:52] yuvipanda: night [10:32:04] valhallasw: and do look at the patches :) [10:35:28] 10Tool-Labs: tools-shadow puppet down - https://phabricator.wikimedia.org/T99068#1284608 (10valhallasw) 3NEW [10:40:54] 10Tool-Labs: tools-shadow puppet down - https://phabricator.wikimedia.org/T99068#1284617 (10valhallasw) 5Open>3Resolved a:3valhallasw There is a Sources.gz, though, so I'm not sure why that isn't picked up. Steps taken: * Commented out the deb-src line in /etc/apt/sources.list.d/wikimedia.list * sudo ap... [10:42:07] 10Tool-Labs: Re-enable puppet on tools-exec-{cyberbot,catscan,gift,wmt} - https://phabricator.wikimedia.org/T99069#1284624 (10valhallasw) 3NEW [10:46:30] 10Tool-Labs: Fix gridengine install on tools-precise-dev - https://phabricator.wikimedia.org/T99070#1284639 (10valhallasw) 3NEW [10:52:15] RECOVERY - Puppet staleness on tools-shadow is OK Less than 1.00% above the threshold [3600.0] [11:02:11] \o/ [11:10:15] 10Tool-Labs: document the need and usage patterns for special exec hosts - https://phabricator.wikimedia.org/T99067#1284664 (10valhallasw) For the -wmt case, using reservations might also be an option: http://manpages.ubuntu.com/manpages/saucy/man1/qrsub.1.html . First reserve the total amount of memory required... [11:15:44] 10Tool-Labs: Fix gridengine install on tools-precise-dev - https://phabricator.wikimedia.org/T99070#1284668 (10valhallasw) ``` root@tools-precise-dev:/home/valhallasw# strace -f apt-get install -f 2>&1 | grep -e 'lib/gridengine' [pid 20577] execve("/bin/mkdir", ["mkdir", "-p", "/var/lib/gridengine"], [/* 31 vars... [11:18:37] 10Tool-Labs: Fix gridengine install on tools-precise-dev - https://phabricator.wikimedia.org/T99070#1284670 (10valhallasw) 5Open>3Resolved a:3valhallasw Relevant: https://www.youtube.com/watch?v=nn2FB1P_Mn8 [11:19:20] legoktm: ^ ready for use! [11:19:51] or, well, let me check if I can actually submit anything... [11:21:00] yeah, working. [11:21:04] addshore: ^ one more host to add ;D [11:21:08] tools-precise-dev [11:21:12] bah! [11:21:27] sorry :-) [11:21:41] we should auto-generate that overview somehow [11:21:55] yeh, would be nice ;p [11:24:04] 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1284673 (10valhallasw) 3NEW [11:30:04] 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1284681 (10valhallasw) Of these, the most broken are ``` UNKNOWN for tools-webproxy-test/Puppet failure UNKNOWN for tools-webproxy-test/Puppet staleness ``` (no data for > 4 months) ``` UNKNOWN for tools-redis/Free space - all mo... [11:30:35] RECOVERY - Puppet failure on tools-precise-dev is OK Less than 1.00% above the threshold [0.0] [11:36:19] 10Tool-Labs: Fix shinken config to remove tools-webproxy-test - https://phabricator.wikimedia.org/T99073#1284685 (10valhallasw) 3NEW [11:52:22] [13intuition] 15ChameleonWiki opened pull request #43: Pull krinkle/master (06master...06html) 02https://github.com/Krinkle/intuition/pull/43 [11:58:27] 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1284734 (10valhallasw) I'm not sure what's happening for tools-redis. The only debug info diamond provides is: ``` DEBUG:diamond:Ignoring / since it is of type rootfs which is not in the list of filesystems. [2015-05-14 11:51:33,... [12:13:21] 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1284750 (10valhallasw) tools-webgrid-lighttpd-1406's diamond seemed to be in some kind of reload loop. Restarted diamond; this seems to fix the issue. [12:23:33] 6Labs, 6operations, 10wikitech.wikimedia.org: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1284755 (10Krenair) Probably caused by https://gerrit.wikimedia.org/r/#/c/196961/ ? [12:33:27] 6Labs, 5Patch-For-Review: Reinstall db1009.eqiad from zero - https://phabricator.wikimedia.org/T98958#1284762 (10jcrespo) [12:33:39] 6Labs, 5Patch-For-Review: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1284764 (10jcrespo) [12:33:41] 6Labs, 5Patch-For-Review: Reinstall db1009.eqiad from zero - https://phabricator.wikimedia.org/T98958#1284763 (10jcrespo) 5Open>3Resolved [12:35:38] 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1284766 (10valhallasw) tools-exec-1212: diamond does not seem to be working at all. Even running manually using ``` /usr/bin/python /usr/bin/diamond --foreground --skip-change-user --skip-fork --skip-pidfile ``` fails, or rather, d... [12:38:14] 10MediaWiki-extensions-OpenStackManager, 7I18n: GENDER support in openstackmanager-addedto, openstackmanager-failedtoadd - https://phabricator.wikimedia.org/T99063#1284769 (10Krenair) "GENDER support" is just when we mark something with the gender function to prove to translators that you can actually use it o... [12:41:11] 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1284770 (10valhallasw) tools-mail seems to have the same issue as tools-exec-1212. Is this maybe some sort of precise v trusty issue? But why do all the other precise hosts work without issues then? In the case of tools-mail, ther... [12:49:42] 6Labs, 5Patch-For-Review: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1284776 (10jcrespo) @Andrew: ready for the data migration, waiting for a ping from you so that we can start (it requires a 2 brief moments on read-only mode). [12:54:35] 10Tool-Labs: Shinken: make sure 'Free space - all mounts' can handle no-longer-existing mounts - https://phabricator.wikimedia.org/T99077#1284779 (10valhallasw) 3NEW [12:55:18] 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1284786 (10valhallasw) ...restarting diamond on these hosts does seem to have mostly solved the issue, though. There are only three left: {T99077}: * UNKNOWN for tools-redis/Free space - all mounts {T99073}: * UNKNOWN for tools-w... [12:59:03] 10Tool-Labs: document the need and usage patterns for special exec hosts - https://phabricator.wikimedia.org/T99067#1284790 (10coren) The reason most of those dedicated queues exist is because their use of resources didn't match the general model of allocation; either because they have a lot (up to 120) of very... [13:00:54] Coren: fair enough, but it would be good to document what the reason is for each queue specifically [13:04:02] Hm. I'd need to ask the maintainers again - most of those are over a year old. :-) Will do. [13:04:24] I'm pretty sure I remember the reasons, I'm just no longer sure which reason applied to which queue. :-) [13:05:02] I think the most important info is whether we can reboot/rebuild instances or not [13:05:30] OK, most UNKNOWNs in shinken are now fixed [13:26:40] valhallasw: The short of it: I always try to give their maintainers advance notice and/or fit with their availability in the past for reboots but I've never hesitated to do so when needed. Having jobs restartable or scheduled is their responsibility. [13:26:56] valhallasw: Same goes with rebuild, obviously, though I've yet to have to. [13:26:57] Coren: ok, good to know. [13:27:48] Despite being dedicated exec nodes with special queues, they are "just" exec nodes. They don't have root on 'em and can't have installed stuff. [13:28:23] I expect an upgrade to Trusty might be problematic if unilateral though - we'd want to sync that up with them. [13:29:43] I'm more afraid of e.g. local files or pipes [13:34:09] but I'll send an e-mail out [13:34:53] valhallasw: Not sure where outside /tmp you'd find those - they have the same setup as every other node. [13:36:24] Coren: mmm, right, users only have write rights there and in /home [13:42:07] 10Tool-Labs: Re-enable puppet on tools-exec-{cyberbot,catscan,gift,wmt} - https://phabricator.wikimedia.org/T99069#1284815 (10valhallasw) @Magnus (for catscan), @Revi/@JohnLewis (for wmt), @Cyberpower678 and @Giftpflanze: we need to reboot these -exec hosts to make sure they get updates again. Do you have a pref... [13:42:18] 10Tool-Labs: Re-enable puppet on tools-exec-{cyberbot,catscan,gift,wmt} - https://phabricator.wikimedia.org/T99069#1284817 (10valhallasw) [13:51:49] 10Tool-Labs: document the need and usage patterns for special exec hosts - https://phabricator.wikimedia.org/T99067#1284826 (10valhallasw) @Magnus/@Cyberpower678/@Giftpflanze: could you add some context for the dedicated exec hosts in the task description? Thanks! [14:04:03] hrm [14:35:45] 10Tool-Labs: Re-enable puppet on tools-exec-{cyberbot,catscan,gift,wmt} - https://phabricator.wikimedia.org/T99069#1284870 (10Giftpflanze) tools-exec-gift can be rebooted when there aren't any jobs running (there is a cascade running for some days every 1st and 15th of the month). The best point of time for rebo... [14:37:06] 6Labs, 5Patch-For-Review: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1284877 (10jcrespo) Migrating data right now, but actual switchover may have to wait for a proper database cleanup (no innodb_file_per_table, ibdata:90GB, some schema may not be useful). [14:39:37] valhallasw: what context do you need exactly? [14:39:56] gifti: mainly why it's on a seperate instance [14:40:04] ok [14:40:19] gifti: e.g. 'needs local interprocess communication', or 'stores temp files that are large and needed for subsequent runs in the same batch', or something like that [14:43:04] 10Tool-Labs: document the need and usage patterns for special exec hosts - https://phabricator.wikimedia.org/T99067#1284909 (10Giftpflanze) [14:44:52] Coren: are you aware of the VENOM bug? [14:45:50] Heh. Yes. Overhyped bug that only affects ridiculously unlikely configurations. [14:46:25] We can't use virtual floppy drives on labs? I'm shocked. [14:46:35] Heh. Nope. [14:47:21] Coren: didnt expect it to be an actual issue, just wanted to ensure that it wasnt a problem [14:48:35] It's not. Honestly, I'm not surprised there might be exploitable bugs in the floppy emulation layer seeing how nobody has had any reason to use it in a decade or so. Bitrot isn't all that surprising. [14:49:27] Our security guys remembers it being needed to install Windows 2003 in some configurations - and that's the most recent use of virtual floppies he can remember. :-) [14:53:19] Coren: I'm trying to figure out if we can have the 16GB /tmp on m1.medium hosts, but I'm having trouble getting a clear grasp of the LVM config. m1.medium has 40GB storage according to https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000000df.eqiad.wmflabs , and /dev/vda1 and /dev/vda2 together are only 10GB together. But pvs -a lists /dev/vda4 with [14:53:19] 6.5G/28.5G free? and lvdisplay only has /dev/vd/swap at 22G [14:54:01] oh, that actually adds up now that I re-read that. 28.5G for LVM + 8G / non-LVM + 2G /var non-LVM? [14:54:10] * Coren nods. [14:54:31] Because of the way images are build, we couldn't put / on lvm [14:54:35] built* [14:54:59] so in terms of LVM there's just 6.5G free, which means the 16G /tmp can't work there, unless we reduce swap by a significant amount [14:55:41] so... either we need to rebuild the m1.medium exec hosts as m1.large, or adapt the puppet manifest with a smaller /tmp or a smaller swap. I guess a 6G /tmp should also be fine, given that / is only 8G [14:57:22] or we can just make a hiera var seperate-tmp, and not have a seperate tmp for medium hosts [14:58:59] I like that option the most, as that would allow us to re-enable puppet on the special exec hosts without restarting [15:10:46] I think you are making sense there. [15:16:35] I try to, every now and then :-) The puppet patch is https://gerrit.wikimedia.org/r/210918 [15:16:48] I *think* that does the right thing, but I'm not 100% sure about hiera and false/true [15:21:50] 10Tool-Labs, 5Patch-For-Review: Re-enable puppet on tools-exec-{cyberbot,catscan,gift,wmt} - https://phabricator.wikimedia.org/T99069#1284973 (10JohnLewis) Rebooting the wmt node is fine. I don't foresee any issues with a rebuild either. [15:22:43] valhallasw: There's some whitespace issues in general.pp [15:22:54] Otherwise looks okay to me. [15:23:10] I should configure vim to show me that [15:27:38] valhallasw: (aside) You use vim for all your wikimedia dev work? [15:27:51] polybuildr: mostly, and a bit of notepad++ on the side [15:28:07] valhallasw: Your development machine is Windows? :o [15:28:23] polybuildr: windows with a linux vm, specifically :-) [15:29:09] valhallasw: That's an interesting setup. :P MW runs inside the VM, you work on the code from windows? And sometimes vim from inside the VM? [15:29:51] Most of my open source work is on linux, most of my science stuff is on windows [15:30:16] in the VM, I use vim, in Windows, I use np++ [15:30:30] or ipython notebooks, depending on what I'm editing :-p [15:31:47] but I'm not exactly a vim guru... I can copy/paste/indent/etc, but I haven't gotten much further than that [15:31:57] valhallasw: My questions are getting very irrelevant to the channel, maybe we could move to pm :P [15:32:03] then why vim? [15:32:22] because I can't do those other things in any other text-mode editor :D [15:32:33] valhallasw: haha :D well then, why text mode? :P [15:32:41] when someone hands me ed, I'm running around in circles, screaming [15:33:08] because virtualbox shared folders are inconvenient. I do use the np++ scp plugin sometimes, though [15:33:19] Now why would someone be cruel enough to had you ed? [15:33:28] especially when I edit files on tool labs, because of the latency [15:33:36] You could use a GUI inside the VM? [15:33:52] Oh, yes. ssh work inside labs is a bit of a pain. [15:34:14] Yeah, I have xming, but it's generally confusing because clipboards don't work the way you expect them to [15:34:53] I'm basically in some local optimum that's optimal enough for me :D [15:35:07] valhallasw: haha :D fair enough. [15:35:17] How come your science stuff is on Windows, though? [15:35:27] I thought things are easier on Linux when it comes to that sort of stuff. [15:35:30] * halfak watches on with great interest [15:37:40] polybuildr: partially historical (there was nothing like python(x,y) for linux a few years ago, apart from the system package manager which then gave you old versions of everything) [15:37:56] polybuildr: and partially because I quite like windows as gui [15:38:19] valhallasw: aha, okay. :) [15:38:40] but now you can just use anaconda on linux, which works like a charm [15:40:30] (on windows, that's preferrable to python(x,y) as well, I think, these days) [15:40:34] valhallasw: hmm, I'd never heard of Anaconda. That's interesting. [15:40:42] valhallasw, what about coreutils? I use shuf, sort, head, tail, etc. all of the time for working with dataset files. [15:41:20] halfak: it's all python for me, and my datasets are typically hdf5 anyway, so head wouldn't help that much :D [15:41:37] * halfak googles HDf5 [15:41:49] * polybuildr googles too [15:41:58] it's 'the standard' binary data interchange format [15:42:08] valhallasw, never heard of it though [15:42:09] it's also the XML of the binary data interchange formats [15:42:19] so many ways to do the same thing [15:42:21] XML of binary data. Now that's interesting. :P [15:42:53] I suppose I primarily work with texty binary data ;) [15:43:13] :P [15:43:15] No spike matrices or images really. [15:44:31] I should write some more human-readable text on what I work on, but https://merlijn.vandeen.nl/category/science.html has info, although maybe a bit cryptical :D [15:44:37] also needs more video and images [15:45:22] https://www.youtube.com/watch?v=UR2EcwM5e7I < this is what I do experimentally [15:45:42] Reminds me of http://www.eater.com/2015/2/24/8102677/how-to-prevent-beer-spillage-science [15:45:55] * halfak looks for a better link [15:46:48] ah, beautiful french science [15:47:58] 6Labs, 5Patch-For-Review: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1285021 (10jcrespo) Data migrated successfully, but I think there is no 3306 port access from db1009 to virt1000, so no temporary replication can be set for switchover. [15:50:14] This one looks way better: http://www.dailymail.co.uk/sciencetech/article-2847545/So-s-coffee-spills-easily-beer-foam-drink-contains-likely-slosh-around.html [15:50:21] It includes videos too. [15:52:25] I'm always annoyed that journalists never bother to link to the original research article [15:52:42] but I haz google [15:52:42] http://arxiv.org/abs/1411.6542 [15:53:05] = http://scitation.aip.org/content/aip/journal/pof2/27/2/10.1063/1.4907048 [15:56:35] valhallasw, +1 It's ridiculous that we've gotten into that practice. [15:56:47] "we" --> "they" [15:57:12] I suppose it makes sense on paper, as there's limited space there... but on the internet? :/ [15:57:24] Indeed. It's just a link! [15:57:47] Or a DOI if you need to print it! [15:58:18] it's not just science though; I've never seen a news website link to, say, the state budget, or a link to the law they an article refers to [15:58:59] I think new scientist does something like that, but they link to their own site where you can use a code to find related material [15:59:09] valhallasw, I wish we had these links and a "what links here" service. I suppose major search engines could provide that. [15:59:22] 6Labs, 5Patch-For-Review: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1285039 (10Andrew) Note that soon I will also want db access from labcontrol1001 -- it's going to be the spare for virt1000. [16:00:23] http://www.wholinks2me.com/details.php?url=https%3A%2F%2Fwikipedia.org&fr=details [16:01:17] Seems like that webservice is kinda broken. [16:01:54] I think google can do that, yes. Or they could, at least. [16:02:29] https://www.google.nl/search?q=link:nl.wikipedia.org/wiki/rijksbegroting [16:24:47] what was the trick to turn SHUTOFF instances back on again? (filippo-test-jessie and filippo-m3-c2 in this case) cc andrewbogott [16:25:06] godog: if you use horizon.wikimedia.org you should be able to just ‘start’ them. [16:25:14] Otherwise you can do it from the cmdline on virt1000 [16:26:08] godog: horizon is shellname/password rather than wikiname/password [16:27:21] andrewbogott: oohh nifty! haven't used horizon yet [16:41:34] 6Labs, 6operations, 10wikitech.wikimedia.org: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1285179 (10hoo) We have two problems here: First of all (as noted above) it's not possible to open a tcp connection to silver on that port (mysql -h silver is failing a... [16:43:26] Coren: could you take a look at https://gerrit.wikimedia.org/r/#/c/202363/3 ? I think scfc's regex is correct, but someone with more perl-fu and more +2 than myself would help ;) [16:43:47] Sure. [16:45:02] * valhallasw is cleaning out his review stack [16:49:45] I need to go grab dinner. o/ [16:50:38] Coren: oh, right, you're in europe! :D enjoy your dinner :) [16:56:43] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL 60.00% of data above the critical threshold [0.0] [16:58:31] looking... [17:00:10] Error: /Stage[main]/Ldap::Client::Pam/File[/etc/pam.d/common-password]: Could not evaluate: Error 403 on SERVER: (...)

You don't have permission to access /production/file_metadata/modules/ldap/common-password on this server.

[17:00:26] andrewbogott: ^ any idea? [17:00:46] valhallasw: that was probably me messing with routing again. Should be back to normal now [17:01:01] andrewbogott: ok! [17:21:44] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK Less than 1.00% above the threshold [0.0] [18:06:40] oh, interesting. I suddenly get shinken spam in my mailbox :D [18:07:34] 10Tool-Labs, 5Patch-For-Review: Bigbrother should ignore empty lines in .bigbrotherrc - https://phabricator.wikimedia.org/T94990#1285454 (10valhallasw) 5Open>3Resolved [18:16:16] (03CR) 10Legoktm: [C: 04-1] Automatically generate toolinfo.json (031 comment) [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209697 (owner: 10Ricordisamoa) [18:16:21] (03CR) 10Legoktm: [C: 032 V: 032] Split CSS into a separate stylesheet [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209692 (owner: 10Ricordisamoa) [18:20:08] is there a place where one can see puppet files/configs for labs roles? [18:20:42] SMalyshev: I think if you click by the little ? on the wikitech page there’s a doc link [18:20:53] but otherwise… lemme find you a link [18:21:19] andrewbogott: yeah I know that one but it's just description which is also often missing. I mean the real files :) [18:21:44] https://github.com/wikimedia/operations-puppet [18:22:00] (03CR) 10Legoktm: "see T99101" [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209692 (owner: 10Ricordisamoa) [18:22:41] andrewbogott: ah, this has labs too? ok, thanks [18:22:51] yeah, all the same repo [18:55:03] any one here ? [18:55:58] hello [18:56:06] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Gorlingor was created, changed by Gorlingor link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Gorlingor edit summary: Created page with "{{Tools Access Request |Justification=I want to use it for building aids for script or gadget developers. The Mediawiki web service APIs just are not suited for some more com..." [18:56:39] !ask | LEEL [18:56:39] LEEL: Hi, how can we help you? Just ask your question. [18:58:22] I'm new to wikipedia.. i need to translate English article to Sinhala Language.. how can i do it ? [19:02:12] (03PS2) 10Ricordisamoa: Automatically generate toolinfo.json [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209697 [19:02:44] (03CR) 10Ricordisamoa: Automatically generate toolinfo.json (031 comment) [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209697 (owner: 10Ricordisamoa) [19:03:28] please some one answer my question [19:04:29] 10Tool-Labs: Bigbrother should ignore empty lines in .bigbrotherrc - https://phabricator.wikimedia.org/T94990#1285669 (10Ricordisamoa) [19:06:23] LEEL: You should probably ask on #wikimedia-dev instead. :) [19:08:35] polybuildr : ok. thanks.. [19:55:55] Okay, I'm sure this is not the right place to ask, but I'm almost sure there is no right place to ask either. :P [19:56:03] I need a spam honeypot wiki to be spammed. [19:56:19] By both spammers and bots, preferably. [19:56:36] Does anybody of any shady places I can leave the link so that it gets spammed? [19:56:48] (03PS3) 10Legoktm: Automatically generate toolinfo.json [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209697 (owner: 10Ricordisamoa) [19:56:54] (I only ask here because it's a Labs instance and so just maybe holds a teeny tiny bit of relevance on this channel.) [19:57:06] (03CR) 10Legoktm: [C: 032 V: 032] "PS3: Rebase" [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209697 (owner: 10Ricordisamoa) [20:09:34] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1209 is CRITICAL 40.00% of data above the critical threshold [0.0] [20:11:14] (03PS1) 10Legoktm: Don't die on different bug titles and other fixes [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/211010 [20:11:23] (03CR) 10Legoktm: [C: 032 V: 032] Don't die on different bug titles and other fixes [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/211010 (owner: 10Legoktm) [20:13:26] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1201 is CRITICAL 55.56% of data above the critical threshold [0.0] [20:22:23] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1210 is CRITICAL 50.00% of data above the critical threshold [0.0] [20:28:44] yuvipanda: hello [20:28:51] what's puppet doing [20:29:35] yuvipanda, can you take a look at https://phabricator.wikimedia.org/T99077 and https://phabricator.wikimedia.org/T99073 ? [20:30:21] valhallasw: I think that's just my dpkg clashing with puppet [20:30:26] Error: /Stage[main]/Ssh::Client/Package[openssh-client]/ensure: change from 1:5.9p1-5ubuntu1.4 to latest failed: Could not get (...) [20:30:27] E: Problem renaming the file /var/cache/apt/pkgcache.bin.uiMCmq to /var/cache/apt/pkgcache.bin - rename (2: No such file or directory) [20:30:30] wat. [20:30:37] 10Tool-Labs: Fix shinken config to remove tools-webproxy-test - https://phabricator.wikimedia.org/T99073#1285876 (10yuvipanda) This is probably stray LDAP entries messing around. I'll take a look. [20:31:30] yuvipanda: hope it'll go away by itself? :P [20:31:47] valhallasw: yes, it will - that’s a dpkg clash [20:31:57] meaning what exactly? [20:32:01] what's clashing with what? [20:32:11] valhallasw: meaning puppet tried to run dpkg while my pssh process was also running dpkg [20:32:37] 10Tool-Labs: Shinken: make sure 'Free space - all mounts' can handle no-longer-existing mounts - https://phabricator.wikimedia.org/T99077#1285879 (10yuvipanda) T93861 is the underlying reason [20:32:39] huh. no, that uses a lock, or should? [20:33:12] valhallasw: apparently not. https://phabricator.wikimedia.org/T92491 [20:34:14] yuvipanda: ugh. more broken software :P [20:34:30] :) all software is broken [20:34:32] yuvipanda: can we fix that no-longer-existing-mount manually somehow? [20:34:40] valhallasw: yes, I can! [20:34:42] \o/ [20:34:50] valhallasw: let me do that now [20:35:03] yuvipanda: and if you can then also review https://gerrit.wikimedia.org/r/#/c/210918/ I have shinken cleared out! [20:35:16] oh no, the ldap thing is still there then [20:35:25] 10Tool-Labs: Shinken: make sure 'Free space - all mounts' can handle no-longer-existing mounts - https://phabricator.wikimedia.org/T99077#1285883 (10yuvipanda) (Fixing this manually now - I'm going to clean out the var mount from graphite) [20:35:38] yuvipanda: also, why were you running dpkg manually? :P [20:35:50] valhallasw: testing tools-webservice :) [20:36:03] ah, right [20:36:09] valhallasw: actually, I Just realized that those nodes probably don’t have the ::general rol [20:36:10] e [20:36:16] valhallasw: I think they have the ::special role [20:36:28] let me check [20:36:53] base, role::labs::instance, sudo::labs_project, role::labs::tools::compute, toollabs::node::compute::dedicated [20:36:59] ah yes [20:37:02] ::dedicatd [20:37:09] so that doesn't even include the bigger /tmp?! [20:37:46] they should just extend the same base class [20:37:59] and no swap as well? :/ [20:38:05] okay, well, then I'll just re-enable puppet :P [20:38:26] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK Less than 1.00% above the threshold [0.0] [20:38:38] valhallasw: heh :) another reason to not have special exec nodes... [20:38:47] yuvipanda: eh? [20:38:52] anyway [20:38:58] a discussion for another time [20:39:03] yuvipanda: that's such a straw man I'm not even going to respond to it :P [20:39:04] so that patch is no longer needed, I guess [20:39:21] valhallasw: well, 'increased administrative work' is the reason [20:39:25] well, no, I think we should have the patch, but the dedicated/general nodes should share more code [20:39:30] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1209 is OK Less than 1.00% above the threshold [0.0] [20:39:44] sure, that's just work I don't think we should be doing. [20:39:51] anyway, let's not even go there now [20:39:58] yuvipanda: see corens reply. These nodes are sensible. [20:40:35] I dunno, I think the application code should change to not require these. '200 jobs that must run on the same instance' is a fairly uncommon use case [20:41:13] and something that should run outside of labs (and JohnFLewis is already working on moving WMT to its own labs instance for other reasons) [20:41:14] err [20:41:16] outside of toollabs [20:41:20] i tried, doesn't work [20:41:39] gifti: which bit doesn't work? [20:41:43] yuvipanda: slowly but surely :p [20:41:45] * yuvipanda is interested inf inding an alternative solution [20:41:47] JohnFLewis: +1 [20:42:13] gifti: what exactly do those 200 jobs do? [20:42:37] check urls [20:42:49] gifti: so why not use: 1. threads, or 2. multiprocessing? [20:43:07] is it just a case of not-enough-time? [20:43:36] well, the bit i mean is the not using 200 jobs, not the multiple instances [20:43:46] yes [20:43:50] gifti: ah, hmm [20:44:15] gifti: so let's say someone offers to spend time with you on your codebase to convert them into threads / multiprocessing / seperate project, would you be ok with that? [20:44:18] 10Tool-Labs, 5Patch-For-Review: Re-enable puppet on tools-exec-{cyberbot,catscan,gift,wmt} - https://phabricator.wikimedia.org/T99069#1285910 (10valhallasw) @YuviPanda noted these nodes are actually all toollabs::node::compute::dedicated (as opposed to toollabs::node::compute::general, where a change was appli... [20:44:44] gifti: is this code in TCL? :) [20:45:32] i already tried threading, it's catastrophic [20:45:36] yes [20:45:41] yuvipanda: is it so hard to assume there might be a reason things are the way they are? :P [20:45:55] there really isn't much of an issue with dedicated exec hosts... [20:46:04] valhallasw: is it so hard to assume there might be a better way to do things? :) [20:46:13] 10Tool-Labs, 10Hackathon-Lyon-2015: Tool-labs meeting agenda for Lyon Hackathon - https://phabricator.wikimedia.org/T98912#1285912 (10valhallasw) [20:46:36] i also tried the extra project, too much work/learning for me alone [20:46:52] gifti: alright, so how about I volunteer my time to work with you to find a better way? [20:47:14] gifti: yeah, that's totally fair. can you point me to the code for these jobs? [20:47:19] i'm open to that [20:47:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1210 is OK Less than 1.00% above the threshold [0.0] [20:48:55] just a moment while i boot my laptop [20:50:57] the project would need a grid engine and redis [20:51:11] gifti: sweet, thanks. I'm going to open a bug so we can use that to coordinate [20:52:20] yuvipanda: the code is at ~tools.giftbot/dwl*.{tcl,sh} [20:52:23] RECOVERY - Puppet staleness on tools-exec-gift is OK Less than 1.00% above the threshold [3600.0] [20:52:59] gifti: ok! [20:58:17] 10Tool-Labs, 5Patch-For-Review: Re-enable puppet on tools-exec-{cyberbot,catscan,gift,wmt} - https://phabricator.wikimedia.org/T99069#1285932 (10valhallasw) 5Open>3Resolved a:3valhallasw And puppet is enabled again on all hosts now! [21:01:08] 10Tool-Labs: Investigate alternatives to dedicated exec node for gifti's tools - https://phabricator.wikimedia.org/T99130#1285955 (10yuvipanda) 3NEW [21:01:16] gifti: ^ just filed this bug [21:01:46] 10Tool-Labs: deduplicate compute::general and compute::dedicated roles - https://phabricator.wikimedia.org/T99131#1285962 (10valhallasw) 3NEW [21:03:03] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Gorlingor was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=158844 edit summary: [21:03:38] yuvipanda: ok, tools-exec-* staleness should be fixed now [21:04:44] RECOVERY - Puppet staleness on tools-exec-wmt is OK Less than 1.00% above the threshold [3600.0] [21:05:26] RECOVERY - Puppet staleness on tools-exec-cyberbot is OK Less than 1.00% above the threshold [3600.0] [21:06:22] 6Labs, 10hardware-requests, 6operations: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1285990 (10Andrew) 3NEW a:3RobH [21:06:24] 10Tool-Labs: deduplicate compute::general and compute::dedicated roles - https://phabricator.wikimedia.org/T99131#1286002 (10yuvipanda) ideally we'll just have a compute role, and then a dedicated_node role - and we'll compose dedicated nodes by applying both, rather than inheritance. [21:06:32] valhallasw: \o/ thanks [21:06:50] yuvipanda: ^ can't do that, duplicate system::role [21:06:55] yuvipanda: because UGH PUPPET [21:07:03] hahaha [21:07:03] wow [21:07:07] that's, man. [21:07:10] wait,r eally? [21:07:36] valhallasw: nope, it's a define. so should work [21:07:46] ? [21:08:03] system::role { 'A': } does not clash with system::role{ 'B': } [21:08:08] RECOVERY - Puppet staleness on tools-exec-catscan is OK Less than 1.00% above the threshold [3600.0] [21:08:18] eeeeeh [21:08:21] oh, I see. [21:08:29] yeah, that's classes vs defines. [21:08:51] so in the system::role define definition (hah!) you'll find [21:08:51] motd::script { "role-${title}": [21:08:55] and other uses of ${title} [21:09:00] so that nothing conflicts. [21:09:08] and then what do we get in the motd? :P [21:09:16] valhallasw: all of them [21:09:28] yeah, that's not what we want :P [21:09:29] example in labmon1001 is [21:09:33] labmon1001 is a real-time metrics processor (role::graphite) [21:09:33] labmon1001 is a statsite server (role::statsite) [21:09:33] oh, it's just that note [21:09:36] not the warning note [21:09:39] yes [21:09:48] yeah, that's fine [21:10:05] 10Tool-Labs: Investigate alternatives to dedicated exec node for gifti's tools - https://phabricator.wikimedia.org/T99130#1286031 (10yuvipanda) Code is at ~tools.giftbot/dwl*.{tcl,sh} [21:10:11] valhallasw: yup! prod roles do that all the time [21:11:11] !log tools forced rescheduling of (non-cont) welcome.py job (iluvatarbot, jobid 8869) [21:11:20] Logged the message, Master [21:12:00] The job 8869 is running in queue task@tools-exec-07.eqiad.wmflabs where jobs are not rerunable. [21:12:01] ... [21:12:29] valhallasw: yup, can't do that. [21:12:36] valhallasw: that's why they're still running.. [21:12:47] valhallasw: so what I do is to check if it's got a .bigbrotherrc file, or if it's run via cron [21:12:48] yuvipanda: can't I force that? :P [21:12:55] valhallasw: don't think so [21:13:05] valhallasw: anyway, if they're from bigbrother or cron, I just kill them [21:13:06] I know the script, I just don't want to figure out the right jsub invocation [21:13:10] andthat's how I got rid of 14, 15 [21:13:20] valhallasw: yeah, history | less might help [21:13:46] valhallasw: I also straced some - most were just doing nothing, waiting for something forever [21:13:50] yuvipanda: success! [21:13:55] magic -f [21:13:56] ;D [21:14:01] valhallasw: oh, doh [21:14:03] fine :P [21:14:17] valhallasw: it can be dangerous tho [21:14:38] imagine it's a bot that should be run only once and it got stuck and the user forgot and now bam it's making edits again, eating tiny babies and kittens... :P [21:14:41] (or not) [21:14:46] yuvipanda: it was welcome.py :P [21:14:48] yuvipanda: I checked [21:14:56] :P hence the (or not) [21:15:10] yeah, I was too lazy to actually dive in and check and was hoping someone else would :) [21:18:02] JohnFLewis: btw, are you going to be at the Lyon hackathon / Wikimania? [21:18:14] yuvipanda: Nope [21:18:21] aww shucks [21:18:23] oh well :) [21:18:25] Wanted to but work [21:19:49] yuvipanda: http://shinken.wmflabs.org/dashboard so pretty <3 [21:19:57] valhallasw: :D for you! [21:20:06] valhallasw: mine is littered with alerts from analyitics [21:20:10] that they never intend to fix [21:20:14] :P [21:20:14] * yuvipanda should fix his [21:50:15] yuvipanda: amir will restart dexbot, and I mailed shubinator about runDYKNomStatsBot [21:51:21] yuvipanda: shouldn't shinken-wm also announce downtimes? [21:53:20] !log tools shut down & removed "tools-exec-08.eqiad.wmflabs" from execution host list [21:53:25] Logged the message, Master [22:00:19] 10Tool-Labs: Jobs Disappearing from SGE - https://phabricator.wikimedia.org/T99027#1286221 (10valhallasw) So, to clarify, this happened 15 apr 2015 around 22:14 UTC? Could be related to {T95555}, then but @Coren knows better about the reboots for that. Unfortunately, tools-exec-15 is no more, so it's also hard t... [22:00:29] oh, no, those logs should be on NFS [22:15:25] 6Labs, 7Shinken: shinken has many warnings (?) about "UNKNOWN: execution of the check script exited with exception list index out of range" - https://phabricator.wikimedia.org/T95161#1286267 (10scfc) @valhallasw fixed the warnings for the #Tool-Labs project in T99072; it mostly involved restarting `diamond` so... [22:22:25] (03CR) 10Ricordisamoa: "Added to the misc list:" (031 comment) [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209697 (owner: 10Ricordisamoa) [22:22:35] Time for bed [22:27:49] (03PS1) 10Ricordisamoa: Remove duplicate title and unused `s_text` argument [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/211055 [22:29:33] (03CR) 10Ricordisamoa: Don't die on different bug titles and other fixes (031 comment) [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/211010 (owner: 10Legoktm) [22:30:19] (03CR) 10Ricordisamoa: Automatically generate toolinfo.json (031 comment) [labs/tools/extreg-wos] - 10https://gerrit.wikimedia.org/r/209697 (owner: 10Ricordisamoa) [23:56:10] yuvipanda, https://github.com/halfak/Objective-Revision-Evaluation-Service