[00:54:25] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-08 is OK: OK: Less than 1.00% above the threshold [0.0]  
[01:31:03] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196550 (10scfc)
[01:50:32] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196580 (10scfc) Added the host with `qconf -mhgrp \@webgrid`, restarted `execd` with `service gridengine-exec restart`, checked with `diff -u <(qconf -sq webgrid-lighttpd@tools-webgrid-07.eqiad.wmflabs)...
[02:00:26] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196594 (10scfc) `diff -u <(qconf -se tools-webgrid-01.eqiad.wmflabs) <(qconf -se tools-webgrid-08.eqiad.wmflabs)` shows the difference between the resources.
[02:09:19] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196612 (10scfc) And that was the Gordian knot:  ``` tools.typoscan@tools-bastion-01:~$ qstat -xml <?xml version='1.0'?> <job_info  xmlns:xsd="http://gridengine.sunsource.net/source/browse/*checkout*/grid...
[02:15:31] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196614 (10yuvipanda) Why is it called 8 and not 4? This mixes up trusty and precise progression and causes confusion...
[02:19:12] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196615 (10scfc) Removed `tools-webgrid-04` as submit host by `qconf -ds tools-webgrid-04.eqiad.wmflabs`, from queue by `qconf -mhgrp \@webgrid` and as exec host by `qconf -de tools-webgrid-04.eqiad.wmfla...
[02:20:03] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196617 (10scfc) 5Open>3Resolved
[02:25:03] <yurik>	 hi, how do i access betalabs db directly?
[02:25:19] <yurik>	 which server is actually used for enwiki?
[02:25:59] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196627 (10scfc) >>! In T95537#1196614, @yuvipanda wrote: > Why is it called 8 and not 4? This mixes up trusty and precise progression > and causes confusion...  Because I didn't want to have to deal with...
[02:26:37] <yurik>	 MaxSem, do you know this ^
[02:27:04] <wikibugs>	 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196634 (10yuvipanda) Fair enough. We should start phasing precise out soon anyway.
[02:27:05] <yurik>	 MaxSem, i meant my q ^^^
[02:27:08] <MaxSem>	 see db-labs.php
[02:36:59] <yurik>	 MaxSem, what about access? mwscript doesn't seem to exist there, and "mysql" fails
[02:39:29] <MaxSem>	 sql?
[02:41:48] <Negative24>	 YuviPanda:  how hard would it be to get a public ip?
[02:42:23] <YuviPanda>	 Negative24: depends on what you need it for :)
[02:42:41] <Negative24>	 suppose I had a phab instance :)
[02:43:06] <Negative24>	 that needed outside cloning access which can't go through the project proxy
[02:43:11] <YuviPanda>	 Aaah
[02:43:29] <YuviPanda>	 Ssh or git protocol cloning I presume 
[02:43:37] <Negative24>	 yep
[02:43:47] <Negative24>	 It would only be temp
[02:43:54] <YuviPanda>	 Should be easy. File a bug and I'll do it when I'm at a computer?
[02:43:59] <YuviPanda>	 (In 30mins)
[02:45:30] <wikibugs>	 6Labs, 6Phabricator: Allocate one public IP to the Phabricator project - https://phabricator.wikimedia.org/T95643#1196643 (10Negative24) 3NEW a:3yuvipanda
[02:45:39] <Negative24>	 YuviPanda: ^
[02:46:07] <Negative24>	 YuviPanda: I probably won't be on in 30 mins
[02:46:53] <YuviPanda>	 Cool
[02:48:28] <yurik>	 MaxSem, sql is not working on either deployment-db1 or deployment-db2
[02:49:59] <MaxSem>	 no idea then
[03:39:17] <wikibugs>	 10Tool-Labs: Labs multilingual tile server lacks localized labels - https://phabricator.wikimedia.org/T95644#1196705 (10mxn) 3NEW
[03:40:43] <wikibugs>	 10Tool-Labs: Set up a tileserver for OSM in Labs - https://phabricator.wikimedia.org/T62819#1196715 (10mxn) >>! In T62819#1193389, @Nemo_bis wrote: >>>! In T62819#988323, @mxn wrote: >> The old Toolserver tile server worked for all the Wikimedia languages, not just English, German, and Russian. Are there plans t...
[03:44:35] <wikibugs>	 10Tool-Labs, 10OpenStreetMap: Labs multilingual tile server lacks localized labels - https://phabricator.wikimedia.org/T95644#1196723 (10yuvipanda)
[03:51:39] <wikibugs>	 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1196735 (10coren) project deployment-prep:     uid 48         /home/anomie/.mweval_history         /data/project/hhvm-cores/     ui...
[03:51:50] * Coren goes to bed.
[05:19:58] <YuviPanda>	 !log tools delete the tomcat node finally :D
[05:20:03] <labs-morebots>	 Logged the message, Master
[05:22:07] <wikibugs>	 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1196807 (10yuvipanda) Is that all? if so w00t :D
[05:49:22] <shinken-wm>	 PROBLEM - Host tools-webgrid-tomcat is DOWN: CRITICAL - Host Unreachable (10.68.16.29)  
[05:49:50] <YuviPanda>	 yay
[05:51:44] <shinken-wm>	 PROBLEM - Puppet failure on tools-services-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[05:54:32] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196815 (10yuvipanda) The node is gone, and the queue is gone too :)
[06:01:47] <shinken-wm>	 RECOVERY - Puppet failure on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]  
[06:20:05] <wikibugs>	 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1196835 (10yuvipanda) I just spoke to @springle on IRC and he said coming Thursday (16th April) 0200 UTC is good for him.   @Coren can you mail out a notice?
[06:32:29] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196842 (10yuvipanda) Hmm, so ```yuvipanda@tools-bastion-01:~$ qconf -de tools-webgrid-tomcat.eqiad.wmflabs Host object "tools-webgrid-tomcat...
[06:43:09] <shinken-wm>	 PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]  
[06:47:27] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[06:51:12] <wikibugs>	 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1196872 (10yuvipanda)
[06:51:14] <wikibugs>	 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Have bigbrother run on multiple nodes to provide redundancy against tools-submit failure - https://phabricator.wikimedia.org/T91237#1196866 (10yuvipanda) 5Open>3declined a:3yuvipanda Replaced by T95521
[06:51:16] <wikibugs>	 10Tool-Labs, 7Tracking: make bigbrother or its replacement reliable - https://phabricator.wikimedia.org/T91414#1196871 (10yuvipanda)
[06:59:21] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]  
[07:01:18] <wikibugs>	 6Labs, 10Tool-Labs: Delete 'commonsarchivebot' from toollabs - https://phabricator.wikimedia.org/T89807#1196874 (10Fastily) >>! In T89807#1193619, @scfc wrote: > Could you please go through and archive/delete the remaining files of the tool?  Thanks!  Done!
[07:10:32] <PissedPanda>	 !log tools take out tools-services-01 to test switchover and also to recreate as small
[07:10:35] <labs-morebots>	 Logged the message, Master
[07:12:24] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:13:11] <shinken-wm>	 RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:16:56] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make service manifest monitors redundant / hotswappable - https://phabricator.wikimedia.org/T95521#1196898 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Hotswap test was good since it was bitten by the fact that manual steps...
[07:16:58] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize webservice manifest monitor - https://phabricator.wikimedia.org/T95210#1196901 (10yuvipanda)
[07:21:26] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make service manifest monitors redundant / hotswappable - https://phabricator.wikimedia.org/T95521#1196915 (10yuvipanda) (I added https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Services)
[07:21:50] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize webservice manifest monitor - https://phabricator.wikimedia.org/T95210#1196916 (10yuvipanda) Ok, so this just needs an email to labs-l and then we're all good!
[07:24:28] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0]  
[07:50:55] <shinken-wm>	 PROBLEM - Puppet failure on tools-services-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]  
[08:07:06] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196954 (10scfc) `qhost -j` (NB: `qhost`, no hostname) shows:  ``` […] tools-webgrid-tomcat.eqiad.wmflabs lx26-amd64      8     -   15.7G...
[08:07:42] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196957 (10scfc) Rescheduled the job, the host job list is now empty.
[08:08:04] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196958 (10scfc) ``` scfc@tools-bastion-01:~$ qconf -de tools-webgrid-tomcat.eqiad.wmflabs scfc@tools-bastion-01.eqiad.wmflabs removed "tools...
[08:08:37] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196962 (10scfc) ``` scfc@tools-bastion-01:~$ qconf -ds tools-webgrid-tomcat.eqiad.wmflabs scfc@tools-bastion-01.eqiad.wmflabs removed "tools...
[08:10:56] <shinken-wm>	 RECOVERY - Puppet failure on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]  
[08:11:22] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize webservice manifest monitor - https://phabricator.wikimedia.org/T95210#1196965 (10scfc) labs-announce :-).
[08:39:25] <wikibugs>	 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1197009 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron <root@tools-webgrid-05> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org Da...
[08:39:51] <wikibugs>	 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1197011 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron <root@tools-services-01> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org D...
[10:00:39] <wikibugs>	 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1197218 (10scfc)
[10:00:41] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1197215 (10scfc) 5Open>3Resolved a:3scfc Did `rm -f /data/project/.system/store/*-tools-webgrid-tomcat.eqiad.wmflabs`.  Anything else t...
[10:01:29] <wikibugs>	 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1197219 (10scfc) a:5scfc>3yuvipanda
[10:04:48] <wikibugs>	 6Labs, 10Tool-Labs: Delete 'commonsarchivebot' from toollabs - https://phabricator.wikimedia.org/T89807#1197227 (10scfc) 5Open>3Resolved Thanks!  I deleted the tool.
[12:31:49] <wikibugs>	 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1197484 (10coren) It's all, execpt for the 9.4 million files owned by www-data which - while not puppetized or in ldap - is stable...
[12:55:55] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-21 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:00:01] <shinken-wm>	 RECOVERY - Puppet failure on tools-exec-wmt is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:00:55] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0]  
[13:16:36] <wikibugs>	 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1197683 (10coren) Notice mail't
[13:36:18] <wikibugs>	 10Tool-Labs: Admin www depends on short_open_tag = On - https://phabricator.wikimedia.org/T95688#1197730 (10scfc) 3NEW
[14:04:06] <wikibugs>	 10Tool-Labs: Admin www depends on short_open_tag = On - https://phabricator.wikimedia.org/T95688#1197804 (10scfc) (And there's a generated file (`htmlpurifier/library/HTMLPurifier.standalone.php`) that is required, but `.gitignored`.  Ooops.  I could have deleted that accidentally.)
[14:35:31] <wikibugs>	 10Tool-Labs, 7Tracking: Toolserver migration to Tools (tracking) - https://phabricator.wikimedia.org/T60788#1197960 (10Dereckson)
[15:10:15] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1004 has a failed disk  1789-Slot 0 Drive Array Disk Drive(s) Not Responding      Check cables or replace the following drive(s):          Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1198079 (10Cmjohnson) A case with HP has been opened because that...
[15:19:20] <wikibugs>	 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1198102 (10Cmjohnson) a:5Cmjohnson>3RobH Rob,  These are 3.5" disk bays. I have 1TB disks on-site and will swap them (just give me the +1).  CJ
[15:30:17] <Coren>	 andrewbogott: I just sent you an yuvi a draft of the email I am going to send to project managers for https://phabricator.wikimedia.org/T95554
[15:30:26] <Coren>	 Can you read it and critique clarity, etc?
[15:30:30] <andrewbogott>	 sure
[15:36:48] <andrewbogott>	 Coren: it looks good to me.  You could add a concrete example in a footnote so make clearer the issue with different ids on different boxes… but that would make the email longer which is maybe not worth it.
[15:36:59] <andrewbogott>	 How many cases of that problem are there?  Lots?
[15:37:25] <Coren>	 Not tons, actually.
[15:37:34] <Coren>	 https://phabricator.wikimedia.org/T95554 has a list
[15:37:54] <Coren>	 I was pleasantly surprised.
[15:39:31] <Coren>	 wikidata-query is the only concerning one as they apparently pointed a mysql server's data dir at NFS (which has bigger problems than the uid)
[15:40:18] <andrewbogott>	 Ah, that’s not so painful
[15:42:06] <Coren>	 And it looks like some projects try to get central logging by pointing the syslogs at NFS which - while it works - is about the most painful way of doing it.  :-)
[15:46:11] <andrewbogott>	 I am… surprised that works at all
[15:46:35] <andrewbogott>	 I guess syslog must already be race-proof since lots of different services write to it
[17:32:46] <Coren>	 andrewbogott_afk: syslog opens files O_APPEND, so that tends to work out.
[17:33:26] <Coren>	 (NFS specifically preserve posix file semantics including the append behaviour)
[17:35:21] <Coren>	 In fact, much of the complexities of NFS are caused by its strict adherence to the really-didn't-consider-remote-filesystems-at-the-time posix file semantics.  :-)
[17:37:06] <wikibugs>	 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1198559 (10coren)
[17:55:57] <MaxSem>	 it used to be possible to create VMs with a large partition but small CPU/RAM requirements. now I'm forced to allocate 16 gigs of RAM I don't need :(
[17:59:11] <PissedPanda>	 are you sure? I don’t remember that being possible at all
[18:04:28] <MaxSem>	 yeah, we had more images some time ago
[18:04:37] * MaxSem blames trusty
[18:06:03] <YuviPanda>	 sizes are determined by ‘flavors’ not images, no?
[18:06:15] <Coren>	 MaxSem: That would have been quite some time ago, because I'm quite certain I remember being annoyed at the lack of big-disk-small-ram flavours.
[18:06:32] <Coren>	 [when I started working on labs]
[18:07:52] <MaxSem>	 shall I open a bug?
[18:08:33] <Coren>	 Might be worthwhile; I'm sure there is a need for at least some flavours with different balances.
[18:10:57] <YuviPanda>	 Coren: andrewbogott_afk I’m going to go to do some more visa work now (grumble, grumble). I’ll come back and start filling in https://etherpad.wikimedia.org/p/labs-report-q4-1 (last 2 week report) based on email, phabricator and gerrit
[18:11:18] <YuviPanda>	 and then there’s a calendar item have all of us just go over it once and then mail it out
[18:11:34] <Coren>	 Otay.  Have "fun" with the visa crap.
[18:12:00] <YuviPanda>	 Coren: yeah. I just got my SSN, and so will have to deal with another whole bunch of paperwork now...
[18:12:12] <YuviPanda>	 including opening a bank account, which seems to require that I have a stable home address
[18:12:30] <YuviPanda>	 anyway, ranting for later! can’t take phone to appointment, apparently.
[18:12:38] <Coren>	 o/
[18:14:38] <wikibugs>	 6Labs: Consider creating big storage/low RAM images - https://phabricator.wikimedia.org/T95731#1198896 (10MaxSem)
[18:15:44] <MaxSem>	 YuviPanda, I got myself a bank account while I was living at Dan's, so not really required
[18:15:54] <YuviPanda>	 MaxSem: did you need a utility bill?
[18:15:57] <MaxSem>	 you just need some mailing address
[18:15:59] <YuviPanda>	 let’s move this to another channel :)
[18:32:26] <wikibugs>	 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1199056 (10coren)
[18:32:28] <wikibugs>	 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1199054 (10coren) 5Open>3Resolved The list is made, and labs-announce notified; since none of those will break for now, there i...
[18:33:15] <wikibugs>	 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1199057 (10coren)
[18:51:54] <wikibugs>	 10Tool-Labs: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1199161 (10coren) a:5coren>3None
[19:09:26] <wikibugs>	 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Remove dependencies on LDAP from labstore100[12] - https://phabricator.wikimedia.org/T95558#1199239 (10coren) modules/openstack/files/replica-addusers.pl -> enumerates users via getent modules/ldap/files/scripts/manage-nfs-volumes-daemon -> u...
[19:40:00] <wikibugs>	 6Labs: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1199361 (10chasemp) Ask for @springle: can we chat about nodepool in labs and how to allocate DB resources?  I'll ping you on irc
[19:45:10] <wikibugs>	 10Tool-Labs: Puppetize gridengine master configuration - https://phabricator.wikimedia.org/T95747#1199397 (10yuvipanda) 3NEW
[19:45:26] <wikibugs>	 10Tool-Labs: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1199406 (10yuvipanda) 5Open>3Resolved a:3yuvipanda T95747 should track puppetizing it.
[19:45:48] <wikibugs>	 10Tool-Labs, 3Labs-Q4-Sprint-2: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1199413 (10yuvipanda)
[19:53:51] <wikibugs>	 6Labs, 6operations, 10ops-eqiad: labvirt1004 has a failed disk  1789-Slot 0 Drive Array Disk Drive(s) Not Responding      Check cables or replace the following drive(s):          Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1199472 (10Cmjohnson) hank you for contacting HP e-Solutions.  Wi...
[20:12:15] <Krenair>	 Coren, probably should have linked https://phabricator.wikimedia.org/T95554 explicitly?
[20:12:31] <wikibugs>	 10Tool-Labs, 7Tracking: Toolserver migration to Tools (tracking) - https://phabricator.wikimedia.org/T60788#1199565 (10valhallasw)
[20:19:21] <wikibugs>	 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1199584 (10valhallasw)
[20:19:23] <wikibugs>	 6Labs, 10Tool-Labs: Provide 'Support request' tool labs project - https://phabricator.wikimedia.org/T94359#1199581 (10valhallasw) 5Open>3Invalid a:3valhallasw I think that's a fair solution, and I'd like to add a big thank you for essentially clearing out the Triage column!
[20:35:34] <Coren>	 Krenair: Eugh.  Definitely should have.
[21:05:19] <Nikerabbit>	 Coren: definitely, wasted a good minutes looking for the list
[21:05:21] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[21:05:23] <YuviPanda>	 Coren: ^ ugh
[21:05:24] <Coren>	 YuviPanda: Odd.  Not NFS, there's a bit of traffic but nothing that bad.
[21:05:24] <Coren>	 Hmmm...  wait.
[21:05:25] <Coren>	 labnet1001 network just flatlined.
[21:05:26] <Coren>	 Ooooh.  Same thing that happened a few days ago with someone doing a humongous write to NFS
[21:05:26] <Coren>	 There's a huge flush process going on atm.  Several gigs of stuff.
[21:05:26] <YuviPanda>	 ugh
[21:05:27] <Coren>	 Huge inbound spike and now the drives are trying to catch up.  Another huge gunzip/gzip?
[21:05:27] <Coren>	 iowait isn't out of bound; this should self-correct shortly.
[21:05:27] * Coren monitors closely.
[21:05:27] <Coren>	 Ah, here's a spike.
[21:05:28] * Coren makes sure shelf 5 isn't ill.
[21:05:29] <Coren>	 Oh blah; I think we might have briefly lost it - it's in resync.
[21:05:29] <Coren>	 iowait seems to be coming back down.
[21:05:50] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 763110 bytes in 4.775 second response time  
[21:06:05] <YuviPanda>	 hmm
[21:06:14] <Nikerabbit>	 (ignore my complain above, stupid gmail did NOT show that there are replies to the thread even though I responded to it)
[21:06:58] <Coren>	 Nevermind about shelf 5 - that's apparently the previous resync still at work - it's using really low bandwidth.
[21:07:09] <Coren>	 But something is still writing crazy amounts to NFS now.
[21:07:13] <YuviPanda>	 hmm
[21:07:24] * YuviPanda sshs in
[21:08:08] <YuviPanda>	 Coren: iftop seems quite empty
[21:08:09] <Coren>	 Look at https://phabricator.wikimedia.org/T95554
[21:08:52] <Coren>	 10.68.16.196 is our culprit
[21:08:59] <YuviPanda>	 ? that phab ticket?
[21:09:02] * Coren uses iptraf
[21:09:16] <Coren>	 copypaste fail.  http://grafana.wikimedia.org/#/dashboard/db/labs-monitoring
[21:09:21] <YuviPanda>	 db01?!
[21:09:57] <YuviPanda>	 I wonder if it’s someone who put a db to write to NFS
[21:10:22] <Coren>	 wikidata-query
[21:10:32] <Coren>	 Yep.  Someone *did* put a mysql server on NFS there.
[21:10:36] * Coren stops it.
[21:10:45] <YuviPanda>	 yes
[21:11:12] <Coren>	 smalyshev is logged in and working hard on it
[21:11:23] <YuviPanda>	 :)
[21:11:38] <YuviPanda>	 there’s a big ‘mv'
[21:11:49] <Coren>	 Yeah.  "bigdata-old.jnl"
[21:12:29] * Coren ponders what to do.
[21:13:01] * YuviPanda pokes SMalyshev
[21:13:08] <YuviPanda>	 Coren: we could just kill it...
[21:13:09] <Coren>	 10:1 this really should have either been local or on /data/scratch
[21:14:39] <SMalyshev>	 YuviPanda: yes, what's the issue there?
[21:14:40] <YuviPanda>	 you’re hammering NFS, basically.
[21:14:40] <YuviPanda>	 yeah, /data/scratch is for hammering
[21:14:40] <SMalyshev>	 hmm... let me see
[21:14:41] <Coren>	 Very hard, too - you're moving so much data around you are forcing cache flushes impacting everyone.
[21:15:02] <SMalyshev>	 sorry... didn't know NFS is so sensitive. 
[21:15:09] <SMalyshev>	 let me stop couple of things
[21:15:34] <SMalyshev>	 YuviPanda: is it better now?
[21:15:42] <Coren>	 It's not /that/ sensitive, but whatever you are doing is an order of magnitude above all of the other labs projects combined.  :-)
[21:16:31] <Coren>	 Traffic is settling down, but I expect it will take a couple of minutes for everything to recover completely.
[21:16:38] <SMalyshev>	 it's just supposed to be reading & writing couple of files. Admittedly, big files :)
[21:16:46] <SMalyshev>	 but that's why they are on nfs...
[21:17:21] <Coren>	 SMalyshev: We'll have a talk about doing something to better match what you are doing.  Likely, /data/scratch is better suited to what you are doing.
[21:17:42] <SMalyshev>	 that's what I was using mostly - data/scratch
[21:17:43] <YuviPanda>	 we could probably just put the old ciscos to use for the wdq people :)
[21:18:25] <Coren>	 SMalyshev: That mv specifically had /data/project as target.  :-)
[21:18:40] <SMalyshev>	 yeah I've killed that one
[21:18:53] <SMalyshev>	 probably was a bad idea doing it in parallel with other stuff, sorry
[21:19:07] <Coren>	 Things are recovering, and my dinner is geting cold on the table.  :-)
[21:19:23] <SMalyshev>	 would not do that again
[21:19:34] * YuviPanda wonders if we should file an incident report or not.
[21:19:44] <YuviPanda>	 we definitely need some form of network limiting at some point, I guess
[21:20:13] <YuviPanda>	 greg-g: do you think an incident report on these kind of things would be useful? it affected beta / tools.
[21:20:15] <Coren>	 YuviPanda: Hard problem.  We'll look into it in Lyon though.
[21:20:20] <jzerebecki>	 but for users doing things like this on a local disk, they would not impact each other like that. so AFAIK our NFS setup looses any fairness between mutliple users. => it is uber sensitive to anything that can actually utilize it. 
[21:20:33] <YuviPanda>	 yeah
[21:20:43] <SMalyshev>	 I'd be happy to *not* put any of it on NFS but the problem it's 50G+ db and if I need a backup copy I have no option
[21:20:52] <jzerebecki>	 YuviPanda: is it a network bottleneck or a disk io/seek/whatever bottleneck?
[21:20:57] <SMalyshev>	 since biggest disk I get is 100G
[21:21:09] <YuviPanda>	 jzerebecki: can you see https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring
[21:21:20] <YuviPanda>	 jzerebecki: you should be able to get upto a 140G /srv
[21:21:30] <jzerebecki>	 SMalyshev: ^^
[21:21:45] <YuviPanda>	 with an xlarge
[21:21:46] <YuviPanda>	 instance
[21:21:51] <greg-g>	 YuviPanda: are there things you learned/should do better about next time?
[21:21:58] <SMalyshev>	 hmm... ok I'll check into it. 
[21:22:06] <YuviPanda>	 greg-g: hmm, not really. 
[21:22:21] <greg-g>	 YuviPanda: just email the relevant lists what happened
[21:22:31] <YuviPanda>	 it was a fairly straightforward see notification, look at graphs, hunt issue, notify person, fixed.
[21:22:36] <YuviPanda>	 alright
[21:22:39] <SMalyshev>	 YuviPanda: is that ok to use /data/scratch for now at the level it is now? I can try to move stuff to local disk but that means moving some double-digits Gs again...
[21:23:29] <YuviPanda>	 SMalyshev: Coren would know better, but I thik the answer is ‘yes’. also, maybe use ionice?
[21:23:42] <jzerebecki>	 YuviPanda: inbound traffic shaping might help a bit
[21:23:55] <YuviPanda>	 hmm, I don’t know if ionice helps with NFS
[21:24:01] <SMalyshev>	 YuviPanda: would that help with nfs?
[21:24:10] <SMalyshev>	 yeah asked the same thing :)
[21:25:06] <SMalyshev>	 I wouldn't mind my processing huge dumps take back seat to more urgent things people doing... it's several hrs anyway, if it takes extra hr no big deal
[21:25:12] <wikibugs>	 6Labs, 6operations: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#1199793 (10yuvipanda) 3NEW
[21:25:25] <YuviPanda>	 jzerebecki: Coren ^ should discuss options there
[21:25:52] <YuviPanda>	 SMalyshev: I think the solution for things like your use case is ‘labs on real hardware’, but I guess we’re several months away from being able to do that...
[21:26:35] <SMalyshev>	 YuviPanda: yeah if we are going to run blazegraph with real load we'd probably need something stronger
[21:26:41] <YuviPanda>	 yup
[21:26:49] <YuviPanda>	 SMalyshev: you can ask for a ‘test’ hardware machine...
[21:26:57] <YuviPanda>	 we’ve spares, you just need to get mark to approve.
[21:27:06] <SMalyshev>	 we will, sometime in the near future :)
[21:27:28] <YuviPanda>	 SMalyshev: :) That’s the solution, I think :)
[21:27:40] <YuviPanda>	 bring the future closer!
[21:28:19] <SMalyshev>	 ok, until then I'll try to keep it lower profile and please tell me if it causes any trouble I'll stop doing whatever causes it
[21:29:11] <Negative24>	 YuviPanda: You at a computer for https://phabricator.wikimedia.org/T95643
[21:29:22] <YuviPanda>	 oooh, yes let me doi it
[21:29:52] <wikibugs>	 6Labs, 6Phabricator: Allocate one public IP to the Phabricator project - https://phabricator.wikimedia.org/T95643#1199819 (10yuvipanda) 5Open>3Resolved Done
[21:29:53] <YuviPanda>	 done
[21:30:50] <Negative24>	 do I have to logout to see it in Special:NovaAddress
[21:31:12] <YuviPanda>	 I always answer yes if people ask me if they have to logout / in on wikitech
[21:34:31] <YuviPanda>	 alright, emailed
[21:36:24] <Negative24>	 YuviPanda: good to go. thanks
[21:37:29] <jzerebecki>	 SMalyshev: btw. i think there is still a server from testing one of the other databases that is not yet decomissioned/reclaimed, retracting the task to decom it is probably less work
[21:40:14] <SMalyshev>	 jzerebecki: I think we still have einsteinium... it's doesn't have a lot of diskspace but may be ok for a while
[21:46:53] <Coren>	 YuviPanda: How odd.  The 5-minute spike from tools is gone.
[21:47:37] <Coren>	 YuviPanda: And it was there until the slight overload.  Perhaps it's not a cron job after all?
[22:00:40] <Negative24>	 YuviPanda: If I remove the previous proxy and then add the exact same hostname with the public ip, it should work, right?
[22:00:50] <Negative24>	 but its not
[22:06:25] <Krenair>	 Coren, in meta_p.wiki what is is_sensitive?
[22:06:49] <Coren>	 Krenair: Whether the wiki has case sensitive titles
[22:06:57] <Krenair>	 ah.
[22:06:58] <Coren>	 Krenair: There are only a couple.
[22:07:14] <Krenair>	 MariaDB [centralauth_p]> select count(*) from meta_p.wiki where is_sensitive = 1;
[22:07:14] <Krenair>	 +----------+
[22:07:14] <Krenair>	 | count(*) |
[22:07:14] <Krenair>	 +----------+
[22:07:14] <Krenair>	 |      173 |
[22:07:15] <Krenair>	 +----------+
[22:07:16] <Krenair>	 1 row in set (0.00 sec)
[22:07:19] <Krenair>	 :)
[22:07:34] <Krenair>	 Do we not set anything for private wikis?
[22:07:58] <bd808>	 I think wiktionaries are case sensitive
[22:08:11] <Coren>	 Right, wiktionaries and one or two language.
[22:08:30] <bd808>	 private wikis don't have centralauth do they?
[22:08:42] <bd808>	 or is that just fishbowls?
[22:09:15] <Coren>	 You know, I'm not sure.  I *think* none of them do.
[22:10:04] <Krenair>	 private wikis and fishbowls don't run centralauth, no
[22:10:53] * bd808 sees that in InitialiseSettings
[22:12:36] <Krenair>	 I just made a list to check against out of private.dblist in the end
[22:15:20] <sDrewth>	 YuviPanda or Coren or anyone else with access, could you please restart webservices for "ia-upload"  (Tpt hasn't been around)
[22:16:12] <YuviPanda>	 sDrewth: heya
[22:16:12] <YuviPanda>	 sure
[22:16:13] <Coren>	 sDrewth: {{done}}
[22:16:18] <YuviPanda>	 ah ;)
[22:24:58] <sDrewth>	 thanks muchly
[22:28:58] <wikibugs>	 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration, 3Continuous-Integration-Isolation: OpenStack API account to control `contintcloud` labs project - https://phabricator.wikimedia.org/T86170#1199958 (10hashar) Adding @chasemp . We talked about nodepool user/credentials today.  The task descri...
[23:23:58] <wikibugs>	 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Make service manifest monitors redundant / hotswappable - https://phabricator.wikimedia.org/T95521#1200100 (10Ricordisamoa)