[00:54:25] RECOVERY - Puppet failure on tools-webgrid-08 is OK: OK: Less than 1.00% above the threshold [0.0] [01:31:03] 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196550 (10scfc) [01:50:32] 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196580 (10scfc) Added the host with `qconf -mhgrp \@webgrid`, restarted `execd` with `service gridengine-exec restart`, checked with `diff -u <(qconf -sq webgrid-lighttpd@tools-webgrid-07.eqiad.wmflabs)... [02:00:26] 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196594 (10scfc) `diff -u <(qconf -se tools-webgrid-01.eqiad.wmflabs) <(qconf -se tools-webgrid-08.eqiad.wmflabs)` shows the difference between the resources. [02:09:19] 10Tool-Labs: Resetup tools-webgrid-04 due to /var being too small - https://phabricator.wikimedia.org/T95537#1196612 (10scfc) And that was the Gordian knot: ``` tools.typoscan@tools-bastion-01:~$ qstat -xml sql? [02:41:48] YuviPanda: how hard would it be to get a public ip? [02:42:23] Negative24: depends on what you need it for :) [02:42:41] suppose I had a phab instance :) [02:43:06] that needed outside cloning access which can't go through the project proxy [02:43:11] Aaah [02:43:29] Ssh or git protocol cloning I presume [02:43:37] yep [02:43:47] It would only be temp [02:43:54] Should be easy. File a bug and I'll do it when I'm at a computer? [02:43:59] (In 30mins) [02:45:30] 6Labs, 6Phabricator: Allocate one public IP to the Phabricator project - https://phabricator.wikimedia.org/T95643#1196643 (10Negative24) 3NEW a:3yuvipanda [02:45:39] YuviPanda: ^ [02:46:07] YuviPanda: I probably won't be on in 30 mins [02:46:53] Cool [02:48:28] MaxSem, sql is not working on either deployment-db1 or deployment-db2 [02:49:59] no idea then [03:39:17] 10Tool-Labs: Labs multilingual tile server lacks localized labels - https://phabricator.wikimedia.org/T95644#1196705 (10mxn) 3NEW [03:40:43] 10Tool-Labs: Set up a tileserver for OSM in Labs - https://phabricator.wikimedia.org/T62819#1196715 (10mxn) >>! In T62819#1193389, @Nemo_bis wrote: >>>! In T62819#988323, @mxn wrote: >> The old Toolserver tile server worked for all the Wikimedia languages, not just English, German, and Russian. Are there plans t... [03:44:35] 10Tool-Labs, 10OpenStreetMap: Labs multilingual tile server lacks localized labels - https://phabricator.wikimedia.org/T95644#1196723 (10yuvipanda) [03:51:39] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1196735 (10coren) project deployment-prep: uid 48 /home/anomie/.mweval_history /data/project/hhvm-cores/ ui... [03:51:50] * Coren goes to bed. [05:19:58] !log tools delete the tomcat node finally :D [05:20:03] Logged the message, Master [05:22:07] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1196807 (10yuvipanda) Is that all? if so w00t :D [05:49:22] PROBLEM - Host tools-webgrid-tomcat is DOWN: CRITICAL - Host Unreachable (10.68.16.29) [05:49:50] yay [05:51:44] PROBLEM - Puppet failure on tools-services-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [05:54:32] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196815 (10yuvipanda) The node is gone, and the queue is gone too :) [06:01:47] RECOVERY - Puppet failure on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [06:20:05] 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1196835 (10yuvipanda) I just spoke to @springle on IRC and he said coming Thursday (16th April) 0200 UTC is good for him. @Coren can you mail out a notice? [06:32:29] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196842 (10yuvipanda) Hmm, so ```yuvipanda@tools-bastion-01:~$ qconf -de tools-webgrid-tomcat.eqiad.wmflabs Host object "tools-webgrid-tomcat... [06:43:09] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:47:27] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:51:12] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1196872 (10yuvipanda) [06:51:14] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Have bigbrother run on multiple nodes to provide redundancy against tools-submit failure - https://phabricator.wikimedia.org/T91237#1196866 (10yuvipanda) 5Open>3declined a:3yuvipanda Replaced by T95521 [06:51:16] 10Tool-Labs, 7Tracking: make bigbrother or its replacement reliable - https://phabricator.wikimedia.org/T91414#1196871 (10yuvipanda) [06:59:21] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [07:01:18] 6Labs, 10Tool-Labs: Delete 'commonsarchivebot' from toollabs - https://phabricator.wikimedia.org/T89807#1196874 (10Fastily) >>! In T89807#1193619, @scfc wrote: > Could you please go through and archive/delete the remaining files of the tool? Thanks! Done! [07:10:32] !log tools take out tools-services-01 to test switchover and also to recreate as small [07:10:35] Logged the message, Master [07:12:24] RECOVERY - Puppet failure on tools-exec-cyberbot is OK: OK: Less than 1.00% above the threshold [0.0] [07:13:11] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [07:16:56] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make service manifest monitors redundant / hotswappable - https://phabricator.wikimedia.org/T95521#1196898 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Hotswap test was good since it was bitten by the fact that manual steps... [07:16:58] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize webservice manifest monitor - https://phabricator.wikimedia.org/T95210#1196901 (10yuvipanda) [07:21:26] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Make service manifest monitors redundant / hotswappable - https://phabricator.wikimedia.org/T95521#1196915 (10yuvipanda) (I added https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Services) [07:21:50] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize webservice manifest monitor - https://phabricator.wikimedia.org/T95210#1196916 (10yuvipanda) Ok, so this just needs an email to labs-l and then we're all good! [07:24:28] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [07:50:55] PROBLEM - Puppet failure on tools-services-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [08:07:06] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196954 (10scfc) `qhost -j` (NB: `qhost`, no hostname) shows: ``` […] tools-webgrid-tomcat.eqiad.wmflabs lx26-amd64 8 - 15.7G... [08:07:42] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196957 (10scfc) Rescheduled the job, the host job list is now empty. [08:08:04] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196958 (10scfc) ``` scfc@tools-bastion-01:~$ qconf -de tools-webgrid-tomcat.eqiad.wmflabs scfc@tools-bastion-01.eqiad.wmflabs removed "tools... [08:08:37] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1196962 (10scfc) ``` scfc@tools-bastion-01:~$ qconf -ds tools-webgrid-tomcat.eqiad.wmflabs scfc@tools-bastion-01.eqiad.wmflabs removed "tools... [08:10:56] RECOVERY - Puppet failure on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:11:22] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Review and productionize webservice manifest monitor - https://phabricator.wikimedia.org/T95210#1196965 (10scfc) labs-announce :-). [08:39:25] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1197009 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org Da... [08:39:51] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1197011 (10scfc) ``` From: root@tools.wmflabs.org (Cron Daemon) Subject: Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) To: root@tools.wmflabs.org D... [10:00:39] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1197218 (10scfc) [10:00:41] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1197215 (10scfc) 5Open>3Resolved a:3scfc Did `rm -f /data/project/.system/store/*-tools-webgrid-tomcat.eqiad.wmflabs`. Anything else t... [10:01:29] 6Labs, 10Tool-Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Retire 'tomcat' node, make Java apps run on the generic webgrid - https://phabricator.wikimedia.org/T91066#1197219 (10scfc) a:5scfc>3yuvipanda [10:04:48] 6Labs, 10Tool-Labs: Delete 'commonsarchivebot' from toollabs - https://phabricator.wikimedia.org/T89807#1197227 (10scfc) 5Open>3Resolved Thanks! I deleted the tool. [12:31:49] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1197484 (10coren) It's all, execpt for the 9.4 million files owned by www-data which - while not puppetized or in ldap - is stable... [12:55:55] RECOVERY - Puppet failure on tools-exec-21 is OK: OK: Less than 1.00% above the threshold [0.0] [13:00:01] RECOVERY - Puppet failure on tools-exec-wmt is OK: OK: Less than 1.00% above the threshold [0.0] [13:00:55] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [13:16:36] 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1197683 (10coren) Notice mail't [13:36:18] 10Tool-Labs: Admin www depends on short_open_tag = On - https://phabricator.wikimedia.org/T95688#1197730 (10scfc) 3NEW [14:04:06] 10Tool-Labs: Admin www depends on short_open_tag = On - https://phabricator.wikimedia.org/T95688#1197804 (10scfc) (And there's a generated file (`htmlpurifier/library/HTMLPurifier.standalone.php`) that is required, but `.gitignored`. Ooops. I could have deleted that accidentally.) [14:35:31] 10Tool-Labs, 7Tracking: Toolserver migration to Tools (tracking) - https://phabricator.wikimedia.org/T60788#1197960 (10Dereckson) [15:10:15] 6Labs, 6operations, 10ops-eqiad: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1198079 (10Cmjohnson) A case with HP has been opened because that... [15:19:20] 6Labs, 10hardware-requests, 6operations: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1198102 (10Cmjohnson) a:5Cmjohnson>3RobH Rob, These are 3.5" disk bays. I have 1TB disks on-site and will swap them (just give me the +1). CJ [15:30:17] andrewbogott: I just sent you an yuvi a draft of the email I am going to send to project managers for https://phabricator.wikimedia.org/T95554 [15:30:26] Can you read it and critique clarity, etc? [15:30:30] sure [15:36:48] Coren: it looks good to me. You could add a concrete example in a footnote so make clearer the issue with different ids on different boxes… but that would make the email longer which is maybe not worth it. [15:36:59] How many cases of that problem are there? Lots? [15:37:25] Not tons, actually. [15:37:34] https://phabricator.wikimedia.org/T95554 has a list [15:37:54] I was pleasantly surprised. [15:39:31] wikidata-query is the only concerning one as they apparently pointed a mysql server's data dir at NFS (which has bigger problems than the uid) [15:40:18] Ah, that’s not so painful [15:42:06] And it looks like some projects try to get central logging by pointing the syslogs at NFS which - while it works - is about the most painful way of doing it. :-) [15:46:11] I am… surprised that works at all [15:46:35] I guess syslog must already be race-proof since lots of different services write to it [17:32:46] andrewbogott_afk: syslog opens files O_APPEND, so that tends to work out. [17:33:26] (NFS specifically preserve posix file semantics including the append behaviour) [17:35:21] In fact, much of the complexities of NFS are caused by its strict adherence to the really-didn't-consider-remote-filesystems-at-the-time posix file semantics. :-) [17:37:06] 6Labs, 10Tool-Labs: Planned labs maintenance on tools-db: Puppetization + log file change - https://phabricator.wikimedia.org/T94643#1198559 (10coren) [17:55:57] it used to be possible to create VMs with a large partition but small CPU/RAM requirements. now I'm forced to allocate 16 gigs of RAM I don't need :( [17:59:11] are you sure? I don’t remember that being possible at all [18:04:28] yeah, we had more images some time ago [18:04:37] * MaxSem blames trusty [18:06:03] sizes are determined by ‘flavors’ not images, no? [18:06:15] MaxSem: That would have been quite some time ago, because I'm quite certain I remember being annoyed at the lack of big-disk-small-ram flavours. [18:06:32] [when I started working on labs] [18:07:52] shall I open a bug? [18:08:33] Might be worthwhile; I'm sure there is a need for at least some flavours with different balances. [18:10:57] Coren: andrewbogott_afk I’m going to go to do some more visa work now (grumble, grumble). I’ll come back and start filling in https://etherpad.wikimedia.org/p/labs-report-q4-1 (last 2 week report) based on email, phabricator and gerrit [18:11:18] and then there’s a calendar item have all of us just go over it once and then mail it out [18:11:34] Otay. Have "fun" with the visa crap. [18:12:00] Coren: yeah. I just got my SSN, and so will have to deal with another whole bunch of paperwork now... [18:12:12] including opening a bank account, which seems to require that I have a stable home address [18:12:30] anyway, ranting for later! can’t take phone to appointment, apparently. [18:12:38] o/ [18:14:38] 6Labs: Consider creating big storage/low RAM images - https://phabricator.wikimedia.org/T95731#1198896 (10MaxSem) [18:15:44] YuviPanda, I got myself a bank account while I was living at Dan's, so not really required [18:15:54] MaxSem: did you need a utility bill? [18:15:57] you just need some mailing address [18:15:59] let’s move this to another channel :) [18:32:26] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1199056 (10coren) [18:32:28] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1199054 (10coren) 5Open>3Resolved The list is made, and labs-announce notified; since none of those will break for now, there i... [18:33:15] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Identify possibly problematic file ownership on the NFS filesystems - https://phabricator.wikimedia.org/T95554#1199057 (10coren) [18:51:54] 10Tool-Labs: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1199161 (10coren) a:5coren>3None [19:09:26] 6Labs, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Remove dependencies on LDAP from labstore100[12] - https://phabricator.wikimedia.org/T95558#1199239 (10coren) modules/openstack/files/replica-addusers.pl -> enumerates users via getent modules/ldap/files/scripts/manage-nfs-volumes-daemon -> u... [19:40:00] 6Labs: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1199361 (10chasemp) Ask for @springle: can we chat about nodepool in labs and how to allocate DB resources? I'll ping you on irc [19:45:10] 10Tool-Labs: Puppetize gridengine master configuration - https://phabricator.wikimedia.org/T95747#1199397 (10yuvipanda) 3NEW [19:45:26] 10Tool-Labs: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1199406 (10yuvipanda) 5Open>3Resolved a:3yuvipanda T95747 should track puppetizing it. [19:45:48] 10Tool-Labs, 3Labs-Q4-Sprint-2: Investigate reducing scheduling interval for Grid Engine - https://phabricator.wikimedia.org/T95485#1199413 (10yuvipanda) [19:53:51] 6Labs, 6operations, 10ops-eqiad: labvirt1004 has a failed disk 1789-Slot 0 Drive Array Disk Drive(s) Not Responding Check cables or replace the following drive(s): Port 1I: Box 1: Bay 1 - https://phabricator.wikimedia.org/T95622#1199472 (10Cmjohnson) hank you for contacting HP e-Solutions. Wi... [20:12:15] Coren, probably should have linked https://phabricator.wikimedia.org/T95554 explicitly? [20:12:31] 10Tool-Labs, 7Tracking: Toolserver migration to Tools (tracking) - https://phabricator.wikimedia.org/T60788#1199565 (10valhallasw) [20:19:21] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1199584 (10valhallasw) [20:19:23] 6Labs, 10Tool-Labs: Provide 'Support request' tool labs project - https://phabricator.wikimedia.org/T94359#1199581 (10valhallasw) 5Open>3Invalid a:3valhallasw I think that's a fair solution, and I'd like to add a big thank you for essentially clearing out the Triage column! [20:35:34] Krenair: Eugh. Definitely should have. [21:05:19] Coren: definitely, wasted a good minutes looking for the list [21:05:21] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:05:23] Coren: ^ ugh [21:05:24] YuviPanda: Odd. Not NFS, there's a bit of traffic but nothing that bad. [21:05:24] Hmmm... wait. [21:05:25] labnet1001 network just flatlined. [21:05:26] Ooooh. Same thing that happened a few days ago with someone doing a humongous write to NFS [21:05:26] There's a huge flush process going on atm. Several gigs of stuff. [21:05:26] ugh [21:05:27] Huge inbound spike and now the drives are trying to catch up. Another huge gunzip/gzip? [21:05:27] iowait isn't out of bound; this should self-correct shortly. [21:05:27] * Coren monitors closely. [21:05:27] Ah, here's a spike. [21:05:28] * Coren makes sure shelf 5 isn't ill. [21:05:29] Oh blah; I think we might have briefly lost it - it's in resync. [21:05:29] iowait seems to be coming back down. [21:05:50] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 763110 bytes in 4.775 second response time [21:06:05] hmm [21:06:14] (ignore my complain above, stupid gmail did NOT show that there are replies to the thread even though I responded to it) [21:06:58] Nevermind about shelf 5 - that's apparently the previous resync still at work - it's using really low bandwidth. [21:07:09] But something is still writing crazy amounts to NFS now. [21:07:13] hmm [21:07:24] * YuviPanda sshs in [21:08:08] Coren: iftop seems quite empty [21:08:09] Look at https://phabricator.wikimedia.org/T95554 [21:08:52] 10.68.16.196 is our culprit [21:08:59] ? that phab ticket? [21:09:02] * Coren uses iptraf [21:09:16] copypaste fail. http://grafana.wikimedia.org/#/dashboard/db/labs-monitoring [21:09:21] db01?! [21:09:57] I wonder if it’s someone who put a db to write to NFS [21:10:22] wikidata-query [21:10:32] Yep. Someone *did* put a mysql server on NFS there. [21:10:36] * Coren stops it. [21:10:45] yes [21:11:12] smalyshev is logged in and working hard on it [21:11:23] :) [21:11:38] there’s a big ‘mv' [21:11:49] Yeah. "bigdata-old.jnl" [21:12:29] * Coren ponders what to do. [21:13:01] * YuviPanda pokes SMalyshev [21:13:08] Coren: we could just kill it... [21:13:09] 10:1 this really should have either been local or on /data/scratch [21:14:39] YuviPanda: yes, what's the issue there? [21:14:40] you’re hammering NFS, basically. [21:14:40] yeah, /data/scratch is for hammering [21:14:40] hmm... let me see [21:14:41] Very hard, too - you're moving so much data around you are forcing cache flushes impacting everyone. [21:15:02] sorry... didn't know NFS is so sensitive. [21:15:09] let me stop couple of things [21:15:34] YuviPanda: is it better now? [21:15:42] It's not /that/ sensitive, but whatever you are doing is an order of magnitude above all of the other labs projects combined. :-) [21:16:31] Traffic is settling down, but I expect it will take a couple of minutes for everything to recover completely. [21:16:38] it's just supposed to be reading & writing couple of files. Admittedly, big files :) [21:16:46] but that's why they are on nfs... [21:17:21] SMalyshev: We'll have a talk about doing something to better match what you are doing. Likely, /data/scratch is better suited to what you are doing. [21:17:42] that's what I was using mostly - data/scratch [21:17:43] we could probably just put the old ciscos to use for the wdq people :) [21:18:25] SMalyshev: That mv specifically had /data/project as target. :-) [21:18:40] yeah I've killed that one [21:18:53] probably was a bad idea doing it in parallel with other stuff, sorry [21:19:07] Things are recovering, and my dinner is geting cold on the table. :-) [21:19:23] would not do that again [21:19:34] * YuviPanda wonders if we should file an incident report or not. [21:19:44] we definitely need some form of network limiting at some point, I guess [21:20:13] greg-g: do you think an incident report on these kind of things would be useful? it affected beta / tools. [21:20:15] YuviPanda: Hard problem. We'll look into it in Lyon though. [21:20:20] but for users doing things like this on a local disk, they would not impact each other like that. so AFAIK our NFS setup looses any fairness between mutliple users. => it is uber sensitive to anything that can actually utilize it. [21:20:33] yeah [21:20:43] I'd be happy to *not* put any of it on NFS but the problem it's 50G+ db and if I need a backup copy I have no option [21:20:52] YuviPanda: is it a network bottleneck or a disk io/seek/whatever bottleneck? [21:20:57] since biggest disk I get is 100G [21:21:09] jzerebecki: can you see https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring [21:21:20] jzerebecki: you should be able to get upto a 140G /srv [21:21:30] SMalyshev: ^^ [21:21:45] with an xlarge [21:21:46] instance [21:21:51] YuviPanda: are there things you learned/should do better about next time? [21:21:58] hmm... ok I'll check into it. [21:22:06] greg-g: hmm, not really. [21:22:21] YuviPanda: just email the relevant lists what happened [21:22:31] it was a fairly straightforward see notification, look at graphs, hunt issue, notify person, fixed. [21:22:36] alright [21:22:39] YuviPanda: is that ok to use /data/scratch for now at the level it is now? I can try to move stuff to local disk but that means moving some double-digits Gs again... [21:23:29] SMalyshev: Coren would know better, but I thik the answer is ‘yes’. also, maybe use ionice? [21:23:42] YuviPanda: inbound traffic shaping might help a bit [21:23:55] hmm, I don’t know if ionice helps with NFS [21:24:01] YuviPanda: would that help with nfs? [21:24:10] yeah asked the same thing :) [21:25:06] I wouldn't mind my processing huge dumps take back seat to more urgent things people doing... it's several hrs anyway, if it takes extra hr no big deal [21:25:12] 6Labs, 6operations: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#1199793 (10yuvipanda) 3NEW [21:25:25] jzerebecki: Coren ^ should discuss options there [21:25:52] SMalyshev: I think the solution for things like your use case is ‘labs on real hardware’, but I guess we’re several months away from being able to do that... [21:26:35] YuviPanda: yeah if we are going to run blazegraph with real load we'd probably need something stronger [21:26:41] yup [21:26:49] SMalyshev: you can ask for a ‘test’ hardware machine... [21:26:57] we’ve spares, you just need to get mark to approve. [21:27:06] we will, sometime in the near future :) [21:27:28] SMalyshev: :) That’s the solution, I think :) [21:27:40] bring the future closer! [21:28:19] ok, until then I'll try to keep it lower profile and please tell me if it causes any trouble I'll stop doing whatever causes it [21:29:11] YuviPanda: You at a computer for https://phabricator.wikimedia.org/T95643 [21:29:22] oooh, yes let me doi it [21:29:52] 6Labs, 6Phabricator: Allocate one public IP to the Phabricator project - https://phabricator.wikimedia.org/T95643#1199819 (10yuvipanda) 5Open>3Resolved Done [21:29:53] done [21:30:50] do I have to logout to see it in Special:NovaAddress [21:31:12] I always answer yes if people ask me if they have to logout / in on wikitech [21:34:31] alright, emailed [21:36:24] YuviPanda: good to go. thanks [21:37:29] SMalyshev: btw. i think there is still a server from testing one of the other databases that is not yet decomissioned/reclaimed, retracting the task to decom it is probably less work [21:40:14] jzerebecki: I think we still have einsteinium... it's doesn't have a lot of diskspace but may be ok for a while [21:46:53] YuviPanda: How odd. The 5-minute spike from tools is gone. [21:47:37] YuviPanda: And it was there until the slight overload. Perhaps it's not a cron job after all? [22:00:40] YuviPanda: If I remove the previous proxy and then add the exact same hostname with the public ip, it should work, right? [22:00:50] but its not [22:06:25] Coren, in meta_p.wiki what is is_sensitive? [22:06:49] Krenair: Whether the wiki has case sensitive titles [22:06:57] ah. [22:06:58] Krenair: There are only a couple. [22:07:14] MariaDB [centralauth_p]> select count(*) from meta_p.wiki where is_sensitive = 1; [22:07:14] +----------+ [22:07:14] | count(*) | [22:07:14] +----------+ [22:07:14] | 173 | [22:07:15] +----------+ [22:07:16] 1 row in set (0.00 sec) [22:07:19] :) [22:07:34] Do we not set anything for private wikis? [22:07:58] I think wiktionaries are case sensitive [22:08:11] Right, wiktionaries and one or two language. [22:08:30] private wikis don't have centralauth do they? [22:08:42] or is that just fishbowls? [22:09:15] You know, I'm not sure. I *think* none of them do. [22:10:04] private wikis and fishbowls don't run centralauth, no [22:10:53] * bd808 sees that in InitialiseSettings [22:12:36] I just made a list to check against out of private.dblist in the end [22:15:20] YuviPanda or Coren or anyone else with access, could you please restart webservices for "ia-upload" (Tpt hasn't been around) [22:16:12] sDrewth: heya [22:16:12] sure [22:16:13] sDrewth: {{done}} [22:16:18] ah ;) [22:24:58] thanks muchly [22:28:58] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration, 3Continuous-Integration-Isolation: OpenStack API account to control `contintcloud` labs project - https://phabricator.wikimedia.org/T86170#1199958 (10hashar) Adding @chasemp . We talked about nodepool user/credentials today. The task descri... [23:23:58] 6Labs, 10Tool-Labs, 3Labs-Q4-Sprint-2, 3ToolLabs-Goals-Q4: Make service manifest monitors redundant / hotswappable - https://phabricator.wikimedia.org/T95521#1200100 (10Ricordisamoa)