[00:03:21] RECOVERY - Host tools-exec-1410 is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [01:25:57] 06Labs: Request creation of jessie-stretch labs project - https://phabricator.wikimedia.org/T165633#3302714 (10Dzahn) I would suggest you change the request to raising quote in the relevant project for that (maybe temp. until the old ones can be removed in exchange). [02:40:58] 06Labs, 10Striker, 10wikitech.wikimedia.org, 07Documentation: Update Tool Labs account creation docs on wikitech to mention Striker - https://phabricator.wikimedia.org/T156340#3302722 (10bd808) 05Open>03Resolved a:03chasemp @chasemp took care of most of this during the #wikimedia-hackathon-2017 by cl... [02:41:00] 06Labs, 10Tool-Labs, 06Developer-Relations, 10Developer-Wishlist (2017), 07Documentation: Run a documentation sprint for Labs - https://phabricator.wikimedia.org/T101659#3302726 (10bd808) [02:42:27] 10Striker, 07Epic, 13Patch-For-Review: Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458#3302729 (10bd808) [02:42:29] 06Labs, 10Tool-Labs: Add publicly-editable tag system to http://tools.wmflabs.org/?list - https://phabricator.wikimedia.org/T139991#3302731 (10bd808) [02:46:07] 10Striker, 07Epic, 13Patch-For-Review: Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458#3302733 (10bd808) [02:46:09] 06Labs, 10Tool-Labs, 07Epic: Tools web interface for tool authors (Brainstorming ticket) - https://phabricator.wikimedia.org/T128158#3302734 (10bd808) [02:47:05] 06Labs, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs: The future of service groups and service users on Labs - https://phabricator.wikimedia.org/T162945#3302736 (10bd808) [02:47:06] 10Striker, 07Epic, 13Patch-For-Review: Manage shared tool accounts via Striker - https://phabricator.wikimedia.org/T149458#2753308 (10bd808) [02:48:23] ^ me, it will come back soon [03:19:42] PROBLEM - Puppet errors on tools-exec-1437 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:30:39] 06Labs, 10Labs-Infrastructure: labvirt1006 super busy right now - https://phabricator.wikimedia.org/T165753#3302745 (10Andrew) I moved two tools instances off of 1006. No obvious change in cpu metrics so far. [03:59:41] RECOVERY - Puppet errors on tools-exec-1437 is OK: OK: Less than 1.00% above the threshold [0.0] [06:32:04] PROBLEM - Puppet errors on tools-exec-1440 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:33:27] PROBLEM - Puppet errors on tools-exec-1441 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [06:42:16] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1426 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [06:57:00] PROBLEM - Puppet errors on tools-mail is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [07:07:03] RECOVERY - Puppet errors on tools-exec-1440 is OK: OK: Less than 1.00% above the threshold [0.0] [07:13:24] RECOVERY - Puppet errors on tools-exec-1441 is OK: OK: Less than 1.00% above the threshold [0.0] [07:17:16] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1426 is OK: OK: Less than 1.00% above the threshold [0.0] [07:37:03] RECOVERY - Puppet errors on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [08:37:37] 06Labs, 10DBA, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3303045 (10Marostegui) So, given the issue with the compressed tables and if we still want to use db1070 - this is the procedure I have thought to get it o... [09:04:06] 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: A tool queries urwiki recentchanges 6 times per second - https://phabricator.wikimedia.org/T166531#3303089 (10hashar) Sorry my code was wrong. `time.sleep()` does not accept string but requires either an int or a float. So we gotta drop the single quote for `thro... [09:49:36] 06Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: A tool queries urwiki recentchanges 6 times per second - https://phabricator.wikimedia.org/T166531#3299145 (10MuhammadShuaib) Please remove the file or quit the job. I do not want to maintain this job, Thanks. [09:55:43] 06Labs, 10DBA, 13Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3303196 (10jcrespo) Don't use mysqldump- mydumper will be faster, and works great. Just make sure you do not backup (or remove later) common dbs like mysql... [10:25:52] Is there any timeout on Tools labs webservice proxy? I'm trying to debug the 502 error with https://tools.wmflabs.org/wsexport/tool/book.php?lang=fr&format=pdf-a5&page=Lacenaire [12:13:11] andrewbogott: ping me when you're around! :) [13:19:39] 06Labs: Request creation of openipmap labs project - https://phabricator.wikimedia.org/T166671#3303834 (10Pintoch) [13:19:41] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#3303846 (10Pintoch) [14:02:26] paravoid: what's up? [14:02:56] hi [14:03:07] can I upgrade facter with salt all across labs? [14:03:42] or were you planning to do something different? [14:04:41] paravoid: please do! [14:05:29] I didn't have much of a plan in mind… in theory I should assess the state of clush and learn to do it that way but I'm happy for you to do it instead :) [14:07:54] !log tools migrating tools-exec-1409 to labvirt1009 to reduce CPU load on labvirt1006 (T165753) [14:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [14:08:00] T165753: labvirt1006 super busy right now - https://phabricator.wikimedia.org/T165753 [14:08:30] how do you !log for everythin? :) [14:09:24] puppet-mailman.puppet.eqiad.wmflabs and mathosphere.math.eqiad.wmflabs are out of disk space [14:09:45] ok, I'll delete some logfiles on those two [14:10:02] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [14:10:30] you can't log everywhere in this channel… mostly I do things in -operations if it's labs-wide, and/or send an email to labs-announce and/or post in https://phabricator.wikimedia.org/phame/blog/view/5/ (depending on who you think it will affect) [14:12:16] noone hopefully? :) [14:12:24] PROBLEM - Puppet errors on tools-exec-1418 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [14:12:28] so the blog then :) [14:13:04] and twitter :-P [14:13:11] and of course salt didn't run the command in a few hosts [14:13:38] PROBLEM - Puppet errors on tools-exec-1421 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:13:42] let's place bets, how many times do I have to run salt pkg.install until it covers 95%, 99% and 100%? [14:14:38] PROBLEM - Host tools-exec-1409 is DOWN: CRITICAL - Host Unreachable (10.68.18.17) [14:14:46] 2,4,infty [14:15:12] amazing [14:15:21] 14 hosts still pending [14:15:33] in 3 runs [14:19:30] paravoid: is this just 'apt get install facter' or is there more to it? [14:19:47] just apt-get install facter [14:19:59] on the instances I'm manually SSH'in, I see dpkg previously interrupted, so.. [14:20:02] root@multimedia-alpha:~# apt-get install facter [14:20:03] E: dpkg was interrupted, you must manually run 'dpkg --configure -a' to correct the problem. [14:20:06] etc. [14:20:10] ok, I'll do these two that I'm working on [14:20:14] while installing a kernel [14:20:36] multimedia-alpha.multimedia.eqiad.wmflabs is also 100% on / [14:20:39] you don't monitor this? [14:20:52] not for volunteer-run projects [14:21:03] gadgets.social-tools.eqiad.wmflabs as well [14:21:25] well you guys say that a working puppet is a pre-condition to supporting VMs [14:21:33] a working puppet depends on a non-empty filesystem :) [14:21:59] labs-vmbuilder-trusty.openstack.eqiad.wmflabs 100% / as well [14:21:59] true — project owners are getting email nags if puppet is broken. they must be ignoring them :( [14:22:45] the last one sounds like one of yours? :) [14:23:38] sca1.services.eqiad.wmflabs 100% / as well [14:24:00] flow-tests.editor-engagement.eqiad.wmflabs: The last Puppet run was at Wed Nov 2 00:01:56 UTC 2016 (112111 minutes ago). [14:24:55] /dev/vda2 1.9G 1.9G 0 100% /var [14:25:23] multimedia-perf.multimedia.eqiad.wmflabs, 100% / [14:25:42] ogvjs-testing.ogvjs-integration.eqiad.wmflabs 100% / [14:26:23] cirrus-browser-bot.search.eqiad.wmflabs had a half-configured dpkg state that was easily fixable (and caught by check_dpkg) [14:26:46] and quarry-main-01.quarry.eqiad.wmflabs has a 100% / [14:26:51] and these are all [14:26:56] so I guess we'll proceed without them [14:27:04] I wonder if something about unattended upgrades has changed… I haven't seen this particular mess-of-1000-kernels before [14:27:14] no, Ubuntu was always like that [14:28:30] I don't really get the "managed VMs" thing where we force our puppet tree on them but then never monitor if the systems are in a working condition, but up to you guys :) [14:29:24] so I guess we can proceed with structured facts in Labs then [14:29:41] you were hired to change that, paravoid ;-) [14:29:53] not that :) [14:35:30] !log multimedia deleting multimedia-alpha instance; it's broken and unused. [14:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Multimedia/SAL [14:37:00] Is there any timeout on Tools labs webservice proxy? I'm trying to debug the 502 error with https://tools.wmflabs.org/wsexport/tool/book.php?lang=fr&format=pdf-a5&page=Lacenaire [14:39:19] Tpt[m]: I'm not sure I totally understand the question, but if it were me I would try logging on to the backend server itself and testing with wget [14:39:25] unless you've done that and that works... [14:40:31] andrewbogott: the tool seems to work fine (nothing in logs) but, when I test it, I get 502 error [14:40:47] so, I was looking for something related to the proxy [14:41:29] ok — I don't know for sure about timeouts. It's unlikely that the proxy is totally broken since it's working for 1000 other things [14:41:45] (and, sorry, my previous advice is probably not helpful, I was confusing tools proxy with labs proxy) [14:43:30] !log multimedia deleting instance multimedia-perf [14:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Multimedia/SAL [14:47:24] RECOVERY - Puppet errors on tools-exec-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [14:48:36] RECOVERY - Puppet errors on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:20] Tpt[m]: let me find the nginx config for the proxy and we can reason about it from there [14:50:04] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [14:50:50] Tpt[m]: the config that is used is at https://github.com/wikimedia/puppet/blob/production/modules/dynamicproxy/templates/urlproxy.conf [14:51:30] bd808: thank you! [14:51:48] there is a "proxy_read_timeout 3600s;" [14:52:05] which is a ridiculously long time [14:53:03] ok, so it's not that [14:53:04] there could be a timeout in the lighttpd -> fcgi inside the webservice too I suppose [14:55:58] * bd808 -> meetings for a while [14:57:58] Tpt[m]: the lighttpd config is in this file -- https://github.com/wikimedia/operations-software-tools-webservice/blob/master/toollabs/webservice/services/lighttpdwebservice.py [15:02:30] RECOVERY - Host tools-exec-1409 is UP: PING OK - Packet loss = 0%, RTA = 2.80 ms [15:10:10] PROBLEM - Puppet errors on tools-exec-1409 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:26:45] 06Labs, 10Labs-Infrastructure: labvirt1006 super busy right now - https://phabricator.wikimedia.org/T165753#3304335 (10Andrew) 05Open>03Resolved I moved one more away -- CPU usage is high now but not so high that I'm worried. [15:32:06] 10Tool-Labs-tools-Pageviews: [Bug] Querying for articles without page view stats produces weird results - https://phabricator.wikimedia.org/T166692#3304364 (10Niharika) [15:33:12] 10Tool-Labs-tools-Pageviews: [Bug] Querying for articles without page view stats produces weird results - https://phabricator.wikimedia.org/T166692#3304379 (10Niharika) [15:40:10] RECOVERY - Puppet errors on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [16:03:39] PROBLEM - Puppet errors on tools-exec-1404 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:03:54] 10Tool-Labs-tools-Pageviews: [Bug] Querying for articles without page view stats produces weird results - https://phabricator.wikimedia.org/T166692#3304500 (10Niharika) [16:04:25] PROBLEM - Puppet errors on tools-exec-1441 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:25:08] !log tools rebooting tools-exec-1404 as part of a disk-space-saving test [16:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:29:26] RECOVERY - Puppet errors on tools-exec-1441 is OK: OK: Less than 1.00% above the threshold [0.0] [16:34:01] 06Labs, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 13Patch-For-Review, 06Release-Engineering-Team (Next): Fix ci puppet role to support stretch - https://phabricator.wikimedia.org/T166611#3304625 (10hashar) a:03Paladox [16:34:52] !log tools running 'apt-get -yq autoremove' env='{DEBIAN_FRONTEND: "noninteractive"}' on all instances with salt [16:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:38:41] RECOVERY - Puppet errors on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [16:43:09] PROBLEM - Puppet errors on tools-exec-1410 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:48:20] PROBLEM - Puppet errors on tools-exec-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:57:53] PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:59:32] PROBLEM - Puppet errors on tools-exec-1403 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:59:46] PROBLEM - Puppet errors on tools-exec-1406 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:00:08] PROBLEM - Puppet errors on tools-exec-1408 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:03:20] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:03:21] 10Labs-project-Extdist, 10VisualEditor: extdist tarball generator is erroring on VisualEditor REL1_23 - https://phabricator.wikimedia.org/T121748#1886960 (10hashar) Note that REL1_23 is end of life. [17:05:44] the shinken puppet warnings are expected. andrewbogott is doing some system maintenance that is likely to cause puppet to be mad intermittently on tools hosts [17:09:29] PROBLEM - Puppet errors on tools-exec-1405 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [17:10:13] PROBLEM - Puppet errors on tools-exec-1401 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:18:08] RECOVERY - Puppet errors on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:14] 10Tool-Labs-tools-Xtools, 06Community-Tech: Diff links and revision permalinks don't work in Top Edits interface - https://phabricator.wikimedia.org/T165551#3304951 (10DannyH) 05Open>03Resolved a:03DannyH Resolved by Musikanimal [17:28:20] RECOVERY - Puppet errors on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [17:32:52] RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [17:33:14] 10Tool-Labs-tools-Xtools, 06Community-Tech: Convert xtools intuition to its own repository - https://phabricator.wikimedia.org/T165708#3305045 (10DannyH) [17:35:05] RECOVERY - Puppet errors on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [17:38:19] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [17:39:31] RECOVERY - Puppet errors on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [17:39:45] RECOVERY - Puppet errors on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [17:44:30] RECOVERY - Puppet errors on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [17:45:14] RECOVERY - Puppet errors on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [17:58:14] PROBLEM - Free space - all mounts on tools-webgrid-lighttpd-1426 is CRITICAL: CRITICAL: tools.tools-webgrid-lighttpd-1426.diskspace._tmp.byte_percentfree (<44.44%) [18:12:27] bd808, chasemp , madhuvishy ,andrewbogott : one quick question about this eventlogging schema that yuvi set up: https://meta.wikimedia.org/wiki/Schema:CommandInvocation [18:12:37] do you guys use that data? [18:21:39] nuria_: not actively at the moment, I think. IIRC it was created to track grid engine usage on Precise [18:22:13] valhallasw`cloud: ok, then maybe we should remove logging that is still happening. Will file ticket [18:22:54] https://phabricator.wikimedia.org/T123444 [18:23:39] https://phabricator.wikimedia.org/T166712 [18:23:41] ahahaha [18:23:45] so... not entirely sure what the background was. Maybe the new Webservice command instead [18:41:11] nuria_: Yuvi created it back when he was trying to understand what the active use cases for jsub and webservice were. We did a little data mining with it 9-12 months ago. We don't look at the data regularly for sure [18:41:49] bd808: i filed a ticket for removal of logging but if you still use it by all means decline [18:42:14] *nod* we should at least look at it a decide if it is providing any value [19:15:19] I'm going to test a new build of the webservice package on tools-dev and then roll it out everywhere if it looks ok after a few quick tests [19:16:15] !log tools Installed toollabs-webservice_0.37_all.deb from local file on tools-bastion-02 (T163355) [19:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:16:19] T163355: webservice stop says service not running but service.manifest not cleared - https://phabricator.wikimedia.org/T163355 [19:23:43] PROBLEM - High iowait on tools-services-01 is CRITICAL: CRITICAL: tools.tools-services-01.cpu.total.iowait (>12.50%) [19:24:22] !log tools Updating toolabs-webservice package via clush (T163355) [19:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:24:27] T163355: webservice stop says service not running but service.manifest not cleared - https://phabricator.wikimedia.org/T163355 [19:29:36] !log tools Rebuiding all Docker images to pick up toollabs-webservice v0.37 (T163355) [19:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:29:40] T163355: webservice stop says service not running but service.manifest not cleared - https://phabricator.wikimedia.org/T163355 [19:30:42] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:31:04] 06Labs, 10Tool-Labs: Instrument jsub/jstart/webservices usage - https://phabricator.wikimedia.org/T123444#1929978 (10Framawiki) @yuvipanda is this task done ? [19:33:42] RECOVERY - High iowait on tools-services-01 is OK: OK: All targets OK [20:05:41] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [20:28:19] PROBLEM - Puppet errors on tools-services-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:34:28] PROBLEM - Puppet errors on tools-bastion-05 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:38:14] PROBLEM - Puppet errors on tools-bastion-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:45:29] PROBLEM - Puppet errors on tools-bastion-03 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [21:18:17] RECOVERY - Puppet errors on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:25:30] RECOVERY - Puppet errors on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:33:19] RECOVERY - Puppet errors on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:39:30] RECOVERY - Puppet errors on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0]