[00:56:01] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2303651 (10Dzahn) @Nemo_bis Possible, but we'll need it in DNS first, then Apache config for wiki.toolserver to work.. LE... [01:04:34] 06Labs, 10Tool-Labs: Make http (404, 302, 301 etc) statistics for toolserver.org - https://phabricator.wikimedia.org/T85167#2303689 (10Dzahn) grep **fisheye.toolserver.org** /var/log/apache2/access.log | wc -l **8124** (2016-05-17T06:43 - 2016-05-18T01:04) [01:09:26] normal 404 page for a tool that doesnt exist: https://tools.wmflabs.org/netaction [01:09:47] totally different 404 page for a tool that doesnt exist, but only this: https://tools.wmflabs.org/osm [01:17:26] 06Labs, 10Tool-Labs: toollabs: tool "unblock" not working - https://phabricator.wikimedia.org/T135578#2303705 (10Dzahn) [01:18:45] 06Labs, 10Tool-Labs: toollabs: tool "wikifeeds" not working - https://phabricator.wikimedia.org/T135579#2303718 (10Dzahn) [01:23:33] 06Labs, 10Tool-Labs: toollabs: tool "wikifeeds" not working - https://phabricator.wikimedia.org/T135579#2303748 (10Dzahn) I noticed this in relation to T85167, there are still hits (404s) for ~wikifeeds on old toolserver.org URLs, so i redirected them to over here, then saw it's not working. [01:23:41] 06Labs, 10Tool-Labs: toollabs: tool "unblock" not working - https://phabricator.wikimedia.org/T135578#2303751 (10Dzahn) I noticed this in relation to T85167, there are still hits (404s) for ~wikifeeds on old toolserver.org URLs, so i redirected them to over here, then saw it's not working. [01:54:08] 06Labs: raise quota limit for project video - https://phabricator.wikimedia.org/T135560#2303109 (10zhuyifei1999) m1.small or m1.medium should be sufficient. [02:36:48] 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2303792 (10Niharika) >>! In T135518#2303356, @Krenair wrote: > In the mean time please could you create a file somewhere in labs (bastion or tools projects are best) that... [03:13:37] could an admin please help with https://phabricator.wikimedia.org/T132988? [03:47:12] 10Tool-Labs-tools-Other: toollabs: tool "unblock" not working - https://phabricator.wikimedia.org/T135578#2303828 (10yuvipanda) [03:47:26] 10Tool-Labs-tools-Other: toollabs: tool "wikifeeds" not working - https://phabricator.wikimedia.org/T135579#2303829 (10yuvipanda) [04:04:28] 06Labs, 10Tool-Labs-tools-Other: `fr-wikiversity` Tool should get deleted - https://phabricator.wikimedia.org/T133778#2303832 (10TerraCodes) [05:11:27] 06Labs, 10Tool-Labs, 13Patch-For-Review: Make http (404, 302, 301 etc) statistics for toolserver.org - https://phabricator.wikimedia.org/T85167#2303833 (10Dzahn) >>! In T85167#2303689, @Dzahn wrote: > grep **fisheye.toolserver.org** /var/log/apache2/access.log | wc -l > **8124** > > (2016-05-17T06:43 - 201... [05:42:30] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [08:07:54] (03PS1) 10Lokal Profil: Set empty lat lon to NULL in monuments_all (and wlpa_all) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289359 (https://phabricator.wikimedia.org/T39422) [08:28:09] 06Labs: Investigate labnet1002 kernel panic - https://phabricator.wikimedia.org/T135322#2304120 (10MoritzMuehlenhoff) No idea, that happened somewhere deep in memory management internals. If it happens again let's run a memory check. On the plus side, with the reboot labnet1002 uses a much more recent kernel now. [08:34:52] PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228) [09:28:40] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2304213 (10akosiaris) > @jcrespo this is a bit shrouded in mystery with no documentation. It seems post replication someone would run [[ https://phabricator.wik... [09:53:09] (03CR) 10Jean-Frédéric: [C: 032] Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 (owner: 10Lokal Profil) [09:55:25] (03Merged) 10jenkins-bot: Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 (owner: 10Lokal Profil) [10:17:53] (03CR) 10Jean-Frédéric: "Looks okay... We would really need a Vagrant box with some database fixtures to kind of test these." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289314 (https://phabricator.wikimedia.org/T135502) (owner: 10Lokal Profil) [10:25:04] (03CR) 10Jean-Frédéric: [C: 032] Set empty lat lon to NULL in monuments_all (and wlpa_all) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289359 (https://phabricator.wikimedia.org/T39422) (owner: 10Lokal Profil) [10:26:04] (03CR) 10Jean-Frédéric: [C: 032] Add lang and project to statistic reports [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289314 (https://phabricator.wikimedia.org/T135502) (owner: 10Lokal Profil) [10:29:47] (03Merged) 10jenkins-bot: Set empty lat lon to NULL in monuments_all (and wlpa_all) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289359 (https://phabricator.wikimedia.org/T39422) (owner: 10Lokal Profil) [10:29:50] (03Merged) 10jenkins-bot: Add lang and project to statistic reports [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289314 (https://phabricator.wikimedia.org/T135502) (owner: 10Lokal Profil) [10:39:31] 06Labs: More local storage on a wmflabs vm? - https://phabricator.wikimedia.org/T134986#2304384 (10Gehaxelt) Bump? @Physikerwelt Thanks for checking this. @Andrew It would be nice if you could increase the quota for the mlp instance on the math cluster. Thanks, gehaxelt [10:45:48] (03CR) 10Jean-Frédéric: [C: 032] Correcting field matchings for two fr.wiki templates [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287209 (owner: 10Lokal Profil) [10:47:30] (03Merged) 10jenkins-bot: Correcting field matchings for two fr.wiki templates [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287209 (owner: 10Lokal Profil) [10:50:09] !log tools.heritage Deployed latest from Git: 39780e2, 977c07f, 5f4532c, b7b297b (T135502 & T55688), 476267f (T39422) [10:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master [10:50:20] T55688: Statistics module uses country field instead of lang field to link to Wikipedia - https://phabricator.wikimedia.org/T55688 [10:50:21] T135502: Undefined index: project in /data/project/heritage/heritage/api/includes/FormatHtml.php - https://phabricator.wikimedia.org/T135502 [10:50:22] T39422: Lat/lon should be NULL when empty - https://phabricator.wikimedia.org/T39422 [13:31:04] chasemp: hey, When do you want to start making dumps down? [13:31:16] I'm doing some analysis right now [13:31:21] it'll end really soon [13:31:55] I don't see you accessing it? [13:32:11] but in general I was trying to sneak in what I thought was a lull window here for usage [13:33:01] chasemp: it's 6452924 job [13:33:03] th_bwds [13:33:32] "dexbot" service group is accessing [13:34:57] hm I haven't seen that accessing dumps at all [13:35:06] still don't, possibly /public/statistics? [13:35:29] no, let me show the command [13:35:31] I'm watching now and it's nothing is using dumps at all (and hasn't been for a few hours really) [13:35:32] kk [13:36:11] /data/project/dexbot/pywikibot-core/pwb.py /data/project/dexbot/pywikibot-core/scripts/dump_based_detection_beta.py /public/dumps/public/thwiki/20160407/thwiki-20160407-pages-meta-history.xml.bz2 [13:37:22] it should be accessing this file [13:37:26] chasemp: ^ [13:37:43] yeah def, my guess is it already read the file from disk and so is not actively accesing it for some time [13:37:54] because there is not actual activity etc [13:38:00] https://phabricator.wikimedia.org/T134629#2298649 [13:38:09] okay [13:38:17] maybe it's on the analysis mode now [13:38:24] I can't tell for sure [13:39:16] I'll look on the exec node to confirm but looking on teh nfs dumps server it must be [13:39:54] thanks :) [13:40:11] seems alright Amir1 sorry for the short notice, I have had this on my mind for awhile and thought I had a good window to slip in [13:40:57] thank you for your great work chasemp. NFS in labs really needs love [13:41:10] I re-run it do it later [13:41:53] chasemp: I killed the job [13:42:02] tell me once you're done [13:42:05] thanks :) [13:43:35] I'll send out a notice to -announce and ping you here if you're about [13:43:35] np [14:37:57] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2305057 (10chasemp) [14:47:40] 06Labs, 10Labs-Sprint-115, 10Tool-Labs, 10labs-sprint-116, and 2 others: Write admission controller disabling mounting of unauthorized volumes - https://phabricator.wikimedia.org/T112718#2305068 (10yuvipanda) 05Open>03Resolved Done and deployed! [14:47:42] 06Labs, 10Tool-Labs, 07Tracking: Initial Deployment of Kubernetes to Tool Labs (Tracking) - https://phabricator.wikimedia.org/T111885#2305070 (10yuvipanda) [15:04:16] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2305116 (10jcrespo) [15:08:10] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Setup NSS inside containers used in Tool Labs - https://phabricator.wikimedia.org/T134748#2305131 (10yuvipanda) We have a fairly decent solution for this now. We've setup libnss-ldapd, and nslcd won't start by default because we've suppressed auto... [15:12:32] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305151 (10hashar) [15:13:20] andrewbogott: YuviPanda chasemp looks like OpenStack misbehave. Nodepool can't spawn instances anymore :( [15:13:21] https://phabricator.wikimedia.org/T135631 [15:13:26] {u'message': u'No valid host was found. Exceeded max scheduling attempts 3 for instance 6f07110f-4f2f-4f46-bddc-1ea30192ab02. Last exception: [u\'Traceback (most recent call last):\\n\', u\' File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2248, in _do', u'code': 500, u'created': u'2016-05-18T15:08:23Z'} | [15:13:37] seems nova-compute has a "no valid host was found" [15:13:42] no clue what that one means [15:13:58] hm andrewbogott^ [15:14:01] hashar: I'll look. It might mean that labs is full :) [15:14:05] I'm goign to restart rabbit we'll see how that works out [15:14:07] oh no :( [15:14:19] although it shouldn't be [15:14:19] thanks andrewbogott [15:14:26] chasemp: hang on a minute, I want to see if I can reproduce [15:14:34] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305194 (10hashar) `openstack server delete 6f07110f-4f2f-4f46-bddc-1ea30192ab02` worked fine though :) [15:14:36] (03PS1) 10Lokal Profil: Re-add wikitext in statistics id [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) [15:14:49] I really hope it is not nodepool causing some weird scaling issue on labs infra :( [15:14:51] nodepool is alerting in -ops as well [15:15:16] yeah I have shut it down [15:15:42] to prevent it from potentially overloading labs infra since nodepool repeatedly attempt to delete and spawn instances [15:15:49] does nodepool allow throttling? [15:15:56] hashar: these are instances of size 'small' right? [15:16:30] m1.medium iirc [15:16:55] ok [15:16:59] yeah m1.medium [15:17:30] my lame dashboard at https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning doesnt show much issue with mem/disk/cpu though [15:18:57] andrewbogott: I'm sorry I missed your note I had already restarted rabbit on labnet at that time but I've done nothign further [15:19:04] ok [15:19:41] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305257 (10hashar) I have stopped nodepool on labnodepool1001.eqiad.wmnet in case it is adding load to the OpenStack labs. To restart it: $ ssh l... [15:20:08] this is like 3 times in a week and half or so nodepool has wigged out or labs creation, not sure on cause and effect there [15:21:20] right now the scheduler seems to not be talking to anyone [15:21:42] nodepool spawning a lot of instances, it might highlights some issue on labs or put too much strain on the nova scheduler [15:22:04] (03CR) 10Jean-Frédéric: [C: 032] Re-add wikitext in statistics id [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) (owner: 10Lokal Profil) [15:22:25] at least instance deletion works [15:26:42] chasemp: andrewbogott: I have got to commute to get kid back home. Should be back in roughly 40 minutes [15:26:46] I have deleted the instances in the 'contintcloud' project [15:26:56] and nodepool is stopped on labnodepool1001.eqiad.wmnet [15:27:23] hashar: ok [15:27:28] hashar: how do I restart once things are working? [15:27:31] I have poked the releng team channel about it [15:27:37] $ ssh labnodepool1001.eqiad.wmnet [15:27:37] $ sudo /usr/sbin/service nodepool start [15:27:44] tail -F /var/log/nodepool/nodepool.log [15:28:35] ok thanks [15:29:09] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305341 (10hashar) The first failure was apparently at 14:35 UTC ``` 2016-05-18 14:35:17,112 INFO nodepool.NodePool: Need to launch 1 ci-jessie-wik... [15:29:21] maybe it is the image that is incorrect [15:29:32] it is auto regenerated around 14:30 which is when the first failure occured [15:31:11] the snapshot are not found apparently 2016-05-18 15:30:26,399 WARNING nodepool.NodePool: Image server id b678c2ab-8b85-499b-bc06-5d90781ce5c3 not found [15:31:16] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305351 (10hashar) Maybe that is the images that weren't correct I have deleted them ``` hashar@labnodepool1001:/var/log/nodepool$ nodepool image-li... [15:32:17] I restarted nodepool [15:36:32] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305405 (10hashar) I have restarted Nodepool, it is supposed to spawn instances out of yesterday snapshots: ``` $ nodepool image-list +-----+-------... [15:37:00] stopped nodepool again, yesterday snapshot can't spawn instance either [15:37:01] :( [15:37:24] I have left the instances around in contintcloud so one could look at them. [15:37:30] rushing, be back in roughly ~ 30 mins [15:39:56] (03PS2) 10Lokal Profil: Re-add wikitext in statistics id [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) [15:41:03] (03CR) 10Lokal Profil: "ahm. Im' sure what happens to my second patch if you have already +2:ed" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) (owner: 10Lokal Profil) [15:41:17] 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2305456 (10Krenair) That's part of a shared account with two other people... I verified with @Niharika over hangouts though [15:41:38] (03CR) 10Lokal Profil: "recheck" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) (owner: 10Lokal Profil) [15:42:40] andrewbogott -- isp issues here fyi. Could this this relate to dumps nfs down and showmount blocking on vm spin up? [15:42:52] Random thought [15:43:03] I don't think so — the instances aren't getting scheduled in the first place [15:43:10] it's some kind of communication issue between services as best I can tell [15:45:33] 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2305465 (10Niharika) I'm pretty sure Ryan or Frances aren't trying to hack into my account. All of the tools I am part of have shared ownership, anyway. Do we still wan... [15:45:59] hm k [15:50:47] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305493 (10Luke081515) p:05Triage>03Unbreak! Blocks Zuul. [15:59:29] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Setup NSS inside containers used in Tool Labs - https://phabricator.wikimedia.org/T134748#2305514 (10yuvipanda) We do need nscd, otherwise it is too slow :( [16:02:14] Krenair: The irc.beta.wmflabs.org RC-IRC thing, whoch puppet role is that? Do you know that? [16:02:33] yep [16:03:27] you can look these things up like this: ldapsearch -x dc:dn:=deployment-ircd.deployment-prep.eqiad.wmflabs | grep puppetClass [16:03:37] it's role::mw_rc_irc [16:04:31] or with http://tools.wmflabs.org/watroles/variable/instancename/deployment-ircd [16:05:42] thanks :) [16:09:11] back [16:12:49] mutante: late answer -- https://tools.wmflabs.org/osm exists and is running lighttpd. The 404 you are seeing from its root page is due to the being no default index. See https://tools.wmflabs.org/osm/libs/openlayers/OpenLayers-patch2-10.js for a file that tool actually serves [16:16:39] Krenair: I didn't worked with puppet variables at labs yet. So for example if I want to set the "instancename" variable, do I have to enter variablename and value directly at Special:NovaPuppetGroup? [16:19:33] Luke081515, not the values [16:19:52] andrewbogott: can you take a look at open stack? every time, I try to spawn an instance, they get into the "error" state [16:19:57] Luke081515, at Special:NovaPuppetGroup you add classes and variables so they can be used by your project [16:20:18] Krenair: Ok, but how can I set the values of the variables? LAter at the instance? [16:20:23] Luke081515, then on Special:NovaInstance you can 'configure' an instance to use those classes and set the values [16:20:32] Luke081515: I'm working on it [16:20:53] Luke081515, note that the variables are not really needed now that we have hira [16:20:53] andrewbogott: thx :) [16:20:54] hiera* [16:21:11] ok :) [16:21:27] but first i have to wait, that I can spawn an instance, where I can try it ;) [16:22:07] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305657 (10hashar) Nodepool eventually restarted due to puppet. Horizon interface shows up instances are blocked on various tasks: in Spawning, Sc... [16:22:28] andrewbogott: for what it is worth I get the error message: 500 No valid host was found. There are not enough hosts available. [16:23:20] there are mutliple things happening [16:24:13] 06Labs: Backup files request - https://phabricator.wikimedia.org/T135014#2305665 (10Mjbmr) Please remove the tool named `xmlfeed` and regenerate `replica.my.cnf` for the tools named `mjbmr-tools` and `mjbmrbot` and my personal account. Thanks. [16:25:20] chasemp: where all did you restart rabbitmq-server? [16:25:46] Labnet1002 only [16:26:35] hm [16:34:46] nodepool is restarted by puppet so it is back around spamming labs infra. [16:35:02] I am away dealing with dinner / kids etc. will check there from time to time [16:39:09] hasharAway: are you sure that nodepool isn't still trying to schedule things? [16:39:16] Quite a flood of scheduling requests over here [16:39:17] it is [16:39:19] restarted [16:39:21] by puppet [16:39:41] could you disable puppet on labnodepool1001.eqiad.Wmnet ? [16:39:48] sure [16:40:13] stopped it manually [16:40:22] thanks, I disabled puppet [16:40:36] Things might be working now but I want to give everyone a chance to catch up [16:40:42] there was such a backlog of schedule requests... [16:41:45] yeah I can imagine nodepool has been quite spammy and overloaded whatever queue is used :-( [16:44:40] bd808: oh! thank you for that. would it make sense to Redirect those requests to somewhere else? [16:46:51] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305151 (10Andrew) 2016-05-18 16:45:29.980 5375 ERROR nova.compute.manager [req-86afc675-0c57-44f4-a164-e1a8320c845b novaadmin... [16:48:17] mutante: *shrug* maybe. I have no idea what that tool is actually doing. It looks like a static file dump for something related to open street maps [16:48:23] !log ores running puppet agent on ores-lb-02 manually [16:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [16:51:14] !log tools.xtools Restarted the webservice for xtools-ec as it was returning 502s again. [16:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.xtools/SAL, Master [16:55:41] bd808: i think i'll just Redirect to https://meta.wikimedia.org/wiki/OpenStreetMap and move on :) [16:57:41] 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2305850 (10Krenair) 05Open>03Resolved a:03Krenair We called just now and I reset 2FA for @Niharika [17:35:30] eh, so, what's the status of nodepool things? Still seems very broken in a way I've not seen... [17:35:44] thcipriani: should be working as of a minute or two ago... [17:36:47] andrewbogott: okie doke, there do seem to be some new instances building now that I look, hopefully the zuul queue starts moving soon. [17:37:19] oh it is moving, amazing :D [17:46:41] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2306116 (10Andrew) 05Open>03Resolved a:03Andrew This is resolved now, and I don't know what went wrong :( [17:47:08] Is there a way to give write access on a tool to another tool without giving it to all tools? For example, could it be set that tool x can access tool y's directories? [17:48:01] tom29739: I think you can make a tool a member of a tool, using the same method you'd use to add a user to a tool [17:51:57] I would do that, but I get 'No results match "tom29739-testing"' for any input in the service user box when managing maintainers on a tool. I don't think that's intended. [17:52:39] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306131 (10Dzahn) - deleted toolserver.org.key in private repo - deleted certs and .key and .keyold on instance, /etc/ssl... [17:55:08] tom29739: andrewbogott you can theoretically do that, but that feature of OSM has been broken for a while [17:55:16] I filed a ticket [17:55:26] damn [17:55:33] https://phabricator.wikimedia.org/T128400 [17:55:59] So I can't do what I want to do? [17:56:02] tom29739: andrewbogott admins can do that manually atm, so if you need it to create a ticket and we'll do it [17:58:45] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306161 (10Dzahn) >>! In T134798#2303651, @Dzahn wrote: > @Nemo_bis Possible, but we'll need it in DNS first, then Apache... [18:43:36] doctaxon: hmm, thanks for bringing it to my notice, am investigating it now [18:43:40] I see ~200 jobs in qw [18:44:07] hmm no queues or grids in error state [18:44:52] but what can be the reason why? [18:45:11] I'm investigating :) [18:45:12] !log toolserver-legacy restart Apache, adding wiki. alias [18:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolserver-legacy/SAL, Master [18:47:45] YuviPanda: I can submit a basic job and it runs fine... [18:47:54] hmm [18:48:33] chasemp: did you run it on precise or trusty? [18:48:51] waht's teh default? trusty I think then [18:48:56] yeah [18:48:58] hmm [18:49:16] hm I see some trusty stuff there too [18:50:46] hm [18:50:46] so [18:50:46] just doing jsub on something [18:50:48] I still get precise [18:50:54] and can run a job that dumps basic info [18:50:56] runs on tools-webgrid-lighttpd-1205.tools.eqiad.wmflabs [18:50:58] yeah you need to pass -l release=trusty [18:51:00] ah [18:51:02] to jsub [18:51:05] to get trusty [18:51:55] did so, qw [18:52:02] trusty nodes are overwhelmed? [18:52:14] i use something like this: jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 4g -l release=trusty ld.tcl [18:52:36] is this wrong? [18:52:54] Hi, mi tool is 503 and webservice start doesn't help [18:53:09] my tool is giving* [18:53:16] Any advice? [18:53:20] queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1402.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=3.180000 (= 3.180000 + 0.50 * 0.000000 with nproc=4) >= 2.75 [18:55:55] I guess that line is related to my problem... [18:56:03] chasemp: yeah, from http://tools.wmflabs.org/?status looks like it [18:56:08] jem: yeah, probably. stand by [18:56:24] anomie: ping? [18:57:01] YuviPanda: pong? [18:57:18] * jem stands by :) [18:57:19] anomie: do a lot of the anomiebots read dumps? [18:57:41] anomie: they're causing super high CPU usage now in a bunch of nodes, and I *think* that's because dumps is unavailable right now. [18:57:53] YuviPanda: I don't have any tasks that read dumps. Most just query the API, some of the newer ones use queries against the DB replica. [18:57:56] hmm [18:58:12] ok that's good to know [18:59:26] chasemp: I'm actually unsure where the load is coming from [18:59:49] so I ahve a fair idea someone is doing something innocuos that is choking on dumps yeah [18:59:50] chasemp: lots of CPU use of kworker and rcu_sched and nothing much else :| (on tools-exec-1410) [19:00:02] chasemp: but across all the nodes? [19:00:16] YuviPanda: All AnomieBOT jobs, or just some? [19:00:18] YuviPanda: yeah odd [19:00:23] give me a sec here to try one thing [19:00:24] chasemp: some had anomiebot using good chunk of CPU but he says they don't use dumps [19:00:26] chasemp: kk [19:00:43] anomie: I was just looking at some, but think it's a red-herring now. [19:01:50] so all the precise hosts seem fine [19:01:55] the trusty ones have gone bonkers [19:02:21] so I ensured all is ro and basically enabled an empty dumps share [19:02:22] and I did [19:02:36] fuser -k /public/dumps [19:02:36] umount -f /public/dumps [19:02:45] mount -o remount /public/dumps [19:02:51] and it seems to have dropped load a lot on [19:03:02] yeah [19:03:05] tools-webgrid-lighttpd-1402.tools.eqiad.wmflabs [19:03:45] not all tho [19:03:51] 1408 [19:03:55] -exec-1408 [19:03:59] 19:03:48 up 55 days, 22:53, 1 user, load average: 13.12, 13.06, 12.46 [19:04:03] same for tools-webgrid-lighttpd-1406 [19:04:06] but [19:04:09] %Cpu(s): 0.2 us, 3.8 sy, 16.7 ni, 77.6 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st [19:04:10] I've only done two nodes [19:04:10] oh [19:04:14] aaah, right [19:04:16] ok [19:04:21] YuviPanda: release=precise does run the cronjobs [19:04:27] both are dropping so we should hit them all I think? [19:04:33] doctaxon: yes, but please don't do that. just wait a few minutes and we'll get this back to working. [19:04:38] and I'll ahve to remount again post resize [19:04:42] chasemp: yeah [19:04:44] thank you [19:04:54] chasemp: do you have the magic handy to do it? [19:05:13] YuviPanda: umount -f /public/dumps && mount -o remount /public/dumps [19:05:28] unles that says it's in use [19:05:29] and maybe [19:05:29] fuser -k /public/dumps [19:05:37] kind of deal [19:07:16] I mean to run it on all hosts :D [19:07:26] chasemp: do you want me to run it on all the trusty execs? [19:07:27] all tools hosts or all hosts? [19:07:33] all tools ones I think [19:07:34] it's affecting precise too [19:07:38] oh ok [19:07:52] chasemp: are you running it on all tools hosts or shall I? [19:08:05] I don't have anything parallel setup but I've a small helper script that xargs ssh [19:08:09] I thought you were in the middle so I was holding off [19:08:17] ah we clashed there [19:08:19] I don't think I have a current lit of all tools [19:08:19] I was asking [19:08:21] ok [19:08:21] heh [19:08:23] I'll do it [19:08:26] k [19:09:18] chasemp: running it now [19:09:21] k [19:10:59] sorry this resize is taking forever so here we are [19:11:40] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [19:13:54] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306476 (10Dzahn) done. the cert has additional SANs now, "wiki" and "stable" [19:14:02] chasemp: hmm, lots of umount.nfs: /public/dumps: device is busy [19:14:10] chasemp: on tools-exec-1405.tools.eqiad.wmflabs for example [19:14:20] -f? [19:15:04] PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:16:23] oh it already had a -f [19:17:37] PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [19:18:15] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:19:19] YuviPanda: timeout 10s fuser -k /public/dumps; umount -f /public/dumps && mount -t nfs labstore1003.eqiad.wmnet:/srv/dumps /public/dumps [19:19:33] maybe [19:19:58] chasemp: btw, should we also take this opportunity to switch dumps from hard to soft mounted? [19:20:11] that was ironically next up on my nfs client things :) [19:20:16] shoudl ahve done it first now I realize [19:21:04] heh [19:21:07] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [19:21:37] I guess [19:21:37] imeout 10s fuser -k /public/dumps; umount -f /public/dumps; mount -t nfs labstore1003.eqiad.wmnet:/srv/dumps /public/dumps [19:21:38] timeout 10s fuser -k /public/dumps; umount -f /public/dumps; mount -t nfs labstore1003.eqiad.wmnet:/srv/dumps /public/dumps [19:21:38] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2306486 (10hashar) ``` File "/usr/lib/python2.7/dist-packages/libvirt.py", line 896, in if ret == -1: raise libvirtError ('... [19:21:39] is better [19:21:41] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [19:21:45] silly mount and it's exit codes [19:22:43] PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [19:22:49] chasemp: doesn't that mount it without any of the options we used? [19:23:04] it's not a real mount [19:23:09] or at least there is nothing there and it's ro [19:23:14] just to get past the stupidity [19:23:22] I'l ahve to remount it anyways [19:23:27] as it will be on a diff fs server side [19:23:33] teh vnode stuff doesn't handle that [19:23:42] ah right [19:23:44] fair enough [19:23:48] sorry, it's a real mount but option wise it's not an issue [19:24:19] nah makes sense [19:24:27] I'm not sure what it'll do to puppet tho [19:24:29] but running it now [19:25:28] chasemp: that seems to work tho [19:28:04] chasemp: almost all the qw jobs are gone tho [19:28:09] jem: is your webservice back online? [19:28:14] doctaxon: your services should also be running now [19:28:34] okay, mom [19:28:48] runs best [19:29:06] * YuviPanda makes doctaxon eat their vegetables [19:30:10] Krenair: Do you know what I have to setup, after appyling that IRC RC role? [19:30:16] YuviPanda - what do you mean? [19:30:27] vegetables? [19:30:43] doctaxon: you mentioned mom, I think it is a stereotypical thing moms are supposed to do [19:30:49] make people who call them moms eat vegetables [19:31:56] YuviPanda: so puppet doesn't seem to care [19:32:01] taht the mount is "wrong" [19:32:02] so that's fun [19:32:14] :D [19:34:21] chasemp: all good now I think [19:34:38] all seems well until dumps finishes resizing [19:34:38] thanks man [19:34:51] afaik it's still chugging away fine it's just a huge volume [19:36:44] chasemp: kk. eta? [19:36:54] like hours? will it finish today you think? [19:36:58] is there even an eta screen? [19:37:03] I'm not sure, nope [19:37:09] ok [19:37:40] I naively thought hour(s) [19:37:46] we'll see [19:38:02] :D ok [19:42:45] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [19:42:45] RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [19:48:17] RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [19:49:04] Hey all. Is there any known issue with the wsexport tool at the moment? Appears unresponsive. [19:49:14] YuviPanda: Yes, it's back, thanks :) [19:50:43] 06Labs, 10Horizon: TGR unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2306704 (10Andrew) [19:51:20] 06Labs, 10Horizon: TGR unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2173147 (10Andrew) a:03jcrespo Chris suggests that something should be done to the 'users' table on silver. So... over to you, Jaime. [19:52:05] sldr: maybe? it appears to be running but throws 2016-05-18 19:51:20: (server.c.1444) [note] sockets disabled, connection limit reached [19:52:40] restarted it [19:53:06] chasemp: That would explain why I suddenly got Bad Gateway, I suppose. I'll try again in a second. [19:54:30] chasemp: Tried again just now. Works one out of one times. Turning it off and on again saves the day again? [19:54:59] maybe so :) [19:55:42] Is it a fantastically bad idea to run a batch job against that tool? Wouldn't want to break it. [19:56:06] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [19:58:31] sldr: I have no idea [19:58:46] chasemp: So... Let's find out? [20:49:34] How can I find out the lighttpd version on tools? I tried 'lighttpd' in the console, and it returned command not found. The http header returns nginx, which I presume is the proxy [20:50:30] tom29739: it's just the lighttpd version ubuntu trusty has [20:50:46] tom29739: you can also ssh to tools-webgrid-lighttpd-1401 (or any such node) and run same commands to find out [20:51:49] who knows how to setup the role::mw_rc_irc [20:51:59] Luke081515: i do [20:52:07] ok [20:52:10] it's already there [20:52:21] in project. eh.. "irc" [20:53:15] or you can apply the role in another one [20:53:37] basically go to "puppet groups" and add the role class, so you can select it when you "configure" an instance [20:54:01] since we just recently fixed some issues for that and it has fake secrets in labs/private etc.. it should simply work :) [21:03:44] YuviPanda: not much on I can do atm for dumps, it's resizing as we speak and that's kind of a volatile state [21:04:11] chasemp: yeah, understood [21:04:18] chasemp: do we have a plan B for it? [21:04:43] I mean if it never comes back we can just wipe it out, recreate lv's and repopulate from dumps [21:04:52] it's not primary copy of any data so [21:04:57] but that's potentially longer [21:05:03] right [21:05:05] kk [21:09:24] !log ircd added Luke08515 as user and admind [21:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ircd/SAL, Master [21:30:40] chasemp: So for the record, I ran a quick batch job which exported some 140 pages as quickly as it could. No obvious error. [21:30:53] sldr: nice :) [22:13:28] 06Labs, 10Tool-Labs, 10Living-Style-Guide, 06Reading-Web-Backlog: npm version on tools-login.wmflabs.org is incompatible with MobileFrontend package.json used by the KSS styleguide - https://phabricator.wikimedia.org/T89093#2307354 (10Danny_B) [23:32:00] PROBLEM - Puppet run on tools-bastion-10 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]