[00:56:01] <wikibugs>	 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2303651 (10Dzahn) @Nemo_bis Possible, but we'll need it in DNS first, then Apache config for wiki.toolserver to work.. LE...
[01:04:34] <wikibugs>	 06Labs, 10Tool-Labs: Make http (404, 302, 301 etc) statistics for toolserver.org - https://phabricator.wikimedia.org/T85167#2303689 (10Dzahn) grep **fisheye.toolserver.org** /var/log/apache2/access.log  | wc -l **8124**  (2016-05-17T06:43 - 2016-05-18T01:04)
[01:09:26] <mutante>	 normal 404 page for a tool that doesnt exist: https://tools.wmflabs.org/netaction
[01:09:47] <mutante>	 totally different 404 page for a tool that doesnt exist, but only this: https://tools.wmflabs.org/osm
[01:17:26] <wikibugs>	 06Labs, 10Tool-Labs: toollabs: tool "unblock" not working - https://phabricator.wikimedia.org/T135578#2303705 (10Dzahn)
[01:18:45] <wikibugs>	 06Labs, 10Tool-Labs: toollabs: tool "wikifeeds" not working - https://phabricator.wikimedia.org/T135579#2303718 (10Dzahn)
[01:23:33] <wikibugs>	 06Labs, 10Tool-Labs: toollabs: tool "wikifeeds" not working - https://phabricator.wikimedia.org/T135579#2303748 (10Dzahn) I noticed this in relation to T85167, there are still hits (404s) for ~wikifeeds on old toolserver.org URLs, so i redirected them to over here, then saw it's not working.
[01:23:41] <wikibugs>	 06Labs, 10Tool-Labs: toollabs: tool "unblock" not working - https://phabricator.wikimedia.org/T135578#2303751 (10Dzahn) I noticed this in relation to T85167, there are still hits (404s) for ~wikifeeds on old toolserver.org URLs, so i redirected them to over here, then saw it's not working.
[01:54:08] <wikibugs>	 06Labs: raise quota limit for project video - https://phabricator.wikimedia.org/T135560#2303109 (10zhuyifei1999) m1.small or m1.medium should be sufficient.
[02:36:48] <wikibugs>	 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2303792 (10Niharika) >>! In T135518#2303356, @Krenair wrote: > In the mean time please could you create a file somewhere in labs (bastion or tools projects are best) that...
[03:13:37] <TerraCodes>	 could an admin please help with https://phabricator.wikimedia.org/T132988?
[03:47:12] <wikibugs>	 10Tool-Labs-tools-Other: toollabs: tool "unblock" not working - https://phabricator.wikimedia.org/T135578#2303828 (10yuvipanda)
[03:47:26] <wikibugs>	 10Tool-Labs-tools-Other: toollabs: tool "wikifeeds" not working - https://phabricator.wikimedia.org/T135579#2303829 (10yuvipanda)
[04:04:28] <wikibugs>	 06Labs, 10Tool-Labs-tools-Other: `fr-wikiversity` Tool should get deleted - https://phabricator.wikimedia.org/T133778#2303832 (10TerraCodes)
[05:11:27] <wikibugs>	 06Labs, 10Tool-Labs, 13Patch-For-Review: Make http (404, 302, 301 etc) statistics for toolserver.org - https://phabricator.wikimedia.org/T85167#2303833 (10Dzahn) >>! In T85167#2303689, @Dzahn wrote: > grep **fisheye.toolserver.org** /var/log/apache2/access.log  | wc -l > **8124** >  > (2016-05-17T06:43 - 201...
[05:42:30] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[08:07:54] <grrrit-wm>	 (03PS1) 10Lokal Profil: Set empty lat lon to NULL in monuments_all (and wlpa_all) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289359 (https://phabricator.wikimedia.org/T39422) 
[08:28:09] <wikibugs>	 06Labs: Investigate labnet1002 kernel panic - https://phabricator.wikimedia.org/T135322#2304120 (10MoritzMuehlenhoff) No idea, that happened somewhere deep in memory management internals. If it happens again let's run a memory check. On the plus side, with the reboot labnet1002 uses a much more recent kernel now.
[08:34:52] <shinken-wm>	 PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228)
[09:28:40] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2304213 (10akosiaris) > @jcrespo this is a bit shrouded in mystery with no documentation.  It seems post replication someone would run [[ https://phabricator.wik...
[09:53:09] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032] Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 (owner: 10Lokal Profil)
[09:55:25] <grrrit-wm>	 (03Merged) 10jenkins-bot: Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 (owner: 10Lokal Profil)
[10:17:53] <grrrit-wm>	 (03CR) 10Jean-Frédéric: "Looks okay... We would really need a Vagrant box with some database fixtures to kind of test these." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289314 (https://phabricator.wikimedia.org/T135502) (owner: 10Lokal Profil)
[10:25:04] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032] Set empty lat lon to NULL in monuments_all (and wlpa_all) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289359 (https://phabricator.wikimedia.org/T39422) (owner: 10Lokal Profil)
[10:26:04] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032] Add lang and project to statistic reports [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289314 (https://phabricator.wikimedia.org/T135502) (owner: 10Lokal Profil)
[10:29:47] <grrrit-wm>	 (03Merged) 10jenkins-bot: Set empty lat lon to NULL in monuments_all (and wlpa_all) [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289359 (https://phabricator.wikimedia.org/T39422) (owner: 10Lokal Profil)
[10:29:50] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add lang and project to statistic reports [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289314 (https://phabricator.wikimedia.org/T135502) (owner: 10Lokal Profil)
[10:39:31] <wikibugs>	 06Labs: More local storage on a wmflabs vm? - https://phabricator.wikimedia.org/T134986#2304384 (10Gehaxelt) Bump? @Physikerwelt Thanks for checking this.   @Andrew It would be nice if you could increase the quota for the mlp instance on the math cluster.   Thanks, gehaxelt
[10:45:48] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032] Correcting field matchings for two fr.wiki templates [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287209 (owner: 10Lokal Profil)
[10:47:30] <grrrit-wm>	 (03Merged) 10jenkins-bot: Correcting field matchings for two fr.wiki templates [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287209 (owner: 10Lokal Profil)
[10:50:09] <JeanFred>	 !log tools.heritage Deployed latest from Git: 39780e2, 977c07f, 5f4532c, b7b297b (T135502 & T55688), 476267f (T39422)
[10:50:14] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.heritage/SAL, Master
[10:50:20] <stashbot>	 T55688: Statistics module uses country field instead of lang field to link to Wikipedia - https://phabricator.wikimedia.org/T55688
[10:50:21] <stashbot>	 T135502: Undefined index: project in /data/project/heritage/heritage/api/includes/FormatHtml.php - https://phabricator.wikimedia.org/T135502
[10:50:22] <stashbot>	 T39422: Lat/lon should be NULL when empty - https://phabricator.wikimedia.org/T39422
[13:31:04] <Amir1>	 chasemp: hey, When do you want to start making dumps down?
[13:31:16] <Amir1>	 I'm doing some analysis right now
[13:31:21] <Amir1>	 it'll end really soon
[13:31:55] <chasemp>	 I don't see you accessing it?
[13:32:11] <chasemp>	 but in general I was trying to sneak in what I thought was a lull window here for usage
[13:33:01] <Amir1>	 chasemp: it's 6452924 job
[13:33:03] <Amir1>	 th_bwds
[13:33:32] <Amir1>	 "dexbot" service group is accessing 
[13:34:57] <chasemp>	 hm I haven't seen that accessing dumps at all
[13:35:06] <chasemp>	 still don't, possibly /public/statistics?
[13:35:29] <Amir1>	 no, let me show the command
[13:35:31] <chasemp>	 I'm watching now and it's nothing is using dumps at all (and hasn't been for a few hours really)
[13:35:32] <chasemp>	 kk
[13:36:11] <Amir1>	  /data/project/dexbot/pywikibot-core/pwb.py /data/project/dexbot/pywikibot-core/scripts/dump_based_detection_beta.py /public/dumps/public/thwiki/20160407/thwiki-20160407-pages-meta-history.xml.bz2
[13:37:22] <Amir1>	 it should be accessing this file
[13:37:26] <Amir1>	 chasemp: ^
[13:37:43] <chasemp>	 yeah def, my guess is it already read the file from disk and so is not actively accesing it for some time
[13:37:54] <chasemp>	 because there is not actual activity etc
[13:38:00] <Amir1>	 https://phabricator.wikimedia.org/T134629#2298649
[13:38:09] <Amir1>	 okay
[13:38:17] <Amir1>	 maybe it's on the analysis mode now
[13:38:24] <Amir1>	 I can't tell for sure
[13:39:16] <chasemp>	 I'll look on the exec node to confirm but looking on teh nfs dumps server it must be
[13:39:54] <Amir1>	 thanks :)
[13:40:11] <chasemp>	 seems alright Amir1 sorry for the short notice, I have had this on my mind for awhile and thought I had a good window to slip in
[13:40:57] <Amir1>	 thank you for your great work chasemp. NFS in labs really needs love
[13:41:10] <Amir1>	 I re-run it do it later 
[13:41:53] <Amir1>	 chasemp: I killed the job
[13:42:02] <Amir1>	 tell me once you're done
[13:42:05] <Amir1>	 thanks :)
[13:43:35] <chasemp>	 I'll send out a notice to -announce and ping you here if you're about
[13:43:35] <chasemp>	 np
[14:37:57] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2305057 (10chasemp)
[14:47:40] <wikibugs>	 06Labs, 10Labs-Sprint-115, 10Tool-Labs, 10labs-sprint-116, and 2 others: Write admission controller disabling mounting of unauthorized volumes - https://phabricator.wikimedia.org/T112718#2305068 (10yuvipanda) 05Open>03Resolved Done and deployed!
[14:47:42] <wikibugs>	 06Labs, 10Tool-Labs, 07Tracking: Initial Deployment of Kubernetes to Tool Labs (Tracking) - https://phabricator.wikimedia.org/T111885#2305070 (10yuvipanda)
[15:04:16] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 07Blocked-on-Operations: No replica for adywiki - https://phabricator.wikimedia.org/T135029#2305116 (10jcrespo)
[15:08:10] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Setup NSS inside containers used in Tool Labs - https://phabricator.wikimedia.org/T134748#2305131 (10yuvipanda) We have a fairly decent solution for this now. We've setup libnss-ldapd, and nslcd won't start by default because we've suppressed auto...
[15:12:32] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305151 (10hashar)
[15:13:20] <hashar>	 andrewbogott: YuviPanda chasemp looks like OpenStack misbehave.   Nodepool can't spawn instances anymore :(
[15:13:21] <hashar>	 https://phabricator.wikimedia.org/T135631
[15:13:26] <hashar>	 {u'message': u'No valid host was found. Exceeded max scheduling attempts 3 for instance 6f07110f-4f2f-4f46-bddc-1ea30192ab02. Last exception: [u\'Traceback (most recent call last):\\n\', u\'  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2248, in _do', u'code': 500, u'created': u'2016-05-18T15:08:23Z'} |
[15:13:37] <hashar>	 seems nova-compute has a "no valid host was found"
[15:13:42] <hashar>	 no clue what that one means
[15:13:58] <chasemp>	 hm andrewbogott^
[15:14:01] <andrewbogott>	 hashar: I'll look.  It might mean that labs is full :)
[15:14:05] <chasemp>	 I'm goign to restart rabbit we'll see how that works out
[15:14:07] <hashar>	 oh no :(
[15:14:19] <andrewbogott>	 although it shouldn't be
[15:14:19] <chasemp>	 thanks andrewbogott 
[15:14:26] <andrewbogott>	 chasemp: hang on a minute, I want to see if I can reproduce
[15:14:34] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305194 (10hashar) `openstack server delete 6f07110f-4f2f-4f46-bddc-1ea30192ab02`  worked fine though :)
[15:14:36] <grrrit-wm>	 (03PS1) 10Lokal Profil: Re-add wikitext in statistics id [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) 
[15:14:49] <hashar>	 I really hope it is not nodepool causing some weird scaling issue on labs infra :(
[15:14:51] <chasemp>	 nodepool is alerting in -ops as well
[15:15:16] <hashar>	 yeah I have shut it down
[15:15:42] <hashar>	 to prevent it from potentially overloading labs infra  since nodepool repeatedly attempt to delete and spawn instances
[15:15:49] <chasemp>	 does nodepool allow throttling?
[15:15:56] <andrewbogott>	 hashar: these are instances of size 'small' right?
[15:16:30] <hashar>	 m1.medium iirc
[15:16:55] <andrewbogott>	 ok
[15:16:59] <hashar>	 yeah m1.medium
[15:17:30] <hashar>	 my lame dashboard at https://grafana.wikimedia.org/dashboard/db/labs-capacity-planning  doesnt show much issue with mem/disk/cpu though
[15:18:57] <chasemp>	 andrewbogott: I'm sorry I missed your note I had already restarted rabbit on labnet at that time but I've done nothign further
[15:19:04] <andrewbogott>	 ok
[15:19:41] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305257 (10hashar) I have stopped nodepool on labnodepool1001.eqiad.wmnet in case it is adding load to the OpenStack labs.  To restart it:   $ ssh l...
[15:20:08] <chasemp>	 this is like 3 times in a week and half or so nodepool has wigged out or labs creation, not sure on cause and effect there
[15:21:20] <andrewbogott>	 right now the scheduler seems to not be talking to anyone
[15:21:42] <hashar>	 nodepool spawning a lot of instances, it might highlights some issue on labs   or put too much strain on the nova scheduler
[15:22:04] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 032] Re-add wikitext in statistics id [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) (owner: 10Lokal Profil)
[15:22:25] <hashar>	 at least instance deletion works
[15:26:42] <hashar>	 chasemp: andrewbogott: I have got to commute to get kid back home.  Should be back in roughly 40 minutes
[15:26:46] <hashar>	 I have deleted the instances in the 'contintcloud' project
[15:26:56] <hashar>	 and nodepool is stopped on labnodepool1001.eqiad.wmnet
[15:27:23] <andrewbogott>	 hashar: ok
[15:27:28] <andrewbogott>	 hashar: how do I restart once things are working?
[15:27:31] <hashar>	 I have poked the releng team channel about it
[15:27:37] <hashar>	 $ ssh labnodepool1001.eqiad.wmnet
[15:27:37] <hashar>	 $ sudo /usr/sbin/service nodepool start
[15:27:44] <hashar>	 tail -F /var/log/nodepool/nodepool.log
[15:28:35] <andrewbogott>	 ok thanks
[15:29:09] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305341 (10hashar) The first failure was apparently at 14:35 UTC  ``` 2016-05-18 14:35:17,112 INFO nodepool.NodePool: Need to launch 1 ci-jessie-wik...
[15:29:21] <hashar>	 maybe it is the image that is incorrect
[15:29:32] <hashar>	 it is auto regenerated around 14:30 which is when the first failure occured
[15:31:11] <hashar>	 the snapshot are not found apparently 2016-05-18 15:30:26,399 WARNING nodepool.NodePool: Image server id b678c2ab-8b85-499b-bc06-5d90781ce5c3 not found
[15:31:16] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305351 (10hashar) Maybe that is the images that weren't correct I have deleted them ``` hashar@labnodepool1001:/var/log/nodepool$ nodepool image-li...
[15:32:17] <hashar>	 I restarted nodepool
[15:36:32] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305405 (10hashar) I have restarted Nodepool, it is supposed to spawn instances out of yesterday snapshots: ``` $ nodepool image-list +-----+-------...
[15:37:00] <hashar>	 stopped nodepool again,  yesterday snapshot can't spawn instance either
[15:37:01] <hashar>	  :(
[15:37:24] <hashar>	 I have left the instances around in contintcloud so one could look at them.  
[15:37:30] <hashar>	 rushing, be back in roughly ~ 30 mins
[15:39:56] <grrrit-wm>	 (03PS2) 10Lokal Profil: Re-add wikitext in statistics id [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) 
[15:41:03] <grrrit-wm>	 (03CR) 10Lokal Profil: "ahm. Im' sure what happens to my second patch if you have already +2:ed" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) (owner: 10Lokal Profil)
[15:41:17] <wikibugs>	 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2305456 (10Krenair) That's part of a shared account with two other people... I verified with @Niharika over hangouts though
[15:41:38] <grrrit-wm>	 (03CR) 10Lokal Profil: "recheck" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289427 (https://phabricator.wikimedia.org/T55688) (owner: 10Lokal Profil)
[15:42:40] <chasemp>	 andrewbogott -- isp issues here fyi.  Could this this relate to dumps nfs down and showmount blocking on vm spin up?
[15:42:52] <chasemp>	 Random thought
[15:43:03] <andrewbogott>	 I don't think so — the instances aren't getting scheduled in the first place
[15:43:10] <andrewbogott>	 it's some kind of communication issue between services as best I can tell
[15:45:33] <wikibugs>	 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2305465 (10Niharika) I'm pretty sure Ryan or Frances aren't trying to hack into my account.  All of the tools I am part of have shared ownership, anyway.  Do we still wan...
[15:45:59] <chasemp>	 hm k
[15:50:47] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305493 (10Luke081515) p:05Triage>03Unbreak! Blocks Zuul.
[15:59:29] <wikibugs>	 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 13Patch-For-Review: Setup NSS inside containers used in Tool Labs - https://phabricator.wikimedia.org/T134748#2305514 (10yuvipanda) We do need nscd, otherwise it is too slow :(
[16:02:14] <Luke081515>	 Krenair: The irc.beta.wmflabs.org RC-IRC thing, whoch puppet role is that? Do you know that?
[16:02:33] <Krenair>	 yep
[16:03:27] <Krenair>	 you can look these things up like this: ldapsearch -x dc:dn:=deployment-ircd.deployment-prep.eqiad.wmflabs | grep puppetClass
[16:03:37] <Krenair>	 it's role::mw_rc_irc
[16:04:31] <YuviPanda>	 or with http://tools.wmflabs.org/watroles/variable/instancename/deployment-ircd
[16:05:42] <Luke081515>	 thanks :)
[16:09:11] <hashar>	 back
[16:12:49] <bd808>	 mutante: late answer -- https://tools.wmflabs.org/osm exists and is running lighttpd. The 404 you are seeing from its root page is due to the being no default index. See https://tools.wmflabs.org/osm/libs/openlayers/OpenLayers-patch2-10.js for a file that tool actually serves
[16:16:39] <Luke081515>	 Krenair: I didn't worked with puppet variables at labs yet. So for example if I want to set the "instancename" variable, do I have to enter variablename and value directly at Special:NovaPuppetGroup?
[16:19:33] <Krenair>	 Luke081515, not the values
[16:19:52] <Luke081515>	 andrewbogott: can you take a look at open stack? every time, I try to spawn an instance, they get into the "error" state
[16:19:57] <Krenair>	 Luke081515, at Special:NovaPuppetGroup you add classes and variables so they can be used by your project
[16:20:18] <Luke081515>	 Krenair: Ok, but how can I set the values of the variables? LAter at the instance?
[16:20:23] <Krenair>	 Luke081515, then on Special:NovaInstance you can 'configure' an instance to use those classes and set the values
[16:20:32] <andrewbogott>	 Luke081515: I'm working on it
[16:20:53] <Krenair>	 Luke081515, note that the variables are not really needed now that we have hira
[16:20:53] <Luke081515>	 andrewbogott: thx :)
[16:20:54] <Krenair>	 hiera*
[16:21:11] <Luke081515>	 ok :)
[16:21:27] <Luke081515>	 but first i have to wait, that I can spawn an instance, where I can try it ;)
[16:22:07] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305657 (10hashar) Nodepool eventually restarted due to puppet.   Horizon interface shows up instances are blocked on various tasks: in Spawning, Sc...
[16:22:28] <hashar>	 andrewbogott: for what it is worth I get the error message: 500 No valid host was found. There are not enough hosts available.
[16:23:20] <andrewbogott>	 there are mutliple things happening
[16:24:13] <wikibugs>	 06Labs: Backup files request - https://phabricator.wikimedia.org/T135014#2305665 (10Mjbmr) Please remove the tool named `xmlfeed` and regenerate `replica.my.cnf` for the tools named `mjbmr-tools` and `mjbmrbot` and my personal account. Thanks.
[16:25:20] <andrewbogott>	 chasemp: where all did you restart rabbitmq-server?
[16:25:46] <chasemp>	 Labnet1002 only
[16:26:35] <andrewbogott>	 hm
[16:34:46] <hasharAway>	 nodepool is restarted by puppet so it is back around spamming labs infra. 
[16:35:02] <hasharAway>	 I am away dealing with dinner / kids etc. will check there from time to time
[16:39:09] <andrewbogott>	 hasharAway: are you sure that nodepool isn't still trying to schedule things?
[16:39:16] <andrewbogott>	 Quite a flood of scheduling requests over here
[16:39:17] <hasharAway>	 it is
[16:39:19] <hasharAway>	 restarted
[16:39:21] <hasharAway>	 by puppet
[16:39:41] <hasharAway>	 could you disable puppet on labnodepool1001.eqiad.Wmnet ?
[16:39:48] <andrewbogott>	 sure
[16:40:13] <hasharAway>	 stopped it manually
[16:40:22] <andrewbogott>	 thanks, I disabled puppet
[16:40:36] <andrewbogott>	 Things might be working now but I want to give everyone a chance to catch up
[16:40:42] <andrewbogott>	 there was such a backlog of schedule requests...
[16:41:45] <hasharAway>	 yeah I can imagine nodepool has been quite spammy and overloaded whatever queue is used :-(
[16:44:40] <mutante>	 bd808: oh! thank you for that. would it make sense to Redirect those requests to somewhere else?
[16:46:51] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2305151 (10Andrew) 2016-05-18 16:45:29.980 5375 ERROR nova.compute.manager [req-86afc675-0c57-44f4-a164-e1a8320c845b novaadmin...
[16:48:17] <bd808>	 mutante: *shrug* maybe. I have no idea what that tool is actually doing. It looks like a static file dump for something related to open street maps
[16:48:23] <Amir1>	 !log ores running puppet agent on ores-lb-02 manually
[16:48:27] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master
[16:51:14] <Matthew_>	 !log tools.xtools Restarted the webservice for xtools-ec as it was returning 502s again.
[16:51:18] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.xtools/SAL, Master
[16:55:41] <mutante>	 bd808: i think i'll just Redirect to https://meta.wikimedia.org/wiki/OpenStreetMap and move on :) 
[16:57:41] <wikibugs>	 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2305850 (10Krenair) 05Open>03Resolved a:03Krenair We called just now and I reset 2FA for @Niharika
[17:35:30] <thcipriani>	 eh, so, what's the status of nodepool things? Still seems very broken in a way I've not seen...
[17:35:44] <andrewbogott>	 thcipriani: should be working as of a minute or two ago...
[17:36:47] <thcipriani>	 andrewbogott: okie doke, there do seem to be some new instances building now that I look, hopefully the zuul queue starts moving soon.
[17:37:19] <thcipriani>	 oh it is moving, amazing :D 
[17:46:41] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2306116 (10Andrew) 05Open>03Resolved a:03Andrew This is resolved now, and I don't know what went wrong :(
[17:47:08] <tom29739>	 Is there a way to give write access on a tool to another tool without giving it to all tools? For example, could it be set that tool x can access tool y's directories?
[17:48:01] <andrewbogott>	 tom29739: I think you can make a tool a member of a tool, using the same method you'd use to add a user to a tool
[17:51:57] <tom29739>	 I would do that, but I get 'No results match "tom29739-testing"' for any input in the service user box when managing maintainers on a tool. I don't think that's intended.
[17:52:39] <wikibugs>	 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306131 (10Dzahn) - deleted toolserver.org.key in private repo - deleted certs and .key and .keyold on instance, /etc/ssl...
[17:55:08] <YuviPanda>	 tom29739: andrewbogott you can theoretically do that, but that feature of OSM has been broken for a while
[17:55:16] <YuviPanda>	 I filed a ticket
[17:55:26] <andrewbogott>	 damn
[17:55:33] <YuviPanda>	 https://phabricator.wikimedia.org/T128400
[17:55:59] <tom29739>	 So I can't do what I want to do?
[17:56:02] <YuviPanda>	 tom29739: andrewbogott admins can do that manually atm, so if you need it to create a ticket and we'll do it
[17:58:45] <wikibugs>	 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306161 (10Dzahn) >>! In T134798#2303651, @Dzahn wrote: > @Nemo_bis Possible, but we'll need it in DNS first, then Apache...
[18:43:36] <YuviPanda>	 doctaxon: hmm, thanks for bringing it to my notice, am investigating it now
[18:43:40] <YuviPanda>	 I see ~200 jobs in qw
[18:44:07] <YuviPanda>	 hmm no queues or grids in error state
[18:44:52] <doctaxon>	 but what can be the reason why?
[18:45:11] <YuviPanda>	 I'm investigating :)
[18:45:12] <mutante>	 !log toolserver-legacy restart Apache, adding wiki. alias
[18:45:18] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolserver-legacy/SAL, Master
[18:47:45] <chasemp>	 YuviPanda: I can submit a basic job and it runs fine...
[18:47:54] <YuviPanda>	 hmm
[18:48:33] <YuviPanda>	 chasemp: did you run it on precise or trusty?
[18:48:51] <chasemp>	 waht's teh default? trusty I think then
[18:48:56] <YuviPanda>	 yeah
[18:48:58] <YuviPanda>	 hmm
[18:49:16] <YuviPanda>	 hm I see some trusty stuff there too
[18:50:46] <chasemp>	 hm
[18:50:46] <chasemp>	 so
[18:50:46] <chasemp>	 just doing jsub on something
[18:50:48] <chasemp>	 I still get precise
[18:50:54] <chasemp>	 and can run a job that dumps basic info
[18:50:56] <chasemp>	 runs on tools-webgrid-lighttpd-1205.tools.eqiad.wmflabs
[18:50:58] <YuviPanda>	 yeah you need to pass -l release=trusty
[18:51:00] <chasemp>	 ah
[18:51:02] <YuviPanda>	 to jsub
[18:51:05] <YuviPanda>	 to get trusty
[18:51:55] <chasemp>	 did so, qw
[18:52:02] <chasemp>	 trusty nodes are overwhelmed?
[18:52:14] <doctaxon>	 i use something like this:  jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 4g -l release=trusty    ld.tcl
[18:52:36] <doctaxon>	 is this wrong?
[18:52:54] <jem>	 Hi, mi tool is 503 and webservice start doesn't help
[18:53:09] <jem>	 my tool is giving*
[18:53:16] <jem>	 Any advice?
[18:53:20] <chasemp>	                             queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1402.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=3.180000 (= 3.180000 + 0.50 * 0.000000 with nproc=4) >= 2.75
[18:55:55] <jem>	 I guess that line is related to my problem...
[18:56:03] <YuviPanda>	 chasemp: yeah, from http://tools.wmflabs.org/?status looks like it
[18:56:08] <YuviPanda>	 jem: yeah, probably. stand by
[18:56:24] <YuviPanda>	 anomie: ping?
[18:57:01] <anomie>	 YuviPanda: pong?
[18:57:18] * jem stands by :)
[18:57:19] <YuviPanda>	 anomie: do a lot of the anomiebots read dumps?
[18:57:41] <YuviPanda>	 anomie: they're causing super high CPU usage now in a bunch of nodes, and I *think* that's because dumps is unavailable right now.
[18:57:53] <anomie>	 YuviPanda: I don't have any tasks that read dumps. Most just query the API, some of the newer ones use queries against the DB replica.
[18:57:56] <YuviPanda>	 hmm
[18:58:12] <YuviPanda>	 ok that's good to know
[18:59:26] <YuviPanda>	 chasemp: I'm actually unsure where the load is coming from
[18:59:49] <chasemp>	 so I ahve a fair idea someone is doing something innocuos that is choking on dumps yeah
[18:59:50] <YuviPanda>	 chasemp: lots of CPU use of kworker and rcu_sched and nothing much else :| (on tools-exec-1410)
[19:00:02] <YuviPanda>	 chasemp: but across all the nodes?
[19:00:16] <anomie>	 YuviPanda: All AnomieBOT jobs, or just some?
[19:00:18] <chasemp>	 YuviPanda: yeah odd 
[19:00:23] <chasemp>	 give me a sec here to try one thing
[19:00:24] <YuviPanda>	 chasemp: some had anomiebot using good chunk of CPU but he says they don't use dumps
[19:00:26] <YuviPanda>	 chasemp: kk
[19:00:43] <YuviPanda>	 anomie: I was just looking at some, but think it's  a red-herring now.
[19:01:50] <YuviPanda>	 so all the precise hosts seem fine
[19:01:55] <YuviPanda>	 the trusty ones have gone bonkers
[19:02:21] <chasemp>	 so I ensured all is ro and basically enabled an empty dumps share
[19:02:22] <chasemp>	 and I did
[19:02:36] <chasemp>	 fuser -k /public/dumps
[19:02:36] <chasemp>	 umount -f /public/dumps
[19:02:45] <chasemp>	 mount -o remount /public/dumps
[19:02:51] <chasemp>	 and it seems to have dropped load a lot on 
[19:03:02] <YuviPanda>	 yeah
[19:03:05] <chasemp>	 tools-webgrid-lighttpd-1402.tools.eqiad.wmflabs
[19:03:45] <YuviPanda>	 not all tho
[19:03:51] <YuviPanda>	 1408
[19:03:55] <YuviPanda>	 -exec-1408
[19:03:59] <YuviPanda>	  19:03:48 up 55 days, 22:53,  1 user,  load average: 13.12, 13.06, 12.46
[19:04:03] <chasemp>	 same for tools-webgrid-lighttpd-1406
[19:04:06] <YuviPanda>	 but
[19:04:09] <YuviPanda>	 %Cpu(s):  0.2 us,  3.8 sy, 16.7 ni, 77.6 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
[19:04:10] <chasemp>	 I've only done two nodes
[19:04:10] <YuviPanda>	 oh
[19:04:14] <YuviPanda>	 aaah, right
[19:04:16] <YuviPanda>	 ok
[19:04:21] <doctaxon>	 YuviPanda: release=precise does run the cronjobs
[19:04:27] <chasemp>	 both are dropping so we should hit them all I think?
[19:04:33] <YuviPanda>	 doctaxon: yes, but please don't do that. just wait a few minutes and we'll get this back to working.
[19:04:38] <chasemp>	 and I'll ahve to remount again post resize
[19:04:42] <YuviPanda>	 chasemp: yeah
[19:04:44] <doctaxon>	 thank you
[19:04:54] <YuviPanda>	 chasemp: do you have the magic handy to do it?
[19:05:13] <chasemp>	 YuviPanda: umount -f /public/dumps && mount -o remount /public/dumps 
[19:05:28] <chasemp>	 unles that says it's in use
[19:05:29] <chasemp>	 and maybe
[19:05:29] <chasemp>	 fuser -k /public/dumps
[19:05:37] <chasemp>	 kind of deal
[19:07:16] <YuviPanda>	 I mean to run it on all hosts :D
[19:07:26] <YuviPanda>	 chasemp: do you want me to run it on all the trusty execs?
[19:07:27] <chasemp>	 all tools hosts or all hosts?
[19:07:33] <YuviPanda>	 all tools ones I think
[19:07:34] <chasemp>	 it's affecting precise too
[19:07:38] <YuviPanda>	 oh ok
[19:07:52] <YuviPanda>	 chasemp: are you running it on all tools hosts or shall I?
[19:08:05] <YuviPanda>	 I don't have anything parallel setup but I've a small helper script that xargs ssh
[19:08:09] <chasemp>	 I thought you were in the middle so I was holding off
[19:08:17] <YuviPanda>	 ah we clashed there
[19:08:19] <chasemp>	 I don't think I have a current lit of all tools
[19:08:19] <YuviPanda>	 I was asking
[19:08:21] <YuviPanda>	 ok
[19:08:21] <chasemp>	 heh
[19:08:23] <YuviPanda>	 I'll do it
[19:08:26] <chasemp>	 k
[19:09:18] <YuviPanda>	 chasemp: running it now
[19:09:21] <chasemp>	 k
[19:10:59] <chasemp>	 sorry this resize is taking forever so here we are
[19:11:40] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[19:13:54] <wikibugs>	 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2306476 (10Dzahn) done. the cert has additional SANs now, "wiki" and "stable"
[19:14:02] <YuviPanda>	 chasemp: hmm, lots of umount.nfs: /public/dumps: device is busy
[19:14:10] <YuviPanda>	 chasemp: on tools-exec-1405.tools.eqiad.wmflabs for example
[19:14:20] <YuviPanda>	 -f?
[19:15:04] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[19:16:23] <YuviPanda>	 oh it already had a -f
[19:17:37] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[19:18:15] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[19:19:19] <chasemp>	 YuviPanda: timeout 10s fuser -k /public/dumps; umount -f /public/dumps && mount -t nfs labstore1003.eqiad.wmnet:/srv/dumps /public/dumps
[19:19:33] <chasemp>	 maybe
[19:19:58] <YuviPanda>	 chasemp: btw, should we also take this opportunity to switch dumps from hard to soft mounted?
[19:20:11] <chasemp>	 that was ironically next up on my nfs client things :)
[19:20:16] <chasemp>	 shoudl ahve done it first now I realize
[19:21:04] <YuviPanda>	 heh
[19:21:07] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[19:21:37] <chasemp>	 I guess
[19:21:37] <chasemp>	 imeout 10s fuser -k /public/dumps; umount -f /public/dumps; mount -t nfs labstore1003.eqiad.wmnet:/srv/dumps /public/dumps
[19:21:38] <chasemp>	 timeout 10s fuser -k /public/dumps; umount -f /public/dumps; mount -t nfs labstore1003.eqiad.wmnet:/srv/dumps /public/dumps
[19:21:38] <wikibugs>	 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Nodepool can not spawn instances anymore on wmflabs - https://phabricator.wikimedia.org/T135631#2306486 (10hashar) ```  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 896, in   if ret == -1: raise libvirtError ('...
[19:21:39] <chasemp>	 is better
[19:21:41] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[19:21:45] <chasemp>	 silly mount and it's exit codes
[19:22:43] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0]
[19:22:49] <YuviPanda>	 chasemp: doesn't that mount it without any of the options we used?
[19:23:04] <chasemp>	 it's not a real mount
[19:23:09] <chasemp>	 or at least there is nothing there and it's ro
[19:23:14] <chasemp>	 just to get past the stupidity
[19:23:22] <chasemp>	 I'l ahve to remount it anyways
[19:23:27] <chasemp>	 as it will be on a diff fs server side
[19:23:33] <chasemp>	 teh vnode stuff doesn't handle that
[19:23:42] <YuviPanda>	 ah right
[19:23:44] <YuviPanda>	 fair enough
[19:23:48] <chasemp>	 sorry, it's a real mount but option wise it's not an issue
[19:24:19] <YuviPanda>	 nah makes sense
[19:24:27] <YuviPanda>	 I'm not sure what it'll do to puppet tho
[19:24:29] <YuviPanda>	 but running it now
[19:25:28] <YuviPanda>	 chasemp: that seems to work tho
[19:28:04] <YuviPanda>	 chasemp: almost all the qw jobs are gone tho
[19:28:09] <YuviPanda>	 jem: is your webservice back online?
[19:28:14] <YuviPanda>	 doctaxon: your services should also be running now
[19:28:34] <doctaxon>	 okay, mom
[19:28:48] <doctaxon>	 runs best
[19:29:06] * YuviPanda makes doctaxon eat their vegetables
[19:30:10] <Luke081515>	 Krenair: Do you know what I have to setup, after appyling that IRC RC role?
[19:30:16] <doctaxon>	 YuviPanda - what do you mean?
[19:30:27] <doctaxon>	 vegetables?
[19:30:43] <YuviPanda>	 doctaxon: you mentioned mom, I think it is a stereotypical thing moms are supposed to do
[19:30:49] <YuviPanda>	 make people who call them moms eat vegetables
[19:31:56] <chasemp>	 YuviPanda: so puppet doesn't seem to care
[19:32:01] <chasemp>	 taht the mount is "wrong"
[19:32:02] <chasemp>	 so that's fun
[19:32:14] <YuviPanda>	 :D
[19:34:21] <YuviPanda>	 chasemp: all good now I think
[19:34:38] <chasemp>	 all seems well until dumps finishes resizing
[19:34:38] <chasemp>	 thanks man
[19:34:51] <chasemp>	 afaik it's still chugging away fine it's just a huge volume
[19:36:44] <YuviPanda>	 chasemp: kk. eta?
[19:36:54] <YuviPanda>	 like hours? will it finish today you think?
[19:36:58] <YuviPanda>	 is there even an eta screen?
[19:37:03] <chasemp>	 I'm not sure, nope
[19:37:09] <YuviPanda>	 ok
[19:37:40] <chasemp>	 I naively thought hour(s)
[19:37:46] <chasemp>	 we'll see
[19:38:02] <YuviPanda>	 :D ok
[19:42:45] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:42:45] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:48:17] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:49:04] <sldr>	 Hey all. Is there any known issue with the wsexport tool at the moment? Appears unresponsive.
[19:49:14] <jem>	 YuviPanda: Yes, it's back, thanks :)
[19:50:43] <wikibugs>	 06Labs, 10Horizon: TGR unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2306704 (10Andrew)
[19:51:20] <wikibugs>	 06Labs, 10Horizon: TGR unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2173147 (10Andrew) a:03jcrespo Chris suggests that something should be done to the 'users' table on silver.  So... over to you, Jaime.
[19:52:05] <chasemp>	 sldr: maybe? it appears to be running but throws 2016-05-18 19:51:20: (server.c.1444) [note] sockets disabled, connection limit reached
[19:52:40] <chasemp>	 restarted it
[19:53:06] <sldr>	 chasemp: That would explain why I suddenly got Bad Gateway, I suppose. I'll try again in a second.
[19:54:30] <sldr>	 chasemp: Tried again just now. Works one out of one times. Turning it off and on again saves the day again?
[19:54:59] <chasemp>	 maybe so :)
[19:55:42] <sldr>	 Is it a fantastically bad idea to run a batch job against that tool? Wouldn't want to break it.
[19:56:06] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:58:31] <chasemp>	 sldr: I have no idea
[19:58:46] <sldr>	 chasemp: So... Let's find out?
[20:49:34] <tom29739>	 How can I find out the lighttpd version on tools? I tried 'lighttpd' in the console, and it returned command not found. The http header returns nginx, which I presume is the proxy
[20:50:30] <YuviPanda>	 tom29739: it's just the lighttpd version ubuntu trusty has
[20:50:46] <YuviPanda>	 tom29739: you can also ssh to tools-webgrid-lighttpd-1401 (or any such node) and run same commands to find out
[20:51:49] <Luke081515>	 who knows how to setup the role::mw_rc_irc
[20:51:59] <mutante>	 Luke081515: i do
[20:52:07] <Luke081515>	 ok
[20:52:10] <mutante>	 it's already there
[20:52:21] <mutante>	 in project. eh.. "irc"
[20:53:15] <mutante>	 or you can apply the role in another one
[20:53:37] <mutante>	 basically go to "puppet groups" and add the role class, so you can select it when you "configure" an instance
[20:54:01] <mutante>	 since we just recently fixed some issues for that and it has fake secrets in labs/private etc.. it should simply work :)
[21:03:44] <chasemp>	 YuviPanda: not much on I can do atm for dumps, it's resizing as we speak and that's kind of a volatile state
[21:04:11] <YuviPanda>	 chasemp: yeah, understood
[21:04:18] <YuviPanda>	 chasemp: do we have a plan B for it?
[21:04:43] <chasemp>	 I mean if it never comes back we can just wipe it out, recreate lv's and repopulate from dumps
[21:04:52] <chasemp>	 it's not primary copy of any data so
[21:04:57] <chasemp>	 but that's potentially longer
[21:05:03] <YuviPanda>	 right
[21:05:05] <YuviPanda>	 kk
[21:09:24] <mutante>	 !log ircd added Luke08515 as user and admind
[21:09:31] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ircd/SAL, Master
[21:30:40] <sldr>	 chasemp: So for the record, I ran a quick batch job which exported some 140 pages as quickly as it could. No obvious error.
[21:30:53] <chasemp>	 sldr: nice :)
[22:13:28] <wikibugs>	 06Labs, 10Tool-Labs, 10Living-Style-Guide, 06Reading-Web-Backlog: npm version on tools-login.wmflabs.org is incompatible with MobileFrontend package.json used by the KSS styleguide - https://phabricator.wikimedia.org/T89093#2307354 (10Danny_B)
[23:32:00] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-10 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]