[00:34:29] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1439 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[00:41:03] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[00:47:17] <wikibugs>	 10Labs, 10Tool-Labs: Homedir for user whym is very large (>60G) - https://phabricator.wikimedia.org/T169265#3393571 (10whym) Thanks for notifying me. It was a (probably incomplete) copy of dumps.wikimedia.org/other. The largest subdirectory was /other/diffdb/ which I have deleted. I was particularly interested...
[01:04:27] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1439 is OK: OK: Less than 1.00% above the threshold [0.0]
[01:08:53] <wikibugs>	 10Labs, 10Tool-Labs: Homedir for user whym is very large (>60G) - https://phabricator.wikimedia.org/T169265#3393602 (10bd808) 05Open>03Resolved a:03whym Usage now is ~2G. Thanks a lot for the quick response @whym. It is very much appreciated.
[01:11:02] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0]
[01:18:42] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1402 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[01:26:32] <bd808>	 grid and kubernetes are going to have intermittent issues as we get switched back to the other NFS server
[01:29:27] <shinken-wm>	 PROBLEM - High iowait on tools-grid-master is CRITICAL: CRITICAL: tools.tools-grid-master.cpu.total.iowait (>11.11%)
[01:29:29] <shinken-wm>	 PROBLEM - Puppet errors on tools-bastion-05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[01:29:36] <andrewbogott>	 !log tools rebooting tools-cron-01
[01:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[01:29:40] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[01:29:53] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[01:30:37] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[01:31:37] <shinken-wm>	 PROBLEM - Puppet errors on tools-mail is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[01:32:03] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[01:32:05] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1440 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[01:33:17] <chasemp>	 !log tools time for i in `cat tools-hosts`; do ssh -i ~/.ssh/labs_root_id_rsa root@$i.eqiad.wmflabs 'hostname -f; uptime; tc-setup'; done
[01:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[02:01:12] <wikibugs>	 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393637 (10Andrew)
[02:01:36] <wikibugs>	 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10Andrew) I tagged dc-ops because... have y'all ever seen something like this?
[02:05:51] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:06:01] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1434 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:07:07] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:07:08] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1440 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:07:21] <wikibugs>	 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10bd808) http://www.dell.com/support/manuals/us/en/04/dell-opnmang-sw-v8.1/EEMI_13G_v1.2-v1/UEFI-Event-Messages?guid=GUID-823669E3-2D7B-41B5-85F1-AF7A6BC11ACC&lang=en-u...
[02:08:00] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:08:09] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:08:59] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:09:01] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:09:39] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:09:51] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:10:39] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:16:40] <wikibugs>	 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10madhuvishy) We did another reboot to downgrade the kernel back to 4.3 and the error happened again.
[02:16:54] <wikibugs>	 10Labs, 10Operations, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393649 (10Andrew)
[02:17:52] <wikibugs>	 10Tool-Labs-tools-Xtools: Properly handle malformed curl results - https://phabricator.wikimedia.org/T169288#3393661 (10Matthewrbowker)
[02:22:36] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[02:28:16] <wikibugs>	 10Labs, 10Operations, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393675 (10madhuvishy)
[02:28:41] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:33:19] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393676 (10bd808)
[02:33:55] <wikibugs>	 10Labs, 10Operations, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393690 (10bd808)
[02:33:56] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393689 (10bd808)
[02:36:02] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393691 (10bd808)
[02:36:04] <wikibugs>	 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393692 (10bd808)
[02:37:25] <wikibugs>	 10cloud-services-team, 10Operations: New anti-stackclash (4.9.25-1~bpo8+3 ) kernal SUPER BAD for NFS - https://phabricator.wikimedia.org/T169290#3393693 (10Andrew)
[02:37:40] <wikibugs>	 10cloud-services-team, 10Operations: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3393706 (10Andrew)
[02:38:26] <wikibugs>	 10cloud-services-team, 10Operations: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3393693 (10Andrew)
[02:38:28] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393707 (10Andrew)
[02:39:01] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393710 (10bd808) Some data about the system load we saw: {P5652}
[02:48:35] <wikibugs>	 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Optimize edit count queries in XTools - https://phabricator.wikimedia.org/T163284#3393715 (10Samwilson) Good idea. It's probably a bit under half the load time for many users. It has been moved it to load via JS.
[02:52:37] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:53:15] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1021 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[02:56:22] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393743 (10bd808) SAL entries: ``` == 2017-06-30 == 02:29 <chasemp> labstore1005 start drbd 02:14 <chasemp> reboot labstore1005 (5m ago) 01:33 <chasemp> time for i in `cat tools-hosts`; d...
[02:58:33] <wikibugs>	 10Labs: Labstore nfsd processes report "sent only x when sending y bytes - shutting down socket" - https://phabricator.wikimedia.org/T169281#3393758 (10bd808)
[02:58:35] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393757 (10bd808)
[03:02:00] <wikibugs>	 10Labs: Labstore nfsd processes report "sent only x when sending y bytes - shutting down socket" - https://phabricator.wikimedia.org/T169281#3393761 (10bd808) Quite likely related to {T169290}. @chasemp could not find other log events like this prior to the kernel upgrade.
[03:13:57] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393775 (10bd808)
[03:17:29] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393776 (10bd808)
[03:33:13] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0]
[03:43:28] <wikibugs>	 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3393780 (10Samwilson) Thanks for finding that. Should be fixed now. https://xtools-dev.wmflabs.org/ is updated.
[05:26:18] <wikibugs>	 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3393827 (10kaldari) @Samwilson: I'm getting the "fr.wikipedia.org is not a valid project" error at https://xtools-dev.wmflabs.org/.
[05:36:25] <wikibugs>	 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3393828 (10Samwilson) Ergh, again?! Bother.  Do you mean when you type in the project name? Does it happen when you type in `fr.wikipedia.org` and then submit quic...
[05:48:30] <wikibugs>	 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3393833 (10kaldari) @Samwilson: It happens as soon as I type in "fr.wikipedia.org" and then unfocus the field, so it's coming from the client-side API check.
[06:57:41] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1425 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[07:12:27] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1421 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[07:32:42] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1425 is OK: OK: Less than 1.00% above the threshold [0.0]
[08:51:21] <wikibugs>	 10Labs, 10Labs-Infrastructure: Unable to SSH onto tools-login.wmflabs.org - https://phabricator.wikimedia.org/T130446#3394061 (10akosiaris)
[08:51:23] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3394058 (10akosiaris) 05Open>03Resolved a:03akosiaris I am guessing this works fine ? Since then we 've had 0 troubles if I am not mistaken. I am resolving, feel free to reopen
[08:52:04] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2492736 (10akosiaris) Dependent task T130593 has had no update since Nov 2016, so this is probably solved. I am gonna resolve this, feel free to reopen
[08:52:08] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2140296 (10akosiaris)
[08:52:11] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#3394066 (10akosiaris) 05Open>03Resolved a:03akosiaris
[08:52:25] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:13:24] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1421 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[09:48:24] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0]
[10:14:24] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1421 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[10:43:25] <wikibugs>	 (03PS1) 10Addshore: Send all WMDE-.* phab projects to #wikimedia-de-tech [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362373
[10:44:41] <wikibugs>	 10Labs, 10Graphite, 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#3394343 (10ArielGlenn) Poking @bd808 on this, since it's been an issue for us again in the past week.
[10:45:07] <wikibugs>	 (03PS1) 10Addshore: Add MediaWiki-extensions-Wikibase(.*) to #wikidata-feed [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362374
[10:49:24] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0]
[11:12:40] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1007 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[11:37:17] <wikibugs>	 (03CR) 10Thiemo Mättig (WMDE): [C: 031] Add MediaWiki-extensions-Wikibase(.*) to #wikidata-feed [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362374 (owner: 10Addshore)
[11:37:46] <wikibugs>	 (03CR) 10Thiemo Mättig (WMDE): [C: 031] Send all WMDE-.* phab projects to #wikimedia-de-tech [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362373 (owner: 10Addshore)
[11:52:39] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:38:44] <Guest12443>	 !help Can anyone help me setting up a project.
[12:38:44] <wm-bot>	 Guest12443: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team
[12:42:54] <acagastya>	 Anyone here?
[13:10:41] <andre__>	 acagastya: What would it change?
[13:11:01] <andre__>	 !ask
[13:14:04] <wikibugs>	 10Labs, 10wikitech.wikimedia.org: 2FA reset for Wikitech account - https://phabricator.wikimedia.org/T169332#3394984 (10Bawolff) I can confirm I was talking to this user on irc, and he was logged into his irc account, which had a wikipedia cloak on it.
[13:17:05] <wikibugs>	 10Labs, 10wikitech.wikimedia.org: 2FA reset for Samtar Wikitech account - https://phabricator.wikimedia.org/T169332#3395006 (10Bawolff)
[13:22:31] <wikibugs>	 10Labs, 10wikitech.wikimedia.org: 2FA reset for Samtar Wikitech account - https://phabricator.wikimedia.org/T169332#3395015 (10Samtar) I've also added a file to a tool I have access to - http://tools.wmflabs.org/communityguidelines/T169332.txt
[13:32:29] <wikibugs>	 10Tool-Labs-tools-Other: Third party resources loaded from "communityguidelines" tool - https://phabricator.wikimedia.org/T169334#3395025 (10Nemo_bis)
[13:34:04] <wikibugs>	 10Labs, 10wikitech.wikimedia.org: 2FA reset for Samtar Wikitech account - https://phabricator.wikimedia.org/T169332#3395040 (10Bawolff) 05Open>03Resolved a:03Bawolff
[14:03:19] <tarrow>	 What is the deal with https://graphite-labs.wikimedia.org/? Is it meant for general labs use? I've been pushing metrics there and they seemed to be persistent for quite a while but recently I've notice they disappeared. Today I noticed that the metric name is no longer even listed in the graphite browser. Is there some reason they would have disappeared?
[14:05:51] <tarrow>	 I know they are gradually aggregated to reduce storage for the long term but I've only been pushing one value per day so I assumed that they would last quite a long time
[14:08:32] <jynus>	 tarrow: I don't know, but from https://wikitech.wikimedia.org/wiki/Graphite it seems that that is mostly thought for beta
[14:09:05] <jynus>	 I wonder if prometheus is better supported? I really don't know
[14:09:38] <jynus>	 I would ask, not because you need permission, but to avoid unintentional deletions
[14:10:51] <tarrow>	 Cool; I'm happy to. Any idea who I should ask?
[14:11:23] <jynus>	 I would start with the tool owner, let me search it
[14:11:29] <jynus>	 s/tool/VM/
[14:14:17] <jynus>	 I do not see anyone connected right now, I would create a phab ticket and add the people on this history: https://wikitech.wikimedia.org/w/index.php?title=Labmon1001&action=history
[14:14:51] <jynus>	 I do not have access, I think, to know the current owners
[14:16:15] <tarrow>	 thanks!
[14:16:25] <tarrow>	 I'll go ahead and find out
[14:19:08] <andre__>	 acagastya: even if I "looked" at the task that you just pointed to me in private, what would you expect specifically from *me*?
[14:19:20] * addshore wonders what timezone madhuvishy is in
[14:19:21] <acagastya>	 I am facing with a nodeJS application.
[14:19:38] <paladox>	 !log git upgrading gerrit on gerrit-test3 to gerrit 2.14.2 (pre)
[14:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL
[14:19:49] <acagastya>	 andre__ no, should I open a new task on phab, or ask in channel.
[14:19:54] <acagastya>	 That is the question.
[14:19:57] <andre__>	 addshore: by default UTC-7
[14:20:03] <acagastya>	 I need libicu-dev for the bot.
[14:20:06] <andre__>	 acagastya, depends on what your previous question is that nobody knows.
[14:20:09] <addshore>	 andre__: ack!
[14:20:21] <andre__>	 acagastya, if you want someone to answer questions, you need to explain the problem.
[14:20:27] <wikibugs>	 10Labs: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3387598 (10Addshore) I agree that this is super odd and these things probably shouldn't just disappear.  What is odd is that many other metrics in graphite remain untouched, it just seems to...
[14:20:32] <acagastya>	 andre__ I need `libicu-dev`.
[14:20:39] <andre__>	 acagastya: why? for what?
[14:20:42] <wikibugs>	 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395166 (10Addshore)
[14:20:53] <andre__>	 acagastya: that's a potential solution. but not a description of a problem.
[14:20:55] <acagastya>	 For a teleirc bot.
[14:21:32] <acagastya>	 I can't apt-get it, andre__
[14:22:02] <andre__>	 acagastya: file a Phabricator task requesting installation of that package and provide reasons why?
[14:24:36] <acagastya>	 https://phabricator.wikimedia.org/T169338
[14:24:49] <wikibugs>	 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395181 (10Tarrow) Perhaps its because it is under a path that is autofilled with metrics describing the status of labs instances.  I think perhaps other metrics may have a...
[14:25:09] <acagastya>	 All that I know is the applicatioin depends on `libicu-dev`
[14:25:39] <andre__>	 acagastya, "Please see this" and linking to a random codebase is not helpful. Also, the task lacks any information WHY you think you need that.
[14:25:56] <andre__>	 acagastya, also, the task lacks any context. Is this about some Tool Labs instance? Some Labs stuff? Somewhere else?
[14:26:16] <andre__>	 acagastya, please see https://mediawiki.org/wiki/How_to_report_a_bug how to help others to better understand reported tasks. Thanks!
[14:26:25] <acagastya>	 I know it does not help. But that application, as far as I know depends on that package.
[14:26:40] <andre__>	 acagastya, so why do you tell us here on irc but not in your Phab task?
[14:27:13] <andre__>	 "as far as I know" confuses me. either you know or you don't? What makes you think so, that lead you to asking here and creating that task?
[14:27:33] <andre__>	 anyway, I cannot spend more time on this, sorry :) Good luck and I hope that will get sorted out! :)
[14:28:55] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3395192 (10Andrew) Yep, no problems in ages.  Thanks for the bug cleanup.
[14:29:11] <acagastya>	 Made some changes, andre__
[14:30:16] <andre__>	 acagastya, if it's a request about installing some package, you want to explain where in your task, e.g. by adding the #Labs tag to it. Phabricator is used for hundreds of projects.
[14:30:58] <andre__>	 without content in the task, nobody knows where you want this. I know that we are here in the #wikimedia-cloud channel, but your task does not say that it is related to Labs at all.
[14:31:20] <andre__>	 acagastya, again, please see the steps in https://mediawiki.org/wiki/How_to_report_a_bug - thanks
[14:31:51] <andre__>	 the better your task, the more likely someone will find it (hence: Project tags needed) and will look at it.
[14:36:55] <wikibugs>	 10Labs: Require 'libicu-dev' package for teleIRC bot for Wikitech project - https://phabricator.wikimedia.org/T169338#3395217 (10Acagastya)
[14:38:17] <acagastya>	 andre__ I also need to run `npm install -g teleirc` and it might need sudo.
[14:38:27] <acagastya>	 Should I include that in the same task?
[14:40:16] <wikibugs>	 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395219 (10Addshore) That would be this script https://github.com/wikimedia/puppet/blob/production/modules/graphite/files/archive-instances
[14:42:39] <wikibugs>	 10Labs: Require 'libicu-dev' package and `npm install -g teleirc` for a Wikitech project - https://phabricator.wikimedia.org/T169338#3395237 (10Acagastya)
[14:42:54] <wikibugs>	 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395239 (10Addshore) The first step there    > Gets list of hosts that have any metric defined  Simply looks at the metrics defined for each project assuming they are all h...
[14:42:58] <addshore>	 tarrow: ^^ I got it
[14:43:04] <wikibugs>	 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395240 (10Tarrow) Yep, that's exactly what's happening.  I can find all my old metrics in archived_metrics  This should probably be documented somewhere.
[14:43:19] <tarrow>	 addshore: yep. I found them!
[14:43:24] <addshore>	 Haha, tarrow the last line of both of our comments is exactly the same!
[14:43:35] <tarrow>	 :P
[14:44:13] <wikibugs>	 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395264 (10Addshore) >>! In T169118#3395239, @Addshore wrote: > This should probably be documented somewhere.  >>! In T169118#3395240, @Tarrow wrote: > This should probably...
[14:44:20] <Urbanecm>	 Hi, why is 'select * from logging where log_title="Grace_Quan_docked.jpg";' taking so long time at commonswiki_p? How can I get my results in reasonable time?
[14:44:51] <addshore>	 tarrow: I'll create a sub ticket
[14:45:05] <addshore>	 you can probably also get someone to move your metric to some other place (so you have the old data)
[14:45:10] <wikibugs>	 10Tool-Labs-tools-Pageviews: Add URL params for all chart options - https://phabricator.wikimedia.org/T169343#3395268 (10MusikAnimal)
[14:45:19] <addshore>	 Or, you can just read it from the api in the archived place and then send it all back to graphite under a new name
[14:48:14] <wikibugs>	 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395281 (10Tarrow) I've popped a note in https://wikitech.wikimedia.org/wiki/Graphite but it could be good to have a specific guide for labs users.
[14:48:27] <addshore>	 It might make sense to actually move all project metrics
[14:48:35] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[14:48:44] <addshore>	 into project.projectname.instancename.foo etc
[14:50:10] <wikibugs>	 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395295 (10Addshore) It might be an idea to move all of these instance metrics 1 level deeper to avoid things like this happening. There are a bunch of other top level metr...
[14:59:01] <madhuvishy>	 morning all! thanks for filing the bug and detailed comments tarrow and addshore, i'll look in a bit!
[14:59:15] <addshore>	 awesome :)
[14:59:18] <addshore>	 And morning! :D
[15:01:13] <tarrow>	 thanks and Morning!
[15:06:13] <wikibugs>	 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395348 (10madhuvishy) @jcrespo Apologies for the delay. Can we start with just labsdb1005 first, and attempt to do it Wednesday July 5, and labsdb1004 on Thursday July 6, provided the...
[15:12:32] <wikibugs>	 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395356 (10Cmjohnson) @madhuvishy I am out all next week and will be back July 11.
[15:14:39] <wikibugs>	 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395358 (10madhuvishy) @Cmjohnson Okay thanks for letting me know, I'll schedule the labsdb1001 and 1003 reboots (the ciscos), for after you are back then. When are you in the DC (from...
[15:16:21] <wikibugs>	 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395385 (10Cmjohnson) @madhuvishy I typically get the DC around 1400UTC  (10am EST).
[15:19:19] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3395386 (10Cmjohnson)
[15:28:37] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:38:18] <wikibugs>	 10Labs, 10Operations, 10ops-eqiad: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3395406 (10Cmjohnson)
[16:08:02] <wikibugs>	 10Labs, 10Tool-Labs: Build new tools puppetmaster - https://phabricator.wikimedia.org/T169350#3395448 (10Andrew)
[16:15:05] <Freddy2001>	 bd808, are you there?
[16:15:37] <bd808>	 Freddy2001: I've got to step away for about 30 mintues
[16:15:43] <bd808>	 something quick?
[16:15:48] <Freddy2001>	 yes
[16:16:03] <Freddy2001>	 same issue as yesterday but now on another instance
[16:16:08] <Freddy2001>	 AH00526: Syntax error on line 3 of /etc/apache2/sites-enabled/000-default.conf:
[16:16:08] <Freddy2001>	 DocumentRoot must be a directory
[16:16:26] <bd808>	 is it?
[16:16:43] <Freddy2001>	 it is
[16:17:28] <bd808>	 you double checked that you didn't leave off the initial / or something when you wrote the config file?
[16:18:40] <Freddy2001>	 yes i can c&p the path and access it with cd
[16:19:39] <Freddy2001>	 but it is an instance with a floating ip
[16:19:46] <Freddy2001>	 do i need some additional config there?
[16:19:53] <bd808>	 weird. I've realy got to run, but I can look at the instance alter if you don't figure it out. I'm pretty sure you have a typo somewhere
[16:20:13] <bd808>	 n, floating ip wouldn't effect that
[16:20:50] <Freddy2001>	 yes it would be nice if you could have a quick look at "webservices"
[17:24:51] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1433 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[17:48:26] <xaosflux>	 !help hi does anyone here support quarry.wmflabs.org ?
[17:48:26] <wm-bot>	 xaosflux: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team
[17:48:42] <xaosflux>	 lots of 502 Bad Gateway errors
[17:49:05] <bd808>	 xaosflux: that doesn't sound good. let me peek
[17:49:10] <xaosflux>	 thank you
[17:50:03] <madhuvishy>	 Thanks bd808: let me know if you need a hand
[17:51:09] <bd808>	 madhuvishy: the 502 seems to happen on the OAuth return link, but then if I heard reload I can get in and I'm authed...
[17:51:37] <madhuvishy>	 ah, so quarry itself is fine?
[17:51:48] <xaosflux>	 I was able to get in after authing, then it keeps throwing that as I navigate pages
[17:52:18] <bd808>	 well, the 502 is from quarry when the browser returns from approving the grant at meta
[17:52:21] <madhuvishy>	 aah
[17:52:23] <xaosflux>	 e.g. error, got past error, go to https://quarry.wmflabs.org/query/runs/all error again
[17:52:28] <madhuvishy>	 okay
[17:52:46] <bd808>	 is it load balanced?
[17:52:48] <xaosflux>	 actually reloading any page
[17:52:54] <xaosflux>	 every other load is failing
[17:52:57] <bd808>	 maybe one node is dead if so?
[17:53:05] <madhuvishy>	 i think there's web-01 and 02
[17:53:30] <xaosflux>	 not exactly every other, but about 50% fail
[17:53:51] <bd808>	 yeah, 502s are intermittent for me. I'd bet on a one dead node in a pool of 2+
[17:57:50] <wikibugs_>	 10Labs, 10Graphite, 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#3395718 (10bd808) @fgiunchedi The patch from @chasemp should stop new ones from being created once puppet does its thing across all of the VMs....
[17:58:08] <harej>	 bd808: I've also been getting intermittent 502s but it hasn't really been bothering me because it's only sometimes
[18:00:57] <madhuvishy>	 bd808: It looks like quarry-main is the uwsgi server, and quarry-runner-01 and 02 have the celery workers
[18:01:02] <Freddy2001>	 bd808, could already fix the problem with the apache config?
[18:01:04] <bd808>	 madhuvishy: looks like there are no python processes running on quarry-runner-01
[18:01:20] <bd808>	 what do I do to kick celery in the head?
[18:01:34] <madhuvishy>	 https://www.irccloud.com/pastebin/gOBUhlrb/
[18:01:39] <madhuvishy>	 i just saw
[18:02:16] <madhuvishy>	 systemctl restart celery-quarry-worker.service
[18:02:27] <bd808>	 of course :)
[18:02:55] <bd808>	 Is there a sysadmin page for Quarry somewhere?
[18:03:27] <bd808>	 Freddy2001: not yet, sorry. what was you project again? I know the instance name is webservices
[18:03:57] <Freddy2001>	 yes the instance is webservices on the project getstarted
[18:04:00] <madhuvishy>	 don't think so
[18:04:59] <bd808>	 madhuvishy: ok. if you get bored some afternoon it would be great to have a brain dump of what you know about it. I'd like WMCS to be able to support it.
[18:05:37] <bd808>	 hmmm... still getting some 502s
[18:05:46] <wikibugs>	 10Labs, 10Labs-Infrastructure: Unable to SSH onto tools-login.wmflabs.org - https://phabricator.wikimedia.org/T130446#3395788 (10chasemp)
[18:05:51] <madhuvishy>	 bd808: yeah alright. milimetric has been trying to know his way around and trying to support it too :)
[18:06:03] <bd808>	 the more the merrier!
[18:06:35] <madhuvishy>	 bd808: might have to restart uwsgi on quarry-main may be?
[18:06:40] <madhuvishy>	 uwsgi-quarry-web
[18:06:50] <bd808>	 yeah I was thinking that would be the next step
[18:07:29] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3395786 (10chasemp) 05Resolved>03Open I don't think this should be closed as long as this stuff exists: > modules/role/manifests/openldap/labs.pp  ```    # restart slapd if it uses more...
[18:07:54] <milimetric>	 just casually observing this is not typical - for quarry to have problems again so soon... wonder if something else is up
[18:07:55] <madhuvishy>	 bd808: right. quarry-main-01 is the instance
[18:08:34] <madhuvishy>	 milimetric: hmmm, any logs you think we should poke at?
[18:09:12] <bd808>	 !log quarry Ran service uwsgi-quarry-web restart on quarry-main-01. People seeing intermittent 502s
[18:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL
[18:10:17] <bd808>	 that looks like it may have healed it up, but would be nice to know what went sour
[18:10:54] <milimetric>	 yeah, no time today, sorry
[18:11:37] <bd808>	 *nod* I'll see if I can circle back to it, but no promisses
[18:12:02] <bd808>	 xaosflux: I *think* we fixed it up. if you see more 502s give us another shout
[18:14:14] <bd808>	 Freddy2001: that 000-default.conf on webservices is in sites-available, but not linked into sites-enabled
[18:14:30] <xaosflux>	 ok thank you bd808 will go try
[18:15:08] <bd808>	 Freddy2001: you need to do something like `sudo a2ensite 000-default`  to enable it and then restart apache
[18:16:12] <Freddy2001>	 i did this, but it does not resolve the issue
[18:16:32] <wikibugs>	 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3395822 (10chasemp) p:05Triage>03High
[18:16:34] <Freddy2001>	 000-default is now linked in sites-enabled
[18:17:10] <bd808>	 did apache give you an error message when you restarted it?
[18:18:39] <bd808>	 Freddy2001: I don't see any sign of a restart since about an hour and half ago
[18:19:08] <xaosflux>	 bd808: quarry seems to be happy now, thanks again
[18:19:09] <Freddy2001>	 <bd808> did apache give you an error message when you restarted it? <- no, just Reloading web server apache2  
[18:19:37] <bd808>	 !log getstarted Ran sudo service apache2 restart on webservices
[18:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Getstarted/SAL
[18:20:04] <bd808>	 Freddy2001: I just meant restart the apache2 demon process :)
[18:20:16] <bd808>	 a full reboot shouldn't hurt though I guess
[18:21:06] <bd808>	 apache doesn't re-read its config once it is running, so every time you make a change you need to do `sudo service apache2 restart`
[18:28:02] <bd808>	 Freddy2001: hmmm... after the reboot its back to only having 00-dummy.conf in sites-enabled
[18:28:16] <bd808>	 Do you know what puppet roles are running there?
[18:33:54] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1417 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[18:40:37] <bd808>	 Freddy2001: "access to / denied (filesystem path '/home/steinsplitter/public_html') because search permissions are missing on a component of the path" -- apache is mad because steinsplitter's home directory is chmod 0700. It wants the www-data user to be able to read all of the components in the path. I would recommend moving that docroot somewhere else.
[18:49:49] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1433 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:13:54] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1417 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:35:51] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1433 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[20:29:38] <apergos>	 hey cloud wmf folks, I have a gsoc student who will probably pop in here at some point but I wanna get your take on this anyways, or at least have it on your radar
[20:30:09] <apergos>	 his project is doing a little twitter bot that will accept emoji tweets and send back commons images, so we're looking at where the final project would run
[20:30:16] <apergos>	 it's node based
[20:30:30] <apergos>	 does he need a new project for this, is there a good project already he could get an instance on, etc
[20:30:43] <apergos>	 he's new to cloud but did already set up an account
[20:35:37] <bd808>	 apergos: I bet that could just run as a tool on Toolforge
[20:35:54] <bd808>	 we have pretty good support for nodejs on the kubernetes grid
[20:36:33] <bd808>	 and some idea of how to setup a bot process running there from things I've done with stashbot and a few other volunteer run tools
[20:38:28] <bd808>	 apergos: the basics of running a continuous process on kubernetes are documented at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Kubernetes#Kubernetes_continuous_jobs
[20:38:50] <apergos>	 is toolforge the new toollabs then?
[20:39:00] <apergos>	 I guess it is, I see the help page has the old name
[20:39:01] <apergos>	 cool
[20:39:12] <apergos>	 all riight, he might ask you about it or link you to a ticket
[20:39:14] <bd808>	 yeah. I'm going to start renaming things all over the place next week :)
[20:39:28] <bd808>	 sounds good. glad to help people get started
[20:39:36] <apergos>	 and of course there's a repo with the bot as it is now, so anyone can look at it and make sure there's nothing weird in here
[20:39:37] <apergos>	 there
[20:39:46] <apergos>	 that wouldn't work on toolforge or whatever
[20:39:52] <apergos>	 toolforge btw is such an awesome name
[20:40:11] <apergos>	 you're going to have to use codesmith in there someplace
[20:40:15] <apergos>	 find a spot :-P
[20:40:31] <bd808>	 heh. maybe the next gen PAWS
[20:40:35] <apergos>	 aww
[20:40:45] <apergos>	 I will hate to see paws go
[20:41:05] <apergos>	 btw thanks for looking at the instance metrics thing
[20:41:16] <bd808>	 I jsut poked chasemp :)
[20:41:39] <apergos>	 well thanks chasemp for looking at the instance metrics thing :-)
[20:43:27] <chasemp>	 welcome
[20:44:30] <Freddy2001>	 bd808, I tried to move all the stuff to another place but http://whatcanidoforwikimediacommons.org/ points still to /var/www
[20:45:05] <bd808>	 something keeps removing the vhost you setup I think
[20:45:37] <bd808>	 My guess would be a puppet role that is cleaning up things it doesn't expect in sites-enabled
[20:46:16] <bd808>	 let's look at waht role::simplelamp does...
[20:47:54] <bd808>	 yeah :/
[20:48:52] <bd808>	 so using role::simplelamp is causing problems. That Puppet role takes ownership of sites-enabled and every time puppet runs it cleans out the symlinks there that puppet doesn't know about
[20:49:36] <bd808>	 to use role::simplelamp you really need to also make a custom puppet role that sets up the vhosts you want to have running
[20:49:56] <bd808>	 this seems like something we could try to find a general solution for...
[20:51:54] <wikibugs_>	 10Labs, 10Patch-For-Review, 10User-bd808, 10cloud-services-team (Kanban): Require 'libicu-dev' package and `npm install -g teleirc` for a Wikitech project - https://phabricator.wikimedia.org/T169338#3396104 (10Amire80) Yep, I am indeed developing a Telegram bot that makes edits on MediaWiki. The code is at...
[20:55:37] <bd808>	 Freddy2001: so the central problem here is that the puppet runs that happen twice an hour are undoing part of your work
[20:56:38] <bd808>	 its not immediately obvious how to fix this without creating a project local puppetmaster in getstarted.
[20:57:16] <Freddy2001>	 okay this explains why my work does not change anything in the config
[20:57:49] <Freddy2001>	 I removed the puppet class in horizon, it should be fixed then, right?
[20:58:28] <bd808>	 yeah, that will keep it from happening again. the downside is that nothing will be making sure that your mysql and apache are up and running
[20:58:40] <bd808>	 but hopefully that will work out ok
[20:59:02] <bd808>	 I'm going to write up a bug too and think about general solutions
[21:00:17] <bd808>	 Freddy2001: so with that role removed from the instance, you should be able to do `sudo a2ensite 000-default` again, then `sudo service apache2 restart` and hopefully the vhost will come up as expected
[21:00:32] <Freddy2001>	 yes it works now! :D
[21:00:41] <Freddy2001>	 thank you very much for your help!
[21:01:09] <bd808>	 you are welcome
[21:14:41] <bd808>	 what's up with wikibugs today?
[21:16:21] <RainbowSprinkles>	 It's getting ready for the weekend?
[21:18:38] <bd808>	 RainbowSprinkles: good plan for a hard working bot
[21:18:58] <apergos>	 bots deserve time off too
[21:19:14] <apergos>	 it's been flooded out earlier today ... or was tht yesterday? not sure
[21:19:22] <apergos>	 anyways, having issues here and there
[21:32:18] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3396208 (10RobH)
[21:35:25] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3396227 (10RobH) a:05RobH>03chasemp
[21:36:28] <wikibugs>	 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3276700 (10RobH) Assigned to @chasemp for service implementation.  This task can be resolved once you are aware!
[21:43:23] <apergos>	 buggin out, see folks later!
[22:21:22] <legoktm>	 bd808: it seems to be lagging for some reason. idk why
[22:21:52] <bd808>	 lots of excess flood kicks too it seems.
[22:21:56] <bd808>	 poor bot
[22:27:16] <ebernhardson>	 hmm, paws is returning 502 bad gateway
[22:27:42] <bd808>	 ebernhardson: blech. we had that earlier today too intermittantly
[22:27:54] * bd808 goes to take a look again
[22:28:36] <bd808>	 oh actually ti was quarry earlier, not paws
[22:29:38] <bd808>	 ebernhardson: did you get the 502 on the return from the OAuth grant or somewhere deeper in the app?
[22:29:52] <ebernhardson>	 bd808: immediately on accessing, after it redirects to my user url
[22:30:09] <ebernhardson>	 so i'm going to guess the oath token is still 'active'
[22:30:50] <bd808>	 ok. If you go back to https://paws.wmflabs.org/ again does it work?
[22:31:06] <bd808>	 This time I got in fine and am even able to use a terminal
[22:31:49] <bd808>	 I wonder if it is similar to what we saw in quarry where it acted like just 1 or 2 uwsgi workers in the pool were dead
[22:32:03] <ebernhardson>	 nope, redirect and then 502 again
[22:32:22] <TabbyCat>	 I cannot get any script running on PAWS fwiw (hides)
[22:33:38] <bd808>	 sadly my PAWS "fix" is to just do this: hey madhuvishy do you have a minute to look at PAWS. Some but not all people are getting 502s from it
[22:33:55] <madhuvishy>	 bd808: sure looking
[22:34:13] <bd808>	 !log quarry Added BryanDavis (self) as project admin
[22:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL
[22:34:22] <ebernhardson>	 :)
[22:34:47] <bd808>	 someday I will grok PAWS but not today
[22:34:49] <madhuvishy>	 hmmm things are working for me, i'm poking at the tool
[22:36:48] <bd808>	 yeah. I was able to log in and use things too.
[22:37:09] <bd808>	 there were some session corruption complaints earlier in the week
[22:37:57] <madhuvishy>	 yeah i'm seeing your requests come in
[22:38:32] <bd808>	 every thing is swell for me. ebernhardson can you make some errors for madhuvishy?
[22:38:53] <madhuvishy>	 ebernhardson: i kicked your pod
[22:38:58] <madhuvishy>	 try logging in now?
[22:41:03] <madhuvishy>	 TabbyCat: what is the behavior/errors you are seeing? what's your paws username
[22:44:08] <TabbyCat>	 madhuvishy: don't worry, I don't use PAWS very often -- but it was about some OAUth grant error or something like that
[22:44:28] <madhuvishy>	 your username?
[22:44:40] <HaeB>	 i just filed https://phabricator.wikimedia.org/T169382
[22:45:02] <madhuvishy>	 ah HaeB looking at yours too
[22:45:10] <bd808>	 HaeB: PAWS hates on you every Friday. :(
[22:45:10] <HaeB>	 ...in my case PAWS fails with a too many redirects error, but i have also seen 502 bad gateway recently
[22:45:17] <wikibugs>	 10PAWS: "Start My Server" fails with "too many redirects" error - https://phabricator.wikimedia.org/T169382#3396434 (10Tbayer)
[22:46:02] <ebernhardson>	 madhuvishy: that seems to hvae done the trick, thanks!
[22:46:13] <madhuvishy>	 ebernhardson: oh cool!
[22:46:40] <madhuvishy>	 HaeB: try now?
[22:48:21] <HaeB>	 madhuvishy: looks like it works now \o/
[22:48:43] <madhuvishy>	 hmmm, some user pods seem to be in some bad state
[22:48:47] <madhuvishy>	 not sure what exactly
[22:49:30] <HaeB>	 (btw meant 504 bad gateway above, not 502)
[22:50:00] <HaeB>	 (*cough* 504 gateway timeout, ofc)
[22:50:05] * madhuvishy can't wait for yuvi's overhaul
[22:50:13] <madhuvishy>	 HaeB: ah okay
[22:50:32] <HaeB>	 that was 2 days ago or so
[22:51:07] <madhuvishy>	 oh hmmm, yesterday everything was kinda possibly down. not sure what was up 2 days ago
[23:11:06] <HaeB>	 madhuvishy: i see... btw is there an equivalent of https://status.wikimedia.org/ for tools/labs/clouds/paws? (or should some of these services be on status.wikimedia?)
[23:14:20] <madhuvishy>	 HaeB: no these aren't on status.wikimedia - I know that that's all provided by an external service, and usually reserved for production things - not sure of what goes there and doesn't
[23:14:36] <madhuvishy>	 as for us, we have shinken/icinga based monitoring that alerts us
[23:14:47] <madhuvishy>	 and things like this - https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts
[23:15:40] <madhuvishy>	 but i see what you are asking for, and i don't think we have an exact equivalent 
[23:18:54] <madhuvishy>	 don't know if you can see this - but this has high level tools status checks - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=checker.tools.wmflabs.org
[23:19:44] <wikibugs>	 10PAWS: "Start My Server" fails with "too many redirects" error - https://phabricator.wikimedia.org/T169382#3396536 (10madhuvishy) 05Open>03Resolved a:03madhuvishy Resolved by restarting the user's paws pod.
[23:21:04] <HaeB>	 yep, i can access that icinga page
[23:22:12] <HaeB>	 madhuvishy: it has 12 different rows ... which ones are most relevant if i want to find out if a PAWS issue is just about me / my notebook, or due to bad weather in general? ;)
[23:24:29] <madhuvishy>	 HaeB: ah that page is pretty specific to broad tools things
[23:27:09] <madhuvishy>	 https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=paws.wmflabs.org
[23:27:14] <madhuvishy>	 shows general paws health
[23:27:24] <madhuvishy>	 but rarely the case is all of paws is down
[23:27:46] <madhuvishy>	 i don't think we track the status of every user's notebook and their ability to login either
[23:37:51] <HaeB>	 madhuvishy: cool, the general paws status is useful enough
[23:37:59] <HaeB>	 added the link at https://wikitech.wikimedia.org/wiki/PAWS#Other_notes
[23:38:11] <madhuvishy>	 HaeB: :) thank you