[00:34:29] PROBLEM - Puppet errors on tools-exec-1439 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [00:41:03] PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [00:47:17] 10Labs, 10Tool-Labs: Homedir for user whym is very large (>60G) - https://phabricator.wikimedia.org/T169265#3393571 (10whym) Thanks for notifying me. It was a (probably incomplete) copy of dumps.wikimedia.org/other. The largest subdirectory was /other/diffdb/ which I have deleted. I was particularly interested... [01:04:27] RECOVERY - Puppet errors on tools-exec-1439 is OK: OK: Less than 1.00% above the threshold [0.0] [01:08:53] 10Labs, 10Tool-Labs: Homedir for user whym is very large (>60G) - https://phabricator.wikimedia.org/T169265#3393602 (10bd808) 05Open>03Resolved a:03whym Usage now is ~2G. Thanks a lot for the quick response @whym. It is very much appreciated. [01:11:02] RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0] [01:18:42] PROBLEM - Puppet errors on tools-exec-1402 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:26:32] grid and kubernetes are going to have intermittent issues as we get switched back to the other NFS server [01:29:27] PROBLEM - High iowait on tools-grid-master is CRITICAL: CRITICAL: tools.tools-grid-master.cpu.total.iowait (>11.11%) [01:29:29] PROBLEM - Puppet errors on tools-bastion-05 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [01:29:36] !log tools rebooting tools-cron-01 [01:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [01:29:40] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [01:29:53] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:30:37] PROBLEM - Puppet errors on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [01:31:37] PROBLEM - Puppet errors on tools-mail is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [01:32:03] PROBLEM - Puppet errors on tools-exec-1436 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [01:32:05] PROBLEM - Puppet errors on tools-exec-1440 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [01:33:17] !log tools time for i in `cat tools-hosts`; do ssh -i ~/.ssh/labs_root_id_rsa root@$i.eqiad.wmflabs 'hostname -f; uptime; tc-setup'; done [01:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [02:01:12] 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393637 (10Andrew) [02:01:36] 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10Andrew) I tagged dc-ops because... have y'all ever seen something like this? [02:05:51] RECOVERY - Puppet errors on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [02:06:01] RECOVERY - Puppet errors on tools-exec-1434 is OK: OK: Less than 1.00% above the threshold [0.0] [02:07:07] RECOVERY - Puppet errors on tools-exec-1436 is OK: OK: Less than 1.00% above the threshold [0.0] [02:07:08] RECOVERY - Puppet errors on tools-exec-1440 is OK: OK: Less than 1.00% above the threshold [0.0] [02:07:21] 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10bd808) http://www.dell.com/support/manuals/us/en/04/dell-opnmang-sw-v8.1/EEMI_13G_v1.2-v1/UEFI-Event-Messages?guid=GUID-823669E3-2D7B-41B5-85F1-AF7A6BC11ACC&lang=en-u... [02:08:00] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [02:08:09] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [02:08:59] RECOVERY - Puppet errors on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [02:09:01] RECOVERY - Puppet errors on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [02:09:39] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [02:09:51] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [02:10:39] RECOVERY - Puppet errors on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [02:16:40] 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393620 (10madhuvishy) We did another reboot to downgrade the kernel back to 4.3 and the error happened again. [02:16:54] 10Labs, 10Operations, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393649 (10Andrew) [02:17:52] 10Tool-Labs-tools-Xtools: Properly handle malformed curl results - https://phabricator.wikimedia.org/T169288#3393661 (10Matthewrbowker) [02:22:36] PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [02:28:16] 10Labs, 10Operations, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393675 (10madhuvishy) [02:28:41] RECOVERY - Puppet errors on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [02:33:19] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393676 (10bd808) [02:33:55] 10Labs, 10Operations, 10Kubernetes: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287#3393690 (10bd808) [02:33:56] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393689 (10bd808) [02:36:02] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393691 (10bd808) [02:36:04] 10Labs, 10DC-Ops, 10Operations: labstore1005 A PCIe link training failure error on boot - https://phabricator.wikimedia.org/T169286#3393692 (10bd808) [02:37:25] 10cloud-services-team, 10Operations: New anti-stackclash (4.9.25-1~bpo8+3 ) kernal SUPER BAD for NFS - https://phabricator.wikimedia.org/T169290#3393693 (10Andrew) [02:37:40] 10cloud-services-team, 10Operations: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3393706 (10Andrew) [02:38:26] 10cloud-services-team, 10Operations: New anti-stackclash (4.9.25-1~bpo8+3 ) kernel super bad for NFS - https://phabricator.wikimedia.org/T169290#3393693 (10Andrew) [02:38:28] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393707 (10Andrew) [02:39:01] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393710 (10bd808) Some data about the system load we saw: {P5652} [02:48:35] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Optimize edit count queries in XTools - https://phabricator.wikimedia.org/T163284#3393715 (10Samwilson) Good idea. It's probably a bit under half the load time for many users. It has been moved it to load via JS. [02:52:37] RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [02:53:15] PROBLEM - Puppet errors on tools-worker-1021 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [02:56:22] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393743 (10bd808) SAL entries: ``` == 2017-06-30 == 02:29 labstore1005 start drbd 02:14 reboot labstore1005 (5m ago) 01:33 time for i in `cat tools-hosts`; d... [02:58:33] 10Labs: Labstore nfsd processes report "sent only x when sending y bytes - shutting down socket" - https://phabricator.wikimedia.org/T169281#3393758 (10bd808) [02:58:35] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393757 (10bd808) [03:02:00] 10Labs: Labstore nfsd processes report "sent only x when sending y bytes - shutting down socket" - https://phabricator.wikimedia.org/T169281#3393761 (10bd808) Quite likely related to {T169290}. @chasemp could not find other log events like this prior to the kernel upgrade. [03:13:57] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393775 (10bd808) [03:17:29] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3393776 (10bd808) [03:33:13] RECOVERY - Puppet errors on tools-worker-1021 is OK: OK: Less than 1.00% above the threshold [0.0] [03:43:28] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3393780 (10Samwilson) Thanks for finding that. Should be fixed now. https://xtools-dev.wmflabs.org/ is updated. [05:26:18] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3393827 (10kaldari) @Samwilson: I'm getting the "fr.wikipedia.org is not a valid project" error at https://xtools-dev.wmflabs.org/. [05:36:25] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3393828 (10Samwilson) Ergh, again?! Bother. Do you mean when you type in the project name? Does it happen when you type in `fr.wikipedia.org` and then submit quic... [05:48:30] 10Tool-Labs-tools-Xtools, 10Community-Tech-Sprint: Restrict access to users' edit stats unless opted-in - https://phabricator.wikimedia.org/T165401#3393833 (10kaldari) @Samwilson: It happens as soon as I type in "fr.wikipedia.org" and then unfocus the field, so it's coming from the client-side API check. [06:57:41] PROBLEM - Puppet errors on tools-webgrid-lighttpd-1425 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:12:27] PROBLEM - Puppet errors on tools-exec-1421 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [07:32:42] RECOVERY - Puppet errors on tools-webgrid-lighttpd-1425 is OK: OK: Less than 1.00% above the threshold [0.0] [08:51:21] 10Labs, 10Labs-Infrastructure: Unable to SSH onto tools-login.wmflabs.org - https://phabricator.wikimedia.org/T130446#3394061 (10akosiaris) [08:51:23] 10Labs, 10Labs-Infrastructure, 10Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3394058 (10akosiaris) 05Open>03Resolved a:03akosiaris I am guessing this works fine ? Since then we 've had 0 troubles if I am not mistaken. I am resolving, feel free to reopen [08:52:04] 10Labs, 10Labs-Infrastructure, 10Operations: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#2492736 (10akosiaris) Dependent task T130593 has had no update since Nov 2016, so this is probably solved. I am gonna resolve this, feel free to reopen [08:52:08] 10Labs, 10Labs-Infrastructure, 10Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2140296 (10akosiaris) [08:52:11] 10Labs, 10Labs-Infrastructure, 10Operations: Investigate failover failure of LDAP servers - https://phabricator.wikimedia.org/T141277#3394066 (10akosiaris) 05Open>03Resolved a:03akosiaris [08:52:25] RECOVERY - Puppet errors on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0] [09:13:24] PROBLEM - Puppet errors on tools-exec-1421 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [09:48:24] RECOVERY - Puppet errors on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0] [10:14:24] PROBLEM - Puppet errors on tools-exec-1421 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [10:43:25] (03PS1) 10Addshore: Send all WMDE-.* phab projects to #wikimedia-de-tech [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362373 [10:44:41] 10Labs, 10Graphite, 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#3394343 (10ArielGlenn) Poking @bd808 on this, since it's been an issue for us again in the past week. [10:45:07] (03PS1) 10Addshore: Add MediaWiki-extensions-Wikibase(.*) to #wikidata-feed [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362374 [10:49:24] RECOVERY - Puppet errors on tools-exec-1421 is OK: OK: Less than 1.00% above the threshold [0.0] [11:12:40] PROBLEM - Puppet errors on tools-worker-1007 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:37:17] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Add MediaWiki-extensions-Wikibase(.*) to #wikidata-feed [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362374 (owner: 10Addshore) [11:37:46] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Send all WMDE-.* phab projects to #wikimedia-de-tech [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/362373 (owner: 10Addshore) [11:52:39] RECOVERY - Puppet errors on tools-worker-1007 is OK: OK: Less than 1.00% above the threshold [0.0] [12:38:44] !help Can anyone help me setting up a project. [12:38:44] Guest12443: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [12:42:54] Anyone here? [13:10:41] acagastya: What would it change? [13:11:01] !ask [13:14:04] 10Labs, 10wikitech.wikimedia.org: 2FA reset for Wikitech account - https://phabricator.wikimedia.org/T169332#3394984 (10Bawolff) I can confirm I was talking to this user on irc, and he was logged into his irc account, which had a wikipedia cloak on it. [13:17:05] 10Labs, 10wikitech.wikimedia.org: 2FA reset for Samtar Wikitech account - https://phabricator.wikimedia.org/T169332#3395006 (10Bawolff) [13:22:31] 10Labs, 10wikitech.wikimedia.org: 2FA reset for Samtar Wikitech account - https://phabricator.wikimedia.org/T169332#3395015 (10Samtar) I've also added a file to a tool I have access to - http://tools.wmflabs.org/communityguidelines/T169332.txt [13:32:29] 10Tool-Labs-tools-Other: Third party resources loaded from "communityguidelines" tool - https://phabricator.wikimedia.org/T169334#3395025 (10Nemo_bis) [13:34:04] 10Labs, 10wikitech.wikimedia.org: 2FA reset for Samtar Wikitech account - https://phabricator.wikimedia.org/T169332#3395040 (10Bawolff) 05Open>03Resolved a:03Bawolff [14:03:19] What is the deal with https://graphite-labs.wikimedia.org/? Is it meant for general labs use? I've been pushing metrics there and they seemed to be persistent for quite a while but recently I've notice they disappeared. Today I noticed that the metric name is no longer even listed in the graphite browser. Is there some reason they would have disappeared? [14:05:51] I know they are gradually aggregated to reduce storage for the long term but I've only been pushing one value per day so I assumed that they would last quite a long time [14:08:32] tarrow: I don't know, but from https://wikitech.wikimedia.org/wiki/Graphite it seems that that is mostly thought for beta [14:09:05] I wonder if prometheus is better supported? I really don't know [14:09:38] I would ask, not because you need permission, but to avoid unintentional deletions [14:10:51] Cool; I'm happy to. Any idea who I should ask? [14:11:23] I would start with the tool owner, let me search it [14:11:29] s/tool/VM/ [14:14:17] I do not see anyone connected right now, I would create a phab ticket and add the people on this history: https://wikitech.wikimedia.org/w/index.php?title=Labmon1001&action=history [14:14:51] I do not have access, I think, to know the current owners [14:16:15] thanks! [14:16:25] I'll go ahead and find out [14:19:08] acagastya: even if I "looked" at the task that you just pointed to me in private, what would you expect specifically from *me*? [14:19:20] * addshore wonders what timezone madhuvishy is in [14:19:21] I am facing with a nodeJS application. [14:19:38] !log git upgrading gerrit on gerrit-test3 to gerrit 2.14.2 (pre) [14:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL [14:19:49] andre__ no, should I open a new task on phab, or ask in channel. [14:19:54] That is the question. [14:19:57] addshore: by default UTC-7 [14:20:03] I need libicu-dev for the bot. [14:20:06] acagastya, depends on what your previous question is that nobody knows. [14:20:09] andre__: ack! [14:20:21] acagastya, if you want someone to answer questions, you need to explain the problem. [14:20:27] 10Labs: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3387598 (10Addshore) I agree that this is super odd and these things probably shouldn't just disappear. What is odd is that many other metrics in graphite remain untouched, it just seems to... [14:20:32] andre__ I need `libicu-dev`. [14:20:39] acagastya: why? for what? [14:20:42] 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395166 (10Addshore) [14:20:53] acagastya: that's a potential solution. but not a description of a problem. [14:20:55] For a teleirc bot. [14:21:32] I can't apt-get it, andre__ [14:22:02] acagastya: file a Phabricator task requesting installation of that package and provide reasons why? [14:24:36] https://phabricator.wikimedia.org/T169338 [14:24:49] 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395181 (10Tarrow) Perhaps its because it is under a path that is autofilled with metrics describing the status of labs instances. I think perhaps other metrics may have a... [14:25:09] All that I know is the applicatioin depends on `libicu-dev` [14:25:39] acagastya, "Please see this" and linking to a random codebase is not helpful. Also, the task lacks any information WHY you think you need that. [14:25:56] acagastya, also, the task lacks any context. Is this about some Tool Labs instance? Some Labs stuff? Somewhere else? [14:26:16] acagastya, please see https://mediawiki.org/wiki/How_to_report_a_bug how to help others to better understand reported tasks. Thanks! [14:26:25] I know it does not help. But that application, as far as I know depends on that package. [14:26:40] acagastya, so why do you tell us here on irc but not in your Phab task? [14:27:13] "as far as I know" confuses me. either you know or you don't? What makes you think so, that lead you to asking here and creating that task? [14:27:33] anyway, I cannot spend more time on this, sorry :) Good luck and I hope that will get sorted out! :) [14:28:55] 10Labs, 10Labs-Infrastructure, 10Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3395192 (10Andrew) Yep, no problems in ages. Thanks for the bug cleanup. [14:29:11] Made some changes, andre__ [14:30:16] acagastya, if it's a request about installing some package, you want to explain where in your task, e.g. by adding the #Labs tag to it. Phabricator is used for hundreds of projects. [14:30:58] without content in the task, nobody knows where you want this. I know that we are here in the #wikimedia-cloud channel, but your task does not say that it is related to Labs at all. [14:31:20] acagastya, again, please see the steps in https://mediawiki.org/wiki/How_to_report_a_bug - thanks [14:31:51] the better your task, the more likely someone will find it (hence: Project tags needed) and will look at it. [14:36:55] 10Labs: Require 'libicu-dev' package for teleIRC bot for Wikitech project - https://phabricator.wikimedia.org/T169338#3395217 (10Acagastya) [14:38:17] andre__ I also need to run `npm install -g teleirc` and it might need sudo. [14:38:27] Should I include that in the same task? [14:40:16] 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395219 (10Addshore) That would be this script https://github.com/wikimedia/puppet/blob/production/modules/graphite/files/archive-instances [14:42:39] 10Labs: Require 'libicu-dev' package and `npm install -g teleirc` for a Wikitech project - https://phabricator.wikimedia.org/T169338#3395237 (10Acagastya) [14:42:54] 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395239 (10Addshore) The first step there > Gets list of hosts that have any metric defined Simply looks at the metrics defined for each project assuming they are all h... [14:42:58] tarrow: ^^ I got it [14:43:04] 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395240 (10Tarrow) Yep, that's exactly what's happening. I can find all my old metrics in archived_metrics This should probably be documented somewhere. [14:43:19] addshore: yep. I found them! [14:43:24] Haha, tarrow the last line of both of our comments is exactly the same! [14:43:35] :P [14:44:13] 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395264 (10Addshore) >>! In T169118#3395239, @Addshore wrote: > This should probably be documented somewhere. >>! In T169118#3395240, @Tarrow wrote: > This should probably... [14:44:20] Hi, why is 'select * from logging where log_title="Grace_Quan_docked.jpg";' taking so long time at commonswiki_p? How can I get my results in reasonable time? [14:44:51] tarrow: I'll create a sub ticket [14:45:05] you can probably also get someone to move your metric to some other place (so you have the old data) [14:45:10] 10Tool-Labs-tools-Pageviews: Add URL params for all chart options - https://phabricator.wikimedia.org/T169343#3395268 (10MusikAnimal) [14:45:19] Or, you can just read it from the api in the archived place and then send it all back to graphite under a new name [14:48:14] 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395281 (10Tarrow) I've popped a note in https://wikitech.wikimedia.org/wiki/Graphite but it could be good to have a specific guide for labs users. [14:48:27] It might make sense to actually move all project metrics [14:48:35] PROBLEM - Puppet errors on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [14:48:44] into project.projectname.instancename.foo etc [14:50:10] 10Labs, 10User-Addshore: Metrics from WikiFactMine labs project have disappeared from graphite - https://phabricator.wikimedia.org/T169118#3395295 (10Addshore) It might be an idea to move all of these instance metrics 1 level deeper to avoid things like this happening. There are a bunch of other top level metr... [14:59:01] morning all! thanks for filing the bug and detailed comments tarrow and addshore, i'll look in a bit! [14:59:15] awesome :) [14:59:18] And morning! :D [15:01:13] thanks and Morning! [15:06:13] 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395348 (10madhuvishy) @jcrespo Apologies for the delay. Can we start with just labsdb1005 first, and attempt to do it Wednesday July 5, and labsdb1004 on Thursday July 6, provided the... [15:12:32] 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395356 (10Cmjohnson) @madhuvishy I am out all next week and will be back July 11. [15:14:39] 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395358 (10madhuvishy) @Cmjohnson Okay thanks for letting me know, I'll schedule the labsdb1001 and 1003 reboots (the ciscos), for after you are back then. When are you in the DC (from... [15:16:21] 10cloud-services-team, 10DBA, 10Operations: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3395385 (10Cmjohnson) @madhuvishy I typically get the DC around 1400UTC (10am EST). [15:19:19] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3395386 (10Cmjohnson) [15:28:37] RECOVERY - Puppet errors on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [15:38:18] 10Labs, 10Operations, 10ops-eqiad: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3395406 (10Cmjohnson) [16:08:02] 10Labs, 10Tool-Labs: Build new tools puppetmaster - https://phabricator.wikimedia.org/T169350#3395448 (10Andrew) [16:15:05] bd808, are you there? [16:15:37] Freddy2001: I've got to step away for about 30 mintues [16:15:43] something quick? [16:15:48] yes [16:16:03] same issue as yesterday but now on another instance [16:16:08] AH00526: Syntax error on line 3 of /etc/apache2/sites-enabled/000-default.conf: [16:16:08] DocumentRoot must be a directory [16:16:26] is it? [16:16:43] it is [16:17:28] you double checked that you didn't leave off the initial / or something when you wrote the config file? [16:18:40] yes i can c&p the path and access it with cd [16:19:39] but it is an instance with a floating ip [16:19:46] do i need some additional config there? [16:19:53] weird. I've realy got to run, but I can look at the instance alter if you don't figure it out. I'm pretty sure you have a typo somewhere [16:20:13] n, floating ip wouldn't effect that [16:20:50] yes it would be nice if you could have a quick look at "webservices" [17:24:51] PROBLEM - Puppet errors on tools-exec-1433 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:48:26] !help hi does anyone here support quarry.wmflabs.org ? [17:48:26] xaosflux: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [17:48:42] lots of 502 Bad Gateway errors [17:49:05] xaosflux: that doesn't sound good. let me peek [17:49:10] thank you [17:50:03] Thanks bd808: let me know if you need a hand [17:51:09] madhuvishy: the 502 seems to happen on the OAuth return link, but then if I heard reload I can get in and I'm authed... [17:51:37] ah, so quarry itself is fine? [17:51:48] I was able to get in after authing, then it keeps throwing that as I navigate pages [17:52:18] well, the 502 is from quarry when the browser returns from approving the grant at meta [17:52:21] aah [17:52:23] e.g. error, got past error, go to https://quarry.wmflabs.org/query/runs/all error again [17:52:28] okay [17:52:46] is it load balanced? [17:52:48] actually reloading any page [17:52:54] every other load is failing [17:52:57] maybe one node is dead if so? [17:53:05] i think there's web-01 and 02 [17:53:30] not exactly every other, but about 50% fail [17:53:51] yeah, 502s are intermittent for me. I'd bet on a one dead node in a pool of 2+ [17:57:50] 10Labs, 10Graphite, 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Move labs 'instances' data to graphite labs - https://phabricator.wikimedia.org/T143405#3395718 (10bd808) @fgiunchedi The patch from @chasemp should stop new ones from being created once puppet does its thing across all of the VMs.... [17:58:08] bd808: I've also been getting intermittent 502s but it hasn't really been bothering me because it's only sometimes [18:00:57] bd808: It looks like quarry-main is the uwsgi server, and quarry-runner-01 and 02 have the celery workers [18:01:02] bd808, could already fix the problem with the apache config? [18:01:04] madhuvishy: looks like there are no python processes running on quarry-runner-01 [18:01:20] what do I do to kick celery in the head? [18:01:34] https://www.irccloud.com/pastebin/gOBUhlrb/ [18:01:39] i just saw [18:02:16] systemctl restart celery-quarry-worker.service [18:02:27] of course :) [18:02:55] Is there a sysadmin page for Quarry somewhere? [18:03:27] Freddy2001: not yet, sorry. what was you project again? I know the instance name is webservices [18:03:57] yes the instance is webservices on the project getstarted [18:04:00] don't think so [18:04:59] madhuvishy: ok. if you get bored some afternoon it would be great to have a brain dump of what you know about it. I'd like WMCS to be able to support it. [18:05:37] hmmm... still getting some 502s [18:05:46] 10Labs, 10Labs-Infrastructure: Unable to SSH onto tools-login.wmflabs.org - https://phabricator.wikimedia.org/T130446#3395788 (10chasemp) [18:05:51] bd808: yeah alright. milimetric has been trying to know his way around and trying to support it too :) [18:06:03] the more the merrier! [18:06:35] bd808: might have to restart uwsgi on quarry-main may be? [18:06:40] uwsgi-quarry-web [18:06:50] yeah I was thinking that would be the next step [18:07:29] 10Labs, 10Labs-Infrastructure, 10Operations: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#3395786 (10chasemp) 05Resolved>03Open I don't think this should be closed as long as this stuff exists: > modules/role/manifests/openldap/labs.pp ``` # restart slapd if it uses more... [18:07:54] just casually observing this is not typical - for quarry to have problems again so soon... wonder if something else is up [18:07:55] bd808: right. quarry-main-01 is the instance [18:08:34] milimetric: hmmm, any logs you think we should poke at? [18:09:12] !log quarry Ran service uwsgi-quarry-web restart on quarry-main-01. People seeing intermittent 502s [18:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [18:10:17] that looks like it may have healed it up, but would be nice to know what went sour [18:10:54] yeah, no time today, sorry [18:11:37] *nod* I'll see if I can circle back to it, but no promisses [18:12:02] xaosflux: I *think* we fixed it up. if you see more 502s give us another shout [18:14:14] Freddy2001: that 000-default.conf on webservices is in sites-available, but not linked into sites-enabled [18:14:30] ok thank you bd808 will go try [18:15:08] Freddy2001: you need to do something like `sudo a2ensite 000-default` to enable it and then restart apache [18:16:12] i did this, but it does not resolve the issue [18:16:32] 10Labs, 10Tool-Labs: Tool Labs 2017-06-29 Labstore100[45] kernel upgrade issues - https://phabricator.wikimedia.org/T169289#3395822 (10chasemp) p:05Triage>03High [18:16:34] 000-default is now linked in sites-enabled [18:17:10] did apache give you an error message when you restarted it? [18:18:39] Freddy2001: I don't see any sign of a restart since about an hour and half ago [18:19:08] bd808: quarry seems to be happy now, thanks again [18:19:09] did apache give you an error message when you restarted it? <- no, just Reloading web server apache2 [18:19:37] !log getstarted Ran sudo service apache2 restart on webservices [18:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Getstarted/SAL [18:20:04] Freddy2001: I just meant restart the apache2 demon process :) [18:20:16] a full reboot shouldn't hurt though I guess [18:21:06] apache doesn't re-read its config once it is running, so every time you make a change you need to do `sudo service apache2 restart` [18:28:02] Freddy2001: hmmm... after the reboot its back to only having 00-dummy.conf in sites-enabled [18:28:16] Do you know what puppet roles are running there? [18:33:54] PROBLEM - Puppet errors on tools-exec-1417 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:40:37] Freddy2001: "access to / denied (filesystem path '/home/steinsplitter/public_html') because search permissions are missing on a component of the path" -- apache is mad because steinsplitter's home directory is chmod 0700. It wants the www-data user to be able to read all of the components in the path. I would recommend moving that docroot somewhere else. [18:49:49] RECOVERY - Puppet errors on tools-exec-1433 is OK: OK: Less than 1.00% above the threshold [0.0] [19:13:54] RECOVERY - Puppet errors on tools-exec-1417 is OK: OK: Less than 1.00% above the threshold [0.0] [19:35:51] PROBLEM - Puppet errors on tools-exec-1433 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:29:38] hey cloud wmf folks, I have a gsoc student who will probably pop in here at some point but I wanna get your take on this anyways, or at least have it on your radar [20:30:09] his project is doing a little twitter bot that will accept emoji tweets and send back commons images, so we're looking at where the final project would run [20:30:16] it's node based [20:30:30] does he need a new project for this, is there a good project already he could get an instance on, etc [20:30:43] he's new to cloud but did already set up an account [20:35:37] apergos: I bet that could just run as a tool on Toolforge [20:35:54] we have pretty good support for nodejs on the kubernetes grid [20:36:33] and some idea of how to setup a bot process running there from things I've done with stashbot and a few other volunteer run tools [20:38:28] apergos: the basics of running a continuous process on kubernetes are documented at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Kubernetes#Kubernetes_continuous_jobs [20:38:50] is toolforge the new toollabs then? [20:39:00] I guess it is, I see the help page has the old name [20:39:01] cool [20:39:12] all riight, he might ask you about it or link you to a ticket [20:39:14] yeah. I'm going to start renaming things all over the place next week :) [20:39:28] sounds good. glad to help people get started [20:39:36] and of course there's a repo with the bot as it is now, so anyone can look at it and make sure there's nothing weird in here [20:39:37] there [20:39:46] that wouldn't work on toolforge or whatever [20:39:52] toolforge btw is such an awesome name [20:40:11] you're going to have to use codesmith in there someplace [20:40:15] find a spot :-P [20:40:31] heh. maybe the next gen PAWS [20:40:35] aww [20:40:45] I will hate to see paws go [20:41:05] btw thanks for looking at the instance metrics thing [20:41:16] I jsut poked chasemp :) [20:41:39] well thanks chasemp for looking at the instance metrics thing :-) [20:43:27] welcome [20:44:30] bd808, I tried to move all the stuff to another place but http://whatcanidoforwikimediacommons.org/ points still to /var/www [20:45:05] something keeps removing the vhost you setup I think [20:45:37] My guess would be a puppet role that is cleaning up things it doesn't expect in sites-enabled [20:46:16] let's look at waht role::simplelamp does... [20:47:54] yeah :/ [20:48:52] so using role::simplelamp is causing problems. That Puppet role takes ownership of sites-enabled and every time puppet runs it cleans out the symlinks there that puppet doesn't know about [20:49:36] to use role::simplelamp you really need to also make a custom puppet role that sets up the vhosts you want to have running [20:49:56] this seems like something we could try to find a general solution for... [20:51:54] 10Labs, 10Patch-For-Review, 10User-bd808, 10cloud-services-team (Kanban): Require 'libicu-dev' package and `npm install -g teleirc` for a Wikitech project - https://phabricator.wikimedia.org/T169338#3396104 (10Amire80) Yep, I am indeed developing a Telegram bot that makes edits on MediaWiki. The code is at... [20:55:37] Freddy2001: so the central problem here is that the puppet runs that happen twice an hour are undoing part of your work [20:56:38] its not immediately obvious how to fix this without creating a project local puppetmaster in getstarted. [20:57:16] okay this explains why my work does not change anything in the config [20:57:49] I removed the puppet class in horizon, it should be fixed then, right? [20:58:28] yeah, that will keep it from happening again. the downside is that nothing will be making sure that your mysql and apache are up and running [20:58:40] but hopefully that will work out ok [20:59:02] I'm going to write up a bug too and think about general solutions [21:00:17] Freddy2001: so with that role removed from the instance, you should be able to do `sudo a2ensite 000-default` again, then `sudo service apache2 restart` and hopefully the vhost will come up as expected [21:00:32] yes it works now! :D [21:00:41] thank you very much for your help! [21:01:09] you are welcome [21:14:41] what's up with wikibugs today? [21:16:21] It's getting ready for the weekend? [21:18:38] RainbowSprinkles: good plan for a hard working bot [21:18:58] bots deserve time off too [21:19:14] it's been flooded out earlier today ... or was tht yesterday? not sure [21:19:22] anyways, having issues here and there [21:32:18] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3396208 (10RobH) [21:35:25] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3396227 (10RobH) a:05RobH>03chasemp [21:36:28] 10Labs, 10Labs-Infrastructure, 10Operations, 10ops-eqiad: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3276700 (10RobH) Assigned to @chasemp for service implementation. This task can be resolved once you are aware! [21:43:23] buggin out, see folks later! [22:21:22] bd808: it seems to be lagging for some reason. idk why [22:21:52] lots of excess flood kicks too it seems. [22:21:56] poor bot [22:27:16] hmm, paws is returning 502 bad gateway [22:27:42] ebernhardson: blech. we had that earlier today too intermittantly [22:27:54] * bd808 goes to take a look again [22:28:36] oh actually ti was quarry earlier, not paws [22:29:38] ebernhardson: did you get the 502 on the return from the OAuth grant or somewhere deeper in the app? [22:29:52] bd808: immediately on accessing, after it redirects to my user url [22:30:09] so i'm going to guess the oath token is still 'active' [22:30:50] ok. If you go back to https://paws.wmflabs.org/ again does it work? [22:31:06] This time I got in fine and am even able to use a terminal [22:31:49] I wonder if it is similar to what we saw in quarry where it acted like just 1 or 2 uwsgi workers in the pool were dead [22:32:03] nope, redirect and then 502 again [22:32:22] I cannot get any script running on PAWS fwiw (hides) [22:33:38] sadly my PAWS "fix" is to just do this: hey madhuvishy do you have a minute to look at PAWS. Some but not all people are getting 502s from it [22:33:55] bd808: sure looking [22:34:13] !log quarry Added BryanDavis (self) as project admin [22:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Quarry/SAL [22:34:22] :) [22:34:47] someday I will grok PAWS but not today [22:34:49] hmmm things are working for me, i'm poking at the tool [22:36:48] yeah. I was able to log in and use things too. [22:37:09] there were some session corruption complaints earlier in the week [22:37:57] yeah i'm seeing your requests come in [22:38:32] every thing is swell for me. ebernhardson can you make some errors for madhuvishy? [22:38:53] ebernhardson: i kicked your pod [22:38:58] try logging in now? [22:41:03] TabbyCat: what is the behavior/errors you are seeing? what's your paws username [22:44:08] madhuvishy: don't worry, I don't use PAWS very often -- but it was about some OAUth grant error or something like that [22:44:28] your username? [22:44:40] i just filed https://phabricator.wikimedia.org/T169382 [22:45:02] ah HaeB looking at yours too [22:45:10] HaeB: PAWS hates on you every Friday. :( [22:45:10] ...in my case PAWS fails with a too many redirects error, but i have also seen 502 bad gateway recently [22:45:17] 10PAWS: "Start My Server" fails with "too many redirects" error - https://phabricator.wikimedia.org/T169382#3396434 (10Tbayer) [22:46:02] madhuvishy: that seems to hvae done the trick, thanks! [22:46:13] ebernhardson: oh cool! [22:46:40] HaeB: try now? [22:48:21] madhuvishy: looks like it works now \o/ [22:48:43] hmmm, some user pods seem to be in some bad state [22:48:47] not sure what exactly [22:49:30] (btw meant 504 bad gateway above, not 502) [22:50:00] (*cough* 504 gateway timeout, ofc) [22:50:05] * madhuvishy can't wait for yuvi's overhaul [22:50:13] HaeB: ah okay [22:50:32] that was 2 days ago or so [22:51:07] oh hmmm, yesterday everything was kinda possibly down. not sure what was up 2 days ago [23:11:06] madhuvishy: i see... btw is there an equivalent of https://status.wikimedia.org/ for tools/labs/clouds/paws? (or should some of these services be on status.wikimedia?) [23:14:20] HaeB: no these aren't on status.wikimedia - I know that that's all provided by an external service, and usually reserved for production things - not sure of what goes there and doesn't [23:14:36] as for us, we have shinken/icinga based monitoring that alerts us [23:14:47] and things like this - https://grafana-labs.wikimedia.org/dashboard/db/tools-basic-alerts [23:15:40] but i see what you are asking for, and i don't think we have an exact equivalent [23:18:54] don't know if you can see this - but this has high level tools status checks - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=checker.tools.wmflabs.org [23:19:44] 10PAWS: "Start My Server" fails with "too many redirects" error - https://phabricator.wikimedia.org/T169382#3396536 (10madhuvishy) 05Open>03Resolved a:03madhuvishy Resolved by restarting the user's paws pod. [23:21:04] yep, i can access that icinga page [23:22:12] madhuvishy: it has 12 different rows ... which ones are most relevant if i want to find out if a PAWS issue is just about me / my notebook, or due to bad weather in general? ;) [23:24:29] HaeB: ah that page is pretty specific to broad tools things [23:27:09] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=paws.wmflabs.org [23:27:14] shows general paws health [23:27:24] but rarely the case is all of paws is down [23:27:46] i don't think we track the status of every user's notebook and their ability to login either [23:37:51] madhuvishy: cool, the general paws status is useful enough [23:37:59] added the link at https://wikitech.wikimedia.org/wiki/PAWS#Other_notes [23:38:11] HaeB: :) thank you