[00:00:35] valhallasw`vecto, ping [00:01:00] we try [00:01:53] andrewbogott maybe? ^ [00:02:08] it's christmas day everyone is offline [00:02:14] not me [00:02:17] not you [00:03:12] is there nobody from labs in disposition [00:03:42] what if the entire mediawiki system is shutting down [00:04:47] the entire mediawiki system? [00:04:52] you mean the production sites? [00:05:04] ya [00:05:09] if wikipedia was down, people would be summoned [00:05:25] if that broke lots of things would be going off [00:05:59] it's somewhat more difficult to get help for tools [00:07:19] Hi [00:07:33] I'm just on a train [00:07:38] Catching up on back scroll [00:08:10] Can look in 30 ish minutes [00:08:25] any docs you know of that I should be reading? [00:08:29] or any other pointers? [00:09:28] Krenaur: precise is okay [00:09:36] Krenair: ^ [00:11:30] Krenair: you can check status of trusty nodes from any bastion [00:11:57] exec-manage status tools-exec-140* [00:12:26] tools-bastion-03 for example [00:12:41] The status will print running jobs and states [00:13:00] /data/project/.system/gridengine/spool/qmaster/messages has a lot of 'unable to find job "" from the ticket order' [00:13:02] ya, but new jobs stuck in qw state [00:15:08] Count of jobs running on host tools-exec-140* : 0 [00:15:16] that can't be right [00:15:25] Oh sorry [00:15:30] (am looking into it now) [00:16:05] I meant 140* as in you can put 1401, 1402 etc [00:16:09] ah [00:17:09] I asked for 1407 and it's just sitting there [00:17:50] same for some others [00:18:18] qstat is not working any more [00:18:19] I think it might be triggered by an out of control process launchied by avicbot [00:18:35] hi yuvi [00:19:27] error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error). [00:19:54] error: commlib error: got read error (closing "tools-grid-master.tools.eqiad.wmflabs/qmaster/1") [00:20:22] !log tools kill clean.sh process of avicbot [00:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:21:40] there are errors stated: #wikimedia-operations [00:24:19] does anyone know https://wikitech.wikimedia.org/wiki/User:Avicennasis [00:24:48] not me [00:25:44] !log tools delete all jobs of avicbot. This is 419 jobs [00:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:27:06] PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [00:28:11] !log tools force delete all jobs of avicbot [00:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:28:24] !log tools comment out cron running 'clean' script of avicbot every minute without -once [00:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [00:28:35] doctaxon: any better now? [00:28:58] hmm maybe not [00:32:13] enwiki contribs show activity just hours ago so they're likely to be receptive to mail over the next few days [00:32:56] doctaxon: Krenair madhuvishy ok, I see things going off qw slowly, grid should be fully back operational in about 10 minutes maybe [00:33:04] qstat -u '*' | grep 'qw' | wc -l [00:33:07] is reducing over time [00:33:40] Great [00:33:46] -u '*', okay [00:33:48] Thank you yuvipanda :) [00:34:10] thanks to both of you [00:34:46] indeed, thanks all! [00:34:55] going afk but otherwise available via pager [00:38:12] yuvipanda: I have much tasks in qw state now, as before [00:38:35] indeed, wait for ten or more minutes [00:38:42] thank you [00:38:43] the number is going down a lot [00:38:46] 2297 tasks in qw a few mins ago, 1800 now [00:39:10] krenair@tools-grid-master:~$ qstat -u '*' | grep 'qw' | wc -l; sleep 2; qstat -u '*' | grep 'qw' | wc -l [00:39:11] 1748 [00:39:11] 1742 [00:42:22] 1430 [00:43:41] 1319 [00:43:44] yeah [00:44:15] yuvipanda is our christmas master [00:50:47] 888 [00:57:06] RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0] [01:18:41] evrything fine , thank you all [03:15:20] 06Labs: Deactivate repository labs/invisible-unicorn - https://phabricator.wikimedia.org/T154099#2901197 (10scfc) [05:06:41] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [06:11:41] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:52:56] 06Labs, 10Tool-Labs, 07Puppet: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2901265 (10scfc) [06:53:28] 06Labs, 10Tool-Labs, 07Puppet: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2884609 (10scfc) [07:55:04] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [08:24:29] 10Wikibugs: wikibugs showing jenkins postmerge events for user L10n-bot - https://phabricator.wikimedia.org/T154094#2901270 (10Peachey88) [08:30:02] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [08:37:42] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [09:37:49] 10Labs-project-other, 06Developer-Relations, 10WikiApiary: move WikiApiary to Labs - https://phabricator.wikimedia.org/T149874#2767311 (10Peachey88) >>! In T149874#2898209, @Nemo_bis wrote: > By the way, if moving WikiApiary to WMFlabs means that it will be hostage of people that wish to destroy it Please d... [09:42:43] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [09:45:37] 06Labs, 10Tool-Labs, 07Puppet: role::puppetmaster::puppetdb depends on Ganglia and cannot be used in Labs - https://phabricator.wikimedia.org/T154104#2901305 (10scfc) [09:46:55] 06Labs, 10Tool-Labs, 07Puppet: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105#2901318 (10scfc) [10:56:02] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [11:36:00] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [12:08:41] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [13:08:42] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [15:31:58] PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:09:40] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:09:42] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [19:35:42] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:40:42] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:06:41] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [22:11:41] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:07:41] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]