[00:00:35] <Krenair>	 valhallasw`vecto, ping
[00:01:00] <doctaxon>	 we try
[00:01:53] <doctaxon>	 andrewbogott maybe? ^
[00:02:08] <Krenair>	 it's christmas day everyone is offline
[00:02:14] <doctaxon>	 not me
[00:02:17] <doctaxon>	 not you
[00:03:12] <doctaxon>	 is there nobody from labs in disposition
[00:03:42] <doctaxon>	 what if the entire mediawiki system is shutting down
[00:04:47] <Krenair>	 the entire mediawiki system?
[00:04:52] <Krenair>	 you mean the production sites?
[00:05:04] <doctaxon>	 ya
[00:05:09] <Reedy>	 if wikipedia was down, people would be summoned
[00:05:25] <Krenair>	 if that broke lots of things would be going off
[00:05:59] <Krenair>	 it's somewhat more difficult to get help for tools
[00:07:19] <madhuvishy>	 Hi
[00:07:33] <madhuvishy>	 I'm just on a train
[00:07:38] <madhuvishy>	 Catching up on back scroll
[00:08:10] <madhuvishy>	 Can look in 30 ish minutes
[00:08:25] <Krenair>	 any docs you know of that I should be reading?
[00:08:29] <Krenair>	 or any other pointers?
[00:09:28] <doctaxon>	 Krenaur: precise is okay
[00:09:36] <doctaxon>	 Krenair: ^
[00:11:30] <madhuvishy>	 Krenair: you can check status of trusty nodes from any bastion
[00:11:57] <madhuvishy>	 exec-manage status tools-exec-140*
[00:12:26] <madhuvishy>	 tools-bastion-03 for example
[00:12:41] <madhuvishy>	 The status will print running jobs and states
[00:13:00] <Krenair>	  /data/project/.system/gridengine/spool/qmaster/messages has a lot of 'unable to find job "<number>" from the ticket order'
[00:13:02] <doctaxon>	 ya, but new jobs stuck in qw state
[00:15:08] <Krenair>	 Count of jobs running on host tools-exec-140* : 0
[00:15:16] <Krenair>	 that can't be right
[00:15:25] <madhuvishy>	 Oh sorry
[00:15:30] <yuvipanda>	 (am looking into it now)
[00:16:05] <madhuvishy>	 I meant 140* as in you can put 1401, 1402 etc
[00:16:09] <Krenair>	 ah
[00:17:09] <Krenair>	 I asked for 1407 and it's just sitting there
[00:17:50] <Krenair>	 same for some others
[00:18:18] <doctaxon>	 qstat is not working any more
[00:18:19] <yuvipanda>	 I think it might be triggered by an out of control process launchied by avicbot
[00:18:35] <doctaxon>	 hi yuvi
[00:19:27] <doctaxon>	 error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
[00:19:54] <doctaxon>	 error: commlib error: got read error (closing "tools-grid-master.tools.eqiad.wmflabs/qmaster/1")
[00:20:22] <yuvipanda>	 !log tools kill clean.sh process of avicbot
[00:20:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[00:21:40] <doctaxon>	 there are errors stated: #wikimedia-operations
[00:24:19] <yuvipanda>	 does anyone know https://wikitech.wikimedia.org/wiki/User:Avicennasis
[00:24:48] <Krenair>	 not me
[00:25:44] <yuvipanda>	 !log tools delete all jobs of avicbot. This is 419 jobs
[00:25:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[00:27:06] <shinken-wm>	 PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[00:28:11] <yuvipanda>	 !log tools force delete all jobs of avicbot
[00:28:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[00:28:24] <yuvipanda>	 !log tools comment out cron running 'clean' script of avicbot every minute without -once
[00:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[00:28:35] <yuvipanda>	 doctaxon: any better now?
[00:28:58] <yuvipanda>	 hmm maybe not
[00:32:13] <Krenair>	 enwiki contribs show activity just hours ago so they're likely to be receptive to mail over the next few days
[00:32:56] <yuvipanda>	 doctaxon: Krenair madhuvishy ok, I see things going off qw slowly, grid should be fully back operational in about 10 minutes maybe
[00:33:04] <yuvipanda>	  qstat -u '*' | grep 'qw' | wc -l
[00:33:07] <yuvipanda>	 is reducing over time
[00:33:40] <madhuvishy>	 Great
[00:33:46] <Krenair>	 -u '*', okay
[00:33:48] <madhuvishy>	 Thank you yuvipanda :)
[00:34:10] <Krenair>	 thanks to both of you
[00:34:46] <godog>	 indeed, thanks all!
[00:34:55] <godog>	 going afk but otherwise available via pager
[00:38:12] <doctaxon>	 yuvipanda: I have much tasks in qw state now, as before
[00:38:35] <yuvipanda>	 indeed, wait for ten or more minutes
[00:38:42] <doctaxon>	 thank you
[00:38:43] <Krenair>	 the number is going down a lot
[00:38:46] <yuvipanda>	 2297 tasks in qw a few mins ago, 1800 now
[00:39:10] <Krenair>	 krenair@tools-grid-master:~$ qstat -u '*' | grep 'qw' | wc -l; sleep 2; qstat -u '*' | grep 'qw' | wc -l
[00:39:11] <Krenair>	 1748
[00:39:11] <Krenair>	 1742
[00:42:22] <doctaxon>	 1430
[00:43:41] <doctaxon>	 1319
[00:43:44] <Krenair>	 yeah
[00:44:15] <doctaxon>	 yuvipanda is our christmas master
[00:50:47] <doctaxon>	 888
[00:57:06] <shinken-wm>	 RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0]
[01:18:41] <doctaxon>	 evrything fine , thank you all
[03:15:20] <wikibugs>	 06Labs: Deactivate repository labs/invisible-unicorn - https://phabricator.wikimedia.org/T154099#2901197 (10scfc)
[05:06:41] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[06:11:41] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[06:52:56] <wikibugs>	 06Labs, 10Tool-Labs, 07Puppet: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2901265 (10scfc)
[06:53:28] <wikibugs>	 06Labs, 10Tool-Labs, 07Puppet: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2884609 (10scfc)
[07:55:04] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[08:24:29] <wikibugs>	 10Wikibugs: wikibugs showing jenkins postmerge events for user L10n-bot - https://phabricator.wikimedia.org/T154094#2901270 (10Peachey88)
[08:30:02] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[08:37:42] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[09:37:49] <wikibugs>	 10Labs-project-other, 06Developer-Relations, 10WikiApiary: move WikiApiary to Labs - https://phabricator.wikimedia.org/T149874#2767311 (10Peachey88) >>! In T149874#2898209, @Nemo_bis wrote: > By the way, if moving WikiApiary to WMFlabs means that it will be hostage of people that wish to destroy it  Please d...
[09:42:43] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:45:37] <wikibugs>	 06Labs, 10Tool-Labs, 07Puppet: role::puppetmaster::puppetdb depends on Ganglia and cannot be used in Labs - https://phabricator.wikimedia.org/T154104#2901305 (10scfc)
[09:46:55] <wikibugs>	 06Labs, 10Tool-Labs, 07Puppet: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105#2901318 (10scfc)
[10:56:02] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[11:36:00] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[12:08:41] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[13:08:42] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[15:31:58] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:09:40] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[17:09:42] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:35:42] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[20:40:42] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:06:41] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[22:11:41] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[23:07:41] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]