[01:44:52] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[02:24:51] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:47:38] <shinken-wm>	 PROBLEM - Host tools-exec-1211 is DOWN: PING CRITICAL - Packet loss = 86%, RTA = 4053.83 ms
[05:52:04] <shinken-wm>	 RECOVERY - Host tools-exec-1211 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms
[06:33:19] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[06:40:21] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[06:56:05] <shinken-wm>	 PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 531 bytes in 0.020 second response time
[07:08:20] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:15:18] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:21:05] <shinken-wm>	 RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.030 second response time
[07:32:46] <wikibugs>	 10PAWS: Hidden __pycache__/ directory prevents folders to be deleted from UI - https://phabricator.wikimedia.org/T147775#2703089 (10Abbe98)
[08:02:29] <wikibugs>	 10PAWS: PAWS can't edit SQL files - https://phabricator.wikimedia.org/T146920#2674793 (10Abbe98) I have not been able to find this issue in the [[ https://github.com/jupyter/notebook/search?utf8=✓&q=sql&type=Issues | upstream bug tracker ]].   Do we know if this is also an issue in other Jupiter instances or if...
[11:54:51] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[12:29:50] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[14:11:24] <wikibugs>	 06Labs, 10Tool-Labs, 10Pywikibot-core: New pages are not being created by pagefromfile.py - https://phabricator.wikimedia.org/T147766#2703695 (10Xqt) p:05Triage>03Normal a:03Xqt
[15:44:58] <dcausse>	 !log deployment-prep deployment-elastic0[5-8]: reduce the number of replicas to 1 max for all indices
[15:44:59] <stashbot>	 Please !log in #wikimedia-releng for beta cluster SAL
[15:45:04] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master
[16:16:43] <wikibugs>	 06Labs, 10Beta-Cluster-Infrastructure: Remove Labs uses of etcd and confd classes - https://phabricator.wikimedia.org/T147800#2703948 (10Andrew)
[16:17:49] <Dispenser>	 I have a list of red links (between 20,000 and 500,000).  If I wanted to scan the ENWP dump for string matches (13 GB decompressed), what's the best way to do it?  Decompress into memory?
[16:20:22] <Dispenser>	 Hammer Special:Search?
[16:26:26] <wikibugs>	 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2703967 (10Andrew)
[16:37:21] <wikibugs>	 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1846901 (10Andrew) When I tried to use role::puppetmaster::standalone last week I was unable to start apache on the new pup...
[16:44:31] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704003 (10jcrespo) No it is not, it is not available on labs- and it should not be until this is resolved.
[16:46:36] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704009 (10jcrespo) @Marostegui we should do this tomorrow, with special guest @chasemp , if he wants.
[17:01:01] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170)
[17:04:41] <wikibugs>	 06Labs, 10Beta-Cluster-Infrastructure: Remove Labs uses of etcd and confd classes - https://phabricator.wikimedia.org/T147800#2703948 (10AlexMonk-WMF) deployment-conf03's includes match production's conf1001.eqiad.wmnet
[17:06:38] <fnielsen>	 I got an "Address already in use"
[17:06:41] <fnielsen>	 problem
[17:06:52] <fnielsen>	 https://wikitech.wikimedia.org/wiki/Help_talk:Tool_Labs/Python_application_stub
[17:07:21] <fnielsen>	 Any idea for a fix?
[17:17:14] <Krenair>	 fnielsen, what command do you run thatreturns address already in use?
[17:17:46] <fnielsen>	 tail -f uwsgi.log
[17:18:04] <fnielsen>	 "socket.error: [Errno 98] Address already in use"
[17:20:38] <fnielsen>	 A bit more context: I am setting up a new tools on wmflabs. I attempted to restart it, but had some path and other problems. Eventuel I got past these problem, but now it seems that old process/apps are occupying the socket.
[17:22:13] <Krenair>	 valhallasw`vecto, know anything about this?
[17:22:24] <Krenair>	 is that command running things on the correct host?
[17:22:59] <Krenair>	 if so how should it determine what port to listen on?
[17:25:15] <valhallasw`vecto>	 I think via an env variable
[17:25:43] <valhallasw`vecto>	 But what cab happen ia that the old proxess is not killed correctly
[17:28:00] <Krenair>	 I only see tools.scholia on the process list on tools-bastion-03
[17:28:27] <Krenair>	 is it trying to assign a port used by something else?
[17:28:46] <fnielsen_>	 with "ps -ef | grep scholia" I see 3 processes
[17:29:14] <fnielsen_>	 (venv)tools.scholia@tools-bastion-03:~$ ps -ef | grep scholia
[17:29:14] <fnielsen_>	 root      6958  6816  0 16:48 pts/24   00:00:00 /usr/bin/sudo -niu tools.scholia
[17:29:14] <fnielsen_>	 root      7394  7290  0 16:12 pts/44   00:00:00 /usr/bin/sudo -niu tools.scholia
[17:29:14] <fnielsen_>	 tools.s+  8930  7400  0 17:29 pts/44   00:00:00 grep scholia
[17:29:15] <fnielsen_>	 root     19071 18991  0 16:24 pts/50   00:00:00 /usr/bin/sudo -niu tools.scholia
[17:29:28] <valhallasw`vecto>	 What does qstat say?
[17:29:46] <fnielsen_>	 $ qstat
[17:29:47] <fnielsen_>	 job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
[17:29:47] <fnielsen_>	 -----------------------------------------------------------------------------------------------------------------
[17:29:47] <fnielsen_>	 1811346 0.30005 uwsgi-pyth tools.scholi dr    10/10/2016 16:35:12 webgrid-generic@tools-webgrid-     1        
[17:29:48] <fnielsen_>	 1811349 0.30005 uwsgi-pyth tools.scholi dr    10/10/2016 16:35:34 webgrid-generic@tools-webgrid-     1        
[17:29:51] <fnielsen_>	 1811350 0.30005 uwsgi-pyth tools.scholi dr    10/10/2016 16:35:38 webgrid-generic@tools-webgrid-     1        
[17:30:45] <valhallasw`vecto>	 Ok, try. qdel 1811346 1811349 1811350
[17:30:58] <valhallasw`vecto>	 Then webservice uwsgi start
[17:31:10] <fnielsen_>	 Ok, I was going to ask that
[17:31:30] <valhallasw`vecto>	 3 running commands is definitely not good
[17:32:02] <fnielsen_>	 What is the difference between webservice and webservice2?
[17:32:46] <fnielsen_>	 3 lines of "tools.scholia has registered the job 1811346 for deletion" or similar but they are still there says qstat
[17:33:37] <mafk>	 oh, maybe the same it's happening to me on stewardbots
[17:33:42] <mafk>	 I can't get stewardbot to stop
[17:33:58] <mafk>	 I've tried qdel and kill -9 on the execution server
[17:34:02] <mafk>	 it respawns
[17:34:12] <mafk>	 bigbrother disabled before just in case
[17:34:37] <mafk>	 okay it seems it's stopped now
[17:34:40] <mafk>	 I'll try to restart
[17:34:54] <mafk>	 do I need to specify -l release=trusty still?
[17:35:54] <Krenair>	 yes
[17:36:07] <mafk>	 $ jstart -N stewardbot -mem 2G -l release=trusty python /data/project/stewardbots/StewardBot/StewardBot.py
[17:36:11] <mafk>	 that's what I'm gonna use
[17:36:59] <mafk>	 Your job 1812957 ("stewardbot") has been submitted
[17:37:06] <mafk>	 but the bot don't join the channels
[17:37:09] <mafk>	 :S
[17:38:17] <bd808>	 !log tools.scholia Force deleted (`sudo qdel -f ...`) three webservice jobs that were stuck in "deleting" state
[17:40:23] <valhallasw`vecto>	 mafk: check the log files?
[17:40:40] <fnielsen_>	 Yes, I saw they got deleted.
[17:41:53] <fnielsen_>	 I tried to start "webservice2 uwsgi-python start" but still receive the same error message. "ps -ef | grep scholia" still shows the old processes.
[17:42:46] <bd808>	 fnielsen_: I'm talking a look at it. The job restarted after the force deletes but was still crashing.
[17:43:07] <mafk>	 valhallasw`vecto: which ones? the project is a mess... I'm trying to get folks cleaning old stuff from toolserver still...
[17:43:32] * bd808 didn't notice that the irc chat was happening on this
[17:48:58] <valhallasw`vecto>	 Mafk: stewardbot.*
[17:49:05] <bd808>	 fnielsen_: It looks like the problem may be your ~/www/python/src/app.py file. It is trying to import the scholia.app module, but I *think* that the python path isn't finding that code
[17:49:08] <mafk>	 I've deleted .bigbrotherrc and is still trying to restart the jobs :S
[17:49:16] <mafk>	 it's flooding my inbox
[17:49:20] <mafk>	 with warn and info
[17:50:09] <fnielsen_>	 When I first tried I got several path problems reported in the log file.
[17:51:38] <fnielsen_>	 The error message I got not is different
[17:52:25] <fnielsen_>	 bd808: Do you think that the path problem is still what is making the socket problem?
[17:53:23] <mafk>	 bd808: ping, can you kill please the big brother daemon from stewardbots? I've deleted the file containing it but still is trying to restart the jobs over and over
[17:53:31] <mafk>	 tools.stewardbots
[17:54:27] <bd808>	 fnielsen_: the error of "unable to load app 0 (mountpoint='') (callable not found or import error)" looks like it happens first. Try clearing that up and then we can tackle the socket error if it still exists.
[17:54:39] <bd808>	 mafk: I'll give it a shot
[17:54:57] <fnielsen_>	 bd808: I have just switched over to the old simple app with no import and it works.
[17:55:38] <bd808>	 fnielsen_: *nod* so it sounds like the import path problem is the next thing to figure out
[17:55:49] <fnielsen_>	 Thanks. 
[17:55:51] <fnielsen_>	 Yes
[17:57:49] <bd808>	 mafk: how long ago did you delete the bigbrotherrc file?
[17:58:19] <mafk>	 bd808: ~30 minutes ago +/-
[17:58:41] <bd808>	 hmmm... ok.
[17:58:56] <mafk>	 tools.stewardbots@tools-bastion-03:~$ qstat
[17:58:58] <mafk>	 job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
[17:58:59] <mafk>	 -----------------------------------------------------------------------------------------------------------------
[17:59:01] <mafk>	  358869 0.35393 lighttpd-s tools.stewar Rr    09/20/2016 17:46:31 webgrid-lighttpd@tools-webgrid     1
[17:59:02] <mafk>	 1147577 0.32371 sulwatcher tools.stewar r     09/23/2016 04:47:19 continuous@tools-exec-1409.eqi     1
[17:59:04] <mafk>	 1813475 0.30000 stewardbot tools.stewar r     10/10/2016 17:57:16 continuous@tools-exec-1410.eqi     1
[17:59:05] <mafk>	 tools.stewardbots@tools-bastion-03:~$
[18:00:42] <mafk>	  queue instance "cyberbot@tools-exec-cyberbot.eqiad.wmflabs" dropped because it is temporarily not available
[18:01:38] <valhallasw`vecto>	 deleting bigbrotherrc might not work until bigbrother is restarted
[18:02:22] <valhallasw`vecto>	 I think an empty file does work (not certain)
[18:03:03] <mafk>	 2016-10-10 18:02:42 warn: Too many attempts to restart job 'stewardbot'; throttling
[18:03:10] <mafk>	 I think that'd stop it a while
[18:03:22] <mafk>	 blank file was tried
[18:03:25] <mafk>	 same result
[18:03:31] <mafk>	 prepending # to the commands too
[18:04:21] <bd808>	 valhallasw`vecto: does that run on tools-services-01?
[18:04:42] <valhallasw`vecto>	 currently on -02 it seems
[18:04:55] <valhallasw`vecto>	 !log tools sudo service bigbrother restart @ tools-services-02
[18:05:00] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[18:05:42] <valhallasw`vecto>	 bd808: I thought you had rewritten bigbrother at some point, but that hasn't been merged yet I guess?
[18:06:26] <mafk>	 I think this is worth looking at http://pastebin.com/JUD0tgYi ?
[18:06:31] <mafk>	 too many jobs dropped
[18:08:35] <valhallasw`vecto>	 no, that's fine
[18:09:10] <valhallasw`vecto>	 the job is running
[18:09:45] <valhallasw`vecto>	 well, /registered/ to be running, on tools-exec-1410
[18:11:27] <mafk>	 it's not running really
[18:11:30] <mafk>	 it's an irc bot
[18:11:36] <mafk>	 and it's not joining any channel
[18:13:35] <valhallasw`vecto>	 well, qstat also doesn't report it anymore, so it's not currently running
[18:13:54] <mafk>	 I've deleted it again
[18:15:07] <StewardBot>	 okay account works
[18:16:18] <shinken-wm>	 PROBLEM - Host tools-exec-cyberbot is DOWN: CRITICAL - Host Unreachable (10.68.16.39)
[18:22:39] <bd808>	 valhallasw`vecto: yeah I have a python port of it hanging out in gerrit
[18:32:18] <shinken-wm>	 PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22)
[18:39:10] <wikibugs>	 06Labs, 10wikitech.wikimedia.org: Install OAuth for wikitech - https://phabricator.wikimedia.org/T147804#2704154 (10bd808)
[18:41:08] <shinken-wm>	 PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218)
[18:48:44] <wikibugs>	 06Labs, 10Tool-Labs: tools-exec-cyberbot in SHUTOFF state - https://phabricator.wikimedia.org/T147805#2704173 (10valhallasw)
[18:56:43] <wikibugs>	 06Labs, 10Beta-Cluster-Infrastructure: Remove Labs uses of etcd and confd classes - https://phabricator.wikimedia.org/T147800#2704189 (10hashar) The `etcd` labs project is most probably @joe sandbox area.
[19:19:47] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704247 (10Marostegui) Sounds good to me, let's do it tomorrow!  El 10 oct. 2016 18:46, "jcrespo" <no-reply@phabricator.wikimedia.org> escribió:...
[19:46:02] <mafk>	 !log tools.stewardbots Restarted StewardBot and SULWatcher, set to run via jstart; IRC server changed.
[19:46:06] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL, Master
[19:50:43] <mafk>	 !log tools.stewardbots jsub -> jstart on the restart_*.sh files
[19:50:46] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL, Master
[20:03:48] <grrrit-wm>	 (03PS1) 10MarcoAurelio: Removed outdated stuff [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/315151 
[20:05:31] <grrrit-wm>	 (03CR) 10MarcoAurelio: [C: 032] Removed outdated stuff [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/315151 (owner: 10MarcoAurelio)
[20:05:59] <grrrit-wm>	 (03Merged) 10jenkins-bot: Removed outdated stuff [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/315151 (owner: 10MarcoAurelio)
[20:06:50] <mafk>	 !log tools.stewardbots [[gerrit:315151|Removed outdated files and stuff]]
[20:06:54] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL, Master
[20:27:12] <wikibugs>	 06Labs, 10wikitech.wikimedia.org: Install OAuth for wikitech - https://phabricator.wikimedia.org/T147804#2704154 (10Reedy) Any obvious blockers? Or just needs db tables creating and the extension enabling as usual?
[20:56:54] <Krenair>	 !log deployment-prep fixed puppet on -tin/-mira by restarting puppetmaster for base_path scap change
[20:57:00] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master
[21:09:56] <wikibugs>	 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Install OAuth for wikitech - https://phabricator.wikimedia.org/T147804#2704455 (10Reedy) 05Open>03Resolved a:03Reedy
[21:11:22] <Krenair>	 !log deployment-prep fixed puppet on -restbase01/-restbase02 by setting up deployment of cassandra/twcs on deployment-tin
[21:11:26] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master
[21:41:41] <Krenair>	 !log deployment-prep restarted keyholder-proxy on -tin to make check_keyholder happy with the extra key that was active but unconfigured
[21:41:45] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master