[01:44:52] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [02:24:51] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:47:38] PROBLEM - Host tools-exec-1211 is DOWN: PING CRITICAL - Packet loss = 86%, RTA = 4053.83 ms [05:52:04] RECOVERY - Host tools-exec-1211 is UP: PING OK - Packet loss = 0%, RTA = 0.65 ms [06:33:19] PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:40:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:56:05] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 531 bytes in 0.020 second response time [07:08:20] RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [07:15:18] RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [07:21:05] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.030 second response time [07:32:46] 10PAWS: Hidden __pycache__/ directory prevents folders to be deleted from UI - https://phabricator.wikimedia.org/T147775#2703089 (10Abbe98) [08:02:29] 10PAWS: PAWS can't edit SQL files - https://phabricator.wikimedia.org/T146920#2674793 (10Abbe98) I have not been able to find this issue in the [[ https://github.com/jupyter/notebook/search?utf8=✓&q=sql&type=Issues | upstream bug tracker ]]. Do we know if this is also an issue in other Jupiter instances or if... [11:54:51] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [12:29:50] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [14:11:24] 06Labs, 10Tool-Labs, 10Pywikibot-core: New pages are not being created by pagefromfile.py - https://phabricator.wikimedia.org/T147766#2703695 (10Xqt) p:05Triage>03Normal a:03Xqt [15:44:58] !log deployment-prep deployment-elastic0[5-8]: reduce the number of replicas to 1 max for all indices [15:44:59] Please !log in #wikimedia-releng for beta cluster SAL [15:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [16:16:43] 06Labs, 10Beta-Cluster-Infrastructure: Remove Labs uses of etcd and confd classes - https://phabricator.wikimedia.org/T147800#2703948 (10Andrew) [16:17:49] I have a list of red links (between 20,000 and 500,000). If I wanted to scan the ENWP dump for string matches (13 GB decompressed), what's the best way to do it? Decompress into memory? [16:20:22] Hammer Special:Search? [16:26:26] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2703967 (10Andrew) [16:37:21] 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#1846901 (10Andrew) When I tried to use role::puppetmaster::standalone last week I was unable to start apache on the new pup... [16:44:31] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704003 (10jcrespo) No it is not, it is not available on labs- and it should not be until this is resolved. [16:46:36] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704009 (10jcrespo) @Marostegui we should do this tomorrow, with special guest @chasemp , if he wants. [17:01:01] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [17:04:41] 06Labs, 10Beta-Cluster-Infrastructure: Remove Labs uses of etcd and confd classes - https://phabricator.wikimedia.org/T147800#2703948 (10AlexMonk-WMF) deployment-conf03's includes match production's conf1001.eqiad.wmnet [17:06:38] I got an "Address already in use" [17:06:41] problem [17:06:52] https://wikitech.wikimedia.org/wiki/Help_talk:Tool_Labs/Python_application_stub [17:07:21] Any idea for a fix? [17:17:14] fnielsen, what command do you run thatreturns address already in use? [17:17:46] tail -f uwsgi.log [17:18:04] "socket.error: [Errno 98] Address already in use" [17:20:38] A bit more context: I am setting up a new tools on wmflabs. I attempted to restart it, but had some path and other problems. Eventuel I got past these problem, but now it seems that old process/apps are occupying the socket. [17:22:13] valhallasw`vecto, know anything about this? [17:22:24] is that command running things on the correct host? [17:22:59] if so how should it determine what port to listen on? [17:25:15] I think via an env variable [17:25:43] But what cab happen ia that the old proxess is not killed correctly [17:28:00] I only see tools.scholia on the process list on tools-bastion-03 [17:28:27] is it trying to assign a port used by something else? [17:28:46] with "ps -ef | grep scholia" I see 3 processes [17:29:14] (venv)tools.scholia@tools-bastion-03:~$ ps -ef | grep scholia [17:29:14] root 6958 6816 0 16:48 pts/24 00:00:00 /usr/bin/sudo -niu tools.scholia [17:29:14] root 7394 7290 0 16:12 pts/44 00:00:00 /usr/bin/sudo -niu tools.scholia [17:29:14] tools.s+ 8930 7400 0 17:29 pts/44 00:00:00 grep scholia [17:29:15] root 19071 18991 0 16:24 pts/50 00:00:00 /usr/bin/sudo -niu tools.scholia [17:29:28] What does qstat say? [17:29:46] $ qstat [17:29:47] job-ID prior name user state submit/start at queue slots ja-task-ID [17:29:47] ----------------------------------------------------------------------------------------------------------------- [17:29:47] 1811346 0.30005 uwsgi-pyth tools.scholi dr 10/10/2016 16:35:12 webgrid-generic@tools-webgrid- 1 [17:29:48] 1811349 0.30005 uwsgi-pyth tools.scholi dr 10/10/2016 16:35:34 webgrid-generic@tools-webgrid- 1 [17:29:51] 1811350 0.30005 uwsgi-pyth tools.scholi dr 10/10/2016 16:35:38 webgrid-generic@tools-webgrid- 1 [17:30:45] Ok, try. qdel 1811346 1811349 1811350 [17:30:58] Then webservice uwsgi start [17:31:10] Ok, I was going to ask that [17:31:30] 3 running commands is definitely not good [17:32:02] What is the difference between webservice and webservice2? [17:32:46] 3 lines of "tools.scholia has registered the job 1811346 for deletion" or similar but they are still there says qstat [17:33:37] oh, maybe the same it's happening to me on stewardbots [17:33:42] I can't get stewardbot to stop [17:33:58] I've tried qdel and kill -9 on the execution server [17:34:02] it respawns [17:34:12] bigbrother disabled before just in case [17:34:37] okay it seems it's stopped now [17:34:40] I'll try to restart [17:34:54] do I need to specify -l release=trusty still? [17:35:54] yes [17:36:07] $ jstart -N stewardbot -mem 2G -l release=trusty python /data/project/stewardbots/StewardBot/StewardBot.py [17:36:11] that's what I'm gonna use [17:36:59] Your job 1812957 ("stewardbot") has been submitted [17:37:06] but the bot don't join the channels [17:37:09] :S [17:38:17] !log tools.scholia Force deleted (`sudo qdel -f ...`) three webservice jobs that were stuck in "deleting" state [17:40:23] mafk: check the log files? [17:40:40] Yes, I saw they got deleted. [17:41:53] I tried to start "webservice2 uwsgi-python start" but still receive the same error message. "ps -ef | grep scholia" still shows the old processes. [17:42:46] fnielsen_: I'm talking a look at it. The job restarted after the force deletes but was still crashing. [17:43:07] valhallasw`vecto: which ones? the project is a mess... I'm trying to get folks cleaning old stuff from toolserver still... [17:43:32] * bd808 didn't notice that the irc chat was happening on this [17:48:58] Mafk: stewardbot.* [17:49:05] fnielsen_: It looks like the problem may be your ~/www/python/src/app.py file. It is trying to import the scholia.app module, but I *think* that the python path isn't finding that code [17:49:08] I've deleted .bigbrotherrc and is still trying to restart the jobs :S [17:49:16] it's flooding my inbox [17:49:20] with warn and info [17:50:09] When I first tried I got several path problems reported in the log file. [17:51:38] The error message I got not is different [17:52:25] bd808: Do you think that the path problem is still what is making the socket problem? [17:53:23] bd808: ping, can you kill please the big brother daemon from stewardbots? I've deleted the file containing it but still is trying to restart the jobs over and over [17:53:31] tools.stewardbots [17:54:27] fnielsen_: the error of "unable to load app 0 (mountpoint='') (callable not found or import error)" looks like it happens first. Try clearing that up and then we can tackle the socket error if it still exists. [17:54:39] mafk: I'll give it a shot [17:54:57] bd808: I have just switched over to the old simple app with no import and it works. [17:55:38] fnielsen_: *nod* so it sounds like the import path problem is the next thing to figure out [17:55:49] Thanks. [17:55:51] Yes [17:57:49] mafk: how long ago did you delete the bigbrotherrc file? [17:58:19] bd808: ~30 minutes ago +/- [17:58:41] hmmm... ok. [17:58:56] tools.stewardbots@tools-bastion-03:~$ qstat [17:58:58] job-ID prior name user state submit/start at queue slots ja-task-ID [17:58:59] ----------------------------------------------------------------------------------------------------------------- [17:59:01] 358869 0.35393 lighttpd-s tools.stewar Rr 09/20/2016 17:46:31 webgrid-lighttpd@tools-webgrid 1 [17:59:02] 1147577 0.32371 sulwatcher tools.stewar r 09/23/2016 04:47:19 continuous@tools-exec-1409.eqi 1 [17:59:04] 1813475 0.30000 stewardbot tools.stewar r 10/10/2016 17:57:16 continuous@tools-exec-1410.eqi 1 [17:59:05] tools.stewardbots@tools-bastion-03:~$ [18:00:42] queue instance "cyberbot@tools-exec-cyberbot.eqiad.wmflabs" dropped because it is temporarily not available [18:01:38] deleting bigbrotherrc might not work until bigbrother is restarted [18:02:22] I think an empty file does work (not certain) [18:03:03] 2016-10-10 18:02:42 warn: Too many attempts to restart job 'stewardbot'; throttling [18:03:10] I think that'd stop it a while [18:03:22] blank file was tried [18:03:25] same result [18:03:31] prepending # to the commands too [18:04:21] valhallasw`vecto: does that run on tools-services-01? [18:04:42] currently on -02 it seems [18:04:55] !log tools sudo service bigbrother restart @ tools-services-02 [18:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:05:42] bd808: I thought you had rewritten bigbrother at some point, but that hasn't been merged yet I guess? [18:06:26] I think this is worth looking at http://pastebin.com/JUD0tgYi ? [18:06:31] too many jobs dropped [18:08:35] no, that's fine [18:09:10] the job is running [18:09:45] well, /registered/ to be running, on tools-exec-1410 [18:11:27] it's not running really [18:11:30] it's an irc bot [18:11:36] and it's not joining any channel [18:13:35] well, qstat also doesn't report it anymore, so it's not currently running [18:13:54] I've deleted it again [18:15:07] okay account works [18:16:18] PROBLEM - Host tools-exec-cyberbot is DOWN: CRITICAL - Host Unreachable (10.68.16.39) [18:22:39] valhallasw`vecto: yeah I have a python port of it hanging out in gerrit [18:32:18] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [18:39:10] 06Labs, 10wikitech.wikimedia.org: Install OAuth for wikitech - https://phabricator.wikimedia.org/T147804#2704154 (10bd808) [18:41:08] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [18:48:44] 06Labs, 10Tool-Labs: tools-exec-cyberbot in SHUTOFF state - https://phabricator.wikimedia.org/T147805#2704173 (10valhallasw) [18:56:43] 06Labs, 10Beta-Cluster-Infrastructure: Remove Labs uses of etcd and confd classes - https://phabricator.wikimedia.org/T147800#2704189 (10hashar) The `etcd` labs project is most probably @joe sandbox area. [19:19:47] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2704247 (10Marostegui) Sounds good to me, let's do it tomorrow! El 10 oct. 2016 18:46, "jcrespo" escribió:... [19:46:02] !log tools.stewardbots Restarted StewardBot and SULWatcher, set to run via jstart; IRC server changed. [19:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL, Master [19:50:43] !log tools.stewardbots jsub -> jstart on the restart_*.sh files [19:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL, Master [20:03:48] (03PS1) 10MarcoAurelio: Removed outdated stuff [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/315151 [20:05:31] (03CR) 10MarcoAurelio: [C: 032] Removed outdated stuff [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/315151 (owner: 10MarcoAurelio) [20:05:59] (03Merged) 10jenkins-bot: Removed outdated stuff [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/315151 (owner: 10MarcoAurelio) [20:06:50] !log tools.stewardbots [[gerrit:315151|Removed outdated files and stuff]] [20:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL, Master [20:27:12] 06Labs, 10wikitech.wikimedia.org: Install OAuth for wikitech - https://phabricator.wikimedia.org/T147804#2704154 (10Reedy) Any obvious blockers? Or just needs db tables creating and the extension enabling as usual? [20:56:54] !log deployment-prep fixed puppet on -tin/-mira by restarting puppetmaster for base_path scap change [20:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [21:09:56] 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Install OAuth for wikitech - https://phabricator.wikimedia.org/T147804#2704455 (10Reedy) 05Open>03Resolved a:03Reedy [21:11:22] !log deployment-prep fixed puppet on -restbase01/-restbase02 by setting up deployment of cassandra/twcs on deployment-tin [21:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [21:41:41] !log deployment-prep restarted keyholder-proxy on -tin to make check_keyholder happy with the extra key that was active but unconfigured [21:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master