[00:30:37] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1406 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[01:02:56] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1010 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[01:10:37] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[01:26:51] <shinken-wm>	 PROBLEM - Host tools-clushmaster-01 is DOWN: CRITICAL - Host Unreachable (10.68.18.81)
[01:28:09] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[01:37:56] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:32:16] <wikibugs_>	 (03PS2) 10BryanDavis: Convert jsub to python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381386
[02:33:09] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[02:33:24] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Convert jsub to python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381386 (owner: 10BryanDavis)
[03:21:25] <wikibugs_>	 (03PS3) 10BryanDavis: Convert jsub to python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381386
[03:35:15] <wikibugs_>	 10cloud-services-team (Kanban), 10Analytics, 10User-bd808: Remove logging from labs for schema https://meta.wikimedia.org/wiki/Schema:CommandInvocation - https://phabricator.wikimedia.org/T166712#3648968 (10bd808) a:03bd808
[03:35:44] <wikibugs_>	 10Toolforge, 10cloud-services-team (Kanban), 10Patch-For-Review, 10User-bd808: Update `sql` command to use new wiki replica servers - https://phabricator.wikimedia.org/T176688#3648971 (10bd808) 05Open>03Resolved
[03:35:47] <wikibugs_>	 10Data-Services, 10cloud-services-team (FY2017-18), 10DBA, 10Goal: Migrate all users to new Wiki Replica cluster and decommission old hardware - https://phabricator.wikimedia.org/T142807#3648972 (10bd808)
[03:57:45] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1427 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[04:09:14] <bd808>	 !log tools.jadfa Stopped and started webservice; watchdog had gone crazy and created 152 running jobs
[04:09:15] <stashbot>	 bd808: Unknown project "tools.jadfa"
[04:09:32] <bd808>	 !log tools.yadfa Stopped and started webservice; watchdog had gone crazy and created 152 running jobs
[04:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.yadfa/SAL
[04:17:47] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1427 is OK: OK: Less than 1.00% above the threshold [0.0]
[04:43:32] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1423 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[05:00:07] <wikibugs_>	 (03PS11) 10BryanDavis: Add rewritten crontab in Python [labs/toollabs] - 10https://gerrit.wikimedia.org/r/336998 (https://phabricator.wikimedia.org/T156174) (owner: 10Zhuyifei1999)
[05:00:07] <wikibugs_>	 (03PS2) 10BryanDavis: Convert list-user-databases to python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381385
[05:00:09] <wikibugs_>	 (03PS4) 10BryanDavis: Convert jsub to python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381386
[05:00:11] <wikibugs_>	 (03PS1) 10BryanDavis: Remove log-command-invocation calls [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381619 (https://phabricator.wikimedia.org/T166712)
[05:00:13] <wikibugs_>	 (03PS1) 10BryanDavis: Toolforge rebranding [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381620 (https://phabricator.wikimedia.org/T168480)
[05:00:15] <wikibugs_>	 (03PS1) 10BryanDavis: Port /usr/bin/job to Python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381621
[05:00:17] <wikibugs_>	 (03PS1) 10BryanDavis: Changelog bump for v1.23 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381622
[05:03:34] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Convert jsub to python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381386 (owner: 10BryanDavis)
[05:06:42] <wikibugs_>	 (03PS5) 10BryanDavis: Convert jsub to python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381386
[05:06:44] <wikibugs_>	 (03PS2) 10BryanDavis: Remove log-command-invocation calls [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381619 (https://phabricator.wikimedia.org/T166712)
[05:06:46] <wikibugs_>	 (03PS2) 10BryanDavis: Toolforge rebranding [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381620 (https://phabricator.wikimedia.org/T168480)
[05:06:48] <wikibugs_>	 (03PS2) 10BryanDavis: Port /usr/bin/job to Python3 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381621
[05:06:52] <wikibugs_>	 (03PS2) 10BryanDavis: Changelog bump for v1.23 [labs/toollabs] - 10https://gerrit.wikimedia.org/r/381622
[05:06:57] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[05:23:34] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1423 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:41:58] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0]
[05:44:34] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1423 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[06:08:28] <wikibugs_>	 10Toolforge: node.js webservice not seeing PORT in env - https://phabricator.wikimedia.org/T176812#3649010 (10Neta-kedem) 05Open>03Resolved a:03Neta-kedem This answered my problem. Thanks a lot for the quick response!
[07:08:00] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1020 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[07:19:33] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1423 is OK: OK: Less than 1.00% above the threshold [0.0]
[07:50:34] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1423 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[08:06:23] <paladox>	 andrebogott hi, puppet-phabricator and gerrit-test3 have gone down.
[08:06:25] <paladox>	 andrewbogott ^^
[08:06:45] <paladox>	 this is since they were switched from different labs virts.
[08:09:22] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649072 (10Paladox)
[08:12:19] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649084 (10Paladox) p:05Triage>03High Since two of them went down at the same time, that is strange. Triaging as high as it was not one but two went down.
[08:12:57] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0]
[08:23:48] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1416 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[08:48:51] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649101 (10Paladox) It seems only a few are also down from this list https://phabricator.wikimedia.org/P6060  like search-jessie and  wdqs-deploy
[08:53:47] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1416 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:25:34] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1423 is OK: OK: Less than 1.00% above the threshold [0.0]
[09:29:08] <wikibugs_>	 10Cloud-Services, 10Huggle: Huggle development environment - portable virtual box - https://phabricator.wikimedia.org/T177145#3649128 (10Petrb) Well, do you realize that it's only going to be a few of people who would eventually download this? I am not distributing it to millions :)
[09:46:33] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1423 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[10:21:34] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1423 is OK: OK: Less than 1.00% above the threshold [0.0]
[10:40:34] <abartov>	 hi.
[10:40:46] <abartov>	 is it possible to run node-based services on Toolforge?
[10:41:13] <abartov>	 node seems to be installed, but nodejs --version returns "v0.10.25", which seems... ancient.
[10:42:33] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1423 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[10:53:03] <shinken-wm>	 PROBLEM - Puppet errors on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[12:19:30] <abartov>	 (it's the weekend, so not necessarily expecting a response, but let me tag bd808 for later.)
[12:27:19] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-gift-trusty-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[12:52:20] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-gift-trusty-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:18:01] <shinken-wm>	 RECOVERY - Puppet errors on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0]
[13:24:50] <Sagan>	 !log rcm Xenon: Updateing Phabricator
[13:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL
[13:25:16] <Sagan>	 !log rcm CAC: Running vagrant git-update
[13:25:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL
[13:25:39] <Sagan>	 !log rcm Tin: Running update (Jenkins)
[13:25:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Rcm/SAL
[14:24:10] <shinken-wm>	 PROBLEM - Puppet errors on tools-exec-1407 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[14:59:10] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:47:45] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649292 (10Paladox) It is still showing as rebooting through horizon and still inaccessible.
[17:04:12] <Zppix>	 bd808:  are you around?
[17:05:13] <Zppix>	 or madhuvishy 
[17:10:25] <Freddy2001>	 Hi
[17:10:31] <Freddy2001>	 I have a problem with jsub
[17:11:32] <madhuvishy>	 Hello, what's up 
[17:12:56] <Zppix>	 madhuvishy:  are you able to go in and force restart gerrit-test3.git.eqiad.wmflabs horizon is stuck on hard reboot, and we cannot ssh into the host
[17:13:37] <Zppix>	 ty in advance
[17:14:54] <madhuvishy>	 Zppix: ah sure can you add me to the project?
[17:15:03] <madhuvishy>	 Freddy2001: what's going on?
[17:15:17] <Zppix>	 madhuvishy: what perms would you need?
[17:15:32] <madhuvishy>	 Zppix: just add to project on wikitech?
[17:15:38] <Freddy2001>	 I cannot submit a new job using jsub
[17:15:39] <Zppix>	 madhuvishy:  as a project admin?
[17:15:42] <madhuvishy>	 yup
[17:15:46] <Zppix>	 ok one sec
[17:16:06] <madhuvishy>	 Freddy2001: can you explain more, what are you trying to submit, what is the error?
[17:16:24] <Freddy2001>	 and if i use qsub, the job waits more than 20 minutes in the queue
[17:16:50] <paladox>	 puppet-phabricator and gerrit-test3 suddenly went down uk time this morning and doing a hard reboot has not brought them back up. It's stuck saying rebooting in the horizion ui.
[17:16:51] <Zppix>	 madhuvishy:  whats your wikitech user?
[17:17:00] <paladox>	  https://phabricator.wikimedia.org/T177164
[17:17:02] <madhuvishy>	 Madhuvishy
[17:17:17] <Freddy2001>	 I tried the following: jsub -once -mem 350m php -f path_to/my_script.php
[17:17:24] <Zppix>	 madhuvishy:  Failed to add madhuvishy to git.
[17:17:32] <shinken-wm>	 RECOVERY - Puppet errors on tools-exec-1423 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:17:32] <paladox>	 I've added her already :)
[17:17:46] <Zppix>	 oh
[17:17:54] <Zppix>	 madhuvishy:  paladox seemed to already add you :P
[17:18:24] <Freddy2001>	 There I do not get any message "you job was scheudled" and qstat does not have any informations about it
[17:19:16] <madhuvishy>	 Freddy2001: I assume you are logged in to the tools bastion? Which tool is this, what command did you run?
[17:19:30] <madhuvishy>	 Zppix: it is still stuck hard rebooting
[17:19:38] <Freddy2001>	 Yes I am on tools-bastion-03
[17:20:19] <paladox>	 madhuvishy yep, it seems 2 instances went down this morning. puppet-phabricator and gerrit-test3.
[17:20:20] <paladox>	 they were moved from a labvirt on friday.
[17:20:23] <Zppix>	 madhuvishy:  it has been for awhile hence is why i was asking if you could do some special thing to get it to get unstuck
[17:20:27] <paladox>	 though they were working up until early this morning.
[17:21:15] <Freddy2001>	 madhuvishy, my tool is freddy2001
[17:22:31] <madhuvishy>	 Zppix: okay I'm looking
[17:22:41] <Zppix>	 ty madhuvishy 
[17:43:05] <madhuvishy>	 Zppix: this instance seems to be scheduled on labvirt1015, which i think has been having some issues. I think it's probably related.
[17:43:18] <Zppix>	 madhuvishy:  how can it be resolved?
[17:43:39] <madhuvishy>	 if you can't delete and recreate the instance, I think the next step is to make a ticket, and I can ask andrew tomorrow
[17:44:24] <Zppix>	 paladox ^
[17:44:42] <paladox>	 	 https://phabricator.wikimedia.org/T177164
[17:44:43] <Zppix>	 madhuvishy:  what about the other host that paladox brought up puppet-phabricator same issue?
[17:44:47] <paladox>	 Zppix madhuvishy ^^
[17:44:56] <paladox>	 filled task at https://phabricator.wikimedia.org/T177164
[17:45:41] <madhuvishy>	 same labvirt
[17:45:56] <paladox>	 there were also two others that were migrated too
[17:46:00] <paladox>	 and they appear down
[17:46:11] <paladox>	 though i haevnt access to them, i just pinged them
[17:46:11] <madhuvishy>	 which ones?
[17:46:20] <paladox>	 search-jessie and wdqs-deploy
[17:46:29] <paladox>	 they are from this list https://phabricator.wikimedia.org/P6060 too
[17:46:30] <madhuvishy>	 which projects are these in?
[17:46:48] <Freddy2001>	 anyone who can help me?
[17:46:57] <paladox>	 wdqs-deploy is in wikidata-query  
[17:47:06] <paladox>	 search-jessie is in search
[17:48:26] <madhuvishy>	 yeah all of them are on labvirt1015
[17:48:37] <paladox>	 oh.
[17:49:22] <madhuvishy>	 Freddy2001: You mentioned the tool name, but I am not sure what exactly you tried to submit via jsub? Sorry if I missed that message
[17:50:07] <Freddy2001>	 I tried to submit a php script
[17:50:26] <Freddy2001>	 runs without any error in a screen
[17:50:53] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649315 (10Paladox) They are on labvirt1015.  search-jessie and wdqs-deploy and puppet-phabricator and gerrit-test3.  I found that tools-clushmaster-01 was migrated there too and is also failing the ping t...
[17:52:43] <valhallasw`cloud>	 Freddy2001: there should be output/errors in php.out/php.err I think?
[17:53:30] <valhallasw`cloud>	 php5.err
[17:59:23] <abartov>	 madhuvishy: if you're still around, do you know the answer to my question above?
[17:59:42] <madhuvishy>	 abartov: ah yes, I do, one sec finding docs
[17:59:42] <abartov>	 is anyone running Node v6.x or later on Toolforge?
[18:00:41] <madhuvishy>	 abartov: yes, the answer is that newer version is available if you use Kubernetes
[18:00:42] <madhuvishy>	 https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#node.js_web_services
[18:01:50] <madhuvishy>	 (we currently have 2 scheduling mechanisms, and the older mechanism has the older node version, we've been asking folks who want to run node to move to the new Kubernetes based setup)
[18:03:14] <Freddy2001>	 valhallasw`cloud, there is no php5.err
[18:03:40] <valhallasw`cloud>	 -rw-rw---- 1 tools.freddy2001 tools.freddy2001 674 Oct  1 16:05 /data/project/freddy2001/php5.err
[18:03:58] <shinken-wm>	 PROBLEM - Puppet errors on tools-worker-1020 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[18:04:58] <Freddy2001>	 ok, then it is new there. I've checked it twice after submitting
[18:43:58] <shinken-wm>	 RECOVERY - Puppet errors on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0]
[19:58:51] <bd808>	 abartov: try https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes for nodejs things. Webservices are easy and bots are possible with a custom deployment.yaml as documented on that wiki page.
[20:22:05] <mhashemirc>	 speaking of toolforge kubernetes!
[20:22:32] <mhashemirc>	 we're trying to get the production version of montage working for WLM2017 judging to start today but we're in a bit of a bind
[20:23:02] <mhashemirc>	 is there anyone here who can take a look?
[20:23:43] <mhashemirc>	 uwsgi log is empty and `kubectl get pods` tells me the pod status
[20:23:44] <mhashemirc>	 is "CrashLoopBackOff"
[20:30:49] <bd808>	 CrashLoopBackOff means that the pod is failing to start. For a python webapp this often means that there is a syntax error in the python code.
[20:30:59] <bd808>	 mhashemirc: what's the tool name?
[20:31:13] <mhashemirc>	 "montage"
[20:31:23] <mhashemirc>	 i think somehow it's trying to run python3 instead of python2
[20:32:29] <bd808>	 webservice --backend=kubernetes python2 start -- that will start with python2
[20:34:13] <mhashemirc>	 yes, we're doing exactly that
[20:34:43] <mhashemirc>	 https://gist.github.com/mahmoud/70b0b40122ec49c8abb78e1a9021f9d6
[20:35:11] <bd808>	 hmmm... the error in uwsgi.log.old is "ImportError: No module named 'encodings'" that shows it was running with python 3.4.2
[20:36:42] <mhashemirc>	 yeah, that's the old log, we've since stopped that pod and restarted it several times
[20:37:08] <mhashemirc>	 is the python3 state sticky, despite the service.manifest, command line options, etc.?
[20:37:15] <bd808>	 mhashemirc: is that /data/project/montage/www/python/uwsgi.ini from a time when you were running on grid engine instead of kubernetes?
[20:38:10] <mhashemirc>	 yes, but more importantly i see a syntax error in there that i'm correcting
[20:38:18] <bd808>	 the webservice watchdog does get confused sometimes which could make things stickier than they should be.
[20:39:56] <mhashemirc>	 cool man, i think we fixed it
[20:39:58] <bd808>	 I *think* the /usr/lib/uwsgi/plugins/python3_plugin.so error message is ok
[20:40:07] <mhashemirc>	 the python3 error was not breaking
[20:40:07] <mhashemirc>	 yeah
[20:40:11] <mhashemirc>	 super confusing but ok!
[20:40:41] <bd808>	 we build the same default uwsgi.ini for both python2 and python3. We should probably change that
[20:42:07] <mhashemirc>	 yeah i think there was literally a typo in the ini that came as a result of a bad copy-paste from montage-beta. the python3 error was just misleading us
[20:42:19] <mhashemirc>	 https://tools.wmflabs.org/montage/meta/
[20:42:22] <mhashemirc>	 we have liftoff :)
[20:42:27] <bd808>	 nice
[20:43:54] <Zppix>	 anyone mind telling me what sql code i would use to query page names on enwiki?
[20:44:28] <bd808>	 Zppix: https://www.mediawiki.org/wiki/Manual:Database_layout
[20:45:00] <Zppix>	 bd808:  i find that page a bit confusing
[20:51:53] <valhallasw`cloud>	 Zppix: which part is confusing to you?
[20:52:02] <bd808>	 Zppix: https://www.mediawiki.org/wiki/Manual:Page_table
[20:52:20] <Zppix>	 bd808: or valhallasw`cloud do i do use enwiki_p select from page and then what do i do to list pages with a certain term?
[20:52:43] <Zppix>	 a certain term in their title*
[20:53:28] <valhallasw`cloud>	 Zppix: you're better off using the Search API rather than the database, as the database is not indexed to support such queries
[20:55:19] <valhallasw`cloud>	 Zppix: when it comes to learning how to write queries, I would suggest something like https://sqlbolt.com/ , although that does not go into the importance of table indexing
[20:55:37] <bd808>	 wildcard searches are definitely best done with search api. something like -- https://en.wikipedia.org/w/index.php?search=intitle%3Apizza&title=Special:Search&profile=default&fulltext=1&searchToken=cfjb5300ybn3pah8mjqvx7rwc
[20:57:38] <Zppix>	 ok
[20:57:43] <bd808>	 or https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=intitle%3Apizza
[20:58:41] <bd808>	 the "intitle:..." is the magic for limiting CirrusSearch to scanning the title
[20:59:06] <bd808>	 https://www.mediawiki.org/wiki/Help:CirrusSearch#Intitle_and_incategory
[21:00:34] <mhashemirc>	 thanks bd808!
[21:03:04] <bd808>	 mhashemirc: you are welcome. I didn't do much except act as a rubber duck, but glad you got it working.
[21:05:57] <mhashemirc>	 well, at the very least you confirmed that the .so issue doesn't need to go on our TODO :)
[21:35:58] <wikibugs_>	 10cloud-services-team (Kanban), 10DC-Ops, 10Operations, 10ops-eqiad: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649499 (10bd808) 05Resolved>03Open @andrew moved 9 VMs to this host on 2017-09-29. On 2017-10-01 we found it non-responsive to ssh and with this output on the management...
[21:36:53] <wikibugs_>	 10cloud-services-team (Kanban), 10DC-Ops, 10Operations, 10ops-eqiad: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649502 (10bd808) Console logging on boot: ``` labvirt1015 login: [   48.163451] kvm [3714]: vcpu0 unhandled rdmsr: 0x611 [   48.169003] kvm [3714]: vcpu0 unhandled rdmsr: 0x...
[21:46:43] <madhuvishy>	 !log tools Cold migrating tools-clushmaster-01 from labvirt1015 to labvirt1017
[21:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
[21:48:09] <wikibugs_>	 10cloud-services-team (Kanban), 10DC-Ops, 10Operations, 10ops-eqiad: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649518 (10chasemp) Note to self: fix cold-migrate to handle already shut down instances
[21:52:22] <shinken-wm>	 RECOVERY - Host tools-clushmaster-01 is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms
[21:53:41] <bd808>	 !log search Cold migrating search-jessie from labvirt1015 to labvirt1017
[21:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Search/SAL
[21:58:57] <shinken-wm>	 PROBLEM - Puppet staleness on tools-clushmaster-01 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [43200.0]
[21:59:02] <madhuvishy>	 !log git Cold migrating gerrit-test3 from labvirt1015 to labvirt1017
[21:59:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL
[22:02:43] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649527 (10chasemp) yes, this labvirt has crashed and we are attempting to recover these instances.  Apologies for the inconvenience, appreciate the patience :)
[22:03:55] <shinken-wm>	 RECOVERY - Puppet staleness on tools-clushmaster-01 is OK: OK: Less than 1.00% above the threshold [3600.0]
[22:05:12] <wikibugs_>	 10cloud-services-team (Kanban), 10DC-Ops, 10Operations, 10ops-eqiad: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649543 (10chasemp) Entirety of labvirt1015 console during crash https://usercontent.irccloud-cdn.com/file/mwxQTBO0/Screen%20Shot%202017-10-01%20at%202.26.59%20PM.png
[22:05:39] <wikibugs_>	 10cloud-services-team (Kanban), 10DC-Ops, 10Operations, 10ops-eqiad: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649544 (10Andrew) The last syslog before reboot was at Oct  1 01:21:01.  It was down for many hours and didn't page because I downtimed it during the hardware replacement an...
[22:09:20] <bd808>	 !log integration Cold migrating integration-slave-jessie-1003 from labvirt105 to labvirt1017
[22:09:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/SAL
[22:10:23] <bd808>	 !log integration Cold migrating integration-slave-jessie-1004 from labvirt105 to labvirt1017
[22:10:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/SAL
[22:13:58] <paladox>	 bd808 thankyou :)
[22:14:14] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649580 (10Paladox) thanks :).
[22:15:24] <bd808>	 paladox: thanks for pointing it out. our monitoring of the host missed that it died because we forgot to tell icinga that we were using it again after the kernel change on Friday :/
[22:15:40] <paladox>	 oh. your welcome.
[22:19:08] <madhuvishy>	 !log phabricator Cold migrating phab-01 from labvirt1015 to labvirt1017
[22:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Phabricator/SAL
[22:19:22] <wikibugs_>	 10cloud-services-team (Kanban), 10DC-Ops, 10Operations, 10ops-eqiad: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3649602 (10Andrew) Here's the latest mcelog.   Without timestamps it's hard to correlate this to the failures but still seems bad.  {F9946281}
[22:21:22] <wikibugs_>	 10cloud-services-team (Kanban), 10DC-Ops, 10Operations, 10ops-eqiad: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3466006 (10Paladox) >>! In T171473#3649602, @Andrew wrote: > Here's the latest mcelog.   Without timestamps it's hard to correlate this to the failures but still seems bad. >...
[22:25:44] <madhuvishy>	 !log phabricator Cold migrating puppet-phabricator from labvirt1015 to labvirt1017
[22:25:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Phabricator/SAL
[22:40:32] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649611 (10madhuvishy) 05Open>03Resolved a:03madhuvishy These instances should be up now.
[22:41:06] <bd808>	 !log wikidata-query Cold migrating wdqs-deploy from labvirt105 to labvirt1017
[22:41:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikidata-query/SAL
[22:42:04] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649614 (10Paladox) thank you so much :).  yep, got the recovery messages :).
[23:01:09] <wikibugs_>	 10Cloud-VPS: puppet-phabricator and gerrit-test3 have gone down - https://phabricator.wikimedia.org/T177164#3649621 (10Andrew) https://wikitech.wikimedia.org/wiki/Incident_documentation/20171001-labvirt1015
[23:03:15] <wikibugs_>	 10cloud-services-team, 10wikitech.wikimedia.org: contentadmin has suddenly less permissions - https://phabricator.wikimedia.org/T171208#3649622 (10EddieGP) User rights explicitely removed from 'contentadmin' in CommonSettings.php:  * Edit other users' CSS files (editusercss) * Edit other users' JavaScript file...
[23:30:43] <wikibugs_>	 10cloud-services-team, 10wikitech.wikimedia.org: contentadmin has suddenly less permissions - https://phabricator.wikimedia.org/T171208#3649629 (10Krenair) >>! In T171208#3649622, @EddieGP wrote: > @Krenair set these explicit removals back in c9f3ef6526c4 - maybe he knows what the reason was to keep some of th...