[06:30:26] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [06:56:03] 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 5Patch-For-Review, and 2 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1772667 (10MusikAnimal) @coren are we sure this has been resolved? I've been having a similar issue with my [[ https://tools.wmflabs.org/musikanimal |... [06:56:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [07:14:24] 10Tool-Labs-tools-Erwin's-tools, 5Patch-For-Review: Migrate https://toolserver.org/~erwin85/talkcatintersect.php to Tool Labs - https://phabricator.wikimedia.org/T62874#1772677 (10Liuxinyu970226) [08:27:48] 6Labs, 10MediaWiki-Vagrant: Wikimedia SMPT server does not work with Labs-Vagrant - https://phabricator.wikimedia.org/T117391#1772724 (10Tgr) 3NEW [09:04:31] 6Labs, 6Discovery, 10Maps: Update wiki page with OSM Postgres access info - https://phabricator.wikimedia.org/T116355#1772785 (10akosiaris) That page has indeed outdated information. That box indeed does not provide anymore an OSM database, which is why it prompts for a password. I 'll update the page. [10:59:26] Hello, I got a question, concerning puppet: If I load ressources from puppet, will puppet update them automatically? Or is there something I need to do? [11:02:59] Luke081515: I'm not sure what you mean. [11:03:56] Luke081515: once an the manifest in ops/puppet has been updated, it will be applied automatically, assuming the puppet cronjob is running (and it should be, because otherwise you might miss out on important labs-wide changes) [11:04:22] ok, thank you [11:08:02] valhallasw`cloud: Do you know, how often role:phabricator:labs gets updated, or where I could find this information? [11:10:32] 'how often role:phabricator:labs gets update'?? [11:11:24] what are you trying to do? [11:12:35] I just want to know, when the next update is, because I saw that phabricar.wikimedia.org is more up to date, then role:phabricator:labs [11:13:23] the next update of /what/ [11:13:33] is there something missing in the role? [11:13:52] of the code of phabricator labs, the phabricator software [11:14:53] as far as I can see from the manifest, both prod and labs use tag release/2015-07-08/1 [11:15:50] in general, labs and prod instances should have the same code checked out [11:16:53] than my current version is not up to date, I guess. For example: at phabricator.wikimedia.org you can archive pastes, at my current instance, this is not possbile [11:17:04] *possible [11:17:15] PROBLEM - Puppet failure on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [11:17:50] I see no obvious reason why that would be the case [11:19:31] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1773011 (10Krenair) [11:20:25] the only oddity I see is that the phabricator repo has https://github.com/wikimedia/phabricator-phabricator/tree/release/2015-10-07.2 [11:20:37] while puppet refers to a much older tag [11:20:48] twentyafterfour: ^ any idea why that is the case? [11:57:17] RECOVERY - Puppet failure on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [13:52:49] (03PS1) 10Alexandros Kosiaris: lda-mirror: Add snakeoil SSL keys [labs/private] - 10https://gerrit.wikimedia.org/r/250422 [14:10:34] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] lda-mirror: Add snakeoil SSL keys [labs/private] - 10https://gerrit.wikimedia.org/r/250422 (owner: 10Alexandros Kosiaris) [14:10:44] (03PS2) 10Alexandros Kosiaris: ldap-mirror: Add snakeoil SSL keys [labs/private] - 10https://gerrit.wikimedia.org/r/250422 [14:10:50] (03CR) 10Andrew Bogott: "I've been creating these as needed, but this is way better. Thanks!" [labs/private] - 10https://gerrit.wikimedia.org/r/250155 (owner: 10Dzahn) [14:10:52] (03CR) 10Alexandros Kosiaris: [V: 032] ldap-mirror: Add snakeoil SSL keys [labs/private] - 10https://gerrit.wikimedia.org/r/250422 (owner: 10Alexandros Kosiaris) [14:17:44] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [14:31:29] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [14:37:39] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [14:43:17] valhallasw`cloud: because puppet isn't used for phabricator deploys anymore [14:45:58] Oh. Okay. How can we make sure labs phabricator instances use the same mechanism to get up to date? Or, at least, how do we make sure labs phabricator instances will check out the correct tag? [14:46:23] Updating the puppet manifest in parallel sounds like too much manual effort ;-) [14:53:19] yeah, I think the right solution would be to have labs updated as part of the production scap deployment [15:08:54] *nod*. we somehow have to make sure a new labs instance gets created correctly as well, though. Maybe it should just check out master from https://github.com/wikimedia/phabricator-deployment ? [15:28:21] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:33:18] valhallasw`cloud: yeah that should be right [17:03:22] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [17:20:37] twentyafterfour: Should I create a task for it? [17:20:50] Luke081515: sure [17:20:58] ok [17:21:34] Luke081515: T114363 is related [17:21:54] how would labs updates from prod scap work? [17:22:00] or do we mean a pre-prod labs scap [17:22:37] chasemp: I will be running phabricator deploys locally on my laptop most likely [17:22:52] so I can deploy production and labs [17:23:36] 6Labs, 6Phabricator, 7Puppet: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1774347 (10Luke081515) 3NEW [17:23:39] done [17:23:40] but we do need labs to work out of the box with puppet, and cloning the deployment repo at master should work for that [17:24:06] 6Labs, 6Phabricator, 7Puppet: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1774356 (10mmodell) a:3mmodell [17:24:18] thanks [17:24:22] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [17:24:22] change out the git:install for git::clone which is pretty standard throughout [17:25:16] twentyafterfour: could you update that README in the phab module for this [17:25:23] ppl do ask about that once every few weeks [17:29:12] 6Labs, 6operations, 10wikitech.wikimedia.org, 7Wikimedia-log-errors: RunJobs.php fails to be executed on labswiki - https://phabricator.wikimedia.org/T117394#1774385 (10Krenair) IIRC, labswiki jobs are supposed to be running locally on silver only... [17:48:37] ok [17:51:31] 6Labs, 10Labs-Infrastructure, 3labs-sprint-117, 3labs-sprint-118: Move project membership/assignment from ldap to keystone mysql - https://phabricator.wikimedia.org/T115029#1774497 (10Andrew) https://wikitech.wikimedia.org/wiki/Labs_keystone_roles#Steps [17:57:07] 6Labs, 10Tool-Labs, 3labs-sprint-118: Enforce that containers from a user run with the uid assigned to that user - https://phabricator.wikimedia.org/T116504#1774514 (10yuvipanda) https://github.com/kubernetes/kubernetes/pull/16250 has discussions, not going very well atm unfortunately :( [18:03:07] 6Labs, 10Tool-Labs, 3labs-sprint-118, 3labs-sprint-119: Enforce that containers from a user run with the uid assigned to that user - https://phabricator.wikimedia.org/T116504#1774539 (10yuvipanda) [18:03:24] 6Labs, 6Discovery, 7Elasticsearch, 3labs-sprint-116, and 3 others: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1774540 (10yuvipanda) [18:03:34] 6Labs, 10Tool-Labs, 10Incident-20150617-LabsNFSOutage, 3labs-sprint-117, and 2 others: Re-enable cron for tools on tool labs - https://phabricator.wikimedia.org/T104614#1774541 (10yuvipanda) [18:04:16] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-119: Automate generation of floating/private dns aliases in the labs recursor - https://phabricator.wikimedia.org/T100990#1774546 (10yuvipanda) [18:04:53] 6Labs, 10Tool-Labs, 10Diffusion, 3labs-sprint-119: Figure out a git hosting solution for tools/kubernetes - https://phabricator.wikimedia.org/T117071#1774552 (10yuvipanda) [18:07:24] 6Labs, 10Tool-Labs, 10Incident-20150617-LabsNFSOutage, 3labs-sprint-117, and 2 others: Re-enable cron for tools on tool labs - https://phabricator.wikimedia.org/T104614#1774564 (10yuvipanda) a:5yuvipanda>3coren [18:08:36] 6Labs, 10Labs-Team-Backlog, 3Labs-Q4-Sprint-1, 3Labs-Q4-Sprint-2, and 5 others: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1774572 (10coren) [18:10:08] 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 5Patch-For-Review, and 3 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1774591 (10coren) I //think// right now that this is a distinct issue, but I'm reopening the ticket until that is determined for certain. [18:10:16] 6Labs, 10Tool-Labs, 3Labs-Sprint-115, 5Patch-For-Review, and 3 others: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1774595 (10coren) 5Resolved>3Open [18:11:52] 6Labs, 6Scrum-of-Scrums: Increase cpu and disk quota for the 'search' group - https://phabricator.wikimedia.org/T116292#1774603 (10EBernhardson) @YuviPanda this is the ticket we talked about last thursday night [18:12:59] 6Labs, 3labs-sprint-118, 3labs-sprint-119: Document support levels for tools and labs projects - https://phabricator.wikimedia.org/T116598#1774609 (10chasemp) [18:13:45] 6Labs, 10Labs-Infrastructure, 10netops, 6operations, and 3 others: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1774624 (10chasemp) [18:13:56] 6Labs, 10Labs-Infrastructure, 10netops, 6operations, and 2 others: Allocate labs subnet in dallas - https://phabricator.wikimedia.org/T115491#1774634 (10chasemp) [18:16:17] 6Labs, 10Tool-Labs, 3labs-sprint-119: set jsub (jstart qcronsub) parameters by environment variable - https://phabricator.wikimedia.org/T64156#1774639 (10coren) [18:16:56] ebernhardson: I'm increasing your quota, moment [18:17:07] YuviPanda: thanks! [18:17:59] 6Labs, 5Patch-For-Review, 3labs-sprint-119: Labs replica DBs incorrectly classify sourceswiki as 'special' - https://phabricator.wikimedia.org/T91534#1774656 (10coren) That fix was reverted in https://gerrit.wikimedia.org/r/#/c/227734/ because it didn't work properly in production - this needs investigation... [18:18:23] 6Labs, 10Tool-Labs, 3labs-sprint-119: Make Flow database available / accessible on Labs/Tools - https://phabricator.wikimedia.org/T69397#1774661 (10coren) [18:19:34] 6Labs, 6Scrum-of-Scrums: Increase cpu and disk quota for the 'search' group - https://phabricator.wikimedia.org/T116292#1774663 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I've doubled your CPU and RAM quotas, but do be careful to not kill labs :) [18:19:40] ebernhardson: ^ [18:22:25] YuviPanda: perfect, i'll do my best to not kill anything :) [18:27:14] 6Labs, 10Labs-Infrastructure: Unable to connect both redundant labstores to the shelves in parallel - https://phabricator.wikimedia.org/T117453#1774707 (10coren) 3NEW [18:28:30] 6Labs, 10Labs-Infrastructure, 6operations: Unable to connect both redundant labstores to the shelves in parallel - https://phabricator.wikimedia.org/T117453#1774715 (10yuvipanda) [18:28:45] 6Labs, 3labs-sprint-118, 3labs-sprint-119: Document support levels for tools and labs projects - https://phabricator.wikimedia.org/T116598#1774719 (10chasemp) https://wikitech.wikimedia.org/wiki/Labs_labs_labs/future [18:47:32] any opinions on https://keybase.io/docs ? [18:47:54] does wikimedia labs use outside key manager things ? or only internal ... [18:49:32] darkblue_b: just internal (ldap) [18:49:37] ok thx [18:50:21] osgeo.org has an ldap too.. we havent integrated outside of it .. I am currently investigating new cert setup possibilities [18:55:32] darkblue_b: We have ldap (managed by a mediawiki frontend) and gerrit has its own cert store [18:56:23] what started this for me was the EFF/Mozilla announce of "Let's Encrypt" .. rolling out "real soon now" after a year of work there [18:56:38] they're all kind of unrelated though [18:56:41] keybase is for GPG keys [18:56:43] but I confirmed with an EFF tech that it is web site domain level only, not a general purpose cert [18:56:45] ldap we use for ssh keys [18:56:53] lets encrypt is ssl keys [18:57:12] yes - I am slowly getting picture... [18:57:21] I suppose you could use keybase to publish your ssh public key, but that's not generally very useful [18:58:08] valhallasw`cloud: no they didn't support that when I last looked [18:58:24] valhallasw`cloud: https://github.com/keybase/keybase-issues/issues/710 [18:58:38] I reviewed some PKI tutorials over the weekend.. the cert chain in x.509 requires a heirarchical chain, with a known Certificate Authority for interop.. signatures and keys are related products [18:59:05] buy you can generate an ssl key without thinking about all of that.. [18:59:10] s/but/buy/ [18:59:17] YuviPanda: beh. Of course, one could juts use github to get the keys [18:59:48] err ssh key, typo [18:59:53] valhallasw`cloud: yes :D https://github.com/keybase/keybase-issues/issues/710 [19:00:01] err [19:00:04] https://github.com/yuvipanda/github-ssh-auth [19:00:12] oh - looking [19:04:24] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [19:04:28] * darkblue_b notes from me while reading YP's hack http://paste.debian.net/319332/ [19:14:00] ok - YP feedback on github-ssh-auth .. quite a long list of "drawbacks" ;-) I think a modern touch might be to checksum the source file and include that somewhere, since you are asking someone to add this to their security process [19:14:16] also, in python the requests lib is nicer than urllib.request [19:14:57] however security becomes its own endless pursuit, so I am not trying to get into that mode.. just learning myself though [19:21:07] yeah but it's just me using this so it's ok :) [19:21:26] also a checksum is not that useful in this case - you can just look at the code yourself instead for better security than actually trusting me [19:21:38] and requests requires an external library install while this doesn't and is simple enough [19:22:15] external - gotcha [19:23:56] YuviPanda: so I should take the public key I generated for -labs, and add it to my Github account keys ? [19:24:12] nononono [19:24:18] all of this is completely unrelated to labs :) [19:24:44] if you want to manipulate your labs keys, they're available in wikitech.wikimedia.org/wiki/Special:Preferences [19:24:48] ah ok - please advise.. I have the key I login to bastion with .. on my main long-term private server now.. I havent done anything else yet [19:25:11] what do you want to do? [19:25:16] yes i believe I did that as part of the setup [19:25:20] I assumed you were just randomly talking about keybase and what not :) [19:25:25] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:26:13] hmm I would like to be a good -labs citizen because I want to make use of wikimedia search in an info system I am designing; b) I am investigating a new cert heirarchy for my .org, OSGeo [19:26:44] .. so I tried bouncing this keybase idea off of this channel, since you are "active" [19:27:22] ....huh? [19:27:37] valhallasw`cloud: can I clarify something for you ? [19:27:40] how is a geospatial project related to public key infrastructure? [19:27:42] yeah, I understand just chatting about keybase :) [19:28:00] and how does osgeo relate to wikimedia? [19:29:01] there will be automated connections between VMs in the info system.. I would like to specify current best practices.. there is an initiative that is current regarding PKI and I want to support FOSS and best identity in general , and wikimedia infra specifically [19:29:34] if that isnt clear I can write it out and send it .. happy to do it.. it is my curent priority [19:30:52] I'll admit to not being fully clear on what you want to do, darkblue_b [19:31:06] ok - I will write it out .. thx [19:35:15] !log bots cd /mnt/share/wikimedia-bot && sudo git pull (46b014c..a623eae) [19:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Bots/SAL, Master [19:36:51] !log bots addshore@wm-bot:/mnt/share/wm-bot$ sudo ./easydepl.sh [19:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Bots/SAL, Master [19:37:04] @ping [19:37:13] I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.8.0.0 [libirc v. 1.0.3] my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features [19:37:13] @help [19:37:26] I think that means I didnt break it... [19:46:37] PROBLEM - Host tools-andrew-puppettest is DOWN: CRITICAL - Host Unreachable (10.68.21.109) [19:46:42] YuviPanda: when you have a moment could you depool and drain these instances? https://etherpad.wikimedia.org/p/depool [19:47:15] andrewbogott: ok if I do those after lunch? or do you want me to get a head start/ [19:47:28] YuviPanda: no rush [19:47:31] andrewbogott: ok! [19:50:48] !log tools drain webgrid-lighttpd-1408 of jobs [19:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:50:59] andrewbogott: ^ that host is drained [19:52:48] great [19:53:24] !log drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402 [19:53:24] drained is not a valid project. [19:53:29] !log tools drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402 [19:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:53:43] andrewbogott: these still have non-restarable tasks on them I'll see how they're in a couple of hours [19:54:10] YuviPanda: sure, sounds good [20:19:04] https://tools.wmflabs.org/widar/ is down [20:19:38] YuviPanda: can you help^^ [20:19:52] ? [20:20:11] multichill: ^^ [20:20:20] * multichill waves [20:21:42] sjoerddebruin: ^^ [20:21:47] Yeah, I see. [20:32:04] hey guys any idea why http://perf-testing.wmflabs.org/ might be throwing a 502? [21:08:34] hey valhallasw`cloud, I have a plagiabot question if you're around [21:09:01] fhocutt: hey! I'm around but working on a physics problem so might be a bit before I respond [21:09:29] hi! that's all right [21:09:57] I want to change the plagiabot API so it spits out more machine-readable data: https://github.com/valhallasw/plagiabot/blob/master/webservice/api.py [21:10:20] I would really like to be able to test this on localhost, especially since there doesn't seem to be a dev branch on Labs or anything [21:10:55] but it looks like I need to set up a fastcgi server on localhost, not just the normal Python app stuff [21:11:03] is there a good way to do this? [21:12:09] !log packaging moving instance ‘packager’ to labvirt1010 as a testcase for new hardware [21:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Packaging/SAL, dummy [21:12:25] and do you know if there are any plans to modify that module to not need fastcgi [21:22:26] fhocutt: uuuh. not sure, eranroz wrote this [21:22:44] I think flup might have a self-serving thing [21:23:40] hm, no. Maybe gunicorn et al? http://flask.pocoo.org/docs/0.10/deploying/wsgi-standalone/ [21:24:01] basically ,and wsgi server should be able to take the app() function and run it [21:25:34] ok, thanks! [21:33:16] anyone..? [21:34:21] jdlrobson: is it still registered in Special:NovaProxy? [21:40:44] valhallasw`cloud: yeh http://perf-testing.mobile.eqiad.wmflabs:80 [21:40:54] (dns hostname perf-testing.wmflabs.org) [21:41:35] valhallasw@tools-bastion-01:~$ telnet perf-testing.mobile.eqiad.wmflabs 80 [21:41:36] Trying 10.68.16.105... [21:41:36] telnet: Unable to connect to remote host: Connection refused [21:42:18] security group settings set too strictly maybe? [21:42:19] I'm trying to restart the webservice on a tool, and I keep getting 'Timeout: could not start job' [21:43:29] ragesoss: odd. The queues don't seem overloaded (YuviPanda did take some servers out of rotation). Which tool is this? [21:44:48] valhallasw`cloud: it's set to default and web [21:45:34] jdlrobson: uh, yes, of course, because I get connection refused rather than no response. Is the webserver running on port 80? ;-) [21:45:58] (of course as in: I should have seen that) [21:46:25] valhallasw`cloud: it's http://tools.wmflabs.org/wikiedudashboard/ [21:46:49] ragesoss: hmm. So the webservice seems up [21:47:00] which suggests that it just took longer than expected for the webservice to get up [21:47:03] eh [21:47:04] no [21:47:08] it's in qw state [21:47:34] (-l h_vmem=4g,release=trusty) cannot run at host "tools-webgrid-lighttpd-1402.eqiad.wmflabs" because it offers only hc:h_vmem=3.142G) etc. [21:47:36] YuviPanda!!!! [21:47:52] we need lighttpd-1408 back [21:48:15] and more lighttpd-14xx nodes [21:48:46] 6Labs, 10Tool-Labs: Increase number of tools-webgrid-lighttpd-14xx nodes - https://phabricator.wikimedia.org/T117488#1775615 (10valhallasw) 3NEW [21:49:37] 6Labs, 10Tool-Labs: Re-enable tools-webgrid-lighttpd-1408 while there are no new nodes - https://phabricator.wikimedia.org/T117490#1775634 (10valhallasw) 3NEW [21:50:03] andrewbogott: when do the servers have to be drained? [21:50:11] valhallasw`cloud: sorry i'm not sure i follow you. I don't have much luck installing servers [21:50:41] jdlrobson: I mean: is there actually a webserver running on that host? The 'connection refused' suggests it's maybe not started, or running on a different port (maybe 8080?) [21:51:13] valhallasw`cloud: lighttpd-1408 is done moving, I think... [21:51:16] valhallasw`cloud: 8080 gives me 301 Moved Permanently [21:51:20] 80 connection refused [21:51:32] so apache is probably messed up some how [21:51:50] goddamn it [21:51:53] this works so poorly [21:52:01] jdlrobson: right. easiest fix is probably to change the proxy to use :8080 instead of :80 [21:52:49] valhallasw`cloud: sorry, not done yet, working on it still [21:53:07] andrewbogott: hmkay [21:53:11] sigh valhallasw`cloud you're right [21:53:14] * jdlrobson feels stupid [21:53:16] let me go hunt for abusive webservices then [21:53:35] valhallasw`cloud: no real reason you can’t create a new node though if you want more. [21:54:00] andrewbogott: except I'm going to bed soon (tm) where soon < time to create a new instance [21:55:48] YuviPanda: HBA is broken :{ [21:58:51] was in lunch [21:58:53] am back [21:58:58] andrewbogott: should I repool 1408? [21:59:12] YuviPanda: not yet [21:59:39] YuviPanda: however please build two more 14xx hosts [21:59:42] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin/new_exec_host [21:59:47] valhallasw`cloud: ok! [21:59:50] I shall [21:59:52] do that now [21:59:52] YuviPanda: ragesoss' webservice can't start [22:00:11] and possibly others as well [22:00:19] jdlrobson: https://wikitech.wikimedia.org/wiki/Help:MediaWiki-Vagrant_in_Labs says 'port 8080' :) [22:00:22] valhallasw`cloud: eek ok [22:03:29] can someone help me get access to mysql on deployment-eventlogging03.eqiad.wmflabs [22:03:40] i suspect i might need sudo .. [22:03:45] but right now i'm being denied access [22:03:53] jdlrobson: try in wikimedia-releng? [22:04:01] typically they manage beta cluster [22:04:10] chasemp: will do :) [22:04:35] !log tools created tools-webgrid-lighttpd-1412 and 1413 [22:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:05:15] anyway, bedtime [22:07:42] valhallasw`cloud: ugh, this new puppet run is going to take forever [22:07:52] I know ;( [22:07:58] fastapt! [22:08:02] for next time I guess [22:08:05] doesn't help here [22:08:10] because it needs to download and install packages [22:08:16] would still be faster I guess? [22:08:18] no? [22:08:18] caching packages in aptly might actually help [22:08:35] it'll save you 50 seconds on half an hour or so :-p [22:09:08] :P [22:09:10] yeah [22:09:21] valhallasw`cloud: I guess I should create some more [22:09:24] and not just two [22:09:32] andrewbogott: how're we looking on total instance capacity? [22:09:37] I was thinking of adding a couple more larges [22:09:49] YuviPanda: That should be fine [22:09:54] andrewbogott: ok! [22:10:04] I just added 1005 to the pool and will with luck add 1010 as well in a few days [22:10:10] nice! [22:10:56] !log tools created tools-webgrid-lighttpd-1414 and 1415 [22:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:24:43] YuviPanda: I take it you haven't finished the fix that will let my webservice start? [22:24:58] unfortunatey. still running puppet [22:25:38] k. minutes? hours? [22:27:28] ragesoss: 10-15mins maybe? [22:27:45] thanks much YuviPanda. [22:28:12] I'm breaking all kinds of things today. [22:35:20] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [22:37:05] Coren: andrewbogott can you attempt to ssh into tools-webgrid-lighttpd-1412.eqiad.wmflabs [22:37:10] I created that instance but seems stuck? [22:37:28] * Coren tries. [22:38:20] Doesn't look like it booted far enough to have its nic configured. [22:38:39] Hm. No, it did. [22:38:42] How odd. [22:39:51] Ah; userland stuck - sshd accepts() but then stalls. [22:40:52] in (possibly related) news, I’m trying to rsync stuff to labvirt1005 and getting "failed verification -- update discarded (will try again)" [22:40:56] which is… worrying. [22:41:01] So I’m going to depool it and do some more tests. [22:41:07] kk [22:41:17] * Coren hmms. [22:41:24] YuviPanda, Coren, possibly related since 1412 is also on 1005 [22:41:39] andrewbogott: ah I see [22:41:42] ok I'll back off that [22:41:47] Although I would expect “hard drive sometimes replaces ones with zeros” to produce a more dramatic failure [22:41:47] * YuviPanda continues poolong 1413 [22:41:52] ouch [22:41:58] andrewbogott: I suppose it's possible, but clearly the vm is running. [22:44:31] Is there some standard tool that just writes a bunch of files and then checks to see if the files are actually what it wrote? [22:44:40] Because otherwise I can’t think why rsync would be failing like this [22:48:28] YuviPanda: for aforementioned reasons I can’t migrate tools-webgrid-lighttpd-1408. So I guess you can re-pool it if it’s urgent. [22:49:28] * andrewbogott is out for a bit [22:49:36] andrewbogott: ok I'll repool it now [22:49:52] andrewbogott: well, I'll give the other nodes a few mins and repool if urgent [22:50:05] yeah, sounds good [22:57:54] !log tools pooled tools-webgrid-lighttpd-1413 [22:57:57] ragesoss: try now? [22:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:58:16] seems to have worked, YuviPanda [22:58:33] ragesoss: \o/ cool [22:59:02] and holy shit, the code I wrote in the meantime without being able to test it seems to be correct! [23:00:03] yay! [23:00:04] well done :) [23:38:50] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:58:49] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0]