[01:01:44] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [01:15:56] Labs is very slow right now [01:15:57] sometimes unresponsive [01:15:57] I tried to restart my web service but I got this [01:15:57] tools.magog@tools-bastion-01:~$ webservice2 restart [01:15:57] Restarting [01:15:58] Traceback (most recent call last): [01:15:58] File "/usr/local/bin/webservice2", line 274, in [01:15:58] main() [01:15:58] File "/usr/local/bin/webservice2", line 258, in main [01:15:58] stop_job(job) [01:15:59] File "/usr/local/bin/webservice2", line 69, in stop_job [01:15:59] subprocess.check_call(command, stdout=open(os.devnull, 'wb')) [01:15:59] File "/usr/lib/python2.7/subprocess.py", line 540, in check_call [01:15:59] raise CalledProcessError(retcode, cmd) [01:16:00] subprocess.CalledProcessError: Command '['qdel', '118566']' returned non-zero exit status 1 [01:16:24] Yup Bastion is hosed [01:16:40] whoami took several seconds :) [01:20:42] * Magog_the_Ogre hopes he didn't cause it to fail again [01:27:27] legoktm, are you a sysadmin? [01:29:07] yurik [01:29:11] 6Labs, 10Maps, 6Scrum-of-Scrums, 7Blocked-on-Operations: Upgrade postgres on labsdb1004 / 1005 to 9.4, and PostGis 2.1 - https://phabricator.wikimedia.org/T101233#1343718 (10Yurik) [01:29:24] Magog_the_Ogre, hi [01:29:30] PROBLEM - Puppet failure on tools-master is CRITICAL 55.56% of data above the critical threshold [0.0] [01:29:40] hello. Labs seems to be unresponsive [01:29:50] oh NOW the bot informs us! [01:30:07] YuviPanda, ^ ? [01:30:36] I'd love to sit through it, but I gtg [01:35:01] seems like no admins are around, sorry [01:35:31] Magog_the_Ogre: I'm not [01:48:58] sigh... [01:50:23] legoktm, should we text one of them or something? [01:50:56] gj nfs: https://ganglia.wikimedia.org/latest/?c=Labs%20NFS%20cluster%20codfw&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [01:51:17] oops that's codfw, but still: https://ganglia.wikimedia.org/latest/?c=Labs%20NFS%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [01:51:32] labstore1001, load 989% [01:52:04] that's high, right? :) [01:52:10] right :) [01:52:30] PROBLEM - High load average on labstore1001 is CRITICAL 100.00% of data above the critical threshold [24.0] [01:52:48] I don't actually know how much ops get notified of [01:52:53] * yurik decides against editing a small text file to prevent server overload... [01:53:22] the last percent that broke elephant's back... [01:53:43] rip toollabs [01:54:30] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [02:00:19] pinging them on hangouts doesn't appear to have done much good [02:05:30] PROBLEM - Puppet failure on tools-master is CRITICAL 30.00% of data above the critical threshold [0.0] [02:07:20] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1401 is CRITICAL 55.56% of data above the critical threshold [0.0] [02:07:20] PROBLEM - Puppet failure on tools-submit is CRITICAL 80.00% of data above the critical threshold [0.0] [02:13:18] ops notified [02:13:35] Magog_the_Ogre, phe, yurik: ^ [02:19:24] Krenair: so, what’s happening? [02:19:54] people mentioned that labs is super slow, others thought it was dead [02:20:09] couldn't kill jobs, connecting is very slow, etc. [02:20:35] I checked ganglia and it looks like nfs on labstore1001 is at 979% load which can't be good: https://ganglia.wikimedia.org/latest/?c=Labs%20NFS%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [02:20:55] wow! [02:22:06] It’s times like these that I wish ganglia wasn’t broken for the labs cluster [02:22:34] it's broken..? [02:24:59] It doesn’t show any of the virt hosts, they vanished a couple of weeks ago [02:24:59] ah [02:24:59] I and everyone I’ve asked have failed to figure out why [02:25:00] sorry but I have to ask - is there a task for it? :p [02:25:00] there is [02:25:00] https://phabricator.wikimedia.org/T101043 [02:25:02] did you get any response from Coren when you texted? [02:25:09] nope [02:26:51] 6Labs, 10Labs-Infrastructure, 7Monitoring: ganglia for virt cluster shows all the wrong things - https://phabricator.wikimedia.org/T101043#1343732 (10Krenair) [02:26:58] labstore1001 looks like it’s doing a lot of meta-work, mdXXX_raid5 [02:27:00] and such [02:27:07] but I don’t know if that’s the cause [02:27:11] it’s maxed for IO as well [02:27:15] so could just be a misbehaving client [02:29:48] 6Labs, 10Labs-Infrastructure, 7Monitoring: ganglia for virt cluster shows all the wrong things - https://phabricator.wikimedia.org/T101043#1343734 (10Legoktm) p:5Triage>3High [02:35:32] RECOVERY - Puppet failure on tools-master is OK Less than 1.00% above the threshold [0.0] [02:36:13] of course all the IO is attributed to ‘kworker’ [02:36:16] becuase that’s helpful [02:36:20] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1401 is OK Less than 1.00% above the threshold [0.0] [02:46:40] Nice, all file systems broken [02:46:40] Just the nfs ones :) [02:46:41] Feel free to improve the topic [02:46:41] PROBLEM - Puppet failure on tools-exec-cyberbot is CRITICAL 44.44% of data above the critical threshold [0.0] [02:46:57] Wikimedia Labs | Status: NFS outage in progress | tools-dev has new OS and fingerprint: https://lists.wikimedia.org/pipermail/labs-announce/2015-April/000009.html | https://www.mediawiki.org/wiki/Wikimedia_Labs | Channel logs: https://bit.ly/11GZvbS | Admin log: http://bit.ly/ROfuY5. [02:47:00] oops [02:48:05] andrewbogott: Fuck NFS and good luck solving it [02:48:13] thanks :( [02:50:40] PROBLEM - Puppet failure on tools-exec-wmt is CRITICAL 22.22% of data above the critical threshold [0.0] [02:52:25] Krenair: did you… do something? That ganglia issue that I was just complaining about is suddenly fixed [02:52:39] nope [02:52:45] huh [02:53:28] I'm just a deployer in production, I can't touch anything like that - at least not as far as I know [02:53:51] I know :) It just had suspicious timing. Ganglia is capricious. [02:54:39] 6Labs, 10Labs-Infrastructure, 7Monitoring: ganglia for virt cluster shows all the wrong things - https://phabricator.wikimedia.org/T101043#1343748 (10Andrew) Right at this moment it is working properly again. As far as I know no one did anything. [02:54:40] I mean, unless ganglia stores it's data in mysql or something. but even then... [02:54:59] I think it must be oscillating. Split brain of some sort [02:57:29] Anyway, all that that has taught me is that, yeah, everything is fine except for NFS [03:03:57] * Magog_the_Ogre gets back a few hours later. Still broken. :-/ [03:06:19] RECOVERY - Puppet failure on tools-exec-cyberbot is OK Less than 1.00% above the threshold [0.0] [05:13:02] * Magog_the_Ogre wonders if Labs is experiencing thread death [05:13:04] PROBLEM - SSH on tools-submit is CRITICAL - Socket timeout after 10 seconds [05:13:07] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL 40.00% of data above the critical threshold [0.0] [05:13:23] In ongoing evidence of the Curse of Labs, the network switch that our NFS server connects to is failing. Faidon is working on it now. [05:16:30] I'm unable to ssh into my tool labs account, is tool labs down? [05:16:43] yes [05:16:46] "Unable to create and initialize directory '/home/cosmiclattes'." [05:19:59] https://wikitech.wikimedia.org/wiki/Incident_documentation/20150607-LabsNFS-Outage [05:19:59] RECOVERY - SSH on tools-submit is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [05:20:00] wow so many new bots here since 2013 [05:21:01] It seems to be working for me now [05:21:03] thanks [05:21:17] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 491201 bytes in 2.756 second response time [05:21:56] Yes, should be coming back online now. [05:22:13] Wikimedia Labs | Status: bouncing back from NFS outage | tools-dev has new OS and fingerprint: https://lists.wikimedia.org/pipermail/labs-announce/2015-April/000009.html | https://www.mediawiki.org/wiki/Wikimedia_Labs | Channel logs: https://bit.ly/11GZvbS | Admin log: http://bit.ly/ROfuY5. [05:22:49] labstore1001 load = 265% though [05:24:14] <100% now [05:32:19] andrewbogott: fyi I don't think you have set the channel topic to "...Status: bouncing back from NFS outage..." [05:32:35] oh, thanks [05:32:39] that’s twice tonight i’ve made that mistake [05:32:47] lol [05:34:44] RECOVERY - Puppet failure on tools-exec-1207 is OK Less than 1.00% above the threshold [0.0] [05:47:11] RECOVERY - Puppet failure on tools-submit is OK Less than 1.00% above the threshold [0.0] [05:50:35] RECOVERY - Puppet failure on tools-exec-wmt is OK Less than 1.00% above the threshold [0.0] [07:42:54] 6Labs, 7database: Rebuild s6 and s7 on labsdb1002 - https://phabricator.wikimedia.org/T101567#1343801 (10Springle) I guess this was labsdb1002? I did indeed do a similar fix to get s6 going, and then sinned slightly by setting slave_exec_mode=idempotent to keep it alive for the weekend. The mysqld process res... [09:46:47] YuviPanda: going to look at webservice now [09:49:55] 10Tool-Labs-tools-Other: Fix tool kmlexport - https://phabricator.wikimedia.org/T92963#1343857 (10valhallasw) >>! In T92963#1343636, @Teslaton wrote: > Yes, even in times when tool was operating, there seemed to be performance > problems, caused both by depth ("unlimited" cat scan was possible) and > width (some... [09:50:37] YuviPanda: oh, I see I am too late [09:53:44] crap, the outage killed all my jobs [10:14:29] valhallasw: still look at it! [10:14:36] YuviPanda: I did! [10:14:48] but grrrit-wm seems dead [10:18:19] !log tools.lolrrit-wm ran qmod -rj lolrrit-wm [10:18:22] Logged the message, Master [10:22:31] wow, looks like I slept through a labs outage for once [10:23:46] YuviPanda: hm, shouldn't labs-l be subscribed to labs-announce? [10:24:15] or are we hoping everyone always remembers to cc? [10:24:44] valhallasw: it is [10:24:50] it just got stuck in moderation [10:24:52] i cleared it [10:25:01] YuviPanda: LOL :D [10:25:06] ok, that explains [10:25:57] valhallasw: got a moment for PM? [10:26:02] ya [10:50:25] 10Tool-Labs: rev_len should be available also for deleted revisions in database replicas - https://phabricator.wikimedia.org/T101631#1343888 (10Blahma) 3NEW [10:55:18] YuviPanda: I thought extra_args would be a list, but maybe you're right [10:55:49] valhallasw: it's used as a string now. naming it extra_args made me feel pretty bad but can't think of a better generic name [10:55:54] ok [10:56:01] string_param [10:56:03] or something :p [10:56:04] dunno [10:56:44] as for the type=stop/start checking: yes, that happened in the super call, but the super call could add a third option at some point [10:59:34] valhallasw: hmm, true [10:59:40] explicit better than implicit, etc, I guess [10:59:44] I'll make that change [10:59:50] and also rename the thing to toollabs-webservice [10:59:56] \o/ [11:51:38] 10Tool-Labs, 3Labs-Sprint-100, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Support for truly 'generic' webservices via service manifests - https://phabricator.wikimedia.org/T97230#1343961 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [12:09:39] 6Labs, 10Labs-Infrastructure, 7database: rev_len should be available also for deleted revisions in database replicas - https://phabricator.wikimedia.org/T101631#1343970 (10Krenair) [12:25:14] petan|lyon, ping [12:27:29] YuviPanda, ping [12:28:55] Nemo_bis, ping [12:30:28] !ask [12:30:28] Hi, how can we help you? Just ask your question. [12:30:33] Cyberpower678: ^ [12:30:53] YuviPanda, I ping because I never know if the person is there or not. [12:31:04] YuviPanda, can you do me a favor? [12:31:07] you should !ask because you never know who may or may not help you [12:31:25] *be able to [12:31:33] !ask | Cyberpower678 [12:31:34] Cyberpower678: Hi, how can we help you? Just ask your question. [12:31:42] If possible, can you restart all of my bot tasks for cyberbot? [12:31:48] you can do that as well, no? [12:32:02] My internet is too crappy. [12:32:17] I don't have my ssh keys on this machine, so can't. sorry! [12:33:00] Well thanks anyways, I guess I'll do it later. I just deployed a framework update to add the rawcontinue parameter to list queries. [12:34:00] no deployments with crappy internet :) [12:34:49] Cyberpower678: see, that's why you should just ask instead of pinging people :-p I /can/ do that ;-) should I just qmod -rj them? (i.e. restart with same parameters) [12:35:08] alternatively, wait until you have internet, so you can debug issues that arise [12:35:12] see ^? :) [12:46:54] 6Labs, 10Tool-Labs, 3Labs-Sprint-100, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1343988 (10yuvipanda) I've a WIP running script, it's identified that tools-master and tools-shadow are in the same host (la... [12:48:24] 6Labs, 10Tool-Labs, 3Labs-Sprint-100, 3ToolLabs-Goals-Q4: Write an icinga check to ensure that toollabs instances are appropriately distributed across labvirt** hosts - https://phabricator.wikimedia.org/T101635#1343991 (10yuvipanda) 3NEW a:3yuvipanda [12:48:47] 6Labs, 10Tool-Labs, 3Labs-Sprint-100, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1343999 (10yuvipanda) Split out the icinga check to T101635 [12:49:31] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4, 7Tracking: Move tools-shadow away from labvirt1004 - https://phabricator.wikimedia.org/T101636#1344001 (10yuvipanda) 3NEW [12:50:05] 6Labs, 10Tool-Labs, 3ToolLabs-Goals-Q4: Move toollabs instances around to minimize damage from a single downed virt* host - https://phabricator.wikimedia.org/T91072#1344007 (10yuvipanda) [12:59:08] 6Labs, 7database: Rebuild s6 and s7 on labsdb1002 - https://phabricator.wikimedia.org/T101567#1344014 (10jcrespo) p:5Low>3Normal Yes, I saw MySQL had restarted, but with a "normal shutdown", plus machine uptime not affected, so I assumed lab people were doing some regular maintenance and just hadn't restar... [13:41:46] 6Labs, 10Datasets-General-or-Unknown, 10Labs-Infrastructure, 10Wikidata, and 2 others: Add Wikidata json dumps to labs in /public/dumps - https://phabricator.wikimedia.org/T100885#1344066 (10Hydriz) [14:08:24] 6Labs, 10Labs-Infrastructure, 7database: rev_len should be available also for deleted revisions in database replicas - https://phabricator.wikimedia.org/T101631#1344107 (10jcrespo) I don't think this information is public: I cannot see the page size of a deleted revision as a regular user e.g. https://en.wik... [14:09:11] 6Labs, 10Labs-Infrastructure, 7database: rev_len should be available also for deleted revisions in database replicas - https://phabricator.wikimedia.org/T101631#1344109 (10Krenair) This is referring to deleted revisions, rather than archived revisions of deleted pages. [14:13:00] 6Labs, 10Labs-Infrastructure, 7database: rev_len should be available also for deleted revisions in database replicas - https://phabricator.wikimedia.org/T101631#1344121 (10Krenair) For example https://en.wikipedia.org/w/index.php?title=Warlingham_School&offset=20100527182211&action=history - those entries fr... [14:40:37] YuviPanda: I'm getting sudo: ldap_start_tls_s(): Connect error on toolsbeta-puppetmaster3 when I try to sudo. Is this a remainder from the outage last week? [14:40:52] seems to be ok on tools-bastion-01 [14:41:05] valhallasw: hmm, possibly. some instances that were self hosted puppetmasters and had puppet failing were broken. let me look [14:42:30] !log toolsbeta run sudo sed -i 's/GlobalSign_CA.pem/ca-certificates.crt/' /etc/ldap/ldap.conf on toolsbeta-puppetmaster3 to fix broken LDAP TLS config [14:42:32] valhallasw: ^ fixed [14:42:33] Logged the message, Master [14:42:54] thanks [14:43:50] YuviPanda: I guess I should do a git pull on puppetmaster to make sure the rest of the hosts get the memo? [14:44:09] valhallasw: yeah [14:44:30] !log toolsbeta updated /var/lib/git/operations/puppet to make sure the other hosts get the memo [14:44:32] Logged the message, Master [14:47:09] 6Labs, 10Labs-Infrastructure, 7database: rev_len should be available also for deleted revisions in database replicas - https://phabricator.wikimedia.org/T101631#1344185 (10jcrespo) Sorry for the misunderstanding, I can confirm that those are not filtered on source: https://git.wikimedia.org/blob/operations%2... [14:56:34] 10Tool-Labs: Add python3 equivalent packages for existing python packages - https://phabricator.wikimedia.org/T101646#1344187 (10valhallasw) 3NEW [14:59:09] 10Tool-Labs: Add python3 equivalent packages for existing python packages - https://phabricator.wikimedia.org/T101646#1344196 (10yuvipanda) For cases they exist on trusty - we shouldn't be building them unless strictly necessary. IMO we should move away from installing packages systemwide, and start encouraging... [14:59:20] 10Tool-Labs: Add python3 equivalent packages for existing python packages - https://phabricator.wikimedia.org/T101646#1344197 (10yuvipanda) (or user space pip) [15:08:51] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Limit NFS bandwith per-instance - https://phabricator.wikimedia.org/T98048#1344203 (10yuvipanda) @Coren @bblack is anyone working on this now? should it still be UBN!? [16:03:00] YuviPanda: I think we should install whatever is available over apt. having a rich development environment is a good idea (tm) [16:03:27] I'm trying figure out how to do this nicely with defines [16:04:06] the problem is that a lot of them don't exist in upstream repositories, even for trusty, so you're left manually looking for them. [16:04:13] ? [16:04:31] a lot of what? [16:04:32] python3- packages [16:04:32] right [16:04:39] I'm not opposed to installing them [16:04:49] as long as we aren't building them ourselves [16:04:52] :) [16:04:53] it's not that bad; over half of the python-* packages have a python3- equivalent [16:05:27] but I want a cleaner way to list them in puppet [16:05:35] same for the general precise vs trusty issue [16:05:43] YuviPanda: any idae why my email to labs-l went to moderation? [16:06:01] andrewbogott: uh, it did? labs-announce to labs-l now seems to go into moderation [16:06:26] I thought you just said in the backscroll... [16:06:27] * andrewbogott looks [16:06:41] andrewbogott: no idea why, though :| [16:07:05] Ah, I see, my email to labs-announce went through but the forward got stuck [16:07:30] It’s a little late to ask this, but… do we really need two lists? Is labs-l too high-traffic for folks to read? [16:07:50] also, our TOU say that labs-l is the official channel (which I guess is OK as long as there is forwarding) [16:07:52] andrewbogott: apparently. enough people asked for labs-announce for us to create it. [16:08:07] *shrug* ok [16:08:10] we need a better solution anyway, at least people asked for that in the meeting [16:08:15] andrewbogott: thanks for taking care of the NFS issues! [16:08:29] andrewbogott: I set my phone to wake me only when called, so Krenair's ping didn't wake me up [16:08:36] (not that it would've otherwise - I sleep fairly sound) [16:08:42] Ah, ok. I think my main lesson from last night is “call, don’t text” [16:09:06] andrewbogott: :) [16:09:08] anyway, as long as things are working today I am going to go back to Sunday mode [16:09:12] andrewbogott: you should! [16:09:16] I should too, actually [16:09:22] you in Scottland? [16:09:34] Um… Scotland? sp? [16:10:36] YuviPanda, yeah, I was trying to get the attention of people already awake :p [16:10:57] andrewbogott: yes, scotland [16:11:18] YuviPanda: I figured. No real tradition of central heat in the UK as I understand it. [16:11:31] andrewbogott: possibly, yeah. [16:12:42] I did think you guys should have had automated alerts (from icinga?) though [16:13:17] Krenair: yup, do have. that just went to email tho [16:14:17] nothing to texts? This is presumably all in the private ops repo so I have very little idea of what the actual details are [16:15:26] Krenair: some things go to text, but nothing that went on yesterday night... [16:15:27] YuviPanda: aaand I have decided to give up on it. There's no way I can write a puppet define /and/ actually check what it does in any reasonable way it seems [16:15:40] valhallasw: a puppet define to install python and python3? [16:16:04] YuviPanda: on for precise vs trusty packaging, one for python{2,3} {trusty,precise} packages, yeah [16:16:07] YuviPanda, right, sounds like that should be an action item then :) [16:16:28] NFS load above a few hundred percent = probably bad [16:16:57] YuviPanda: basically to get an interface like python-package { 'ipaddr': py2 => true, py3 => ['trusty'] } [16:17:04] 6Labs, 6operations: Make Labs NFS alerts paging - https://phabricator.wikimedia.org/T101650#1344262 (10yuvipanda) 3NEW [16:17:16] Krenair: ^ [16:17:21] :) [16:17:32] YuviPanda: but I just realized I have no clue how to actually test whatever I write [16:18:15] other than 'appy and see if anything dies', which doesn't sound like the best plan :-p [16:18:35] toolsbeta! [16:18:38] let me file a task for that [16:19:29] 10Tool-Labs: Set up toolsbeta more fully to help make testing easier - https://phabricator.wikimedia.org/T101651#1344271 (10yuvipanda) 3NEW [16:19:31] valhallasw: ^ [16:20:00] YuviPanda: there should just be a simple way to say 'hey, tell me what the resulting manifest is if I give you X Y and Z' :( [16:20:12] might be possible if I write a ruby module, I guess [16:20:16] possibly. [16:20:52] maybe I should try what rspec-puppet does [16:48:26] 10Quarry: SQL String functions not working - https://phabricator.wikimedia.org/T100057#1344288 (10yuvipanda) @halfak @milimetric any SQL help? [16:48:42] 10Quarry: SQL String functions not working - https://phabricator.wikimedia.org/T100057#1344290 (10yuvipanda) a:5yuvipanda>3None [16:48:56] 6Labs, 7Shinken: shinken has many warnings (?) about "UNKNOWN: execution of the check script exited with exception list index out of range" - https://phabricator.wikimedia.org/T95161#1344291 (10yuvipanda) a:5yuvipanda>3None [16:49:07] 6Labs, 7Shinken: shinken has many warnings (?) about "UNKNOWN: execution of the check script exited with exception list index out of range" - https://phabricator.wikimedia.org/T95161#1181902 (10yuvipanda) p:5Triage>3Normal [16:49:43] 6Labs, 10Labs-Infrastructure: Labs webservice2 not working - https://phabricator.wikimedia.org/T101132#1344296 (10yuvipanda) 5Open>3Resolved Was resolved earlier. [16:52:19] 10Tool-Labs: Create a set of 'template' tools in various languages with deploy scripts for toollabs - https://phabricator.wikimedia.org/T91059#1344307 (10yuvipanda) a:5yuvipanda>3None [16:52:36] 10Tool-Labs: Make http (404, 302, 301 etc) statistics for toolserver.org - https://phabricator.wikimedia.org/T85167#1344309 (10yuvipanda) a:5yuvipanda>3None [16:52:53] 10Tool-Labs: Monitor mail system in Graphite - https://phabricator.wikimedia.org/T71072#1344311 (10yuvipanda) a:5yuvipanda>3None [16:53:33] 10Tool-Labs: Provide page view metrics for individual tools on toollabs - https://phabricator.wikimedia.org/T87001#1344313 (10yuvipanda) a:5yuvipanda>3None [16:53:54] 6Labs, 5Patch-For-Review: Replace custom ec2id fact with facter's ec2 - https://phabricator.wikimedia.org/T86297#1344317 (10yuvipanda) a:5yuvipanda>3None [16:54:02] 10Tool-Labs: Useful graphite metrics to be tracked for Tool labs (tracking) - https://phabricator.wikimedia.org/T69879#1344320 (10yuvipanda) a:5yuvipanda>3None [16:55:28] 10Quarry: SQL String functions not working - https://phabricator.wikimedia.org/T100057#1344324 (10Halfak) This is because the fields are stored as raw bytes in a VARBINARY field. You need to cast the field to an encoding before it MySQL knows that the byte for "t" corresponds to the byte for "T". See http://... [16:56:06] 10Quarry: SQL String functions not working - https://phabricator.wikimedia.org/T100057#1344327 (10Halfak) 5Open>3Invalid [16:56:43] 10LabsDB-Auditor: Generate simple HTML interface to view reports generated by labsdb-auditor - https://phabricator.wikimedia.org/T78723#1344328 (10yuvipanda) a:5yuvipanda>3None [16:56:49] 10Tool-Labs: Track labsdb stats on Labs Graphite - https://phabricator.wikimedia.org/T69884#1344329 (10yuvipanda) a:5yuvipanda>3None [16:56:54] 10Tool-Labs: Track gridengine stats on Graphite - https://phabricator.wikimedia.org/T69881#1344330 (10yuvipanda) a:5yuvipanda>3None [16:56:59] 10Tool-Labs: Track 5xx error stats on Graphite - https://phabricator.wikimedia.org/T69880#1344331 (10yuvipanda) a:5yuvipanda>3None [16:57:20] 10Tool-Labs, 7HHVM, 5Patch-For-Review: Make DynamicProxy be able to proxy back to non-http protocols - https://phabricator.wikimedia.org/T84983#1344334 (10yuvipanda) a:5yuvipanda>3None [16:59:24] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL - Socket timeout after 10 seconds [17:02:13] valhallasw: ugh ^ [17:03:21] YuviPanda: {{worksforme}}? [17:03:35] valhallasw: it was down for a couple of minutes... [17:03:39] :/ [17:03:43] because someone is running a mongodb instance on NFS. [17:03:54] and that went haywire and killed NFS [17:04:02] wat [17:04:17] * valhallasw hands YuviPanda a Picard [17:04:17] not on toollabs [17:04:19] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 792588 bytes in 3.644 second response time [17:04:26] also someone is hammering NFS from tools-exec-1205 [17:05:11] !log tools killed sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger to rescue NFS [17:05:14] Logged the message, Master [17:06:43] valhallasw: I'm suspending all of templatetiger's jobs [17:06:52] YuviPanda: +1 [17:07:45] valhallasw: can you cluebat the maintainers? [17:07:46] I have to go now... [17:07:51] err, I think so [17:08:20] valhallasw: no cronjobs tho [17:08:21] thanks [17:08:23] ok [17:10:36] YuviPanda: the other jobs were also sorts? [17:10:41] valhallasw: yes [17:12:08] YuviPanda: sent [17:12:25] valhallasw: thanks! [17:13:13] 6Labs, 10Labs-Infrastructure, 3ToolLabs-Goals-Q4: Limit NFS bandwith per-instance - https://phabricator.wikimedia.org/T98048#1344364 (10yuvipanda) Tool Labs had a mini outage (couple of minutes) again today because an instance was hammering on NFS enough to kill it completely. [17:14:33] 6Labs: Horizon dashboard for managing instance puppet config - https://phabricator.wikimedia.org/T91990#1344369 (10yuvipanda) a:5yuvipanda>3None [17:14:53] 10Quarry: Quarry does not respect ORDER BY sort order in result set - https://phabricator.wikimedia.org/T87829#1344375 (10yuvipanda) a:5yuvipanda>3None [17:15:44] 6Labs, 7Tracking: New Labs project requests (Tracking) - https://phabricator.wikimedia.org/T76375#1344380 (10yuvipanda) [17:17:14] 6Labs, 7Tracking: Create labs project for Reading department - https://phabricator.wikimedia.org/T101325#1344381 (10yuvipanda) I think projects should be more specific and not be team based - that's something we've been trying to move away from in the past so we don't end up with projects with vague / no owner... [17:19:43] YuviPanda: Do you know of any known issues with local disk and/or NFS that would cause CI jobs to idle in the middle of an MySQL operation for 20 minutes doing nothing? [17:28:16] Krinkle: 19:02 because someone is running a mongodb instance on NFS. [17:34:28] I'm still curious why that effects Ci [17:34:35] almost everything is in local disk or ram/tmpfs [17:34:45] I guess ssh keys are not? [17:34:55] but the connection was already established [17:40:20] YuviPanda: Would it be possible for each project to have its own somewhat virtual NFS server? [17:40:49] then only things from /public would affect non-tool labs projects [17:42:59] anyway, looks like those tasks track it already. Good to see the limitsin place :) [18:09:45] Krinkle: not really (NFS server per project), although the limit nfs bandwidth per-instance would make the problem a lot more manageable [18:09:51] but solution is to not use NFS :) [19:01:00] 6Labs, 6WMF-Legal: Discussion: can I park WikiSpy under a separate, simpler domain? - https://phabricator.wikimedia.org/T97846#1344426 (10yuvipanda) a:5yuvipanda>3None [19:01:07] 6Labs, 6WMF-Legal: Request to review privacy policy and rules - https://phabricator.wikimedia.org/T97844#1344427 (10yuvipanda) a:5yuvipanda>3None [19:01:30] 6Labs, 10Tool-Labs: Document labsdb replication set up - https://phabricator.wikimedia.org/T85868#1344428 (10yuvipanda) a:5yuvipanda>3None [19:04:03] 10Tool-Labs: Get rid of tools-trusty bastion - https://phabricator.wikimedia.org/T101094#1344435 (10yuvipanda) Scheduled for June 10th 2015, Wednesday. [19:05:17] 10Tool-Labs, 3Labs-Sprint-101: Get rid of tools-trusty bastion - https://phabricator.wikimedia.org/T101094#1344436 (10yuvipanda) [19:05:41] YuviPanda: the rspec-puppet thing is quite nice once it works [19:05:48] I should write a blog post on it ;p [19:05:57] valhallasw: niice! [19:05:59] you should! [19:07:39] 10Tool-Labs, 3Labs-Sprint-101: Stop bigbrother from being used for restarting webservices - https://phabricator.wikimedia.org/T101654#1344437 (10yuvipanda) 3NEW [19:11:28] YuviPanda: "up for grabs" means that basically noone is assigned for the task right now? [19:11:37] d33tah: yes [19:11:46] or rather, I don't know who is, but it isn't me. [19:11:53] i see [19:12:22] d33tah: for the privacy policy stuff, I suspect it should be someone from legal [19:13:12] i e-mailed zhao, still waiting for a reply [19:13:40] ZhouZ* [19:14:56] ok [19:15:30] as for 'is it ok if it gets slashdotted' - from an ops perspective, sure! be aware that it might go down, and if it does there's nothing we (labs admins) can do to help - it's all on you [19:27:19] YuviPanda: I wrote a define both for the python stuff and for the trusty vs precise stuff [19:27:24] with tests \o/ [19:53:37] valhallasw: <3 cool [19:53:48] valhallasw: kolossos' emaial is bouncing. apparently our toolserver.org redirect stuff is shit. [19:53:59] *email [19:54:11] it's not like we took over the domain with promise of working redirects or anything. [19:54:23] I guess snark isn't useful, but what is? [19:59:09] 10Tool-Labs: kolossos@toolserver.org bouncing - https://phabricator.wikimedia.org/T101656#1344464 (10yuvipanda) 3NEW [19:59:13] YuviPanda: Setting up MX records for toolserver.org would probably help [19:59:43] multichill: > toolserver.org. 3600 IN MX 10 toolserver.org. [20:00:49] YuviPanda: 208.80.155.197 ? Doesn't listen on port smtp [20:02:26] multichill: did you have a toolserver email that I can test with? [20:02:47] username@toolserver.org so multichill@toolserver.org used to work [20:03:36] YuviPanda: No smtp server seems to be listining the ipaddress so that's probably the source of the bounces [20:07:23] multichill: I sent one to multichill@toolserver.org [20:07:26] multichill: no bounces yet [20:07:52] did it actually go anywhere? [20:08:03] I dunno, from where did you try to send it? [20:08:08] gmail? :) [20:08:21] * YuviPanda has no idea what he's doing wrt email [20:09:34] So gmail will find 208.80.155.197 as the MX server. Tries to deliver the email. Will hit the fact that there is no mail server on that ip and might or might not generate a soft bounce [20:09:45] Will retry a couple of times and will generate a hard bounce [20:09:57] That can take maybe a week [20:10:44] YuviPanda: Where is the forwarding configuration? [20:12:18] 10Tool-Labs: kolossos@toolserver.org bouncing - https://phabricator.wikimedia.org/T101656#1344482 (10Multichill) dig MX toolserver.org gives: ;; ANSWER SECTION: toolserver.org. 3600 IN MX 10 toolserver.org. ;; ADDITIONAL SECTION: toolserver.org. 3600 IN A 208.80.155.1... [20:12:29] 10Tool-Labs, 7Mail: kolossos@toolserver.org bouncing - https://phabricator.wikimedia.org/T101656#1344483 (10Multichill) [20:12:54] 10Tool-Labs, 3ToolLabs-Goals-Q4: Put toolserver.org redirect configuration in git - https://phabricator.wikimedia.org/T85165#1344485 (10yuvipanda) *bump*? [20:22:17] 6Labs, 10Continuous-Integration-Infrastructure: Continuous integration should not depend on labs NFS - https://phabricator.wikimedia.org/T90610#1344487 (10Krinkle) As a first step, I disabled "Shared project storage" (/data/project NFS mount) in the [Nova Project management](https://wikitech.wikimedia.org/w/in... [20:22:21] 10Tool-Labs, 7Mail: kolossos@toolserver.org bouncing - https://phabricator.wikimedia.org/T101656#1344489 (10scfc) Were you trying to connect to that host from within Labs? Then that's the intra-Labs public IP network issue. But from the InterNet it doesn't look good either: ``` [tim@passepartout ~]$ nc -C 2... [20:24:01] 10Tool-Labs, 7Mail: kolossos@toolserver.org bouncing - https://phabricator.wikimedia.org/T101656#1344490 (10yuvipanda) p:5Triage>3High [20:24:30] 10Tool-Labs, 7Mail: kolossos@toolserver.org bouncing - https://phabricator.wikimedia.org/T101656#1344491 (10Multichill) >>! In T101656#1344489, @scfc wrote: > Were you trying to connect to that host from within Labs? Then that's the intra-Labs public IP network issue. Yup, from somewhere on the internet. In... [21:32:17] 6Labs, 6operations: Make Labs NFS alerts paging - https://phabricator.wikimedia.org/T101650#1344507 (10yuvipanda) Need to figure out: 1. Who all should be paged? 2. What's the paging condition? [21:33:16] 10Tool-Labs: Run a documentation sprint for Tool Labs - https://phabricator.wikimedia.org/T101659#1344508 (10yuvipanda) 3NEW [21:37:35] * quiddity gives YuviPanda a cookie. https://i.imgur.com/coQ3buR.jpg [21:42:30] quiddity: :) [22:05:43] 6Labs: Disabling NFS in wikitech still has puppet trying to mount the folders - https://phabricator.wikimedia.org/T101660#1344522 (10yuvipanda) 3NEW [22:05:56] 6Labs, 3Labs-Sprint-101: Disabling NFS in wikitech still has puppet trying to mount the folders - https://phabricator.wikimedia.org/T101660#1344529 (10yuvipanda) [22:07:33] 6Labs, 3Labs-Sprint-101: Provide all labs users with username / passwords for the Postgres database - https://phabricator.wikimedia.org/T101661#1344532 (10yuvipanda) 3NEW [22:34:59] YuviPanda: all my jobs (for 'legobot') are in the Eqw state, I'm guessing due to the NFS outage? [22:39:53] 10Tool-Labs, 3Labs-Sprint-101: Get rid of tools-trusty bastion - https://phabricator.wikimedia.org/T101094#1344557 (10Legoktm) IIRC tools-trusty wasn't using the same cron as the other bastions. Were all of those moved over?