[00:25:38] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [00:46:24] PROBLEM - Puppet run on tools-webgrid-generic-1405 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:46:26] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [00:47:04] yes that's all me ^ [00:50:32] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:03:21] PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [01:26:26] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [01:30:26] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [01:30:46] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [02:00:39] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [02:56:49] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: create conduit method for the creation of phabricator policy objects - https://phabricator.wikimedia.org/T135249#2325277 (10bd808) [05:04:04] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 383 bytes in 0.003 second response time [05:05:12] ^ tool labs is down [05:06:30] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325315 (10Harej) [05:06:39] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325328 (10Harej) p:05Triage>03Unbreak! [05:20:12] harej: could you be more specific please? [05:20:37] Loading a page on tools.wmflabs.org, including https://tools.wmflabs.org itself, results in a 503 error. [05:20:48] ok, so /a/ tool is down [05:20:53] are there any other symptoms? [05:20:54] No, I think they all are. [05:21:08] e.g.? [05:21:45] labs-morebots: how's it going? [05:21:46] I am a logbot running on tools-exec-1221. [05:21:46] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:21:46] To log a message, type !log . [05:22:37] https://tools.wmflabs.org/ https://tools.wmflabs.org/mix-n-match/ https://tools.wmflabs.org/wikidata-todo/ https://tools.wmflabs.org/xtools-ec/ all return "503 Service Temporarily Unavailable" [05:22:48] (yes xtools is not a good example I know but this is happening to every page on Tool Labs I try to load) [05:23:19] https://tools.wmflabs.org/wdq2sparql/w2s.php also returns that error [05:25:38] Yet, https://wpx.wmflabs.org/requests/en (Wikimedia Labs, but not Tool Labs) works fine. [05:26:14] harej: The xtools one often happens, sadly. Let me see if I can get it working. [05:26:24] But it's not just xtools! [05:26:31] Nothing on Tool Labs is loading. [05:26:45] Oh... OK. [05:30:31] the wikidata-game is also 503, ack [05:30:31] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325363 (10Harej) To clarify, this is not just happening with one tool. Seemingly each tool I try to load results in the same error. URLs tested: * https://tools.wmflabs.org/ * https://tools.wmflabs.org/mix-n-match/... [05:33:51] !log tools rebooting tools-proxy-02 [05:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [05:34:10] did the logging bot just call you a dummy [05:34:17] always [05:35:04] ah, tools-proxy. makes sense [05:35:14] No change so far. [05:43:17] the error message that nginx shows seems to indicate redis connection issues [05:43:26] but it is running [05:44:55] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: create conduit method for the creation of phabricator policy objects - https://phabricator.wikimedia.org/T135249#2325382 (10bd808) [05:45:12] andrewbogott: do we have to start redis on tools-proxy-02 ? [05:45:40] i see it's not running [05:45:44] mutante: I just failed over to -01 [05:45:56] ok! [05:46:30] …which doesn't seem to have helped, despite redis working properly there [05:47:15] it's running as a process, but : [05:47:22] Active: inactive (dead) [05:47:31] <_joe_> do you need me to take a look guys? [05:47:49] <_joe_> I am still sleepy though [05:48:16] <_joe_> andrewbogott: did you check that the backend for the homepage works? [05:48:28] i did this: [05:48:34] systemctl status redis-server [05:48:42] now Active: active (running) [05:48:58] but that didnt change it like you said [05:49:34] _joe_: I didn't check backends, although multiple tools are getting 503s and the redis on -02 seemed empty [05:49:37] so I'm failing over to -01 [05:49:40] which is not super fast [05:49:45] <_joe_> mutante: redis works on tools-proxy-01 and data is there [05:50:34] <_joe_> the site doesn't work on 01 either [05:50:36] _joe_: can you check -02? It seemed like it wasn't working there [05:50:43] _joe_: the fail-over isn't complete yet though... [05:50:57] <_joe_> andrewbogott: curl -H 'Host: tools.wmflabs.org' localhost/ [05:51:10] the nginx error log had "nginx attempt to send data on a closed socket" [05:52:04] <_joe_> it seems the proxy can't connect to the tools [05:52:06] I'm forcing some puppet runs now [05:52:16] The tools need to be updated as to which is the active proxy [05:52:18] puppet should do that [05:52:29] (this is via a hiera setting which it took me a few minutes to find) [05:52:36] <_joe_> andrewbogott: that won't change anything [05:52:56] _joe_: ok [05:53:11] active_proxy_host isn't tied to firewall rules or anything like that? [05:54:52] <_joe_> andrewbogott: so, the data in redis on -01 is _incorrect_ [05:55:31] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325384 (10Urbanecm) https://tools.wmflabs.org/urbanecmbot/reliktyCswiki/ wasn't working a few seconds ago, after restarting with webservice restart (SSH is working) it works. Also my second tool (https://tools.wmflab... [05:55:32] proxy-01 says it cant connect to upstream at tools-webgrid-lidhhtpd-1415 [05:55:57] _joe_: ok, that's interesting... [05:56:00] <_joe_> mutante: because data in redis is outdated [05:56:04] is it incorrect in a consistent way? [05:56:14] <_joe_> andrewbogott: can we re-populate redis? [05:56:24] <_joe_> andrewbogott: I have no idea, I'm just looking at -01 [05:56:32] <_joe_> I suggest you page yuvi and chase [05:56:33] <_joe_> no2 [05:56:35] <_joe_> *now [05:56:39] YuviPanda is on his way [05:57:01] Instructions (such as they are) at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#WebProxy [05:57:14] as far as I know, redis is populated dynamically by the webservices. I don't know how that works though [05:57:32] bd808: yeah, I'm doing the 'to switch over' now [05:57:36] <_joe_> bd808: they don't tell you how to recover from bad data in redis [05:57:47] <_joe_> and that's the issue; you have bad data in redis [05:57:50] although _joe_ has convinced me it won't help, I at least need to get it fully on one side or the other [05:57:54] <_joe_> someone fucked up the sync script [05:58:52] Is redis even running on 01? [05:59:01] <_joe_> bd808: it is [05:59:10] /usr/bin/redis-server *:6379 [05:59:19] <_joe_> so, redis-cli HGETALL prefix:rangecontri [05:59:26] <_joe_> *b [05:59:32] <_joe_> gets you a ip:port pair [05:59:47] <_joe_> which is definitely not being used [06:00:39] * andrewbogott wonders why https://tools.wmflabs.org/nagf/ works [06:00:41] <_joe_> it is consistently wrong on both proxies, too [06:00:48] <_joe_> andrewbogott: let me see [06:00:53] restarting the tools.admin webservice got that back up [06:00:56] > Created by @Krinkle. [06:00:59] that's why ;) [06:01:09] <_joe_> andrewbogott: it's on k8s [06:01:13] nagf was the last thing YuviPanda was messing with [06:01:16] <_joe_> and k8s uses a sync script I wrote [06:01:24] <_joe_> that's why :P [06:01:27] <_joe_> ori: ^^ [06:01:31] _joe_: k8s still uses redis/proxy doesn't it? [06:01:32] ah, ok [06:01:48] <_joe_> so the problem is specifically with gridengine => redis [06:01:56] <_joe_> did anyone touch any script? [06:02:11] I was looking in puppet history, didn't see anything [06:02:13] but I'll look again [06:02:34] restarting webservice processes seems to get them back up and running now [06:02:34] <_joe_> andrewbogott: I know jackshit about the GE => redis script [06:02:45] <_joe_> bd808: yes, that is probably working [06:02:48] bd808: can we just restart everything? [06:02:52] or are some things stateful? [06:02:56] <_joe_> something must have gone horribly wrong somewhere [06:03:09] <_joe_> ori: I assume we can, given the failure rate of toollabs [06:03:21] yeah [06:03:24] <_joe_> anything that cannot be randomly restarted would already be unusable [06:03:27] <_joe_> since forever [06:03:29] yeah, I don't know how to do it, but it has been done often [06:03:45] <_joe_> bd808: I can create a list of webservices to restart [06:03:52] <_joe_> if you don't have it [06:04:03] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3669 bytes in 0.025 second response time [06:04:06] well, what do they do on restart that fixes things? register themselves with redis? [06:04:11] <_joe_> ori: yes [06:04:25] <_joe_> ori: I don't remember how that works exactly, though [06:04:31] I found the magic command [06:04:32] bd808: did you restart something to make the home page start working? [06:04:37] What tool was it? [06:04:38] <_joe_> or if there is a way to tell gridengine "dump me all" [06:04:51] nmap ;) [06:04:56] <_joe_> or a way to resync everything with our tools [06:04:56] andrewbogott: yes. tools.admin [06:05:00] qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | awk '{ print $1;}' | xargs -L1 qmod -rj [06:05:11] of course [06:05:12] <_joe_> lol [06:05:17] I'm always surprised that there is a tool called 'tools' which is nonetheless not that [06:05:52] Shold I run that restart them all script [06:06:28] bd808: yes, go ahead. [06:06:40] hello [06:06:53] !log tools Restarting all webservice jobs [06:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [06:07:01] bd808: hang on, let's see if YuviPanda needs to do a postmortem first [06:07:04] * YuviPanda reads backscroll [06:07:11] YuviPanda: so in short — redis is wrong [06:07:12] too late [06:07:18] bd808: ok :) [06:07:23] YuviPanda: redis /was/ wrong [06:07:24] get things back up first [06:07:28] postmortems later [06:07:30] about basically every tool [06:07:47] I found some empty entries on -02 so failed over to -01 [06:07:55] but that didn't help since -01 was wrong about everything as well [06:08:05] restarting things seems to help, so bd808 is restarting every webservice [06:09:00] it's done... [06:09:35] https://tools.wmflabs.org/xtools/ is still a 502, how long does it typically take? [06:09:36] * andrewbogott is watching https://tools.wmflabs.org/mix-n-match/ but honestly doesn't know if it ever worked [06:09:39] hmmm... sal is still down [06:09:51] you fixed the home page, but other tools like wikidata-game, no [06:10:13] https://tools.wmflabs.org/wikidata-game/ [06:10:15] my console is full of lines like "Pushed rescheduling of job 6725187 on host tools-webgrid-lighttpd-1415.tools.eqiad.wmflabs" from the one-liner [06:10:15] run MONITOR on the redis instance to see how tools are fixing themselves [06:11:09] so the restart is still being processed [06:11:12] A large number of jobs are in "Rr" state [06:11:15] that's good [06:11:29] 2) "http://10.68.23.222:57450" [06:11:31] it means that there is still a chance that you have fixed everything [06:11:34] is the entry for wikidata game [06:11:40] lighttpd 11009 tools.wikidata-game 4u IPv4 117036768 0t0 TCP *:34336 (LISTEN) [06:11:43] is where it's listening in [06:12:49] YuviPanda: Is it the case that tools register with the proxy redis when they come up, and then there's a different codepath that actively syncs them after the fact? [06:12:59] Or is it only when they first launch that they're registered? [06:13:33] andrewbogott: they register when they come up [06:13:40] I think I know the problem, stand by [06:13:49] bd808: you might have to run that script again in a minute... [06:13:58] I think the problem is that puppet is stuck on tools-services-01 [06:14:17] because of /public/dumps [06:14:19] <_joe_> and so the scripts keeps writing to the wrong redis? [06:14:48] bd808: am running that one liner now [06:17:03] nope, still not fixed [06:17:05] lighttpd 31977 tools.geohack 4u IPv4 5979985 0t0 TCP *:34543 (LISTEN) [06:17:09] 2) "http://10.68.18.54:37669" [06:17:12] hmm [06:17:12] you got puppet unstuck? [06:17:28] so I thought the problem was that the toollabs-webservice package on services was out of date [06:17:31] because puppet was stuck [06:17:34] but apparently not [06:17:35] (is services-01 really involved in the mix at all?) [06:17:36] ah [06:17:40] * bd808 invents -- qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | tail -n+3 | awk '{print $5}' | sort | uniq -c [06:18:07] I actually did something else just now [06:18:12] changed the qmod -rj to qdel [06:18:16] So — the other variable is that I updated python-mwclient site-wide earlier today. There's NO WAY that could be connected to this, but I'm just throwing it out there. [06:18:20] and let webservicemonitor bring it up [06:18:28] andrewbogott: no, I was tweaking the toollabs-webservice package earlier [06:18:45] mine is just a count of jobs in each state in that queue [06:18:47] I tested it before I left, but clearly something I did there had caused this. I merged 3 patches [06:19:37] the count of 'r' state is slowly rising [06:20:02] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 383 bytes in 0.002 second response time [06:20:10] 2) "http://10.68.20.250:40415" [06:20:13] Briefly worked and now down again. [06:20:15] lighttpd 12647 tools.multidesc 4u IPv4 117049623 0t0 TCP *:40415 (LISTEN) [06:20:22] ok so that seems to be in sync now [06:20:55] Matthew_: which tool? [06:20:58] Matthew_: yeah, we're working on it. I restarted it manually before. It got killed in the last restart YuviPanda set off [06:21:08] YuviPanda: I've been testing on the main page. [06:21:24] right, it'll probably come back [06:21:25] 88 running now [06:21:27] I've got xtools working with a simple webservice restart. [06:21:30] webservicemonitor throttles things too I think [06:22:12] yeah the seem to be restarting in batches with a bit of a pause in between [06:22:51] I got xtools working with a "webservice restart" (I hope I didn't jump the gun) but it seems to be looking good. [06:23:07] Matthew_: yup, that should work too [06:23:14] I did a bunch of spot checks and they seem fine [06:23:39] OKay. [06:23:56] I just brought the landing page up manually [06:24:33] we didn't parallelize the webservice restarts because we figured it'd be rare and also that doing that might overload gridengine, which I think is an ok state to be in [06:25:04] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3669 bytes in 1.458 second response time [06:25:12] thanks bd808 [06:25:26] wikidata-game works again, thanks [06:25:39] 196 running now [06:25:58] so… I think I missed a step. Was the solution restart, , restart again? [06:26:00] wait, no, kind of. the error page is now the tool labs page [06:26:04] Or did the original restart work and it just took a while? [06:26:34] andrewbogott: no, it was 1. make sure the version of toollabs-webservice on tools-services-01 was the same as rest of cluster, 2. delete all the webservices and have tools-services-01 bring them back up [06:26:40] the qmod -rj did not work but the qdel did [06:26:56] ah, ok [06:27:14] so is the toollabs-webservice bit a red herring? [06:27:20] YuviPanda: you should update -- https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Restarting_all_webservices [06:27:26] (I still don't understand why tools-services-01 has anything to do with proxying) [06:27:37] andrewbogott: bd808 https://gerrit.wikimedia.org/r/#/c/290612/ is the culprit. new version knew to pass '--register-proxy' to the webservice-runner, but old version did *not*. tools-services-01 was the old one... [06:28:08] 300 running now [06:28:09] bd808: I think the -rj works fine in most cases, this was the problem because of the version mismatches in various places... [06:28:15] Well, as long as we all agree that this can be blamed on NFS [06:28:35] andrewbogott: it runs webservicemonitor, which is responsible for checking if webservices are up and restarting them if not [06:28:40] mutante: which url is back down? [06:29:03] YuviPanda: why did that matter, though? [06:29:25] Did the issue only affect services which died and were restarted by webservicemonitor this evening? [06:29:33] YuviPanda: https://tools.wmflabs.org/wikidata-game/ but it shows "No webservice" now, not 503 anymore [06:29:55] well, 503 but the fancier error page [06:31:18] mutante: I manually restarted it, is back up [06:31:21] gridengine is struggling a bit now [06:31:29] 415 up [06:31:49] YuviPanda: thanks, works:) [06:33:12] feels like it's faster than before [06:33:45] 504 running [06:33:59] bd808: there was a total of ~600 something right? [06:34:21] something near that I think [06:37:01] looks like there should be ~680 when they are all back up. 592 now [06:37:59] the count or 'r' + 'Rr' before the mass qdel was about 680'ish [06:39:00] bd808: webservicemonitor is still restarting things [06:39:17] bd808: and by the time it's finished this round it'll probably pick up another set in a second round [06:41:23] looks like it has capped out at 669 [06:42:03] it has bumped to 670 and then back down to 669 on the last few counts I've run [06:42:17] yeah, the 'bub' tool is flapping [06:42:25] * bd808 is amazed there is only one flapping [06:42:37] bd808: me or valhallasw`cloud kill them now and then when we look at logs [06:42:58] hasn't bub been flapping for forever? [06:43:07] possibly [06:43:57] hm, I thought there was a Task on that, but I can't find it [06:44:10] I just stopped it now [06:44:56] no, it was definitely online last night (see access.log) [06:45:30] imma head to bed. night folks (and morning joe) [06:46:02] * andrewbogott waves [06:46:27] 'night bd808 [06:47:03] YuviPanda: service.manifest is still there, and WSW tries to restart it? [06:47:24] valhallasw`cloud: yeah, I just rm'd it [06:47:26] thanks bd808 [06:48:42] andrewbogott: _joe_ ori https://etherpad.wikimedia.org/p/tools-web-outage-2016-05-25 I wrote up sequence of events as far as I can tell (includes causes I believe) [06:49:57] Heh, I was just about to ask and then noticed that you already wrote ???? in the spot where my question would go [06:50:42] andrewbogott: yeah, I'm not sure what happened there. [06:50:50] andrewbogott: something must've triggered a restart of those things [06:51:44] andrewbogott: heh, puppet's been stuck for '6660' hours now [06:52:31] andrewbogott: other than the ??? does the rest make sense? [06:53:45] Yeah, it all makes sense [06:54:00] and the ???? bit isn't that crazy, since in general webservices can restart just fine [06:54:27] so maybe they do it all the time and we don't notice [06:54:38] (which, I guess, would be a good thing to know if true) [06:54:39] andrewbogott: this data is collected in graphite actually [06:54:43] let me find [06:55:30] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325473 (10Urbanecm) p:05Unbreak!>03High Lowering the priority because Tool Labs is working, so this task is about finding why Tool Labs wasn't accessable. [06:56:11] labs graphite has gotten unbearably slow now [06:57:17] 06Labs, 10Tool-Labs: Turn on puppet nag emails for tools too - https://phabricator.wikimedia.org/T136167#2325475 (10yuvipanda) [06:57:41] now that I think about it, I bet that the nag emails aren't triggered for hangs. [06:58:10] 06Labs, 10Tool-Labs: Switch toollabs-webservice to be deployed with an actual deployment mechanism - https://phabricator.wikimedia.org/T136168#2325488 (10yuvipanda) [06:58:16] andrewbogott: am filing bugs for action items [06:59:35] andrewbogott: I can't get the metrics out of graphite because it has slowed completely to a crawl [06:59:42] that's ok [06:59:56] I presume they'll still be there in the morning [07:00:25] andrewbogott: one hopes [07:00:40] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325315 (10yuvipanda) https://etherpad.wikimedia.org/p/tools-web-outage-2016-05-25 for some ad-hoc notes on what happened. [07:00:55] 06Labs, 10Tool-Labs: Investigate Tool Labs webservice outage on 2016-05-25 - https://phabricator.wikimedia.org/T136162#2325503 (10yuvipanda) [07:01:51] YuviPanda: I'm going to go back to bed — I emailed the list earlier and it seems like things are mostly working for the moment. [07:02:08] andrewbogott: yeah. I'll do something like that too. [07:02:14] andrewbogott: thanks for paging and taking a look! [07:02:24] I hope you're enjoying Chicago! [07:02:36] andrewbogott: it's been amazing and I've a few more days left [07:19:36] !log tools hard reboot tools-services-01, was completely stuck on /public/dumps [07:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [07:25:32] 06Labs, 10Tool-Labs, 10DBA: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2257889 (10jcrespo) The replica is not corrupt, it just has drifted from production, failing to delete and insert some records, for several reasons: the main ones is crashing while using non-transactio... [07:25:56] 06Labs, 10Tool-Labs, 10DBA: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2325581 (10jcrespo) a:03jcrespo [08:10:24] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2323829 (10Framawiki) i use a b.sh bash file that can be call with ``` bash b.sh ``` ;) [08:11:34] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [08:16:37] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2323791 (10Framawiki) you can use special chars with a bash file : T136119 but not directly in terminal [09:19:51] RECOVERY - Puppet staleness on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:24:39] 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1464173 (10Volans) I've applied the tendril grant from `/etc/mysql/production-grants.sql` (and only that one) required to have tendril monitor this host and added t... [10:02:22] 06Labs: Make user_email_authenticated status visible on labs - https://phabricator.wikimedia.org/T70876#2325775 (10Danny_B) [10:14:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [10:22:33] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2325875 (10Dvorapa) @Framawiki thank you [10:23:18] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2325880 (10Dvorapa) @Framawiki thank you, good idea [11:02:26] Hi [11:02:30] There seems to be a lag [11:02:49] between when Wikipedia updates and when labs sees the newly updated versions in the API [11:02:54] How long is this lag? [11:03:25] Noted the lag here - https://quarry.wmflabs.org/query/6052 [11:03:50] There's stuff in the query which is appearing which based on the query and whats ACTUALLY in English Wikipedia shouldn't be [11:05:09] ShakespeareFan00, there are imports ongoing [11:05:29] Okay [11:05:40] there is https://tools.wmflabs.org/replag/ [11:06:01] but aside from that, there are small periods of revisions missing and they coming back [11:06:22] it is either that (small periods of glitches) or bringing labs 100% down [11:06:39] if you want to programatically detect that [11:06:49] Hmm 2 hour replication lag isn't bad compared with toolservers 48 hours lags :) [11:07:17] see https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag [11:07:31] and you can, for example, avoid doing queries while lag is > max [11:08:02] the good news is that those imports will fix a huge amount of differences agains production [11:08:12] so it is worth it [11:12:37] Oh and I had a need for a rather intensive tool [11:12:52] Essentialy it's a blame tool [11:13:25] buit I am trying to find a list of files where a GFDL tag or simmilar was added by someone other than the uploader of the file [11:13:53] This basicly needs to do a revsions scan AND a grep of page text [11:13:55] :( [11:15:04] Why is it only enwwiki that gets big lags ? [11:15:06] XD [11:16:27] it is me [11:17:10] for the import, I need to sync production and labs, and for that I need to stop them in the same position while the import is ongoing [11:17:27] Fair enough [11:17:35] I can do something else for a bit [11:17:37] :) [11:22:20] ShakespeareFan00: for such a tool, you're probably better off parsing the dumps [11:22:51] Hmm [11:22:54] I might ask aabout that [11:33:10] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#2326040 (10Framawiki) [11:33:13] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2326042 (10Framawiki) [11:37:12] 10Labs-project-wikistats: Update lietuvai.lt statistics URLs - https://phabricator.wikimedia.org/T136183#2326049 (10Nemo_bis) [11:37:21] 10Labs-project-wikistats: Update lietuvai.lt statistics URLs - https://phabricator.wikimedia.org/T136183#2326061 (10Nemo_bis) p:05Triage>03Low [11:38:19] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2326062 (10Framawiki) 05Open>03Resolved a:03Framawiki [11:40:54] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2326065 (10jayvdb) [11:40:56] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#2326067 (10jayvdb) [11:41:27] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857587 (10jayvdb) 05duplicate>03Open [11:43:35] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2326069 (10jayvdb) @Framawiki , thank you for finding and fixing the duplicate ;-) . In future, please close duplicates which are latter (higher number), leaving open the earlier task open (lower num... [11:53:22] 10Labs-project-wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2326077 (10Nemo_bis) [11:54:04] 10Labs-project-wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2326092 (10Nemo_bis) p:05Triage>03Normal [13:20:55] 06Labs, 10Tool-Labs, 13Patch-For-Review: Investigate Tool Labs webservice outage on 2016-05-25 - https://phabricator.wikimedia.org/T136162#2325315 (10chasemp) >>! In T136162#2325501, @yuvipanda wrote: > https://etherpad.wikimedia.org/p/tools-web-outage-2016-05-25 for some ad-hoc notes on what happened. Than... [14:05:46] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326298 (10chasemp) [14:08:42] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326312 (10chasemp) p:05Triage>03High [14:11:21] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2326337 (10chasemp) [14:14:31] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2326337 (10jcrespo) It holds also 165G on toolsdb. [14:16:53] hey there is till 7TB+ free!! [14:24:47] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2326413 (10chasemp) p:05Triage>03Normal [14:27:32] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2326415 (10chasemp) I wasn't able to find any of the maintainers on Phabricator. I emailed Markus via contact information found through this user page. [14:27:48] PROBLEM - Free space - all mounts on tools-worker-1004 is CRITICAL: CRITICAL: tools.tools-worker-1004.diskspace.root.byte_percentfree (<10.00%) [14:29:03] hashar: what are you talking about? [14:29:22] randomly mumbling about templatetiger is using 613G in Tools out of 8T :D [14:29:26] ignore me! [14:30:07] :) [14:34:32] 06Labs, 10Tool-Labs: currentevents is using 248G in Tools - https://phabricator.wikimedia.org/T136195#2326421 (10chasemp) [14:36:13] PROBLEM - SSH on tools-webgrid-lighttpd-1408 is CRITICAL: Server answer [14:41:43] 06Labs, 10Tool-Labs: icelab is using 245G in Tools - https://phabricator.wikimedia.org/T136197#2326481 (10chasemp) p:05Triage>03High [14:47:12] 06Labs, 10Tool-Labs: wikiviewstats is using 232G on Tools - https://phabricator.wikimedia.org/T136198#2326489 (10chasemp) [14:56:54] 06Labs, 10Tool-Labs: oar is using 207G on Tools - https://phabricator.wikimedia.org/T136201#2326542 (10chasemp) [15:05:23] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2326563 (10mkroetzsch) We have two kinds of large data files: biweekly Wikidata json entity dumps and RDF exports that we generate from them. The RDF exports are what we offer through our website http://too... [15:07:02] 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2326565 (10Andrew) @Volans -- thanks! [15:10:14] 06Labs, 10Tool-Labs: toolserver-home-archive is using 52G on Tools - https://phabricator.wikimedia.org/T136202#2326568 (10chasemp) [15:23:56] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326630 (10dschwen) Hey Chase, I went ahead and deleted cache entries older than 90 days. ``` find /data/project/zoomviewer/public_html/cache -mtime +90 -delete ``` I can put this in the tool's cron... [15:47:46] 06Labs, 10Tool-Labs: liangent-php is using 348G on Tools - https://phabricator.wikimedia.org/T136208#2326730 (10chasemp) [15:52:30] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326761 (10chasemp) @dschwen, thank you. This tool just went from the #1 space user to somewhere around #20. Much appreciated. I believe our mounts are all done w/ `noatime`. Let me know what you d... [15:58:26] 06Labs, 10Tool-Labs: wikidata-analysis is using 153G on Tools - https://phabricator.wikimedia.org/T136211#2326813 (10chasemp) [16:05:40] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space - https://phabricator.wikimedia.org/T136212#2326835 (10bd808) [16:05:53] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space - https://phabricator.wikimedia.org/T136212#2326851 (10bd808) p:05Triage>03Normal [16:07:56] 06Labs, 10Tool-Labs: wikidata-analysis is using 153G on Tools - https://phabricator.wikimedia.org/T136211#2326886 (10bd808) [16:07:58] 06Labs, 10Tool-Labs: toolserver-home-archive is using 52G on Tools - https://phabricator.wikimedia.org/T136202#2326888 (10bd808) [16:08:00] 06Labs, 10Tool-Labs: oar is using 207G on Tools - https://phabricator.wikimedia.org/T136201#2326889 (10bd808) [16:08:02] 06Labs, 10Tool-Labs: wikiviewstats is using 232G on Tools - https://phabricator.wikimedia.org/T136198#2326890 (10bd808) [16:08:04] 06Labs, 10Tool-Labs: icelab is using 245G in Tools - https://phabricator.wikimedia.org/T136197#2326891 (10bd808) [16:08:06] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2326893 (10bd808) [16:08:08] 06Labs, 10Tool-Labs: currentevents is using 248G in Tools - https://phabricator.wikimedia.org/T136195#2326892 (10bd808) [16:08:10] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326895 (10bd808) [16:08:23] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space - https://phabricator.wikimedia.org/T136212#2326885 (10bd808) [16:08:25] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2326894 (10bd808) [16:09:29] bd808: hi, can you delete an unused tool from toollabs for me? (I'm maintainer) [16:09:34] the directory is already empty [16:10:32] Luke081515: open a phab task and link it to T133777 [16:10:32] T133777: [Tracking] Tools that should get deleted - https://phabricator.wikimedia.org/T133777 [16:10:43] ok [16:11:04] I think there is some complicated dance that has to be done [16:12:11] 06Labs, 10Tool-Labs, 07Tracking: Delete tool 'rcm' - https://phabricator.wikimedia.org/T136216#2326965 (10Luke081515) [16:12:21] bd808: {{Done}} :) [16:12:38] 06Labs, 10Tool-Labs: Delete tool 'rcm' - https://phabricator.wikimedia.org/T136216#2326980 (10bd808) [16:12:43] ah, you were faster :D [16:13:23] It would be really neat if we could somehow flag tags that shouldn't be inherited like tracking, patch-for-review, etc [16:13:51] probably overcomplicating the world though [16:15:34] never occurred to me actually, are there others than 'tracking' [16:15:41] we would want to flag as uninheritable? [16:15:55] patch-for-review and upstream I think [16:16:01] probably a few more [16:16:18] maybe anything that is a "tag" type? [16:16:57] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2327000 (10Dvorapa) @Framawiki Does it mean it is fixed? Or not? Or could I reopen it as a proposal? [16:18:25] my internet is shitty today :/ [16:18:49] I think my 6-in-4 tunnel is being flaky [16:19:24] 06Labs, 10Tool-Labs: tools.suggestbot web requests fail after a period of time - https://phabricator.wikimedia.org/T133090#2327015 (10bd808) [16:19:26] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2327017 (10yuvipanda) 05Resolved>03Open I think we should re-open this, since writing it in a file is cumbersome. I'll figure out the upstream task for this.. [16:20:05] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2327024 (10Dvorapa) @yuvipanda ok, thank you [16:20:07] RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [16:20:23] andrewbogott: where does the puppet nagging code live, btw? [16:21:38] modules/base/files/labs/puppetalert.py [16:22:38] YuviPanda: the good bit is [16:22:39] if hiera('send_puppet_failure_emails', false) [16:22:52] in modules/base/manifests/labs.pp [16:23:05] yeah [16:23:13] I'm going to file a task first and then turn it back on [16:25:47] 06Labs, 10Tool-Labs: liangent-php is using 348G on Tools - https://phabricator.wikimedia.org/T136208#2327065 (10liangent) mw-log is really activity log, for all activities in my php bot, which is actually $wgDebugLogFile. There was a time people asked me "why my bot is doing (something) in (some way)" and the... [16:27:12] 06Labs, 10Tool-Labs: currentevents is using 248G in Tools - https://phabricator.wikimedia.org/T136195#2327075 (10chasemp) In particular is it possible that files such as: ```./currentevents/dumps/enwiki/20151201/enwiki-20151201-pages-meta-history2.xml-p000018040p000019712.7z: 177M ./currentevents/dumps/enwiki... [16:29:30] 06Labs, 10Tool-Labs: oar is using 207G on Tools - https://phabricator.wikimedia.org/T136201#2327081 (10A930913) EdSaperia: "I'm trying to raise more funds for it So keep would be nice" [16:33:50] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space (tracking) - https://phabricator.wikimedia.org/T136212#2327094 (10Danny_B) [16:33:57] 06Labs, 10Tool-Labs: [Tool Labs] Database credential file replica.my.cnf missing in my home directory on Tool Labs (/home/wiki13). - https://phabricator.wikimedia.org/T122657#2327096 (10bd808) [16:33:58] 06Labs, 10Tool-Labs: Restore replica.my.cnf for toolsbeta.admin - https://phabricator.wikimedia.org/T109807#2327098 (10bd808) [16:34:00] 06Labs, 10Tool-Labs, 07Tracking: Tool Labs users missing replica.my.cnf (tracking) - https://phabricator.wikimedia.org/T135931#2327095 (10bd808) [16:52:13] How do I set a php.ini config value for my tool? Symfony is moaning because date.timezone isn't set, and it won't run without it: ' * date.timezone setting must be set [16:52:13] > Set the "date.timezone" setting in php.ini* (like Europe/Paris). [16:52:14] ' [16:53:10] tom29739: php.my.ini is available, iirc, and you can probably use ini_set at the top of your index.php? [16:53:17] before you import any symfony stuff [16:53:50] (I can't say I understand why symfony doesn't do that for you) [16:54:28] It can't choose a default timezone. [16:55:04] If they included something like that, then people might be annoyed because it overrides the php.ini default. [16:55:05] it can, based on configuration parameters the programmer provides [16:55:26] That's the only reason I can think of. [16:55:32] having a default in php.ini makes no sense, because there is no sane default for all applications on the same webserver [16:55:42] (other than, maybe, 'utc') [16:55:59] That's what I thought the default would be. [17:02:11] I am now confused: 'Default timezone => UTC' and 'date.timezone => no value => no value'. [17:02:25] Shouldn't they be the same? [17:02:29] there is a rant from Tim somewhere in the code comments about the decision by PHP/Zend to leave it unset by default [17:02:53] I tried to set it in php.my.ini. [17:04:12] (03PS1) 10Alexandros Kosiaris: keyholder: service_deploy to deploy_service [labs/private] - 10https://gerrit.wikimedia.org/r/290718 [17:06:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] keyholder: service_deploy to deploy_service [labs/private] - 10https://gerrit.wikimedia.org/r/290718 (owner: 10Alexandros Kosiaris) [17:18:55] !log tools fixed hhvm upgrade on tools-cron-01 [17:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:39:09] 06Labs, 10Tool-Labs: Unmount unneeded NFS mounts from tool labs hosts - https://phabricator.wikimedia.org/T136222#2327313 (10yuvipanda) [17:48:09] 06Labs, 10Tool-Labs: Unmount unneeded NFS mounts from tool labs hosts - https://phabricator.wikimedia.org/T136222#2327340 (10yuvipanda) Host types that should have no NFS: 1. k8s master 2. k8s etcd hosts 3. proxies 4. redises Host types that should *only* have /data/project: 1. services hosts (for manifest co... [18:02:16] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space (tracking) - https://phabricator.wikimedia.org/T136212#2326835 (10Kolossos) templatetiger should be after cleanup now at under 140GB. [18:03:05] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2327360 (10Kolossos) templatetiger should be after cleanup now at under 140GB on file system. [18:14:05] (03CR) 10Merlijn van Deen: [C: 032] add hostname to userinfo [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/289899 (owner: 10Merlijn van Deen) [18:14:38] (03Merged) 10jenkins-bot: add hostname to userinfo [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/289899 (owner: 10Merlijn van Deen) [18:15:52] * valhallasw`cloud prods wikibugs [18:17:45] !log tools.wikibugs valhallasw: Deployed 6b863811ff4a2ce9230eabce141f802854cd33f7 Merge "add hostname to userinfo" wb2-irc [18:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [18:18:48] 'VERSION Wikibugs v2.1, http://tools.wmflabs.org/wikibugs/ ,running on tools-exec-1403.tools.eqiad.wmflabs' [18:18:48] wheee [18:18:56] the spacing isn't quite right, but oh well [18:24:10] Is there any easy way to set the php.ini used by lighttpd? I've tried loads... [18:24:52] tom29739: https://secure.php.net/manual/en/configuration.file.per-user.php [18:25:08] tom29739: and, as suggested before, ini_set [18:25:30] in the specific case of timezones, date_default_timezone_set should also work [18:28:32] YuviPanda: you used to have a link to a presentation on your site or blog or something called "the funniest presentation" or funniest conference presentation ever [18:28:34] can't find it [18:28:42] do you recall this at all? :) [18:29:08] 06Labs, 10Labs-Infrastructure: I/O on labmon1001 is very slow - https://phabricator.wikimedia.org/T127957#2059611 (10RobH) So labmon1001 went out of warranty this last March. I'd suggest the re-partitioning to use all the disks should be attempted before we purchase new hardware to go into an outdated system.... [18:29:29] chasemp: do you recall what it was about? I vaguely remember it [18:29:47] nope :) just that the dude make a joke like every 30s and it was entertaining [18:29:59] chasemp: oooh, yes I remember. it was the microsoft nightwatch guy [18:30:12] yes [18:30:15] chasemp: james mickens [18:31:34] the massad or not massad talk ;) [18:31:38] I think? [18:32:31] chasemp: https://vimeo.com/111122950 maybe? [18:32:39] chasemp: or https://vimeo.com/95066828 perhaps [18:32:46] https://vimeo.com/95066828 [18:32:49] yes [18:37:29] I need to reboot bastion-02 [18:37:35] I broadcast to all users on the host [18:40:15] RECOVERY - Puppet run on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [18:40:24] 06Labs, 10Tool-Labs: Backup and/or puppetize @toolserver.org mail forwards - https://phabricator.wikimedia.org/T136225#2327428 (10valhallasw) [18:44:56] 06Labs, 10Labs-Infrastructure: I/O on labmon1001 is very slow - https://phabricator.wikimedia.org/T127957#2327456 (10yuvipanda) Plan is to try to reinstall this with './modules/install_server/files/autoinstall/partman/raid10-gpt-srv-lvm-ext4.cfg' recipe and see how that goes. [18:49:59] 10Tool-Labs-tools-Other, 07Tracking: merl tools (tracking) - https://phabricator.wikimedia.org/T69556#2327474 (10valhallasw) [18:50:01] 06Labs, 10Tool-Labs: Provide resource for db access in grid - https://phabricator.wikimedia.org/T70881#2327472 (10valhallasw) 05Open>03declined Unfortunately, we don't have the in-house knowledge to implement and maintain such a custom resource. I think 'check-and-reschedule' is a sane workaround, which ha... [18:52:22] 06Labs, 10Labs-Infrastructure: Copy graphite data from labmon1001 to an exteranal HDD - https://phabricator.wikimedia.org/T136226#2327508 (10yuvipanda) [18:53:43] 06Labs, 10Labs-Infrastructure: Reinstall labmon1001 with new disk configuration (and jessie) - https://phabricator.wikimedia.org/T136227#2327529 (10yuvipanda) [18:54:22] valhallasw`cloud: I'm figuring out how to reinstall labmon1001, I'll probably try to do it next week [18:54:36] YuviPanda: can I suggest sharding? [18:54:57] in other words a seperate monitoring instance for tool labs [18:56:09] valhallasw`cloud: probably, but this is a 1d procedure involving no procurement while that will require new hardware... [18:56:17] or a vm [18:56:29] valhallasw`cloud: vm + graphite isn't going to go well... [18:56:45] even if it's only for tools? [18:56:48] valhallasw`cloud: long term I think we should just run a prometheus agent on all nodes, that way we get alerting too and bypass all the problems [18:57:05] valhallasw`cloud: a VM for just tools is still going to perform at best on par with an overloaded HDD real hardware box I think. [18:57:33] valhallasw`cloud: and the current setup is *incredibly* inefficienet, I think we'll get at least a 3x boost... [18:57:39] I'm confused. [18:57:51] Diamond writes data for each host once every what, five minutes? [18:57:58] I'm not entirely sure on that, only because of the relatively small write [18:57:59] yeah [18:58:18] hmm, I must retract my 3x boost claim now since I realized I'm just parroting out of context heh [18:58:20] so maybe I should first ask the question 'have you profiled where the load comes from' [18:58:25] I've been thinking about it and am inclined towards a branch model for monitoring as well [18:58:51] i.e. treat Tools as a "site" agnostic to where it is and have it hang off of a higher up monitoring node for Labs [18:58:57] 10Quarry: Excel does not recognize Quarry CSV output as UTF-8 - https://phabricator.wikimedia.org/T76126#2327562 (10valhallasw) No, the 'UTF-16' seems to actually be UTF-8... [18:59:03] but whatever we can do now I understand [18:59:35] I'm all for rethinking our models, but I highly doubt any of them are going to take 1d to implement. And when we do rethink our models I really don't want graphite to be a part of it [18:59:49] http://tools-prometheus.wmflabs.org/ should just collect metrics from all nodes, not just k8s [19:00:06] that actually has a distributed data store from what I can tell, so we won't run into the SPOF issues with graphite [19:00:23] nope, no objections to this, it will take some doing to rethink [19:00:52] if this buys us another 3 months I'll be happy with it :D [19:01:04] yep [19:03:01] valhallasw`cloud: chasemp if we merge https://gerrit.wikimedia.org/r/#/c/276243/ we can get a start on it (prometheus metrics from all nodes). Rest requires some amount of debian package fuckery tho, to get it to run on other nodes. I'm seriously considering the fact that we shouldn't be using debs for go packages at all. Requires a lot of work for shitall returns [19:03:29] fpm! [19:03:48] wget! :D [19:03:51] so for similar iirc I used dpkg-deb as well which I think is ...frowned as well but it's so simple [19:04:00] probbly w/ wget is for instance on k8s nodes now [19:04:01] and sane upgrades if you make sure to actually force an apt-get upgrade && apt-get install [19:04:07] how do i query version etc [19:04:14] you run into all the problems apt solves [19:04:15] idk [19:04:43] so this is vaguely related to yesterday's outage [19:05:02] I'm not sure how to deploy the toollabs-webservice deb [19:05:50] yeah, first thought htere is ensure latest is always bad, we have had the conversation in the past as well [19:05:55] 10Quarry: Excel does not recognize Quarry CSV output as UTF-8 - https://phabricator.wikimedia.org/T76126#790571 (10Dzahn) Btw, separate from the encoding... what Excel considers to be a "CSV" actually depends on language settings in Windows. For example if you use a German Windows, the delimiter character by def... [19:05:55] but I hate that setting [19:06:21] but in general for similar I have had ensure => present and then do apt-get install foo -y [19:06:30] which will upgrade via some salt or whatever orchestration [19:06:46] My preference would be ensure => present plus an upgrade cronjob [19:06:58] why a cron job? [19:07:05] 'whatever orchestration' -> we don't really have anything we can trust and use [19:07:18] ...or in scheduled maintenance [19:07:20] atm something ssh based is useable afaik [19:07:26] if there is a node sans SSH in Tools at least [19:07:29] that's another isssue [19:07:33] I've my hand hacked shell scripts but I run those by hand and I missed running them on the services node [19:07:34] I'm talking purely in Tools [19:07:35] before you use a manual cronjob, you could use https://wiki.debian.org/UnattendedUpgrades [19:07:54] mutante: those break down once you suggest to upgrade everything :( [19:08:29] chasemp: not documented anywhere, and I lost all the clush stuff from my history. It also didn't really do any node discovery when I had it because we haven't set it up... [19:08:34] valhallasw`cloud: ..with a config that does not upgrad everything? [19:08:47] there is also debdeploy used in prod for this [19:08:48] mutante: no, if you use a config that upgrades everything [19:08:48] so not very different from xargs bash-script-that-sshes which I'm using [19:09:01] YuviPanda: I'm open to consolidating on a "one true way" which we don't have now [19:09:20] chasemp: we don't have *any* supported way, IMO. we all have our little helper scripts. [19:09:25] I think 'whatever prod does' sounds like the best way to go? [19:09:28] if valhallasw`cloud wants to run his own he has to figure out his own [19:09:36] valhallasw`cloud: 'whatever prod does' is 'salt' [19:09:41] it's debdeploy [19:09:50] it happens automatically based on hiera grains [19:09:51] well, debdeploy + salt + scap + whatever [19:10:00] debdeploy is also based on salt [19:10:00] and which packages are listed in it to be upgraded [19:10:09] sure I just mean, the stack is not simple [19:10:11] yes, but it's much more than a person running salt [19:10:12] totally [19:10:38] it's true we each have our own thing, I would like to settle on clush and say ssh has to be working or $alert [19:10:41] salt on labs is also unreliable to a point that I'd rather open 50 tabs [19:10:46] chasemp: I totally agree [19:10:48] and I don't think the scaffolding around that is too difficult [19:10:56] I agree too :D just needs to be built [19:11:03] tom29739: I added a note about .user.ini at https://wikitech.wikimedia.org/w/index.php?title=Help:Tool_Labs/Web&diff=569479&oldid=549251 -- add more things there if you think of them [19:11:03] for now if you want to do node discovery, or at least compile time node list generation [19:11:15] you can pull nodes from puppet master certs signing list [19:11:21] which gives you every tools host [19:11:27] which isn't savvy targeting [19:11:31] but is useable [19:11:45] but yeah it's not a solved problem [19:12:10] yeah [19:12:19] this is every tools host [19:12:20] puppet cert -l --all | grep '\.tools\.' | cut -d '"' -f 2 [19:12:21] etc [19:12:33] bd808, that seems good. [19:12:44] you can probably ask puppetmaster for 'what classes are being applied to this host?' [19:13:07] and if https://gerrit.wikimedia.org/r/#/c/285014/ gets merged you can maybe ask it too [19:13:31] until then, maybe I should just write a fabfile [19:13:47] that's what debdeploy does, depending on the role class the host uses, different rules apply what gets auto-upgraded and what doesnt [19:14:14] mutante: it assumes a reliable working salt setup. [19:14:36] YuviPanda: sounds like a good problem post storage things for $new person [19:14:41] chasemp: yeah, I agree. [19:14:42] figure us out a thing to do stuff thanks [19:15:15] chasemp: ya. [19:15:46] why is labs salt more unreliable than prod salt after the fixes? not the same version yet? [19:15:48] chasemp: in the meantime tho, I'm going to write a fabfile for stuff like this rather than xargs things. [19:15:54] sure [19:16:28] (03CR) 10Merlijn van Deen: [C: 032] Task parsing code: always split by the /last/ : [labs/tools/forrestbot] - 10https://gerrit.wikimedia.org/r/290424 (https://phabricator.wikimedia.org/T136041) (owner: 10Gerrit Patch Uploader) [19:16:49] mutante: I do not know enough about salt to have an informed opinion on why our installaton of it is so unreliable. I know that I've been burnt by it way more than enough times in the past to never trust it again unless someone can explicitly prove to me that it works right 100% of the time. [19:17:06] (03Merged) 10jenkins-bot: Task parsing code: always split by the /last/ : [labs/tools/forrestbot] - 10https://gerrit.wikimedia.org/r/290424 (https://phabricator.wikimedia.org/T136041) (owner: 10Gerrit Patch Uploader) [19:17:38] YuviPanda: i assume some kind of fixes have been applied that are not in labs yet, "stable enough for prod but not stable enough for labs" would be odd [19:17:58] no idea :) [19:18:05] but that's always historically been the case I guess [19:18:27] mutante: maybe I'm wrong on this, but my feeling is that for prod there's always more people around to step in when things go wrong [19:18:29] I gave up on it right around the big labstore explosion last June, when it proved absolutely useless [19:18:32] ^ [19:18:36] I think that's ultimately it [19:19:01] I think that apergos would be the person to talk to about checking on the salt masters and such [19:19:03] valhallasw`cloud: how is that related to salt returning the right hosts or not though [19:19:30] !log tools deleted tools-docker-builder-01 and -02, hosed hosts that are unused [19:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:19:52] mutante: not, it's related to reliability. If salt doesn't do the right thing, having people around to step in and fix it helps. [19:20:18] (where 'doesn't do the right thing' is probably 'doesn't run' rather than 'runs the wrong thing') [19:20:29] valhallasw`cloud: i have not seen an incident where people had to step in because debdeploy did something wrong [19:20:30] which is painful if it's a rollback [19:20:51] mutante: you don't remember the many months of 'salt does not work' complaints in production? [19:21:11] yea, and then it was fixed [19:21:18] oh, debdeploy doesn't use salt to run commands, it only uses it to get a host list [19:21:20] what bd808 said then [19:21:37] yea, and that too [19:21:41] PROBLEM - Host tools-docker-builder-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.104) [19:21:55] I think the bad experiences were with using salt to run stuff [19:22:23] mutante: it's an issue of trust. I don't trust it to work, and hence there's no point in me using it. [19:22:27] we have gone down this road, and many issues are largely historical, but also a big part of salt reliability in prod is dedicated masters which we don't have, and the other element is supporting the salt service since salt is on demand discovery very fixed [19:22:29] well.. is it really easier to keep doing it different.. you just said how everybody uses their own tools [19:22:29] they we for hoping that salt would actually communicate with all hosts having a given grain [19:23:01] what chasemp said [19:23:26] what I was left with after talking to apergos last time around is, w/o more resources for salt here we are [19:23:36] ^ [19:23:38] and it also doesn't solve the matter of instances in labs for which the project has it's own salt master [19:23:45] for which there is no solution afaik [19:23:52] becuase our problems extend outside of tools [19:24:05] it's not that salt is bad, it's that the problem overlap for labs and prod isn't that large [19:24:14] re: instances and what they are doing what they need [19:24:24] so yeah, ssh is the best option as far as I know [19:26:04] gotcha [19:29:32] !log ores deploying 7992fd1 into web and worker nodes [19:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [19:33:37] !log ores running puppet agent manually in ores-web-03 [19:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL, Master [19:39:20] !log tools run sudo dpkg --configure -a on tools-worker-1007 to get it unstuck [19:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:43:44] !log tools delete devpi instance, not currently in use [19:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:44:43] PROBLEM - Host tools-devpi-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.227) [19:58:56] 06Labs, 10Tool-Labs: currentevents is using 248G in Tools - https://phabricator.wikimedia.org/T136195#2327850 (10chasemp) p:05Triage>03High [19:59:48] RECOVERY - Host tools-devpi-01 is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [20:00:17] 06Labs, 10Tool-Labs: oar is using 207G on Tools - https://phabricator.wikimedia.org/T136201#2327852 (10chasemp) >>! In T136201#2327081, @A930913 wrote: > EdSaperia: "I'm trying to raise more funds for it > So keep would be nice" Can you elaborate on what this means and who this person is in relation to the tool? [20:00:59] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2327856 (10chasemp) >>! In T136192#2327360, @Kolossos wrote: > templatetiger should be after cleanup now at under 140GB on file system. What is your cleanup strategy? Files older than n days? [20:04:17] RECOVERY - Puppet staleness on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [3600.0] [20:06:30] 06Labs, 10Labs-Infrastructure: Copy graphite data from labmon1001 to an external HDD - https://phabricator.wikimedia.org/T136226#2327892 (10Krenair) [20:07:21] 06Labs, 10Labs-Infrastructure: Copy graphite data from labmon1001 to an external HDD - https://phabricator.wikimedia.org/T136226#2327899 (10yuvipanda) After more discussion, we decided to just put the service in downtime and copy the data, since the disks are already IO saturated anyway... [20:10:42] 06Labs, 10Labs-Infrastructure, 06Operations, 10ops-eqiad: connect usb external disk to labmon1001 - https://phabricator.wikimedia.org/T136242#2327910 (10RobH) [20:14:02] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space (tracking) - https://phabricator.wikimedia.org/T136212#2327929 (10Emijrp) [20:14:04] 06Labs, 10Tool-Labs: currentevents is using 248G in Tools - https://phabricator.wikimedia.org/T136195#2327927 (10Emijrp) 05Open>03Resolved Hello! I just deleted all the files. Regards. [20:15:31] !log tools deleted tools-bastion-mtemp per chasemp [20:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:15:44] tx [20:17:26] PROBLEM - Host tools-bastion-mtemp is DOWN: CRITICAL - Host Unreachable (10.68.19.117) [20:21:23] 06Labs, 10Tool-Labs, 07Tracking: Make toollabs reliable enough (tracking) - https://phabricator.wikimedia.org/T90534#2327943 (10yuvipanda) [20:21:25] 06Labs, 10Tool-Labs: Set up sufficient monitoring for toollabs - https://phabricator.wikimedia.org/T90845#2327941 (10yuvipanda) 05Open>03Invalid Is too vague to be useful anymore, I thing. [20:33:26] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2327970 (10chasemp) >>! In T136194#2326563, @mkroetzsch wrote: > We can easily delete old Wikidata dumps. However, history might be of interest. Is there any other record of Wikidata dumps anywhere? It woul... [20:36:09] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2327979 (10chasemp) >>! In T136192#2327360, @Kolossos wrote: > templatetiger should be after cleanup now at under 140GB on file system. Is there something we can do to prevent this from happening?... [20:38:48] 06Labs, 10DBA, 10Horizon: TGR unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2327984 (10csteipp) On labswiki, the user table was create at a time when the collation wasn't explicitly set, so it's ``` CREATE TABLE `user` ( `user_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_... [20:40:26] I just deleted 2 GBs, [20:40:34] I'm not sure how helpful it would be [20:41:07] Amir1: it all helps :) thanks [20:41:37] :) [20:41:59] I think I can delete 9 GB of redundant data [20:42:04] let me check [20:47:08] chasemp: I just deleted one of the instances of Kian, freed 9 GB :) [20:47:17] now, that's something [20:51:46] I'm having a problem creating databases on tools-db. [20:52:32] This: 'MariaDB [(none)]> create database s52590__api; [20:52:32] ERROR 1044 (42000): Access denied for user 's52590'@'%' to database 's52590__api'' [20:55:50] tom29739: please file a bug [20:56:19] https://graphite.wmflabs.org/render/?title=tools+cluster+Disk+space+last+day&width=800&height=250&from=-1day&hideLegend=false&uniqueLegend=true&target=aliasByNode%28sum%28tools.%2A.diskspace.%2A.byte_avail%29%2C-3%2C-2%29 [20:56:22] nice :D [20:58:01] 06Labs, 10Labs-Infrastructure, 06Operations: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2328038 (10Dzahn) [21:01:07] it seems something took lots of space in 5/19 https://graphite.wmflabs.org/render/?title=tools+cluster+Disk+space+last+week&width=800&height=250&from=-1week&hideLegend=false&uniqueLegend=true&target=aliasByNode%28sum%28tools.%2A.diskspace.%2A.byte_avail%29%2C-3%2C-2%29 [21:03:01] 06Labs, 10Tool-Labs: Cannot create database with s52590 - https://phabricator.wikimedia.org/T136247#2328079 (10tom29739) [21:03:07] 06Labs, 10Labs-Infrastructure, 06Operations: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2328038 (10Krinkle) RCStream doesn't use channels (unlike the RC messages we send over IRCD, though even there IRCD auto-creates any channels address mesagges at). It's one large "... [21:08:16] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: create conduit method for the creation of phabricator policy objects - https://phabricator.wikimedia.org/T135249#2328097 (10mmodell) @bd808: is the current implementation satisfactory? [21:09:25] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2328101 (10bd808) [21:09:29] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: create conduit method for the creation of phabricator policy objects - https://phabricator.wikimedia.org/T135249#2328099 (10bd808) 05Open>03Resolved >>! In T135249#2328097, @mmodell wrote: > @bd808: is the current implementation satisfactory... [21:23:26] 06Labs, 10Tool-Labs, 13Patch-For-Review: Investigate Tool Labs webservice outage on 2016-05-25 - https://phabricator.wikimedia.org/T136162#2328226 (10yuvipanda) [21:23:28] 06Labs, 10Tool-Labs, 13Patch-For-Review: Turn on puppet nag emails for tools too - https://phabricator.wikimedia.org/T136167#2328224 (10yuvipanda) 05Open>03Resolved a:03yuvipanda [21:25:37] 06Labs, 10Tool-Labs: oar is using 207G on Tools - https://phabricator.wikimedia.org/T136201#2328244 (10A930913) This tool comes from the grant at https://meta.wikimedia.org/wiki/Grants:IEG/Open_Access_Reader applied for by Ed Saperia. He is saying that he is seeking more funds for further research and so keepi... [21:27:26] feels like a silly question .. but how do i add a user to a group on a labs vm? I added myself to the docker group, and i can see `docker:x:117:ebernhardson` it /etc/group along with `group: files ldap` in /etc/nsswitch.conf. But after logging out and logging back in i'm not in the docker group [21:28:46] so you want to add your ldap user to a local group I think [21:28:52] which...yeah it's not a silly q :) [21:29:00] err, sigh silly me. the problem is ssh ControlMaster maintaining the connection so it only faked logging out [21:29:04] ha [21:30:26] (03PS1) 10Dzahn: add fake dumpsdeploy key pair [labs/private] - 10https://gerrit.wikimedia.org/r/290804 [21:30:59] (03CR) 10Dzahn: [C: 032] add fake dumpsdeploy key pair [labs/private] - 10https://gerrit.wikimedia.org/r/290804 (owner: 10Dzahn) [21:31:08] (03CR) 10Dzahn: [V: 032] add fake dumpsdeploy key pair [labs/private] - 10https://gerrit.wikimedia.org/r/290804 (owner: 10Dzahn) [21:33:26] 06Labs, 10Labs-Infrastructure, 06Operations: rcstream not working for wikitech wiki - https://phabricator.wikimedia.org/T136245#2328279 (10Krenair) I don't have time to dig into this today but when I looked at `telnet rcs1001.eqiad.wmnet 6379` from silver earlier it would try to IPv6 for a minute and fail, t... [21:46:32] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2328349 (10Lydia_Pintscher) >>! In T136194#2327970, @chasemp wrote: >>>! In T136194#2326563, @mkroetzsch wrote: >> We can easily delete old Wikidata dumps. However, history might be of interest. Is there an... [21:50:27] I can't manage DNS proxies in horizion. Is it a known bug? [21:50:42] "Something went wrong! An unexpected error has occurred. Try refreshing the page. If that doesn't help, contact your local administrator." [21:51:30] I'm not sure what the status of that is Krenair^? [21:52:17] https://horizon.wikimedia.org/project/proxy/ [22:58:56] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Security-Reviews: Security review of Tool Labs console application - https://phabricator.wikimedia.org/T135784#2328620 (10bd808) [23:46:50] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2328692 (10bd808) [23:46:53] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Security-Reviews: Security review of Tool Labs console application - https://phabricator.wikimedia.org/T135784#2311122 (10bd808) [23:46:57] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: Deploy "Striker" Tool Labs console to WMF production - https://phabricator.wikimedia.org/T136256#2328710 (10bd808) [23:49:12] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion, 15User-bd808: Create application to manage Diffusion repositories for a Tool Labs project - https://phabricator.wikimedia.org/T133252#2328723 (10bd808) Basic testing environment deployed at http://striker.wmflabs.org/. See http://devwiki-strik...