[00:25:38] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [00:46:24] PROBLEM - Puppet run on tools-webgrid-generic-1405 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [00:46:26] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [00:47:04] yes that's all me ^ [00:50:32] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [01:03:21] PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [01:26:26] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [01:30:26] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [01:30:46] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [02:00:39] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [02:56:49] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: create conduit method for the creation of phabricator policy objects - https://phabricator.wikimedia.org/T135249#2325277 (10bd808) [05:04:04] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 383 bytes in 0.003 second response time [05:05:12] ^ tool labs is down [05:06:30] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325315 (10Harej) [05:06:39] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325328 (10Harej) p:05Triage>03Unbreak! [05:20:12] harej: could you be more specific please? [05:20:37] Loading a page on tools.wmflabs.org, including https://tools.wmflabs.org itself, results in a 503 error. [05:20:48] ok, so /a/ tool is down [05:20:53] are there any other symptoms? [05:20:54] No, I think they all are. [05:21:08] e.g.? [05:21:45] labs-morebots: how's it going? [05:21:46] I am a logbot running on tools-exec-1221. [05:21:46] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:21:46] To log a message, type !log . [05:22:37] https://tools.wmflabs.org/ https://tools.wmflabs.org/mix-n-match/ https://tools.wmflabs.org/wikidata-todo/ https://tools.wmflabs.org/xtools-ec/ all return "503 Service Temporarily Unavailable" [05:22:48] (yes xtools is not a good example I know but this is happening to every page on Tool Labs I try to load) [05:23:19] https://tools.wmflabs.org/wdq2sparql/w2s.php also returns that error [05:25:38] Yet, https://wpx.wmflabs.org/requests/en (Wikimedia Labs, but not Tool Labs) works fine. [05:26:14] harej: The xtools one often happens, sadly. Let me see if I can get it working. [05:26:24] But it's not just xtools! [05:26:31] Nothing on Tool Labs is loading. [05:26:45] Oh... OK. [05:30:31] the wikidata-game is also 503, ack [05:30:31] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325363 (10Harej) To clarify, this is not just happening with one tool. Seemingly each tool I try to load results in the same error. URLs tested: * https://tools.wmflabs.org/ * https://tools.wmflabs.org/mix-n-match/... [05:33:51] !log tools rebooting tools-proxy-02 [05:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [05:34:10] did the logging bot just call you a dummy [05:34:17] always [05:35:04] ah, tools-proxy. makes sense [05:35:14] No change so far. [05:43:17] the error message that nginx shows seems to indicate redis connection issues [05:43:26] but it is running [05:44:55] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 10Diffusion: create conduit method for the creation of phabricator policy objects - https://phabricator.wikimedia.org/T135249#2325382 (10bd808) [05:45:12] andrewbogott: do we have to start redis on tools-proxy-02 ? [05:45:40] i see it's not running [05:45:44] mutante: I just failed over to -01 [05:45:56] ok! [05:46:30] …which doesn't seem to have helped, despite redis working properly there [05:47:15] it's running as a process, but : [05:47:22] Active: inactive (dead) [05:47:31] <_joe_> do you need me to take a look guys? [05:47:49] <_joe_> I am still sleepy though [05:48:16] <_joe_> andrewbogott: did you check that the backend for the homepage works? [05:48:28] i did this: [05:48:34] systemctl status redis-server [05:48:42] now Active: active (running) [05:48:58] but that didnt change it like you said [05:49:34] _joe_: I didn't check backends, although multiple tools are getting 503s and the redis on -02 seemed empty [05:49:37] so I'm failing over to -01 [05:49:40] which is not super fast [05:49:45] <_joe_> mutante: redis works on tools-proxy-01 and data is there [05:50:34] <_joe_> the site doesn't work on 01 either [05:50:36] _joe_: can you check -02? It seemed like it wasn't working there [05:50:43] _joe_: the fail-over isn't complete yet though... [05:50:57] <_joe_> andrewbogott: curl -H 'Host: tools.wmflabs.org' localhost/ [05:51:10] the nginx error log had "nginx attempt to send data on a closed socket" [05:52:04] <_joe_> it seems the proxy can't connect to the tools [05:52:06] I'm forcing some puppet runs now [05:52:16] The tools need to be updated as to which is the active proxy [05:52:18] puppet should do that [05:52:29] (this is via a hiera setting which it took me a few minutes to find) [05:52:36] <_joe_> andrewbogott: that won't change anything [05:52:56] _joe_: ok [05:53:11] active_proxy_host isn't tied to firewall rules or anything like that? [05:54:52] <_joe_> andrewbogott: so, the data in redis on -01 is _incorrect_ [05:55:31] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325384 (10Urbanecm) https://tools.wmflabs.org/urbanecmbot/reliktyCswiki/ wasn't working a few seconds ago, after restarting with webservice restart (SSH is working) it works. Also my second tool (https://tools.wmflab... [05:55:32] proxy-01 says it cant connect to upstream at tools-webgrid-lidhhtpd-1415 [05:55:57] _joe_: ok, that's interesting... [05:56:00] <_joe_> mutante: because data in redis is outdated [05:56:04] is it incorrect in a consistent way? [05:56:14] <_joe_> andrewbogott: can we re-populate redis? [05:56:24] <_joe_> andrewbogott: I have no idea, I'm just looking at -01 [05:56:32] <_joe_> I suggest you page yuvi and chase [05:56:33] <_joe_> no2 [05:56:35] <_joe_> *now [05:56:39] YuviPanda is on his way [05:57:01] Instructions (such as they are) at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#WebProxy [05:57:14] as far as I know, redis is populated dynamically by the webservices. I don't know how that works though [05:57:32] bd808: yeah, I'm doing the 'to switch over' now [05:57:36] <_joe_> bd808: they don't tell you how to recover from bad data in redis [05:57:47] <_joe_> and that's the issue; you have bad data in redis [05:57:50] although _joe_ has convinced me it won't help, I at least need to get it fully on one side or the other [05:57:54] <_joe_> someone fucked up the sync script [05:58:52] Is redis even running on 01? [05:59:01] <_joe_> bd808: it is [05:59:10] /usr/bin/redis-server *:6379 [05:59:19] <_joe_> so, redis-cli HGETALL prefix:rangecontri [05:59:26] <_joe_> *b [05:59:32] <_joe_> gets you a ip:port pair [05:59:47] <_joe_> which is definitely not being used [06:00:39] * andrewbogott wonders why https://tools.wmflabs.org/nagf/ works [06:00:41] <_joe_> it is consistently wrong on both proxies, too [06:00:48] <_joe_> andrewbogott: let me see [06:00:53] restarting the tools.admin webservice got that back up [06:00:56] > Created by @Krinkle. [06:00:59] that's why ;) [06:01:09] <_joe_> andrewbogott: it's on k8s [06:01:13] nagf was the last thing YuviPanda was messing with [06:01:16] <_joe_> and k8s uses a sync script I wrote [06:01:24] <_joe_> that's why :P [06:01:27] <_joe_> ori: ^^ [06:01:31] _joe_: k8s still uses redis/proxy doesn't it? [06:01:32] ah, ok [06:01:48] <_joe_> so the problem is specifically with gridengine => redis [06:01:56] <_joe_> did anyone touch any script? [06:02:11] I was looking in puppet history, didn't see anything [06:02:13] but I'll look again [06:02:34] restarting webservice processes seems to get them back up and running now [06:02:34] <_joe_> andrewbogott: I know jackshit about the GE => redis script [06:02:45] <_joe_> bd808: yes, that is probably working [06:02:48] bd808: can we just restart everything? [06:02:52] or are some things stateful? [06:02:56] <_joe_> something must have gone horribly wrong somewhere [06:03:09] <_joe_> ori: I assume we can, given the failure rate of toollabs [06:03:21] yeah [06:03:24] <_joe_> anything that cannot be randomly restarted would already be unusable [06:03:27] <_joe_> since forever [06:03:29] yeah, I don't know how to do it, but it has been done often [06:03:45] <_joe_> bd808: I can create a list of webservices to restart [06:03:52] <_joe_> if you don't have it [06:04:03] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3669 bytes in 0.025 second response time [06:04:06] well, what do they do on restart that fixes things? register themselves with redis? [06:04:11] <_joe_> ori: yes [06:04:25] <_joe_> ori: I don't remember how that works exactly, though [06:04:31] I found the magic command [06:04:32] bd808: did you restart something to make the home page start working? [06:04:37] What tool was it? [06:04:38] <_joe_> or if there is a way to tell gridengine "dump me all" [06:04:51] nmap ;) [06:04:56] <_joe_> or a way to resync everything with our tools [06:04:56] andrewbogott: yes. tools.admin [06:05:00] qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | awk '{ print $1;}' | xargs -L1 qmod -rj [06:05:11] of course [06:05:12] <_joe_> lol [06:05:17] I'm always surprised that there is a tool called 'tools' which is nonetheless not that [06:05:52] Shold I run that restart them all script [06:06:28] bd808: yes, go ahead. [06:06:40] hello [06:06:53] !log tools Restarting all webservice jobs [06:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [06:07:01] bd808: hang on, let's see if YuviPanda needs to do a postmortem first [06:07:04] * YuviPanda reads backscroll [06:07:11] YuviPanda: so in short — redis is wrong [06:07:12] too late [06:07:18] bd808: ok :) [06:07:23] YuviPanda: redis /was/ wrong [06:07:24] get things back up first [06:07:28] postmortems later [06:07:30] about basically every tool [06:07:47] I found some empty entries on -02 so failed over to -01 [06:07:55] but that didn't help since -01 was wrong about everything as well [06:08:05] restarting things seems to help, so bd808 is restarting every webservice [06:09:00] it's done... [06:09:35] https://tools.wmflabs.org/xtools/ is still a 502, how long does it typically take? [06:09:36] * andrewbogott is watching https://tools.wmflabs.org/mix-n-match/ but honestly doesn't know if it ever worked [06:09:39] hmmm... sal is still down [06:09:51] you fixed the home page, but other tools like wikidata-game, no [06:10:13] https://tools.wmflabs.org/wikidata-game/ [06:10:15] my console is full of lines like "Pushed rescheduling of job 6725187 on host tools-webgrid-lighttpd-1415.tools.eqiad.wmflabs" from the one-liner [06:10:15] run MONITOR on the redis instance to see how tools are fixing themselves [06:11:09] so the restart is still being processed [06:11:12] A large number of jobs are in "Rr" state [06:11:15] that's good [06:11:29] 2) "http://10.68.23.222:57450" [06:11:31] it means that there is still a chance that you have fixed everything [06:11:34] is the entry for wikidata game [06:11:40] lighttpd 11009 tools.wikidata-game 4u IPv4 117036768 0t0 TCP *:34336 (LISTEN) [06:11:43] is where it's listening in [06:12:49] YuviPanda: Is it the case that tools register with the proxy redis when they come up, and then there's a different codepath that actively syncs them after the fact? [06:12:59] Or is it only when they first launch that they're registered? [06:13:33] andrewbogott: they register when they come up [06:13:40] I think I know the problem, stand by [06:13:49] bd808: you might have to run that script again in a minute... [06:13:58] I think the problem is that puppet is stuck on tools-services-01 [06:14:17] because of /public/dumps [06:14:19] <_joe_> and so the scripts keeps writing to the wrong redis? [06:14:48] bd808: am running that one liner now [06:17:03] nope, still not fixed [06:17:05] lighttpd 31977 tools.geohack 4u IPv4 5979985 0t0 TCP *:34543 (LISTEN) [06:17:09] 2) "http://10.68.18.54:37669" [06:17:12] hmm [06:17:12] you got puppet unstuck? [06:17:28] so I thought the problem was that the toollabs-webservice package on services was out of date [06:17:31] because puppet was stuck [06:17:34] but apparently not [06:17:35] (is services-01 really involved in the mix at all?) [06:17:36] ah [06:17:40] * bd808 invents -- qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | tail -n+3 | awk '{print $5}' | sort | uniq -c [06:18:07] I actually did something else just now [06:18:12] changed the qmod -rj to qdel [06:18:16] So — the other variable is that I updated python-mwclient site-wide earlier today. There's NO WAY that could be connected to this, but I'm just throwing it out there. [06:18:20] and let webservicemonitor bring it up [06:18:28] andrewbogott: no, I was tweaking the toollabs-webservice package earlier [06:18:45] mine is just a count of jobs in each state in that queue [06:18:47] I tested it before I left, but clearly something I did there had caused this. I merged 3 patches [06:19:37] the count of 'r' state is slowly rising [06:20:02] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 383 bytes in 0.002 second response time [06:20:10] 2) "http://10.68.20.250:40415" [06:20:13] Briefly worked and now down again. [06:20:15] lighttpd 12647 tools.multidesc 4u IPv4 117049623 0t0 TCP *:40415 (LISTEN) [06:20:22] ok so that seems to be in sync now [06:20:55] Matthew_: which tool? [06:20:58] Matthew_: yeah, we're working on it. I restarted it manually before. It got killed in the last restart YuviPanda set off [06:21:08] YuviPanda: I've been testing on the main page. [06:21:24] right, it'll probably come back [06:21:25] 88 running now [06:21:27] I've got xtools working with a simple webservice restart. [06:21:30] webservicemonitor throttles things too I think [06:22:12] yeah the seem to be restarting in batches with a bit of a pause in between [06:22:51] I got xtools working with a "webservice restart" (I hope I didn't jump the gun) but it seems to be looking good. [06:23:07] Matthew_: yup, that should work too [06:23:14] I did a bunch of spot checks and they seem fine [06:23:39] OKay. [06:23:56] I just brought the landing page up manually [06:24:33] we didn't parallelize the webservice restarts because we figured it'd be rare and also that doing that might overload gridengine, which I think is an ok state to be in [06:25:04] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 3669 bytes in 1.458 second response time [06:25:12] thanks bd808 [06:25:26] wikidata-game works again, thanks [06:25:39] 196 running now [06:25:58] so… I think I missed a step. Was the solution restart, , restart again? [06:26:00] wait, no, kind of. the error page is now the tool labs page [06:26:04] Or did the original restart work and it just took a while? [06:26:34] andrewbogott: no, it was 1. make sure the version of toollabs-webservice on tools-services-01 was the same as rest of cluster, 2. delete all the webservices and have tools-services-01 bring them back up [06:26:40] the qmod -rj did not work but the qdel did [06:26:56] ah, ok [06:27:14] so is the toollabs-webservice bit a red herring? [06:27:20] YuviPanda: you should update -- https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#Restarting_all_webservices [06:27:26] (I still don't understand why tools-services-01 has anything to do with proxying) [06:27:37] andrewbogott: bd808 https://gerrit.wikimedia.org/r/#/c/290612/ is the culprit. new version knew to pass '--register-proxy' to the webservice-runner, but old version did *not*. tools-services-01 was the old one... [06:28:08] 300 running now [06:28:09] bd808: I think the -rj works fine in most cases, this was the problem because of the version mismatches in various places... [06:28:15] Well, as long as we all agree that this can be blamed on NFS [06:28:35] andrewbogott: it runs webservicemonitor, which is responsible for checking if webservices are up and restarting them if not [06:28:40] mutante: which url is back down? [06:29:03] YuviPanda: why did that matter, though? [06:29:25] Did the issue only affect services which died and were restarted by webservicemonitor this evening? [06:29:33] YuviPanda: https://tools.wmflabs.org/wikidata-game/ but it shows "No webservice" now, not 503 anymore [06:29:55] well, 503 but the fancier error page [06:31:18] mutante: I manually restarted it, is back up [06:31:21] gridengine is struggling a bit now [06:31:29] 415 up [06:31:49] YuviPanda: thanks, works:) [06:33:12] feels like it's faster than before [06:33:45] 504 running [06:33:59] bd808: there was a total of ~600 something right? [06:34:21] something near that I think [06:37:01] looks like there should be ~680 when they are all back up. 592 now [06:37:59] the count or 'r' + 'Rr' before the mass qdel was about 680'ish [06:39:00] bd808: webservicemonitor is still restarting things [06:39:17] bd808: and by the time it's finished this round it'll probably pick up another set in a second round [06:41:23] looks like it has capped out at 669 [06:42:03] it has bumped to 670 and then back down to 669 on the last few counts I've run [06:42:17] yeah, the 'bub' tool is flapping [06:42:25] * bd808 is amazed there is only one flapping [06:42:37] bd808: me or valhallasw`cloud kill them now and then when we look at logs [06:42:58] hasn't bub been flapping for forever? [06:43:07] possibly [06:43:57] hm, I thought there was a Task on that, but I can't find it [06:44:10] I just stopped it now [06:44:56] no, it was definitely online last night (see access.log) [06:45:30] imma head to bed. night folks (and morning joe) [06:46:02] * andrewbogott waves [06:46:27] 'night bd808 [06:47:03] YuviPanda: service.manifest is still there, and WSW tries to restart it? [06:47:24] valhallasw`cloud: yeah, I just rm'd it [06:47:26] thanks bd808 [06:48:42] andrewbogott: _joe_ ori https://etherpad.wikimedia.org/p/tools-web-outage-2016-05-25 I wrote up sequence of events as far as I can tell (includes causes I believe) [06:49:57] Heh, I was just about to ask and then noticed that you already wrote ???? in the spot where my question would go [06:50:42] andrewbogott: yeah, I'm not sure what happened there. [06:50:50] andrewbogott: something must've triggered a restart of those things [06:51:44] andrewbogott: heh, puppet's been stuck for '6660' hours now [06:52:31] andrewbogott: other than the ??? does the rest make sense? [06:53:45] Yeah, it all makes sense [06:54:00] and the ???? bit isn't that crazy, since in general webservices can restart just fine [06:54:27] so maybe they do it all the time and we don't notice [06:54:38] (which, I guess, would be a good thing to know if true) [06:54:39] andrewbogott: this data is collected in graphite actually [06:54:43] let me find [06:55:30] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325473 (10Urbanecm) p:05Unbreak!>03High Lowering the priority because Tool Labs is working, so this task is about finding why Tool Labs wasn't accessable. [06:56:11] labs graphite has gotten unbearably slow now [06:57:17] 06Labs, 10Tool-Labs: Turn on puppet nag emails for tools too - https://phabricator.wikimedia.org/T136167#2325475 (10yuvipanda) [06:57:41] now that I think about it, I bet that the nag emails aren't triggered for hangs. [06:58:10] 06Labs, 10Tool-Labs: Switch toollabs-webservice to be deployed with an actual deployment mechanism - https://phabricator.wikimedia.org/T136168#2325488 (10yuvipanda) [06:58:16] andrewbogott: am filing bugs for action items [06:59:35] andrewbogott: I can't get the metrics out of graphite because it has slowed completely to a crawl [06:59:42] that's ok [06:59:56] I presume they'll still be there in the morning [07:00:25] andrewbogott: one hopes [07:00:40] 06Labs, 10Tool-Labs: Tool Labs appears to be down - https://phabricator.wikimedia.org/T136162#2325315 (10yuvipanda) https://etherpad.wikimedia.org/p/tools-web-outage-2016-05-25 for some ad-hoc notes on what happened. [07:00:55] 06Labs, 10Tool-Labs: Investigate Tool Labs webservice outage on 2016-05-25 - https://phabricator.wikimedia.org/T136162#2325503 (10yuvipanda) [07:01:51] YuviPanda: I'm going to go back to bed — I emailed the list earlier and it seems like things are mostly working for the moment. [07:02:08] andrewbogott: yeah. I'll do something like that too. [07:02:14] andrewbogott: thanks for paging and taking a look! [07:02:24] I hope you're enjoying Chicago! [07:02:36] andrewbogott: it's been amazing and I've a few more days left [07:19:36] !log tools hard reboot tools-services-01, was completely stuck on /public/dumps [07:19:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [07:25:32] 06Labs, 10Tool-Labs, 10DBA: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2257889 (10jcrespo) The replica is not corrupt, it just has drifted from production, failing to delete and insert some records, for several reasons: the main ones is crashing while using non-transactio... [07:25:56] 06Labs, 10Tool-Labs, 10DBA: enwiki_p replica on s1 is corrupted - https://phabricator.wikimedia.org/T134203#2325581 (10jcrespo) a:03jcrespo [08:10:24] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2323829 (10Framawiki) i use a b.sh bash file that can be call with ``` bash b.sh ``` ;) [08:11:34] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [08:16:37] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2323791 (10Framawiki) you can use special chars with a bash file : T136119 but not directly in terminal [09:19:51] RECOVERY - Puppet staleness on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:24:39] 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1464173 (10Volans) I've applied the tendril grant from `/etc/mysql/production-grants.sql` (and only that one) required to have tendril monitor this host and added t... [10:02:22] 06Labs: Make user_email_authenticated status visible on labs - https://phabricator.wikimedia.org/T70876#2325775 (10Danny_B) [10:14:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [10:22:33] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2325875 (10Dvorapa) @Framawiki thank you [10:23:18] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2325880 (10Dvorapa) @Framawiki thank you, good idea [11:02:26] Hi [11:02:30] There seems to be a lag [11:02:49] between when Wikipedia updates and when labs sees the newly updated versions in the API [11:02:54] How long is this lag? [11:03:25] Noted the lag here - https://quarry.wmflabs.org/query/6052 [11:03:50] There's stuff in the query which is appearing which based on the query and whats ACTUALLY in English Wikipedia shouldn't be [11:05:09] ShakespeareFan00, there are imports ongoing [11:05:29] Okay [11:05:40] there is https://tools.wmflabs.org/replag/ [11:06:01] but aside from that, there are small periods of revisions missing and they coming back [11:06:22] it is either that (small periods of glitches) or bringing labs 100% down [11:06:39] if you want to programatically detect that [11:06:49] Hmm 2 hour replication lag isn't bad compared with toolservers 48 hours lags :) [11:07:17] see https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag [11:07:31] and you can, for example, avoid doing queries while lag is > max [11:08:02] the good news is that those imports will fix a huge amount of differences agains production [11:08:12] so it is worth it [11:12:37] Oh and I had a need for a rather intensive tool [11:12:52] Essentialy it's a blame tool [11:13:25] buit I am trying to find a list of files where a GFDL tag or simmilar was added by someone other than the uploader of the file [11:13:53] This basicly needs to do a revsions scan AND a grep of page text [11:13:55] :( [11:15:04] Why is it only enwwiki that gets big lags ? [11:15:06] XD [11:16:27] it is me [11:17:10] for the import, I need to sync production and labs, and for that I need to stop them in the same position while the import is ongoing [11:17:27] Fair enough [11:17:35] I can do something else for a bit [11:17:37] :) [11:22:20] ShakespeareFan00: for such a tool, you're probably better off parsing the dumps [11:22:51] Hmm [11:22:54] I might ask aabout that [11:33:10] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#2326040 (10Framawiki) [11:33:13] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2326042 (10Framawiki) [11:37:12] 10Labs-project-wikistats: Update lietuvai.lt statistics URLs - https://phabricator.wikimedia.org/T136183#2326049 (10Nemo_bis) [11:37:21] 10Labs-project-wikistats: Update lietuvai.lt statistics URLs - https://phabricator.wikimedia.org/T136183#2326061 (10Nemo_bis) p:05Triage>03Low [11:38:19] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2326062 (10Framawiki) 05Open>03Resolved a:03Framawiki [11:40:54] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2326065 (10jayvdb) [11:40:56] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#2326067 (10jayvdb) [11:41:27] 10PAWS: Paste does not work in PAWS terminal - https://phabricator.wikimedia.org/T120633#1857587 (10jayvdb) 05duplicate>03Open [11:43:35] 10PAWS: There should be a way, how to copy/paste a text from/to PAWS - https://phabricator.wikimedia.org/T136119#2326069 (10jayvdb) @Framawiki , thank you for finding and fixing the duplicate ;-) . In future, please close duplicates which are latter (higher number), leaving open the earlier task open (lower num... [11:53:22] 10Labs-project-wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2326077 (10Nemo_bis) [11:54:04] 10Labs-project-wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2326092 (10Nemo_bis) p:05Triage>03Normal [13:20:55] 06Labs, 10Tool-Labs, 13Patch-For-Review: Investigate Tool Labs webservice outage on 2016-05-25 - https://phabricator.wikimedia.org/T136162#2325315 (10chasemp) >>! In T136162#2325501, @yuvipanda wrote: > https://etherpad.wikimedia.org/p/tools-web-outage-2016-05-25 for some ad-hoc notes on what happened. Than... [14:05:46] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326298 (10chasemp) [14:08:42] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326312 (10chasemp) p:05Triage>03High [14:11:21] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2326337 (10chasemp) [14:14:31] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2326337 (10jcrespo) It holds also 165G on toolsdb. [14:16:53] hey there is till 7TB+ free!! [14:24:47] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2326413 (10chasemp) p:05Triage>03Normal [14:27:32] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2326415 (10chasemp) I wasn't able to find any of the maintainers on Phabricator. I emailed Markus via contact information found through this user page. [14:27:48] PROBLEM - Free space - all mounts on tools-worker-1004 is CRITICAL: CRITICAL: tools.tools-worker-1004.diskspace.root.byte_percentfree (<10.00%) [14:29:03] hashar: what are you talking about? [14:29:22] randomly mumbling about templatetiger is using 613G in Tools out of 8T :D [14:29:26] ignore me! [14:30:07] :) [14:34:32] 06Labs, 10Tool-Labs: currentevents is using 248G in Tools - https://phabricator.wikimedia.org/T136195#2326421 (10chasemp) [14:36:13] PROBLEM - SSH on tools-webgrid-lighttpd-1408 is CRITICAL: Server answer [14:41:43] 06Labs, 10Tool-Labs: icelab is using 245G in Tools - https://phabricator.wikimedia.org/T136197#2326481 (10chasemp) p:05Triage>03High [14:47:12] 06Labs, 10Tool-Labs: wikiviewstats is using 232G on Tools - https://phabricator.wikimedia.org/T136198#2326489 (10chasemp) [14:56:54] 06Labs, 10Tool-Labs: oar is using 207G on Tools - https://phabricator.wikimedia.org/T136201#2326542 (10chasemp) [15:05:23] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2326563 (10mkroetzsch) We have two kinds of large data files: biweekly Wikidata json entity dumps and RDF exports that we generate from them. The RDF exports are what we offer through our website http://too... [15:07:02] 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2326565 (10Andrew) @Volans -- thanks! [15:10:14] 06Labs, 10Tool-Labs: toolserver-home-archive is using 52G on Tools - https://phabricator.wikimedia.org/T136202#2326568 (10chasemp) [15:23:56] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326630 (10dschwen) Hey Chase, I went ahead and deleted cache entries older than 90 days. ``` find /data/project/zoomviewer/public_html/cache -mtime +90 -delete ``` I can put this in the tool's cron... [15:47:46] 06Labs, 10Tool-Labs: liangent-php is using 348G on Tools - https://phabricator.wikimedia.org/T136208#2326730 (10chasemp) [15:52:30] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326761 (10chasemp) @dschwen, thank you. This tool just went from the #1 space user to somewhere around #20. Much appreciated. I believe our mounts are all done w/ `noatime`. Let me know what you d... [15:58:26] 06Labs, 10Tool-Labs: wikidata-analysis is using 153G on Tools - https://phabricator.wikimedia.org/T136211#2326813 (10chasemp) [16:05:40] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space - https://phabricator.wikimedia.org/T136212#2326835 (10bd808) [16:05:53] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space - https://phabricator.wikimedia.org/T136212#2326851 (10bd808) p:05Triage>03Normal [16:07:56] 06Labs, 10Tool-Labs: wikidata-analysis is using 153G on Tools - https://phabricator.wikimedia.org/T136211#2326886 (10bd808) [16:07:58] 06Labs, 10Tool-Labs: toolserver-home-archive is using 52G on Tools - https://phabricator.wikimedia.org/T136202#2326888 (10bd808) [16:08:00] 06Labs, 10Tool-Labs: oar is using 207G on Tools - https://phabricator.wikimedia.org/T136201#2326889 (10bd808) [16:08:02] 06Labs, 10Tool-Labs: wikiviewstats is using 232G on Tools - https://phabricator.wikimedia.org/T136198#2326890 (10bd808) [16:08:04] 06Labs, 10Tool-Labs: icelab is using 245G in Tools - https://phabricator.wikimedia.org/T136197#2326891 (10bd808) [16:08:06] 06Labs, 10Tool-Labs: wikidata-exports is using 256G in Tools - https://phabricator.wikimedia.org/T136194#2326893 (10bd808) [16:08:08] 06Labs, 10Tool-Labs: currentevents is using 248G in Tools - https://phabricator.wikimedia.org/T136195#2326892 (10bd808) [16:08:10] 06Labs, 10Tool-Labs: zoomviewer is using 837G out of 8T for Tools - https://phabricator.wikimedia.org/T136190#2326895 (10bd808) [16:08:23] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space - https://phabricator.wikimedia.org/T136212#2326885 (10bd808) [16:08:25] 06Labs, 10Tool-Labs: templatetiger is using 613G in Tools out of 8T - https://phabricator.wikimedia.org/T136192#2326894 (10bd808) [16:09:29] bd808: hi, can you delete an unused tool from toollabs for me? (I'm maintainer) [16:09:34] the directory is already empty [16:10:32] Luke081515: open a phab task and link it to T133777 [16:10:32] T133777: [Tracking] Tools that should get deleted - https://phabricator.wikimedia.org/T133777 [16:10:43] ok [16:11:04] I think there is some complicated dance that has to be done [16:12:11] 06Labs, 10Tool-Labs, 07Tracking: Delete tool 'rcm' - https://phabricator.wikimedia.org/T136216#2326965 (10Luke081515) [16:12:21] bd808: {{Done}} :) [16:12:38] 06Labs, 10Tool-Labs: Delete tool 'rcm' - https://phabricator.wikimedia.org/T136216#2326980 (10bd808) [16:12:43] ah, you were faster :D [16:13:23] It would be really neat if we could somehow flag tags that shouldn't be inherited like tracking, patch-for-review, etc [16:13:51] probably overcomplicating the world though [16:15:34] never occurred to me actually, are there others than 'tracking' [16:15:41] we would want to flag as uninheritable? [16:15:55] patch-for-review and upstream I think [16:16:01] probably a few more [16:16:18] maybe anything that is a "tag" type? [16:16:57] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2327000 (10Dvorapa) @Framawiki Does it mean it is fixed? Or not? Or could I reopen it as a proposal? [16:18:25] my internet is shitty today :/ [16:18:49] I think my 6-in-4 tunnel is being flaky [16:19:24] 06Labs, 10Tool-Labs: tools.suggestbot web requests fail after a period of time - https://phabricator.wikimedia.org/T133090#2327015 (10bd808) [16:19:26] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2327017 (10yuvipanda) 05Resolved>03Open I think we should re-open this, since writing it in a file is cumbersome. I'll figure out the upstream task for this.. [16:20:05] 10PAWS: I can not write some special characters in PAWS - https://phabricator.wikimedia.org/T136118#2327024 (10Dvorapa) @yuvipanda ok, thank you [16:20:07] RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [16:20:23] andrewbogott: where does the puppet nagging code live, btw? [16:21:38] modules/base/files/labs/puppetalert.py [16:22:38] YuviPanda: the good bit is [16:22:39] if hiera('send_puppet_failure_emails', false) [16:22:52] in modules/base/manifests/labs.pp [16:23:05] yeah [16:23:13] I'm going to file a task first and then turn it back on [16:25:47] 06Labs, 10Tool-Labs: liangent-php is using 348G on Tools - https://phabricator.wikimedia.org/T136208#2327065 (10liangent) mw-log is really activity log, for all activities in my php bot, which is actually $wgDebugLogFile. There was a time people asked me "why my bot is doing (something) in (some way)" and the... [16:27:12] 06Labs, 10Tool-Labs: currentevents is using 248G in Tools - https://phabricator.wikimedia.org/T136195#2327075 (10chasemp) In particular is it possible that files such as: ```./currentevents/dumps/enwiki/20151201/enwiki-20151201-pages-meta-history2.xml-p000018040p000019712.7z: 177M ./currentevents/dumps/enwiki... [16:29:30] 06Labs, 10Tool-Labs: oar is using 207G on Tools - https://phabricator.wikimedia.org/T136201#2327081 (10A930913) EdSaperia: "I'm trying to raise more funds for it So keep would be nice" [16:33:50] 06Labs, 10Tool-Labs, 07Tracking: Contact tool maintainters using large amounts of disk space (tracking) - https://phabricator.wikimedia.org/T136212#2327094 (10Danny_B) [16:33:57] 06Labs, 10Tool-Labs: [Tool Labs] Database credential file replica.my.cnf missing in my home directory on Tool Labs (/home/wiki13). - https://phabricator.wikimedia.org/T122657#2327096 (10bd808) [16:33:58] 06Labs, 10Tool-Labs: Restore replica.my.cnf for toolsbeta.admin - https://phabricator.wikimedia.org/T109807#2327098 (10bd808) [16:34:00] 06Labs, 10Tool-Labs, 07Tracking: Tool Labs users missing replica.my.cnf (tracking) - https://phabricator.wikimedia.org/T135931#2327095 (10bd808) [16:52:13] How do I set a php.ini config value for my tool? Symfony is moaning because date.timezone isn't set, and it won't run without it: ' * date.timezone setting must be set [16:52:13] > Set the "date.timezone" setting in php.ini* (like Europe/Paris). [16:52:14] ' [16:53:10] tom29739: php.my.ini is available, iirc, and you can probably use ini_set at the top of your index.php? [16:53:17] before you import any symfony stuff [16:53:50] (I can't say I understand why symfony doesn't do that for you) [16:54:28] It can't choose a default timezone. [16:55:04] If they included something like that, then people might be annoyed because it overrides the php.ini default. [16:55:05] it can, based on configuration parameters the programmer provides [16:55:26] That's the only reason I can think of. [16:55:32] having a default in php.ini makes no sense, because there is no sane default for all applications on the same webserver [16:55:42] (other than, maybe, 'utc') [16:55:59]