[00:01:32] andrewbogott: encoding02 & 03 are cattle. if it helps with recovery feel free to delete them and re-create, but please ping me so I can re-apply the configs [00:01:44] (of the video project) [00:02:32] Good to know. I think they're already copied by now but I will check. [00:06:31] k [01:18:25] Do we at least have backups of the file systems? [01:34:07] Betacommand: no. we do not backup the local disks of Cloud VPS instances. We have replication for most NFS shares, but few true backups. [01:34:59] ouch [01:37:09] We have a rough plan of building a self managed backup solution for people to use, but have not got to the point of actually implementing it yet. It won't be a full disk image back solution though. It will be more of a place to backup database dumps and critical data that can't be easily recreated. [11:10:03] getting a very scary warning when trying to SSH into toolforge: [11:10:08] https://paste.gnome.org/pb9t4ndpm [11:10:23] did he host key of the Stretch bastion change? [11:12:44] !help did he host key of the Stretch bastion change? [11:12:44] Lucas_WMDE: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [11:13:07] hey Lucas_WMDE [11:13:15] hi [11:13:41] yes, the `login-stretch.tools.wmflabs.org` DNS name points now to another server [11:13:54] due to an outage we had yesterday [11:13:57] ah, okay [11:14:12] it should be now `tools-sgebastion-07.eqiad.wmflabs` (rather than 06) [11:14:42] thanks for double checking though [11:14:44] :-) [11:14:49] alright, thanks for the info :) [11:14:58] is it okay if I send an email to the cloud mailing list? others might run into the same issue [11:15:11] sure [11:15:42] you can include previous/current public key as well [11:16:14] and sgebastion-07 now has -06’s IP address too? [11:16:32] I thought I could just remove the hostname from known_hosts and keep the IP, but that still produces a (less serious) warning [11:16:44] public IP yes, private IP is different [11:16:52] ok thanks [11:23:49] email sent, thanks again for your help :) [11:32:43] !fingerprints [11:32:43] ssh keys for bastion hosts: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [11:33:08] arturo: Lucas_WMDE: would be good if someone could update https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/login-stretch.tools.wmflabs.org [11:33:47] ah, I didn’t know about that page! [11:34:15] not sure how to get all the different formats though [11:34:26] it's explained on https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [11:34:44] thanks legoktm. Lucas_WMDE I will let you handle that as well [11:34:52] okay, I will try ^^ [11:35:01] I'm trying to finish https://wikitech.wikimedia.org/wiki/Incident_documentation/20190213-cloudvps [11:35:27] I’m not allowed to edit that page :( [11:36:10] anyone want to give me some extra rights? https://wikitech.wikimedia.org/wiki/Special:UserRights/Lucas_Werkmeister_(WMDE) [11:36:39] I can modify the fingerprints page [11:36:59] Lucas_WMDE: if you can prepare me a diff for me to copy/paste, I would appreciate that :-P [11:37:06] ok [11:37:20] regarding extra-user rights, I'm not sure how that process goes [11:38:23] new content: [11:38:27] https://paste.gnome.org/pm1pifkia [11:38:38] (I do have production shell access already) [11:39:23] done Lucas_WMDE !! [11:40:23] thanks! [11:40:48] thanks Lucas_WMDE !!! really appreciated :-) regarding the admin rights, may I suggest opening a phab task so we can clarify how that goes? I don't know if we have any established process for in-wikitech admin rights management [11:41:21] will do [11:50:27] https://phabricator.wikimedia.org/T216126 [11:50:56] iirc sysop on wikitech effectively gives out (or used to) cloud-wide root, which is why we have a contentadmin group [11:52:34] oh, is sysop tied to some uber-powerful LDAP group? [11:53:18] PHP Fatal error: Uncaught Error: Class 'mysqli' not found i [11:53:27] sigh... it was there two days ago. a mess. [11:53:33] !help Hi, anyone here knows if the host key for login-stretch.tools.wmflabs.org has changed? [11:53:33] MarioFinale: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [11:53:46] hey MarioFinale yes [11:54:11] MarioFinale: see https://lists.wikimedia.org/pipermail/cloud/2019-February/000546.html and https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/login-stretch.tools.wmflabs.org [11:54:38] Oh, ok. so isn't just me. [11:54:42] I was scared for a sec [11:55:11] can we have mysqli installed by default? it is not deprecated... [11:56:15] Thanks arturo and Lucas_WMDE [11:57:07] legoktm: ca we have back php7.2-mysql ? it was there two days ago. [12:00:46] Steinsplitter: you mean in Toolforge, right? which backed do you use, grid or kubernetes? [12:00:53] also, which grid do you use, trusty or stretch? [12:00:59] (if you are using grid) [12:01:05] backend* [12:01:33] ahwell, i found the bugreport: https://phabricator.wikimedia.org/T216076 [12:01:40] so nvm and thanks aniway :) [12:03:46] Steinsplitter: I haven't touched anything :| if no one else looks into it, I will try to over the weekend [12:04:27] legoktm: you'r the best :) [12:07:47] lol now I got distracted looking into it [12:14:15] Steinsplitter: pinged around a bit and Moritz was already on it :) https://phabricator.wikimedia.org/T216076#4953944 [12:23:14] :) [13:04:01] oh, is sysop tied to some uber-powerful LDAP group? [13:04:03] no [13:04:11] there is no connection between MediaWiki groups on wikitech and LDAP groups [13:04:21] the problem is sysop on there lets you have full editing access to the whole wiki [13:04:29] which does not exclude Hiera pages [13:04:33] so you have cloud-wide root [13:10:32] Syntax-coloring in vi (the best of editors ever!) is not turned on by default on tools-sgebastion-07. Add the line "syntax on" to ~/.vimrc [13:11:16] Is this worth a mail in the mailinglist or shouldn't it be turned on by default in some /etc/... file? [13:12:23] /etc/vim/vimrc.local or /etc/vim/vimrc [13:21:48] Krenair: okay, thanks [13:22:43] also you could do things like edit the code run on the page when cloudadmins use Special:NovaInstance or whatever [13:22:50] not sure how relevant that bit is anymore :) [13:28:06] Wurgl: probably fine to live on your ~/.vimrc [13:29:52] arturo: As long as I am the only vi-user: Yes ;^) [13:30:54] cool [13:31:55] please do so. I believe it isn't sane to maintain in Toolforge config for every text editor in linux in our puppet repo [13:37:03] May I ask if tools was affected by the outage or just xxx.wmflabs.org stuff? [13:38:59] tools is xxx.wmflabs.org stuff [13:40:49] VPS vs. tools then? I can't be expected to know all the tech terminology, I'm not a tech guy [13:41:04] tools is a VPS project [13:41:08] hauskatze: several services were affected. Everything should be back into normal state now, if not, that's a bug worth a phab task [13:41:09] :-) [13:41:27] thanks arturo [13:49:37] arturo: I have a .puppet file in tools.mabot that wasn't there before, probably doesn't do any harm right? [13:50:00] hauskatze: let me check [13:50:04] s/file/folder [13:50:20] arturo: feel free to sudo if you need to [13:51:20] hauskatze: I confirm it's safe to delete that dir. A leftover of who knows what :-) [13:51:31] oki [13:52:04] did tools.mabot@tools-sgebastion-07:~$ rm -Rf .puppet/ [15:33:49] !log tools moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r [15:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:13:32] q [16:49:09] chicocvenancio: PAWS dead? [16:51:16] * bd808 looks at paws [16:51:31] it is at least very slow at the moment... [16:55:08] bd808: i get 504 Gateway Time-out [16:55:09] nginx/1.13.6 [16:55:50] fuzheado: yes. we are looking into it. We had problems with PAWS yesterday too as a result of toolsdb being overloaded [16:56:05] bd808: thanks! [17:00:28] fuzheado: I'm on the hospital with my son, can't be of much help at the moment. The cloud team is looking into it, I'll peak at the channel and see if they need any paws-specific information from time to time [17:00:59] chicocvenancio: no worries! best to you and your son [17:35:24] !log tools T215154 tools-sgebastion-07 now running systemd 239 and starts enforcing user limits [17:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:35:27] T215154: Toolforge: sgebastion: systemd resource control not working - https://phabricator.wikimedia.org/T215154 [17:54:36] is the VPS outage only affecting cloudvirt1024? I have two VMs (eventmetrics-prod01 and eventmetrics-dev02) that are not cloudvirt1024 that are currently unresponsive. No requests are getting through, just hangs. Other VPS instances seem to be working fine [17:55:29] musikanimal: we should have everything off of the cloudvirt hosts with failed disks. Which instances are giving you trouble? [17:55:46] eventmetrics-prod01 and eventmetrics-dev02 [17:56:10] the request isn't even getting to Apache (I see no activity in the logs), so maybe it's an issue with the proxies? [17:58:53] musikanimal: https://eventmetrics.wmflabs.org/ works for me [17:59:30] does it? what about https://eventmetrics-dev.wmflabs.org ? a member on our team who's in London can't access that either [18:00:01] that one is working for me too [18:00:20] ssh'ed to both instances with no issues as well [18:00:27] yeah I can SSH in just fine [18:00:50] Additional data point: I can access dev but not prod from the WMF office right now. [18:01:08] https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue may have some tips on tracking down where things are going wrong [18:01:33] Niharika: that's super weird. The IP address of both would be the same [18:02:45] Anything I can do to help troubleshoot? [18:04:20] Niharika: what error are you seeing in your browser (I assume) from https://eventmetrics.wmflabs.org/ [18:05:05] bd808: No error, it just loads forever. [18:05:24] PHP Warning: mysqli::mysqli(): (HY000/1040): Too many connections [18:06:07] yeah, `ping`ing it gives a quick reply, and traceroute looks okay, mtr shows 0 for Loss% [18:06:09] Wurgl: is that a connection to toolsdb? [18:06:20] user-db [18:07:11] Wurgl: ok, that's the same thing. We are currently investigating that service. PAWS is having trouble connecting to it as well. [18:07:25] Niharika: woah I can `curl` and it works too! just not in the browser... [18:07:25] Thanks [18:07:41] bstorm_: ^ another "too many connections" for toolsdb reported by Wurgl [18:09:18] I just got "too many connections" errors from XTools too [18:14:00] musikanimal: me too, looks like it's ok now tho [18:15:20] mysql tools-db seems to work. Thanks [18:17:12] so I can get to eventmetrics in incognito mode, or a different browser; but clearing my cache, including the host resolver cache (DNS), doesn't fix it in my main browsing window [18:19:58] Niharika: woah, it's the cookies! [18:20:39] why on earth that would affect connectivity, I don't know... [18:20:52] I am very sad to tell you again: ERROR 1040 (08004): Too many connections [18:22:12] Wurgl: Yep, getting that as well [18:23:24] I also got "Connection refused" a bunch of times yesterday, and my usual "Lock wait timeout exceeded" [18:23:29] Wurgl, SQL: we are trying to rack down the cause, but basically some tool is opening too many concurrent connections [18:23:38] ah, fair [18:23:42] so it's a type of denial of service [18:24:19] !log tools.checkwiki stopping service to see if that fixes the DB [18:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.checkwiki/SAL [18:26:44] !log tools.checkwiki stopped update_dumps job in case that was the cause of the DB issue [18:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.checkwiki/SAL [18:30:56] !log toolsbeta moving toolsbeta-puppetdb-01 to labvirt1002 [18:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL [18:34:48] !log tools.checkwiki bd808 disabled all cron jobs by commenting them out in the Stretch grid crontab while debugging ToolsDB connection overload [18:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.checkwiki/SAL [18:34:56] !log tools moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002 [18:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:37:19] !log tools moving tools-exec-1418, tools-exec-1424 to labvirt1003 [18:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:56:30] Thank you all for looking into the PAWS issue. I was just going to ask about that error here, but I see it's already a known issue. [18:57:14] Yep, it's a toolsdb issue that affects paws [18:58:20] !log tools.fountain Stopped webservice. Implicated in ToolsDB connection overload outage. [18:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.fountain/SAL [19:11:03] !log tools moving tools-k8s-etcd-01 to labvirt1002 [19:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:11:32] There is something very strange with tools-db [19:11:47] Bad disk? [19:12:20] Wurgl: no, just concurrent usage overload [19:12:42] we keep killing things to make space and then something new pops up to take it all back :/ [19:13:40] https://paste.gnome.org/p201maddl <-- 9028 seconds for some insert statement? [19:14:27] system is overloaded, ram is full, things are slow [19:14:27] That kind of behaviour is typical for bad blocks (IMHO) [19:25:18] !log tools moving tools-elastic-02 to labvirt1003 [19:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:33:51] !log tools moving tools-checker-01 to labvirt1003 [19:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:43:22] notconfusing, fuzheado: paws is up [19:43:35] Work of the cloud team [19:46:38] Thank you cloud team for bringing PAWS back up! [19:55:17] !log tools moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419 [19:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:58:02] tools-db will hang soon :-( [19:58:31] As before, very simple statements take a long time [20:15:53] !help Hi, I don't know if it's only me but toolforge is very unstable. The index page of my tool is responding with 404 or 500 very frequently, and it's just an htm page... [20:15:54] MarioFinale: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [20:16:24] https://tools.wmflabs.org/periodibot/ [20:17:12] Toolsdb had a minor disaster. We are still cleaning up. There is some fallout. [20:17:54] as something to do with T216170? [20:17:54] T216170: toolsdb - Per-user connection limits - https://phabricator.wikimedia.org/T216170 [20:17:59] has* [20:26:32] MarioFinale: no, we haven't changed in that regard [20:26:36] I just opened the task :) [20:27:07] It's related. However, it appears to be a combination of some tools leaking connections and possibly some server-level issues that encouraged that happening [20:28:02] So we are investigating the latter...and killed those tools connections. There might be more to do, though. Some issues with some tables and such. [20:28:21] Oh, well. Guess I'll just have to wait and see. [20:28:26] thanks [20:29:28] Is any task about that "disaster"? [20:29:35] in phab [20:30:53] So I can track the progress. I was just making some minor changes to my tool :/ [20:34:08] !log tools moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419 [20:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:35:53] Possibly. We have some incident docs we're working on [20:36:39] We are here so far https://wikitech.wikimedia.org/wiki/Incident_documentation/20190214-labsdb1005 [20:40:27] thanks [20:57:21] bstorm_: https://paste.gnome.org/pgwh5wgbr [20:57:44] !log tools rebooting tools-worker-1005 [20:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:57:59] Connections are climbing (14:08) because the queries are not finished [20:58:21] ? [20:59:32] Wurgl: what are you speaking of, specifically? [20:59:55] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190214-labsdb1005 <-- the outage [21:00:25] Well, yes...and the server can only stand so many. [21:00:30] The queries are all dead now [21:00:54] But 9000 seconds for some inster statement which usual takes 1/10 second? [21:01:06] Was there a load of 50.000? [21:01:16] Are you saying that there are slow queries on toolsdb at this time? [21:01:18] -ister+insert [21:01:28] Or a few hours ago? [21:01:47] more than slow – i so not have a word for that slowness [21:01:50] at this time, we are repairing some tables, so the toolsdb server is under load...it is not currently over connection capacity and not in an outage [21:02:16] please be patient [21:02:21] This was about 19:30 UTC [21:02:27] Ah ok [21:02:30] repairs are still going on [21:02:43] 19:30 UTC? [21:02:45] I've just checked the repair process and it's still working, we can't do much right now [21:04:13] about that time +/- 10 minutes [21:04:50] Short after allowing 1280 connections [21:04:50] Ok, around that point the server was crashing. I can't tell you whether you were seeing results or causes. [21:04:57] ah ok. [21:05:33] There was a large number of connections still placing heavy demand on the database, even though they were theoretically being killed [21:05:54] But that is good information to have, thank you. [21:06:51] The problem startet these 9800/9900 seconds before ~19:30 [21:08:11] ok [21:09:06] If the db server is under heavy enough load that could happen. The connection limit is there to prevent overburdening the server. [21:09:24] Hello - is CVNBot1 down? [21:09:28] Something was thrashing it. We have some likely candidates, but there's other things. [21:09:48] Any reason I should be getting a 502 bad gateway here? https://tools.wmflabs.org/ipcheck I've checked and made sure that webservice was running, also restarted it to make sure. [21:09:49] flippinbizkits: There's a lot of things that have been going down today. [21:09:59] Understood...thx... [21:10:41] We are just trying to keep tabs on them all. Toolsdb had an outage. Yesterday two cloudvirts did as well. Now there's various cleanup and problems. [21:11:19] SQL: possibly? That I can perhaps check. Grid or Kubernetes? [21:11:33] bstorm_: kube, php7.2 [21:11:39] ok [21:12:15] that's ipcheck was having issues a few days ago I think... /me looks at SAL [21:12:34] Oh, I wasn't aware o_O [21:12:44] and sal is busted :/ probably elastcisearch cluster health [21:13:08] I'm also having problems with webservice. kubernetes php5.6 [21:13:13] ipcheck is running on a node that is being moved [21:13:31] bstorm_: oh, thank you very much - no worries then! [21:13:32] MarioFinale: that's unrelated, but similar. We found that and are getting the pod rescheduled [21:14:08] oh, i see [21:14:10] SQL: It should get rescheduled now on a different node [21:14:20] ty [21:15:01] MarioFinale: totally outside the toolsdb mess, we've had a mess on our physical hosting servers. So we've been trying to move things around to cope with it. It's been rough. I'm hoping that yours will be a little happier now? [21:15:59] it is working now [21:16:26] :) [21:28:29] !log tools.sal Updated config to point to tools-elastic-01 and restarted [21:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.sal/SAL [21:34:57] ooh reminds me i should check my vm and make sure it didn't explode [21:35:31] looks like it survived \o/ [21:41:44] :) [21:49:05] https://www.irccloud.com/pastebin/RqkIe4nj/ [21:49:29] this instances are shutdown [21:51:56] sorry, wrong window [21:56:43] !log tools Deleted old tools-package-builder-01 instance [21:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:57:14] !log tools Deleted old tools-proxy-01 instance [21:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:57:31] !log tools Deleted old tools-proxy-02 instance [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:16:09] !log paws Activated maintenance page on paws-proxy-02 nginx config [22:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [22:22:12] bstorm_: We again have that problem! [22:22:15] | 481825 | s51512 | 10.68.17.201:43273 | s51512__data | Query | 2386 | Opening tables | SELECT * FROM dewiki_data WHERE wiki='dewiki' AND page_id='6958701' AND page_latest IS NOT NULL | 0.000 | [22:22:32] Yeah, it's taken down some services [22:22:59] 2386 seconds … for access on unique index [22:25:30] !log paws downtimed PAWS in Icinga [22:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Paws/SAL [22:34:12] Wurgl: this one may be entirely different reasons because we are repairing some tables, which may be putting mariadb in a weird state in places. That could be the cause here. Not sure. [22:36:31] Seems to depend what tool, I've got the same issue with ipcheck, but ipcheck-dev is fine [22:38:21] bstorm_: I am not an adminstrator of some toll or server, I never was. Just writing programs since *hmm* 1973? This one was the first I had access to (sorry, only in italian language) https://it.wikipedia.org/wiki/Olivetti_P602 [22:40:35] I had such behaviour when the disk had bad blocks … [23:36:15] Killed does not really help … | 481825 | s51512 | 10.68.17.201:43273 | s51512__data | Killed | 6832 | Opening tables | SELECT * FROM dewiki_data WHERE wiki='dewiki' AND page_id='6958701' AND page_latest IS NOT NULL | 0.000 | [23:38:43] DBAs are helping at this point. Some long-running queries were killed in the process of that. The repair was locking up metadata and basically making it so tables weren't opening. [23:39:04] So the repair is now stopped [23:40:09] Strange status, isn't it? [23:53:33] The database is now in read-only mode by action of the DBA team [23:59:53] tools db is dead?