[07:49:28] morning [07:57:42] <_joe_> hi! [07:57:47] <_joe_> (on a train right now [08:05:24] nce [08:05:47] power at the seats? [08:12:02] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 3 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) Amended the patch to use the regular check_https_url check command and to link to the full output at https://gerrit.wikimedia.org... [08:35:09] 10serviceops, 10Gerrit, 10Icinga, 10Operations, and 3 others: gerrit: Add a icinga check that uses the healthcheck endpoint - https://phabricator.wikimedia.org/T215457 (10Dzahn) 05Open→03Resolved works now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=gerrit.wikimedia.org&servic... [08:39:26] phab1002 also needs to be switched from thirdparty/php72 to component/php72 to pick up the new PHP 7.2.15 packages [08:39:38] ^ mutante could you take care of this? [08:40:40] moritzm: yes [08:43:36] also doc1001 seems to use PHP 7,2 and would need to be switched (or rather just use the stock PHP7 from stretch, I doubt this really needs to run the custom builds at all?) [08:45:30] i can also take that one, i did that for hashar [08:45:48] 10serviceops, 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) @akosiaris thanks for listing up information needed by SRE. This is very helpful. Before I add those to the task description,... [08:50:35] mw2151 has failed nutcracker service - is that from testing? [08:50:46] (just restarting it wont do it) [08:52:37] mutante: ack, let's have this use the standard PHP 7 from stretch unless there's a compelling reason to use 7.2 [08:52:56] ok,yep [09:06:13] mw2151 fixed nutcracker [09:11:08] what did you do with it (as you wrote before that a restart didn't suffice)? [09:12:15] mkdir /var/run/nutcracker ; chown nutcracker:nutcracker /var/run/nutcracker ; systemctl start nutcracker ; pool [09:12:24] but then i remembered that i could as well have rebooted [09:12:32] either way works [09:15:34] ack, or simply "systemd-tmpfile --create" [09:18:28] oh, aha [09:49:07] <_joe_> moritzm: I would strongly prefer we run one single php version in production as much as possible [09:49:08] <_joe_> mutante: we need to fix the way we create systemd tmpfiles adding that exec ^^ [09:50:00] <_joe_> ok I'm going afk for a bit - I gotta hop off the train [09:50:02] <_joe_> ttyl [09:51:07] i uploaded patches and confirmed with hashar we are rolling back doc to 7.0 [09:51:13] will do after meeting [09:51:29] same for phab, harmless because it doesnt serve prod [09:52:01] eh, not same as in version, switching that one to component/7.2 since Phab does not support 7.0 [09:53:21] for phab that's okay as it needed some feature only re-introduced in 7.1 [09:53:59] ack [09:54:34] but we should not use the custom packages for random other one-off PHP services, it means additional work (e.g. building additional PHP modules) for no practical gain, we won't reach the level of testing those packages have been. and more importantly: [09:55:38] if we use these packages for tasks aside of mediawiki and Phab we tie in all these services to the next ICU transition (for which we'll need to build custom transition packages using libicu63) [09:55:59] this will be painful enough with the the app servers packages on it's own [09:59:14] for doc1001.eqiad.wmnet it is probably fine to use php7.0 [09:59:33] and if some doc really requires 7.2, we can look at upgrading. But it is probably unlikely [10:09:05] mwdebug1001 is currently depooled, can it be repooled or does anything speak against that? [10:16:44] mutante: about reprepro and jenkins. reprepro does not support following two redirects. We use pkg.jenkins.io which does a single redirect, I think they have changed their infrastructure to no more have to redirect pkg -> mirrors -> actual site [10:17:21] mutante: so tentatively reprepro should work again. And http://pkg.jenkins.io/debian-stable/ has a new version available if you want to try reprepro [11:02:06] hashar: yes, i saw your comment on the abandoned gerrit change and had looked at those redirects. they did change something , yea [11:15:02] service ops people, please have a look at the JD today :) [11:28:45] doc.wikimedia.org done - downgraded from PHP 7.2 to 7.0 and puppet running again. remove --purged the 7.2 packages, needed follow-up to adjust socket path to 7.0 so there was like a 2 minute downtime. all done now [11:29:35] good [11:30:02] lunch & [11:30:51] good idea..lunch [11:49:10] hey service folks, am I right that the only service moved to kube atm is mathoid 50% traffic, or is that out of date? I know there's a bunch of ongoing work but I'm looknig for one completed migration for 'study purpose' [11:49:14] *purposes [12:25:03] re: "which servers are under our care" and netbox. others already had the same idea of course https://phabricator.wikimedia.org/T217686 https://phabricator.wikimedia.org/T216088 [12:32:55] is citoid moved or not completely yet? (I knew I was forgetting something obvious) [12:47:54] oh, and eventgate... [12:48:01] * apergos slinks off cursing their memory [13:07:51] https://wikitech.wikimedia.org/wiki/Mathoid <-- this was pretty out of date. it's now less out of date but with a lot of 'umm' and 'not sure' in it, in case $someone wants to fix it up or tell me how to fix it up [14:23:10] <_joe_> apergos: mathoid is 100% on kube, as well as eventgate and now (I think) citoid [14:23:40] <_joe_> there is also zotero there (for citoid) [14:23:56] <_joe_> and some other service which jijiki loves [14:24:04] <_joe_> let's see if you can guess the name [14:24:27] yeah I think I have them mathoid citoid and eventgate, I went and looked at one of the nodes eventually [14:24:50] <_joe_> so you already know what's the other service [14:25:17] starts with blub and ends with beroid [14:25:47] so my question now is what should that wikitech page have where I handwaved a lot? [14:26:09] because it definitly should not have had 'ssh to scaxxx and scap deploy' [14:26:10] :-P [14:26:21] https://wikitech.wikimedia.org/wiki/Mathoid I mean [16:35:35] <_joe_> apergos: ahaha I love the new wording [16:36:04] that's what happens when some uninformed person writes it :-P [16:56:33] akosiaris: oh, https://phabricator.wikimedia.org/T217747#5007931, yeah I am familiar [16:56:37] akosiaris: lolz [16:56:41] :) [16:57:11] :D [16:57:20] akosiaris: https://xkcd.com/927 [16:57:43] apergos: what? [16:58:21] uh [16:58:34] ah, I just got what you wanted [16:58:36] ok [16:58:52] apergos: ok, I 'll update that page [16:58:57] yeah basically I just want 'go look here' to figure it out so I can finish the mathoid page, or [16:59:04] for someone else to do it and I'll learn from that :-D [16:59:06] ty! [16:59:11] but to answer the question, citoid is 100% moved to kubernetes as well as of today [16:59:21] very nice [16:59:26] and there is zotero and blubberoid and eventgate [16:59:35] yeah I was basically just looking for a few examples [16:59:48] and in the end I sshed in to k1001,2 and looked at processes :-D [17:00:02] and I guess that did not particularly help [17:00:07] staff something something not breakfast is starting now btw [17:00:17] it did, I found citoid mathoid and eventgate [17:00:24] and of course we all know about blubberoid [17:00:52] but like 'oh where does mathoid log now'... prety sure I documented the worst way to find those [17:01:14] or 'how do you deploy'. well that's a series of 'and then some magic happens' so it would be awesome if $someone filled in those blanks, etc... [17:03:13] yeah part of this q's goal to fix that last part [17:03:34] heh, I figured as much [17:03:44] logging is logstash plus locally, but using the standard tooling won't help [17:03:48] but there is probably something that can still be written down for 'at the moment we do...' [17:03:50] but I can amend docs [17:03:56] sweeeet [17:22:46] apergos: there: https://wikitech.wikimedia.org/wiki/Mathoid [17:28:13] oh this is good! [17:28:18] thank you very much [17:29:06] so if we want to restart just one service on some/all of the pods... [17:29:11] what's the right way to do that? [17:29:36] or is that not an approach we would use any more with kube? [17:36:01] a pod is a service [17:36:30] so to restart it, just delete the pod [17:36:52] the pod is the actually "nodejs server.js blah blah" thing [17:37:26] deleting in the API instructs kubernetes to stop it and start a new one (in order to have the same capacity) [17:40:21] ah, for some reason (due to my cluelessness I guess) I had the notion that some of these pods had more than one service on them [17:40:22] my bad [17:41:28] well, you are right that they can have >1 things in them, but not services. Extra helping software (called sidecars) like the statsd-prometheus-exporter [17:41:34] so now eventually there will be tools i guess that will be 'find and destroy with extreme prejudice all of the pods for a given service, with a delay between each' and stuff, I guess? [17:41:51] one too many 'I guess' there but eh [17:41:54] kubectl delete pods --all [17:41:58] that's it [17:42:08] but you want to specify the service name [17:42:15] not kill all of all pods! [17:42:21] kubectl -n mathoid delete pods --all [17:42:24] ah!! [17:42:27] the default is the "default" [17:42:30] which has no pods :-) [17:42:35] GOOD [17:42:36] :-D [17:42:44] yeah we did not make that mistake ;-) [17:42:48] I like it when the deault saves me from my stupidity [17:42:52] *default [17:46:15] I have put something about that on the page (plus a question). [17:46:40] I know eventually some of this can get moved to a generic 'how to do stuff for services' page, but for now it seems it's nice to have the details spelled out here [19:54:10] 10serviceops, 10Core Platform Team (Multi-DC (TEC1)), 10Core Platform Team Backlog (Next), 10Kubernetes, and 2 others: Deployment strategy for the session storage application. - https://phabricator.wikimedia.org/T217650 (10Eevans) From an IRC conversation: `lang=irc [2019-03-05 09:29:06] _joe_:...