[06:56:59] 10serviceops, 10Operations, 10observability: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10Joe) FWIW, I think I remember systemctl status php7.2-fpm to stall on a busy server, but I might remember incorrectly. [07:48:44] 10serviceops, 10Kubernetes, 10Patch-For-Review: Make helm upgrades atomic - https://phabricator.wikimedia.org/T252428 (10JMeybohm) [09:05:08] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) Some notes: * Added https://grafana.wikimedia.org/d/000000174/redis?panelId=14&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-job=redis... [09:56:47] 10serviceops, 10Operations, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10elukey) 05Open→03Stalled Precisely, let's hold this task until T243106 is completed. [09:56:53] 10serviceops, 10Operations, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10elukey) [12:36:37] 10serviceops, 10Arc-Lamp, 10Performance-Team, 10Patch-For-Review: Decom the ArcLamp pipeline for HHVM/Xenon - https://phabricator.wikimedia.org/T233884 (10Dzahn) Merged the change above, it was complete noop on webperf1002/2002. Should we kill `/usr/bin/python /usr/local/bin/arclamp-log /etc/arclamp-log-x... [12:54:59] _joe_: do you have opinions either way about T252605 ? [13:46:02] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy Scap version 3.14.0-1 - https://phabricator.wikimedia.org/T249250 (10LarsWirzenius) [13:49:20] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy Scap version 3.14.0-1 - https://phabricator.wikimedia.org/T249250 (10LarsWirzenius) Is this blocked on something for serviceops? [14:07:27] <_joe_> oh sigh [14:07:41] <_joe_> rzl or jayme: wanna pick that up? [14:07:52] <_joe_> (deploy the new scap version) [14:08:27] sure [14:08:34] I've already done that once, happy to leave it for jayme to get the exposure-- perfect :) [14:08:43] also, if there's something I can do to make it easier for serviceops to handle new Scap releases, please tell me [14:09:08] <_joe_> liw: we just suck at triaging, sorry about that [14:09:14] <_joe_> it's completely our fault [14:09:27] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy Scap version 3.14.0-1 - https://phabricator.wikimedia.org/T249250 (10JMeybohm) a:03JMeybohm [14:10:05] ok, thanks [14:10:08] jayme: feel free to ping if you need a hand [14:10:15] I'll get to it in an hour I guess - if thats okay [14:10:19] rzl: cool, thanks [14:11:07] <_joe_> jayme: now I read all your IRC messages with a *squeak* for every whitespace [14:11:21] haha [14:11:43] that was really annoying! But I already "unsqueaked" it [14:12:24] <_joe_> eheh [14:22:46] I did not get the squeak reference; is this something I"m better off having missed? :-) [14:25:16] <_joe_> apergos: I was chatting with jayme earlier and apparently his laptop's spacebar wask squeaking [14:25:18] <_joe_> *was [14:25:24] <_joe_> and he was *very* annoyed by it [14:25:32] ah ha :-D :-D [14:25:51] and it must have been that loud that _joe_ could hear it :P [14:25:53] <_joe_> so imagine readling I'llSQUEAKgetSQUEAKtoSQUEALthat [14:26:25] I'm gonna see if there's a driver shim :-D [14:26:30] it would be hilarious! [14:31:32] found an app actually but it's old and it's python 2.7, meh [15:00:45] 10serviceops, 10Graphoid, 10Operations, 10Core Platform Team (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [15:07:41] 10serviceops, 10Operations: Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 (10akosiaris) [15:14:37] 10serviceops, 10Kubernetes, 10Patch-For-Review: Make helm upgrades atomic - https://phabricator.wikimedia.org/T252428 (10JMeybohm) a:03JMeybohm [15:30:59] 10serviceops, 10Operations, 10observability: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10Joe) I took a brief peek at what flows to systemd from php-fpm on dbus: ` $ sudo dbus-monitor --system path='/org/freedesktop/systemd1/unit/php7_2e2_2dfpm_2eservice ...... [15:31:02] <_joe_> cdanis: ^^ [15:31:37] _joe_: yeah indeed [15:32:03] I'm in favor of just parsing it from systemctl show, should be quite simple, don't think that status formatting has ever changed [15:33:11] <_joe_> we lose some info with respect to the current php-fpm exporter [15:33:24] well I think this would supplement, not replace [15:34:21] <_joe_> so you want to just change the current exporter to extract that info that way instead than with a fcgi request, correct? [15:34:55] no, I was thinking of a new metric, and then updating dashboarding [15:35:04] (and then adding alerting on the new metric) [15:35:12] <_joe_> a new metric, ok, how to collect this new metric? [15:35:34] either a trivial python script, or a simple textfile exporter [15:35:57] <_joe_> another exporter? we already have too many on those servers :) [15:36:09] could also modify the php-fpm-exporter, that's a fine idea, maybe have it fall back to the systemd show approach if its scrape fails [15:36:11] <_joe_> but yes, a textfile exporter should be fine [15:36:27] <_joe_> yeah that can be done later(TM) [15:36:52] Later™️ is my favorite [15:37:22] <_joe_> please, this needs to be written in pure ASCII format for compatibility with our bugs from 2005 [15:39:17] Later™ [16:09:06] _joe_: how would you select the hosts to roll out scap to? Should I chose one as canary (which?)? rzl pointet out "R:Package = scap" as host selector which seems like a good way to start... [16:10:02] <_joe_> jayme: so I would start with the mwdebug* servers [16:10:14] <_joe_> and then go there and run (as yourself) "scap pull" [16:10:44] <_joe_> if that gives no errors, coordinate with releng but I guess they want it installed everywhere [16:10:55] <_joe_> rzl: did you introduce debdeploy to jayme? [16:11:06] he did [16:11:25] <_joe_> so ok, I think their expectation is that it gets installed everywhere [16:12:38] Ah, okay. I wasn't even sure it's installed on every machine [16:13:34] because "There is always an exception™️" [16:14:29] it's not installed everywhere, not even half our servers have it: https://debmonitor.wikimedia.org/packages/scap [16:16:14] that brings the question up again how to properly select those where it is installed [16:20:15] debdeploy only updates packages, it never installs new packages, as such you can either address selected clusters (e.g. A:mw-canary), entire sites (e.g. A:codfw) or all hosts: (A:all) [16:22:52] Ah, I see. Thanks! [16:44:05] There is still one host left that is not even catched by A:all (https://debmonitor.wikimedia.org/hosts/an-tool1006.eqiad.wmnet) [16:45:07] And I had expected debdeploy to log to SAL... moritzm? [16:47:45] It is cathed by A:all but debdeploy reports it as "up to date", which it isn't [16:50:46] jayme: need a hand? [16:51:47] could use one :) - trying to figure out why debdeploy thinks an-tool1006 has an up to date scap [16:54:00] might be because "apt-get update" hangs there [16:54:20] apt-cache policy shows only the 3.13.0-1 version [16:54:22] so yes [16:54:40] would make sense as it is not listed as to update in debmonitor either [16:55:59] interesting [16:56:03] an-tool1006 has a "deb http://deb.debian.org/debian-debug/ stretch-debug main" source [16:56:06] the apt sources list has somem spurious stuff [16:56:08] and no internet access i guess [16:56:11] yep that one [16:56:13] indeed [16:56:15] without a proxy [16:56:19] no internet access [16:57:56] IIRC we manage with puppet only stuff in sources.list.d/ [16:58:45] cc ottomata elukey (mmmh not here though) [17:00:40] ah, because an == analytics - had to look for that naming page :-) [17:01:07] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Servers [17:01:10] yeah, sorry [17:01:23] my guess is that was modified manually for some testing [17:02:01] context modules/apt/manifests/repository.pp [17:02:56] thanks [17:03:15] TL;DR you found a snowflake :D [17:04:02] np for the naming lookup. I like to figure out myself. I'ts just not cached in memory yet :) [17:04:02] that needs fixing as it has apt broken atm [17:08:43] I had thought puppet would complain about it at some point as it runs apt-get update, no? [17:09:05] not directly, we run apt-get update *before* puppet in the same cron script [17:09:18] check /usr/local/sbin/puppet-run [17:09:24] that is run by /etc/cron.d/puppet [17:09:27] on any host [17:10:00] we run it with a timeout -k 60 300 [17:10:19] and yes, we should probably alert if it's failing consistently [17:11:27] I guess the fault is in the '|&' jayme [17:11:48] false |& true exit with 0 [17:12:16] hence the set -e it's not triggered [17:12:17] its "set +e" anyways [17:12:43] only the first command is set -x [17:12:47] *set -e [17:13:03] yeah we set +e after that [17:13:15] # From here out, make a best effort to continue in the face of failure [17:13:18] lol [17:14:27] jayme: and there is a trick to see since when it's happening [17:14:28] downside of failing hard at update would be that we can't fix broken apt stuff with puppet I guess [17:14:43] go to https://debmonitor.wikimedia.org/hosts/, sort by last update [17:15:12] that's the last successful run of any apt/dpkg command [17:15:20] that has a hook to speak to debmonitor [17:15:50] so apt-get update/(un)install/upgrade; dpkg -i ... [17:15:55] Read about that [17:16:38] So we could alert on hosts that have not reached out to debmonitor for X? [17:17:30] potentially yes [17:17:37] only problem is to avoid false positive, like down hsots [17:17:38] *hosts [17:19:33] Wouldn't inciga "silence" service checks for hosts that are down? [17:20:11] hm..might not be tied to a host .. :/ [17:22:17] Is it suitable to temporarily fix the situation on an-tool1006 to get the task of updating scap done? Or is that a bad behaviour because it might break whatever experiment is ongoing there [17:22:39] I'd like to know more waht happened there [17:22:41] it shouldn't have [17:23:22] is the scap new version incompatible with the old one? [17:23:38] can a host stay with the old one for a day while we get to the bottom of this? [17:24:05] don't know.. liw? [17:32:30] jayme, sorry, in meeting, what's up? [17:32:57] new scap version should be entirely compatible with old one [17:33:06] One host is missing the scap update and we would like to .. :) okay, cool [17:34:51] 10serviceops, 10Release-Engineering-Team-TODO, 10Scap: Deploy Scap version 3.14.0-1 - https://phabricator.wikimedia.org/T249250 (10JMeybohm) All hosts except an-tool1006.eqiad.wmnet have up to date scap now. an-tool1006.eqiad.wmne currently has a modified sources.list which makes "apt-get update" fail. @elu... [17:43:12] jayme, okay, meeting ended -- can I still help? [17:46:02] it doesn't log to SAL, typical deployments are often split into many clusters, that would be super-spammy, simply summarise manually what you did with !log [17:47:31] for inactive hosts you can sort by "Last updated" in https://debmonitor.wikimedia.org/hosts/, typically hosts in hw maintenance [17:49:28] yep, mentioned above :) [18:07:11] 10serviceops, 10Arc-Lamp, 10Performance-Team: Decom the ArcLamp pipeline for HHVM/Xenon - https://phabricator.wikimedia.org/T233884 (10dpifke) 05Open→03Resolved Yeah, there isn't anything in Puppet to delete previously-generated configuration files which are no longer referenced. I manually stopped and... [19:00:52] 10serviceops, 10Kubernetes, 10Patch-For-Review: Make helm upgrades atomic - https://phabricator.wikimedia.org/T252428 (10hashar) I have updated https://integration.wikimedia.org/ci/job/helm-lint/ to use the new container and hence helm 2.1.6.7. Production hosts are still on 2.12.2* https://debmonitor.wikime... [19:31:45] hmm, jayme volans dunn what that is about [19:31:50] luca probably knmows [19:34:28] ottomata, jayme: yes he pinged me back in pvt telling me that is a test host where there is work in progress [19:34:31] so can wait [19:43:13] ok coo [20:24:04] 10serviceops, 10Operations, 10observability: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) Discussed some with Joe on IRC and the consensus approach for now is to, for now, write a textfile exporter that parses the systemd status line, and perhaps later... [20:24:10] 10serviceops, 10Operations, 10observability: Reliable metrics for idle/busy PHP-FPM workers - https://phabricator.wikimedia.org/T252605 (10CDanis) a:03CDanis [20:30:16] 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10daniel) From the TechCom meeting: we could just put the parser output into memcached with a short expiry t... [20:31:57] 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10tstarling) a:03tstarling [20:34:53] 10serviceops, 10Operations, 10Parsing-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10tstarling) [21:26:19] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [21:26:56] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [21:27:21] 10serviceops, 10Operations, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [21:57:48] mutante, when you get a chance can you review https://gerrit.wikimedia.org/r/c/operations/puppet/+/596293 ? ty.