[00:08:34] (03PS1) 10Dzahn: exempt mwlog hosts from screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/407179 [00:08:47] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3935863 (10Paladox) @dzahn did you get a strace? :) [00:09:28] paladox: no :/ [00:09:46] but the exact same pattern as in the existing screenshot [00:10:21] i'll keep that grafana tab open and try to catch it next time, should have [00:12:28] Ok thanks :) [00:13:34] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3935879 (10mmodell) [00:15:37] (03CR) 10BryanDavis: [C: 031] cloud: overlay whitelist as default [puppet] - 10https://gerrit.wikimedia.org/r/406851 (owner: 10Rush) [00:16:57] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3935889 (10Dzahn) Unfortunately not, i'll try to catch it next time. I have the Grafana link f... [00:20:43] mutante: does phab1001 use apache ldap module? [00:21:42] 10Operations, 10Ops-Access-Requests: Requesting access to analytics-users / webrequest for Esteban - https://phabricator.wikimedia.org/T185988#3935917 (10Dzahn) 05Open>03stalled [00:22:36] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3935923 (10ayounsi) Reviving this thread as T165519 is now resolved. Aiming to do the upgrade on Wednesday Feb 14. All the details from the description should still be accurate. Le... [00:23:46] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3935931 (10ayounsi) [00:24:20] It remembers me the symptoms of [00:24:24] https://bz.apache.org/bugzilla/show_bug.cgi?id=60296 / https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=814980 [00:25:37] Platonides: no, no Apache LDAP module on phab server (it's authnz_ldap on others) [00:28:36] what do you mean by "on others"? [00:28:45] by Apache LDAP I meant authnz_ldap_module [00:28:48] 10Operations, 10media-storage: Requesting access to swift for Phabricator's git-lfs storage - https://phabricator.wikimedia.org/T182085#3812316 (10Halfak) Ping. Checking in on this since it's blocking our transition to git-lfs in #ORES. Is someone working on this? [00:28:50] servers actually using Apache LDAP module [00:29:01] that arent phab [00:30:07] yea, we were talking about the same module name, used by several services but not this one [00:30:23] ok [00:30:29] I didn't remember the name by heart [00:30:42] the bugs turned out to be easy to find, though [00:31:14] (03PS2) 10Chad: Expose a simple Swagger spec for checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) [00:46:15] jouncebot: now [00:46:16] No deployments scheduled for the next 16 hour(s) and 13 minute(s) [00:53:27] (03Draft2) 10Zppix: Add throttle rule for an event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407183 (https://phabricator.wikimedia.org/T185930) [00:54:07] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#3935982 (10BBlack) All of the other lvs10xx will be decommed or spared out to other usage, once these new 4 machines are fully in service and tested. 1x per row makes perfec... [00:57:41] (03CR) 10Krinkle: "This currently will be exposed on all wikis, not just enwiki. But the tests are specific to enwiki. This is fine, but we could also make t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) (owner: 10Chad) [00:57:46] (03CR) 10Krinkle: [C: 031] Expose a simple Swagger spec for checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) (owner: 10Chad) [00:59:57] (03CR) 10Krinkle: [C: 031] Expose a simple Swagger spec for checks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407174 (https://phabricator.wikimedia.org/T136839) (owner: 10Chad) [01:15:55] Krinkle: Did you want us to amend that before landing, or is that just a nice-to-have? [01:20:18] Ah just saw your comments. I'll follow up there [01:45:19] no_justification: nice to have, no rush [01:46:23] What's in Gerrit I tested with thcipriani in beta [02:23:02] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.17) (duration: 05m 53s) [02:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:03] (03CR) 10BBlack: [C: 031] "Compiler got through ~350 hosts as a sanity check. The diffs seen are as expected (mostly global lvs config impacts that are functionally" [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [03:24:58] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 857.00 seconds [04:03:09] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 263.67 seconds [04:56:58] (03PS1) 10KartikMistry: WIP: Add Matxin MT config [puppet] - 10https://gerrit.wikimedia.org/r/407197 [06:14:14] (03PS1) 10KartikMistry: apertium-rus: New upstream release [debs/contenttranslation/apertium-rus] - 10https://gerrit.wikimedia.org/r/407202 (https://phabricator.wikimedia.org/T184901) [07:00:56] 10Operations, 10ops-eqiad, 10DBA: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3936240 (10Marostegui) This server definitely needs a BBU replacement. @Cmjohnson can you let us know a day that works for you to get it replaced? Thanks! [07:04:35] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3936304 (10Marostegui) >>! In T172459#3935923, @ayounsi wrote: > Reviving this thread as T165519 is now resolved. > > Aiming to do the upgrade on Wednesday Feb 14. All the details... [07:08:59] (03PS1) 10Marostegui: db-eqiad.php: Clarify db1051 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407321 [07:16:39] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:20:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:47:39] 10Operations, 10Ops-Access-Requests: Requesting access to wmf.webrequest for Simonjoylet - https://phabricator.wikimedia.org/T186190#3936572 (10Simonjoylet) [08:56:02] 10Operations, 10Ops-Access-Requests: Requesting access to wmf.webrequest for Simonjoylet - https://phabricator.wikimedia.org/T186190#3936592 (10Simonjoylet) Hey, there. I'm Simon from Southeast University. Now I'm on a research about scheduling web request and experiments for the research need web request data... [09:04:17] 10Operations, 10Icinga: Icinga contact group team-performance pointing to our old mailing list - https://phabricator.wikimedia.org/T186192#3936610 (10Gilles) [09:05:11] 10Operations, 10Icinga, 10Performance-Team: Icinga contact group team-performance pointing to our old mailing list - https://phabricator.wikimedia.org/T186192#3936621 (10Gilles) [09:09:21] 10Operations, 10Icinga, 10Performance-Team: Icinga contact group team-performance pointing to our old mailing list - https://phabricator.wikimedia.org/T186192#3936624 (10Gilles) [09:11:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM in general (I didn't test it tho), but see a few comments about the code structure." (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [09:11:26] (03PS4) 10Lokal Profil: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) [09:11:36] (03CR) 10jerkins-bot: [V: 04-1] Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [09:12:32] (03PS5) 10Lokal Profil: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) [09:12:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [09:12:43] (03CR) 10jerkins-bot: [V: 04-1] Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [09:14:45] (03PS6) 10Lokal Profil: Drop the medlem user group and editallpages user right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) [09:15:05] (03CR) 10Lokal Profil: Drop the medlem user group and editallpages user right (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404942 (https://phabricator.wikimedia.org/T184981) (owner: 10Lokal Profil) [09:40:04] is it just me, or does phab [09:40:17] *phab's favicon not show up now? [09:41:24] and i just said yesterday that almost nobody ever looks at favicons [09:41:30] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Generally correct, see a few comments inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/407039 (owner: 10Andrew Bogott) [09:42:53] well I guess I'm one of the few that do look at them :p [09:43:14] <_joe_> Zackary: I see it correctly [09:43:24] me too [09:43:41] <_joe_> Zackary: can you try to request https://phabricator.wikimedia.org/favicon.ico from an incognito window? [09:44:33] (03CR) 10Gehel: [C: 032] Metrics are exposed by Blazegraph directly [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/405878 (https://phabricator.wikimedia.org/T182857) (owner: 10Gehel) [09:44:35] <_joe_> (that's how it's called in firefox/chrome IIRC) [09:46:02] I can see the page if I go to it in a incognito window, but if I open a new tab in the same window and go to phab, it shows a blank favicon [09:46:26] (ye, thats what its called in chrome, firefox just calls it a private window) [09:50:10] huh, I can see the favicon on firefox and opera, but not in edge or chrome [09:53:13] <_joe_> Zackary: I can see it in chrome, I guess you're seeing some client-side caching issue [09:53:29] <_joe_> I also checked the headers and they seem ok [09:56:20] (03PS1) 10Gehel: update changelog [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/407405 (https://phabricator.wikimedia.org/T182857) [09:57:22] huh [09:57:48] I'll try waiting a bit to see if cache will fix itself [09:59:40] (03CR) 10Filippo Giunchedi: [C: 031] update changelog [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/407405 (https://phabricator.wikimedia.org/T182857) (owner: 10Gehel) [09:59:51] (03CR) 10Gehel: [C: 032] update changelog [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/407405 (https://phabricator.wikimedia.org/T182857) (owner: 10Gehel) [10:03:09] (03PS3) 10Filippo Giunchedi: Lower Thumbor subprocess timeout to 59 seconds [puppet] - 10https://gerrit.wikimedia.org/r/405698 (https://phabricator.wikimedia.org/T185479) (owner: 10Gilles) [10:04:27] (03PS2) 10Muehlenhoff: Ensure all packages are updated when d-i installs security updates [puppet] - 10https://gerrit.wikimedia.org/r/405026 [10:05:09] (03CR) 10Filippo Giunchedi: [C: 032] "LGTM for now, we can revisit the strategy when/if more tweaking is needed" [puppet] - 10https://gerrit.wikimedia.org/r/405698 (https://phabricator.wikimedia.org/T185479) (owner: 10Gilles) [10:08:27] (03CR) 10Muehlenhoff: [C: 032] Ensure all packages are updated when d-i installs security updates [puppet] - 10https://gerrit.wikimedia.org/r/405026 (owner: 10Muehlenhoff) [10:08:31] (03PS3) 10Muehlenhoff: Ensure all packages are updated when d-i installs security updates [puppet] - 10https://gerrit.wikimedia.org/r/405026 [10:15:06] (03PS4) 10Gehel: wdqs: replace prometheus-wdqs-updater-exporter with prometheus-jmx-exporter [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) [10:15:21] (03PS5) 10Gehel: wdqs: replace prometheus-wdqs-updater-exporter with prometheus-jmx-exporter [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) [10:16:58] (03PS1) 10Jcrespo: mariadb: Repool es2018 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407407 (https://phabricator.wikimedia.org/T181293) [10:18:24] (03PS2) 10Jcrespo: mariadb: Repool es2018 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407407 (https://phabricator.wikimedia.org/T181293) [10:19:25] (03PS1) 10Filippo Giunchedi: thumbor: add conftool client [puppet] - 10https://gerrit.wikimedia.org/r/407409 [10:22:13] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/9835/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/407409 (owner: 10Filippo Giunchedi) [10:23:52] _joe_: around? [10:24:03] (03CR) 10Gehel: "puppet compiler seems happy: https://puppet-compiler.wmflabs.org/compiler02/9834/" [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [10:24:22] (03PS3) 10Gehel: wdqs: remove cleanup code after migrating to prometheus jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/405888 (https://phabricator.wikimedia.org/T182773) [10:25:31] (03CR) 10Filippo Giunchedi: wdqs: replace prometheus-wdqs-updater-exporter with prometheus-jmx-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [10:27:42] <_joe_> Amir1: more or less, yes [10:28:15] (03PS6) 10Gehel: wdqs: replace prometheus-wdqs-updater-exporter with prometheus-jmx-exporter [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) [10:28:26] _joe_: just wanted to say job queue is in a record-breaking low 4M [10:28:35] and it's going down for real [10:28:38] <_joe_> Amir1: indeed :) [10:28:53] (03CR) 10Gehel: wdqs: replace prometheus-wdqs-updater-exporter with prometheus-jmx-exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [10:29:00] my patch to fix ruwiki refresh link was deployed last couple of days [10:29:17] <_joe_> Amir1: which one, out of curiosity? [10:29:28] let me get it to you [10:29:51] <_joe_> I was probably either sleeping or flying over greenland when it was deployed [10:30:03] https://gerrit.wikimedia.org/r/#/c/406578/3/wmf-config/InitialiseSettings.php [10:30:32] _joe_: what it does is that parses bad lua and turn it into a better usage tracking [10:30:45] <_joe_> oh the fine-grained tracking, yes [10:30:49] <_joe_> that's pretty nice [10:31:05] <_joe_> and indeed it solves the major logical underoptimisation we had [10:31:32] _joe_: there are two fine-grained usage tracking, we call this, lua fine-grained usage tracking xkill, we have statement fine-grained usage tracking (it's confusing, it's my fault) [10:31:40] <_joe_> if this was not going to be enough, we'd be presented with architectural optimization questions no one wants to have to answer to :) [10:31:55] <_joe_> xkill? [10:32:06] <_joe_> as the unix program? [10:32:08] because it kills x (all) aspects [10:32:12] nah [10:32:28] https://grafana.wikimedia.org/dashboard/db/wikidata-entity-usage-project?orgId=1&var-project=arwiki&var-project=cawiki&var-project=cswiki&var-project=elwiki&var-project=fawiki&var-project=hewiki&var-project=iawiki&var-project=jawiki&var-project=kowiki&var-project=ptwiki&var-project=ruwiki&var-project=trwiki&var-project=viwiki&var-project=wikidatawiki [10:32:35] This is list of wikis it's enabled [10:32:35] <_joe_> yeah, I'm just remembering the good ole days when 'alias ls=xkill' was a thing [10:32:54] :))) That was evil [10:32:55] but the problem is that it might blow up the database [10:33:13] !log roll restart thumbor to lower subprocess timeout - T185479 [10:33:20] <_joe_> why should it? [10:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:27] T185479: Lower Thumbor SUBPROCESS_TIMEOUT to 59 seconds to avoid tripping Varnish slow log - https://phabricator.wikimedia.org/T185479 [10:34:01] because someone might go through all languages or statements and we start to store all of them in the database [10:34:10] it's already happening a little for catalan [10:35:01] around several k articles are using 100 statements from wikidata [10:35:29] now wbc_entity_usage table in cawiki has 50M rows, next week, it should drop to 20M [10:35:46] <_joe_> so won't the solution be convert all the entries there somehow to proper calls? [10:36:03] after this patch gets deployed [10:36:17] <_joe_> heh ok :) [10:36:22] <_joe_> good job anyways [10:37:31] what we do is that it's going through more than 73 statement from an item, it turns it to a general statement tracking row (which might increase the job queue, it's a trade off but I guarantee it's so fine-grained that it make issues) [10:37:38] I keep an eye on everything [10:37:52] (except db storage stuff, which I can't :D) [10:37:59] thank you [10:38:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Small change suggested." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/407409 (owner: 10Filippo Giunchedi) [10:41:10] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3936830 (10faidon) I had a look at both `modules/eventlogging/files/eventloggingctl` and `modules/eventlogging/templates/upstart/*`. They all seemed fairly easy to reimplement with systemd (with or wit... [10:44:42] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, thanks Gehel for taking care of this!" [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [10:45:21] godog: ^ and thanks for holding my hand through this :) [10:45:49] gehel: heheh no worries! [10:46:31] the deployment on the prometheus side is somewhat convoluted... but I'll understand it one of these days... [10:47:03] (03PS1) 10Jcrespo: mariadb: Depool es2019 for upgrade/reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407410 [10:48:32] gehel: part of the reason might be that the prometheus/puppet "service discovery" isn't effectively documented anywhere :( [10:49:11] godog: who needs documentation when there is code ? he, he, he :) [10:49:13] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM as well. Also echoing filippo. Some PCC would be nice" [puppet] - 10https://gerrit.wikimedia.org/r/406794 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [10:50:03] gehel: heheh indeed [10:56:21] (03PS2) 10Filippo Giunchedi: thumbor: add conftool client [puppet] - 10https://gerrit.wikimedia.org/r/407409 [10:59:06] (03CR) 10Giuseppe Lavagetto: [C: 031] thumbor: add conftool client [puppet] - 10https://gerrit.wikimedia.org/r/407409 (owner: 10Filippo Giunchedi) [11:00:07] (03PS3) 10Filippo Giunchedi: thumbor: add conftool client [puppet] - 10https://gerrit.wikimedia.org/r/407409 [11:02:28] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: add conftool client [puppet] - 10https://gerrit.wikimedia.org/r/407409 (owner: 10Filippo Giunchedi) [11:03:09] (03PS1) 10Gilles: Add Thumbor-Request-Id generated by nginx [puppet] - 10https://gerrit.wikimedia.org/r/407411 (https://phabricator.wikimedia.org/T179954) [11:12:44] (03CR) 10Marostegui: [C: 031] mariadb: Repool es2018 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407407 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [11:19:43] (03CR) 10Marostegui: [C: 031] mariadb: Depool es2019 for upgrade/reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407410 (owner: 10Jcrespo) [11:20:24] (03CR) 10Jcrespo: [C: 032] mariadb: Repool es2018 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407407 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [11:32:46] 10Operations, 10MediaWiki-Configuration, 10Availability (Multiple-active-datacenters), 10MediaWiki-Platform-Team (MWPT-Q3-Jan-Mar-2018), and 4 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3937082 (10Volans) @tstarling Thanks for th... [11:41:09] 10Operations, 10Traffic, 10Wikimedia-Site-requests: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3937114 (10Aklapper) [11:44:33] operations-mw-config-composer-hhvm-jessie is stuck [11:45:34] not stuck, just 1 hour wait [11:53:37] !log installing libxml2 security updates [11:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:27] (03PS1) 10Alexandros Kosiaris: postgresql::slave: Granular relationships [puppet] - 10https://gerrit.wikimedia.org/r/407415 (https://phabricator.wikimedia.org/T184634) [12:01:13] (03Merged) 10jenkins-bot: mariadb: Repool es2018 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407407 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [12:01:25] (03CR) 10jenkins-bot: mariadb: Repool es2018 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407407 (https://phabricator.wikimedia.org/T181293) (owner: 10Jcrespo) [12:01:32] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es2019 for upgrade/reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407410 (owner: 10Jcrespo) [12:03:27] !log restarting squid on URL downloaders to pick up libxml2 security update [12:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:13] (03Merged) 10jenkins-bot: mariadb: Depool es2019 for upgrade/reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407410 (owner: 10Jcrespo) [12:04:51] (03CR) 10jenkins-bot: mariadb: Depool es2019 for upgrade/reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407410 (owner: 10Jcrespo) [12:05:00] (03PS2) 10KartikMistry: WIP: Add Matxin MT config [puppet] - 10https://gerrit.wikimedia.org/r/407197 (https://phabricator.wikimedia.org/T119384) [12:06:25] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2018, depool es2019 (duration: 00m 57s) [12:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:57] (03PS3) 10KartikMistry: WIP: Add Matxin MT config [puppet] - 10https://gerrit.wikimedia.org/r/407197 (https://phabricator.wikimedia.org/T186204) [12:08:19] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9836/netmon2001.wikimedia.org/ says ok, let's see if this works" [puppet] - 10https://gerrit.wikimedia.org/r/407415 (https://phabricator.wikimedia.org/T184634) (owner: 10Alexandros Kosiaris) [12:15:27] 10Operations, 10Traffic, 10Wikimedia-Site-requests: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3935257 (10BBlack) I get the same identifying headers when I query that URL directly from Swift storage internally, indicating this is not an edge caching issue: ``` curl... [12:18:55] PROBLEM - puppet last run on maps1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:20:05] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:20:35] (03PS7) 10BBlack: eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) [12:21:13] (03CR) 10jerkins-bot: [V: 04-1] eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [12:22:25] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:22:36] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:23:54] maps puppetfail seems to be the postgresql patch [12:25:41] (03CR) 10BBlack: [V: 031 C: 032] eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [12:25:45] (03CR) 10BBlack: [V: 032 C: 032] eqsin: deeper configuration details [puppet] - 10https://gerrit.wikimedia.org/r/392639 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [12:26:44] note: there's a small chance the above will cause some unexpected fallout, e.g. I missed that it will auto-generate an icinga check that fails, because eqsin isn't actually online yet. [12:27:03] I think I caught all of those, but if you see an alert in the next 30 minutes and it references "eqsin", it's not real :) [12:29:05] PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:29:50] (03PS1) 10Alexandros Kosiaris: netbox: Switch to 4 space indents [puppet] - 10https://gerrit.wikimedia.org/r/407417 [12:29:52] (03PS1) 10Alexandros Kosiaris: netbox: Remove hiera default value [puppet] - 10https://gerrit.wikimedia.org/r/407418 [12:30:51] (03PS2) 10Alexandros Kosiaris: netbox: Remove hiera default value [puppet] - 10https://gerrit.wikimedia.org/r/407418 [12:31:10] 10Operations, 10Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#963609 (10MoritzMuehlenhoff) >>! In T86209#3809083, @herron wrote: > I'll tear down the systems from T169566 as well What's the status of the VM removal? I notic... [12:31:55] PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:15] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:34:05] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:34:15] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:36:55] PROBLEM - puppet last run on nihal is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:37:04] (03PS1) 10Muehlenhoff: update cumin alias for dumps [puppet] - 10https://gerrit.wikimedia.org/r/407419 [12:37:06] PROBLEM - puppet last run on maps2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:37:45] akosiaris: Invalid relationship: File[/srv/postgres/9.4/main/recovery.conf] { before => Service[postgresql] }, because Service[postgresql] doesn't seem to be in the catalog [12:38:11] (03PS2) 10Muehlenhoff: update cumin alias for dumps [puppet] - 10https://gerrit.wikimedia.org/r/407419 [12:38:39] $service_name = $::lsbdistcodename ? { [12:38:39] 'jessie' => "postgresql@${pgversion}-main", [12:38:40] default => 'postgresql', [12:40:33] (03CR) 10Muehlenhoff: [C: 032] update cumin alias for dumps [puppet] - 10https://gerrit.wikimedia.org/r/407419 (owner: 10Muehlenhoff) [12:52:06] (03PS2) 10Alexandros Kosiaris: netbox: Switch to 4 space indents [puppet] - 10https://gerrit.wikimedia.org/r/407417 [12:52:08] (03PS3) 10Alexandros Kosiaris: netbox: Remove hiera default value [puppet] - 10https://gerrit.wikimedia.org/r/407418 [12:52:10] (03PS1) 10Alexandros Kosiaris: netbox: Only create users/db on master [puppet] - 10https://gerrit.wikimedia.org/r/407420 [12:53:19] !log restarting apache on hafnium to pick up libxml2 security update [12:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:38] damn I hate postgres [12:53:57] and the fact we are trying to puppetize it even more [12:54:25] what's that service_name thing even doing there... postgresql is a valid service name even in jessie [12:54:35] !log restarting nginx on debug proxies to pick up libxml2 security update [12:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:11] (03PS1) 10Jcrespo: mariadb: Disable notifications for es2019 before reimage [puppet] - 10https://gerrit.wikimedia.org/r/407421 [12:59:13] (03PS1) 10Jcrespo: mariadb: Move es2019 socket to the default location [puppet] - 10https://gerrit.wikimedia.org/r/407422 (https://phabricator.wikimedia.org/T148507) [13:00:32] (03CR) 10Jcrespo: [C: 032] mariadb: Disable notifications for es2019 before reimage [puppet] - 10https://gerrit.wikimedia.org/r/407421 (owner: 10Jcrespo) [13:02:15] akosiaris: looks like your recent postgresql change is breaking on maps. I'm having a look... [13:04:58] gehel: see backlog, I'm out for lunch, but I guess it could be reverted if a fix is not ready or complex [13:05:11] fix coming up [13:05:44] gehel: I am looking at it already [13:05:48] did I mention I hate postgres ? [13:05:53] :) [13:06:17] service name isn't consistent on all distro, so a case statement is needed... [13:06:45] I'm afraid it needs to be duplicated from postgresql::server since you are trying to break that dependency [13:06:54] I am afraid the exact same thing [13:07:02] and I hate how ugly it is [13:07:07] but having 2 case statements on different files foe the same thing is a call for later failures ;) [13:07:13] ok, I'll let you push that fix [13:07:22] s/foe/for/ [13:07:27] volans|off: yup, which is why it is ugly [13:07:42] having an inconsistent service name is a call for later failures :) [13:07:54] indeed [13:07:56] maybe I should reuse service_name from postgresql::server [13:08:07] let's see if it would work [13:08:31] did anyone already upgraded apache from jessie's version to stretch? Something to be aware of? [13:08:32] yeah, but in that case, you need to ensure ordering again, which goes again what you were trying to achieve (if I understand it correctly) [13:09:02] possibly... [13:09:13] !log upgrade nging on elastic* [13:09:15] the number of things I hate right now is increasing [13:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:39] could we pass the service name from above to both classes? [13:10:05] sorry if doesn't make sense, talking without the code in front [13:10:07] FWIW I think we should stop trying to puppetize the damn thing [13:10:15] https://tenor.com/view/yoda-anger-hate-suffering-starwars-gif-4977234 [13:10:15] we don't puppetize those things for mysql [13:10:22] we shouldn't do it for postgres [13:10:26] volans|off: ugly as well, the service name is a postgresql concern, which should be in the postgres module, not in a profile / role [13:10:30] the only reason we do is we don't got postgres dbas [13:10:56] that is correct- account management can be automated, but it must be done outside for configuration management [13:11:03] *of [13:11:09] that is not account management [13:11:15] it is slave configuration [13:11:21] but your statement still holds true [13:11:23] well, topology applies [13:11:27] exactly :-) [13:13:01] !log restarting postgresql and nodejs services on maps* [13:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:39] !log restarting apache on krypton to pick up libxml2 security update [13:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:50] (03PS1) 10Alexandros Kosiaris: postgres::slave: Reuse $service_name [puppet] - 10https://gerrit.wikimedia.org/r/407424 [13:17:18] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.11 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/407412 (https://phabricator.wikimedia.org/T178072) (owner: 10Gilles) [13:17:47] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9841/maps1004.eqiad.wmnet/ says this works although it's a hack." [puppet] - 10https://gerrit.wikimedia.org/r/407424 (owner: 10Alexandros Kosiaris) [13:18:00] gehel: ^ [13:18:12] looking [13:18:16] I am merging and pray we don't hit some other damn thing [13:18:30] and yes I know it's order dependent [13:18:47] but there is no damn nice way to solve this [13:18:59] (03CR) 10Gehel: [C: 031] "that's the least ugly solution..." [puppet] - 10https://gerrit.wikimedia.org/r/407424 (owner: 10Alexandros Kosiaris) [13:19:10] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3363615 (10MoritzMuehlenhoff) @RobH : Given Antoine's comment, let's reclaim, then? This host has almost 2.5 years remaining warranty [13:20:03] !log stop and reimage es2019 [13:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:35] (03CR) 10Alexandros Kosiaris: [C: 032] "For posterity's sake, this is order dependent. If we ever set order= in puppet.conf or puppet decides to change default resource ordering," [puppet] - 10https://gerrit.wikimedia.org/r/407424 (owner: 10Alexandros Kosiaris) [13:22:22] (03PS3) 10Alexandros Kosiaris: netbox: Switch to 4 space indents [puppet] - 10https://gerrit.wikimedia.org/r/407417 [13:22:24] (03PS2) 10Alexandros Kosiaris: netbox: Only create users/db on master [puppet] - 10https://gerrit.wikimedia.org/r/407420 [13:22:29] (03PS4) 10Alexandros Kosiaris: netbox: Remove hiera default value [puppet] - 10https://gerrit.wikimedia.org/r/407418 [13:23:15] !log force puppet run on all postgres servers for https://gerrit.wikimedia.org/r/407424 [13:23:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:44] (03PS2) 10Marostegui: db-eqiad.php: Clarify db1051 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407321 [13:23:52] RECOVERY - puppet last run on maps1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:24:01] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:24:02] RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:24:11] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [13:24:16] (03PS1) 10Muehlenhoff: Update Cumin alias for labservices [puppet] - 10https://gerrit.wikimedia.org/r/407425 [13:25:28] (03CR) 10Muehlenhoff: [C: 032] Update Cumin alias for labservices [puppet] - 10https://gerrit.wikimedia.org/r/407425 (owner: 10Muehlenhoff) [13:26:52] RECOVERY - puppet last run on maps-test2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:26:52] RECOVERY - puppet last run on nihal is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:27:11] RECOVERY - puppet last run on maps2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:27:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Clarify db1051 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407321 (owner: 10Marostegui) [13:27:21] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:27:22] RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:27:39] (03PS4) 10Alexandros Kosiaris: netbox: Switch to 4 space indents [puppet] - 10https://gerrit.wikimedia.org/r/407417 [13:27:41] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:27:41] (03PS3) 10Alexandros Kosiaris: netbox: Only create users/db on master [puppet] - 10https://gerrit.wikimedia.org/r/407420 [13:27:43] (03PS5) 10Alexandros Kosiaris: netbox: Remove hiera default value [puppet] - 10https://gerrit.wikimedia.org/r/407418 [13:28:03] (03CR) 10Alexandros Kosiaris: [C: 032] netbox: Switch to 4 space indents [puppet] - 10https://gerrit.wikimedia.org/r/407417 (owner: 10Alexandros Kosiaris) [13:28:09] (03CR) 10Alexandros Kosiaris: [C: 032] netbox: Only create users/db on master [puppet] - 10https://gerrit.wikimedia.org/r/407420 (owner: 10Alexandros Kosiaris) [13:28:26] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9842/ says the OK, so merging" [puppet] - 10https://gerrit.wikimedia.org/r/407420 (owner: 10Alexandros Kosiaris) [13:28:49] (03Merged) 10jenkins-bot: db-eqiad.php: Clarify db1051 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407321 (owner: 10Marostegui) [13:29:02] (03CR) 10jenkins-bot: db-eqiad.php: Clarify db1051 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407321 (owner: 10Marostegui) [13:30:22] 10Operations, 10Traffic, 10Wikimedia-Site-requests: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3937383 (10Ankry) > I suspect this has more to do with (possibly old) thumbnailing or djvu-handling issues than caching? Newly generated thumbnails of other sizes are ge... [13:31:12] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Update db1051 reason for depooling (duration: 00m 56s) [13:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:31] (03CR) 10Jcrespo: [C: 032] mariadb: Move es2019 socket to the default location [puppet] - 10https://gerrit.wikimedia.org/r/407422 (https://phabricator.wikimedia.org/T148507) (owner: 10Jcrespo) [13:39:36] (03PS2) 10Jcrespo: mariadb: Move es2019 socket to the default location [puppet] - 10https://gerrit.wikimedia.org/r/407422 (https://phabricator.wikimedia.org/T148507) [13:40:11] (03PS1) 10Muehlenhoff: Fix alias [puppet] - 10https://gerrit.wikimedia.org/r/407426 [13:41:01] (03PS7) 10Gehel: wdqs: replace prometheus-wdqs-updater-exporter with prometheus-jmx-exporter [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) [13:41:49] (03CR) 10Muehlenhoff: [C: 032] Fix alias [puppet] - 10https://gerrit.wikimedia.org/r/407426 (owner: 10Muehlenhoff) [13:42:09] (03PS3) 10Jcrespo: mariadb: Move es2019 socket to the default location [puppet] - 10https://gerrit.wikimedia.org/r/407422 (https://phabricator.wikimedia.org/T148507) [13:42:55] !log restarting nginx on wdqs* for upgrade [13:43:07] (03CR) 10Gehel: [C: 032] wdqs: replace prometheus-wdqs-updater-exporter with prometheus-jmx-exporter [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [13:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:16] (03PS8) 10Gehel: wdqs: replace prometheus-wdqs-updater-exporter with prometheus-jmx-exporter [puppet] - 10https://gerrit.wikimedia.org/r/405887 (https://phabricator.wikimedia.org/T182773) [13:46:32] 10Operations, 10Wikimedia-Site-requests, 10media-storage: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3937436 (10BBlack) Well, yes, semantics :) It is a "caching" problem in some general sense of the word, but in terms of pointing fingers at different parts of our... [13:47:25] (03PS1) 10Gehel: wdqs: fixed typo after introducing prometheus-jmx-exporter [puppet] - 10https://gerrit.wikimedia.org/r/407427 (https://phabricator.wikimedia.org/T182773) [13:47:48] (03PS1) 10Jcrespo: Revert "mariadb: Disable notifications for es2019 before reimage" [puppet] - 10https://gerrit.wikimedia.org/r/407428 [13:48:02] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Disable notifications for es2019 before reimage" [puppet] - 10https://gerrit.wikimedia.org/r/407428 (owner: 10Jcrespo) [13:48:09] (03Abandoned) 10Jcrespo: Revert "mariadb: Disable notifications for es2019 before reimage" [puppet] - 10https://gerrit.wikimedia.org/r/407428 (owner: 10Jcrespo) [13:49:51] (03CR) 10Gehel: [C: 032] wdqs: fixed typo after introducing prometheus-jmx-exporter [puppet] - 10https://gerrit.wikimedia.org/r/407427 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [13:50:03] (03PS1) 10Jcrespo: mariadb: Reenable notifications on es2019 before repool [puppet] - 10https://gerrit.wikimedia.org/r/407429 [13:50:39] (03PS2) 10Jcrespo: mariadb: Reenable notifications on es2019 before repool [puppet] - 10https://gerrit.wikimedia.org/r/407429 [13:54:19] (03PS4) 10Gehel: wdqs: remove cleanup code after migrating to prometheus jmx exporter [puppet] - 10https://gerrit.wikimedia.org/r/405888 (https://phabricator.wikimedia.org/T182773) [13:54:23] PROBLEM - puppet last run on wdqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:55:04] (03CR) 10Alexandros Kosiaris: [C: 032] "This is weird. PCC fails at https://puppet-compiler.wmflabs.org/compiler02/9843/netmon1002.wikimedia.org/change.netmon1002.wikimedia.org.e" [puppet] - 10https://gerrit.wikimedia.org/r/407418 (owner: 10Alexandros Kosiaris) [13:55:09] (03PS6) 10Alexandros Kosiaris: netbox: Remove hiera default value [puppet] - 10https://gerrit.wikimedia.org/r/407418 [13:55:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] netbox: Remove hiera default value [puppet] - 10https://gerrit.wikimedia.org/r/407418 (owner: 10Alexandros Kosiaris) [13:55:42] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3937475 (10Mvolz) >>! In T165105#3919661, @The_RedBurn wrote: > From Twitter above: > * 10.1080/03014223.19... [13:56:30] !log restarting nginx on meitnerium/archiva to pick up libxml2 security update [13:56:38] (03CR) 10Alexandros Kosiaris: "PCC worked correctly, this did actually fail, PEBKAC on my side" [puppet] - 10https://gerrit.wikimedia.org/r/407418 (owner: 10Alexandros Kosiaris) [13:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:46] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#3937476 (10Mvolz) a:03Mvolz [13:58:06] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#3257087 (10Mvolz) [13:58:40] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:59:20] RECOVERY - puppet last run on wdqs1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:00:44] (03PS1) 10Alexandros Kosiaris: netbox: Add hiera key into correct role [puppet] - 10https://gerrit.wikimedia.org/r/407430 [14:00:48] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#3937502 (10Mvolz) [14:01:08] !log restarting nginx on puppetdb hosts to pick up libxml2 security update [14:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:29] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3937503 (10chasemp) The thinking has been that even with nodepool decomissioned there will be another piece of software responsible for managing a resource poo... [14:01:30] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:01:46] (03CR) 10Alexandros Kosiaris: [C: 032] netbox: Add hiera key into correct role [puppet] - 10https://gerrit.wikimedia.org/r/407430 (owner: 10Alexandros Kosiaris) [14:02:21] (03PS1) 10Gehel: wdqs: corrected source of jmx exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/407431 (https://phabricator.wikimedia.org/T182773) [14:02:47] (03PS2) 10Gehel: wdqs: corrected source of jmx exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/407431 (https://phabricator.wikimedia.org/T182773) [14:03:16] (03CR) 10Gehel: [C: 032] wdqs: corrected source of jmx exporter configuration [puppet] - 10https://gerrit.wikimedia.org/r/407431 (https://phabricator.wikimedia.org/T182773) (owner: 10Gehel) [14:04:30] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [14:06:33] (03PS1) 10Alexandros Kosiaris: netbox: Allow slaves to to connect to master [puppet] - 10https://gerrit.wikimedia.org/r/407432 (https://phabricator.wikimedia.org/T184634) [14:08:40] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:09:40] PROBLEM - Check systemd state on wdqs2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:50] PROBLEM - Check systemd state on wdqs2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:09:51] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:10:30] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:12:16] (03PS1) 10Filippo Giunchedi: raid: fix check-hpssacli for controllers in HBA mode [puppet] - 10https://gerrit.wikimedia.org/r/407433 (https://phabricator.wikimedia.org/T185216) [14:12:30] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational [14:14:55] (03CR) 10Filippo Giunchedi: "Tested on restbase1011 (HBA) and restbase1012 (no HBA)" [puppet] - 10https://gerrit.wikimedia.org/r/407433 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [14:15:42] (03CR) 10Jcrespo: [C: 032] mariadb: Reenable notifications on es2019 before repool [puppet] - 10https://gerrit.wikimedia.org/r/407429 (owner: 10Jcrespo) [14:15:47] (03PS3) 10Jcrespo: mariadb: Reenable notifications on es2019 before repool [puppet] - 10https://gerrit.wikimedia.org/r/407429 [14:17:40] RECOVERY - Check systemd state on wdqs2001 is OK: OK - running: The system is fully operational [14:18:20] (03PS1) 10Jcrespo: Revert "mariadb: Depool es2019 for upgrade/reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407435 [14:20:50] RECOVERY - Check systemd state on wdqs2003 is OK: OK - running: The system is fully operational [14:21:50] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [14:25:32] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#3937578 (10akosiaris) >> I'd rather wait a few... [14:27:37] (03PS2) 10Alexandros Kosiaris: netbox: Allow slaves to to connect to master [puppet] - 10https://gerrit.wikimedia.org/r/407432 (https://phabricator.wikimedia.org/T184634) [14:27:58] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/9845/ is happy, merging" [puppet] - 10https://gerrit.wikimedia.org/r/407432 (https://phabricator.wikimedia.org/T184634) (owner: 10Alexandros Kosiaris) [14:34:42] !log restarting apache on rutherfordium to pick up libxml2 security update [14:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:28] 10Operations, 10Icinga, 10Performance-Team: Icinga contact group team-performance pointing to our old mailing list - https://phabricator.wikimedia.org/T186192#3937621 (10Dzahn) 05Open>03Resolved a:03Dzahn Yea, the contacts are in our private repo because the phone numbers in that file. There is T16423... [14:39:07] (03PS1) 10Awight: Set changeprop URI to the new beta ORES node [puppet] - 10https://gerrit.wikimedia.org/r/407437 (https://phabricator.wikimedia.org/T184501) [14:39:39] (03PS1) 10Alexandros Kosiaris: netbox: Add IPv6 ferm rules as well [puppet] - 10https://gerrit.wikimedia.org/r/407438 (https://phabricator.wikimedia.org/T184634) [14:39:45] (03PS1) 10Muehlenhoff: Extend misc-ops Cumin alias with role::mirrors [puppet] - 10https://gerrit.wikimedia.org/r/407439 [14:40:27] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es2019 for upgrade/reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407435 (owner: 10Jcrespo) [14:40:41] !log restarting nginx on sodium to pick up libxml2 security update [14:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:55] (03CR) 10Alexandros Kosiaris: [C: 032] netbox: Add IPv6 ferm rules as well [puppet] - 10https://gerrit.wikimedia.org/r/407438 (https://phabricator.wikimedia.org/T184634) (owner: 10Alexandros Kosiaris) [14:42:42] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es2019 for upgrade/reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407435 (owner: 10Jcrespo) [14:43:00] (03CR) 10jenkins-bot: Revert "mariadb: Depool es2019 for upgrade/reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407435 (owner: 10Jcrespo) [14:44:22] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool es2019 (duration: 00m 57s) [14:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:01] !log installing tiff security updates [14:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:58] (03PS2) 10Muehlenhoff: Extend misc-ops Cumin alias with role::mirrors [puppet] - 10https://gerrit.wikimedia.org/r/407439 [14:51:04] (03CR) 10Muehlenhoff: [C: 032] Extend misc-ops Cumin alias with role::mirrors [puppet] - 10https://gerrit.wikimedia.org/r/407439 (owner: 10Muehlenhoff) [14:51:21] (03PS1) 10Alexandros Kosiaris: Specify correct profile::netbox::slaves value [puppet] - 10https://gerrit.wikimedia.org/r/407442 [14:51:52] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify correct profile::netbox::slaves value [puppet] - 10https://gerrit.wikimedia.org/r/407442 (owner: 10Alexandros Kosiaris) [14:52:18] moritzm: merging yours as well [14:52:52] thx [14:54:03] (03PS1) 10Muehlenhoff: Add library hint for tiff [puppet] - 10https://gerrit.wikimedia.org/r/407443 [14:56:11] (03Abandoned) 10Zoranzoki21: Add throttle rule for 1Lib1Ref event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406603 (https://phabricator.wikimedia.org/T185857) (owner: 10Zoranzoki21) [14:57:20] (03CR) 10Muehlenhoff: [C: 032] Add library hint for tiff [puppet] - 10https://gerrit.wikimedia.org/r/407443 (owner: 10Muehlenhoff) [15:07:19] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:14:29] !log installing curl security updates [15:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:39] (03PS1) 10Filippo Giunchedi: prometheus: aggregate nginx requests and availability [puppet] - 10https://gerrit.wikimedia.org/r/407445 (https://phabricator.wikimedia.org/T177195) [15:16:40] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [15:20:36] (03PS1) 10Alexandros Kosiaris: netbox: Add IPv6 postgresql::user resource [puppet] - 10https://gerrit.wikimedia.org/r/407446 (https://phabricator.wikimedia.org/T184634) [15:21:00] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:23:42] (03PS2) 10Filippo Giunchedi: raid: fix check-hpssacli for controllers in HBA mode [puppet] - 10https://gerrit.wikimedia.org/r/407433 (https://phabricator.wikimedia.org/T185216) [15:23:44] (03PS1) 10Filippo Giunchedi: raid: report PDs from get-raid-status-hpssacli [puppet] - 10https://gerrit.wikimedia.org/r/407447 (https://phabricator.wikimedia.org/T185216) [15:24:35] (03CR) 10Alexandros Kosiaris: [C: 032] netbox: Add IPv6 postgresql::user resource [puppet] - 10https://gerrit.wikimedia.org/r/407446 (https://phabricator.wikimedia.org/T184634) (owner: 10Alexandros Kosiaris) [15:27:08] !log restarting apache/HHVM on deployment servers to pick up libxml2/curl security updates [15:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:30] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [15:29:24] (03PS2) 10Dzahn: DNS: Add mgmt dns entries for tendril2001 [dns] - 10https://gerrit.wikimedia.org/r/407171 (owner: 10Papaul) [15:31:48] (03CR) 10Dzahn: [C: 032] "confirmed" [dns] - 10https://gerrit.wikimedia.org/r/407171 (owner: 10Papaul) [15:32:19] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:32:53] 10Operations, 10Wikimedia-Site-requests, 10media-storage: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3935257 (10fgiunchedi) Indeed looks like not all thumbs for all pages have been purged and some are still in swift for the older version. @Ankry does purging the fi... [15:33:43] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db2016, db2017, db2018, db2019, db2023, db2028, db2029 - https://phabricator.wikimedia.org/T184090#3937784 (10Dzahn) a:05Papaul>03RobH [15:35:47] 10Operations, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3937785 (10akosiaris) The above series of patches fixed netmon2001 not running puppet. [15:35:59] 10Operations, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3937786 (10akosiaris) [15:39:46] !log upgrading nginx on mw1266-mw1299 (for T164456) [15:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:58] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [15:40:51] (03PS2) 10Filippo Giunchedi: prometheus: aggregate nginx requests and availability [puppet] - 10https://gerrit.wikimedia.org/r/407445 (https://phabricator.wikimedia.org/T177195) [15:43:40] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: aggregate nginx requests and availability [puppet] - 10https://gerrit.wikimedia.org/r/407445 (https://phabricator.wikimedia.org/T177195) (owner: 10Filippo Giunchedi) [15:44:50] 10Operations, 10ops-eqiad: Hardware check on mw1271 - https://phabricator.wikimedia.org/T184722#3937817 (10MoritzMuehlenhoff) @Cmjohnson Is this ready to be re-pooled with the new DIMM or are you planning further tests which require the server to be out of service? [15:47:20] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-full] [15:51:32] (03CR) 10Ottomata: [C: 031] "1 nit but +1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [15:52:20] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:53:19] PROBLEM - Apache HTTP on labweb1001 is CRITICAL: connect to address 10.64.16.200 and port 80: Connection refused [15:53:20] PROBLEM - HHVM rendering on labweb1001 is CRITICAL: connect to address 10.64.16.200 and port 80: Connection refused [15:55:26] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3937861 (10chasemp) >>! In T185493#3935805, @Dzahn wrote: > racktables not worth it anymore? almost replaced by netbox. Netbox access should automatically come with the LDAP gr... [15:56:25] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3937864 (10Dzahn) @Robh could you do the racktables part? ^ [15:57:29] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:59:42] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Package/upload service-checker for Debian stretch - https://phabricator.wikimedia.org/T184224#3937871 (10akosiaris) 05Open>03Resolved a:03akosiaris Package uploaded! Since it's a native package I 've had to bump the version number t... [16:00:19] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3937876 (10RobH) [16:00:34] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3917465 (10RobH) Emailed it to her just now to update and change once she logs in. [16:01:42] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add service-checker image used to test service images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/405205 (https://phabricator.wikimedia.org/T184220) (owner: 10Dduvall) [16:04:29] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3937880 (10Papaul) @Marostegui is okay for me to use raid1-gpt.cfg for the partman on this system ? [16:08:57] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3937886 (10Ottomata) I actually tried to move to systemd a couple of years ago. I don’t remember the exact details, but there were some serious difficulties in automatically registering groups of proc... [16:15:48] (03PS1) 10Papaul: DNS: Add production DNS entry for tendril2001 [dns] - 10https://gerrit.wikimedia.org/r/407454 [16:15:50] Oops, I made a mistake, can you merge this? https://gerrit.wikimedia.org/r/#/c/407437/ [16:16:05] akosiaris: ^ [16:17:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3937914 (10Papaul) [16:17:32] Amir1: sure [16:17:42] (03PS2) 10Alexandros Kosiaris: Set changeprop URI to the new beta ORES node [puppet] - 10https://gerrit.wikimedia.org/r/407437 (https://phabricator.wikimedia.org/T184501) (owner: 10Awight) [16:17:50] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Set changeprop URI to the new beta ORES node [puppet] - 10https://gerrit.wikimedia.org/r/407437 (https://phabricator.wikimedia.org/T184501) (owner: 10Awight) [16:17:56] Amir1: done [16:18:05] thank you very much [16:18:11] yw [16:22:09] (03PS1) 10Arturo Borrero Gonzalez: apt: disable daily cron job from apt-show-versions [puppet] - 10https://gerrit.wikimedia.org/r/407456 (https://phabricator.wikimedia.org/T186230) [16:22:45] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3937952 (10Ottomata) Ah, found previous ticket: T114199 [16:26:10] !log apt-get install 'designate' on labservices1001 and 1002 — routine upgrade [16:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:02] (03PS1) 10Papaul: DHCP: Add MAC address entry for tendril2001 [puppet] - 10https://gerrit.wikimedia.org/r/407457 [16:28:31] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address entry for tendril2001 [puppet] - 10https://gerrit.wikimedia.org/r/407457 (owner: 10Papaul) [16:32:36] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install tendril2001 - https://phabricator.wikimedia.org/T186123#3937990 (10jcrespo) raid1-gpt.cfg seems about right [16:37:31] (03CR) 10Rush: [C: 031] "small comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/407456 (https://phabricator.wikimedia.org/T186230) (owner: 10Arturo Borrero Gonzalez) [16:41:31] 10Operations, 10Wikimedia-Site-requests, 10media-storage: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3938021 (10Ankry) >>! In T186153#3937781, @fgiunchedi wrote: > Indeed looks like not all thumbs for all pages have been purged and some are still in swift for the o... [16:43:42] (03PS2) 10Arturo Borrero Gonzalez: apt: disable daily cron job from apt-show-versions [puppet] - 10https://gerrit.wikimedia.org/r/407456 (https://phabricator.wikimedia.org/T186230) [17:00:04] godog, moritzm, and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 8 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180201T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:22] (03PS1) 10Rush: designate: ensure consistent state for logrotate mess [puppet] - 10https://gerrit.wikimedia.org/r/407459 (https://phabricator.wikimedia.org/T186142) [17:01:31] (03PS2) 10Rush: cloud: overlay whitelist as default [puppet] - 10https://gerrit.wikimedia.org/r/406851 [17:01:49] (03PS1) 10Madhuvishy: block_sync: Fix lock file open with w+ mode [puppet] - 10https://gerrit.wikimedia.org/r/407460 (https://phabricator.wikimedia.org/T186235) [17:02:08] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3938114 (10Bstorm) I'm in racktables! :) [17:02:13] (03CR) 10jerkins-bot: [V: 04-1] block_sync: Fix lock file open with w+ mode [puppet] - 10https://gerrit.wikimedia.org/r/407460 (https://phabricator.wikimedia.org/T186235) (owner: 10Madhuvishy) [17:02:44] (03CR) 10Rush: [C: 032] cloud: overlay whitelist as default [puppet] - 10https://gerrit.wikimedia.org/r/406851 (owner: 10Rush) [17:03:45] (03PS2) 10Madhuvishy: block_sync: Fix lock file open with w+ mode [puppet] - 10https://gerrit.wikimedia.org/r/407460 (https://phabricator.wikimedia.org/T186235) [17:06:09] (03PS3) 10Madhuvishy: block_sync: Fix lock file open with w+ mode [puppet] - 10https://gerrit.wikimedia.org/r/407460 (https://phabricator.wikimedia.org/T186235) [17:07:24] (03CR) 10Madhuvishy: [C: 032] block_sync: Fix lock file open with w+ mode [puppet] - 10https://gerrit.wikimedia.org/r/407460 (https://phabricator.wikimedia.org/T186235) (owner: 10Madhuvishy) [17:08:39] mutante twentyafterfour hi, restarting apache every sunday won't work either as mutante found yesturday. He only restarted it at the weekend and it was doing it again yesturday. [17:13:15] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3938158 (10ayounsi) @Marostegui do you have an approximate timeline I can base this task on? [17:15:41] (03CR) 10Andrew Bogott: [C: 031] designate: ensure consistent state for logrotate mess [puppet] - 10https://gerrit.wikimedia.org/r/407459 (https://phabricator.wikimedia.org/T186142) (owner: 10Rush) [17:15:52] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3938170 (10Marostegui) >>! In T172459#3938158, @ayounsi wrote: > @Marostegui do you have an approximate timeline I can base this task on? Not really. We'd probably won't be doing... [17:22:32] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3938179 (10jcrespo) Maybe your manager should talk to our manager to help us prioritize it? Right now, goals are on our top priority unless said the opposite. Failing over those ser... [17:25:14] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3835986 (10elukey) a gdb `thread apply all bt` would probably be more useful to get where http... [17:32:40] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3938199 (10elukey) The other useful thing to do, without waiting for a complete leak, is to ch... [17:32:54] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3938200 (10Paladox) Also it seems that restarting it every sunday would not r... [17:32:58] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3938201 (10elukey) [17:38:05] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3938224 (10BBlack) The question (after the initially proposed 2 week timeline was rejected) was merely "do you have an approximate timeline I can base this task on?". The answer se... [17:42:05] 10Operations, 10Ops-Access-Requests: Requesting access to wmf.webrequest for Simonjoylet - https://phabricator.wikimedia.org/T186190#3938234 (10RobH) [17:43:12] 10Operations, 10Ops-Access-Requests: Requesting access to wmf.webrequest for Simonjoylet - https://phabricator.wikimedia.org/T186190#3936572 (10RobH) [17:45:15] (03PS1) 10Aaron Schulz: [DNM] Switch labs to using mcrouter instead of nutcracker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407464 [17:45:46] (03PS1) 10Arturo Borrero Gonzalez: WIP: apt: merge script report-pending-upgrades to apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) [17:46:01] <_joe_> AaronSchulz: your idea would be to run mcrouter locally on the appservers? [17:46:25] (03CR) 10jerkins-bot: [V: 04-1] WIP: apt: merge script report-pending-upgrades to apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:46:43] <_joe_> I'll take a look at your config [17:47:21] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3938249 (10Marostegui) I would definitely vote for waiting to the DC switchover if that is possible and a reasonable timeframe for NetOps. Otherwise, we'd need to squeeze this into... [17:51:02] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3938253 (10Marostegui) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180201T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:06:57] (03PS1) 10Muehlenhoff: Extend Cumin alias for labs-nfs [puppet] - 10https://gerrit.wikimedia.org/r/407468 [18:11:12] _joe_: yeah, like nutcracker [18:11:24] Nothing for ORES. [18:11:27] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3938314 (10RobH) p:05Triage>03Normal a:03faidon So we would normally quote out a standard misc system, but we happen to have a couple of spare servers ready for allocation in eqiad alr... [18:11:31] 10Operations, 10hardware-requests: hardware request for tin replacement - https://phabricator.wikimedia.org/T184481#3938319 (10RobH) p:05Triage>03Normal a:03faidon So we would normally quote out a standard misc system, but we happen to have a couple of spare servers ready for allocation in eqiad already.... [18:12:07] 10Operations, 10hardware-requests: decommission mw1163 - https://phabricator.wikimedia.org/T175089#3938327 (10RobH) a:05mark>03faidon [18:24:39] (03CR) 10Muehlenhoff: [C: 032] Extend Cumin alias for labs-nfs [puppet] - 10https://gerrit.wikimedia.org/r/407468 (owner: 10Muehlenhoff) [18:27:16] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3938382 (10MoritzMuehlenhoff) >>! In T185667#3937886, @Ottomata wrote: > I actually tried to move to systemd a couple of years ago. But T114199 was for jessie, the first Debian release with systemd... [18:28:18] (03PS1) 10Muehlenhoff: Extend labtest Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/407472 [18:29:07] (03PS2) 10Muehlenhoff: Extend labtest Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/407472 [18:29:49] (03PS3) 10Muehlenhoff: Extend labtest Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/407472 [18:30:42] (03PS3) 10Zoranzoki21: Enable ArticlePlaceholder for Estonian Wikipedia (etwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407017 (https://phabricator.wikimedia.org/T186107) [18:42:05] (03CR) 10Muehlenhoff: [C: 032] Extend labtest Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/407472 (owner: 10Muehlenhoff) [18:53:31] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3938562 (10RobH) [19:02:31] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.15 (duration: 14m 55s) [19:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:24] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.16 [keeping static files] (duration: 01m 16s) [19:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:45] (03PS2) 10Rush: designate: ensure consistent state for logrotate mess [puppet] - 10https://gerrit.wikimedia.org/r/407459 (https://phabricator.wikimedia.org/T186142) [19:10:16] (03CR) 10Rush: [C: 032] designate: ensure consistent state for logrotate mess [puppet] - 10https://gerrit.wikimedia.org/r/407459 (https://phabricator.wikimedia.org/T186142) (owner: 10Rush) [19:13:33] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 10Patch-For-Review: Lower varnish caching length on doc.wikimedia.org - https://phabricator.wikimedia.org/T184255#3877424 (10hashar) The patch got merged and deployed. So tentatively that is fixed? [19:13:42] 10Operations, 10Continuous-Integration-Infrastructure, 10Traffic: Lower varnish caching length on doc.wikimedia.org - https://phabricator.wikimedia.org/T184255#3938632 (10hashar) [19:14:58] 10Operations, 10ops-esams, 10DC-Ops, 10netops: cr2-esams temperature warning - https://phabricator.wikimedia.org/T176816#3938634 (10ayounsi) 05Open>03Resolved a:03mark I haven't seen any alerts since. Thanks. [19:17:36] !log labservices1002:~# logrotate --force /etc/logrotate.conf [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:23] !log labservices1001:~# logrotate --force /etc/logrotate.conf [19:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:03] 10Operations, 10ops-eqiad: check eventlog1002 production network cable - https://phabricator.wikimedia.org/T186252#3938674 (10RobH) p:05Triage>03Normal [19:20:16] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3938690 (10RobH) Install blocked by network issue detailed on T186252 for onsite work. [19:21:25] (03PS12) 10Rush: rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 [19:36:28] (03PS1) 10BBlack: URL Path Normalization: refactor, add to cache_text [puppet] - 10https://gerrit.wikimedia.org/r/407488 (https://phabricator.wikimedia.org/T127387) [19:36:30] (03PS1) 10BBlack: URL Path Normalization: add to cache_misc [puppet] - 10https://gerrit.wikimedia.org/r/407489 (https://phabricator.wikimedia.org/T127387) [19:37:07] (03CR) 10Rush: [C: 032] rabbitmq: handling users and initial setup [puppet] - 10https://gerrit.wikimedia.org/r/403202 (owner: 10Rush) [19:41:22] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3938830 (10RobH) [19:46:26] PROBLEM - puppet last run on labtestcontrol2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:56:45] 10Operations, 10Ops-Access-Requests: Requesting access to wmf.webrequest for Simonjoylet - https://phabricator.wikimedia.org/T186190#3938929 (10Nuria) Sorry but you have to request a formal collaboration with research team for us to grant access to data. In your case looking at the very brief description of y... [19:56:51] 10Operations, 10Ops-Access-Requests: Requesting access to wmf.webrequest for Simonjoylet - https://phabricator.wikimedia.org/T186190#3938931 (10Nuria) 05Open>03declined [20:01:16] (03PS1) 10Rush: openstack: pull appropriate hiera password for cleanup [puppet] - 10https://gerrit.wikimedia.org/r/407490 [20:03:48] (03CR) 10Rush: [C: 032] openstack: pull appropriate hiera password for cleanup [puppet] - 10https://gerrit.wikimedia.org/r/407490 (owner: 10Rush) [20:04:41] (03PS1) 10Herron: puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 [20:05:12] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: add support for puppetlabs puppetdb 4.4 package [puppet] - 10https://gerrit.wikimedia.org/r/407492 (owner: 10Herron) [20:06:26] RECOVERY - puppet last run on labtestcontrol2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:15:47] (03CR) 10Herron: "> it would be easier if that part is also available for review in a" [puppet] - 10https://gerrit.wikimedia.org/r/405808 (https://phabricator.wikimedia.org/T185501) (owner: 10Herron) [20:16:54] (03CR) 10Herron: "tests should pass after 405808 is merged" [puppet] - 10https://gerrit.wikimedia.org/r/407492 (owner: 10Herron) [20:18:04] (03PS1) 10Rush: drainqueue: pass user/password when not using silent mode [puppet] - 10https://gerrit.wikimedia.org/r/407495 [20:19:05] (03PS2) 10Rush: drainqueue: pass user/password when not using silent mode [puppet] - 10https://gerrit.wikimedia.org/r/407495 [20:20:23] (03PS3) 10Rush: drainqueue: pass user/password when not using silent mode [puppet] - 10https://gerrit.wikimedia.org/r/407495 [20:21:22] (03CR) 10Rush: [C: 032] drainqueue: pass user/password when not using silent mode [puppet] - 10https://gerrit.wikimedia.org/r/407495 (owner: 10Rush) [20:30:21] (03PS1) 10Jcrespo: mariadb: emergency depool of db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407501 [20:31:51] if something is being deployed, revert [20:33:38] (03CR) 10Jcrespo: [C: 032] mariadb: emergency depool of db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407501 (owner: 10Jcrespo) [20:33:48] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: emergency depool of db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407501 (owner: 10Jcrespo) [20:34:04] (03PS1) 10Herron: remove references to dysprosium and diadem [puppet] - 10https://gerrit.wikimedia.org/r/407502 [20:34:41] there is an outage going on [20:34:54] (03CR) 10Herron: [C: 032] remove references to dysprosium and diadem [puppet] - 10https://gerrit.wikimedia.org/r/407502 (owner: 10Herron) [20:35:00] in case people is not aware [20:35:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: emergency depool of db1083 (duration: 00m 55s) [20:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:34] https://logstash.wikimedia.org/goto/49bfc164a2acaeacec0b34fbb7b9b3e7 [20:36:16] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:36:50] not sure if that was enough or I just moved the problem to another server [20:37:05] (03CR) 10jenkins-bot: mariadb: emergency depool of db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407501 (owner: 10Jcrespo) [20:37:18] *reads up* [20:39:11] I am going to block wikiexporter queries [20:39:23] (03PS1) 10Rush: openstack: move log permissions handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/407503 (https://phabricator.wikimedia.org/T171494) [20:39:38] (03PS2) 10Rush: openstack: move log permissions handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/407503 (https://phabricator.wikimedia.org/T171494) [20:41:16] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:41:28] !log deployed modified query killer to enwiki replicas [20:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:57] that will likely break functionality, but it will avoid errors [20:42:32] devels can fix the pieces later [20:42:39] what were the queries? [20:42:41] (03PS1) 10Jcrespo: Revert "mariadb: emergency depool of db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407505 [20:43:08] addshore: will comment in private, I cannot discard, even if unlikely, it is not deliberate [20:43:29] * addshore can wait for the ticket [20:43:59] let me say it should not worry too much yourself, unless you were the user of the api :-) [20:44:50] the query killer was not useless in this case, it was preventing from having a full outage [20:44:55] (03PS1) 10Herron: remove diadem/dysprosium from dns [dns] - 10https://gerrit.wikimedia.org/r/407506 [20:45:22] (03CR) 10Herron: [C: 032] remove diadem/dysprosium from dns [dns] - 10https://gerrit.wikimedia.org/r/407506 (owner: 10Herron) [20:46:01] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: emergency depool of db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407505 (owner: 10Jcrespo) [20:46:35] was there some recent deploy? [20:46:45] this seems to have been building up for 1h+ [20:47:30] (03Merged) 10jenkins-bot: Revert "mariadb: emergency depool of db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407505 (owner: 10Jcrespo) [20:47:40] (03CR) 10jenkins-bot: Revert "mariadb: emergency depool of db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407505 (owner: 10Jcrespo) [20:50:02] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool of db1083 (duration: 00m 55s) [20:50:08] 10Operations, 10Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#3939138 (10herron) @MoritzMuehlenhoff VMs have been removed [20:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:16] (03PS4) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [20:53:32] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3939141 (10jcrespo) @BBlack The thing is, we physically could do this in 2 weeks- if we put it on our top priority and do nothing else- I don't know hour urgent is this- if it is lo... [20:53:44] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [20:57:25] things are stable now [20:58:44] just got back home. that's 22 hours on the road. ugh [20:59:02] apergos: sorry to disturb you, just a quick question [20:59:16] ask away [20:59:25] jynus: [21:00:31] (03PS3) 10Rush: openstack: move log permissions handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/407503 (https://phabricator.wikimedia.org/T171494) [21:01:59] (I asked in private) [21:03:30] (03PS5) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [21:04:04] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [21:05:53] (and I answered there too, heh) [21:13:52] 10Operations, 10Packaging: rebuild php-wikidiff2 and php-luasandbox for php7 and stretch - https://phabricator.wikimedia.org/T184270#3939182 (10Legoktm) a:03Legoktm Moritz uploaded wikidiff2 to stretch-backports, it's currently waiting in the backports NEW queue: https://ftp-master.debian.org/backports-new.html [21:17:09] (03PS4) 10Rush: openstack: move log permissions handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/407503 (https://phabricator.wikimedia.org/T171494) [21:20:24] 10Operations, 10Puppet: Puppet: Empty modules/nginx directory in operations/puppet - https://phabricator.wikimedia.org/T186268#3939224 (10herron) p:05Triage>03Low [21:20:56] (03PS1) 10Herron: remove empty directory modules/nginx [puppet] - 10https://gerrit.wikimedia.org/r/407518 (https://phabricator.wikimedia.org/T186268) [21:22:01] (03PS1) 10Jcrespo: query-killer: Ignore Wikiexport exception [software] - 10https://gerrit.wikimedia.org/r/407519 (https://phabricator.wikimedia.org/T186266) [21:22:33] (03PS4) 10Jcrespo: Add Proxysql creation debian package script [software] - 10https://gerrit.wikimedia.org/r/404153 [21:22:35] (03PS2) 10Jcrespo: query-killer: Ignore Wikiexport exception [software] - 10https://gerrit.wikimedia.org/r/407519 (https://phabricator.wikimedia.org/T186266) [21:24:31] (03CR) 10Rush: [C: 032] openstack: move log permissions handling out of nova::common [puppet] - 10https://gerrit.wikimedia.org/r/407503 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:27:22] (03CR) 10Jcrespo: "This is likely to break things, but at least it matches production (see ticket)." [software] - 10https://gerrit.wikimedia.org/r/407519 (https://phabricator.wikimedia.org/T186266) (owner: 10Jcrespo) [21:27:27] (03CR) 10Jcrespo: [C: 032] query-killer: Ignore Wikiexport exception [software] - 10https://gerrit.wikimedia.org/r/407519 (https://phabricator.wikimedia.org/T186266) (owner: 10Jcrespo) [21:27:32] (03PS3) 10Jcrespo: query-killer: Ignore Wikiexport exception [software] - 10https://gerrit.wikimedia.org/r/407519 (https://phabricator.wikimedia.org/T186266) [21:30:18] I hope wikidata weeklies are run as the wikiadmin user; if not I will nudge the maintainer to fix it, this is a perfect time to sort out things like that [21:32:13] (03PS2) 10Rush: openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) [21:38:25] (03PS3) 10Rush: openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) [21:38:54] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [21:41:44] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3939329 (10Dzahn) p:05Triage>03Normal [21:42:56] (03PS4) 10Rush: openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) [21:43:01] 10Operations, 10media-storage: ms-be2023 unresponsive while rebuilding one disk - https://phabricator.wikimedia.org/T185306#3939333 (10Dzahn) p:05Triage>03Normal [21:43:27] 10Operations, 10ops-eqiad: check americium eth1 cabling and link - https://phabricator.wikimedia.org/T185219#3939339 (10Dzahn) p:05Triage>03Normal [21:45:06] 10Operations, 10ops-eqiad, 10Analytics-Kanban: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3939349 (10Dzahn) p:05Triage>03Normal [21:45:54] 10Operations, 10Patch-For-Review: Netbox: postgres cannot be restarted w/ current config - https://phabricator.wikimedia.org/T184634#3939350 (10Dzahn) p:05Triage>03High [21:46:26] 10Operations, 10Ops-Access-Requests: Requesting access to wmf.webrequest for Simonjoylet - https://phabricator.wikimedia.org/T186190#3939365 (10Nuria) @Simonjoylet Please be so kind to communicate via this ticket rather than on prior requests for access that are not closed, it is confusing cause it makes it a... [21:46:33] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3939366 (10RobH) [21:47:16] 10Operations, 10monitoring: Netbox: add Icinga check for the website - https://phabricator.wikimedia.org/T185505#3939367 (10Dzahn) This would be resolved unless we say "content behind login" is a mandatory check. (We don't really do that for other services either yet) [21:51:07] I forgot to log something on Monday ... [21:52:04] !log Removed 2FA from User:Superzerocool (on Mon, Jan 29): https://phabricator.wikimedia.org/T185731 [21:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:58] !log Removed 2FA from User:Jehochman [21:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:08] (03PS5) 10Rush: openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) [22:26:33] 10Operations, 10DNS, 10Domains, 10Traffic, and 2 others: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3939435 (10CRoslof) 05Open>03Resolved a:03CRoslof I've changed the domains we have with Porkbun to use our nameservers. (I didn't change anything a... [22:27:56] mutante: what's the scheduled maintenance window for Gerrit then? I will likely do some fooling around on the weekend but can certainly plan around it [22:27:58] joucebot: reload [22:28:09] jouncebot: reload [22:28:18] jouncebot: refresh [22:28:23] I refreshed my knowledge about deployments. [22:35:18] (03PS3) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [23:23:36] apergos: it's currently 1pm to 3pm PST, maybe it will start a tad later [23:26:12] apergos: so late Friday night for you but Saturday/Sunday you are good to go [23:29:31] 10Operations, 10Icinga, 10monitoring, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#3932784 (10Dzahn) So there are the existing checks for this, one for each data center. It would be easy to just flip those to "critical => true". Ther... [23:29:45] (03PS4) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [23:30:13] 10Operations, 10Icinga, 10monitoring, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#3939616 (10Dzahn) p:05Triage>03High [23:30:15] (03CR) 10jerkins-bot: [V: 04-1] NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [23:31:29] (03PS5) 10Madhuvishy: NFS: add custom script to generate target hosts [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [23:33:00] 10Operations, 10Wikimedia-Site-requests, 10media-storage: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#3939619 (10Dzahn) p:05Triage>03Normal [23:33:26] 10Operations, 10ops-eqiad, 10DBA: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3939620 (10Dzahn) p:05Triage>03Normal [23:33:41] 10Operations, 10ops-eqiad, 10DBA: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3932086 (10Dzahn) a:03Cmjohnson [23:34:47] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3931188 (10Dzahn) Yea, that makes sense. I also think it's the easiest way to create a new disk in ganeti and then mount it. [23:35:32] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3939629 (10Dzahn) p:05Triage>03High [23:37:15] !log creating new 100GB virtual disk for ganeti VM meitnerium (T186020) [23:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:28] T186020: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020 [23:44:33] 10Operations, 10Wikimedia-General-or-Unknown: Beta English Wikipedia: History of the page 'Bird' generates a 500 or 503 error - https://phabricator.wikimedia.org/T185969#3939655 (10Dzahn) p:05Triage>03Normal Can't reproduce it right now. It works for me. [23:44:54] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-General-or-Unknown: Beta English Wikipedia: History of the page 'Bird' generates a 500 or 503 error - https://phabricator.wikimedia.org/T185969#3939659 (10Dzahn) [23:45:50] 10Operations, 10Puppet: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5 - https://phabricator.wikimedia.org/T185814#3925904 (10Dzahn) @Paladox can you add a specific host name where this happens? [23:46:40] 10Operations, 10Puppet: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5 - https://phabricator.wikimedia.org/T185814#3939662 (10Dzahn) p:05Triage>03Low [23:47:17] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikimedia-General-or-Unknown: Beta English Wikipedia: History of the page 'Bird' generates a 500 or 503 error - https://phabricator.wikimedia.org/T185969#3939673 (10matmarex) It is still broken for me in the same way. Looks like the error only happens when I'm l... [23:49:16] 10Operations, 10Traffic, 10Accessibility, 10Browser-Support-Internet-Explorer: Wikipedia no longer accessible to those using some braille devices - https://phabricator.wikimedia.org/T185582#3920199 (10Dzahn) @Cameron11598 Could you forward Brandon's answer above to the original reporter? Let us know if t... [23:49:21] 10Operations, 10Puppet: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5 - https://phabricator.wikimedia.org/T185814#3939679 (10Paladox) [23:49:31] 10Operations, 10Traffic, 10Accessibility, 10Browser-Support-Internet-Explorer: Wikipedia no longer accessible to those using some braille devices - https://phabricator.wikimedia.org/T185582#3939680 (10Dzahn) p:05Triage>03Normal [23:49:50] 10Operations, 10ops-eqiad, 10User-Eevans: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T185494#3939681 (10Dzahn) p:05Triage>03Normal [23:52:42] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install notebook100[34] - https://phabricator.wikimedia.org/T183935#3939682 (10RobH) [23:53:49] (03PS1) 10RobH: setting remainder of notebook100[34] install params [puppet] - 10https://gerrit.wikimedia.org/r/407571 (https://phabricator.wikimedia.org/T183935) [23:54:25] (03CR) 10RobH: [C: 032] setting remainder of notebook100[34] install params [puppet] - 10https://gerrit.wikimedia.org/r/407571 (https://phabricator.wikimedia.org/T183935) (owner: 10RobH) [23:54:56] 10Operations, 10Puppet: /etc/puppet/hiera.yaml: Use of 'hiera.yaml' version 3 is deprecated. It should be converted to version 5 - https://phabricator.wikimedia.org/T185814#3939687 (10Paladox) Here's the online docs for the conversion https://puppet.com/docs/puppet/4.10/hiera_migrate_v3_yaml.html