[00:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T0000). [00:00:04] cscott: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:04:37] I can deploy today...if no one else is around. [00:04:57] cscott: here? :) [00:05:43] Urbanecm: yep [00:05:55] i think subbu|away is around on #-parsoid as well (despite his nick) [00:07:07] (03CR) 10Urbanecm: [C: 03+2] Bump wikimedia/parsoid to v0.13.0-a19 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/647046 (https://phabricator.wikimedia.org/T269685) (owner: 10C. Scott Ananian) [00:07:28] cscott, i am not |away .. your matrix client is lying to you. [00:07:55] i am using quaternion, so it must just be orthogonal to the truth :) [00:08:26] :) [00:08:27] So, let's see once it merged :) [00:17:04] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:27:20] (03PS2) 10Jforrester: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy) [00:28:25] (03CR) 10jerkins-bot: [V: 04-1] Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy) [00:28:48] (03PS3) 10Jforrester: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy) [00:28:50] (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 [00:30:11] (03CR) 10Daimona Eaytoy: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (owner: 10Jforrester) [00:31:09] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to v0.13.0-a19 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/647046 (https://phabricator.wikimedia.org/T269685) (owner: 10C. Scott Ananian) [00:31:57] (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 [00:31:59] (03CR) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (owner: 10Jforrester) [00:32:14] cscott: still here? [00:33:21] i am around here as well. [00:33:40] (03PS3) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) [00:33:42] (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712) [00:33:44] subbu: great [00:33:44] (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) [00:34:26] subbu: cscott: please test at mwdebug1001 [00:34:36] ok. one moment [00:35:35] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1147399384 and 342 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:11] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2426127320 and 379 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:19] how do i purge a page and have the reparse req. go to mwdebug1001? [00:36:29] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 679446064 and 395 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:36:35] subbu: what about a preview at mwdebug1001? [00:37:16] ah .. ok, let me try that. [00:37:43] sure [00:38:34] but, i need parsoid to reparse the page. that won't work since i cannot get parsoid reqs. to go to a specific server. [00:38:45] let me think if there is any other way of verifying withotu actually pushing it out everywhere. [00:39:03] hmm, good point [00:39:37] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2397421400 and 583 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:25] subbu: ssh to scandium and query the host directly? [00:41:05] cscott, oh, now that the new rest endpoint has been pushed everywhere .. we could probably use that ... [00:41:25] do you know what that url schema is? [00:41:57] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 960951376 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:58] Pchelolo would know but i see he is away. [00:42:08] you should be able to directly fetch https://zh.wikipedia.org/api/rest_v1/page/html/%E4%BA%94%E7%9C%BC%E8%81%AF%E7%9B%9F [00:42:13] just set the host header [00:42:20] (working on that myself) [00:42:38] cscott, but the new code is only on mwdebug1001. [00:42:45] ah right yeah [00:42:47] and those parsoid endpoints are not available on that server. [00:43:01] so, just push everywhere and hope? [00:43:03] so, that is why i was looking at the new rest endpoints that Pchelolo recently enabled everywhere. [00:43:07] hrm. [00:43:18] Urbanecm, hold on :) we just need to figure out that url schema. [00:43:23] okay [00:43:25] waiting [00:43:30] this has been an issue w/ testing parsoid for a while. [00:43:46] subbu: i don't think the new url schema exposes the langconv stuff yet [00:43:57] ah, ok. right. [00:44:27] but, just hititng that endpoint should not emit the langvariant markup .. [00:44:35] so, that is the test. [00:44:57] let me search phab for pchelolo's ticket. [00:45:02] it's a test, at least [00:45:14] aha .. https://en.wikipedia.org/w/rest.php/v1/page/Earth/html is the new rest api schema [00:45:22] so, just need to fetch the zhwiki equivalent on mwdebug1001 [00:46:44] https://zh.wikipedia.org/w/rest.php/v1/page/%E4%BA%94%E7%9C%BC%E8%81%AF%E7%9B%9F/html says not implementd. :) [00:46:51] alright, i guess we just push everywhere then. [00:48:26] subbu: if you can use a particular parsoid host, you can run scap pull there to fetch the code [00:48:27] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3161171608 and 171 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:38] and i can also just push everywhere [00:49:49] subbu: so, `scap pull` at scandium should fetch the code to scandium [00:51:00] actually scandium has that new code since we rt-tested it. [00:51:07] and i verified that the page is fixed there. [00:51:11] aha [00:51:15] so I'll sync then [00:51:17] so, yes, let us push everywhere. [00:51:19] ya. [00:52:00] we already rt-tested all these pages prior to release, so, that works as expected. just being able to test on a non-scandium host would have been an added bonus, but we don't have a mechanism right now, so yes, let us push. [00:52:17] sync in progress :) [00:52:56] go go go [00:53:18] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.20/vendor/: 3278ffd107888757c4620383160a6d5fa67d05b5: Bump wikimedia/parsoid to v0.13.0-a19 (T269685) (duration: 01m 16s) [00:53:24] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 70136 and 247 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:28] T269685: /page/html endpoint broken when requesting language variants affecting /page/summary - https://phabricator.wikimedia.org/T269685 [00:53:34] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 229712 and 256 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:35] cscott: subbu: should be everywhere now. Can you make sure it works now? :) [00:53:56] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 138808 and 277 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:15] (03PS4) 10Jforrester: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy) [00:55:17] (03PS4) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) [00:55:19] (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712) [00:55:22] (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712) [00:55:49] there were a slew of transient errors .. i'm going to assume that is because the php7.2-fpm was restarting in that period. they hav eall gone away now. but, will monitor logstash for a bit longer first. [00:55:58] Urbanecm: I've an additional patch to go out now, when you're done (or you can do it if you want :-)). [00:55:58] okay [00:56:13] yup, those are gone. [00:56:18] now to actually verify the other things. [00:56:24] in that case, it's yours James_F :) [00:56:24] subbu: confirmed your
fix is live on enwiki [00:56:39] did you verify the zhwiki fix? [00:56:47] * James_F waits. [00:56:51] https://zh.wikipedia.org/api/rest_v1/page/html/%E4%BA%94%E7%9C%BC%E8%81%AF%E7%9B%9F is fixed as well. [00:57:13] can someone from the mcs/apps side verify their endpoints are fixed? [00:57:39] subbu: yup, confirmed, that looks good to me [00:58:22] cscott, the mcs endpoints you mean? [00:58:55] No, I mean the zhwiki url you wrote [00:59:08] Haven't tried to check mcs endpoints [00:59:20] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 290280 and 602 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:33] https://phabricator.wikimedia.org/T269685 has the urls to test. so, let us verify that. [01:00:06] verified working now. returns http 200, not http 400. [01:00:09] lgtm. [01:00:39] curl -i -H "Accept-Language: zh-hant" https://zh.wikipedia.org/api/rest_v1/page/html/%E8%B4%9D%E6%8B%89%E5%85%8B%C2%B7%E5%A5%A5%E5%B7%B4%E9%A9%AC returns HTTP 200 to me [01:01:22] (03CR) 10Jforrester: [C: 03+2] Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy) [01:01:24] great. i think we are done then. [01:01:44] (03CR) 10Jforrester: [C: 04-1] "Not until the train rolls out, so we can spot if this breaks things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [01:01:54] i am just curious why we got those transient fatals on sync. we normally don't get those on normal deploy. [01:02:02] but, we can look into that another time. [01:02:15] (03Merged) 10jenkins-bot: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy) [01:02:17] cscott, look good to you? [01:02:25] if yes, we can call this done. [01:03:35] and https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2021/Citations?dtenable=1 now has reply links enabled as well. [01:03:46] so, they got their links 1 day early as well. :) [01:04:47] thanks Urbanecm. [01:04:54] happy to help! [01:05:00] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 114581872 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:26] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89893088 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:57] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Explicitly set wgAbuseFilterAflFilterMigrationStage ahead of train roll-out T269712 (duration: 01m 03s) [01:08:02] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2058272 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:05] T269712: Migrate afl_filter to afl_filter_id and afl_global - https://phabricator.wikimedia.org/T269712 [01:08:08] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 28616 and 1130 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:56] OK, all done on my part. [01:10:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 151717600 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:40] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2010112 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:16:52] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 296552 and 1654 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:19:32] subbu: yes, looks good to me. (sorry for the lag) [01:24:14] great! [01:24:28] logs continue to be clean. [01:27:57] (03PS12) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) [01:28:02] (03CR) 10Mstyles: "> Patch Set 11:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [01:36:35] 10Operations, 10Mail: SREs mail servers - https://phabricator.wikimedia.org/T269725 (10Reedy) [04:00:14] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:24] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:14] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:29:22] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:58] (03PS1) 10Ammarpad: Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709) [05:46:43] (03CR) 10jerkins-bot: [V: 04-1] Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709) (owner: 10Ammarpad) [05:47:14] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:49:24] (03PS2) 10Ammarpad: Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709) [06:57:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10elukey) >>! In T268146#6677434, @Ottomata wrote: > Hmm, uh oh, I think this host needed to be placed in the Analytics VLAN. Ping @elukey @razzi @robh Ah snap I didn'... [06:58:05] (03CR) 10Elukey: [V: 03+1 C: 03+2] zookeeper: Support a standalone server's mbeans in the JMX exporter's conf [puppet] - 10https://gerrit.wikimedia.org/r/645371 (https://phabricator.wikimedia.org/T268202) (owner: 10Elukey) [07:59:32] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:47] (03PS1) 10Elukey: Import prometheus and constants module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) [08:15:29] (03CR) 10Elukey: "Looks very good! One thing - is profile::memcached already included somewhere else or does it need to get added to the role?" [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [08:22:57] (03PS1) 10Elukey: profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 [08:24:24] (03CR) 10jerkins-bot: [V: 04-1] profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [08:24:37] uff [08:25:12] (03PS1) 10Hashar: gerrit: disable autogc when receiving packs [puppet] - 10https://gerrit.wikimedia.org/r/647191 [08:25:37] (03PS2) 10Elukey: profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 [08:26:07] (03CR) 10Hashar: "That is made the default with Gerrit 3.3 per https://gerrit-review.googlesource.com/c/gerrit/+/289470" [puppet] - 10https://gerrit.wikimedia.org/r/647191 (owner: 10Hashar) [08:27:43] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27025/console" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [08:33:26] (03CR) 10Jcrespo: "Looks sane, but please let me double check database backups are being created correctly before deploying. I will get back to you soon." [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff) [08:33:55] (03CR) 10Muehlenhoff: "Ack, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff) [08:39:41] (03CR) 10Jcrespo: "@Moritz:" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff) [08:47:48] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:36] (03CR) 10Muehlenhoff: [C: 03+2] Add Tyler as approval contact for Gerrit/contint [puppet] - 10https://gerrit.wikimedia.org/r/644856 (owner: 10Muehlenhoff) [08:51:37] (03CR) 10Volans: [C: 03+1] "Nice! Couple of nits inline, LGTM" (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [08:52:21] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan) [08:53:24] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [08:57:30] (03PS2) 10Elukey: Import prometheus and constants module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) [08:58:05] (03CR) 10Elukey: Import prometheus and constants module from spicerack (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [08:59:44] (03CR) 10Kormat: "I'm inclined to say let's not do this, for now." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644857 (owner: 10Jcrespo) [09:00:12] (03CR) 10Jcrespo: "This is your own code, so we have no business here, but if that helps, we generally make explicit (e.g. through a comment) when we use "fa" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [09:00:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (apart from the silly CI check), you can pass a regex to the PCC host selection to bypass that:" [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) (owner: 10Dave Pifke) [09:03:33] !log swift codfw-prod: add ms-be20[58-61] - T269337 [09:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:43] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [09:04:22] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:04:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Caught a small typo, otherwise +1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [09:05:47] moritzm: o/ are you puppet-merging? [09:06:55] (03CR) 10Kosta Harlan: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [09:07:51] sorry, yes [09:08:01] done [09:09:18] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:13:39] (03PS5) 10JMeybohm: Split out RBAC rules and service accounts for typha and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) [09:13:41] (03PS2) 10JMeybohm: calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653) [09:13:55] (03CR) 10JMeybohm: Split out RBAC rules and service accounts for typha and CNI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [09:15:08] (03CR) 10Volans: [C: 03+1] "LGTM!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [09:16:06] (03CR) 10Jcrespo: "Fair, I wasn't aware of those other scripts and I personally don't need this." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644857 (owner: 10Jcrespo) [09:16:16] (03Abandoned) 10Jcrespo: Move section script from software/dbtools to wmfmariapy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644857 (owner: 10Jcrespo) [09:16:18] (03CR) 10Alexandros Kosiaris: [V: 03+1] prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [09:16:28] (03PS11) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [09:18:24] akosiaris: did you see my comment re: ^ ? [09:19:19] specifically the monitoring one in k8s.pp [09:19:24] (03PS1) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [09:19:40] (03Abandoned) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [09:21:38] (03CR) 10Jcrespo: "> Thanks for the comment Jaime (please, you are always welcome to review & comment!). "snakeoil" is used throughout config files in this r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [09:22:38] (03CR) 10Alexandros Kosiaris: [C: 03+1] calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [09:23:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] Split out RBAC rules and service accounts for typha and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [09:24:40] !log make message mandatory for disable-puppet [09:24:44] (03CR) 10Jbond: [C: 03+2] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond) [09:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:47] godog: No, I missed it. Sorry about that. Looking now. Thanks for the ping [09:25:32] (03CR) 10JMeybohm: [C: 03+1] prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [09:25:37] (03PS3) 10Muehlenhoff: Remove "idp" backup file set and drop backup host profile from IDPs [puppet] - 10https://gerrit.wikimedia.org/r/646966 [09:26:10] (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff) [09:26:18] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27027/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [09:27:23] akosiaris: np! [09:27:48] (03PS6) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) [09:29:55] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) This is an example of metadata extracted, after normalizatio... [09:29:59] (03CR) 10Muehlenhoff: redis: define redis version on buster for multidc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [09:31:02] (03CR) 10Jbond: "LGTM wonder if we also need to add it to the absent_packages variable?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff) [09:31:33] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:35:01] 10Operations, 10serviceops, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10MoritzMuehlenhoff) >>! In T269004#6677107, @Andrew wrote: > It's most useful if effort is directed towards completing T237773, which will render this issue moot. In theo... [09:35:04] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) And this is after downloading the entire commonswiki metadat... [09:35:12] (03PS5) 10JMeybohm: admin_ng: Generalization, prod values anf fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434) [09:35:14] (03PS4) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) [09:35:39] (03PS2) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [09:35:55] (03CR) 10Kosta Harlan: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [09:37:20] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27030/console" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [09:37:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/646879 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:40:06] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:27] (03CR) 10Muehlenhoff: Stop installing apt-transport-https on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff) [09:45:00] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:20] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10jcrespo) @Krinkle That looks very similar to the problems I found initially on DB* dashboards, and then they did something to fix it- people here will know more. This was my i... [09:45:30] (03PS1) 10Jbond: profile::ntp: remove use_chrony parameter as its never used [puppet] - 10https://gerrit.wikimedia.org/r/647203 [09:46:04] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:47:20] (03PS1) 10Effie Mouzeli: hiera: install redis on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) [09:47:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602286 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [09:47:53] (03CR) 10jerkins-bot: [V: 04-1] hiera: install redis on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [09:48:10] (03PS2) 10Effie Mouzeli: hiera: install redis on mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) [09:48:33] 10Operations, 10Puppet, 10DBA, 10User-jbond: Request new database for pki.discovery.wmnet - https://phabricator.wikimedia.org/T268329 (10jcrespo) 2 tables and its schema were backed up yesterday, with around 4K in size after gzip compression. If that seems right I would call out the backups "working". Ple... [09:48:39] (03CR) 10jerkins-bot: [V: 04-1] hiera: install redis on mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [09:49:35] 10Operations, 10Puppet, 10DBA, 10User-jbond: Request new database for pki.discovery.wmnet - https://phabricator.wikimedia.org/T268329 (10jbond) >>! In T268329#6678525, @jcrespo wrote: > 2 tables and its schema were backed up yesterday, with around 4K in size after gzip compression. If that seems right I w... [09:51:43] (03CR) 10Jcrespo: [C: 03+1] "Only 1 table was backed up from the cas database with around 10K after comrpession. No backup from case_staging were generated. If this se" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff) [09:51:58] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [09:52:40] (03CR) 10Muehlenhoff: "The plan is still to replace ISC NTP with Chrony in production, but I haven't found the time to pursue this further. Let's keep it in for " [puppet] - 10https://gerrit.wikimedia.org/r/647203 (owner: 10Jbond) [09:52:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/646890 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:53:42] 10Operations, 10Puppet, 10DBA, 10User-jbond: Request new database for pki.discovery.wmnet - https://phabricator.wikimedia.org/T268329 (10jcrespo) >>! In T268329#6678526, @jbond wrote: >>>! In T268329#6678525, @jcrespo wrote: >> 2 tables and its schema were backed up yesterday, with around 4K in size after... [09:53:58] (03CR) 10Alexandros Kosiaris: [V: 03+1] "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [09:54:10] (03CR) 10JMeybohm: [C: 04-1] "> Patch Set 3:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [09:55:33] (03CR) 10Jbond: "minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:56:29] (03PS5) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) [09:56:46] (03CR) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [09:59:01] (03CR) 10Alexandros Kosiaris: [V: 03+1] prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [10:00:02] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [10:00:23] (03PS8) 10Volans: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [10:00:30] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:01:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:02:17] 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T269731 (10kostajh) [10:02:24] (03CR) 10Muehlenhoff: ""class redis" should also switch from require_package to ensure_packages, otherwise we might run into issues with the order of the setup o" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [10:03:00] (03CR) 10Elukey: "Thanks to all for the feedback, I'd be inclined to proceed with after reading all comments." [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [10:03:20] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10kostajh) [10:03:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff) [10:04:17] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10kostajh) > Requestor -- Please coordinate obtaining a comment of approval on this task from the approving party. cc @akosiaris @marcella Please let me know if you have any quest... [10:04:25] (03PS3) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) [10:04:54] (03CR) 10jerkins-bot: [V: 04-1] hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [10:06:48] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27033/console" [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [10:07:15] (03CR) 10Volans: [C: 03+2] Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [10:07:34] (03CR) 10Elukey: [C: 03+2] Import prometheus and constants module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey) [10:08:22] (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [10:08:56] (03Merged) 10jenkins-bot: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [10:11:18] (03PS5) 10Jbond: ntp: replace hiera() with lookup(), move use_chrony to parameters [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:12:00] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 52, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:12:26] (03CR) 10Jbond: [C: 03+1] "Made an update to the key name" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:12:41] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/647203 (owner: 10Jbond) [10:12:44] (03CR) 10Jbond: [C: 03+2] profile::ntp: remove use_chrony parameter as its never used [puppet] - 10https://gerrit.wikimedia.org/r/647203 (owner: 10Jbond) [10:13:01] (03Abandoned) 10Jbond: profile::ntp: remove use_chrony parameter as its never used [puppet] - 10https://gerrit.wikimedia.org/r/647203 (owner: 10Jbond) [10:13:51] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM. Couple of answers inline, thanks for the answers to my own questions. It helped clear up a few things." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:15:32] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10akosiaris) >>! In T269731#6678565, @kostajh wrote: >> Requestor -- Please coordinate obtaining a comment of approval on this task from the approving party. > > cc @akosiaris @ma... [10:20:08] (03PS1) 10Jbond: icinga_status: fix type downtimed != downtime [puppet] - 10https://gerrit.wikimedia.org/r/647208 (https://phabricator.wikimedia.org/T269672) [10:20:58] (03CR) 10Jbond: [C: 03+2] icinga_status: fix type downtimed != downtime [puppet] - 10https://gerrit.wikimedia.org/r/647208 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [10:24:03] (03CR) 10Muehlenhoff: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff) [10:24:30] 10Operations, 10Patch-For-Review: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) This should be fixed now please re open if yu still see issues ` $ /usr/local/bin/icinga-status -j auth1002.eqiad.wmnet {"auth1002": {"name": "a... [10:24:58] 10Operations, 10Patch-For-Review: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) 05Open→03Resolved a:03jbond [10:26:29] (03PS1) 10Alexandros Kosiaris: deployment: Set global statsd exporter version [puppet] - 10https://gerrit.wikimedia.org/r/647210 [10:29:44] (03CR) 10Jbond: [C: 03+2] spec: add and use parallel sec which seems to give a boost to run time [puppet] - 10https://gerrit.wikimedia.org/r/645113 (owner: 10Jbond) [10:40:12] (03PS1) 10JMeybohm: _tls_helpers: Add a default tls.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647211 [10:42:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] _tls_helpers: Add a default tls.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647211 (owner: 10JMeybohm) [10:43:54] (03Merged) 10jenkins-bot: _tls_helpers: Add a default tls.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647211 (owner: 10JMeybohm) [10:45:23] (03CR) 10JMeybohm: [C: 04-1] "You'll need to bump the chart version in Chart.yaml for this to take effect." [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [10:45:58] !log installing openssl updates on Buster [10:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:02] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:09] (03PS7) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) [10:48:23] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [10:48:45] (03PS6) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) [10:49:22] (03PS8) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) [10:50:56] (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [10:52:37] (03PS5) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/645147 [10:53:52] (03CR) 10Kosta Harlan: "> Patch Set 5: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [10:55:05] (03PS7) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) [10:55:37] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27034/console" [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond) [10:55:51] (03CR) 10JMeybohm: [C: 03+2] Add calico helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:56:05] (03CR) 10JMeybohm: [C: 03+2] Split out RBAC rules and service accounts for typha and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:56:12] (03CR) 10JMeybohm: [C: 03+2] calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:56:17] !log change librenms alerts and transport groups to use alertmanager - T267018 [10:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:25] T267018: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 [10:57:17] (03Merged) 10jenkins-bot: Add calico helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:57:52] (03Merged) 10jenkins-bot: Split out RBAC rules and service accounts for typha and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [10:57:54] (03Merged) 10jenkins-bot: calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [11:01:34] (03CR) 10JMeybohm: [C: 03+1] "> Are you saying the linting of Ide101c55e5a0fd9a390f22de7c33d303e9f3da50 will be unbroken after this patch is merged which includes the b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [11:02:43] (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:04:10] 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10fgiunchedi) +netops for visibility, cc @ayounsi [11:06:39] !log reboot ms-be1019 / ms-be1020 - T268435 [11:06:43] (03CR) 10JMeybohm: [C: 03+2] linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [11:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:47] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [11:07:48] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10elukey) After a chat with Riccardo and Arzhel, the idea is to: 1) decom an-tool1010 (testing a new feature of the decom cookbook to auto-cleanup switch configs). 2) r... [11:07:52] PROBLEM - Host ms-be1019 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:58] (03Merged) 10jenkins-bot: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan) [11:08:28] (03PS9) 10JMeybohm: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:08:44] 10Operations, 10Mail: Bounces when sending mail to aliases of a specific WMF email address - https://phabricator.wikimedia.org/T269725 (10Aklapper) [11:11:15] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) The list of VSM-related issues affecting 5.2.1 according to [[https://github.com/varnishcache/varnish-cache/blob/6.0/doc/changes.rst#fix... [11:11:52] RECOVERY - Host ms-be1019 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [11:14:15] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) The new calico chart is merged, thanks @akosiaris What is missing currently is a proper RoleBinding for the calicoctl user as I was n... [11:14:42] PROBLEM - Host ms-be2059 is DOWN: PING CRITICAL - Packet loss = 100% [11:15:51] that's me ^ [11:16:30] PROBLEM - Host ms-be1020 is DOWN: PING CRITICAL - Packet loss = 100% [11:17:18] ditto [11:17:24] RECOVERY - Host ms-be2059 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [11:17:49] godog, Scap released and deployed, btw [11:18:36] RECOVERY - Host ms-be1020 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [11:20:54] (03PS1) 10Elukey: hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) [11:20:56] liw: thanks! [11:21:16] (03CR) 10jerkins-bot: [V: 04-1] hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [11:21:38] (03CR) 10JMeybohm: [C: 03+2] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:23:02] (03Merged) 10jenkins-bot: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [11:32:07] (03PS1) 10JMeybohm: linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) [11:38:10] (03PS2) 10Elukey: hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) [11:40:18] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27036/console" [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [11:44:03] (03PS3) 10Elukey: hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) [11:45:33] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27037/console" [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [11:56:53] 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10jbond) 05Open→03Resolved a:03jbond Looks like this is complete, resolving please reopen if i mis... [11:57:21] 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10jbond) p:05Triage→03Medium [11:58:10] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10jbond) p:05Triage→03Medium [11:58:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various comments inline, but the premise is sane" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T1200). [12:00:05] ammarpad: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:29] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27038/console" [puppet] - 10https://gerrit.wikimedia.org/r/647210 (owner: 10Alexandros Kosiaris) [12:00:59] I can deploy today! [12:01:02] 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10jbond) p:05Triage→03Medium [12:01:12] Ammarpad: hi, are you here? [12:01:16] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10jbond) p:05Triage→03Medium [12:01:55] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10jbond) p:05Triage→03Medium [12:02:22] 10Operations, 10netops: Upgrade Routinator 3000 to 0.8.2 - https://phabricator.wikimedia.org/T269738 (10ayounsi) p:05Triage→03Medium [12:02:30] Ammarpad: ping? [12:02:50] 10Operations, 10serviceops, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10jbond) p:05Triage→03Medium [12:03:21] 10Operations, 10observability, 10CAS-SSO: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10jbond) p:05Triage→03Medium [12:04:01] 10Operations, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/pipermail/wikija-l/ has broken encoding - https://phabricator.wikimedia.org/T269301 (10jbond) p:05Triage→03Medium [12:04:07] @Urbanecm, yes I am [12:04:57] (03CR) 10Urbanecm: [C: 03+2] Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709) (owner: 10Ammarpad) [12:05:03] (03PS1) 10KartikMistry: Update apertium to 2020-12-09-115733-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647220 [12:05:04] 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10ayounsi) 05Resolved→03Open Not everything has been migrated yet, see the full list on https://libr... [12:06:02] (03Merged) 10jenkins-bot: Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709) (owner: 10Ammarpad) [12:06:30] 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10jbond) p:05Triage→03Medium >>! In T267018#6678835, @ayounsi wrote: > Not everything has been migra... [12:07:17] Ammarpad: pulled onto mwdebug1001, can you test please? [12:08:33] 10Operations, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10jbond) p:05Triage→03Medium [12:08:41] Urbanecm OK [12:08:49] 10Operations, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10jbond) p:05Triage→03Medium [12:09:05] I mean I am testing... [12:10:45] 10Operations: slapd fails to restart sometimes - https://phabricator.wikimedia.org/T269394 (10jbond) p:05Triage→03Medium [12:11:45] 10Operations, 10SRE-tools, 10observability: HP RAID failed on ms-be1054 didn't open a task - https://phabricator.wikimedia.org/T269563 (10jbond) p:05Triage→03Medium [12:12:09] Ammarpad: yes, Im waiting [12:12:12] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10jbond) p:05Triage→03Medium [12:13:10] (03CR) 10Kosta Harlan: [C: 03+1] linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm) [12:13:11] yes.. all is OK... [12:13:20] You can proceed [12:13:40] 10Operations, 10Domains, 10Okapi, 10Traffic: Okapi Domains - https://phabricator.wikimedia.org/T269686 (10jbond) p:05Triage→03Medium [12:14:08] (03CR) 10Kosta Harlan: "Actually while I think this looks reasonable and makes sense, I'll remove my vote so people more qualified in this domain can judge :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm) [12:14:12] thanks Ammarpad [12:15:45] (03CR) 10Urbanecm: [C: 03+2] Remove unsupported arg in MediaWiki::doPostOutputShutdown() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 (owner: 10Ammarpad) [12:15:56] Ammarpad: I assume the other one can't really be tested, is that irght? [12:16:02] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3414289c8c7272185e30cacc3df5d5dbc719219d: Add extended-confirmed group and restriction level for bgwiki (T269709) (duration: 01m 19s) [12:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:11] T269709: Add extended-confirmed group and restriction level for bgwiki - https://phabricator.wikimedia.org/T269709 [12:16:34] (03Merged) 10jenkins-bot: Remove unsupported arg in MediaWiki::doPostOutputShutdown() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 (owner: 10Ammarpad) [12:17:14] 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10jbond) p:05Triage→03Medium [12:17:33] Ammarpad: could you answer the q above? [12:17:50] it's at mwdebug1001 anyway [12:18:24] Yes indeed. But I am sure it will not cause problem. Lucas aso gave it +1. The method does not take parameter. [12:19:05] it doesn't, so hope it's all right [12:19:07] OK, it's not testable though, so there's nothing I can do. PHP should be throwing error if you call method that takes no arg with arg [12:19:50] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/645147 (owner: 10Jbond) [12:20:10] that would indicate the file is not in use anymore [12:20:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=restbase,service=restbase,name=restbase2009.codfw.wmnet [12:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:33] syncing [12:23:25] !log urbanecm@deploy1001 Synchronized w/static.php: cfb36023ac873c00e680032999b7c21c2a105132: Remove unsupported arg in MediaWiki::doPostOutputShutdown() call (duration: 01m 02s) [12:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:31] Ammarpad: done [12:23:58] Thank you [12:24:28] and stuff like https://cs.wikipedia.org/w/extensions/GrowthExperiments/images/mentor-ltr.svg still works, and goes through static.php [12:24:32] !log Eu B&C window done [12:24:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:54] 10Operations, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10hnowlan) >>! In T269328#6667383, @Eevans wrote: > At the very least, getting rid of these names would create inconvenience. There are lots of examples of maintenance and admin commands t... [12:29:34] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [12:29:35] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1001.eqiad.wmnet [12:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:08] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21561504 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:34] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 312677880 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:14] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 402721320 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:27] (03CR) 10Jbond: [C: 03+2] profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/645147 (owner: 10Jbond) [12:37:00] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 53383376 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:10] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 61605872 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:50] ^ expected [12:40:10] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 52 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:20] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 62 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:30] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24824 and 71 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:10] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1008 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:41:52] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 104580512 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:14] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 541206160 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:24] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35275104 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:08] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32276528 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:45:08] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 70696672 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:46:08] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:38] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 548657160 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:48:09] (03PS1) 10Ayounsi: Standardize Private-Peer BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/647226 [12:48:18] PROBLEM - Device not healthy -SMART- on ms-be1030 is CRITICAL: cluster=swift device=1I:1:5 instance=ms-be1030 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1030&var-datasource=eqiad+prometheus/ops [12:48:44] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 9792 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:48:54] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3032 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:49:16] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 14520 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:49:40] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24592 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:50:02] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36224 and 80 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:51:00] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10jbond) > As for myself(toan) I'm currently not defined as an admin but would also like to be a part of this list. Should I add this in a follow-u... [12:56:06] (03CR) 10Ayounsi: [C: 03+2] Standardize Private-Peer BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/647226 (owner: 10Ayounsi) [12:56:36] (03Merged) 10jenkins-bot: Standardize Private-Peer BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/647226 (owner: 10Ayounsi) [13:03:12] (03PS1) 10Ppchelko: Article::view - remove the old subtitle from doOutputFromParserCache. [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647081 (https://phabricator.wikimedia.org/T269727) [13:03:41] (03CR) 10Ema: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:03:42] !log standardize Private-Peer BGP group on all cr* [13:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:06] (03PS2) 10Jbond: Enable base::service_auto_restart for purged [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:09:33] (03CR) 10Muehlenhoff: Enable base::service_auto_restart for purged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:09:54] 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10hnowlan) a:03hnowlan [13:10:48] (03PS1) 10Hnowlan: maps: increase replication lag tolerance further [puppet] - 10https://gerrit.wikimedia.org/r/647230 [13:11:07] 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10hnowlan) maps1001 is depooled and resyncing. [13:15:08] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:40] (03CR) 10Jbond: [C: 03+1] "lgtm, thx" [puppet] - 10https://gerrit.wikimedia.org/r/647230 (owner: 10Hnowlan) [13:18:24] (03PS4) 10Muehlenhoff: Remove "idp" backup file set and drop backup host profile from IDPs [puppet] - 10https://gerrit.wikimedia.org/r/646966 [13:25:49] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:29] (03CR) 10Jbond: [C: 03+2] raktables: hand off authentication to httpd [puppet] - 10https://gerrit.wikimedia.org/r/644543 (owner: 10Jbond) [13:26:35] (03CR) 10Jbond: [C: 03+2] racktables: Make everyone admin [puppet] - 10https://gerrit.wikimedia.org/r/644544 (owner: 10Jbond) [13:26:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10WMDE-leszek) I approve this request. I will also approve @toan's production shell access request when it is open. [13:27:50] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [13:27:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:23] (03PS1) 10Jbond: racktables: update correct file [puppet] - 10https://gerrit.wikimedia.org/r/647238 [13:30:36] (03CR) 10Jbond: [V: 03+2 C: 03+2] racktables: update correct file [puppet] - 10https://gerrit.wikimedia.org/r/647238 (owner: 10Jbond) [13:32:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm) [13:33:55] (03CR) 10Filippo Giunchedi: "> Patch Set 11:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [13:35:16] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27039/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [13:37:07] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:18] (03PS1) 10Jbond: racktables: add trailing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/647242 [13:39:38] (03CR) 10Muehlenhoff: [C: 03+1] racktables: add trailing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/647242 (owner: 10Jbond) [13:39:42] (03CR) 10Jbond: [C: 03+2] racktables: add trailing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/647242 (owner: 10Jbond) [13:40:52] (03PS6) 10Kormat: int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 [13:41:05] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:12] (03CR) 10Zfilipin: "Thanks! 🎉" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/639293 (https://phabricator.wikimedia.org/T265463) (owner: 10Harriet Ayugi) [13:44:06] 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10klausman) This is definitely doable, but needs at least one change: The Bullseye version of the package depends on librdkafka1 >= 1.4.2, which Buster... [13:54:54] !log experiment with rsync.service increased niceness on ms-be2057 - T269337 [13:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:02] T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337 [13:55:10] 10Operations: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) >>! In T269672#6678613, @jbond wrote: > This should be fixed now please re open if yu still see issues > Spoke to soon we now see this issue ` Exception raised while executi... [13:56:53] (03PS1) 10Jbond: icinga_status: remove values from json until support added to spicerack [puppet] - 10https://gerrit.wikimedia.org/r/647243 (https://phabricator.wikimedia.org/T269672) [13:57:31] (03CR) 10jerkins-bot: [V: 04-1] icinga_status: remove values from json until support added to spicerack [puppet] - 10https://gerrit.wikimedia.org/r/647243 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [13:58:11] (03PS1) 10Jbond: icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) [14:01:26] (03CR) 10Muehlenhoff: [C: 03+2] Remove "idp" backup file set and drop backup host profile from IDPs [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff) [14:02:17] (03CR) 10jerkins-bot: [V: 04-1] icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [14:06:42] (03PS7) 10Kormat: integration: Complete framework for running basic tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) [14:07:06] 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10klausman) I've also poked Faidon on whether an official backport might be done. [14:07:14] (03CR) 10Kormat: "Ready for review now" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [14:11:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see inline" (031 comment) [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 (owner: 10Cwhite) [14:12:14] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [14:15:38] (03CR) 10Alexandros Kosiaris: [V: 03+1] "> Patch Set 11:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [14:16:23] (03PS12) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262 [14:19:59] RECOVERY - Device not healthy -SMART- on ms-be1030 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1030&var-datasource=eqiad+prometheus/ops [14:23:51] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27040/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [14:28:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [14:31:38] 10Operations, 10User-DannyS712: Access to #mediawiki_security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10CDanis) 05Open→03Resolved a:03CDanis [14:35:10] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for purged [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) [14:38:07] (03Abandoned) 10TK-999: GeoDNS: Update entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/643983 (owner: 10TK-999) [14:38:32] (03CR) 10Hnowlan: [C: 03+2] maps: increase replication lag tolerance further [puppet] - 10https://gerrit.wikimedia.org/r/647230 (owner: 10Hnowlan) [14:39:46] (03PS1) 10TK-999: GeoDNS: Remove old hack for Wikia RES datacenter [dns] - 10https://gerrit.wikimedia.org/r/647253 [14:43:01] (03CR) 10JMeybohm: [C: 03+2] linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm) [14:44:25] (03Merged) 10jenkins-bot: linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm) [14:46:52] (03PS2) 10Muehlenhoff: Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 [14:47:45] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:24] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks everyone! PCC is ok at https://puppet-compiler.wmflabs.org/compiler1001/27040/prometheus1003.eqiad.wmnet/fulldiff.html, merging" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [14:48:59] (03CR) 10jerkins-bot: [V: 04-1] Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff) [14:49:45] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10klausman) Networking will be 1G. No hw RAID. As for partitioning, there currently is no parman recipe available that does exactly what we want (2xSSD RAID-1 for OS, 2x (or m... [14:51:04] (03CR) 10Volans: "Did a second pass, and I sent you an offline question as I forgot some bits of the context." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond) [14:57:44] (03PS3) 10Muehlenhoff: Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 [14:59:48] (03CR) 10jerkins-bot: [V: 04-1] Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff) [15:00:40] (03CR) 10Elukey: [V: 03+1 C: 03+2] hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey) [15:06:54] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10akosiaris) >>! In T267653#6678721, @JMeybohm wrote: > The new calico chart is merged, thanks @akosiaris > > What is missing currently is a prop... [15:12:00] elukey: I think I 've never followed up on the conf1006 stuff [15:12:39] do we have a timeline for that migration? I can concoct a potion for puppet and make conf1006 drink it so we can take it offline [15:13:17] (03PS4) 10Jbond: Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff) [15:13:19] (03PS1) 10Jbond: base: update rakefile [puppet] - 10https://gerrit.wikimedia.org/r/647260 [15:18:32] akosiaris: nono nothing urgent, it was just in the list so I asked, we can do it anytime [15:18:48] (03CR) 10Jbond: [C: 03+2] base: update rakefile [puppet] - 10https://gerrit.wikimedia.org/r/647260 (owner: 10Jbond) [15:19:01] if you have to roll restart pyball etc.. i can sync with John to make the change [15:19:11] (it is not blocking anything I mean) [15:19:33] (03CR) 10Jbond: [C: 03+1] Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff) [15:26:24] !og restarting nginx on htmldump1001 to pick up OpenSSL security updates [15:27:29] moritzm: missed the "l" of "log" :) [15:29:09] (03PS1) 10JMeybohm: _scaffold: Default to latest for monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647262 [15:29:21] good poin, thanks :-) [15:29:25] !log restarting nginx on htmldump1001 to pick up OpenSSL security updates [15:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:52] (03CR) 10JMeybohm: [C: 03+1] "Cool! I've added the _scaffold counterpart in I57dce79777bf1e9aa7f6ae88fc8e10969ed1518a" [puppet] - 10https://gerrit.wikimedia.org/r/647210 (owner: 10Alexandros Kosiaris) [15:30:54] (03PS2) 10Jbond: icinga_status: remove values from json until support added to spicerack [puppet] - 10https://gerrit.wikimedia.org/r/647243 (https://phabricator.wikimedia.org/T269672) [15:33:28] (03PS1) 10Filippo Giunchedi: WIP logstash: add ulogd ecs filter + tests [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) [15:33:31] 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Jdforrester-WMF) [15:35:48] 10Operations, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) [15:35:53] (03CR) 10Filippo Giunchedi: [C: 03+1] Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff) [15:36:17] (03PS3) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [15:37:07] (03PS5) 10Ssingh: Initial commit of the knead-wikidough test suite [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/639838 (https://phabricator.wikimedia.org/T267424) [15:37:19] 10Operations, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) I can do the testbed host first, and then the rest. Do we have a mediawiki server on buster anywhere in the cluster yet? [15:37:38] (03CR) 10jerkins-bot: [V: 04-1] Initial commit of the knead-wikidough test suite [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/639838 (https://phabricator.wikimedia.org/T267424) (owner: 10Ssingh) [15:37:43] (03CR) 10jerkins-bot: [V: 04-1] redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [15:38:42] (03PS4) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [15:39:04] 10Operations, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10MoritzMuehlenhoff) Yes, mwdebug1003 is running Buster, you can select it with the latest version of the WikimediaDebug browser extension. [15:42:30] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27041/console" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [15:47:05] !log reimaging restbase2009 after disk replacement [15:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:16] (03CR) 10Jbond: [C: 03+2] icinga_status: remove values from json until support added to spicerack [puppet] - 10https://gerrit.wikimedia.org/r/647243 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [15:47:49] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) T267366#6667864 suggests the cables should have arrived on-site. [15:49:03] (03PS1) 10Jbond: icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 [15:49:05] (03CR) 10Cwhite: "Looking good!" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi) [15:49:41] (03CR) 10jerkins-bot: [V: 04-1] icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 (owner: 10Jbond) [15:50:06] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:50:16] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) Still working with Dell on this, tried reseating the raid controller and the cables, the raid card is still not recognized by the bios. [15:50:53] (03PS2) 10Jbond: icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 [15:51:40] (03CR) 10jerkins-bot: [V: 04-1] icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 (owner: 10Jbond) [15:52:23] (03PS3) 10Jbond: icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 [15:56:43] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1001.eqiad.wmnet [15:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:32] (03PS6) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 [16:03:14] PROBLEM - Host ms-be2050 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:17] (03CR) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process (031 comment) [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 (owner: 10Cwhite) [16:05:00] !log importing wikidiff2 1.10.0-1~wmf1+buster1 to component/php72 T250515 [16:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:07] T250515: Please provide our special component/php72 in buster-wikimedia - https://phabricator.wikimedia.org/T250515 [16:05:24] RECOVERY - Host ms-be2050 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms [16:05:41] (03PS5) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [16:05:55] (03PS6) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [16:06:12] !log updating mwdebug1003, parse2001, deploy1002, deploy2002 to wikidiff 1.10.0-1~wmf1+buster1 [16:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:00] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27042/console" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [16:09:17] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [16:10:42] !log deployment-cache-text06: deploy varnish 6.0.0-1wm1 T264398 [16:10:45] (03PS4) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) [16:10:48] (03PS1) 10Ayounsi: Revert "Run Homer during the decom cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/647288 [16:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:50] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [16:10:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] _scaffold: Default to latest for monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647262 (owner: 10JMeybohm) [16:11:09] (03PS1) 10Elukey: Add a second Hive Metastore on an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028) [16:11:18] (03CR) 10jerkins-bot: [V: 04-1] hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [16:13:06] (03CR) 10Ayounsi: [C: 03+2] Revert "Run Homer during the decom cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/647288 (owner: 10Ayounsi) [16:14:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27045/console" [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [16:15:18] (03Merged) 10jenkins-bot: Revert "Run Homer during the decom cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/647288 (owner: 10Ayounsi) [16:17:02] (03CR) 10Jbond: hiera: install redis on shard16 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [16:17:13] (03CR) 10Jbond: redis: define redis version on buster for multidc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [16:17:35] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) OK the amount of work needed to get 5.2.1 in a usable state really seems excessive. Let's give a try to 6.0.0, which is the version imme... [16:21:38] (03PS2) 10Elukey: Add a second Hive Metastore on an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028) [16:24:25] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27046/console" [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey) [16:27:35] (03PS1) 10MSantos: mobileapps: bump to 2020-12-09-093703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647275 [16:30:05] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2020-12-09-093703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647275 (owner: 10MSantos) [16:31:24] (03Merged) 10jenkins-bot: mobileapps: bump to 2020-12-09-093703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647275 (owner: 10MSantos) [16:33:43] (03PS2) 10Filippo Giunchedi: WIP logstash: add ulogd ecs filter + tests [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) [16:33:45] (03CR) 10Filippo Giunchedi: WIP logstash: add ulogd ecs filter + tests (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi) [16:34:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Cmjohnson) @Dzahn Is it possible to move mw1281,82 and 83? I need this space for the an-workers on 10G. I can move them to A8. [16:35:42] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm-k8s: use cached calico container images [puppet] - 10https://gerrit.wikimedia.org/r/647094 (https://phabricator.wikimedia.org/T269016) (owner: 10Bstorm) [16:35:59] (03PS2) 10Jbond: icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) [16:36:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fix Cumin alias for cloudvirt-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/646994 (owner: 10Muehlenhoff) [16:37:25] 10Operations, 10Patch-For-Review: Traceback in icinga-status 'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) I have pushed a patch which should removes the invalid parameters from the json output until spicerack is patched. This should hopefully fix the cookbook [16:37:43] (03CR) 10CRusnov: [C: 03+2] modules/icinga/files/raid_handler.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/646890 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [16:40:47] (03PS1) 10CRusnov: Revert "modules/icinga/files/raid_handler.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/647292 [16:41:28] (03CR) 10jerkins-bot: [V: 04-1] Revert "modules/icinga/files/raid_handler.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/647292 (owner: 10CRusnov) [16:42:16] (03PS2) 10CRusnov: Revert "modules/icinga/files/raid_handler.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/647292 [16:42:32] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:13] 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10toan) [16:43:43] (03CR) 10CRusnov: [C: 03+2] Revert "modules/icinga/files/raid_handler.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/647292 (owner: 10CRusnov) [16:44:29] (03PS1) 10Jbond: icinga::raid_handler: add support for ssacli [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) [16:45:03] (03CR) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [16:45:27] (03CR) 10CRusnov: [C: 04-1] "Will make this change and port separately to 2.9 instead of making this change and having to change it completely when it ports to 2.9." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [16:45:49] 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10toan) [16:48:14] (03PS1) 10Bstorm: wikireplicas and toolsdb: close connections when done with them [puppet] - 10https://gerrit.wikimedia.org/r/647285 (https://phabricator.wikimedia.org/T269620) [16:48:14] !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [16:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:46] !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [16:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:31] !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [16:52:32] (03PS7) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [16:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:53] (03PS1) 10Clarakosi: JobQueue: Move translation jobs to its own queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/647307 (https://phabricator.wikimedia.org/T267520) [16:58:45] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10jbond) @kaldari you are listed as kostajh manager as such can you approve this access request @thcipriani are you able to approve adding kostajh to the `deployment:` group [16:59:00] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10jbond) [16:59:21] (03CR) 10Bstorm: [C: 03+2] kubeadm-k8s: use cached calico container images [puppet] - 10https://gerrit.wikimedia.org/r/647094 (https://phabricator.wikimedia.org/T269016) (owner: 10Bstorm) [16:59:33] (03CR) 10Ppchelko: [C: 03+1] "Ok, this seems reasonable." [deployment-charts] - 10https://gerrit.wikimedia.org/r/647307 (https://phabricator.wikimedia.org/T267520) (owner: 10Clarakosi) [16:59:43] (03PS7) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 [17:00:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10elukey) I can definitely help on this @Dzahn, lemme know if you need a pair of extra hands :) [17:00:21] (03PS8) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [17:01:20] (03PS8) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 [17:01:39] (03CR) 10CDanis: [C: 03+1] "looks good, thanks!" (032 comments) [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite) [17:01:42] (03PS9) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) [17:02:34] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:03:28] (03PS13) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) [17:04:12] (03CR) 10Bstorm: "Next patch for this should probably be black formatting. The format is all over the place in this script." [puppet] - 10https://gerrit.wikimedia.org/r/647285 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [17:09:35] 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10hnowlan) maps1001 is now in sync and serving data consistent with the other nodes. [17:11:14] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:11:32] (03PS6) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [17:12:09] Is lists1001 lagging or something? [17:12:50] (03CR) 10Jcrespo: [C: 03+1] "I have no thoughts on this, as long as hp raid checks work/keep working I am ok with any change." [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond) [17:13:01] (03CR) 10jerkins-bot: [V: 04-1] redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [17:14:51] (03PS7) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [17:15:20] (03CR) 10Effie Mouzeli: redis: define redis version on buster for multidc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [17:15:47] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey Here is what is in racks now (not setup) 2 servers in A2 2 servers in A4 I requested @dzahn to move 3 mw servers to make room... [17:16:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) [17:17:42] (03PS5) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) [17:18:01] (03PS6) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) [17:21:58] 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10MSantos) 05Open→03Resolved Thanks, @hnowlan! [17:22:04] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:02] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw128[1-3].eqiad.wmnet [17:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:16] (03PS8) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [17:24:26] !log depooling 3 API appservers in eqiad to physically move to another rack [17:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:56] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:11] (03PS9) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [17:30:34] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10marcella) @jbond I am Kosta's manager and I approve this request. Thank you! [17:36:12] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10Cmjohnson) 05Open→03Resolved The disk has been replaced and is rebuilding, please re-open if the problem persists [17:39:45] (03PS10) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [17:40:04] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on labstore1006 - https://phabricator.wikimedia.org/T268281 (10Cmjohnson) 05Open→03Resolved The disk has been replaced, I am not sure if you have it for auto rebuild. Please check and if the problem persists, re-open this task. [17:41:51] (03PS1) 10Jcrespo: alerting: Disable screen/tmux monitoring on orchestrator hosts [puppet] - 10https://gerrit.wikimedia.org/r/647319 (https://phabricator.wikimedia.org/T265990) [17:41:54] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10Cmjohnson) @fgiunchedi The bbu is on-site, please let me know when I can take this offline? I can do tomorrow 1500UTC [17:43:06] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) @dcausse I am sorry no, I forgot to put a ticket in with them. I will do that today. Thanks [17:43:46] (03CR) 10Jcrespo: "I know this is not 100% productionized, but proposing a small addition, similar to the other roles to avoid alerts on hosts starting with " [puppet] - 10https://gerrit.wikimedia.org/r/647319 (https://phabricator.wikimedia.org/T265990) (owner: 10Jcrespo) [17:45:04] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10jcrespo) As usual, Chris, thank you for the quick response! [17:45:07] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Cmjohnson) @Bstorm Do we need to update both 1004 and 1005 to 10G at the same time? I can convert 1005 to 10G anytime. [17:46:31] (03PS11) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) [17:49:59] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw128[1-3].eqiad.wmnet [17:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:41] (03PS1) 10Andrew Bogott: Fake keydata for cinder ceph client [labs/private] - 10https://gerrit.wikimedia.org/r/647321 (https://phabricator.wikimedia.org/T265965) [17:52:18] (03CR) 10Clarakosi: [C: 03+2] JobQueue: Move translation jobs to its own queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/647307 (https://phabricator.wikimedia.org/T267520) (owner: 10Clarakosi) [17:54:01] (03Merged) 10jenkins-bot: JobQueue: Move translation jobs to its own queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/647307 (https://phabricator.wikimedia.org/T267520) (owner: 10Clarakosi) [17:54:32] RECOVERY - Device not healthy -SMART- on labstore1006 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1006&var-datasource=eqiad+prometheus/ops [17:57:04] !log clarakosi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [17:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:36] !log clarakosi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [17:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:58] !log clarakosi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [18:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:01:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:01:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:01:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:02:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: move_to_other_rack ` mw1281.eqiad.wmnet ` [18:02:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: move_to_other_rack ` mw1282.eqiad.wmnet ` [18:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: move_to_other_rack ` mw1283.eqiad.wmnet ` [18:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:58] (03PS1) 10Andrew Bogott: Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511) [18:03:02] (03PS3) 10Razzi: superset: add cache to staging superset [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) [18:03:23] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Fake keydata for cinder ceph client [labs/private] - 10https://gerrit.wikimedia.org/r/647321 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [18:04:28] (03CR) 10jerkins-bot: [V: 04-1] Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:05:49] !log mw1281,mw1282,mw1283 shut down for T266164 [18:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:56] T266164: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 [18:06:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) @Cmjohnson Yes. I just depooled mw1281-1283, downtimed them and then shut them down physically. You can move them. [18:06:20] (03PS2) 10Andrew Bogott: Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511) [18:08:20] (03CR) 1020after4: [C: 03+2] Article::view - remove the old subtitle from doOutputFromParserCache. [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647081 (https://phabricator.wikimedia.org/T269727) (owner: 10Ppchelko) [18:08:51] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) >>! In T267653#6679363, @akosiaris wrote: >>>! In T267653#6678721, @JMeybohm wrote: >> The new calico chart is merged, thanks @akosiari... [18:11:21] (03PS1) 10Dzahn: site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326 [18:11:37] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Bstorm) @Cmjohnson not really at the same time, no. If the 1Gb crossover cable works after converting the primary interface to 10Gb, the... [18:11:40] (03CR) 10jerkins-bot: [V: 04-1] site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326 (owner: 10Dzahn) [18:11:55] 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10WMDE-leszek) [18:11:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:11:57] 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10WMDE-leszek) I believe what @toan needs is "shell access" to production. For the time being would indeed be access to a subspace of releases.wikimedia.org, which is managed via `... [18:11:58] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:12:05] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with... [18:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:48] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:59] 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10Dzahn) This request depends on T268818 being resolved first. That is a request to add the group mentioned here. So far that doesn't exist. [18:15:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27054/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [18:15:41] (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM a minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli) [18:15:51] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2243.codfw.wmnet [18:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:13] !log depooling mw2243 (jobrunner) for reimaging (T245757) [18:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:20] T245757: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 [18:17:07] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2243.codfw.wmnet [18:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:18] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2... [18:19:26] PROBLEM - Host mw1282.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:21:16] PROBLEM - Host mw1281.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:21:16] PROBLEM - Host mw1283.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:21:19] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Bstorm) @Jclark-ctr labstore1006 is currently out of the pool. Any time you want to update firmware, let me know and I can silence its alarms and shu... [18:22:55] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Jclark-ctr) @bstorm thanks I can take care of this today around 4:30pm est [18:23:38] (03CR) 10JMeybohm: [C: 03+2] _scaffold: Default to latest for monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647262 (owner: 10JMeybohm) [18:24:37] I am mildly surprised to get Icinga alerts for mgmt interfaces of hosts that I downtimed by cookbook. [18:24:43] (03PS3) 10Andrew Bogott: Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511) [18:24:45] (03PS1) 10Andrew Bogott: Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511) [18:24:57] Didn't it automatically include the mgmt hosts? maybe not [18:24:58] (03Merged) 10jenkins-bot: _scaffold: Default to latest for monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647262 (owner: 10JMeybohm) [18:25:31] but fine with me.. then I see when they come back in the new rack [18:26:28] (03CR) 10jerkins-bot: [V: 04-1] Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:27:09] wikitech is being really slow for me - is it just me? [18:28:02] (03CR) 10Dzahn: [C: 03+2] add 20.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/643386 (https://phabricator.wikimedia.org/T264367) (owner: 10Dzahn) [18:28:05] (03PS3) 10Dzahn: add 20.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/643386 (https://phabricator.wikimedia.org/T264367) [18:28:27] (03PS2) 10Andrew Bogott: Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511) [18:29:14] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Bstorm) That works for me. I'll silence things. Just ping me on IRC when you need a shutdown. [18:31:46] (03PS4) 10Razzi: superset: add cache to staging superset [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) [18:32:03] (03PS3) 10Andrew Bogott: Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511) [18:34:19] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:34:33] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:36:45] (03CR) 10Dzahn: "https://20.wikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/643386 (https://phabricator.wikimedia.org/T264367) (owner: 10Dzahn) [18:36:47] (03CR) 10Razzi: [C: 03+2] superset: add cache to staging superset [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi) [18:36:48] RECOVERY - Host mw1282.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [18:37:35] (03PS1) 10Andrew Bogott: Cinder: no need to restart apache2 when config changes [puppet] - 10https://gerrit.wikimedia.org/r/647330 (https://phabricator.wikimedia.org/T269511) [18:38:00] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10Dzahn) @hdothiduc @Varnent Done! Added to DNS and https://20.wikipedia.org works now for me. There could be a little delay depending on caches an... [18:38:24] RECOVERY - Host mw1283.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [18:38:36] RECOVERY - Host mw1281.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.74 ms [18:39:00] (03Merged) 10jenkins-bot: Article::view - remove the old subtitle from doOutputFromParserCache. [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647081 (https://phabricator.wikimedia.org/T269727) (owner: 10Ppchelko) [18:40:46] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: no need to restart apache2 when config changes [puppet] - 10https://gerrit.wikimedia.org/r/647330 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:40:58] PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: Table superset_staging.ab_user doesnt exist https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:41:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [18:41:36] 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10Dzahn) If you can confirm things are working for you then it's up to you if we close this ticket now or after the actual birthday page has been cre... [18:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:24] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:56] mutante: running the scripts now for the mw move...thanks for doing that so quickly [18:46:24] cmjohnson1: alright! yep, np. these weren't proxies so that means less work to remove them [18:48:21] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:29] mutante: the mgmt checks are another host for Icinga, and given they are in the OOB network should not be affected by normal operations hence are not covered by the cookbook [18:48:42] but we can surely consider adding an option to downtime the mgmt too [18:49:01] I'm not sure the script on icinga that the cookbook runs supports it, but can surely be added there too [18:49:13] if you think that's a valid use case feel free to open a task [18:52:34] volans: ACK, they are different hosts. Yea, I am on the fence about it. I don't want to cause alerts during downtime but for things like these physical moves it's also a feature to see when mgmt comes back online. [18:53:29] agree [18:54:15] (03PS1) 10Andrew Bogott: Cinder: fix keystone auth for the cinder service user [puppet] - 10https://gerrit.wikimedia.org/r/647335 (https://phabricator.wikimedia.org/T269511) [18:54:32] hmm, leave it as it is. If I change my mind I will make that ticket :) [18:54:42] works for me, thanks :) [18:54:49] thanks as well [18:56:12] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: fix keystone auth for the cinder service user [puppet] - 10https://gerrit.wikimedia.org/r/647335 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [18:59:34] 10Operations, 10Mail: Bounces when sending mail to aliases of a specific WMF email address - https://phabricator.wikimedia.org/T269725 (10JGulingan) Hi all, I sent a test email to legoktm@ and kmehta-ctr@ today and did not receive a bounce back email. Thanks for your help! Best, Jo [18:59:44] !log testreduce1001 - installed make [18:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:05] twentyafterfour and marxarelli: Dear deployers, time to do the Train log triage with CPT deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T1900). [19:01:06] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) meanwhile there is no more /srv/parsoid on testreduce1001 but /srv/parsoid-testing instead. I tried an "npm... [19:01:35] PROBLEM - SSH on an-presto1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:10:31] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:10:34] (03CR) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [19:12:24] (03PS2) 10TK-999: GeoDNS: Remove old hack for Wikia RES datacenter [dns] - 10https://gerrit.wikimedia.org/r/647253 [19:13:31] (03PS2) 10Urbanecm: Enable ArticlePlaceholder at papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646831 (https://phabricator.wikimedia.org/T223693) [19:13:35] (03CR) 10Urbanecm: [C: 03+2] Enable ArticlePlaceholder at papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646831 (https://phabricator.wikimedia.org/T223693) (owner: 10Urbanecm) [19:14:26] (03Merged) 10jenkins-bot: Enable ArticlePlaceholder at papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646831 (https://phabricator.wikimedia.org/T223693) (owner: 10Urbanecm) [19:16:44] 10Operations, 10Mail: Bounces when sending mail to aliases of a specific WMF email address: 550 Previous (cached) callout verification failure - https://phabricator.wikimedia.org/T269725 (10Aklapper) [19:17:23] !log twentyafterfour@deploy1001 Synchronized php-1.36.0-wmf.21/includes/page/Article.php: deploy 0d99fe6d54 Article::view - remove the old subtitle from doOutputFromParserCache. Bug: T269727 (duration: 01m 04s) [19:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:27] T269727: Old revision warning box is added twice on page view if old rev served from cache - https://phabricator.wikimedia.org/T269727 [19:25:17] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2243.codfw.wmnet'] ` and were **ALL** successful. [19:26:51] PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:01] RECOVERY - SSH on an-presto1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:33:07] RECOVERY - Check systemd state on an-presto1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:38:42] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ce01bbe7b05eda8065fc57c865a69370e8aae797: Enable ArticlePlaceholder at papwiki (T223693) (duration: 01m 02s) [19:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:48] T223693: Deploy article placeholder on the pap.wikipedia (papiamentu) - https://phabricator.wikimedia.org/T223693 [19:38:54] (03PS1) 10Andrew Bogott: Cinder: more config fixes [puppet] - 10https://gerrit.wikimedia.org/r/647344 (https://phabricator.wikimedia.org/T269511) [19:38:56] (03PS1) 10Andrew Bogott: Cinder: include cinder-volume service on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/647345 (https://phabricator.wikimedia.org/T269511) [19:42:22] (03PS2) 10Andrew Bogott: Cinder: include cinder-volume service on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/647345 (https://phabricator.wikimedia.org/T269511) [19:42:26] mutante: mw1281-1283 are back [19:43:12] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: more config fixes [puppet] - 10https://gerrit.wikimedia.org/r/647344 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:43:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Cmjohnson) [19:43:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Cmjohnson) @dzahn completed the move and mw1281-83 are up [19:44:20] (03CR) 10Andrew Bogott: [C: 03+2] Cinder: include cinder-volume service on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/647345 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott) [19:44:54] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @dzahn and I were able to move mw1281-1283 and we now have 6 servers total in row A. [19:46:06] cmjohnson1: great! thanks for doing it swiftly. will get them back in prod [20:00:04] twentyafterfour and marxarelli: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T2000). [20:02:38] twentyafterfour: heyo o/ [20:04:34] (03PS6) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 [20:05:11] marxarelli: hello [20:06:27] I deployed the patch for T269727 [20:06:31] T269727: Old revision warning box is added twice on page view if old rev served from cache - https://phabricator.wikimedia.org/T269727 [20:07:35] everything loooks good to go ahead with group1 [20:10:19] (03PS1) 1020after4: group1 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647350 [20:10:21] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647350 (owner: 1020after4) [20:11:35] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647350 (owner: 1020after4) [20:12:51] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.21 [20:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:54] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.21 (duration: 01m 02s) [20:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:31] (03PS1) 10Milimetric: refine: blacklist WikibasePingback [puppet] - 10https://gerrit.wikimedia.org/r/647351 [20:16:30] (03CR) 10Milimetric: "Thanks in advance for merging this, it can be reverted once the schema is fixed." [puppet] - 10https://gerrit.wikimedia.org/r/647351 (owner: 10Milimetric) [20:17:09] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Open→03Resolved [20:18:09] !log wmf.21 looks good on group1 wikis. Still seeing T269603 but not at an increased rate. (refs T264801) [20:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:14] T264801: 1.36.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T264801 [20:18:14] T269603: InvalidArgumentException when requesting Special:EntityData with Sense or Form ID (The given ID does not refer to an entity of type lexeme) - https://phabricator.wikimedia.org/T269603 [20:19:43] (03PS1) 10Dzahn: parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) [20:20:17] (03CR) 10jerkins-bot: [V: 04-1] parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [20:21:56] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1281.eqiad.wmnet [20:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1282.eqiad.wmnet [20:22:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:12] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1283.eqiad.wmnet [20:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:23] (03CR) 10Razzi: [C: 03+2] refine: blacklist WikibasePingback [puppet] - 10https://gerrit.wikimedia.org/r/647351 (owner: 10Milimetric) [20:26:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw128[1-3].eqiad.wmnet [20:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:17] (03PS1) 10Ahmon Dancy: Enable $wgShowHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353 [20:26:19] (03PS1) 10Ahmon Dancy: Update Chart.yaml source references [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354 [20:26:21] (03PS1) 10Ahmon Dancy: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 [20:26:25] !log repooling mw1281,mw1282,mw1283 - now in rack A8 [20:26:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:16] !log mw1281,mw1282,mw1283 - scap pull [20:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:20] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) @Cmjohnson Thank you. Repooled and receiving traffic again. Monitoring looks good. [20:31:07] (03PS2) 10Dzahn: site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326 [20:31:31] (03PS3) 10Dzahn: site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326 (https://phabricator.wikimedia.org/T266164) [20:31:42] (03CR) 10Dzahn: [C: 03+2] site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326 (https://phabricator.wikimedia.org/T266164) (owner: 10Dzahn) [20:31:48] (03CR) 10Cwhite: [C: 03+2] First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite) [20:32:08] (03CR) 10Cwhite: [V: 03+2 C: 03+2] First attempt at a JSONSchema template generator utility. (031 comment) [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite) [20:33:09] (03PS2) 10Dzahn: parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) [20:34:46] (03CR) 10jerkins-bot: [V: 04-1] parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [20:34:55] RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:35:11] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi) [20:35:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) @Cmjohnson Should this stay open for mw1313-mw1316 or did we solve the issue by moving other servers now? [20:36:44] (03CR) 10Dzahn: [C: 03+2] "was missing during npm test-install https://puppet-compiler.wmflabs.org/compiler1001/27061/" [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [20:38:22] (03PS3) 10Dzahn: parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) [20:38:39] (03PS1) 10Elukey: analytics-meta: Avoid replication of superset_staging db when running as replica [puppet] - 10https://gerrit.wikimedia.org/r/647358 [20:40:19] (03CR) 10Dzahn: [C: 03+2] parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [20:43:01] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:44:06] (03CR) 10Ryan Kemper: "I think gehel is right that a timestamp should be a gauge and not a counter. Reading https://www.robustperception.io/are-increasing-timest" [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) (owner: 10Ryan Kemper) [20:44:52] (03CR) 10Dzahn: ntp: replace hiera() with lookup(), move use_chrony to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [20:47:51] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:54:27] ^ that's https://phabricator.wikimedia.org/T269693 [20:55:55] (03CR) 10Volans: [C: 03+1] "LGTM, couple of nits inline" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond) [20:56:37] rzl: thanks, yea, with every cron we convert to timer we now get more "systemd state" alerts now due to "former cron that we never noticed had issues" on other non mwmaint* hosts as well now. I am not sure yet if I think that's fine as it is or if there should be email for failed timers and not generic "systemd alerts" or both [20:57:17] yeah agreed -- it's good that we have monitoring for these now, but it also means we have to actually track it down when we fail :P [20:57:20] *when they fail [20:57:23] when we do too, I suppose [20:58:13] in particular I think it means telling the difference between "transient failure, next run succeeded, alert resolves on its own" and cases like this where the job is consistently broken [21:00:05] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T2100). [21:00:51] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:07] (03CR) 10Volans: "A couple of questions/comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond) [21:09:02] rzl: Yes, one strategy can be just "systemctl reset-failed" it once and then see if it comes back [21:09:46] a lot of them just clear up from transient failure in the past, some do not [21:10:04] sure, although in this case journalctl shows a ton of failures in a row -- no need to even try reset-failed, it definitely won't help [21:11:40] true [21:11:54] partly I'm trying to figure out if there's some way to automate some of this -- having to ssh into the machine and investigate every time that alert fires is a drag [21:12:12] if the alert could even just include the unit name, that would be a big help [21:12:22] but unit name plus the number of consecutive failures, that would really get us somewhere [21:12:25] that would be a reason to want a notification that already includes the job name [21:12:31] (consecutive failures when it's a timer, I mean) [21:12:31] i just don't like email as a protocol :p [21:12:46] email is definitely the wrong way to go, yes [21:13:10] or we make an icinga check specific to timers [21:13:16] that runs list-timers and parses it [21:13:32] and that tells us right away which one it is ..here on IRC [21:13:42] mm, sure -- if we can also exclude timers from the "check systemd state" that would be a big step [21:13:46] or list-units --state=failed [21:13:51] and the top one [21:14:12] which would be more than timers [21:14:18] but to include the name of the failed service [21:14:26] yeah [21:14:41] in that case I think we'd want to include all of them, not just the top -- otherwise if a second unit fails, we'd never notice [21:14:51] but either way that should be doable [21:15:32] the multi-line output on IRC would be spammy and if we just alert for the first and you fix it.. you notice you get a new one :p [21:15:39] yea [21:17:40] (03CR) 10Herron: [C: 03+1] WIP logstash: add ulogd ecs filter + tests [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi) [21:17:42] (03CR) 10Dzahn: [V: 03+1] "This IS actually used in production. Here it is compiled on dns3001:" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:19:03] (03CR) 10Dzahn: [V: 03+1] "I don't know why I thought role::dnsbox wasn't used in prod, I think there was a compiler failure or pebkac where it wouldn't find matchin" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [21:22:30] rzl: regarding the actual content of the ticket, that failed WDQS job, I saw that same thing but since it started during the reimaging work and it was about failing to fetch some lag data from them.. I definitely expected that would be one-time and go away by itself once the reimaging is over. seeing that is not the case..is surprising [21:23:33] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Krinkle) The service is mainly for executing shell commands. Right now that happens through `wfShellExec`. Typically to invoke program... [21:23:35] yeah agree -- I haven't starting digging into the job itself, I was hoping I could get someone more familiar to take it over [21:24:39] 10Operations, 10MW-on-K8s, 10serviceops: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10Krinkle) [21:24:42] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Krinkle) 05Open→03Resolved [21:24:56] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Krinkle) >>! In T260330#6632099, @Krinkle wrote: > Put on Last Call until 2 December. This RFC has been approved and is now closed. [21:25:14] I did debug a little bit but I lost it in IRC backlog.. hmm [21:25:34] oh, of course we have public logs.. will paste it on ticket [21:26:03] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) Replaced more Dimms per HP no errors at this [21:26:32] it tries to get "lag data" from prometheus and that is what fails [21:26:51] and it runs every minute [21:28:20] 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10Dzahn) ` [20:17:19] ryankemper: I guess it makes sense that "job_wikidata-updateQueryServiceLag" could not run during current work [20:19:46] ^ I'll take a look at the `job_wikidata-updateQueryServiceLag` [21:31:42] There's some context in the description of https://phabricator.wikimedia.org/T269204 that mentions that `blazegraph_lastupdated` is now `blazegraph_lastupdated_total`, so if the job has to do with that metric then the re-image is likely the source of the problem [21:33:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10RKemper) There's some context in the description of https://phabricator.wikimedia.org/T269204 that mentions that the counter metric `blazegraph_lastupdated`... [21:35:01] 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10Dzahn) ` ExecStart=/usr/local/bin/mw-cli-wrapper /usr/local/bin/mwscript extensions/Wikidata.org/maintenance/updateQueryServiceLag.php --wiki wikidatawiki --... [21:35:23] ryankemper: thank you! [21:37:20] 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10Dzahn) I ran the command manually as the same user, www-data. The error is simply "Failed to get lag from prometheus". ` @mwmaint1002:~# sudo -u www-data /u... [21:37:23] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) 05Open→03Resolved [21:39:08] ryankemper: cool! so I ran that command manually but the error is simply "failed to get from prometheus" [21:39:20] but sounds like that can match what you said, ack [21:39:22] 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10RKemper) Job lives here: https://github.com/wikimedia/mediawiki-extensions-Wikidata.org/blob/60c5f96ebf424b792077bb7c6b533a68702e7aea/maintenance/updateQuery... [21:40:42] !log shutting down labstore1006 for maintenance T268285 [21:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:46] T268285: update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 [21:46:15] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/647364 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [21:46:51] (03CR) 10Ryan Kemper: "So I think the main "downstream" consumer of this metric we need to worry about is `mediawiki_job_wikidata-updateQueryServiceLag`: https:/" [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) (owner: 10Ryan Kemper) [21:47:33] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:47] mutante: makes sense, since that error means that you hit this codepath: https://github.com/wikimedia/mediawiki-extensions-Wikidata.org/blob/60c5f96ebf424b792077bb7c6b533a68702e7aea/maintenance/updateQueryServiceLag.php#L84-L90 [21:48:28] ryankemper: hah! yea, that does it:) [21:48:42] (just thinking out loud here) so this metric should probably be a gauge and not a count anyway, since counters are for metrics where you don't care about the absolute value but rather just the incrementing over time [21:49:27] the reason it's breaking now is because https://github.com/prometheus/client_python/commit/a4dd93bcc6a0422e10cfa585048d1813909c6786 forces counters to now be suffixed with `_total`, so by switching to a gauge we shouldn't need to rename the metric, but I need to make sure that it being a gauge will play nicely with the job [21:50:43] I don't fully understand the logic of this `getLag` here: https://github.com/wikimedia/mediawiki-extensions-Wikidata.org/blob/60c5f96ebf424b792077bb7c6b533a68702e7aea/src/QueryServiceLag/WikimediaPrometheusQueryServiceLagProvider.php#L65 specifically I don't quite get why it's doing a `count` on the response from `getLags` (the other logic of `floor` etc makes sense to me) [21:50:44] *nod* this part is maybe good for -observability channel as well [21:51:14] that's a good idea! I'll transport the above context over there [21:54:09] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:58] And I reimaged this server and all seemed to go great but it is still stretch because I forgot the DHCP change.. starting over. [21:56:32] (03PS1) 10Dzahn: DHCP: switch mw2243 to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/647367 (https://phabricator.wikimedia.org/T245757) [21:56:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:57:14] (03CR) 10Dzahn: [C: 03+2] DHCP: switch mw2243 to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/647367 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [21:59:37] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:05] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/647364 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:00:37] (03CR) 10CRusnov: [C: 03+2] icinga/raid.pp: Add Python3 requirements for raid_handler.py [puppet] - 10https://gerrit.wikimedia.org/r/647364 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:01:55] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2... [22:02:48] (03PS3) 10Dzahn: query_service/updater: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) [22:03:57] ryankemper: If you don't mind I will now merge a change to WDQS puppet code - that will be noop and I compiled and has +1s [22:04:07] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:04:14] no change to the actual service, just puppet code [22:04:14] mutante: go ahead [22:04:19] ack, thx [22:04:25] PROBLEM - MariaDB Replica IO: s1 on db1139 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:04:25] PROBLEM - mysqld processes on db1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [22:04:35] (03CR) 10Dzahn: [C: 03+2] query_service/updater: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:05:03] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:38] still confirming manually on one of the hosts each time [22:06:03] yea, nothing changed on wdqs1003 [22:06:21] (03CR) 10Dzahn: "noop on wdqs1003 as expected" [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:06:47] PROBLEM - MariaDB Replica IO: s6 on db1139 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:09:21] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:43] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:37] (03CR) 10Dzahn: "hmm.. it's kind of convincing to just drop it if it's truly optional. Just based on the comments it was shown in error messages. It could " [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [22:12:39] PROBLEM - MariaDB Replica SQL: s1 on db1139 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:12:47] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:13:48] (03CR) 10Dzahn: [C: 03+2] gerrit: open link in new window [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox) [22:15:03] PROBLEM - MariaDB Replica SQL: s6 on db1139 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:17:56] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Bstorm) The box has very recent firmware already, apparently. 😦 [22:18:05] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:27] PROBLEM - MariaDB Replica Lag: s6 on db1139 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:21:41] PROBLEM - MariaDB Replica Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:22:09] PROBLEM - MariaDB read only s1 on db1139 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:24:31] PROBLEM - MariaDB read only s6 on db1139 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [22:25:00] (03CR) 10CRusnov: "I have tested the critical path where the test is expected to get data from zlib (and subprocess), where encoding issues would occur and i" [puppet] - 10https://gerrit.wikimedia.org/r/647369 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [22:25:55] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:28:15] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:30:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [22:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:09] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:07] (03CR) 10Jeena Huneidi: "I left a comment on the service which I think is more of a question for SRE on whether we should be exposing the non-webui ports as a Node" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles) [22:41:20] (03CR) 10Paladox: [C: 03+1] gerrit: disable autogc when receiving packs [puppet] - 10https://gerrit.wikimedia.org/r/647191 (owner: 10Hashar) [22:44:16] (03CR) 10Dzahn: [C: 03+2] gerrit: disable autogc when receiving packs [puppet] - 10https://gerrit.wikimedia.org/r/647191 (owner: 10Hashar) [22:45:46] (03PS3) 10Razzi: Add kafka-test1007 virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/647109 (https://phabricator.wikimedia.org/T268202) [22:45:57] (03CR) 10Jeena Huneidi: "I think this deserves an increment to the chart minor version in Chart.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 (owner: 10Ahmon Dancy) [22:46:59] (03CR) 10Jeena Huneidi: "I think this deserves an increment to the chart patch version in Chart.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353 (owner: 10Ahmon Dancy) [22:47:14] (03CR) 10Dzahn: [C: 03+2] Add .webm in files.viewable-mime-types of Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/569627 (https://phabricator.wikimedia.org/T215360) (owner: 10Zoranzoki21) [22:48:59] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:01] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 107.08, 101.42, 98.10 https://wikitech.wikimedia.org/wiki/Swift [22:50:14] (03PS2) 10Ahmon Dancy: Enable $wgShowHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353 [22:50:17] (03PS2) 10Ahmon Dancy: Update Chart.yaml source references [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354 [22:50:19] (03PS2) 10Ahmon Dancy: 0.0.8: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 [22:56:40] (03PS3) 10Ahmon Dancy: 0.1.0: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 [22:58:01] (03PS5) 10Dzahn: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [22:59:20] (03PS6) 10Dzahn: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [23:00:58] (03CR) 10Jeena Huneidi: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354 (owner: 10Ahmon Dancy) [23:01:01] (03PS1) 10Dzahn: remove zero.wikimedia.org from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/647379 (https://phabricator.wikimedia.org/T187716) [23:01:54] (03PS2) 10Dzahn: remove zero.wikimedia.beta.wmflabs.org from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/647379 (https://phabricator.wikimedia.org/T187716) [23:02:27] (03CR) 10Dzahn: [V: 03+2] remove zero.wikimedia.beta.wmflabs.org from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/647379 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn) [23:02:34] (03CR) 10Dzahn: [V: 03+2 C: 03+2] remove zero.wikimedia.beta.wmflabs.org from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/647379 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn) [23:04:06] !log zero.wikimedia.beta.wmflabs.org removed from beta_sites (deployment-prep) T187716 [23:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:11] T187716: Sunset Wikipedia Zero - https://phabricator.wikimedia.org/T187716 [23:04:53] (03PS7) 10Dzahn: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [23:07:25] (03PS3) 10Dzahn: wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876 [23:07:32] (03PS1) 10Mforns: Migrate HelpPanel schema from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647382 (https://phabricator.wikimedia.org/T267333) [23:07:39] (03CR) 10jerkins-bot: [V: 04-1] wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876 (owner: 10Dzahn) [23:09:18] (03PS4) 10Dzahn: wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876 [23:09:53] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:00] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2243.codfw.wmnet'] ` and were **ALL** successful. [23:13:39] (03PS1) 10Mforns: Migrate HomepageModule schema from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647383 (https://phabricator.wikimedia.org/T267333) [23:15:21] (03CR) 10Bstorm: [C: 03+2] wikireplicas and toolsdb: close connections when done with them [puppet] - 10https://gerrit.wikimedia.org/r/647285 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm) [23:15:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=parse2001.codfw.wmnet [23:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:41] !log repooling parse2001 after buster reimage - T245757 [23:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:44] T245757: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 [23:17:25] (03PS1) 10Mforns: Migrate HomepageVisit schema from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647384 (https://phabricator.wikimedia.org/T267333) [23:19:45] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2243.codfw.wmnet [23:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:45] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [23:21:54] !log repooling parse2001 after buster reimage - T268524 [23:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:57] T268524: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 [23:24:57] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) >>! In T245757#6645352, @jijiki wrote: > @Dzahn @hnowlan After discussing with @Muehlenhoff, since w... [23:26:53] (03PS1) 10Mforns: Migrate ServerSideAccountCreation schema from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647386 (https://phabricator.wikimedia.org/T267333) [23:32:00] (03PS1) 10Bstorm: maintain-dbusers: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/647388 [23:33:54] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/647388 (owner: 10Bstorm) [23:42:35] PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 101.10, 103.59, 101.76 https://wikitech.wikimedia.org/wiki/Swift [23:48:34] 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10kaldari) I approve as well in case it matters ;) [23:49:09] (03PS4) 10Ahmon Dancy: 0.1.0: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 [23:51:02] (03PS3) 10Ahmon Dancy: Update Chart.yaml source references [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354 [23:54:54] (03CR) 10Jeena Huneidi: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353 (owner: 10Ahmon Dancy) [23:56:15] (03Merged) 10jenkins-bot: Enable $wgShowHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353 (owner: 10Ahmon Dancy)