[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T0000).
[00:00:04] <jouncebot>	 cscott: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:04:37] <Urbanecm>	 I can deploy today...if no one else is around. 
[00:04:57] <Urbanecm>	 cscott: here? :)
[00:05:43] <cscott>	 Urbanecm: yep
[00:05:55] <cscott>	 i think subbu|away is around on #-parsoid as well (despite his nick)
[00:07:07] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Bump wikimedia/parsoid to v0.13.0-a19 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/647046 (https://phabricator.wikimedia.org/T269685) (owner: 10C. Scott Ananian)
[00:07:28] <subbu>	 cscott, i am not |away .. your matrix client is lying to you.
[00:07:55] <cscott>	 i am using quaternion, so it must just be orthogonal to the truth :)
[00:08:26] <subbu>	 :)
[00:08:27] <Urbanecm>	 So, let's see once it merged :)
[00:17:04] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:27:20] <wikibugs>	 (03PS2) 10Jforrester: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy)
[00:28:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy)
[00:28:48] <wikibugs>	 (03PS3) 10Jforrester: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy)
[00:28:50] <wikibugs>	 (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116
[00:30:11] <wikibugs>	 (03CR) 10Daimona Eaytoy: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (owner: 10Jforrester)
[00:31:09] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to v0.13.0-a19 [vendor] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/647046 (https://phabricator.wikimedia.org/T269685) (owner: 10C. Scott Ananian)
[00:31:57] <wikibugs>	 (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116
[00:31:59] <wikibugs>	 (03CR) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (owner: 10Jforrester)
[00:32:14] <Urbanecm>	 cscott: still here?
[00:33:21] <subbu>	 i am around here as well.
[00:33:40] <wikibugs>	 (03PS3) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712)
[00:33:42] <wikibugs>	 (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712)
[00:33:44] <Urbanecm>	 subbu: great
[00:33:44] <wikibugs>	 (03PS1) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712)
[00:34:26] <Urbanecm>	 subbu: cscott: please test at mwdebug1001
[00:34:36] <subbu>	 ok. one moment
[00:35:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1147399384 and 342 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:36:11] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2426127320 and 379 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:36:19] <subbu>	 how do i purge a page and have the reparse req. go to mwdebug1001?
[00:36:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 679446064 and 395 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:36:35] <Urbanecm>	 subbu: what about a preview at mwdebug1001?
[00:37:16] <subbu>	 ah .. ok, let me try that.
[00:37:43] <Urbanecm>	 sure
[00:38:34] <subbu>	 but, i need parsoid to reparse the page. that won't work since i cannot get parsoid reqs. to go to a specific server.
[00:38:45] <subbu>	 let me think if there is any other way of verifying withotu actually pushing it out everywhere.
[00:39:03] <Urbanecm>	 hmm, good point
[00:39:37] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2397421400 and 583 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:25] <cscott>	 subbu: ssh to scandium and query the host directly?
[00:41:05] <subbu>	 cscott, oh, now that the new rest endpoint has been pushed everywhere .. we could probably use that ... 
[00:41:25] <subbu>	 do you know what that url schema is? 
[00:41:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 960951376 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:41:58] <subbu>	 Pchelolo would know but i see he is away.
[00:42:08] <cscott>	 you should be able to directly fetch https://zh.wikipedia.org/api/rest_v1/page/html/%E4%BA%94%E7%9C%BC%E8%81%AF%E7%9B%9F
[00:42:13] <cscott>	 just set the host header
[00:42:20] <cscott>	 (working on that myself)
[00:42:38] <subbu>	 cscott, but the new code is only on mwdebug1001.
[00:42:45] <cscott>	 ah right yeah
[00:42:47] <subbu>	 and those parsoid endpoints are not available on that server.
[00:43:01] <Urbanecm>	 so, just push everywhere and hope?
[00:43:03] <subbu>	 so, that is why i was looking at the new rest endpoints that Pchelolo recently enabled everywhere. 
[00:43:07] <cscott>	 hrm.
[00:43:18] <subbu>	 Urbanecm, hold on :) we just need to figure out that url schema.
[00:43:23] <Urbanecm>	 okay
[00:43:25] <Urbanecm>	 waiting
[00:43:30] <cscott>	 this has been an issue w/ testing parsoid for a while.
[00:43:46] <cscott>	 subbu: i don't think the new url schema exposes the langconv stuff yet
[00:43:57] <subbu>	 ah, ok. right.
[00:44:27] <subbu>	 but, just hititng that endpoint should not emit the langvariant markup ..
[00:44:35] <subbu>	 so, that is the test.
[00:44:57] <subbu>	 let me search phab for pchelolo's ticket.
[00:45:02] <cscott>	 it's a test, at least
[00:45:14] <subbu>	 aha .. https://en.wikipedia.org/w/rest.php/v1/page/Earth/html is the new rest api schema
[00:45:22] <subbu>	 so, just need to fetch the zhwiki equivalent on mwdebug1001
[00:46:44] <subbu>	 https://zh.wikipedia.org/w/rest.php/v1/page/%E4%BA%94%E7%9C%BC%E8%81%AF%E7%9B%9F/html says not implementd. :)
[00:46:51] <subbu>	 alright, i guess we just push everywhere then.
[00:48:26] <Urbanecm>	 subbu: if you can use a particular parsoid host, you can run scap pull there to fetch the code
[00:48:27] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3161171608 and 171 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:38] <Urbanecm>	 and i can also just push everywhere
[00:49:49] <Urbanecm>	 subbu: so, `scap pull` at scandium should fetch the code to scandium
[00:51:00] <subbu>	 actually scandium has that new code since we rt-tested it.
[00:51:07] <subbu>	 and i verified that the page is fixed there.
[00:51:11] <Urbanecm>	 aha
[00:51:15] <Urbanecm>	 so I'll sync then
[00:51:17] <subbu>	 so, yes, let us push everywhere.
[00:51:19] <subbu>	 ya.
[00:52:00] <subbu>	 we already rt-tested all these pages prior to release, so, that works as expected. just being able to test on a non-scandium host would have been an added bonus, but we don't have a mechanism right now, so yes, let us push.
[00:52:17] <Urbanecm>	 sync in progress :)
[00:52:56] <cscott>	 go go go
[00:53:18] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.20/vendor/: 3278ffd107888757c4620383160a6d5fa67d05b5: Bump wikimedia/parsoid to v0.13.0-a19 (T269685) (duration: 01m 16s)
[00:53:24] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 70136 and 247 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:53:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:53:28] <stashbot>	 T269685: /page/html endpoint broken when requesting language variants affecting /page/summary - https://phabricator.wikimedia.org/T269685
[00:53:34] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 229712 and 256 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:53:35] <Urbanecm>	 cscott: subbu: should be everywhere now. Can you make sure it works now? :)
[00:53:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 138808 and 277 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:55:15] <wikibugs>	 (03PS4) 10Jforrester: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy)
[00:55:17] <wikibugs>	 (03PS4) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make WRITE_BOTH everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712)
[00:55:19] <wikibugs>	 (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make READ_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647117 (https://phabricator.wikimedia.org/T269712)
[00:55:22] <wikibugs>	 (03PS2) 10Jforrester: wgAbuseFilterAflFilterMigrationStage: Make COMPAT_NEW in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647118 (https://phabricator.wikimedia.org/T269712)
[00:55:49] <subbu>	 there were a slew of transient errors .. i'm going to assume that is because the php7.2-fpm was restarting in that period. they hav eall gone away now. but, will monitor logstash for a bit longer first.
[00:55:58] <James_F>	 Urbanecm: I've an additional patch to go out now, when you're done (or you can do it if you want :-)).
[00:55:58] <Urbanecm>	 okay
[00:56:13] <subbu>	 yup, those are gone.
[00:56:18] <subbu>	 now to actually verify the other things.
[00:56:24] <Urbanecm>	 in that case, it's yours James_F :)
[00:56:24] <cscott>	 subbu: confirmed your <section> fix is live on enwiki
[00:56:39] <cscott>	 did you verify the zhwiki fix?
[00:56:47] * James_F waits.
[00:56:51] <subbu>	 https://zh.wikipedia.org/api/rest_v1/page/html/%E4%BA%94%E7%9C%BC%E8%81%AF%E7%9B%9F is fixed as well.
[00:57:13] <subbu>	 can someone from the mcs/apps side verify their endpoints are fixed?
[00:57:39] <cscott>	 subbu: yup, confirmed, that looks good to me
[00:58:22] <subbu>	 cscott, the mcs endpoints you mean?
[00:58:55] <cscott>	 No, I mean the zhwiki url you wrote
[00:59:08] <cscott>	 Haven't tried to check mcs endpoints
[00:59:20] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 290280 and 602 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:59:33] <subbu>	 https://phabricator.wikimedia.org/T269685 has the urls to test. so, let us verify that.
[01:00:06] <subbu>	 verified working now. returns http 200, not http 400.
[01:00:09] <subbu>	 lgtm.
[01:00:39] <Urbanecm>	 curl -i -H "Accept-Language: zh-hant" https://zh.wikipedia.org/api/rest_v1/page/html/%E8%B4%9D%E6%8B%89%E5%85%8B%C2%B7%E5%A5%A5%E5%B7%B4%E9%A9%AC returns HTTP 200 to me
[01:01:22] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy)
[01:01:24] <subbu>	 great. i think we are done then.
[01:01:44] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "Not until the train rolls out, so we can spot if this breaks things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[01:01:54] <subbu>	 i am just curious why we got those transient fatals on sync. we normally don't get those on normal deploy.
[01:02:02] <subbu>	 but, we can look into that another time.
[01:02:15] <wikibugs>	 (03Merged) 10jenkins-bot: Explicitly set wgAbuseFilterAflFilterMigrationStage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647091 (https://phabricator.wikimedia.org/T269712) (owner: 10Daimona Eaytoy)
[01:02:17] <subbu>	 cscott, look good to you?
[01:02:25] <subbu>	 if yes, we can call this done.
[01:03:35] <subbu>	 and https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2021/Citations?dtenable=1 now has reply links enabled as well.
[01:03:46] <subbu>	 so, they got their links 1 day early as well. :)
[01:04:47] <subbu>	 thanks Urbanecm.
[01:04:54] <Urbanecm>	 happy to help!
[01:05:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 114581872 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:05:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89893088 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:07:57] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Explicitly set wgAbuseFilterAflFilterMigrationStage ahead of train roll-out T269712 (duration: 01m 03s)
[01:08:02] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2058272 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:08:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:08:05] <stashbot>	 T269712: Migrate afl_filter to afl_filter_id and afl_global - https://phabricator.wikimedia.org/T269712
[01:08:08] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 28616 and 1130 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:09:56] <James_F>	 OK, all done on my part.
[01:10:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 151717600 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:11:40] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2010112 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:16:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 296552 and 1654 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:19:32] <cscott>	 subbu: yes, looks good to me. (sorry for the lag)
[01:24:14] <subbu>	 great!
[01:24:28] <subbu>	 logs continue to be clean.
[01:27:57] <wikibugs>	 (03PS12) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526)
[01:28:02] <wikibugs>	 (03CR) 10Mstyles: "> Patch Set 11:" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles)
[01:36:35] <wikibugs>	 10Operations, 10Mail: SREs mail servers - https://phabricator.wikimedia.org/T269725 (10Reedy)
[04:00:14] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:24] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:47:14] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:29:22] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:45:58] <wikibugs>	 (03PS1) 10Ammarpad: Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709)
[05:46:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709) (owner: 10Ammarpad)
[05:47:14] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:49:24] <wikibugs>	 (03PS2) 10Ammarpad: Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709)
[06:57:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10elukey) >>! In T268146#6677434, @Ottomata wrote: > Hmm, uh oh, I think this host needed to be placed in the Analytics VLAN.  Ping @elukey @razzi @robh  Ah snap I didn'...
[06:58:05] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] zookeeper: Support a standalone server's mbeans in the JMX exporter's conf [puppet] - 10https://gerrit.wikimedia.org/r/645371 (https://phabricator.wikimedia.org/T268202) (owner: 10Elukey)
[07:59:32] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:47] <wikibugs>	 (03PS1) 10Elukey: Import prometheus and constants module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905)
[08:15:29] <wikibugs>	 (03CR) 10Elukey: "Looks very good! One thing - is profile::memcached already included somewhere else or does it need to get added to the role?" [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi)
[08:22:57] <wikibugs>	 (03PS1) 10Elukey: profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190
[08:24:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[08:24:37] <elukey>	 uff
[08:25:12] <wikibugs>	 (03PS1) 10Hashar: gerrit: disable autogc when receiving packs [puppet] - 10https://gerrit.wikimedia.org/r/647191
[08:25:37] <wikibugs>	 (03PS2) 10Elukey: profile::memcached::instance: simplify handling of extendend_options [puppet] - 10https://gerrit.wikimedia.org/r/647190
[08:26:07] <wikibugs>	 (03CR) 10Hashar: "That is made the default with Gerrit 3.3 per https://gerrit-review.googlesource.com/c/gerrit/+/289470" [puppet] - 10https://gerrit.wikimedia.org/r/647191 (owner: 10Hashar)
[08:27:43] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27025/console" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[08:33:26] <wikibugs>	 (03CR) 10Jcrespo: "Looks sane, but please let me double check database backups are being created correctly before deploying. I will get back to you soon." [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff)
[08:33:55] <wikibugs>	 (03CR) 10Muehlenhoff: "Ack, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff)
[08:39:41] <wikibugs>	 (03CR) 10Jcrespo: "@Moritz:" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff)
[08:47:48] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:48:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Tyler as approval contact for Gerrit/contint [puppet] - 10https://gerrit.wikimedia.org/r/644856 (owner: 10Muehlenhoff)
[08:51:37] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Nice! Couple of nits inline, LGTM" (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[08:52:21] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/643717 (owner: 10Hnowlan)
[08:53:24] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad total VRPs alert, total VRPs alert, valid ROAs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[08:57:30] <wikibugs>	 (03PS2) 10Elukey: Import prometheus and constants module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905)
[08:58:05] <wikibugs>	 (03CR) 10Elukey: Import prometheus and constants module from spicerack (032 comments) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[08:59:44] <wikibugs>	 (03CR) 10Kormat: "I'm inclined to say let's not do this, for now." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644857 (owner: 10Jcrespo)
[09:00:12] <wikibugs>	 (03CR) 10Jcrespo: "This is your own code, so we have no business here, but if that helps, we generally make explicit (e.g. through a comment) when we use "fa" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[09:00:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (apart from the silly CI check), you can pass a regex to the PCC host selection to bypass that:" [puppet] - 10https://gerrit.wikimedia.org/r/647031 (https://phabricator.wikimedia.org/T266479) (owner: 10Dave Pifke)
[09:03:33] <godog>	 !log swift codfw-prod: add ms-be20[58-61] - T269337
[09:03:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:03:43] <stashbot>	 T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337
[09:04:22] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[09:04:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Caught a small typo, otherwise +1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[09:05:47] <elukey>	 moritzm: o/ are you puppet-merging?
[09:06:55] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[09:07:51] <moritzm>	 sorry, yes
[09:08:01] <moritzm>	 done
[09:09:18] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[09:13:39] <wikibugs>	 (03PS5) 10JMeybohm: Split out RBAC rules and service accounts for typha and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653)
[09:13:41] <wikibugs>	 (03PS2) 10JMeybohm: calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653)
[09:13:55] <wikibugs>	 (03CR) 10JMeybohm: Split out RBAC rules and service accounts for typha and CNI (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[09:15:08] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM!" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[09:16:06] <wikibugs>	 (03CR) 10Jcrespo: "Fair, I wasn't aware of those other scripts and I personally don't need this." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644857 (owner: 10Jcrespo)
[09:16:16] <wikibugs>	 (03Abandoned) 10Jcrespo: Move section script from software/dbtools to wmfmariapy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/644857 (owner: 10Jcrespo)
[09:16:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[09:16:28] <wikibugs>	 (03PS11) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262
[09:18:24] <godog>	 akosiaris: did you see my comment re: ^ ?
[09:19:19] <godog>	 specifically the monitoring one in k8s.pp
[09:19:24] <wikibugs>	 (03PS1) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[09:19:40] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: redis: define redis version on buster [puppet] - 10https://gerrit.wikimedia.org/r/646638 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[09:21:38] <wikibugs>	 (03CR) 10Jcrespo: "> Thanks for the comment Jaime (please, you are always welcome to review & comment!). "snakeoil" is used throughout config files in this r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[09:22:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[09:23:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Split out RBAC rules and service accounts for typha and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[09:24:40] <jbond42>	 !log make message mandatory for disable-puppet
[09:24:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] disable-puppet: make message mandatory [puppet] - 10https://gerrit.wikimedia.org/r/645104 (owner: 10Jbond)
[09:24:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:47] <akosiaris>	 godog: No, I missed it. Sorry about that. Looking now. Thanks for the ping
[09:25:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[09:25:37] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove "idp" backup file set and drop backup host profile from IDPs [puppet] - 10https://gerrit.wikimedia.org/r/646966
[09:26:10] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff)
[09:26:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27027/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[09:27:23] <godog>	 akosiaris: np!
[09:27:48] <wikibugs>	 (03PS6) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653)
[09:29:55] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) This is an example of metadata extracted, after normalizatio...
[09:29:59] <wikibugs>	 (03CR) 10Muehlenhoff: redis: define redis version on buster for multidc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[09:31:02] <wikibugs>	 (03CR) 10Jbond: "LGTM wonder if we also need to add it to the absent_packages variable?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff)
[09:31:33] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[09:35:01] <wikibugs>	 10Operations, 10serviceops, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10MoritzMuehlenhoff) >>! In T269004#6677107, @Andrew wrote: > It's most useful if effort is directed towards completing T237773, which will render this issue moot. In theo...
[09:35:04] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) And this is after downloading the entire commonswiki metadat...
[09:35:12] <wikibugs>	 (03PS5) 10JMeybohm: admin_ng: Generalization, prod values anf fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/644787 (https://phabricator.wikimedia.org/T268434)
[09:35:14] <wikibugs>	 (03PS4) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573)
[09:35:39] <wikibugs>	 (03PS2) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[09:35:55] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[09:37:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27030/console" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[09:37:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/646879 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[09:40:06] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:44:27] <wikibugs>	 (03CR) 10Muehlenhoff: Stop installing apt-transport-https on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff)
[09:45:00] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:45:20] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10jcrespo) @Krinkle That looks very similar to the problems I found initially on DB* dashboards, and then they did something to fix it- people here will know more. This was my i...
[09:45:30] <wikibugs>	 (03PS1) 10Jbond: profile::ntp: remove use_chrony parameter as its never used [puppet] - 10https://gerrit.wikimedia.org/r/647203
[09:46:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[09:47:20] <wikibugs>	 (03PS1) 10Effie Mouzeli: hiera: install redis on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643)
[09:47:41] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/602286 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff)
[09:47:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hiera: install redis on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[09:48:10] <wikibugs>	 (03PS2) 10Effie Mouzeli: hiera: install redis on mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643)
[09:48:33] <wikibugs>	 10Operations, 10Puppet, 10DBA, 10User-jbond: Request new database for  pki.discovery.wmnet - https://phabricator.wikimedia.org/T268329 (10jcrespo) 2 tables and its schema were backed up yesterday, with around 4K in size after gzip compression. If that seems right I would call out the backups "working". Ple...
[09:48:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hiera: install redis on mc2036 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[09:49:35] <wikibugs>	 10Operations, 10Puppet, 10DBA, 10User-jbond: Request new database for  pki.discovery.wmnet - https://phabricator.wikimedia.org/T268329 (10jbond) >>! In T268329#6678525, @jcrespo wrote: > 2 tables and its schema were backed up yesterday, with around 4K in size after gzip compression. If that seems right I w...
[09:51:43] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Only 1 table was backed up from the cas database with around 10K after comrpession. No backup from case_staging were generated. If this se" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff)
[09:51:58] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[09:52:40] <wikibugs>	 (03CR) 10Muehlenhoff: "The plan is still to replace ISC NTP with Chrony in production, but I haven't found the time to pursue this further. Let's keep it in for " [puppet] - 10https://gerrit.wikimedia.org/r/647203 (owner: 10Jbond)
[09:52:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/646890 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[09:53:42] <wikibugs>	 10Operations, 10Puppet, 10DBA, 10User-jbond: Request new database for  pki.discovery.wmnet - https://phabricator.wikimedia.org/T268329 (10jcrespo) >>! In T268329#6678526, @jbond wrote: >>>! In T268329#6678525, @jcrespo wrote: >> 2 tables and its schema were backed up yesterday, with around 4K in size after...
[09:53:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[09:54:10] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "> Patch Set 3:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[09:55:33] <wikibugs>	 (03CR) 10Jbond: "minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:56:29] <wikibugs>	 (03PS5) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573)
[09:56:46] <wikibugs>	 (03CR) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[09:59:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[10:00:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey)
[10:00:23] <wikibugs>	 (03PS8) 10Volans: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi)
[10:00:30] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active - Telia, AS1299/IPv6: Active - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:01:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[10:02:17] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T269731 (10kostajh)
[10:02:24] <wikibugs>	 (03CR) 10Muehlenhoff: ""class redis" should also switch from require_package to ensure_packages, otherwise we might run into issues with the order of the setup o" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[10:03:00] <wikibugs>	 (03CR) 10Elukey: "Thanks to all for the feedback, I'd be inclined to proceed with after reading all comments." [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey)
[10:03:20] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10kostajh)
[10:03:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff)
[10:04:17] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10kostajh) > Requestor -- Please coordinate obtaining a comment of approval on this task from the approving party.  cc @akosiaris @marcella Please let me know if you have any quest...
[10:04:25] <wikibugs>	 (03PS3) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643)
[10:04:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[10:06:48] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27033/console" [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[10:07:15] <wikibugs>	 (03CR) 10Volans: [C: 03+2] Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi)
[10:07:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Import prometheus and constants module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/647189 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[10:08:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[10:08:56] <wikibugs>	 (03Merged) 10jenkins-bot: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi)
[10:11:18] <wikibugs>	 (03PS5) 10Jbond: ntp: replace hiera() with lookup(), move use_chrony to parameters [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[10:12:00] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 52, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:12:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "Made an update to the key name" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[10:12:41] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/647203 (owner: 10Jbond)
[10:12:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile::ntp: remove use_chrony parameter as its never used [puppet] - 10https://gerrit.wikimedia.org/r/647203 (owner: 10Jbond)
[10:13:01] <wikibugs>	 (03Abandoned) 10Jbond: profile::ntp: remove use_chrony parameter as its never used [puppet] - 10https://gerrit.wikimedia.org/r/647203 (owner: 10Jbond)
[10:13:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM. Couple of answers inline, thanks for the answers to my own questions. It helped clear up a few things." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:15:32] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10akosiaris) >>! In T269731#6678565, @kostajh wrote: >> Requestor -- Please coordinate obtaining a comment of approval on this task from the approving party. >  > cc @akosiaris @ma...
[10:20:08] <wikibugs>	 (03PS1) 10Jbond: icinga_status: fix type downtimed != downtime [puppet] - 10https://gerrit.wikimedia.org/r/647208 (https://phabricator.wikimedia.org/T269672)
[10:20:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icinga_status: fix type downtimed != downtime [puppet] - 10https://gerrit.wikimedia.org/r/647208 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond)
[10:24:03] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 3: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff)
[10:24:30] <wikibugs>	 10Operations, 10Patch-For-Review: Traceback in icinga-status  'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) This should be fixed now please re open if yu still see issues  ` $ /usr/local/bin/icinga-status -j auth1002.eqiad.wmnet         {"auth1002": {"name": "a...
[10:24:58] <wikibugs>	 10Operations, 10Patch-For-Review: Traceback in icinga-status  'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) 05Open→03Resolved a:03jbond
[10:26:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: deployment: Set global statsd exporter version [puppet] - 10https://gerrit.wikimedia.org/r/647210
[10:29:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] spec: add and use parallel sec which seems to give a boost to run time [puppet] - 10https://gerrit.wikimedia.org/r/645113 (owner: 10Jbond)
[10:40:12] <wikibugs>	 (03PS1) 10JMeybohm: _tls_helpers: Add a default tls.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647211
[10:42:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] _tls_helpers: Add a default tls.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647211 (owner: 10JMeybohm)
[10:43:54] <wikibugs>	 (03Merged) 10jenkins-bot: _tls_helpers: Add a default tls.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647211 (owner: 10JMeybohm)
[10:45:23] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "You'll need to bump the chart version in Chart.yaml for this to take effect." [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[10:45:58] <moritzm>	 !log installing openssl updates on Buster
[10:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:02] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:09] <wikibugs>	 (03PS7) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893)
[10:48:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[10:48:45] <wikibugs>	 (03PS6) 10Kosta Harlan: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573)
[10:49:22] <wikibugs>	 (03PS8) 10Kosta Harlan: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893)
[10:50:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[10:52:37] <wikibugs>	 (03PS5) 10Jbond: profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/645147
[10:53:52] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 5: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[10:55:05] <wikibugs>	 (03PS7) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653)
[10:55:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27034/console" [puppet] - 10https://gerrit.wikimedia.org/r/644808 (https://phabricator.wikimedia.org/T235162) (owner: 10Jbond)
[10:55:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add calico helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:56:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Split out RBAC rules and service accounts for typha and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:56:12] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:56:17] <godog>	 !log change librenms alerts and transport groups to use alertmanager - T267018
[10:56:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:25] <stashbot>	 T267018: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018
[10:57:17] <wikibugs>	 (03Merged) 10jenkins-bot: Add calico helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:57:52] <wikibugs>	 (03Merged) 10jenkins-bot: Split out RBAC rules and service accounts for typha and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[10:57:54] <wikibugs>	 (03Merged) 10jenkins-bot: calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[11:01:34] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "> Are you saying the linting of Ide101c55e5a0fd9a390f22de7c33d303e9f3da50 will be unbroken after this patch is merged which includes the b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[11:02:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Add helmfile.d config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[11:04:10] <wikibugs>	 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10fgiunchedi) +netops for visibility, cc @ayounsi
[11:06:39] <godog>	 !log reboot ms-be1019 / ms-be1020 - T268435
[11:06:43] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[11:06:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:47] <stashbot>	 T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435
[11:07:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-tool1010.eqiad.wmnet - https://phabricator.wikimedia.org/T268146 (10elukey) After a chat with Riccardo and Arzhel, the idea is to:  1) decom an-tool1010 (testing a new feature of the decom cookbook to auto-cleanup switch configs). 2) r...
[11:07:52] <icinga-wm>	 PROBLEM - Host ms-be1019 is DOWN: PING CRITICAL - Packet loss = 100%
[11:07:58] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Add private config for DB admin user [deployment-charts] - 10https://gerrit.wikimedia.org/r/646658 (https://phabricator.wikimedia.org/T269573) (owner: 10Kosta Harlan)
[11:08:28] <wikibugs>	 (03PS9) 10JMeybohm: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[11:08:44] <wikibugs>	 10Operations, 10Mail: Bounces when sending mail to aliases of a specific WMF email address - https://phabricator.wikimedia.org/T269725 (10Aklapper)
[11:11:15] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) The list of VSM-related issues affecting 5.2.1 according to [[https://github.com/varnishcache/varnish-cache/blob/6.0/doc/changes.rst#fix...
[11:11:52] <icinga-wm>	 RECOVERY - Host ms-be1019 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[11:14:15] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) The new calico chart is merged, thanks @akosiaris   What is missing currently is a proper RoleBinding for the calicoctl user as I was n...
[11:14:42] <icinga-wm>	 PROBLEM - Host ms-be2059 is DOWN: PING CRITICAL - Packet loss = 100%
[11:15:51] <godog>	 that's me ^
[11:16:30] <icinga-wm>	 PROBLEM - Host ms-be1020 is DOWN: PING CRITICAL - Packet loss = 100%
[11:17:18] <godog>	 ditto
[11:17:24] <icinga-wm>	 RECOVERY - Host ms-be2059 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms
[11:17:49] <liw>	 godog, Scap released and deployed, btw
[11:18:36] <icinga-wm>	 RECOVERY - Host ms-be1020 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms
[11:20:54] <wikibugs>	 (03PS1) 10Elukey: hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412)
[11:20:56] <godog>	 liw: thanks! 
[11:21:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[11:21:38] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[11:23:02] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Add helmfile.d config [deployment-charts] - 10https://gerrit.wikimedia.org/r/646649 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[11:32:07] <wikibugs>	 (03PS1) 10JMeybohm: linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893)
[11:38:10] <wikibugs>	 (03PS2) 10Elukey: hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412)
[11:40:18] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27036/console" [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[11:44:03] <wikibugs>	 (03PS3) 10Elukey: hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412)
[11:45:33] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27037/console" [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[11:56:53] <wikibugs>	 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10jbond) 05Open→03Resolved a:03jbond Looks like this is complete, resolving please reopen if i mis...
[11:57:21] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: Manage frack switches with Netbox - https://phabricator.wikimedia.org/T268802 (10jbond) p:05Triage→03Medium
[11:58:10] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10jbond) p:05Triage→03Medium
[11:58:28] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various comments inline, but the premise is sane" (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[12:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T1200).
[12:00:05] <jouncebot>	 ammarpad: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27038/console" [puppet] - 10https://gerrit.wikimedia.org/r/647210 (owner: 10Alexandros Kosiaris)
[12:00:59] <Urbanecm>	 I can deploy today!
[12:01:02] <wikibugs>	 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10jbond) p:05Triage→03Medium
[12:01:12] <Urbanecm>	 Ammarpad: hi, are you here?
[12:01:16] <wikibugs>	 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10jbond) p:05Triage→03Medium
[12:01:55] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10jbond) p:05Triage→03Medium
[12:02:22] <wikibugs>	 10Operations, 10netops: Upgrade Routinator 3000 to 0.8.2 - https://phabricator.wikimedia.org/T269738 (10ayounsi) p:05Triage→03Medium
[12:02:30] <Urbanecm>	 Ammarpad: ping?
[12:02:50] <wikibugs>	 10Operations, 10serviceops, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10jbond) p:05Triage→03Medium
[12:03:21] <wikibugs>	 10Operations, 10observability, 10CAS-SSO: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10jbond) p:05Triage→03Medium
[12:04:01] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/pipermail/wikija-l/ has broken encoding - https://phabricator.wikimedia.org/T269301 (10jbond) p:05Triage→03Medium
[12:04:07] <Ammarpad>	 @Urbanecm, yes I am
[12:04:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709) (owner: 10Ammarpad)
[12:05:03] <wikibugs>	 (03PS1) 10KartikMistry: Update apertium to 2020-12-09-115733-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647220
[12:05:04] <wikibugs>	 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10ayounsi) 05Resolved→03Open Not everything has been migrated yet, see the full list on https://libr...
[12:06:02] <wikibugs>	 (03Merged) 10jenkins-bot: Add extended-confirmed group and restriction level for bgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647129 (https://phabricator.wikimedia.org/T269709) (owner: 10Ammarpad)
[12:06:30] <wikibugs>	 10Operations, 10netops, 10observability, 10User-fgiunchedi: LibreNMS sends its alerts to Alertmanager, resulting in email notifications to network operations - https://phabricator.wikimedia.org/T267018 (10jbond) p:05Triage→03Medium >>! In T267018#6678835, @ayounsi wrote: > Not everything has been migra...
[12:07:17] <Urbanecm>	 Ammarpad: pulled onto mwdebug1001, can you test please?
[12:08:33] <wikibugs>	 10Operations, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10jbond) p:05Triage→03Medium
[12:08:41] <Ammarpad>	 Urbanecm OK
[12:08:49] <wikibugs>	 10Operations, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10jbond) p:05Triage→03Medium
[12:09:05] <Ammarpad>	 I mean I am testing...
[12:10:45] <wikibugs>	 10Operations: slapd fails to restart sometimes - https://phabricator.wikimedia.org/T269394 (10jbond) p:05Triage→03Medium
[12:11:45] <wikibugs>	 10Operations, 10SRE-tools, 10observability: HP RAID failed on ms-be1054 didn't open a task - https://phabricator.wikimedia.org/T269563 (10jbond) p:05Triage→03Medium
[12:12:09] <Urbanecm>	 Ammarpad: yes, Im waiting
[12:12:12] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for toan - https://phabricator.wikimedia.org/T269678 (10jbond) p:05Triage→03Medium
[12:13:10] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm)
[12:13:11] <Ammarpad>	 yes.. all is OK...
[12:13:20] <Ammarpad>	 You can proceed
[12:13:40] <wikibugs>	 10Operations, 10Domains, 10Okapi, 10Traffic: Okapi Domains - https://phabricator.wikimedia.org/T269686 (10jbond) p:05Triage→03Medium
[12:14:08] <wikibugs>	 (03CR) 10Kosta Harlan: "Actually while I think this looks reasonable and makes sense, I'll remove my vote so people more qualified in this domain can judge :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm)
[12:14:12] <Urbanecm>	 thanks Ammarpad 
[12:15:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Remove unsupported arg in MediaWiki::doPostOutputShutdown() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 (owner: 10Ammarpad)
[12:15:56] <Urbanecm>	 Ammarpad: I assume the other one can't really be tested, is that irght?
[12:16:02] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 3414289c8c7272185e30cacc3df5d5dbc719219d: Add extended-confirmed group and restriction level for bgwiki (T269709) (duration: 01m 19s)
[12:16:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:11] <stashbot>	 T269709: Add extended-confirmed group and restriction level for bgwiki - https://phabricator.wikimedia.org/T269709
[12:16:34] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unsupported arg in MediaWiki::doPostOutputShutdown() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 (owner: 10Ammarpad)
[12:17:14] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10jbond) p:05Triage→03Medium
[12:17:33] <Urbanecm>	 Ammarpad: could you answer the q above?
[12:17:50] <Urbanecm>	 it's at mwdebug1001 anyway
[12:18:24] <Ammarpad>	 Yes indeed. But I am sure it will not cause problem. Lucas aso gave it +1. The method does not take parameter.
[12:19:05] <Urbanecm>	 it doesn't, so hope it's all right
[12:19:07] <Ammarpad>	 OK, it's not testable though, so there's nothing I can do. PHP should be throwing error if you call method that takes no arg with arg
[12:19:50] <wikibugs>	 (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/645147 (owner: 10Jbond)
[12:20:10] <Urbanecm>	 that would indicate the file is not in use anymore
[12:20:22] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=restbase,service=restbase,name=restbase2009.codfw.wmnet
[12:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:33] <Urbanecm>	 syncing
[12:23:25] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized w/static.php: cfb36023ac873c00e680032999b7c21c2a105132: Remove unsupported arg in MediaWiki::doPostOutputShutdown() call (duration: 01m 02s)
[12:23:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:31] <Urbanecm>	 Ammarpad: done
[12:23:58] <Ammarpad>	 Thank you
[12:24:28] <Urbanecm>	 and stuff like https://cs.wikipedia.org/w/extensions/GrowthExperiments/images/mentor-ltr.svg still works, and goes through static.php
[12:24:32] <Urbanecm>	 !log Eu B&C window done
[12:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:54] <wikibugs>	 10Operations, 10Cassandra: Cassandra instance DNS records - are they needed? - https://phabricator.wikimedia.org/T269328 (10hnowlan) >>! In T269328#6667383, @Eevans wrote: > At the very least, getting rid of these names would create inconvenience.  There are lots of examples of maintenance and admin commands t...
[12:29:34] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime
[12:29:35] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:00] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1001.eqiad.wmnet
[12:30:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:08] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21561504 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:34:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 312677880 and 18 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:35:14] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 402721320 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:35:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile: migrate to shared spec_test [puppet] - 10https://gerrit.wikimedia.org/r/645147 (owner: 10Jbond)
[12:37:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 53383376 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:37:10] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 61605872 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:37:50] <hnowlan>	 ^ expected 
[12:40:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 52 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:40:20] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 62 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:40:30] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24824 and 71 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:41:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1008 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:41:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 104580512 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:42:14] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 541206160 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:42:24] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35275104 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:43:08] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 32276528 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:45:08] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 70696672 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:46:08] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:38] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 548657160 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:48:09] <wikibugs>	 (03PS1) 10Ayounsi: Standardize Private-Peer BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/647226
[12:48:18] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on ms-be1030 is CRITICAL: cluster=swift device=1I:1:5 instance=ms-be1030 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1030&var-datasource=eqiad+prometheus/ops
[12:48:44] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 9792 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:48:54] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3032 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:49:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 14520 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:49:40] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24592 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:50:02] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 36224 and 80 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:51:00] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:52:38] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10jbond) > As for myself(toan) I'm currently not defined as an admin but would also like to be a part of this list. Should I add this in a follow-u...
[12:56:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Standardize Private-Peer BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/647226 (owner: 10Ayounsi)
[12:56:36] <wikibugs>	 (03Merged) 10jenkins-bot: Standardize Private-Peer BGP group [homer/public] - 10https://gerrit.wikimedia.org/r/647226 (owner: 10Ayounsi)
[13:03:12] <wikibugs>	 (03PS1) 10Ppchelko: Article::view - remove the old subtitle from doOutputFromParserCache. [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647081 (https://phabricator.wikimedia.org/T269727)
[13:03:41] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:03:42] <XioNoX>	 !log standardize Private-Peer BGP group on all cr* 
[13:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:06:06] <wikibugs>	 (03PS2) 10Jbond: Enable base::service_auto_restart for purged [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:09:33] <wikibugs>	 (03CR) 10Muehlenhoff: Enable base::service_auto_restart for purged (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:09:54] <wikibugs>	 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10hnowlan) a:03hnowlan
[13:10:48] <wikibugs>	 (03PS1) 10Hnowlan: maps: increase replication lag tolerance further [puppet] - 10https://gerrit.wikimedia.org/r/647230
[13:11:07] <wikibugs>	 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10hnowlan) maps1001 is depooled and resyncing.
[13:15:08] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm, thx" [puppet] - 10https://gerrit.wikimedia.org/r/647230 (owner: 10Hnowlan)
[13:18:24] <wikibugs>	 (03PS4) 10Muehlenhoff: Remove "idp" backup file set and drop backup host profile from IDPs [puppet] - 10https://gerrit.wikimedia.org/r/646966
[13:25:49] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[13:25:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] raktables: hand off authentication to httpd [puppet] - 10https://gerrit.wikimedia.org/r/644543 (owner: 10Jbond)
[13:26:35] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] racktables: Make everyone admin [puppet] - 10https://gerrit.wikimedia.org/r/644544 (owner: 10Jbond)
[13:26:41] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10WMDE-leszek) I approve this request. I will also approve @toan's production shell access request when it is open.
[13:27:50] <logmsgbot>	 !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
[13:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:23] <wikibugs>	 (03PS1) 10Jbond: racktables: update correct file [puppet] - 10https://gerrit.wikimedia.org/r/647238
[13:30:36] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] racktables: update correct file [puppet] - 10https://gerrit.wikimedia.org/r/647238 (owner: 10Jbond)
[13:32:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm)
[13:33:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 11:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[13:35:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27039/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[13:37:07] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:18] <wikibugs>	 (03PS1) 10Jbond: racktables: add trailing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/647242
[13:39:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] racktables: add trailing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/647242 (owner: 10Jbond)
[13:39:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] racktables: add trailing semicolon [puppet] - 10https://gerrit.wikimedia.org/r/647242 (owner: 10Jbond)
[13:40:52] <wikibugs>	 (03PS6) 10Kormat: int_cont [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634
[13:41:05] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:12] <wikibugs>	 (03CR) 10Zfilipin: "Thanks! 🎉" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/639293 (https://phabricator.wikimedia.org/T265463) (owner: 10Harriet Ayugi)
[13:44:06] <wikibugs>	 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10klausman) This is definitely doable, but needs at least one change: The Bullseye version of the package depends on librdkafka1 >= 1.4.2, which Buster...
[13:54:54] <godog>	 !log experiment with rsync.service increased niceness on ms-be2057 - T269337
[13:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:02] <stashbot>	 T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337
[13:55:10] <wikibugs>	 10Operations: Traceback in icinga-status  'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) >>! In T269672#6678613, @jbond wrote: > This should be fixed now please re open if yu still see issues >   Spoke to soon we now see this issue ` Exception raised while executi...
[13:56:53] <wikibugs>	 (03PS1) 10Jbond: icinga_status: remove values from json until support added to spicerack [puppet] - 10https://gerrit.wikimedia.org/r/647243 (https://phabricator.wikimedia.org/T269672)
[13:57:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga_status: remove values from json until support added to spicerack [puppet] - 10https://gerrit.wikimedia.org/r/647243 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond)
[13:58:11] <wikibugs>	 (03PS1) 10Jbond: icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672)
[14:01:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove "idp" backup file set and drop backup host profile from IDPs [puppet] - 10https://gerrit.wikimedia.org/r/646966 (owner: 10Muehlenhoff)
[14:02:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond)
[14:06:42] <wikibugs>	 (03PS7) 10Kormat: integration: Complete framework for running basic tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266)
[14:07:06] <wikibugs>	 10Operations, 10Analytics-Clusters: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10klausman) I've also poked Faidon on whether an official backport might be done.
[14:07:14] <wikibugs>	 (03CR) 10Kormat: "Ready for review now" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/646634 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat)
[14:11:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see inline" (031 comment) [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 (owner: 10Cwhite)
[14:12:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[14:15:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "> Patch Set 11:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[14:16:23] <wikibugs>	 (03PS12) 10Alexandros Kosiaris: prometheus::k8s: Support arbitrary clusters [puppet] - 10https://gerrit.wikimedia.org/r/644262
[14:19:59] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on ms-be1030 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1030&var-datasource=eqiad+prometheus/ops
[14:23:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27040/console" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[14:28:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[14:31:38] <wikibugs>	 10Operations, 10User-DannyS712: Access to #mediawiki_security IRC channel for DannyS712 - https://phabricator.wikimedia.org/T267800 (10CDanis) 05Open→03Resolved a:03CDanis
[14:35:10] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for purged [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991)
[14:38:07] <wikibugs>	 (03Abandoned) 10TK-999: GeoDNS: Update entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/643983 (owner: 10TK-999)
[14:38:32] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] maps: increase replication lag tolerance further [puppet] - 10https://gerrit.wikimedia.org/r/647230 (owner: 10Hnowlan)
[14:39:46] <wikibugs>	 (03PS1) 10TK-999: GeoDNS: Remove old hack for Wikia RES datacenter [dns] - 10https://gerrit.wikimedia.org/r/647253
[14:43:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm)
[14:44:25] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Allow MySQL egress and set public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/647216 (https://phabricator.wikimedia.org/T265893) (owner: 10JMeybohm)
[14:46:52] <wikibugs>	 (03PS2) 10Muehlenhoff: Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654
[14:47:45] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "Thanks everyone! PCC is ok at https://puppet-compiler.wmflabs.org/compiler1001/27040/prometheus1003.eqiad.wmnet/fulldiff.html, merging" [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris)
[14:48:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff)
[14:49:45] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10klausman) Networking will be 1G. No hw RAID.  As for partitioning, there currently is no parman recipe available that does exactly what we want (2xSSD RAID-1 for OS, 2x (or m...
[14:51:04] <wikibugs>	 (03CR) 10Volans: "Did a second pass, and I sent you an offline question as I forgot some bits of the context." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/544943 (https://phabricator.wikimedia.org/T254249) (owner: 10Jbond)
[14:57:44] <wikibugs>	 (03PS3) 10Muehlenhoff: Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654
[14:59:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff)
[15:00:40] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] hive: force TLS from the Metastore to the db-host when needed [puppet] - 10https://gerrit.wikimedia.org/r/647215 (https://phabricator.wikimedia.org/T257412) (owner: 10Elukey)
[15:06:54] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10akosiaris) >>! In T267653#6678721, @JMeybohm wrote: > The new calico chart is merged, thanks @akosiaris  >  > What is missing currently is a prop...
[15:12:00] <akosiaris>	 elukey: I think I 've never followed up on the conf1006 stuff
[15:12:39] <akosiaris>	 do we have a timeline for that migration? I can concoct a potion for puppet and make conf1006 drink it so we can take it offline
[15:13:17] <wikibugs>	 (03PS4) 10Jbond: Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff)
[15:13:19] <wikibugs>	 (03PS1) 10Jbond: base: update rakefile [puppet] - 10https://gerrit.wikimedia.org/r/647260
[15:18:32] <elukey>	 akosiaris: nono nothing urgent, it was just in the list so I asked, we can do it anytime
[15:18:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] base: update rakefile [puppet] - 10https://gerrit.wikimedia.org/r/647260 (owner: 10Jbond)
[15:19:01] <elukey>	 if you have to roll restart pyball etc.. i can sync with John to make the change 
[15:19:11] <elukey>	 (it is not blocking anything I mean)
[15:19:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff)
[15:26:24] <moritzm>	 !og restarting nginx on htmldump1001 to pick up OpenSSL security updates
[15:27:29] <hauskatze>	 moritzm: missed the "l" of "log" :)
[15:29:09] <wikibugs>	 (03PS1) 10JMeybohm: _scaffold: Default to latest for monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647262
[15:29:21] <moritzm>	 good poin, thanks :-)
[15:29:25] <moritzm>	 !log restarting nginx on htmldump1001 to pick up OpenSSL security updates
[15:29:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:52] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Cool! I've added the _scaffold counterpart in I57dce79777bf1e9aa7f6ae88fc8e10969ed1518a" [puppet] - 10https://gerrit.wikimedia.org/r/647210 (owner: 10Alexandros Kosiaris)
[15:30:54] <wikibugs>	 (03PS2) 10Jbond: icinga_status: remove values from json until support added to spicerack [puppet] - 10https://gerrit.wikimedia.org/r/647243 (https://phabricator.wikimedia.org/T269672)
[15:33:28] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP logstash: add ulogd ecs filter + tests [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565)
[15:33:31] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team-TODO, 10Epic, and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Jdforrester-WMF)
[15:35:48] <wikibugs>	 10Operations, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn)
[15:35:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Stop installing apt-transport-https on Buster and prune it from Stretch installs [puppet] - 10https://gerrit.wikimedia.org/r/646654 (owner: 10Muehlenhoff)
[15:36:17] <wikibugs>	 (03PS3) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[15:37:07] <wikibugs>	 (03PS5) 10Ssingh: Initial commit of the knead-wikidough test suite [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/639838 (https://phabricator.wikimedia.org/T267424)
[15:37:19] <wikibugs>	 10Operations, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) I can do the testbed host first, and then the rest. Do we have a mediawiki server on buster anywhere in the cluster yet?
[15:37:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial commit of the knead-wikidough test suite [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/639838 (https://phabricator.wikimedia.org/T267424) (owner: 10Ssingh)
[15:37:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[15:38:42] <wikibugs>	 (03PS4) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[15:39:04] <wikibugs>	 10Operations, 10Dumps-Generation, 10Platform Engineering, 10serviceops: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10MoritzMuehlenhoff) Yes, mwdebug1003 is running Buster, you can select it with the latest version of the WikimediaDebug browser extension.
[15:42:30] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27041/console" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[15:47:05] <hnowlan>	 !log reimaging restbase2009 after disk replacement
[15:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icinga_status: remove values from json until support added to spicerack [puppet] - 10https://gerrit.wikimedia.org/r/647243 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond)
[15:47:49] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) T267366#6667864 suggests the cables should have arrived on-site.
[15:49:03] <wikibugs>	 (03PS1) 10Jbond: icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084
[15:49:05] <wikibugs>	 (03CR) 10Cwhite: "Looking good!" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi)
[15:49:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 (owner: 10Jbond)
[15:50:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:50:16] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) Still working with Dell on this, tried reseating the raid controller and the cables, the raid card is still not recognized by the bios.
[15:50:53] <wikibugs>	 (03PS2) 10Jbond: icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084
[15:51:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084 (owner: 10Jbond)
[15:52:23] <wikibugs>	 (03PS3) 10Jbond: icinga_status: add downtimed and notifications_enabled to json [puppet] - 10https://gerrit.wikimedia.org/r/647084
[15:56:43] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=maps,service=kartotherian,name=maps1001.eqiad.wmnet
[15:56:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:32] <wikibugs>	 (03PS6) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638
[16:03:14] <icinga-wm>	 PROBLEM - Host ms-be2050 is DOWN: PING CRITICAL - Packet loss = 100%
[16:03:17] <wikibugs>	 (03CR) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process (031 comment) [software/ecs] - 10https://gerrit.wikimedia.org/r/644638 (owner: 10Cwhite)
[16:05:00] <moritzm>	 !log importing wikidiff2 1.10.0-1~wmf1+buster1 to component/php72 T250515
[16:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:07] <stashbot>	 T250515: Please provide our special component/php72 in buster-wikimedia - https://phabricator.wikimedia.org/T250515
[16:05:24] <icinga-wm>	 RECOVERY - Host ms-be2050 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms
[16:05:41] <wikibugs>	 (03PS5) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[16:05:55] <wikibugs>	 (03PS6) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565)
[16:06:12] <moritzm>	 !log updating mwdebug1003, parse2001, deploy1002, deploy2002 to wikidiff 1.10.0-1~wmf1+buster1
[16:06:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27042/console" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[16:09:17] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[16:10:42] <ema>	 !log deployment-cache-text06: deploy varnish 6.0.0-1wm1 T264398
[16:10:45] <wikibugs>	 (03PS4) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643)
[16:10:48] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Run Homer during the decom cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/647288
[16:10:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:50] <stashbot>	 T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398
[16:10:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] _scaffold: Default to latest for monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647262 (owner: 10JMeybohm)
[16:11:09] <wikibugs>	 (03PS1) 10Elukey: Add a second Hive Metastore on an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028)
[16:11:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[16:13:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Revert "Run Homer during the decom cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/647288 (owner: 10Ayounsi)
[16:14:17] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27045/console" [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey)
[16:15:18] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Run Homer during the decom cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/647288 (owner: 10Ayounsi)
[16:17:02] <wikibugs>	 (03CR) 10Jbond: hiera: install redis on shard16 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[16:17:13] <wikibugs>	 (03CR) 10Jbond: redis: define redis version on buster for multidc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[16:17:35] <wikibugs>	 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) OK the amount of work needed to get 5.2.1 in a usable state really seems excessive. Let's give a try to 6.0.0, which is the version imme...
[16:21:38] <wikibugs>	 (03PS2) 10Elukey: Add a second Hive Metastore on an-coord1002 [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028)
[16:24:25] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27046/console" [puppet] - 10https://gerrit.wikimedia.org/r/647273 (https://phabricator.wikimedia.org/T268028) (owner: 10Elukey)
[16:27:35] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bump to 2020-12-09-093703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647275
[16:30:05] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2020-12-09-093703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647275 (owner: 10MSantos)
[16:31:24] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to 2020-12-09-093703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/647275 (owner: 10MSantos)
[16:33:43] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP logstash: add ulogd ecs filter + tests [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565)
[16:33:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: WIP logstash: add ulogd ecs filter + tests (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi)
[16:34:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Cmjohnson) @Dzahn   Is it possible to move   mw1281,82 and 83?  I need this space for the an-workers on 10G.  I can move them to A8.
[16:35:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kubeadm-k8s: use cached calico container images [puppet] - 10https://gerrit.wikimedia.org/r/647094 (https://phabricator.wikimedia.org/T269016) (owner: 10Bstorm)
[16:35:59] <wikibugs>	 (03PS2) 10Jbond: icinga: add support for downtimed and notifications_enabled parameters [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672)
[16:36:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Fix Cumin alias for cloudvirt-codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/646994 (owner: 10Muehlenhoff)
[16:37:25] <wikibugs>	 10Operations, 10Patch-For-Review: Traceback in icinga-status  'Host' object has no attribute 'downtime' - https://phabricator.wikimedia.org/T269672 (10jbond) I have pushed a patch which should removes the invalid parameters from the json output until spicerack is patched.  This should hopefully fix the cookbook
[16:37:43] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] modules/icinga/files/raid_handler.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/646890 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[16:40:47] <wikibugs>	 (03PS1) 10CRusnov: Revert "modules/icinga/files/raid_handler.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/647292
[16:41:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "modules/icinga/files/raid_handler.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/647292 (owner: 10CRusnov)
[16:42:16] <wikibugs>	 (03PS2) 10CRusnov: Revert "modules/icinga/files/raid_handler.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/647292
[16:42:32] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:43:13] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10toan)
[16:43:43] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Revert "modules/icinga/files/raid_handler.py: Port to Python 3" [puppet] - 10https://gerrit.wikimedia.org/r/647292 (owner: 10CRusnov)
[16:44:29] <wikibugs>	 (03PS1) 10Jbond: icinga::raid_handler: add support for ssacli [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563)
[16:45:03] <wikibugs>	 (03CR) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov)
[16:45:27] <wikibugs>	 (03CR) 10CRusnov: [C: 04-1] "Will make this change and port separately to 2.9 instead of making this change and having to change it completely when it ports to 2.9." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov)
[16:45:49] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10toan)
[16:48:14] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas and toolsdb: close connections when done with them [puppet] - 10https://gerrit.wikimedia.org/r/647285 (https://phabricator.wikimedia.org/T269620)
[16:48:14] <logmsgbot>	 !log mbsantos@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[16:48:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:46] <logmsgbot>	 !log mbsantos@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[16:49:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:31] <logmsgbot>	 !log mbsantos@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[16:52:32] <wikibugs>	 (03PS7) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565)
[16:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:53] <wikibugs>	 (03PS1) 10Clarakosi: JobQueue: Move translation jobs to its own queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/647307 (https://phabricator.wikimedia.org/T267520)
[16:58:45] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10jbond) @kaldari you are listed as kostajh manager as such can you approve this access request @thcipriani are you able to approve adding kostajh to the `deployment:` group
[16:59:00] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10jbond)
[16:59:21] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] kubeadm-k8s: use cached calico container images [puppet] - 10https://gerrit.wikimedia.org/r/647094 (https://phabricator.wikimedia.org/T269016) (owner: 10Bstorm)
[16:59:33] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] "Ok, this seems reasonable." [deployment-charts] - 10https://gerrit.wikimedia.org/r/647307 (https://phabricator.wikimedia.org/T267520) (owner: 10Clarakosi)
[16:59:43] <wikibugs>	 (03PS7) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638
[17:00:07] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10elukey) I can definitely help on this @Dzahn, lemme know if you need a pair of extra hands :)
[17:00:21] <wikibugs>	 (03PS8) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565)
[17:01:20] <wikibugs>	 (03PS8) 10Cwhite: Generate Logstash ECS cleanup filter as part of regular build process [software/ecs] - 10https://gerrit.wikimedia.org/r/644638
[17:01:39] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "looks good, thanks!" (032 comments) [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite)
[17:01:42] <wikibugs>	 (03PS9) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565)
[17:02:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/646971 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[17:03:28] <wikibugs>	 (03PS13) 10Mstyles: Add new helm chart for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526)
[17:04:12] <wikibugs>	 (03CR) 10Bstorm: "Next patch for this should probably be black formatting. The format is all over the place in this script." [puppet] - 10https://gerrit.wikimedia.org/r/647285 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm)
[17:09:35] <wikibugs>	 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10hnowlan) maps1001 is now in sync and serving data consistent with the other nodes.
[17:11:14] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:11:32] <wikibugs>	 (03PS6) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[17:12:09] <hauskatze>	 Is lists1001 lagging or something?
[17:12:50] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I have no thoughts on this, as long as hp raid checks work/keep working I am ok with any change." [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond)
[17:13:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[17:14:51] <wikibugs>	 (03PS7) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[17:15:20] <wikibugs>	 (03CR) 10Effie Mouzeli: redis: define redis version on buster for multidc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[17:15:47] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey   Here is what is in racks now (not setup) 2 servers in A2 2 servers in A4  I requested @dzahn to move 3 mw servers to make room...
[17:16:41] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn)
[17:17:42] <wikibugs>	 (03PS5) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643)
[17:18:01] <wikibugs>	 (03PS6) 10Effie Mouzeli: hiera: install redis on shard16 [puppet] - 10https://gerrit.wikimedia.org/r/647204 (https://phabricator.wikimedia.org/T265643)
[17:21:58] <wikibugs>	 10Operations, 10Maps (Kartographer): Some PostgreSQL replicas are not fully updated - https://phabricator.wikimedia.org/T268927 (10MSantos) 05Open→03Resolved Thanks, @hnowlan!
[17:22:04] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:23:02] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw128[1-3].eqiad.wmnet
[17:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:16] <wikibugs>	 (03PS8) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[17:24:26] <mutante>	 !log depooling 3 API appservers in eqiad to physically move to another rack
[17:24:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:56] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:28:11] <wikibugs>	 (03PS9) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[17:30:34] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10marcella) @jbond I am Kosta's manager and I approve this request.  Thank you!
[17:36:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10Cmjohnson) 05Open→03Resolved The disk has been replaced and is rebuilding, please re-open if the problem persists
[17:39:45] <wikibugs>	 (03PS10) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[17:40:04] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on labstore1006 - https://phabricator.wikimedia.org/T268281 (10Cmjohnson) 05Open→03Resolved The disk has been replaced, I am not sure if you have it for auto rebuild.  Please check and if the problem persists,  re-open this task.
[17:41:51] <wikibugs>	 (03PS1) 10Jcrespo: alerting: Disable screen/tmux monitoring on orchestrator hosts [puppet] - 10https://gerrit.wikimedia.org/r/647319 (https://phabricator.wikimedia.org/T265990)
[17:41:54] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10Cmjohnson) @fgiunchedi The bbu is on-site, please let me know when I can take this offline?  I can do tomorrow 1500UTC
[17:43:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) @dcausse I am sorry no, I forgot to put a ticket in with them.  I will do that today.  Thanks
[17:43:46] <wikibugs>	 (03CR) 10Jcrespo: "I know this is not 100% productionized, but proposing a small addition, similar to the other roles to avoid alerts on hosts starting with " [puppet] - 10https://gerrit.wikimedia.org/r/647319 (https://phabricator.wikimedia.org/T265990) (owner: 10Jcrespo)
[17:45:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on es1023 - https://phabricator.wikimedia.org/T268796 (10jcrespo) As usual, Chris, thank you for the quick response!
[17:45:07] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Cmjohnson) @Bstorm Do we need to update both 1004 and 1005 to 10G at the same time?  I can convert 1005 to 10G anytime.
[17:46:31] <wikibugs>	 (03PS11) 10Effie Mouzeli: redis: define redis version on buster for multidc [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643)
[17:49:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw128[1-3].eqiad.wmnet
[17:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:41] <wikibugs>	 (03PS1) 10Andrew Bogott: Fake keydata for cinder ceph client [labs/private] - 10https://gerrit.wikimedia.org/r/647321 (https://phabricator.wikimedia.org/T265965)
[17:52:18] <wikibugs>	 (03CR) 10Clarakosi: [C: 03+2] JobQueue: Move translation jobs to its own queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/647307 (https://phabricator.wikimedia.org/T267520) (owner: 10Clarakosi)
[17:54:01] <wikibugs>	 (03Merged) 10jenkins-bot: JobQueue: Move translation jobs to its own queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/647307 (https://phabricator.wikimedia.org/T267520) (owner: 10Clarakosi)
[17:54:32] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on labstore1006 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1006&var-datasource=eqiad+prometheus/ops
[17:57:04] <logmsgbot>	 !log clarakosi@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
[17:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:36] <logmsgbot>	 !log clarakosi@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[17:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:58] <logmsgbot>	 !log clarakosi@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[18:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:46] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:01:48] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:01:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:53] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:01:54] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:01:58] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:02:00] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: move_to_other_rack ` mw1281.eqiad.wmnet `
[18:02:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: move_to_other_rack ` mw1282.eqiad.wmnet `
[18:02:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: move_to_other_rack ` mw1283.eqiad.wmnet `
[18:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:58] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511)
[18:03:02] <wikibugs>	 (03PS3) 10Razzi: superset: add cache to staging superset [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784)
[18:03:23] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Fake keydata for cinder ceph client [labs/private] - 10https://gerrit.wikimedia.org/r/647321 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott)
[18:04:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:05:49] <mutante>	 !log mw1281,mw1282,mw1283 shut down for T266164
[18:05:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:56] <stashbot>	 T266164: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164
[18:06:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) @Cmjohnson Yes. I just depooled mw1281-1283, downtimed them and then shut them down physically. You can move them.
[18:06:20] <wikibugs>	 (03PS2) 10Andrew Bogott: Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511)
[18:08:20] <wikibugs>	 (03CR) 1020after4: [C: 03+2] Article::view - remove the old subtitle from doOutputFromParserCache. [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647081 (https://phabricator.wikimedia.org/T269727) (owner: 10Ppchelko)
[18:08:51] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Refactor calico deploy strategy - https://phabricator.wikimedia.org/T267653 (10JMeybohm) >>! In T267653#6679363, @akosiaris wrote: >>>! In T267653#6678721, @JMeybohm wrote: >> The new calico chart is merged, thanks @akosiari...
[18:11:21] <wikibugs>	 (03PS1) 10Dzahn: site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326
[18:11:37] <wikibugs>	 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Bstorm) @Cmjohnson not really at the same time, no. If the 1Gb crossover cable works after converting the primary interface to 10Gb, the...
[18:11:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326 (owner: 10Dzahn)
[18:11:55] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10WMDE-leszek)
[18:11:57] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:11:57] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10WMDE-leszek) I believe what @toan needs is "shell access" to production. For the time being would indeed be access to a subspace of releases.wikimedia.org, which is managed via `...
[18:11:58] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:12:05] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Icinga downtime for 4:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with...
[18:12:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:48] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:13:59] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to releasers-wikibase for toan - https://phabricator.wikimedia.org/T269777 (10Dzahn) This request depends on T268818 being resolved first. That is a request to add the group mentioned here. So far that doesn't exist.
[18:15:13] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27054/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[18:15:41] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "LGTM a minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/647197 (https://phabricator.wikimedia.org/T265643) (owner: 10Effie Mouzeli)
[18:15:51] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2243.codfw.wmnet
[18:15:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:13] <mutante>	 !log depooling mw2243 (jobrunner) for reimaging (T245757)
[18:16:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:20] <stashbot>	 T245757: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757
[18:17:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw2243.codfw.wmnet
[18:17:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:18] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2...
[18:19:26] <icinga-wm>	 PROBLEM - Host mw1282.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:21:16] <icinga-wm>	 PROBLEM - Host mw1281.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:21:16] <icinga-wm>	 PROBLEM - Host mw1283.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:21:19] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Bstorm) @Jclark-ctr labstore1006 is currently out of the pool. Any time you want to update firmware, let me know and I can silence its alarms and shu...
[18:22:55] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Jclark-ctr) @bstorm thanks I can take care of this today around 4:30pm est
[18:23:38] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] _scaffold: Default to latest for monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647262 (owner: 10JMeybohm)
[18:24:37] <mutante>	 I am mildly surprised to get Icinga alerts for mgmt interfaces of hosts that I downtimed by cookbook.
[18:24:43] <wikibugs>	 (03PS3) 10Andrew Bogott: Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511)
[18:24:45] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511)
[18:24:57] <mutante>	 Didn't it automatically include the mgmt hosts? maybe not
[18:24:58] <wikibugs>	 (03Merged) 10jenkins-bot: _scaffold: Default to latest for monitoring.image_version [deployment-charts] - 10https://gerrit.wikimedia.org/r/647262 (owner: 10JMeybohm)
[18:25:31] <mutante>	 but fine with me.. then I see when they come back in the new rack
[18:26:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:27:09] <DannyS712>	 wikitech is being really slow for me - is it just me?
[18:28:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add 20.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/643386 (https://phabricator.wikimedia.org/T264367) (owner: 10Dzahn)
[18:28:05] <wikibugs>	 (03PS3) 10Dzahn: add 20.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/643386 (https://phabricator.wikimedia.org/T264367)
[18:28:27] <wikibugs>	 (03PS2) 10Andrew Bogott: Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511)
[18:29:14] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Bstorm) That works for me. I'll silence things. Just ping me on IRC when you need a shutdown.
[18:31:46] <wikibugs>	 (03PS4) 10Razzi: superset: add cache to staging superset [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784)
[18:32:03] <wikibugs>	 (03PS3) 10Andrew Bogott: Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511)
[18:34:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: install ceph client keyring [puppet] - 10https://gerrit.wikimedia.org/r/647323 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:34:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: use the deployment-wide libvirt_rbd_uuid for cinder [puppet] - 10https://gerrit.wikimedia.org/r/647329 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:36:45] <wikibugs>	 (03CR) 10Dzahn: "https://20.wikipedia.org" [dns] - 10https://gerrit.wikimedia.org/r/643386 (https://phabricator.wikimedia.org/T264367) (owner: 10Dzahn)
[18:36:47] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] superset: add cache to staging superset [puppet] - 10https://gerrit.wikimedia.org/r/647106 (https://phabricator.wikimedia.org/T268784) (owner: 10Razzi)
[18:36:48] <icinga-wm>	 RECOVERY - Host mw1282.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms
[18:37:35] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: no need to restart apache2 when config changes [puppet] - 10https://gerrit.wikimedia.org/r/647330 (https://phabricator.wikimedia.org/T269511)
[18:38:00] <wikibugs>	 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10Dzahn) @hdothiduc @Varnent Done!  Added to DNS and https://20.wikipedia.org  works now for me. There could be a little delay depending on caches an...
[18:38:24] <icinga-wm>	 RECOVERY - Host mw1283.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms
[18:38:36] <icinga-wm>	 RECOVERY - Host mw1281.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.74 ms
[18:39:00] <wikibugs>	 (03Merged) 10jenkins-bot: Article::view - remove the old subtitle from doOutputFromParserCache. [core] (wmf/1.36.0-wmf.21) - 10https://gerrit.wikimedia.org/r/647081 (https://phabricator.wikimedia.org/T269727) (owner: 10Ppchelko)
[18:40:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: no need to restart apache2 when config changes [puppet] - 10https://gerrit.wikimedia.org/r/647330 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:40:58] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: Table superset_staging.ab_user doesnt exist https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:41:34] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[18:41:36] <wikibugs>	 10Operations, 10Domains, 10Traffic: URL to redirect to upcoming Wikipedia Birthday page on wikimediafoundation.org - https://phabricator.wikimedia.org/T264367 (10Dzahn) If you can confirm things are working for you then it's up to you if we close this ticket now or after the actual birthday page has been cre...
[18:41:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:24] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[18:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:39] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[18:43:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:56] <cmjohnson1>	 mutante: running the scripts now for the mw move...thanks for doing that so quickly
[18:46:24] <mutante>	 cmjohnson1: alright! yep, np. these weren't proxies so that means less work to remove them
[18:48:21] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:48:29] <volans>	 mutante: the mgmt checks are another host for Icinga, and given they are in the OOB network should not be affected by normal operations hence are not covered by the cookbook
[18:48:42] <volans>	 but we can surely consider adding an option to downtime the mgmt too
[18:49:01] <volans>	 I'm not sure the script on icinga that the cookbook runs supports it, but can surely be added there too
[18:49:13] <volans>	 if you think that's a valid use case feel free to open a task
[18:52:34] <mutante>	 volans: ACK, they are different hosts. Yea, I am on the fence about it. I don't want to cause alerts during downtime but for things like these physical moves it's also a feature to see when mgmt comes back online. 
[18:53:29] <volans>	 agree
[18:54:15] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: fix keystone auth for the cinder service user [puppet] - 10https://gerrit.wikimedia.org/r/647335 (https://phabricator.wikimedia.org/T269511)
[18:54:32] <mutante>	 hmm, leave it as it is. If I change my mind I will make that ticket :)
[18:54:42] <volans>	 works for me, thanks :)
[18:54:49] <mutante>	 thanks as well
[18:56:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: fix keystone auth for the cinder service user [puppet] - 10https://gerrit.wikimedia.org/r/647335 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[18:59:34] <wikibugs>	 10Operations, 10Mail: Bounces when sending mail to aliases of a specific WMF email address - https://phabricator.wikimedia.org/T269725 (10JGulingan) Hi all,  I sent a test email to legoktm@ and kmehta-ctr@ today and did not receive a bounce back email.  Thanks for your help!  Best, Jo
[18:59:44] <mutante>	 !log testreduce1001 - installed make
[18:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T1900).
[19:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[19:00:05] <jouncebot>	 twentyafterfour and marxarelli: Dear deployers, time to do the Train log triage with CPT deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T1900).
[19:01:06] <wikibugs>	 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) meanwhile there is no more /srv/parsoid on testreduce1001 but /srv/parsoid-testing instead.  I tried an "npm...
[19:01:35] <icinga-wm>	 PROBLEM - SSH on an-presto1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:10:31] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:10:34] <wikibugs>	 (03CR) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm)
[19:12:24] <wikibugs>	 (03PS2) 10TK-999: GeoDNS: Remove old hack for Wikia RES datacenter [dns] - 10https://gerrit.wikimedia.org/r/647253
[19:13:31] <wikibugs>	 (03PS2) 10Urbanecm: Enable ArticlePlaceholder at papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646831 (https://phabricator.wikimedia.org/T223693)
[19:13:35] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable ArticlePlaceholder at papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646831 (https://phabricator.wikimedia.org/T223693) (owner: 10Urbanecm)
[19:14:26] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ArticlePlaceholder at papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/646831 (https://phabricator.wikimedia.org/T223693) (owner: 10Urbanecm)
[19:16:44] <wikibugs>	 10Operations, 10Mail: Bounces when sending mail to aliases of a specific WMF email address: 550 Previous (cached) callout verification failure - https://phabricator.wikimedia.org/T269725 (10Aklapper)
[19:17:23] <logmsgbot>	 !log twentyafterfour@deploy1001 Synchronized php-1.36.0-wmf.21/includes/page/Article.php: deploy 0d99fe6d54 Article::view - remove the old subtitle from doOutputFromParserCache. Bug: T269727 (duration: 01m 04s)
[19:17:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:27] <stashbot>	 T269727: Old revision warning box is added twice on page view if old rev served from cache - https://phabricator.wikimedia.org/T269727
[19:25:17] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2243.codfw.wmnet'] `  and were **ALL** successful.
[19:26:51] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:28:01] <icinga-wm>	 RECOVERY - SSH on an-presto1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:33:07] <icinga-wm>	 RECOVERY - Check systemd state on an-presto1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:38:42] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: ce01bbe7b05eda8065fc57c865a69370e8aae797: Enable ArticlePlaceholder at papwiki (T223693) (duration: 01m 02s)
[19:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:48] <stashbot>	 T223693: Deploy article placeholder on the pap.wikipedia (papiamentu) - https://phabricator.wikimedia.org/T223693
[19:38:54] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: more config fixes [puppet] - 10https://gerrit.wikimedia.org/r/647344 (https://phabricator.wikimedia.org/T269511)
[19:38:56] <wikibugs>	 (03PS1) 10Andrew Bogott: Cinder: include cinder-volume service on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/647345 (https://phabricator.wikimedia.org/T269511)
[19:42:22] <wikibugs>	 (03PS2) 10Andrew Bogott: Cinder: include cinder-volume service on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/647345 (https://phabricator.wikimedia.org/T269511)
[19:42:26] <cmjohnson1>	 mutante: mw1281-1283 are back 
[19:43:12] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: more config fixes [puppet] - 10https://gerrit.wikimedia.org/r/647344 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[19:43:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Cmjohnson)
[19:43:59] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Cmjohnson) @dzahn completed the move and mw1281-83 are up
[19:44:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Cinder: include cinder-volume service on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/647345 (https://phabricator.wikimedia.org/T269511) (owner: 10Andrew Bogott)
[19:44:54] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @dzahn and I were able to move mw1281-1283 and we now have 6 servers total in row A.
[19:46:06] <mutante>	 cmjohnson1: great! thanks for doing it swiftly. will get them back in prod
[20:00:04] <jouncebot>	 twentyafterfour and marxarelli: How many deployers does it take to do Mediawiki train - American Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T2000).
[20:02:38] <marxarelli>	 twentyafterfour: heyo o/
[20:04:34] <wikibugs>	 (03PS6) 10Cwhite: First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719
[20:05:11] <twentyafterfour>	 marxarelli: hello
[20:06:27] <twentyafterfour>	 I deployed the patch for T269727
[20:06:31] <stashbot>	 T269727: Old revision warning box is added twice on page view if old rev served from cache - https://phabricator.wikimedia.org/T269727
[20:07:35] <twentyafterfour>	 everything loooks good to go ahead with group1
[20:10:19] <wikibugs>	 (03PS1) 1020after4: group1 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647350
[20:10:21] <wikibugs>	 (03CR) 1020after4: [C: 03+2] group1 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647350 (owner: 1020after4)
[20:11:35] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647350 (owner: 1020after4)
[20:12:51] <logmsgbot>	 !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.21
[20:12:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:54] <logmsgbot>	 !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.21 (duration: 01m 02s)
[20:13:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:31] <wikibugs>	 (03PS1) 10Milimetric: refine: blacklist WikibasePingback [puppet] - 10https://gerrit.wikimedia.org/r/647351
[20:16:30] <wikibugs>	 (03CR) 10Milimetric: "Thanks in advance for merging this, it can be reverted once the schema is fixed." [puppet] - 10https://gerrit.wikimedia.org/r/647351 (owner: 10Milimetric)
[20:17:09] <wikibugs>	 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Open→03Resolved
[20:18:09] <twentyafterfour>	 !log wmf.21 looks good on group1 wikis. Still seeing T269603 but not at an increased rate. (refs T264801)
[20:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:14] <stashbot>	 T264801: 1.36.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T264801
[20:18:14] <stashbot>	 T269603: InvalidArgumentException when requesting Special:EntityData with Sense or Form ID (The given ID does not refer to an entity of type lexeme) - https://phabricator.wikimedia.org/T269603
[20:19:43] <wikibugs>	 (03PS1) 10Dzahn: parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906)
[20:20:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[20:21:56] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1281.eqiad.wmnet
[20:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1282.eqiad.wmnet
[20:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:12] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1283.eqiad.wmnet
[20:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:23] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] refine: blacklist WikibasePingback [puppet] - 10https://gerrit.wikimedia.org/r/647351 (owner: 10Milimetric)
[20:26:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw128[1-3].eqiad.wmnet
[20:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:17] <wikibugs>	 (03PS1) 10Ahmon Dancy: Enable $wgShowHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353
[20:26:19] <wikibugs>	 (03PS1) 10Ahmon Dancy: Update Chart.yaml source references [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354
[20:26:21] <wikibugs>	 (03PS1) 10Ahmon Dancy: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355
[20:26:25] <mutante>	 !log repooling mw1281,mw1282,mw1283 - now in rack A8
[20:26:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:27:16] <mutante>	 !log  mw1281,mw1282,mw1283 - scap pull 
[20:27:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) @Cmjohnson Thank you. Repooled and receiving traffic again. Monitoring looks good.
[20:31:07] <wikibugs>	 (03PS2) 10Dzahn: site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326
[20:31:31] <wikibugs>	 (03PS3) 10Dzahn: site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326 (https://phabricator.wikimedia.org/T266164)
[20:31:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: update comment about location of mw1281-mw1283 [puppet] - 10https://gerrit.wikimedia.org/r/647326 (https://phabricator.wikimedia.org/T266164) (owner: 10Dzahn)
[20:31:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] First attempt at a JSONSchema template generator utility. [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite)
[20:32:08] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] First attempt at a JSONSchema template generator utility. (031 comment) [software/ecs] - 10https://gerrit.wikimedia.org/r/637719 (owner: 10Cwhite)
[20:33:09] <wikibugs>	 (03PS2) 10Dzahn: parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906)
[20:34:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[20:34:55] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: analytics-meta-replica on an-coord1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:35:11] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi)
[20:35:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: eqiad: Physical moves for MediaWiki servers - https://phabricator.wikimedia.org/T266164 (10Dzahn) @Cmjohnson Should this stay open for mw1313-mw1316 or did we solve the issue by moving other servers now?
[20:36:44] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "was missing during npm test-install   https://puppet-compiler.wmflabs.org/compiler1001/27061/" [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[20:38:22] <wikibugs>	 (03PS3) 10Dzahn: parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906)
[20:38:39] <wikibugs>	 (03PS1) 10Elukey: analytics-meta: Avoid replication of superset_staging db when running as replica [puppet] - 10https://gerrit.wikimedia.org/r/647358
[20:40:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] parsoid/testreduce: ensure make is installed on testreduce host [puppet] - 10https://gerrit.wikimedia.org/r/647352 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[20:43:01] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:44:06] <wikibugs>	 (03CR) 10Ryan Kemper: "I think gehel is right that a timestamp should be a gauge and not a counter. Reading https://www.robustperception.io/are-increasing-timest" [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) (owner: 10Ryan Kemper)
[20:44:52] <wikibugs>	 (03CR) 10Dzahn: ntp: replace hiera() with lookup(), move use_chrony to parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[20:47:51] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:27] <rzl>	 ^ that's https://phabricator.wikimedia.org/T269693
[20:55:55] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, couple of nits inline" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/647245 (https://phabricator.wikimedia.org/T269672) (owner: 10Jbond)
[20:56:37] <mutante>	 rzl: thanks, yea, with every cron we convert to timer we now get more "systemd state" alerts now due to "former cron that we never noticed had issues" on other non mwmaint* hosts as well now. I am not sure yet if I think that's fine as it is or if there should be email for failed timers and not generic "systemd alerts" or both
[20:57:17] <rzl>	 yeah agreed -- it's good that we have monitoring for these now, but it also means we have to actually track it down when we fail :P
[20:57:20] <rzl>	 *when they fail
[20:57:23] <rzl>	 when we do too, I suppose
[20:58:13] <rzl>	 in particular I think it means telling the difference between "transient failure, next run succeeded, alert resolves on its own" and cases like this where the job is consistently broken
[21:00:05] <jouncebot>	 chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201209T2100).
[21:00:51] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:07] <wikibugs>	 (03CR) 10Volans: "A couple of questions/comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/647281 (https://phabricator.wikimedia.org/T269563) (owner: 10Jbond)
[21:09:02] <mutante>	 rzl: Yes, one strategy can be just "systemctl reset-failed" it once and then see if it comes back 
[21:09:46] <mutante>	 a lot of them just clear up from transient failure in the past, some do not
[21:10:04] <rzl>	 sure, although in this case journalctl shows a ton of failures in a row -- no need to even try reset-failed, it definitely won't help
[21:11:40] <mutante>	 true
[21:11:54] <rzl>	 partly I'm trying to figure out if there's some way to automate some of this -- having to ssh into the machine and investigate every time that alert fires is a drag
[21:12:12] <rzl>	 if the alert could even just include the unit name, that would be a big help
[21:12:22] <rzl>	 but unit name plus the number of consecutive failures, that would really get us somewhere
[21:12:25] <mutante>	 that would be a reason to want a notification that already includes the job name
[21:12:31] <rzl>	 (consecutive failures when it's a timer, I mean)
[21:12:31] <mutante>	 i just don't like email as a protocol :p
[21:12:46] <rzl>	 email is definitely the wrong way to go, yes
[21:13:10] <mutante>	 or we make an icinga check specific to timers
[21:13:16] <mutante>	 that runs list-timers and parses it
[21:13:32] <mutante>	 and that tells us right away which one it is ..here on IRC
[21:13:42] <rzl>	 mm, sure -- if we can also exclude timers from the "check systemd state" that would be a big step
[21:13:46] <mutante>	 or  list-units --state=failed 
[21:13:51] <mutante>	 and the top one
[21:14:12] <mutante>	 which would be more than timers
[21:14:18] <mutante>	 but to include the name of the failed service
[21:14:26] <rzl>	 yeah
[21:14:41] <rzl>	 in that case I think we'd want to include all of them, not just the top -- otherwise if a second unit fails, we'd never notice
[21:14:51] <rzl>	 but either way that should be doable
[21:15:32] <mutante>	 the multi-line output on IRC would be spammy and if we just alert for the first and you fix it.. you notice you get a new one :p
[21:15:39] <mutante>	 yea
[21:17:40] <wikibugs>	 (03CR) 10Herron: [C: 03+1] WIP logstash: add ulogd ecs filter + tests [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi)
[21:17:42] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "This IS actually used in production. Here it is compiled on dns3001:" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:19:03] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "I don't know why I thought role::dnsbox wasn't used in prod, I think there was a compiler failure or pebkac where it wouldn't find matchin" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[21:22:30] <mutante>	 rzl: regarding the actual content of the ticket, that failed WDQS job, I saw that same thing but since it started during the reimaging work and it was about failing to fetch some lag data from them.. I definitely expected that would be one-time and go away by itself once the reimaging is over. seeing that is not the case..is surprising
[21:23:33] <wikibugs>	 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Krinkle) The service is mainly for executing shell commands. Right now that happens through `wfShellExec`. Typically to invoke program...
[21:23:35] <rzl>	 yeah agree -- I haven't starting digging into the job itself, I was hoping I could get someone more familiar to take it over
[21:24:39] <wikibugs>	 10Operations, 10MW-on-K8s, 10serviceops: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10Krinkle)
[21:24:42] <wikibugs>	 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Krinkle) 05Open→03Resolved
[21:24:56] <wikibugs>	 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Krinkle) >>! In T260330#6632099, @Krinkle wrote: > Put on Last Call until 2 December.  This RFC has been approved and is now closed.
[21:25:14] <mutante>	 I did debug a little bit but I lost it in IRC backlog.. hmm
[21:25:34] <mutante>	 oh, of course we have public logs.. will paste it on ticket
[21:26:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) Replaced more Dimms per HP no errors at this
[21:26:32] <mutante>	 it tries to get "lag data" from prometheus and that is what fails
[21:26:51] <mutante>	 and it runs every minute
[21:28:20] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10Dzahn) ` [20:17:19] <mutante>  ryankemper: I guess it makes sense that "job_wikidata-updateQueryServiceLag" could not run during current work [20:19:46] <rya...
[21:31:01] <ryankemper>	 ^ I'll take a look at the `job_wikidata-updateQueryServiceLag`
[21:31:42] <ryankemper>	 There's some context in the description of https://phabricator.wikimedia.org/T269204 that mentions that `blazegraph_lastupdated` is now `blazegraph_lastupdated_total`, so if the job has to do with that metric then the re-image is likely the source of the problem
[21:33:24] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10RKemper) There's some context in the description of https://phabricator.wikimedia.org/T269204 that mentions that the counter metric `blazegraph_lastupdated`...
[21:35:01] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10Dzahn) ` ExecStart=/usr/local/bin/mw-cli-wrapper /usr/local/bin/mwscript extensions/Wikidata.org/maintenance/updateQueryServiceLag.php --wiki wikidatawiki --...
[21:35:23] <rzl>	 ryankemper: thank you!
[21:37:20] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10Dzahn) I ran the command manually as the same user, www-data. The error is simply "Failed to get lag from prometheus".  ` @mwmaint1002:~# sudo -u www-data /u...
[21:37:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot (issue continues after board change) 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10Jclark-ctr) 05Open→03Resolved
[21:39:08] <mutante>	 ryankemper: cool! so I ran that command manually but the error is simply "failed to get from prometheus"
[21:39:20] <mutante>	 but sounds like that can match what you said, ack
[21:39:22] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: mediawiki_job_wikidata-updateQueryServiceLag failing - https://phabricator.wikimedia.org/T269693 (10RKemper) Job lives here: https://github.com/wikimedia/mediawiki-extensions-Wikidata.org/blob/60c5f96ebf424b792077bb7c6b533a68702e7aea/maintenance/updateQuery...
[21:40:42] <bstorm>	 !log shutting down labstore1006 for maintenance T268285
[21:40:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:46] <stashbot>	 T268285: update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285
[21:46:15] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/647364 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[21:46:51] <wikibugs>	 (03CR) 10Ryan Kemper: "So I think the main "downstream" consumer of this metric we need to worry about is `mediawiki_job_wikidata-updateQueryServiceLag`: https:/" [puppet] - 10https://gerrit.wikimedia.org/r/646888 (https://phabricator.wikimedia.org/T269204) (owner: 10Ryan Kemper)
[21:47:33] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:47:47] <ryankemper>	 mutante: makes sense, since that error means that you hit this codepath: https://github.com/wikimedia/mediawiki-extensions-Wikidata.org/blob/60c5f96ebf424b792077bb7c6b533a68702e7aea/maintenance/updateQueryServiceLag.php#L84-L90
[21:48:28] <mutante>	 ryankemper: hah! yea, that does it:)
[21:48:42] <ryankemper>	 (just thinking out loud here) so this metric should probably be a gauge and not a count anyway, since counters are for metrics where you don't care about the absolute value but rather just the incrementing over time
[21:49:27] <ryankemper>	 the reason it's breaking now is because https://github.com/prometheus/client_python/commit/a4dd93bcc6a0422e10cfa585048d1813909c6786 forces counters to now be suffixed with `_total`, so by switching to a gauge we shouldn't need to rename the metric, but I need to make sure that it being a gauge will play nicely with the job
[21:50:43] <ryankemper>	 I don't fully understand the logic of this `getLag` here: https://github.com/wikimedia/mediawiki-extensions-Wikidata.org/blob/60c5f96ebf424b792077bb7c6b533a68702e7aea/src/QueryServiceLag/WikimediaPrometheusQueryServiceLagProvider.php#L65 specifically I don't quite get why it's doing a `count` on the response from `getLags` (the other logic of `floor` etc makes sense to me)
[21:50:44] <mutante>	 *nod* this part is maybe good for -observability channel as well
[21:51:14] <ryankemper>	 that's a good idea! I'll transport the above context over there
[21:54:09] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:54:58] <mutante>	 And I reimaged this server and all seemed to go great but it is still stretch because I forgot the DHCP change.. starting over.
[21:56:32] <wikibugs>	 (03PS1) 10Dzahn: DHCP: switch mw2243 to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/647367 (https://phabricator.wikimedia.org/T245757)
[21:56:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:57:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: switch mw2243 to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/647367 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[21:59:37] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:05] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/647364 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:00:37] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] icinga/raid.pp: Add Python3 requirements for raid_handler.py [puppet] - 10https://gerrit.wikimedia.org/r/647364 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:01:55] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw2...
[22:02:48] <wikibugs>	 (03PS3) 10Dzahn: query_service/updater: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953)
[22:03:57] <mutante>	 ryankemper: If you don't mind I will now merge a change to WDQS puppet code - that will be noop and I compiled and has +1s
[22:04:07] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:04:14] <mutante>	 no change to the actual service, just puppet code
[22:04:14] <ryankemper>	 mutante: go ahead
[22:04:19] <mutante>	 ack, thx
[22:04:25] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s1 on db1139 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:04:25] <icinga-wm>	 PROBLEM - mysqld processes on db1139 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[22:04:35] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] query_service/updater: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[22:05:03] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:05:38] <mutante>	 still confirming manually on one of the hosts each time
[22:06:03] <mutante>	 yea, nothing changed on wdqs1003 
[22:06:21] <wikibugs>	 (03CR) 10Dzahn: "noop on wdqs1003 as expected" [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[22:06:47] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db1139 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:09:21] <icinga-wm>	 PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:10:43] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:12:37] <wikibugs>	 (03CR) 10Dzahn: "hmm.. it's kind of convincing to just drop it if it's truly optional. Just based on the comments it was shown in error messages. It could " [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy)
[22:12:39] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s1 on db1139 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:12:47] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:13:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: open link in new window [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox)
[22:15:03] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on db1139 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:17:56] <wikibugs>	 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Bstorm) The box has very recent firmware already, apparently. 😦
[22:18:05] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:18:27] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on db1139 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:21:41] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db1139 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[22:22:09] <icinga-wm>	 PROBLEM - MariaDB read only s1 on db1139 is CRITICAL: Could not connect to localhost:3311 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[22:24:31] <icinga-wm>	 PROBLEM - MariaDB read only s6 on db1139 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[22:25:00] <wikibugs>	 (03CR) 10CRusnov: "I have tested the critical path where the test is expected to get data from zlib (and subprocess), where encoding issues would occur and i" [puppet] - 10https://gerrit.wikimedia.org/r/647369 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[22:25:55] <icinga-wm>	 RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:28:15] <icinga-wm>	 RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:30:07] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime
[22:30:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:32:09] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:32:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:07] <wikibugs>	 (03CR) 10Jeena Huneidi: "I left a comment on the service which I think is more of a question for SRE on whether we should be exposing the non-webui ports as a Node" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/640571 (https://phabricator.wikimedia.org/T265526) (owner: 10Mstyles)
[22:41:20] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: disable autogc when receiving packs [puppet] - 10https://gerrit.wikimedia.org/r/647191 (owner: 10Hashar)
[22:44:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: disable autogc when receiving packs [puppet] - 10https://gerrit.wikimedia.org/r/647191 (owner: 10Hashar)
[22:45:46] <wikibugs>	 (03PS3) 10Razzi: Add kafka-test1007 virtual machine [puppet] - 10https://gerrit.wikimedia.org/r/647109 (https://phabricator.wikimedia.org/T268202)
[22:45:57] <wikibugs>	 (03CR) 10Jeena Huneidi: "I think this deserves an increment to the chart minor version in Chart.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355 (owner: 10Ahmon Dancy)
[22:46:59] <wikibugs>	 (03CR) 10Jeena Huneidi: "I think this deserves an increment to the chart patch version in Chart.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353 (owner: 10Ahmon Dancy)
[22:47:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add .webm in files.viewable-mime-types of Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/569627 (https://phabricator.wikimedia.org/T215360) (owner: 10Zoranzoki21)
[22:48:59] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:50:01] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 107.08, 101.42, 98.10 https://wikitech.wikimedia.org/wiki/Swift
[22:50:14] <wikibugs>	 (03PS2) 10Ahmon Dancy: Enable $wgShowHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353
[22:50:17] <wikibugs>	 (03PS2) 10Ahmon Dancy: Update Chart.yaml source references [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354
[22:50:19] <wikibugs>	 (03PS2) 10Ahmon Dancy: 0.0.8: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355
[22:56:40] <wikibugs>	 (03PS3) 10Ahmon Dancy: 0.1.0: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355
[22:58:01] <wikibugs>	 (03PS5) 10Dzahn: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem)
[22:59:20] <wikibugs>	 (03PS6) 10Dzahn: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem)
[23:00:58] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354 (owner: 10Ahmon Dancy)
[23:01:01] <wikibugs>	 (03PS1) 10Dzahn: remove zero.wikimedia.org from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/647379 (https://phabricator.wikimedia.org/T187716)
[23:01:54] <wikibugs>	 (03PS2) 10Dzahn: remove zero.wikimedia.beta.wmflabs.org from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/647379 (https://phabricator.wikimedia.org/T187716)
[23:02:27] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2] remove zero.wikimedia.beta.wmflabs.org from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/647379 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn)
[23:02:34] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] remove zero.wikimedia.beta.wmflabs.org from beta sites [puppet] - 10https://gerrit.wikimedia.org/r/647379 (https://phabricator.wikimedia.org/T187716) (owner: 10Dzahn)
[23:04:06] <mutante>	 !log zero.wikimedia.beta.wmflabs.org removed from beta_sites (deployment-prep) T187716
[23:04:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:04:11] <stashbot>	 T187716: Sunset Wikipedia Zero - https://phabricator.wikimedia.org/T187716
[23:04:53] <wikibugs>	 (03PS7) 10Dzahn: Remove apache config for zero.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/524088 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem)
[23:07:25] <wikibugs>	 (03PS3) 10Dzahn: wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876
[23:07:32] <wikibugs>	 (03PS1) 10Mforns: Migrate HelpPanel schema from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647382 (https://phabricator.wikimedia.org/T267333)
[23:07:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876 (owner: 10Dzahn)
[23:09:18] <wikibugs>	 (03PS4) 10Dzahn: wikistats: use script and separate config to dump mysql from timer [puppet] - 10https://gerrit.wikimedia.org/r/646876
[23:09:53] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:13:00] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2243.codfw.wmnet'] `  and were **ALL** successful.
[23:13:39] <wikibugs>	 (03PS1) 10Mforns: Migrate HomepageModule schema from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647383 (https://phabricator.wikimedia.org/T267333)
[23:15:21] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas and toolsdb: close connections when done with them [puppet] - 10https://gerrit.wikimedia.org/r/647285 (https://phabricator.wikimedia.org/T269620) (owner: 10Bstorm)
[23:15:57] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=parse2001.codfw.wmnet
[23:15:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:41] <mutante>	 !log repooling parse2001 after buster reimage - T245757
[23:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:16:44] <stashbot>	 T245757: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757
[23:17:25] <wikibugs>	 (03PS1) 10Mforns: Migrate HomepageVisit schema from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647384 (https://phabricator.wikimedia.org/T267333)
[23:19:45] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2243.codfw.wmnet
[23:19:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:20:45] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[23:21:54] <mutante>	 !log repooling parse2001 after buster reimage - T268524
[23:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:21:57] <stashbot>	 T268524: Upgrade Parsoid servers to buster  - https://phabricator.wikimedia.org/T268524
[23:24:57] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) >>! In T245757#6645352, @jijiki wrote: > @Dzahn @hnowlan After discussing with @Muehlenhoff, since w...
[23:26:53] <wikibugs>	 (03PS1) 10Mforns: Migrate ServerSideAccountCreation schema from EventLogging to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647386 (https://phabricator.wikimedia.org/T267333)
[23:32:00] <wikibugs>	 (03PS1) 10Bstorm: maintain-dbusers: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/647388
[23:33:54] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: apply black formatting [puppet] - 10https://gerrit.wikimedia.org/r/647388 (owner: 10Bstorm)
[23:42:35] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2018 is CRITICAL: CRITICAL - load average: 101.10, 103.59, 101.76 https://wikitech.wikimedia.org/wiki/Swift
[23:48:34] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to deployment for Kosta Harlan - https://phabricator.wikimedia.org/T269731 (10kaldari) I approve as well in case it matters ;)
[23:49:09] <wikibugs>	 (03PS4) 10Ahmon Dancy: 0.1.0: Add ENABLE_DEBUG_LOGGING setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/647355
[23:51:02] <wikibugs>	 (03PS3) 10Ahmon Dancy: Update Chart.yaml source references [deployment-charts] - 10https://gerrit.wikimedia.org/r/647354
[23:54:54] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353 (owner: 10Ahmon Dancy)
[23:56:15] <wikibugs>	 (03Merged) 10jenkins-bot: Enable $wgShowHostnames [deployment-charts] - 10https://gerrit.wikimedia.org/r/647353 (owner: 10Ahmon Dancy)