[00:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T0000).
[00:00:04] <jouncebot>	 legoktm: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[00:00:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] parsoid::testing: switch db_host from m5-master to localhost [puppet] - 10https://gerrit.wikimedia.org/r/654565 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[00:00:23] <wikibugs>	 (03PS2) 10Dzahn: parsoid::testing: switch db_host from m5-master to localhost [puppet] - 10https://gerrit.wikimedia.org/r/654565 (https://phabricator.wikimedia.org/T266509)
[00:00:37] <dont|panic>	 I'm here for a last minute patch too :P
[00:01:02] <legoktm>	 hi
[00:01:15] <legoktm>	 I can deploy stuff today
[00:01:26] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] ATS: re-add config for parsoid-rt-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/654351 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[00:01:56] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Revert "remove parsoid-rt-tests.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/653998 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[00:02:08] <legoktm>	 oh cool, more logo stuffs
[00:03:06] <dont|panic>	 this one should work, as I literally copied and renamed arbcom-ru's one :P
[00:03:21] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658463 (https://phabricator.wikimedia.org/T272920) (owner: 10Tks4Fish)
[00:03:52] <wikibugs>	 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10ssastry)
[00:04:07] <wikibugs>	 (03Merged) 10jenkins-bot: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658463 (https://phabricator.wikimedia.org/T272920) (owner: 10Tks4Fish)
[00:04:40] <legoktm>	 dont|panic: on mwdebug1002
[00:05:07] <dont|panic>	 looks good :)
[00:05:11] <legoktm>	 same
[00:06:51] <wikibugs>	 (03CR) 10Dzahn: "config changed on scandium - noop on testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/654565 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[00:07:22] <logmsgbot>	 !log legoktm@deploy1001 Synchronized static/favicon/arbcom_enwiki.ico: T272920: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico (1/2) (duration: 01m 00s)
[00:07:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:27] <stashbot>	 T272920: arbcom-en.wikipedia.org change favicon - https://phabricator.wikimedia.org/T272920
[00:08:17] <dont|panic>	 thanks a bunch, legoktm :)
[00:08:27] <legoktm>	 hold on one more sync
[00:08:40] <dont|panic>	 oh, right
[00:08:42] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T272920: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico (2/2) (duration: 00m 58s)
[00:08:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:01] <dont|panic>	 yep, now it's okay, thanks :D
[00:09:05] <legoktm>	 technically you're supposed to split this into two Gerrit patches but it's okay because I'm also breaking that rule today too
[00:09:27] <wikibugs>	 (03PS3) 10Legoktm: Drop obsolete requirements.txt and setup.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657954
[00:09:29] <wikibugs>	 (03PS3) 10Legoktm: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955
[00:09:31] <wikibugs>	 (03PS7) 10Legoktm: Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640)
[00:09:48] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "no-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657954 (owner: 10Legoktm)
[00:09:58] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955 (owner: 10Legoktm)
[00:10:06] <dont|panic>	 oh, I saw the previous patch with it and thought one could do it in one patch
[00:10:13] <dont|panic>	 I'll keep that in mind for the next one :)
[00:10:38] <wikibugs>	 (03Merged) 10jenkins-bot: Drop obsolete requirements.txt and setup.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657954 (owner: 10Legoktm)
[00:10:49] <wikibugs>	 (03Merged) 10jenkins-bot: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955 (owner: 10Legoktm)
[00:10:51] <legoktm>	 https://wikitech.wikimedia.org/wiki/Backport_windows#Guidelines 
[00:10:52] <legoktm>	 Single patches that require more than one sync - in other words, changes to multiple files which depend on each other.
[00:10:52] <legoktm>	     Instead, please break up the patches into multiple safe patches that can be deployed by themselves. See: task T187761
[00:10:52] <stashbot>	 T187761: Proposal: Effective immediately, disallow multi-sync patch deployment - https://phabricator.wikimedia.org/T187761
[00:11:18] <legoktm>	 yeah, not a big deal, it's just that I literally can't do it in one sync command because they're in different directories
[00:11:29] <legoktm>	 so typically you have one commit that adds the logo and the next that changes the config
[00:12:18] <dont|panic>	 ohhh okay
[00:12:31] <dont|panic>	 sorry, won't happen again
[00:12:53] <dont|panic>	 I even abandoned a patch as it hadn't uploaded the logo lol
[00:13:18] <legoktm>	 :))
[00:14:50] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/logos.php: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php (1/2) (duration: 00m 56s)
[00:14:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:24] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php (1/2) (duration: 01m 00s)
[00:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:00] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm)
[00:19:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm)
[00:20:20] <wikibugs>	 10SRE, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) 05Resolved→03Open
[00:23:39] <wikibugs>	 (03PS1) 10Legoktm: Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466
[00:25:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466 (owner: 10Legoktm)
[00:25:53] <wikibugs>	 (03PS2) 10Legoktm: Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466
[00:27:58] <wikibugs>	 10SRE, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) As part of this work, scandium puppet code was split into two pieces: (a) retain app-server config on scandium (b)...
[00:28:49] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466 (owner: 10Legoktm)
[00:29:38] <wikibugs>	 (03Merged) 10jenkins-bot: Invalidate configuration cache when logos.php is touched too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658466 (owner: 10Legoktm)
[00:30:34] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:02] <legoktm>	 ok, now it's working
[00:32:22] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/logos.php: Add script to mostly automate logo management (duration: 00m 55s)
[00:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:06] <logmsgbot>	 !log legoktm@deploy1001 Synchronized wmf-config/CommonSettings.php: Invalidate configuration cache when logos.php is touched too (duration: 00m 56s)
[00:34:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:35:21] <legoktm>	 I believe that's everything
[00:37:06] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:37:11] <wikibugs>	 10SRE, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) /etc/testreduce does not exist at all on scandium, so that doesn't seem to be a puppetization issue.  The mysql conf...
[00:40:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:42:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:42:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2320.codfw.wmnet
[00:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:15] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2319.codfw.wmnet
[00:43:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2318.codfw.wmnet
[00:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:04] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2331.codfw.wmnet
[00:44:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Legoktm) 05Stalled→03Open a:03Legoktm Verified the account was created by ITS: https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=39709548
[00:46:26] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2320.codfw.wmnet
[00:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2319.codfw.wmnet
[00:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:47:37] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2318.codfw.wmnet
[00:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:48:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2331.codfw.wmnet
[00:48:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:49:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:50:52] <wikibugs>	 (03PS1) 10Legoktm: admin: Add mewoph to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912)
[00:56:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Legoktm) p:05Triage→03Medium
[01:07:59] <wikibugs>	 (03PS2) 10Legoktm: superset: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479)
[01:09:17] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] superset: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[01:11:31] <wikibugs>	 (03PS2) 10Legoktm: threedtopng: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657903 (https://phabricator.wikimedia.org/T266479)
[01:12:04] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] threedtopng: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657903 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[01:12:46] <wikibugs>	 (03PS2) 10Legoktm: udp2log: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657904 (https://phabricator.wikimedia.org/T266479)
[01:14:21] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] udp2log: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657904 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[01:27:42] <wikibugs>	 (03PS1) 10Aaron Schulz: Reword wmfEtcdApplyDBConfig() comments to better match those in LBFactoryMulti [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658473
[01:32:06] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:32:12] <wikibugs>	 10SRE, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10ssastry) Looks like the config files in /etc/testreduce/ are puppetized already!  I manually edited the config file /etc/te...
[01:34:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10Legoktm) Hi @JTannerWMF I looked and it seems like you have two Developer accounts (aka wikitech/LDAP accounts): * https://ldap.toolforge.org/user/jtanner * https://ldap.toolforge.org/us...
[01:38:44] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:44:45] <wikibugs>	 (03PS2) 10Legoktm: mailman3: Fix python package for mysql [puppet] - 10https://gerrit.wikimedia.org/r/657952 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[01:51:42] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mailman3: Fix python package for mysql [puppet] - 10https://gerrit.wikimedia.org/r/657952 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[01:53:57] <wikibugs>	 (03PS1) 10Legoktm: codesearch: Configure port for puppet [puppet] - 10https://gerrit.wikimedia.org/r/658477 (https://phabricator.wikimedia.org/T272947)
[01:55:39] <tgr_>	 will there be a branch cut today, or is that delayed/skipped because of last week's rollback?
[01:57:00] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] codesearch: Configure port for puppet [puppet] - 10https://gerrit.wikimedia.org/r/658477 (https://phabricator.wikimedia.org/T272947) (owner: 10Legoktm)
[02:01:19] <tgr_>	 (hm, for some reason I thought last week ended with the train blocked, but apparently not.)
[02:07:15] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.28 [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658479
[02:09:56] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10Platform Engineering (Icebox): Undeploy graphoid for phase 4 wiki's - https://phabricator.wikimedia.org/T270443 (10Jseddon) 05Open→03Resolved
[02:10:04] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon)
[02:13:35] <legoktm>	 tgr_: clearly the TrainBranchBot was listening to you ^^
[02:14:31] <tgr_>	 good bot.
[02:21:41] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] rsyslog: send AM notifications logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/658308 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[02:22:12] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alertmanager: add JSON logging of all notifications [puppet] - 10https://gerrit.wikimedia.org/r/658307 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[02:31:24] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:55] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.28 [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658479 (https://phabricator.wikimedia.org/T271342) (owner: 10TrainBranchBot)
[02:38:28] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:44] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:22:56] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:30:06] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:31:52] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:10] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[03:38:56] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:26] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.478 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:53:22] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[04:04:48] <wikibugs>	 10SRE, 10vm-requests: <site>: <number> of VMs requested for <service> - https://phabricator.wikimedia.org/T272949 (10Iupparand)
[04:31:36] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:34] <wikibugs>	 (03PS1) 10Legoktm: zuul: Port zuul-test-repo to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658485
[04:38:26] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:31:06] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:56] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:40:44] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[05:43:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set candidate master to weight 0 before the failover T271427', diff saved to https://phabricator.wikimedia.org/P13952 and previous config saved to /var/cache/conftool/dbconfig/20210126-054337-marostegui.json
[05:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:42] <stashbot>	 T271427: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427
[06:00:15] <wikibugs>	 (03CR) 10Marostegui: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui)
[06:00:20] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427)
[06:01:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui)
[06:01:25] <wikibugs>	 (03CR) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui)
[06:01:29] <wikibugs>	 (03PS2) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427)
[06:31:16] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:05] <marostegui>	 In 30 minutes we'll failover s4 (commons) master
[06:38:14] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:16:04] <marostegui>	 jynus: what's the query at the moment? just a similar update to the one I built?
[07:16:21] <jynus>	 grep "# update section with section name from the former slave"
[07:16:34] <jynus>	 at /usr/lib/python3/dist-packages/wmfmariadbpy/cli_admin/switchover.py
[07:16:37] <jynus>	 or on repo
[07:17:04] <marostegui>	 ah I see, yeah, pretty much the same as the one I tried
[07:18:24] <jynus>	 https://phabricator.wikimedia.org/diffusion/OSMD/browse/master/wmfmariadbpy/cli_admin/switchover.py;38660e943a9167fe7d174f79806acf3a6e4d4f23$722?as=source&blame=off
[07:18:37] <jynus>	 we can just add an order by
[07:18:40] <jynus>	 or remove the limit
[07:18:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This goes in the right direction, but if we start collecting data about specific endpoints, I'd rather generalize the idea a bit - see the" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) (owner: 10Hnowlan)
[07:19:03] <jynus>	 it is one of those: the query is legit but the parser cannot tell the difference
[07:19:32] <jynus>	 as there should only be 1 row so it is a deterministic statement
[07:19:58] <jynus>	 but it is exactly why we cannot got to row on the primary dbs for mw
[07:25:39] <marostegui>	 I am trying without the limit
[07:26:28] <marostegui>	 Although I do like the limit :)
[07:27:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM overall, but I'd really get away from calling bash scripts altogether, and use python requests instead to fetch opcache data." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[07:28:18] <wikibugs>	 (03PS3) 10Effie Mouzeli: service_proxy: enable ipv6 on envoy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568)
[07:28:59] <jynus>	 we can do it without the limit and WARN if rows affected > 1
[07:30:07] <marostegui>	 I am not sure if without the limit the query would success though
[07:30:15] <jynus>	 ?
[07:31:25] <jynus>	 you think is a db issue, not a query issue?
[07:32:37] <marostegui>	 yeah, let me confirm something
[07:33:08] <jynus>	 strange, binlog format says STATEMENT
[07:34:07] <jynus>	 but it should be "safe for binlog"
[07:34:23] <jynus>	 we can also enable unsafe statements- it should be ok for tendril db
[07:35:25] <marostegui>	 or just do a set session binlog_format=row for that query: https://phabricator.wikimedia.org/P13956
[07:36:25] <jynus>	 or go REPEATABLE-READ ?
[07:36:43] <jynus>	 can you try that too ^ just for curiosity
[07:36:47] <marostegui>	 sure thing
[07:37:08] <jynus>	 we may get a warning
[07:37:31] <marostegui>	 yeah, that works with the warning of: this might be unsafe
[07:37:39] <wikibugs>	 (03PS1) 10Ladsgroup: Migrate hiera() to lookup() and set datatypes in purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/658503 (https://phabricator.wikimedia.org/T209953)
[07:37:41] <jynus>	 so either of the 2
[07:37:54] <jynus>	 kormat to decide :-)
[07:37:55] <marostegui>	 So either the set session binlog or set session to repeatable-read can work to get this fixed without changing other things
[07:37:58] <marostegui>	 yeah
[07:38:22] <marostegui>	 I would go for set session binlog_format="ROW"; just to avoid the warning :)
[07:38:32] <jynus>	 OR we move zarcillo and configure the db properly
[07:38:52] <jynus>	 in a non-tokudb way, which I think is why it is in that level
[07:39:07] <jynus>	 one of the 3
[07:39:23] <marostegui>	 jynus: I would leave that for a medium-term approach and get db-switchover fixed with that small hack for now
[07:39:26] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/658503 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[07:39:41] <jynus>	 i am just giving options
[07:40:04] <jynus>	 as in, I think there is an orchestator db
[07:40:19] <jynus>	 it wouldn't be unthinkable to move it there
[07:40:28] <jynus>	 but it would require editing the file anyway
[07:40:32] <jynus>	 as config is hardcoded
[07:40:35] <kormat>	 jynus: orchestrator db is on the zarcillo 'section'
[07:40:36] <jynus>	 so no reason really
[07:41:01] <jynus>	 kormat, you mean it lives on db1115? 
[07:41:12] <jynus>	 or the one on codfw maybe?
[07:41:18] <kormat>	 on the codfw node, yeah
[07:41:36] <wikibugs>	 (03PS1) 10Marostegui: db1160: Install stretch [puppet] - 10https://gerrit.wikimedia.org/r/658504 (https://phabricator.wikimedia.org/T258361)
[07:41:58] <jynus>	 so there you have 2 1/2 solutions, whatever you think is best (or manuel convicens you is best :-D)
[07:42:05] <jynus>	 *convinces
[07:42:14] <jynus>	 *tricks you into
[07:42:21] <jynus>	 :-)))
[07:42:50] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:43:11] <marostegui>	 But obviously, kormat to decide :)
[07:43:11] <marostegui>	 jynus: I am going to close the switchover ticket and create a follow up one for db-switchover fix, does that sound good?
[07:43:14] <jynus>	 new host has report_host, right?
[07:43:21] <jynus>	 new as in new primary
[07:43:42] <jynus>	 marostegui, ok if answer to question is yes :-D
[07:43:56] <marostegui>	 XD
[07:44:57] <wikibugs>	 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui)
[07:47:24] <wikibugs>	 (03CR) 10Ladsgroup: "An extra PCC: https://puppet-compiler.wmflabs.org/compiler1001/27655/" [puppet] - 10https://gerrit.wikimedia.org/r/658503 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[07:50:48] <wikibugs>	 (03PS2) 10Marostegui: db1160: Install stretch [puppet] - 10https://gerrit.wikimedia.org/r/658504 (https://phabricator.wikimedia.org/T258361)
[07:57:22] <wikibugs>	 (03CR) 10Muehlenhoff: "Thanks! There's no reason we need Py2 compat here, though. We can drop the __future__ import and simply change the shebang to python3." [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm)
[07:58:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for mraish [puppet] - 10https://gerrit.wikimedia.org/r/658547
[08:02:23] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1160: Install stretch [puppet] - 10https://gerrit.wikimedia.org/r/658504 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[08:03:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) (owner: 10Legoktm)
[08:03:51] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: add cert for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713)
[08:05:50] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: add cert for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713)
[08:05:55] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1160.eqiad.wmnet'] ` The log ca...
[08:06:50] <wikibugs>	 (03CR) 10Hashar: "The $wgSoftBlockRange fine to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657067 (https://phabricator.wikimedia.org/T209011) (owner: 10Arturo Borrero Gonzalez)
[08:08:04] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: add cert for wdqs-internal [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713)
[08:12:07] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: add dummy key for new wdqs-internal cert [labs/private] - 10https://gerrit.wikimedia.org/r/658550 (https://phabricator.wikimedia.org/T272713)
[08:13:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1119.eqiad.wmnet', 'an-worker1131.eqiad...
[08:13:54] <wikibugs>	 (03CR) 10Ryan Kemper: "After generating a new cert, three things need to be done:" [labs/private] - 10https://gerrit.wikimedia.org/r/658550 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper)
[08:14:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat)
[08:16:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] udp2log: Install bsection [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat)
[08:17:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE
[08:17:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script [puppet] - 10https://gerrit.wikimedia.org/r/636074 (https://phabricator.wikimedia.org/T243009) (owner: 10Ahmon Dancy)
[08:18:49] <moritzm>	 !log upgrading OpenJDK on aqs and Hadoop systems
[08:18:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: REIMAGE
[08:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:16] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: mediawiki::php bump opcache.max_accelerated_files [puppet] - 10https://gerrit.wikimedia.org/r/636047 (https://phabricator.wikimedia.org/T253673) (owner: 10Effie Mouzeli)
[08:21:08] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[08:25:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The change seems ok in theory, but I'd like to see some risk management in terms of adding a feature flag to control the deployment." [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli)
[08:26:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1119.eqiad.wmnet with reason: REIMAGE
[08:26:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1131.eqiad.wmnet with reason: REIMAGE
[08:26:02] <wikibugs>	 10SRE, 10DBA: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1160.eqiad.wmnet'] `  and were **ALL** successful.
[08:26:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:03] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1131.eqiad.wmnet with reason: REIMAGE
[08:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I agree with this change, but let's see what alex thinks as well - he's usually opposed to using a native version number." [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/657218 (owner: 10Legoktm)
[08:28:06] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1119.eqiad.wmnet with reason: REIMAGE
[08:28:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:52] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:59] <godog>	 !log swift start decom for ms-be20[17,19,21,23,24,25,26,27] - T272837
[08:31:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:03] <stashbot>	 T272837:  Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837
[08:32:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: add JSON logging of all notifications [puppet] - 10https://gerrit.wikimedia.org/r/658307 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[08:32:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: send AM notifications logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/658308 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[08:33:16] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host furud.codfw.wmnet
[08:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1131.eqiad.wmnet', 'an-worker1119.eqiad.wmnet'] `  and were **ALL** successful.
[08:36:06] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:36:13] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1119,1131].eqiad.wmnet
[08:36:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:57] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host furud.codfw.wmnet
[08:37:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "One minor comment on the dockerfile, and one field missing in the control file; otherwise LGTM" (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan)
[08:37:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:04] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1119,1131].eqiad.wmnet
[08:38:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:10] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet
[08:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:28] <wikibugs>	 (03PS1) 10ArielGlenn: use the platform-engineering group to add people to deployers [puppet] - 10https://gerrit.wikimedia.org/r/658552
[08:41:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] use the platform-engineering group to add people to deployers [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn)
[08:42:16] <wikibugs>	 (03PS1) 10Elukey: Add an-worker1119 and 1131 to the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/658553 (https://phabricator.wikimedia.org/T260411)
[08:42:54] <icinga-wm>	 PROBLEM - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:42:58] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add an-worker1119 and 1131 to the Hadoop backup cluster [puppet] - 10https://gerrit.wikimedia.org/r/658553 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[08:44:08] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet
[08:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:44] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 27 probes of 675 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:47:38] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 46 probes of 592 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:50:40] <wikibugs>	 (03PS2) 10ArielGlenn: use the platform-engineering group to add people to deployment [puppet] - 10https://gerrit.wikimedia.org/r/658552
[08:51:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] use the platform-engineering group to add people to deployment [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn)
[08:51:48] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:52:24] <wikibugs>	 (03PS1) 10Marostegui: db1081: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658554 (https://phabricator.wikimedia.org/T258361)
[08:53:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1081: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658554 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[08:53:43] <marostegui>	 !log Stop mysql on db1081 to clone db1160
[08:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:00] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:55:07] <wikibugs>	 (03PS3) 10ArielGlenn: use the platform-engineering group to add people to deployment [puppet] - 10https://gerrit.wikimedia.org/r/658552
[09:01:52] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658556 (https://phabricator.wikimedia.org/T258361)
[09:02:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658556 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[09:02:49] <Urbanecm>	 jouncebot: now
[09:02:49] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 57 minute(s)
[09:02:51] <Urbanecm>	 jouncebot: next
[09:02:51] <jouncebot>	 In 2 hour(s) and 57 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1200)
[09:04:37] <wikibugs>	 (03PS1) 10Urbanecm: frwiki: Fix tagline height and width [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658557 (https://phabricator.wikimedia.org/T272907)
[09:06:39] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[09:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] frwiki: Fix tagline height and width [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658557 (https://phabricator.wikimedia.org/T272907) (owner: 10Urbanecm)
[09:07:36] <wikibugs>	 (03Merged) 10jenkins-bot: frwiki: Fix tagline height and width [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658557 (https://phabricator.wikimedia.org/T272907) (owner: 10Urbanecm)
[09:09:05] <wikibugs>	 (03CR) 10ArielGlenn: "Note that this adds gmodena and nikkin to the deployment group since they are team group members. Holding until we have a thumbs up from m" [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn)
[09:09:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for mraish [puppet] - 10https://gerrit.wikimedia.org/r/658547 (owner: 10Muehlenhoff)
[09:11:22] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[09:11:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 to clone db1175 T258361', diff saved to https://phabricator.wikimedia.org/P13958 and previous config saved to /var/cache/conftool/dbconfig/20210126-091149-marostegui.json
[09:11:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:53] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[09:12:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1078 (db1175 isn't ready yet)', diff saved to https://phabricator.wikimedia.org/P13959 and previous config saved to /var/cache/conftool/dbconfig/20210126-091236-marostegui.json
[09:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:18] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: eab87780: frwiki: Fix tagline height and width (T272907) (duration: 00m 58s)
[09:13:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:21] <stashbot>	 T272907: French Wikipedia logo: Tagline too distant from wordmark; Tagline has 24px height instead of 13px and wordmark is too small - https://phabricator.wikimedia.org/T272907
[09:14:44] <elukey>	 !log reboot dbstore1004 for kernel upgrades
[09:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:58] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
[09:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:28] <icinga-wm>	 PROBLEM - Apache HTTP on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:16:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2319 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:16:42] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2501 bytes in 0.078 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:17:14] <icinga-wm>	 PROBLEM - Apache HTTP on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:17:34] <icinga-wm>	 PROBLEM - PHP7 rendering on mw2331 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:17:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw2318 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:18:20] <icinga-wm>	 PROBLEM - Apache HTTP on mw2320 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2500 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:19:11] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10akosiaris) >>! In T179696#6776337, @Joe wrote: >>>! In T179696#6775314, @Legoktm wrote: >> In my testing of repeatedly issuing the same curl com...
[09:19:50] <elukey>	 I see something like
[09:19:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:19:52] <elukey>	 Jan 26 09:19:20 mw2320 php7.2-fpm: PHP Fatal error:  require(): Failed opening required '/srv/mediawiki/wmf-config/logos.php' 
[09:19:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 115 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:20:01] <wikibugs>	 (03PS1) 10Kormat: realm: Add translate_cache to $private_tables. [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957)
[09:20:34] <Urbanecm>	 elukey: upps
[09:20:36] <Urbanecm>	 I know what that is
[09:20:42] <Urbanecm>	 I'll fix it
[09:20:48] <elukey>	 super I was about to ping you :)
[09:20:49] <elukey>	 thanks
[09:21:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:21:51] <wikibugs>	 (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27658/console" [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) (owner: 10Kormat)
[09:22:23] <wikibugs>	 (03CR) 10Kormat: realm: Add translate_cache to $private_tables. [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) (owner: 10Kormat)
[09:24:26] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: Resyncing to fix mw2xxx apache loading (duration: 00m 57s)
[09:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:37] <wikibugs>	 (03CR) 10Kormat: [V: 03+1 C: 03+2] udp2log: Install bsection [puppet] - 10https://gerrit.wikimedia.org/r/657543 (owner: 10Kormat)
[09:24:56] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "requires restarting of all sanitarium hosts: db1124, db1125, db1154, db1155, db2094, db2095" [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) (owner: 10Kormat)
[09:25:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw2320 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:25:03] <Urbanecm>	 elukey: ^^that should do the trick
[09:25:07] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2331 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:25:17] <icinga-wm>	 RECOVERY - Apache HTTP on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:25:27] <elukey>	 perfect :)
[09:25:33] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2318 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:26:29] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 36 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:27:06] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] realm: Add translate_cache to $private_tables. [puppet] - 10https://gerrit.wikimedia.org/r/658558 (https://phabricator.wikimedia.org/T272957) (owner: 10Kormat)
[09:27:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] admin: Add mewoph to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) (owner: 10Legoktm)
[09:28:35] <Urbanecm>	 elukey: do you happen to know if scap pull is standard part of reimaging MW servers? according to SAL, lego.ktm did https://sal.toolforge.org/production?p=0&q=logos.php&d= last night, and at around the same time, mu.tante reimagined mw2319 (among other affected servers). So, until I did the IS.php sync to fix an (unrelated) bug, the error was there, but not noticed. Once I changed IS.php, the mw2319 copy started to require 
[09:28:35] <Urbanecm>	 logos.php. For some reason, it seems mw2319 was using old copy of /srv/mediawiki
[09:28:51] <elukey>	 !log reboot dbstore1003 for kernel upgrades
[09:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:39] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: Restart mariadb to pick up config changes T272957
[09:29:41] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: Restart mariadb to pick up config changes T272957
[09:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:42] <stashbot>	 T272957: Mark mediawikiwiki.translate_cache as private so it doesn't replicate to wiki replicas - https://phabricator.wikimedia.org/T272957
[09:29:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:57] <icinga-wm>	 RECOVERY - Apache HTTP on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:31:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Adapt proxy setting in debmonitor nginx site for CAS [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff)
[09:32:26] <godog>	 !log disable mdadm check emails on ms-be1022 / known, and host is going to be decom'd - T267870
[09:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:29] <stashbot>	 T267870: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870
[09:32:47] <wikibugs>	 (03PS2) 10Muehlenhoff: debmonitor: Don't include debmonitor_static for the internal listener [puppet] - 10https://gerrit.wikimedia.org/r/657795
[09:33:11] <Urbanecm>	 apparently, mw2319's copy is still not in its expected state
[09:33:23] <Urbanecm>	 resyncing CommonSettings.php to fix this
[09:33:43] <wikibugs>	 (03PS4) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802
[09:34:32] <Urbanecm>	 we should have alerts on drifts of important files in /srv/mediawiki
[09:34:54] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: Resync: Some mw2xxx hosts have old version (duration: 00m 55s)
[09:34:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:58] <elukey>	 Urbanecm: ah yes this might explain, in theory right after the reimage we should do a scap pull 
[09:35:12] <elukey>	 it is not automatic IIRC, but it has been a while since I checke
[09:35:14] <elukey>	 *checked
[09:35:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Don't include debmonitor_static for the internal listener [puppet] - 10https://gerrit.wikimedia.org/r/657795 (owner: 10Muehlenhoff)
[09:35:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[09:37:45] <elukey>	 !log reboot dbstore1005 for kernel upgrades
[09:37:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:32] <wikibugs>	 (03PS16) 10Jbond: cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139
[09:39:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw2319 is OK: HTTP OK: HTTP/1.1 302 Found - 654 bytes in 0.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[09:39:10] <wikibugs>	 (03PS15) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[09:39:17] <wikibugs>	 (03PS17) 10Jbond: cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139
[09:40:00] <wikibugs>	 (03PS5) 10Jbond: sre.misc-clusters.thumbor: create batch action cook book for thumbor [cookbooks] - 10https://gerrit.wikimedia.org/r/657802
[09:41:01] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet
[09:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:12] <wikibugs>	 (03PS1) 10Jbond: gitignore: ignore vi swap/tmp files [cookbooks] - 10https://gerrit.wikimedia.org/r/658560
[09:47:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:49:11] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:52:01] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:52:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw2331 is OK: HTTP OK: HTTP/1.1 302 Found - 640 bytes in 0.126 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[09:53:43] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet
[09:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:25] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 213859768 and 24 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[09:56:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[09:56:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete tmpreaper Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/658271 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff)
[09:57:33] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10MoritzMuehlenhoff)
[09:57:47] <wikibugs>	 (03PS4) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[09:58:51] <wikibugs>	 (03PS5) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[10:00:19] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM! elukey should be the one to merge :)" [puppet] - 10https://gerrit.wikimedia.org/r/642411 (owner: 10DCausse)
[10:01:01] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 189657152 and 87 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:01:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[10:01:49] <wikibugs>	 (03PS6) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[10:04:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[10:05:14] <wikibugs>	 (03PS7) 10Jbond: dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141
[10:05:22] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Kormat)
[10:07:31] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:09:41] <wikibugs>	 (03PS1) 10Jbond: add gitignore: add vi swap/tmp files [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658562
[10:15:25] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:19:02] <wikibugs>	 (03PS3) 10Hnowlan: maps: reimage maps1009 with buster. [puppet] - 10https://gerrit.wikimedia.org/r/656404 (https://phabricator.wikimedia.org/T238753)
[10:20:18] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] maps: reimage maps1009 with buster. [puppet] - 10https://gerrit.wikimedia.org/r/656404 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[10:20:29] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:20:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 206064424 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:22:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable managed adduser.conf unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/657770 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff)
[10:23:07] <wikibugs>	 (03PS5) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753)
[10:23:51] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) we're still experiencing timeouts when trying to gather the catalog list with the url `/v2/_catalog?last=releng%2Fquibble-jessie-php55&n=10...
[10:24:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:25:19] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:27:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:31:32] <wikibugs>	 (03PS6) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753)
[10:35:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10aborrero) thanks, I will put it into service soon!
[10:35:59] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good to me, a couple of nits inline" (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[10:38:27] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 994656 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[10:52:31] <wikibugs>	 (03CR) 10David Caro: "I agree with @Bstorm, though I would instead add a '--interactive' flag or similar, as I think it's useful to be able to just copy-paste a" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott)
[10:54:53] <wikibugs>	 (03CR) 10David Caro: "Looks good, would be interesting to know where did you get the info on what to change for the upgrade too ;) (aside from the test plan/res" [puppet] - 10https://gerrit.wikimedia.org/r/658416 (owner: 10Bstorm)
[11:03:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] data-services: apply user variances to future creations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm)
[11:07:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "I 'll bundle with a couple of others changes (to avoid too many pybal restarts) and deploy today" [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[11:08:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add a linkrecommendation-external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[11:08:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks for the review. I 'll merge and let's see how it goes" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[11:09:51] <wikibugs>	 (03Merged) 10jenkins-bot: Add a linkrecommendation-external release [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[11:11:58] <wikibugs>	 (03CR) 10Elukey: sre: convert the generic reboot functions to the cookbook class API (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[11:13:00] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Joe) Looking at the logs from a failed run, it looks like no retry is attempted when a 504 is received, at least on `registry2002`.  Every 504 f...
[11:13:28] <wikibugs>	 (03PS10) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683)
[11:16:49] <wikibugs>	 (03PS2) 10ArielGlenn: handle backwards searches for bz2 blocks in tiny files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658305
[11:16:51] <wikibugs>	 (03PS2) 10ArielGlenn: update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306
[11:20:57] <wikibugs>	 (03PS1) 10Elukey: sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565
[11:21:20] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Urbanecm)
[11:21:47] <wikibugs>	 (03PS3) 10ArielGlenn: handle backwards searches for bz2 blocks in tiny files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658305
[11:21:49] <wikibugs>	 (03PS3) 10ArielGlenn: update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306
[11:23:46] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/657416 (https://phabricator.wikimedia.org/T181368) (owner: 10Effie Mouzeli)
[11:25:34] <wikibugs>	 (03PS11) 10Effie Mouzeli: varnish: Set debug=1 in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683)
[11:26:19] <wikibugs>	 (03PS4) 10ArielGlenn: update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306
[11:27:38] <wikibugs>	 (03PS1) 10Effie Mouzeli: varnish: include X-Client-Port in X-Analytics [puppet] - 10https://gerrit.wikimedia.org/r/658567 (https://phabricator.wikimedia.org/T181368)
[11:29:32] <moritzm>	 !log imported jenkins 2.263.3 to apt.wikimedia.org (thirdparty/ci)
[11:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:09] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Thanks for starting to explore implementing things with spicerack and cookbooks." (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[11:30:30] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658569 (https://phabricator.wikimedia.org/T258361)
[11:30:42] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658562 (owner: 10Jbond)
[11:31:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1160 [puppet] - 10https://gerrit.wikimedia.org/r/658569 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[11:31:07] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/658560 (owner: 10Jbond)
[11:31:41] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:33:10] <wikibugs>	 (03PS1) 10Hnowlan: cassandra::single_instance: use dedicated hiera key, don't use 'cluster' [puppet] - 10https://gerrit.wikimedia.org/r/658572
[11:34:18] <wikibugs>	 (03Merged) 10jenkins-bot: add gitignore: add vi swap/tmp files [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658562 (owner: 10Jbond)
[11:34:36] <wikibugs>	 (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] handle backwards searches for bz2 blocks in tiny files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658305 (owner: 10ArielGlenn)
[11:34:52] <wikibugs>	 (03Merged) 10jenkins-bot: gitignore: ignore vi swap/tmp files [cookbooks] - 10https://gerrit.wikimedia.org/r/658560 (owner: 10Jbond)
[11:34:55] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Thanks for the fix, minor improvement inline." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey)
[11:35:10] <wikibugs>	 (03CR) 10ArielGlenn: [V: 03+2 C: 03+2] update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306 (owner: 10ArielGlenn)
[11:35:57] <wikibugs>	 (03PS2) 10Elukey: sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565
[11:36:09] <wikibugs>	 (03CR) 10Elukey: sre.hosts.decommission: fix homer subprocess execution code (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey)
[11:38:04] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27660/console" [puppet] - 10https://gerrit.wikimedia.org/r/658572 (owner: 10Hnowlan)
[11:38:15] <wikibugs>	 (03PS3) 10Volans: sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey)
[11:38:21] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:38:23] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey)
[11:41:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey)
[11:43:34] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: fix homer subprocess execution code [cookbooks] - 10https://gerrit.wikimedia.org/r/658565 (owner: 10Elukey)
[11:44:33] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.decommission
[11:44:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:30] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[11:46:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:42] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[11:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:20] <wikibugs>	 (03CR) 10David Caro: ""I think that would be easier to start with cookbooks to manage the production side of the WMCS infrastructure than toolforge"" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[11:49:51] <wikibugs>	 (03PS1) 10ArielGlenn: version 0.1.3 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/658574
[11:51:42] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1160 is ready to replace db1081. Leaving it to replicate for 24h before pooling it.
[11:53:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:55:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1200).
[12:00:04] <jouncebot>	 Evrifaessa: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:06] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: httpbb: Add test for gzipping of static css files. [puppet] - 10https://gerrit.wikimedia.org/r/658317 (https://phabricator.wikimedia.org/T272305)
[12:00:08] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305)
[12:00:10] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[12:00:20] <Urbanecm>	 I hope only one, jouncebot :)
[12:00:46] <Urbanecm>	 requesting deployer not present, so there's nothing to do anyway
[12:00:46] <wikibugs>	 (03PS7) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753)
[12:01:24] <wikibugs>	 (03CR) 10Volans: "Couple of nits inline, CI is not happy and you might need a rebase." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[12:01:47] <Evrifaessa>	 hey
[12:01:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[12:02:14] <Evrifaessa>	 Urbanecm, Amir1, Lucas_WMDE awight: anyone here?
[12:02:15] <Urbanecm>	 hi, Evrifaessa 
[12:02:19] <Urbanecm>	 I can deploy today
[12:02:21] <Lucas_WMDE>	 hi!
[12:02:22] <Evrifaessa>	 o/
[12:02:36] <Urbanecm>	 (unless Lucas_WMDE really wants to? 🙂 )
[12:02:46] <Lucas_WMDE>	 you can do it if you want to :)
[12:02:54] <Evrifaessa>	 lol
[12:02:59] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] version 0.1.3 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/658574 (owner: 10ArielGlenn)
[12:03:06] <wikibugs>	 (03PS2) 10Urbanecm: Add namespace aliases to Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657995 (https://phabricator.wikimedia.org/T272782) (owner: 10Evrifaessa)
[12:03:28] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add namespace aliases to Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657995 (https://phabricator.wikimedia.org/T272782) (owner: 10Evrifaessa)
[12:04:21] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespace aliases to Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657995 (https://phabricator.wikimedia.org/T272782) (owner: 10Evrifaessa)
[12:04:49] <Urbanecm>	 Evrifaessa: available at mwdebug1001 for testing, can you have a look?
[12:04:59] <wikibugs>	 (03PS4) 10Urbanecm: Add Turkish 'Powered by MediaWiki' and 'A Wikimedia project' icons for Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657994 (https://phabricator.wikimedia.org/T272781) (owner: 10Evrifaessa)
[12:05:03] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add Turkish 'Powered by MediaWiki' and 'A Wikimedia project' icons for Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657994 (https://phabricator.wikimedia.org/T272781) (owner: 10Evrifaessa)
[12:05:04] <Evrifaessa>	 works
[12:05:10] <Urbanecm>	 thanks, syncing
[12:06:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add Turkish 'Powered by MediaWiki' and 'A Wikimedia project' icons for Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657994 (https://phabricator.wikimedia.org/T272781) (owner: 10Evrifaessa)
[12:06:08] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: services: Create LVS services for linkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/658576 (https://phabricator.wikimedia.org/T265603)
[12:06:12] <Evrifaessa>	 I'd love to have T272776 deployed today too, but I couldn't optimize the SVG
[12:06:13] <stashbot>	 T272776: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage - https://phabricator.wikimedia.org/T272776
[12:06:18] <wikibugs>	 (03PS10) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[12:06:48] <Evrifaessa>	 can't pull the change. for some reason my git freezes out. can you optimize the SVG? Urbanecm  
[12:06:58] <Evrifaessa>	 this one: https://gerrit.wikimedia.org/r/c/657971
[12:07:27] <Urbanecm>	 Evrifaessa: try going to master, doing git pull, and then git review -d 657971 again :)
[12:07:28] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: eab535fcc983d57dd36c41309162ace8aadcae1a: Add namespace aliases to Turkish Wikivoyage (T272782) (duration: 01m 00s)
[12:07:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:33] <stashbot>	 T272782: Add namespace aliases to Turkish Wikivoyage - https://phabricator.wikimedia.org/T272782
[12:08:42] <Urbanecm>	 Evrifaessa: please test the other change at mwdebug1001
[12:08:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[12:08:59] <Evrifaessa>	 works.
[12:09:06] <Evrifaessa>	 thank you
[12:10:59] <wikibugs>	 (03PS2) 10Evrifaessa: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776)
[12:12:04] <Evrifaessa>	 Urbanecm: mind checking and deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/657971 now?
[12:12:15] <Urbanecm>	 !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=trwikivoyage --cluster=all
[12:12:16] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 4dfc28a4a759050726561da861a9e1030b529d3e: Add Turkish Powered by MediaWiki and A Wikimedia project icons for Turkish Wikivoyage (T272781) (duration: 01m 00s)
[12:12:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:18] <Evrifaessa>	 I uploaded a new patchset
[12:12:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:21] <stashbot>	 T272781: Localize the footer images in Turkish Wikivoyage - https://phabricator.wikimedia.org/T272781
[12:12:23] <Urbanecm>	 Evrifaessa: saw, will look :)
[12:13:55] <Urbanecm>	 Evrifaessa: what was your optimalization command?
[12:14:26] <Evrifaessa>	 svgo wikivoyage-wordmark-tr.svg --disable={cleanupIDs,convertPathData,removeDesc,removeTitle,removeViewBox,removeXMLProcInst} --enable='sortAttrs' --pretty
[12:15:45] <Evrifaessa>	 Urbanecm: ?
[12:15:53] <Urbanecm>	 Evrifaessa: I'm writing ;)
[12:15:58] <Evrifaessa>	 oh
[12:15:59] <Evrifaessa>	 lol
[12:16:18] <wikibugs>	 (03PS8) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753)
[12:17:07] <Urbanecm>	 Evrifaessa: and also looking at the patch itself
[12:17:28] <wikibugs>	 (03PS3) 10Urbanecm: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) (owner: 10Evrifaessa)
[12:19:12] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) (owner: 10Evrifaessa)
[12:20:05] <wikibugs>	 (03Merged) 10jenkins-bot: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) (owner: 10Evrifaessa)
[12:20:28] <wikibugs>	 (03PS11) 10Giuseppe Lavagetto: mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[12:20:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: apertium: Add the TLS-enabled LVS service [puppet] - 10https://gerrit.wikimedia.org/r/658577
[12:21:12] <Urbanecm>	 Evrifaessa: can you check?
[12:21:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[12:21:27] <Evrifaessa>	 Urbanecm: mwdebug1001?
[12:21:31] <Urbanecm>	 yup
[12:21:47] <Evrifaessa>	 hmm
[12:21:47] <Evrifaessa>	 nope
[12:21:53] <Evrifaessa>	 still the default logo
[12:21:59] <Evrifaessa>	 https://tr.m.wikivoyage.org/wiki/Anasayfa
[12:22:17] <Urbanecm>	 Evrifaessa: sorry, can you try now? 
[12:22:34] <Evrifaessa>	 looks awesome
[12:22:35] <Evrifaessa>	 thanks
[12:22:45] <Urbanecm>	 great, syncing
[12:23:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27661/console" [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[12:23:33] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: services: similar-users discovery and LVS component [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:23:35] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: services: Create LVS services for linkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/658576 (https://phabricator.wikimedia.org/T265603)
[12:23:37] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: apertium: Add the TLS-enabled LVS service [puppet] - 10https://gerrit.wikimedia.org/r/658577
[12:24:26] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/wikivoyage-wordmark-tr.svg: 080389dbac5bb2cddab7640071e43674a868e945: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage (T272776; 1/2) (duration: 01m 01s)
[12:24:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:32] <stashbot>	 T272776: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage - https://phabricator.wikimedia.org/T272776
[12:24:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27662/console" [puppet] - 10https://gerrit.wikimedia.org/r/658577 (owner: 10Alexandros Kosiaris)
[12:26:18] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 080389dbac5bb2cddab7640071e43674a868e945: Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage (T272776; 2/2) (duration: 01m 02s)
[12:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:30] <Urbanecm>	 Evrifaessa: should be live :)
[12:26:40] <Evrifaessa>	 yeah, it works. thanks :))
[12:27:16] <Urbanecm>	 excellent :)
[12:27:57] <Evrifaessa>	 o/
[12:28:04] <wikibugs>	 (03PS6) 10Urbanecm: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875)
[12:28:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875)
[12:29:00] <wikibugs>	 (03Merged) 10jenkins-bot: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) (owner: 10A2569875)
[12:29:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "I 've added the LVS IP to the kubernetes nodes on this patch and run PCC on this and descendants. https://puppet-compiler.wmflabs.org/comp" [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:29:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] services: similar-users discovery and LVS component [puppet] - 10https://gerrit.wikimedia.org/r/657101 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[12:30:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] services: Create LVS services for linkrecommendation [puppet] - 10https://gerrit.wikimedia.org/r/658576 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[12:30:08] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] apertium: Add the TLS-enabled LVS service [puppet] - 10https://gerrit.wikimedia.org/r/658577 (owner: 10Alexandros Kosiaris)
[12:30:15] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:32:02] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 11cfef4f05612771d6a7cbe27f9bb1fbb41e0e5d: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia (T271612) (duration: 01m 01s)
[12:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:07] <stashbot>	 T271612: New namespace for WikiProject on zh.wikipedia - https://phabricator.wikimedia.org/T271612
[12:32:11] <wikibugs>	 (03Abandoned) 10JMeybohm: Demo - don't merge: Add a new listener to services proxy [puppet] - 10https://gerrit.wikimedia.org/r/657591 (owner: 10JMeybohm)
[12:32:16] <wikibugs>	 (03Abandoned) 10JMeybohm: Demo - don't merge: Enable the service-proxy-demo listener for MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/657592 (owner: 10JMeybohm)
[12:33:00] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658579 (https://phabricator.wikimedia.org/T265603)
[12:33:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658579 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[12:33:31] <wikibugs>	 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) Thanks, @sbassett, @JFishback_WMF, and @jcrespo, for the further input! Yes, it sounds like I would need...
[12:34:01] <Urbanecm>	 !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=zhwiki --fix # T271612 # P13960
[12:34:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:05] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:40:03] <volans>	 legoktm, _joe_: the above is because requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://docker-r
[12:40:06] <volans>	 egistry.discovery.wmnet/v2/_catalog?last=releng%2Fquibble-jessie-php55&n=100
[12:42:29] <wikibugs>	 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10jcrespo) Assuming legal approves and IT helps you on client side, we (SREs) will be able to help the person with any tra...
[12:43:32] <wikibugs>	 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) >>! In T271202#6737057, @Platonides wrote: > On the topic of ssh accesses, there shouldn't be a "big hea...
[12:44:21] <wikibugs>	 (03PS1) 10Kormat: switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954)
[12:44:53] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.11:4737]) https://wikitech.wikimedia.org/wiki/PyBal
[12:46:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat)
[12:47:24] <wikibugs>	 (03PS17) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:47:57] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 55 connections established with conf2001.codfw.wmnet:2379 (min=56) https://wikitech.wikimedia.org/wiki/PyBal
[12:48:10] <wikibugs>	 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10nshahquinn-wmf) >>! In T271202#6777042, @jcrespo wrote: > Assuming legal approves and IT helps you on client side, we (S...
[12:48:39] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27663/console" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:49:41] <_joe_>	 volans: we know, and it's tracked in the task already
[12:49:42] <wikibugs>	 (03CR) 10Volans: "Couple of minor things inline." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond)
[12:49:56] <volans>	 ack, thx, sorry for the  noise
[12:49:58] <wikibugs>	 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10jcrespo) Analytics tools are improving every day, The data engineering team are doing a great job of offering web-based...
[12:51:54] <_joe_>	 volans: no, sorry for the noise from that script, somehow retries seem not to be working, and I didn't look further into what's going wrong
[12:52:02] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 97 connections established with conf1004.eqiad.wmnet:4001 (min=98) https://wikitech.wikimedia.org/wiki/PyBal
[12:52:26] <wikibugs>	 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10jcrespo) > do y'all still need to approve it?  We ask you to loop us in. For production access is always better to ask f...
[12:55:24] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 65 connections established with conf1004.eqiad.wmnet:4001 (min=66) https://wikitech.wikimedia.org/wiki/PyBal
[12:55:56] <wikibugs>	 (03PS1) 10Jbond: (WIP) debdeploy: Add debdeploy functionality [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658582
[12:56:20] <wikibugs>	 (03PS18) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[12:57:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I haven't tested, +1, but see comments below." (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat)
[12:57:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) debdeploy: Add debdeploy functionality [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658582 (owner: 10Jbond)
[13:03:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "we can test with pc1 codfw at some point if you like" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat)
[13:05:32] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.11:4737]) https://wikitech.wikimedia.org/wiki/PyBal
[13:06:30] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 75 connections established with conf2001.codfw.wmnet:2379 (min=76) https://wikitech.wikimedia.org/wiki/PyBal
[13:10:20] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.11:4737]) https://wikitech.wikimedia.org/wiki/PyBal
[13:21:43] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.11:4737]) https://wikitech.wikimedia.org/wiki/PyBal
[13:21:49] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "More options (if we are afraid a connection loss can happen between one statement and the other:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat)
[13:29:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:30:54] <hashar>	 !log Upgraded and restarting Jenkins on release1002 / releases2002 / contint1001 and contint2001
[13:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:04] <hashar>	 CI jobs halting for a couple minutes
[13:31:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:31:27] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:35:04] <wikibugs>	 (03PS3) 10Patsagorn Y.: Create patroller user group for thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499)
[13:35:28] <wikibugs>	 (03CR) 10David Caro: "Fixed issues introduced when linting, now playing with classes." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[13:35:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:36:07] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:37:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge k8s: upgrade docker and containerd [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm)
[13:40:31] <wikibugs>	 (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] MediaSearch: new set of heuristics for alternative implementation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658589 (https://phabricator.wikimedia.org/T271532)
[13:40:33] <wikibugs>	 (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] MediaSearch: remove old, unused set of heuristics [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658590 (https://phabricator.wikimedia.org/T271532)
[13:41:16] <arturo>	 !log admin update some kubernetes-related packages in buster-wikimedia/thirdparty/kubeadm-k8s-1-17 (T263284)
[13:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:23] <stashbot>	 T263284: Upgrade Toolforge K8s to 1.17 - https://phabricator.wikimedia.org/T263284
[13:44:11] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm)
[13:45:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 156823768 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:46:14] <wikibugs>	 (03CR) 10Patsagorn Y.: Create patroller user group for thwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.)
[13:48:48] <wikibugs>	 (03CR) 10Patsagorn Y.: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.)
[13:49:11] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 9.4 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[13:51:49] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[13:54:23] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 789000 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:54:53] <wikibugs>	 (03CR) 10Patsagorn Y.: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.)
[13:56:00] <wikibugs>	 (03CR) 10Majavah: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.)
[13:58:56] <akosiaris>	 the lvs hosts are me btw. Proceeding with setting up new LVS services
[13:59:31] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658579 (https://phabricator.wikimedia.org/T265603)
[14:00:26] <wikibugs>	 (03PS1) 10Kormat: setup.cfg: Don't specify a python_version for mypy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592
[14:00:55] <wikibugs>	 (03PS2) 10Kormat: setup.cfg: Don't specify a python_version for mypy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592
[14:03:25] <godog>	 !log swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837
[14:03:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:30] <stashbot>	 T272837:  Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837
[14:03:43] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 86118936 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:04:53] <wikibugs>	 (03PS16) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[14:05:17] <marostegui>	 !log Restart db1077
[14:05:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:03] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:06:39] <wikibugs>	 (03CR) 10HitomiAkane: "Note: the topic (bug) ID in the commit message is incorrect, also need rebase to solve the merge conflict issue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T2721499) (owner: 10Patsagorn Y.)
[14:07:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[14:07:14] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[14:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:42] <wikibugs>	 (03CR) 10Jbond: sre: convert the generic reboot functions to the cookbook class API (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[14:09:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: hide 'logger' receiver on alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/658593 (https://phabricator.wikimedia.org/T272474)
[14:10:22] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, but you might want to open a bug upstream, this is quite an unwanted behaviour." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 (owner: 10Kormat)
[14:10:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: hide 'logger' receiver on alerts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/658593 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[14:13:34] <logmsgbot>	 !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[14:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:46] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] setup.cfg: Don't specify a python_version for mypy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 (owner: 10Kormat)
[14:15:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658579 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[14:16:48] <wikibugs>	 (03PS17) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[14:17:26] <wikibugs>	 (03Merged) 10jenkins-bot: setup.cfg: Don't specify a python_version for mypy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 (owner: 10Kormat)
[14:18:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={eqiad,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:18:48] <wikibugs>	 (03PS2) 10Kormat: switchover: Work-around isolation level issue [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954)
[14:19:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[14:20:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:21:21] <wikibugs>	 (03PS1) 10WMDE-Fisch: Enable bracket matching on the first wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658594 (https://phabricator.wikimedia.org/T270238)
[14:21:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch!" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[14:21:25] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:21:48] <marostegui>	 !log Install mariadb 10.4.18 on pc2010 - T268457 
[14:21:51] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 59 connections established with conf2001.codfw.wmnet:2379 (min=59) https://wikitech.wikimedia.org/wiki/PyBal
[14:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:53] <stashbot>	 T268457: Investigate possible optimizer regression on 10.4.17 with DELETE statements - https://phabricator.wikimedia.org/T268457
[14:22:01] <akosiaris>	 !log restart pybal on lvs1015, lvs1016, lvs2009, lvs2010 for picking up linkrecommendation, similar-users, apertium-tls LVS services.
[14:22:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:11] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 234392848 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:23:07] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[14:23:07] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[14:23:07] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[14:23:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:41] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 69 connections established with conf1004.eqiad.wmnet:4001 (min=69) https://wikitech.wikimedia.org/wiki/PyBal
[14:23:43] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:24:27] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 342200 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:25:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[14:28:28] <wikibugs>	 (03Merged) 10jenkins-bot: dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[14:30:23] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 60634264 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:32:08] <wikibugs>	 (03PS1) 10Elukey: Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973)
[14:33:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 1079136 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:34:31] <wikibugs>	 (03PS2) 10Elukey: Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973)
[14:37:21] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:55] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 79 connections established with conf2001.codfw.wmnet:2379 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[14:44:06] <hnowlan>	 !log reimaging maps1009 as new buster master
[14:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:35] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 101 connections established with conf1004.eqiad.wmnet:4001 (min=101) https://wikitech.wikimedia.org/wiki/PyBal
[14:46:21] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: force 'default' receiver as last child route [puppet] - 10https://gerrit.wikimedia.org/r/658598 (https://phabricator.wikimedia.org/T272474)
[14:46:27] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:46:30] <wikibugs>	 (03CR) 10Kormat: "> SET STATEMENT binlog_format=STATEMENT FOR <original query>" (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658580 (https://phabricator.wikimedia.org/T272954) (owner: 10Kormat)
[14:47:01] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/658599 (https://phabricator.wikimedia.org/T265603)
[14:47:43] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:48:19] <wikibugs>	 (03PS18) 10Jbond: cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139
[14:48:33] <wikibugs>	 (03CR) 10Jbond: cookbook sre.misc-clusters.apt: (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond)
[14:48:40] <wikibugs>	 (03PS1) 10Elukey: Add fake user/pass for Search's airflow [labs/private] - 10https://gerrit.wikimedia.org/r/658600
[14:49:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: force 'default' receiver as last child route [puppet] - 10https://gerrit.wikimedia.org/r/658598 (https://phabricator.wikimedia.org/T272474) (owner: 10Filippo Giunchedi)
[14:49:19] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/658599 (https://phabricator.wikimedia.org/T265603)
[14:49:43] <wikibugs>	 (03PS3) 10Elukey: Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973)
[14:50:06] <wikibugs>	 (03PS3) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385
[14:50:25] <wikibugs>	 (03PS2) 10Elukey: Add fake user/pass for Search's airflow [labs/private] - 10https://gerrit.wikimedia.org/r/658600
[14:50:45] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake user/pass for Search's airflow [labs/private] - 10https://gerrit.wikimedia.org/r/658600 (owner: 10Elukey)
[14:50:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cookbook sre.misc-clusters.apt: [cookbooks] - 10https://gerrit.wikimedia.org/r/656139 (owner: 10Jbond)
[14:51:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey)
[14:53:38] <wikibugs>	 (03PS4) 10Elukey: Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973)
[14:53:48] <wikibugs>	 (03CR) 10Kormat: "I've filed https://github.com/python/mypy/issues/9972 with upstream." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/658592 (owner: 10Kormat)
[14:54:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:55:10] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27667/console" [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey)
[14:56:44] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1009.eqiad.wmnet with reason: REIMAGE
[14:56:44] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "To be effective this change needs something like https://gerrit.wikimedia.org/r/c/labs/private/+/658600 in the private repo before merging" [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey)
[14:56:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[14:57:10] <wikibugs>	 (03CR) 10Volans: [C: 04-2] "This is great but should go in spicerack. wmflib is a multi-purpose library that doesn't require any special permissions and can be import" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658582 (owner: 10Jbond)
[14:58:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:58:43] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1009.eqiad.wmnet with reason: REIMAGE
[14:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:54] <wikibugs>	 (03PS4) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385
[15:00:00] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[15:02:36] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm)
[15:06:46] <wikibugs>	 (03PS1) 10Jbond: (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626
[15:06:48] <wikibugs>	 (03PS1) 10Hnowlan: maps::apps: only use nodejs10 repo on stretch [puppet] - 10https://gerrit.wikimedia.org/r/658627 (https://phabricator.wikimedia.org/T238753)
[15:06:59] <wikibugs>	 (03Abandoned) 10Jbond: (WIP) debdeploy: Add debdeploy functionality [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658582 (owner: 10Jbond)
[15:07:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[15:08:04] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Absent /etc/helmfile-defaults/service-proxy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/658628
[15:08:06] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: service proxy: Add apertium [puppet] - 10https://gerrit.wikimedia.org/r/658629
[15:08:08] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/658550 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper)
[15:08:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] similar-users, linkrecommendation: Switch to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/658599 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[15:09:32] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM as far as I understand how we manage SSL certs. A +1 from someone who has a more complete understanding would be nice." [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper)
[15:09:37] <wikibugs>	 (03CR) 10Gehel: "LGTM as far as I understand how we manage SSL certs. A +1 from someone who has a more complete understanding would be nice." [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper)
[15:12:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) debdeploy: Add debdeploy functionality [software/spicerack] - 10https://gerrit.wikimedia.org/r/658626 (owner: 10Jbond)
[15:12:50] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Absent /etc/helmfile-defaults/service-proxy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/658628
[15:12:52] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: service proxy: Add apertium [puppet] - 10https://gerrit.wikimedia.org/r/658629
[15:12:54] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: similar-users, linkrecommendation: Switch to production [puppet] - 10https://gerrit.wikimedia.org/r/658630 (https://phabricator.wikimedia.org/T265603)
[15:14:38] <wikibugs>	 10SRE, 10DBA, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) @Kormat  @Marostegui I believe this is unblocked now for you to remove groups from the db configuration.  At this...
[15:16:48] <wikibugs>	 (03PS5) 10Jbond: icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385
[15:26:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga: add wait_for_optimal function [software/spicerack] - 10https://gerrit.wikimedia.org/r/657385 (owner: 10Jbond)
[15:28:04] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 174964472 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:29:32] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 580528 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:29:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658552 (owner: 10ArielGlenn)
[15:30:02] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform, and 5 others: Set up internal eventstreams instance exposing all streams declared in stream config (and in kafka jumbo) - https://phabricator.wikimedia.org/T269160 (10elukey) Created all the TLS certs and configs as described in https://wikitech.wik...
[15:30:12] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:19] <wikibugs>	 (03PS19) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[15:30:47] <wikibugs>	 (03PS7) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288)
[15:33:15] <wikibugs>	 (03PS8) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288)
[15:33:58] <wikibugs>	 (03PS7) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[15:33:59] <wikibugs>	 (03PS1) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631
[15:35:30] <wikibugs>	 (03CR) 10David Caro: "The diff looks really bad xd, there aren't really  many changes." [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 (owner: 10David Caro)
[15:36:44] <wikibugs>	 (03PS8) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[15:36:46] <wikibugs>	 (03PS2) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631
[15:36:48] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond)
[15:43:34] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 47542712 and 33 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:46:13] <wikibugs>	 (03CR) 10Elukey: "Hi David, I saw that code change passing by and I added a couple of comments, nothing related to the logic of the cookbooks but only to so" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 (owner: 10David Caro)
[15:51:42] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 20528 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[15:53:37] <wikibugs>	 (03CR) 10David Caro: "Thanks a lot Elukey! Will fix/added reply 😊" (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658631 (owner: 10David Caro)
[15:54:52] <icinga-wm>	 RECOVERY - Check systemd state on mw2259 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:55:39] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, two nits/comments inline (but feel free to ignore)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond)
[15:55:43] <wikibugs>	 (03PS9) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[15:55:45] <wikibugs>	 (03PS3) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631
[15:58:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[16:02:02] <moritzm>	 !log installing mutt security updates on buster
[16:02:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:01] <wikibugs>	 (03CR) 10Andrew Bogott: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott)
[16:12:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:14:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:14:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 178367440 and 73 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:15:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658627 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[16:15:22] <wikibugs>	 (03PS6) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407
[16:15:55] <wikibugs>	 (03CR) 10Bstorm: data-services: apply user variances to future creations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm)
[16:16:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27670/console" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond)
[16:18:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:18:47] <wikibugs>	 (03PS7) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407
[16:19:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27671/console" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond)
[16:20:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:21:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: Add test for gzipping of static css files. [puppet] - 10https://gerrit.wikimedia.org/r/658317 (https://phabricator.wikimedia.org/T272305) (owner: 10Giuseppe Lavagetto)
[16:22:04] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Branch commit for wmf/1.36.0-wmf.28 [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658479 (https://phabricator.wikimedia.org/T271342) (owner: 10TrainBranchBot)
[16:22:06] <wikibugs>	 (03PS8) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407
[16:23:14] <wikibugs>	 (03PS4) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631
[16:23:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27672/console" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond)
[16:24:25] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] O:idp: update apero_cas::service so its a bit more intuitive (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond)
[16:25:31] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: linkrecommendation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/658636 (https://phabricator.wikimedia.org/T258978)
[16:25:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] maps::apps: only use nodejs10 repo on stretch [puppet] - 10https://gerrit.wikimedia.org/r/658627 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[16:26:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) >>! In T266481#6775156, @wiki_willy wrote: > Hi @Jgreen - it looks like we're running a bit tight on space in the Fundraising rack.  In order for us to rack the servers for...
[16:26:53] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] add deploy1002 and deploy2002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[16:27:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 26536 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[16:28:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10wiki_willy) Thanks @Jgreen (cc'ing @Jclark-ctr as a fyi)  >>! In T266481#6777535, @Jgreen wrote: >>>! In T266481#6775156, @wiki_willy wrote: >> Hi @Jgreen - it looks like we're runn...
[16:28:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey)
[16:28:50] <wikibugs>	 (03PS1) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658637
[16:29:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Dzahn) Hi @mewoph could you let us know what specific thing you actually want to access?  Thanks!
[16:29:21] <wikibugs>	 (03CR) 10David Caro: "Hmm... something got borked here... I seem no be unable to update, patch 9 was not supposed to be there (removing the linting comments)." [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[16:29:42] <wikibugs>	 (03CR) 10David Caro: "Comes from https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/658357" [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro)
[16:30:27] <wikibugs>	 (03PS5) 10David Caro: wmcs: Move to class-based cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/658631
[16:31:02] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:21] <wikibugs>	 (03Abandoned) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[16:31:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:32:22] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm, checked on mwmaint and ldap-corp. full time employee" [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) (owner: 10Legoktm)
[16:32:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add deploy1002 and deploy2002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[16:32:57] <wikibugs>	 (03PS6) 10Dzahn: add deploy1002 and deploy2002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963)
[16:33:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:33:40] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Ok all nodes racked are now working! We have 6 missing node still to rack, ideally in rows not already too used. For example, this is our current d...
[16:34:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/658636 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[16:36:05] <mutante>	 adding new deployment hosts to firewalls, this will be a ferm change/reload on a LOT of hosts
[16:36:20] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Enable monitoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/658636 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[16:37:48] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:26] <icinga-wm>	 PROBLEM - puppet last run on kafka-test1007 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:38:41] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[16:38:41] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[16:38:41] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[16:38:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:11] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[16:40:34] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on maps1009.eqiad.wmnet with reason: REIMAGE
[16:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[16:41:50] <wikibugs>	 (03CR) 10Elukey: sre.kafka.reboot-workers: Add cookbook to restart nodes in kafka cluster (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[16:42:06] <mutante>	 !log reimaginge l33t jobrunner mw1337
[16:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[16:42:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[16:42:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[16:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:41] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps1009.eqiad.wmnet with reason: REIMAGE
[16:42:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:22] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[16:43:45] <icinga-wm>	 RECOVERY - puppet last run on kafka-test1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:43:55] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[16:43:56] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[16:43:56] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[16:43:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:07] <icinga-wm>	 PROBLEM - Check systemd state on kafka-test1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:45:19] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[16:45:24] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] `  Of...
[16:45:53] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[16:46:04] <wikibugs>	 (03CR) 10Elukey: "One last little thing that I just realized, and I think we are ready to go :)" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[16:50:09] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[16:50:15] <marostegui>	 !log Deploy schema change on testwiki - T272953
[16:50:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:20] <stashbot>	 T272953: CentralNotice: Update DB schema on Meta for campign types feature - https://phabricator.wikimedia.org/T272953
[16:50:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.28 [core] (wmf/1.36.0-wmf.28) - 10https://gerrit.wikimedia.org/r/658479 (https://phabricator.wikimedia.org/T271342) (owner: 10TrainBranchBot)
[16:51:38] <RhinosF1|NotHere>	 Am I being stupid or has the diagnosing connection problems page been moved on wikitech?
[16:55:02] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1409.eqiad.wmnet with reason: REIMAGE
[16:55:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:27] <RhinosF1|NotHere>	 I found it
[16:56:30] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] "seems sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey)
[16:56:36] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1408.eqiad.wmnet with reason: REIMAGE
[16:56:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:05] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1409.eqiad.wmnet with reason: REIMAGE
[16:57:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:58:54] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1408.eqiad.wmnet with reason: REIMAGE
[16:58:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:05] <jouncebot>	 jbond42 and cdanis: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1700).
[17:01:11] <wikibugs>	 (03PS1) 10Dzahn: DHCP: switch all eqiad appservers to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/658642 (https://phabricator.wikimedia.org/T245757)
[17:02:20] <wikibugs>	 10SRE, 10Analytics-Radar, 10Domains, 10Traffic, and 2 others: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle)
[17:02:35] <wikibugs>	 10SRE, 10Analytics-Radar, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle)
[17:02:38] <wikibugs>	 10SRE, 10Analytics-Radar, 10Domains, 10Traffic, 10Wikimedia-General-or-Unknown: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) - https://phabricator.wikimedia.org/T262996 (10Krinkle)
[17:02:55] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:03:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: switch all eqiad appservers to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/658642 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[17:04:59] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2321.codfw.wmnet with reason: REIMAGE
[17:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:25] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1337.eqiad.wmnet with reason: REIMAGE
[17:05:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:06:42] <wikibugs>	 (03PS1) 10Dzahn: scap: add deploy2002 to mediawiki installation hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963)
[17:06:58] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2321.codfw.wmnet with reason: REIMAGE
[17:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:07] <wikibugs>	 (03PS2) 10Dzahn: scap: add deploy2002 to mediawiki installation hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963)
[17:07:37] <wikibugs>	 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10Cmjohnson) a new motherboard has been dispatched, I will coordinate with Dell Tech to get this completed, Hoping for Wednesday.
[17:08:24] <wikibugs>	 10SRE, 10puppet-compiler, 10User-jbond: puppet documentation generation is missing some compnets - https://phabricator.wikimedia.org/T271909 (10thcipriani)
[17:08:55] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1337.eqiad.wmnet with reason: REIMAGE
[17:08:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:59] <wikibugs>	 (03PS3) 10Dzahn: scap: add deploy1002 and deploy2002 to mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963)
[17:11:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) New DIMM has been dispatched for the server I will coordinate a time with you to power down to restore the original configuration.
[17:12:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Sounds good @Cmjohnson let me know when it arrives and you plan to change it so I can stop mysql Thank you
[17:12:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) Good news bad news,  Dell dispatched a new DIMM.  The bad news, is we do not know which one and it could take some time to figure that...
[17:15:03] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] service proxy: Add apertium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/658629 (owner: 10Alexandros Kosiaris)
[17:17:04] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "At least termbox still references this file (helmfile.d/services/termbox/helmfile.yaml). I would suggest to first remove it from there, ju" [puppet] - 10https://gerrit.wikimedia.org/r/658628 (owner: 10Alexandros Kosiaris)
[17:17:09] <wikibugs>	 (03PS6) 10Dzahn: Revert "remove parsoid-rt-tests.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/653998 (https://phabricator.wikimedia.org/T266509)
[17:17:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "remove parsoid-rt-tests.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/653998 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[17:18:04] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1028 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:18:14] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1409.eqiad.wmnet'] `  an...
[17:19:13] <mutante>	 !log ms-be1028 - running puppet to clear ferm icinga alert
[17:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:40] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[17:19:41] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1408.eqiad.wmnet'] `  an...
[17:20:38] <mutante>	 so that kind of ferm alert is a race condition that can happen in very few cases if you make a change to ferm rules in base.. just the scale of it
[17:21:47] <mutante>	 that's why i announced it earlier.   running puppet another time. rescheduling icinga check if it does like in this one case here
[17:21:49] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1028 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[17:21:49] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:56] <mutante>	 and there it is fixed again
[17:27:01] <wikibugs>	 (03PS2) 10Cwhite: profile: ecs indices to use a weekly rotation [puppet] - 10https://gerrit.wikimedia.org/r/657371 (https://phabricator.wikimedia.org/T234565)
[17:27:16] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:27:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[17:28:26] <wikibugs>	 (03PS20) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[17:30:14] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] `  an...
[17:32:32] <wikibugs>	 (03PS21) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[17:33:27] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[17:34:03] <wikibugs>	 (03PS4) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568)
[17:34:19] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: ecs indices to use a weekly rotation [puppet] - 10https://gerrit.wikimedia.org/r/657371 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[17:34:51] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[17:35:41] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on deploy1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[17:36:25] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[17:37:07] <wikibugs>	 10SRE, 10DBA, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), and 2 others: Create integration test env for wmfmariadbpy - https://phabricator.wikimedia.org/T265266 (10thcipriani)
[17:38:36] <wikibugs>	 (03PS5) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568)
[17:39:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Hi @elukey - in looking through Netbox and talking to Chris, this is what I'm thinking, but @Cmjohnson/@Jclark-ctr/@elukey - please call me out...
[17:39:34] <wikibugs>	 10SRE, 10ops-eqiad: sdg1 failed on ms-be1054 - https://phabricator.wikimedia.org/T269556 (10Cmjohnson) A ticket has been opened with HPE 5353126225
[17:40:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Cmjohnson)
[17:40:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Cmjohnson) 05Open→03Resolved
[17:40:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Cmjohnson)
[17:41:44] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10Patch-For-Review, 10cloud-services-team (Hardware): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Cmjohnson) 05Open→03Resolved This has been completed
[17:42:00] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10Epic, 10cloud-services-team (Hardware): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Cmjohnson) 05Open→03Resolved The move has taken place, if you have work to do outside of data center, please re-open and rem...
[17:42:51] <wikibugs>	 (03PS1) 10Cwhite: profile: define required curator config variables [puppet] - 10https://gerrit.wikimedia.org/r/658646
[17:43:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Cmjohnson) @Bstorm I do have the cross-over cable on-site. Is it okay to just connect o...
[17:43:13] <wikibugs>	 (03PS6) 10Effie Mouzeli: service_proxy: add ipv6 config option on services_proxy config [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568)
[17:44:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add support for php deployments (0310 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto)
[17:44:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1409.eqiad.wmnet with reason: REIMAGE
[17:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:44:30] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1408.eqiad.wmnet with reason: REIMAGE
[17:44:30] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757
[17:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:45:09] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] profile: define required curator config variables [puppet] - 10https://gerrit.wikimedia.org/r/658646 (owner: 10Cwhite)
[17:46:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] service_proxy: add ipv6 config option on services_proxy config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629343 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli)
[17:46:19] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1409.eqiad.wmnet with reason: REIMAGE
[17:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1408.eqiad.wmnet with reason: REIMAGE
[17:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:50] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1337.eqiad.wmnet'] `  an...
[17:55:53] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10JTannerWMF) Hi there sorry for the delayed response, for the sake of consistency JTanner works.
[18:00:05] <jouncebot>	 chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1800).
[18:05:05] <wikibugs>	 (03PS1) 10Ahmon Dancy: testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658647
[18:05:07] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658647 (owner: 10Ahmon Dancy)
[18:06:09] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1409.eqiad.wmnet'] `  an...
[18:06:49] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658647 (owner: 10Ahmon Dancy)
[18:07:02] <logmsgbot>	 !log dancy@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.28
[18:07:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:07:42] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1408.eqiad.wmnet'] `  an...
[18:11:20] <wikibugs>	 (03PS22) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[18:15:08] <moritzm>	 !log installing sudo security updates on Buster
[18:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:03] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming after rebuild
[18:22:04] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming after rebuild
[18:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Refactor Discovery's analytics airflow to be more generic [puppet] - 10https://gerrit.wikimedia.org/r/658596 (https://phabricator.wikimedia.org/T272973) (owner: 10Elukey)
[18:26:00] <Cyberpower678>	 I've gotten a report about Cyberbot malfunction globally in the last few weeks.  I suspected a transient issue, but it has not gone away.  I looked at the logs today and I see an enourmous amount of Error: 502, Server Hangup at 2021-01-26 14:21:35 GMT
[18:28:37] <Cyberpower678>	 A lot of other requests are simply 0-byte responses.
[18:29:39] <Cyberpower678>	 !help
[18:29:39] <wm-bot>	 want docs? ask for "!wm-bot". all keywords? try "@regsearch .*"
[18:30:30] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:15] <Cyberpower678>	 legoktm do I ping you for this?
[18:32:13] <legoktm>	 hi
[18:32:26] <Cyberpower678>	 Hi.  It's been a while. :-)
[18:32:30] <legoktm>	 Cyberpower678: can you file a bug please? and tag #SRE
[18:32:36] <Cyberpower678>	 Sure
[18:32:54] <legoktm>	 if you could include what API requests you're making that would help too
[18:33:56] <Cyberpower678>	 I have a lot of juicy details for you. :-)
[18:34:56] <wikibugs>	 (03PS16) 10Cwhite: profile: add ecs pre and post filters to pipeline [puppet] - 10https://gerrit.wikimedia.org/r/647028 (https://phabricator.wikimedia.org/T234565)
[18:36:08] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:36:21] <cdanis>	 Cyberpower678: yeah, please include particular API calls, and also originating IP address, if you can
[18:37:34] <legoktm>	 Cyberpower678: thanks, I just got into a meeting so I'll look as soon as I'm done
[18:37:50] <moritzm>	 !log installing sudo security updates on Stretch
[18:37:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:56] <wikibugs>	 10SRE, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678)
[18:39:11] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) It should be ok to just connect any time as long as the primary link is ok.
[18:39:59] <cdanis>	 Cyberpower678: dumb question, does Cyberbot run on WMCS?  
[18:41:24] <cdanis>	 what User-Agent does it set?
[18:43:53] <Cyberpower678>	 cdanis yes it does, and I don't remember what the UA is.
[18:44:02] <Cyberpower678>	 Cyberbot is quite old
[18:44:05] <wikibugs>	 10SRE, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10CDanis) What is the originating IP address of these requests?  What User-Agent is sent by Cyberbot?
[18:44:13] <cdanis>	 ah
[18:44:38] <Cyberpower678>	 But the originating IP is contained in some of the 502 HTML responses.
[18:46:59] <logmsgbot>	 !log dancy@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.28 (duration: 40m 09s)
[18:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:37] <cdanis>	 it seems that the User-Agent being sent is "Peachy MediaWiki Bot API Version 2.0 (alpha 8)".  It would be nice to have that updated to be compliant with https://meta.wikimedia.org/wiki/User-Agent_policy
[18:52:26] <wikibugs>	 10SRE, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) The IP 172.16.2.21, I don't know what the UA is.  The bot is pretty old.
[18:53:40] <Cyberpower678>	 cdanis: it originates from a framework I no longer maintain.  The bot is old, and whenever I get free time, I am slowly coding up replacements for the existing bot tasks.
[18:55:57] <cdanis>	 I see
[18:57:09] <moritzm>	 !log uploaded sudo 1.8.10p3-1+deb8u7+wmf1 to apt.wikimedia.org
[18:57:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:15] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq...
[18:58:03] <moritzm>	 !log installing sudo security updates on Jessie
[18:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T1900)
[19:05:55] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10CDanis) It seems the User-Agent being used is `Peachy MediaWiki Bot API Version 2.0 (alpha 8)` (which ideally should...
[19:06:44] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack: add config files for openstack Train [puppet] - 10https://gerrit.wikimedia.org/r/658652 (https://phabricator.wikimedia.org/T261135)
[19:06:46] <wikibugs>	 (03PS1) 10Andrew Bogott: Neutron: forward our dmz hacks from version Stein to Train [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135)
[19:06:48] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: forward our server name regex hack to version Train [puppet] - 10https://gerrit.wikimedia.org/r/658654 (https://phabricator.wikimedia.org/T261135)
[19:06:50] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: Very minor config update for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658655 (https://phabricator.wikimedia.org/T261135)
[19:06:52] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm)
[19:06:55] <wikibugs>	 (03PS1) 10Andrew Bogott: Add manifests for Neutron version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658656 (https://phabricator.wikimedia.org/T261135)
[19:06:57] <wikibugs>	 (03PS1) 10Andrew Bogott: Add manifests for Cinder version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658657 (https://phabricator.wikimedia.org/T261135)
[19:06:59] <wikibugs>	 (03PS1) 10Andrew Bogott: Add manifest for Glance version Train [puppet] - 10https://gerrit.wikimedia.org/r/658658 (https://phabricator.wikimedia.org/T261135)
[19:07:01] <wikibugs>	 (03PS1) 10Andrew Bogott: Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135)
[19:07:03] <wikibugs>	 (03PS1) 10Andrew Bogott: Add manifests for Nova version Train [puppet] - 10https://gerrit.wikimedia.org/r/658660 (https://phabricator.wikimedia.org/T261135)
[19:07:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Add manifest for Barbican version Train [puppet] - 10https://gerrit.wikimedia.org/r/658661 (https://phabricator.wikimedia.org/T261135)
[19:07:07] <wikibugs>	 (03PS1) 10Andrew Bogott: Add OpenStack client package manifests for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658662 (https://phabricator.wikimedia.org/T261135)
[19:09:37] <legoktm>	 Cyberpower678: does Cyberbot have rate limiting built in?
[19:10:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[19:12:32] <Cyberpower678>	 It follows maxlag
[19:13:32] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2321.codfw.wmnet
[19:13:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:18] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2317.codfw.wmnet with reason: REIMAGE
[19:16:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:28] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2321 is CRITICAL: Host mw2321 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:18:20] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2317.codfw.wmnet with reason: REIMAGE
[19:18:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:18:42] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw2321 is CRITICAL: Host mw2321 is not in mediawiki-installation dsh group daniel_zahn reimaged https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:19:52] <wikibugs>	 (03CR) 10Andrew Bogott: "removing jenkin's down-vote.  It's because the linter doesn't like us including network::constants and I'm not going to refactor that toda" [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[19:20:38] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw1337 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[19:21:32] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on mw2321 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[19:22:22] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[19:22:56] <mutante>	 ^ they need a scap pull.. doing that
[19:25:52] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw1337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[19:25:58] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) a:03Dzahn
[19:26:28] <wikibugs>	 (03Abandoned) 10Jforrester: Restore hide link when viewing single AbuseLog entries [extensions/AbuseFilter] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655943 (https://phabricator.wikimedia.org/T271667) (owner: 10DannyS712)
[19:26:44] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on mw2321 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[19:27:21] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) It already does the things suggested by Urbanecm, just that the Icinga check isn't running every 5 minutes but just every couple hours, i think.
[19:27:38] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on deploy2002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[19:31:09] <wikibugs>	 (03PS1) 10Andrew Bogott: toolforge: pin sudo_ldap to a newer package version [puppet] - 10https://gerrit.wikimedia.org/r/658666
[19:31:32] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:33:25] <wikibugs>	 (03PS2) 10Legoktm: admin: Add mewoph to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912)
[19:34:05] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] admin: Add mewoph to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658469 (https://phabricator.wikimedia.org/T272912) (owner: 10Legoktm)
[19:34:12] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) 05Open→03Resolved `  19:25 <+icinga-wm> RECOVERY - Ensure local MW versions match expected deployment on mw1337 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Applicatio...
[19:34:35] <wikibugs>	 (03PS2) 10Andrew Bogott: toolforge: pin sudo_ldap to a newer package version [puppet] - 10https://gerrit.wikimedia.org/r/658666
[19:36:08] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:37:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/658666 (owner: 10Andrew Bogott)
[19:37:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolforge: pin sudo_ldap to a newer package version [puppet] - 10https://gerrit.wikimedia.org/r/658666 (owner: 10Andrew Bogott)
[19:37:58] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1337 is CRITICAL: Host mw1337 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:40:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Legoktm) 05Open→03Resolved You're now in the `wmf` group: https://ldap.toolforge.org/user/mewoph
[19:40:27] <icinga-wm>	 ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1337 is CRITICAL: Host mw1337 is not in mediawiki-installation dsh group daniel_zahn reimage https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[19:40:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:41:34] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2317.codfw.wmnet'] `  an...
[19:42:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:42:40] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1337.eqiad.wmnet
[19:42:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1409.eqiad.wmnet
[19:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:14] <wikibugs>	 10SRE, 10ops-codfw: codfw: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268749 (10Papaul) Row B complete
[19:44:40] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on deploy1002 is CRITICAL: CRITICAL: 3 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[19:49:23] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2317.codfw.wmnet
[19:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:13] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova: forward our server name regex hack to version Train [puppet] - 10https://gerrit.wikimedia.org/r/658654 (https://phabricator.wikimedia.org/T261135)
[19:51:15] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova: Very minor config update for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658655 (https://phabricator.wikimedia.org/T261135)
[19:51:17] <wikibugs>	 (03PS2) 10Andrew Bogott: Add manifests for Neutron version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658656 (https://phabricator.wikimedia.org/T261135)
[19:51:19] <wikibugs>	 (03PS2) 10Andrew Bogott: Add manifests for Cinder version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658657 (https://phabricator.wikimedia.org/T261135)
[19:51:21] <wikibugs>	 (03PS2) 10Andrew Bogott: Add manifest for Glance version Train [puppet] - 10https://gerrit.wikimedia.org/r/658658 (https://phabricator.wikimedia.org/T261135)
[19:51:23] <wikibugs>	 (03PS2) 10Andrew Bogott: Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135)
[19:51:25] <wikibugs>	 (03PS2) 10Andrew Bogott: Add manifests for Nova version Train [puppet] - 10https://gerrit.wikimedia.org/r/658660 (https://phabricator.wikimedia.org/T261135)
[19:51:27] <wikibugs>	 (03PS2) 10Andrew Bogott: Add manifest for Barbican version Train [puppet] - 10https://gerrit.wikimedia.org/r/658661 (https://phabricator.wikimedia.org/T261135)
[19:51:29] <wikibugs>	 (03PS2) 10Andrew Bogott: Add OpenStack client package manifests for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658662 (https://phabricator.wikimedia.org/T261135)
[19:51:31] <wikibugs>	 (03PS2) 10Andrew Bogott: Neutron: forward our dmz hacks from version Stein to Train [puppet] - 10https://gerrit.wikimedia.org/r/658653 (https://phabricator.wikimedia.org/T261135)
[19:53:05] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2317.codfw.wmnet
[19:53:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:22] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1408.eqiad.wmnet
[19:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:00:04] <jouncebot>	 dancy and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210126T2000).
[20:00:49] <brennen>	 (here, but also in a pairing session shortly.  please ping if i can be of use.)
[20:01:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:01:56] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1408.eqiad.wmnet
[20:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:03:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1409.eqiad.wmnet
[20:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:48] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:09:26] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:11:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: add config files for openstack Train [puppet] - 10https://gerrit.wikimedia.org/r/658652 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:12:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova: forward our server name regex hack to version Train [puppet] - 10https://gerrit.wikimedia.org/r/658654 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:12:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2321.codfw.wmnet
[20:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:29] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:13:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add OpenStack client package manifests for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658662 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:13:02] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2321.codfw.wmnet
[20:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for Barbican version Train [puppet] - 10https://gerrit.wikimedia.org/r/658661 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:13:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add manifest for Glance version Train [puppet] - 10https://gerrit.wikimedia.org/r/658658 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:13:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for Nova version Train [puppet] - 10https://gerrit.wikimedia.org/r/658660 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:13:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for Neutron version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658656 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:13:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Add manifests for Cinder version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658657 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:13:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova: Very minor config update for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658655 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:14:30] <wikibugs>	 (03PS3) 10Andrew Bogott: Nova: Very minor config update for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658655 (https://phabricator.wikimedia.org/T261135)
[20:14:49] <wikibugs>	 (03PS3) 10Andrew Bogott: Add manifests for Cinder version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658657 (https://phabricator.wikimedia.org/T261135)
[20:14:57] <wikibugs>	 (03PS3) 10Andrew Bogott: Add manifests for Nova version Train [puppet] - 10https://gerrit.wikimedia.org/r/658660 (https://phabricator.wikimedia.org/T261135)
[20:15:09] <wikibugs>	 (03PS3) 10Andrew Bogott: Add manifests for Neutron version 'Train' [puppet] - 10https://gerrit.wikimedia.org/r/658656 (https://phabricator.wikimedia.org/T261135)
[20:15:18] <wikibugs>	 (03PS3) 10Andrew Bogott: Add manifest for Glance version Train [puppet] - 10https://gerrit.wikimedia.org/r/658658 (https://phabricator.wikimedia.org/T261135)
[20:15:29] <wikibugs>	 (03PS3) 10Andrew Bogott: Add manifest for Barbican version Train [puppet] - 10https://gerrit.wikimedia.org/r/658661 (https://phabricator.wikimedia.org/T261135)
[20:15:40] <wikibugs>	 (03PS3) 10Andrew Bogott: Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135)
[20:15:51] <wikibugs>	 (03PS3) 10Andrew Bogott: Nova: forward our server name regex hack to version Train [puppet] - 10https://gerrit.wikimedia.org/r/658654 (https://phabricator.wikimedia.org/T261135)
[20:15:57] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:16:01] <wikibugs>	 (03PS3) 10Andrew Bogott: Add OpenStack client package manifests for version Train [puppet] - 10https://gerrit.wikimedia.org/r/658662 (https://phabricator.wikimedia.org/T261135)
[20:16:12] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:16:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:17:11] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add manifests for Keystone version Train [puppet] - 10https://gerrit.wikimedia.org/r/658659 (https://phabricator.wikimedia.org/T261135) (owner: 10Andrew Bogott)
[20:17:13] <wikibugs>	 (03CR) 10Ryan Kemper: "(reaching out to service-ops to try to find a reviewer)" [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper)
[20:17:23] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2321 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[20:17:53] <wikibugs>	 (03Abandoned) 10Ryan Kemper: udev_reload missing trailing sudo [puppet] - 10https://gerrit.wikimedia.org/r/634390 (owner: 10Ryan Kemper)
[20:19:35] <icinga-wm>	 RECOVERY - Ensure local MW versions match expected deployment on deploy1002 is OK: OKAY: wikiversions in sync https://wikitech.wikimedia.org/wiki/Application_servers
[20:20:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:20:32] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Urbanecm) 05Resolved→03Open Boldly reopening this.   >>! In T272967#6778204, @Dzahn wrote: > It already does the things suggested by Urbanecm, just that the Icinga check isn't running every 5 minutes...
[20:20:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "This looks good. But before you merge it you need to create the certificate for it or puppet/envoy will fail." [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper)
[20:22:13] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) a:05Dzahn→03None
[20:22:45] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:22:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1411.eqiad.wmnet with reason: REIMAGE
[20:22:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:34] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) also see T218412
[20:24:56] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1411.eqiad.wmnet with reason: REIMAGE
[20:24:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:01] <wikibugs>	 (03PS4) 10Ryan Kemper: Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444)
[20:26:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1413.eqiad.wmnet with reason: REIMAGE
[20:26:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:11] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) Probably just that downtimes expired 8 hours later.  I think there is not much to fix here besides "remember to scap pull" and he other part about MW versions would be T272967
[20:28:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1413.eqiad.wmnet with reason: REIMAGE
[20:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:14] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:31:15] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Urbanecm) >>! In T272967#6778314, @Dzahn wrote: > Probably just that downtimes expired 8 hours later. >  > I think there is not much to fix here besides "remember to scap pull" and the other part about M...
[20:31:54] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:33:47] <wikibugs>	 (03PS1) 10Legoktm: admin: Add jaz to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658672 (https://phabricator.wikimedia.org/T272522)
[20:34:40] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:35:39] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] admin: Add jaz to list of privledged LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/658672 (https://phabricator.wikimedia.org/T272522) (owner: 10Legoktm)
[20:36:00] <ryankemper>	 !log T272444 (Decommission relforge100[1,2]) Downtimed `relforge100[1,2]` in Icinga cookbook for the next 26 hours
[20:36:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:04] <stashbot>	 T272444: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444
[20:36:30] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Decommission relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/657453 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper)
[20:37:47] <ryankemper>	 !log T272444 (Decommission relforge100[1,2]) Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/657453 prior to running decom cookbook
[20:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:13] <wikibugs>	 (03PS1) 10Ottomata: Declare 5 NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658673 (https://phabricator.wikimedia.org/T271208)
[20:38:32] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] data-services: apply user variances to future creations [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm)
[20:38:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10Legoktm) 05Open→03Resolved a:05JTannerWMF→03Legoktm Done, you're now a member of the `wmf` group: https://ldap.toolforge.org/user/jaz. A tip that if you're...
[20:39:01] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission
[20:39:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:39] <wikibugs>	 (03PS1) 10Ahmon Dancy: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658674
[20:39:41] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658674 (owner: 10Ahmon Dancy)
[20:39:48] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99)
[20:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:17] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission
[20:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:20] <wikibugs>	 (03PS2) 10Ottomata: Declare 5 NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658673 (https://phabricator.wikimedia.org/T271208)
[20:40:26] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.28 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658674 (owner: 10Ahmon Dancy)
[20:40:27] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Dzahn) I don't know. I think the answer to that question would best be found on T218412.
[20:40:29] <ryankemper>	 !log T272444 (Decommission relforge100[1,2]) Beginning decommission of `relforge1001`: `sudo -i cookbook sre.hosts.decommission relforge1001.eqiad.wmnet -t T272444`
[20:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:21] <logmsgbot>	 !log dancy@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.28
[20:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:52] <ottomata>	 dancy: brennen  i have a config change to sync, only affects testwiki.  ok to sync or is train in progress?
[20:43:10] <brennen>	 ottomata: train is in progress, i believe.
[20:43:12] <dancy>	 group0 rollout just finished.
[20:43:26] <ottomata>	 ok, dancy  let me know when it is clear to sync
[20:43:42] <brennen>	 dancy: i see a maybe blocker
[20:43:48] <dancy>	 ottomata: Go for it.
[20:43:53] <ottomata>	 oh k
[20:43:53] <brennen>	 MediumSpecificBagOStuff:1094  PHP Notice: Undefined offset: 1
[20:43:58] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1337.eqiad.wmnet with reason: REIMAGE
[20:44:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:04] <dancy>	 nod. seeing lots of those now.. 
[20:44:09] <ottomata>	 shall I wait?
[20:44:12] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1338.eqiad.wmnet with reason: REIMAGE
[20:44:14] <dancy>	 yes please.
[20:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:16] <ottomata>	 k
[20:44:25] <dancy>	 I'm going to roll back now.
[20:45:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1411.eqiad.wmnet'] `  an...
[20:46:02] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1337.eqiad.wmnet with reason: REIMAGE
[20:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:08] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1411.eqiad.wmnet
[20:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:13] <wikibugs>	 (03PS1) 10Ahmon Dancy: Rollback group0 to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658676
[20:47:31] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+2] Rollback group0 to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658676 (owner: 10Ahmon Dancy)
[20:48:04] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1338.eqiad.wmnet with reason: REIMAGE
[20:48:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:48:19] <wikibugs>	 (03Merged) 10jenkins-bot: Rollback group0 to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658676 (owner: 10Ahmon Dancy)
[20:49:19] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1413.eqiad.wmnet'] `  an...
[20:49:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:50:06] <logmsgbot>	 !log dancy@deploy1001 rebuilt and synchronized wikiversions files: (no justification provided)
[20:50:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:20] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1411.eqiad.wmnet
[20:50:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:50:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:37] <dancy>	 !log group0 rolled back to 1.36.0-wmf.27
[20:50:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:54] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2308.codfw.wmnet with reason: REIMAGE
[20:50:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:15] <dancy>	 ottomata: Your turn.  Lemme know when you're done.  I'll file a complaint about MW errors in the meantime
[20:51:21] <ottomata>	 k
[20:51:23] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/mailman/listinfo/wikics-l has no CSS styling due to 404 URLs; Cannot subscribe due to token error - https://phabricator.wikimedia.org/T272969 (10Legoktm) 05Open→03Resolved p:05Triage→03Medium a:03Legoktm I reset the HTML theme to the Wikim...
[20:51:30] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Declare 5 NavigationTiming eventlogging streams and migrate on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658673 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata)
[20:51:58] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[20:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:12] <ryankemper>	 !log T272444 (Decommission relforge100[1,2]) Beginning decommission of `relforge1002`: `sudo -i cookbook sre.hosts.decommission relforge1002.eqiad.wmnet -t T272444`
[20:52:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:15] <stashbot>	 T272444: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444
[20:52:17] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:52:20] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission
[20:52:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:52:55] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1413.eqiad.wmnet
[20:52:58] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2308.codfw.wmnet with reason: REIMAGE
[20:52:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:33] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Migrate 5 NavigationTiming schemas to Event Platform on testwiki - T271208 (duration: 01m 17s)
[20:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:36] <stashbot>	 T271208: NavigationTiming Extension schemas Event Platform Migration - https://phabricator.wikimedia.org/T271208
[20:55:48] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1413.eqiad.wmnet
[20:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:57] <wikibugs>	 (03CR) 10Krinkle: Declare 5 NavigationTiming eventlogging streams and migrate on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658673 (https://phabricator.wikimedia.org/T271208) (owner: 10Ottomata)
[20:56:08] <ottomata>	 dancy:  done
[20:56:15] <dancy>	 thx!
[20:56:19] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:57:20] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@2662ca2]: ship hourly link recommendations
[20:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10Ottomata) @ppelberg, this ticket needs to be done before you can access data via Presto.   You don't need ssh access, but you do need to be in the `analytics-privatedata-users` group, which requires the sa...
[21:04:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:05:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:05:50] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@2662ca2]: ship hourly link recommendations (duration: 08m 30s)
[21:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:58] <ebernhardson>	 !log restart airflow-scheduler and airflow-webserver on an-airflow1001 post-deploy
[21:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:09] <wikibugs>	 (03PS1) 10Subramanya Sastry: Parsoid Testing: Switch rt/vd server db hosts to localhost [puppet] - 10https://gerrit.wikimedia.org/r/658679 (https://phabricator.wikimedia.org/T266509)
[21:11:18] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2306.codfw.wmnet with reason: REIMAGE
[21:11:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:17] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm) Have you made any changes to the bot recently?  `lang=irc  11:09:36 <legoktm> Cyberpower678: does Cyberbot...
[21:13:27] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2306.codfw.wmnet with reason: REIMAGE
[21:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:13:51] <legoktm>	 maybe I'm blind, but I don't see any option to change the priority on https://phabricator.wikimedia.org/T273003
[21:14:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2308.codfw.wmnet'] `  an...
[21:15:08] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[21:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:20] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2304.codfw.wmnet with reason: REIMAGE
[21:15:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:15:28] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Cyberpower678) No changes have been made to the bot whatsoever.  I believe it only does maxlag on write requests, li...
[21:15:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/27680/" [puppet] - 10https://gerrit.wikimedia.org/r/658679 (https://phabricator.wikimedia.org/T266509) (owner: 10Subramanya Sastry)
[21:17:28] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2304.codfw.wmnet with reason: REIMAGE
[21:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:17:41] <wikibugs>	 (03CR) 10Dzahn: "deployed on testreduce1001. puppet also triggered a refresh of the parsoid-rt service" [puppet] - 10https://gerrit.wikimedia.org/r/658679 (https://phabricator.wikimedia.org/T266509) (owner: 10Subramanya Sastry)
[21:19:02] <bd808>	 legoktm: I wonder if that missing phab priority setting is an artifact of it being marked "production error" somehow?
[21:19:21] <RhinosF1|NotHere>	 legoktm: you're not blind
[21:19:25] <bd808>	 also, this is not a "production error" report but meh
[21:19:30] <RhinosF1|NotHere>	 bd808: that form was changed today
[21:19:35] <mutante>	 for some reason it uses the "agile" story points thing instead
[21:19:51] <RhinosF1|NotHere>	 https://phabricator.wikimedia.org/T240343
[21:20:09] <RhinosF1|NotHere>	 twentyafterfour: ^
[21:20:16] <RhinosF1|NotHere>	 That seems to have broke something
[21:20:57] <legoktm>	 okay
[21:21:04] <legoktm>	 I edited the form to make Priority visible
[21:21:24] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Wikimedia-production-error: Cyberbot is getting a lot of 502 errors, or blank responses when querying the API - https://phabricator.wikimedia.org/T273003 (10Legoktm) p:05Triage→03Medium
[21:21:27] <legoktm>	 thanks for the pointer RhinosF1|NotHere
[21:23:00] <RhinosF1|NotHere>	 legoktm: np
[21:27:52] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@a87a69a]: correct alter table syntax to create wbitem table
[21:27:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:08] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2308.codfw.wmnet
[21:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:28:45] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1338.eqiad.wmnet'] `  an...
[21:29:09] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2308.codfw.wmnet
[21:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:49] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1337.eqiad.wmnet'] `  an...
[21:29:56] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[21:31:02] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@a87a69a]: correct alter table syntax to create wbitem table (duration: 03m 09s)
[21:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:31:09] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:32:04] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1337.eqiad.wmnet
[21:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:13] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw13388.eqiad.wmnet
[21:32:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:22] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1338.eqiad.wmnet
[21:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:08] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1337.eqiad.wmnet
[21:33:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:58] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2306.codfw.wmnet'] `  an...
[21:35:04] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) >>! In T179696#6776832, @Joe wrote: > Looking at the logs from a failed run, it looks like no retry is attempted when a 504 is received...
[21:36:37] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:35] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq...
[21:37:39] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T179696 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:38:58] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2304.codfw.wmnet'] `  an...
[21:40:34] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1338.eqiad.wmnet
[21:40:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:49] <wikibugs>	 (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/658682
[21:45:11] <wikibugs>	 (03PS1) 10Ryan Kemper: relforge: remove decommed relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/658683 (https://phabricator.wikimedia.org/T272444)
[21:48:57] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2302.codfw.wmnet with reason: REIMAGE
[21:48:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:12] <wikibugs>	 (03PS1) 10Legoktm: Allow talking to the registry over HTTP [docker-images/docker-report] - 10https://gerrit.wikimedia.org/r/658684 (https://phabricator.wikimedia.org/T179696)
[21:50:21] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2306.codfw.wmnet
[21:50:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:31] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2304.codfw.wmnet
[21:50:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:59] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2302.codfw.wmnet with reason: REIMAGE
[21:51:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:52:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2304.codfw.wmnet
[21:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:19] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2306.codfw.wmnet
[21:54:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:38] <logmsgbot>	 !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2300.codfw.wmnet with reason: REIMAGE
[21:56:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[21:57:23] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[21:58:40] <logmsgbot>	 !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2300.codfw.wmnet with reason: REIMAGE
[21:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:01:16] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:02:08] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] relforge: remove decommed relforge100[1,2] [puppet] - 10https://gerrit.wikimedia.org/r/658683 (https://phabricator.wikimedia.org/T272444) (owner: 10Ryan Kemper)
[22:02:47] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:07:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10ppelberg) @Joe, @elukey, @Ottomata and @jcrespo: thank you responding as helpfully and responsively as y'all did...I'm sorry I left you wondering for as long as I have.   Responses to, what I think are, th...
[22:08:39] <wikibugs>	 (03PS5) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python [puppet] - 10https://gerrit.wikimedia.org/r/657677
[22:09:31] <wikibugs>	 (03CR) 10Legoktm: mediawiki: Port nrpe_check_opcache to Python (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657677 (owner: 10Legoktm)
[22:10:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "openssl x509 -text -noout -in wdqs-internal.discovery.wmnet.crt | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/658548 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper)
[22:10:56] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] wdqs: add dummy key for new wdqs-internal cert [labs/private] - 10https://gerrit.wikimedia.org/r/658550 (https://phabricator.wikimedia.org/T272713) (owner: 10Ryan Kemper)
[22:12:28] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2302.codfw.wmnet'] `  an...
[22:15:45] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2299.codfw.wmnet with reason: REIMAGE
[22:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:24] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2298.codfw.wmnet with reason: REIMAGE
[22:16:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:50] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2299.codfw.wmnet with reason: REIMAGE
[22:17:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:29] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Discovery-Search (Current work), 10Patch-For-Review: decommission relforge1001.eqiad.wmnet and relforge1002.eqiad.wmnet - https://phabricator.wikimedia.org/T272444 (10RKemper)
[22:18:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/27682/wdqs1008.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper)
[22:19:44] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2298.codfw.wmnet with reason: REIMAGE
[22:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:18] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2300.codfw.wmnet'] `  an...
[22:21:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2297.codfw.wmnet with reason: REIMAGE
[22:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:04] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/658548   and https://gerrit.wikimedia.org/r/c/labs/private/+/658550  are deployed.. a" [puppet] - 10https://gerrit.wikimedia.org/r/657913 (owner: 10Ryan Kemper)
[22:22:06] <wikibugs>	 (03PS3) 10Legoktm: openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455
[22:23:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm)
[22:23:55] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2297.codfw.wmnet with reason: REIMAGE
[22:23:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1264.eqiad.wmnet with reason: REIMAGE
[22:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:32] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1264.eqiad.wmnet with reason: REIMAGE
[22:26:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:47] <wikibugs>	 (03PS4) 10Legoktm: openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455
[22:27:07] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw2300.codfw.wmnet
[22:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:40] <wikibugs>	 (03PS1) 10Ppchelko: CacheTime: Extra protection for rollback unserialization [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658688 (https://phabricator.wikimedia.org/T273007)
[22:28:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:30:01] <logmsgbot>	 !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw2300.codfw.wmnet
[22:30:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:34:03] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@a276626]: correct execution_date_fn in ores_predictions_hourly
[22:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:35:10] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@a276626]: correct execution_date_fn in ores_predictions_hourly (duration: 01m 07s)
[22:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:40:19] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2298.codfw.wmnet'] `  an...
[22:40:56] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2299.codfw.wmnet'] `  an...
[22:45:51] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2297.codfw.wmnet'] `  an...
[22:47:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:49:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:11:46] <wikibugs>	 (03PS1) 10Dzahn: add certificate for testreduce.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/658695 (https://phabricator.wikimedia.org/T266509)
[23:12:02] <wikibugs>	 (03PS1) 10Dzahn: add fake cert for testreduce.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/658696 (https://phabricator.wikimedia.org/T266509)
[23:12:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add certificate for testreduce.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/658695 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[23:15:50] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake cert for testreduce.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/658696 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[23:17:19] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1264.eqiad.wmnet'] `  an...
[23:20:38] <wikibugs>	 10SRE, 10Sustainability: Alerts for drifts in /srv/mediawiki - https://phabricator.wikimedia.org/T272967 (10Legoktm) Running `scap pull` after reimaging will definitely take care of the immediate problem.  I read through {T218412} and it looks like that's mostly what's being asked for in this ticket, figuring...
[23:23:38] <wikibugs>	 (03PS5) 10Legoktm: openldap: Convert cross-validate-accounts to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455
[23:24:37] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm)
[23:25:22] <wikibugs>	 (03CR) 10Volans: "quick reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657451 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi)
[23:25:24] <wikibugs>	 (03PS2) 10Dzahn: ATS: re-add config for parsoid-rt-tests.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/654351 (https://phabricator.wikimedia.org/T266509)
[23:26:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "certificate created" [puppet] - 10https://gerrit.wikimedia.org/r/654351 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[23:30:24] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2302.codfw.wmnet
[23:30:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2299.codfw.wmnet
[23:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:26] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2298.codfw.wmnet
[23:31:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:47] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1264.eqiad.wmnet
[23:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:03] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2297.codfw.wmnet
[23:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:35:43] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks Ppchelko!" [core] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658688 (https://phabricator.wikimedia.org/T273007) (owner: 10Ppchelko)
[23:37:24] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1264.eqiad.wmnet
[23:37:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:37:39] <mutante>	 one more canary server in eqiad is buster now ^
[23:38:06] <mutante>	 trying to match canaries with reality 
[23:38:21] <dancy>	 Nice
[23:40:14] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2302.codfw.wmnet
[23:40:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:41:30] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2298.codfw.wmnet
[23:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:43:00] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2297.codfw.wmnet
[23:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:45:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Hue access for Peter Pelberg - https://phabricator.wikimedia.org/T271602 (10DannyH) I approve Peter for this, thanks.
[23:54:16] <wikibugs>	 (03PS1) 10Dzahn: add testreduce.discovery.wmnet, point to testreduce1001 [dns] - 10https://gerrit.wikimedia.org/r/658701 (https://phabricator.wikimedia.org/T266509)
[23:55:46] <wikibugs>	 10SRE, 10Parsoid-Tests, 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review: Make testreduce web UI publicly accessible on the internet - https://phabricator.wikimedia.org/T266509 (10Dzahn)
[23:56:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add testreduce.discovery.wmnet, point to testreduce1001 [dns] - 10https://gerrit.wikimedia.org/r/658701 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[23:57:25] <wikibugs>	 (03PS4) 10Dzahn: scap: add deploy1002 and deploy2002 to mediawiki hosts [puppet] - 10https://gerrit.wikimedia.org/r/658643 (https://phabricator.wikimedia.org/T265963)