[00:00:12] <James_F>	 jouncebot: next
[00:00:13] <jouncebot>	 In 7 hour(s) and 59 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210108T0800)
[00:00:23] <James_F>	 Oh, ha.
[00:00:24] <James_F>	 jouncebot: now
[00:00:25] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210108T0000)
[00:00:31] <James_F>	 Silly bot.
[00:00:36] <Seddon>	 Me!
[00:00:38] <Seddon>	 It's me!
[00:00:43] <James_F>	 Seddon: I can deploy if you want?
[00:00:56] <Seddon>	 @James_F sure!
[00:01:08] <James_F>	 Seddon: (Also, do you want to undeploy from test2?)
[00:01:20] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Undeploy graphoid on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654949 (https://phabricator.wikimedia.org/T271495) (owner: 10Seddon)
[00:01:49] <Seddon>	 @James_F Not yet. In case there are issues I still need the two test environments to compare setups
[00:02:02] <James_F>	 Ack.
[00:02:07] <wikibugs>	 (03Merged) 10jenkins-bot: Undeploy graphoid on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654949 (https://phabricator.wikimedia.org/T271495) (owner: 10Seddon)
[00:02:08] <James_F>	 Soon, soon.
[00:02:15] <Seddon>	 Had issues on a handful of wikis and it proved useful 
[00:02:49] <James_F>	 Seddon: Want me to deploy to a debug box?
[00:03:10] <James_F>	 (Live on mwdebug1002.)
[00:03:11] <Seddon>	 James_F: please yeah, just to make sure I've done it right
[00:03:16] <Seddon>	 thanks will confirm
[00:04:10] <mutante>	 reimaging 4 servers to buster but they are out of the scap dsh groups
[00:04:16] <mutante>	 and will scap pull once done
[00:04:35] <Seddon>	 @James_F confirmed that worked
[00:04:55] <James_F>	 Seddon: OK, will sync
[00:06:15] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Undeploy graphoid on enwiki T271495 (duration: 00m 57s)
[00:06:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:19] <stashbot>	 T271495: Final undeploy for graphoid - en.wiki - https://phabricator.wikimedia.org/T271495
[00:06:21] <James_F>	 Seddon: Done!
[00:06:38] <Seddon>	 James_F: Thanks. I'm monitoring Kibana and will workout if this has been smooth.
[00:07:24] <James_F>	 Cool
[00:15:11] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1266.eqiad.wmnet with reason: REIMAGE
[00:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:16:04] <wikibugs>	 10SRE, 10Graphoid, 10Platform Engineering, 10serviceops: Final undeploy for graphoid - en.wiki - https://phabricator.wikimedia.org/T271495 (10Jdforrester-WMF) 05Open→03Resolved
[00:16:09] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10Platform Engineering (Icebox): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF)
[00:17:14] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1266.eqiad.wmnet with reason: REIMAGE
[00:17:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:40] <James_F>	 Seddon: This is not me being premature, it's me being keen: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Graph/+/654950 ;-)
[00:21:14] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1276.eqiad.wmnet with reason: REIMAGE
[00:21:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:54] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1267.eqiad.wmnet with reason: REIMAGE
[00:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:13] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1276.eqiad.wmnet with reason: REIMAGE
[00:23:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:23:33] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1277.eqiad.wmnet with reason: REIMAGE
[00:23:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:52] <wikibugs>	 (03PS1) 10Jforrester: Drop ability to use graphoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654954 (https://phabricator.wikimedia.org/T242855)
[00:25:15] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1267.eqiad.wmnet with reason: REIMAGE
[00:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:25:38] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF)
[00:26:35] <Seddon>	 @James_F: It's quite possible, given the wishlist that with a vega upgrade the server side rendering makes a return
[00:27:10] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1277.eqiad.wmnet with reason: REIMAGE
[00:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:16] <James_F>	 Seddon: Ha ha ha ha ha ha ha ha no.
[00:31:11] <Seddon>	 James_F: It probably should, this current situation is less than ideal
[00:31:28] <James_F>	 Seddon: Oh, sure, but that request is massively outside the scope of the tech wishlist.
[00:31:58] <James_F>	 Seddon: Writing the code is trivial. Supporting the code for years requires a team. That's a team that doesn't exist, and leadership have repeatedly decided not to fund.
[00:32:19] * James_F lobbied for years, but the decision was always to work on other things. So be it.
[00:42:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5391325808 and 342 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:43:48] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1724728624 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:20] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4122624160 and 232 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:12] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 6400 and 190 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:24] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f1a80e12518: Failed to establish a new connection: [Errno 111] Connection
[00:48:24] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Search%23Administration
[00:48:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 293928 and 224 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:49:06] <icinga-wm>	 PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:49:34] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24640 and 272 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:02:32] <wikibugs>	 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1266.eqiad.wmnet'] `  and were **ALL** successful.
[01:07:57] <wikibugs>	 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1276.eqiad.wmnet'] `  and were **ALL** successful.
[01:09:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:11:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:11:40] <wikibugs>	 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1267.eqiad.wmnet'] `  and were **ALL** successful.
[01:12:08] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1266.eqiad.wmnet
[01:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:12:13] <wikibugs>	 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1277.eqiad.wmnet'] `  and were **ALL** successful.
[01:15:58] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: number_of_data_nodes: 3, initializing_shards: 0, cluster_name: production-logstash-eqiad, unassigned_shards: 0, timed_out: False, number_of_pending_tasks: 1, active_shards_percent_as_number: 100.0, delayed_unassigned_shards: 0, status: green, relocating_shards: 0, task_max_waiting_in_queue_millis: 0
[01:15:58] <icinga-wm>	 916, number_of_nodes: 6, active_primary_shards: 483, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:16:40] <icinga-wm>	 RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:16:53] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1267.eqiad.wmnet
[01:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:25] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1276.eqiad.wmnet
[01:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:34] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1277.eqiad.wmnet
[01:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1266.eqiad.wmnet
[01:18:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:23:56] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/weight=25; selector: name=mw1265.eqiad.wmnet
[01:23:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:24:23] <mutante>	 !log mw1265 - raised weight to 25 like regular appservers (buster)
[01:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:24:35] <mutante>	 !log mw1266 - another buster appserver now serving traffic 
[01:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:28:03] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1276.eqiad.wmnet
[01:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:28:19] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1277.eqiad.wmnet
[01:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:28:54] <mutante>	 !log mw1276, mw1277 - first API appervers on buster, now serving traffic, free to depool if any issues
[01:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:29:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1267.eqiad.wmnet
[01:29:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:41:42] <wikibugs>	 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) 4 more servers have been upgraded to buster:  mw1266, mw1267 (appserver) mw1276, mw1277 (API server)  are now...
[01:46:44] <wikibugs>	 (03PS1) 10Dzahn: lvs: stop monitoring graphoid [puppet] - 10https://gerrit.wikimedia.org/r/654959 (https://phabricator.wikimedia.org/T242855)
[01:48:54] <wikibugs>	 (03PS1) 10Dzahn: admin: delete the graphoid admin group, remove from scb [puppet] - 10https://gerrit.wikimedia.org/r/654960 (https://phabricator.wikimedia.org/T242855)
[01:50:09] <wikibugs>	 (03PS2) 10Dzahn: lvs: stop monitoring graphoid [puppet] - 10https://gerrit.wikimedia.org/r/654959 (https://phabricator.wikimedia.org/T242855)
[01:50:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:56:07] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] cirrus: bump es shard size alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/654917 (https://phabricator.wikimedia.org/T265908) (owner: 10Ryan Kemper)
[02:04:34] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2027 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[02:06:52] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:09:12] <ryankemper>	 !log [wdqs deploy] Tests passing on canary before beginning wdqs deploy, proceeding
[02:09:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:09:25] <logmsgbot>	 !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@b15fc5c]: 0.3.58
[02:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:51] <ryankemper>	 !log [wdqs deploy] While queries run fine, it looks like there might be a UI glitch in this version. Digging in to see if it's transient, but I'll likely be aborting this deploy
[02:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:15:28] <ryankemper>	 !log [wdqs deploy] Nevermind - the UI failure I mentioned above is transient. Restarting my ssh tunnel seemed to make the problem go away. Proceeding with deploy
[02:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:15:34] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[02:18:28] <icinga-wm>	 PROBLEM - PHP opcache health on mw1266 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[02:23:31] <icinga-wm>	 PROBLEM - PHP opcache health on mw1276 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[02:27:23] <icinga-wm>	 PROBLEM - PHP opcache health on mw1277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[02:27:30] <logmsgbot>	 !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@b15fc5c]: 0.3.58 (duration: 18m 04s)
[02:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:34:29] <ryankemper>	 !log [wdqs deploy] Restarted `wdqs-updater` across all instances: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[02:34:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:34:47] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2027 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[02:35:00] <ryankemper>	 !log [wdqs deploy] Restarted `wdqs-categories` across test instances: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[02:35:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:35:39] <ryankemper>	 !log [wdqs deploy] Restarting `wdqs-categories` across load-balanced instances, one host at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[02:35:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:44:51] <icinga-wm>	 PROBLEM - PHP opcache health on mw1267 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[02:55:11] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37
[03:04:22] <ryankemper>	 !log [wdqs deploy] Deploy complete, service is healthy. This is done.
[03:04:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:33:13] <wikibugs>	 (03PS1) 10Reedy: Fix a bunch of fatal errors seen in production [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654860 (https://phabricator.wikimedia.org/T271430)
[03:34:19] <wikibugs>	 (03CR) 10Reedy: "I'm not deploying this at 3am my time... But I might later on today (after some sleep) due to various errors popping up in prod (and dupli" [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654860 (https://phabricator.wikimedia.org/T271430) (owner: 10Reedy)
[04:59:39] <mutante>	 !log mw1266 - restart-php7.2-fpm
[04:59:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:24] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw1266 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn freshly reimaged and somehow normal until they run for a while. though php7adm /opcache-info jq . shows 99% hit rate https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:06:24] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw1267 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn freshly reimaged and somehow normal until they run for a while. though php7adm /opcache-info jq . shows 99% hit rate https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:06:24] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw1276 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn freshly reimaged and somehow normal until they run for a while. though php7adm /opcache-info jq . shows 99% hit rate https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:06:24] <icinga-wm>	 ACKNOWLEDGEMENT - PHP opcache health on mw1277 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% daniel_zahn freshly reimaged and somehow normal until they run for a while. though php7adm /opcache-info jq . shows 99% hit rate https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[05:36:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Abstract Wikipedia (Phase β): Grant Access to ldap/wmf for Cory Massaro - https://phabricator.wikimedia.org/T271245 (10Joe) p:05High→03Medium a:05dr0ptp4kt→03Joe
[05:57:18] <wikibugs>	 10SRE, 10ops-codfw, 10DBA: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Marostegui) Thank you Papaul - once it arrives, feel free to replace the DIMM (the host is off) and power it back on.
[06:18:43] <marostegui>	 !log Deploy schema change on s2 codfw master - T270187
[06:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:48] <stashbot>	 T270187: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187
[06:22:15] <wikibugs>	 (03CR) 10Marostegui: "Thanks Brooke for starting to work on this!" [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) (owner: 10Bstorm)
[06:30:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: add apine to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/654973 (https://phabricator.wikimedia.org/T271245)
[06:31:40] <wikibugs>	 (03PS1) 10Marostegui: db1074: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/654974 (https://phabricator.wikimedia.org/T268742)
[06:32:29] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: admin: add apine to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/654973 (https://phabricator.wikimedia.org/T271245)
[06:32:47] <wikibugs>	 (03PS2) 10Marostegui: db1085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/654974 (https://phabricator.wikimedia.org/T268742)
[06:33:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 to clone db1155:3316 T268742 ', diff saved to https://phabricator.wikimedia.org/P13666 and previous config saved to /var/cache/conftool/dbconfig/20210108-063301-marostegui.json
[06:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:33:05] <stashbot>	 T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742
[06:33:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/654974 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui)
[06:34:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add apine to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/654973 (https://phabricator.wikimedia.org/T271245) (owner: 10Giuseppe Lavagetto)
[06:42:11] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Abstract Wikipedia (Phase β), 10Patch-For-Review: Grant Access to ldap/wmf for Cory Massaro - https://phabricator.wikimedia.org/T271245 (10Joe) 05Open→03Resolved Hi @cmassaro you've been added to the "wmf" group on ldap, which should give you access to most restricted r...
[06:48:31] <wikibugs>	 (03CR) 10Marostegui: "> Patch Set 35:" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm)
[07:22:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libmaxminddb [puppet] - 10https://gerrit.wikimedia.org/r/654876 (owner: 10Muehlenhoff)
[07:23:13] <marostegui>	 !log Deploy schema change on s5 codfw master - T270187
[07:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:18] <stashbot>	 T270187: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187
[07:23:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache/Hue on testcluster [puppet] - 10https://gerrit.wikimedia.org/r/654802 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[07:23:31] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for Apache/Hue on testcluster [puppet] - 10https://gerrit.wikimedia.org/r/654802 (https://phabricator.wikimedia.org/T135991)
[07:28:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "This can safely be removed. We can't use recent mongodb versions anyway since the developers switched to a non-free license and it was con" [puppet] - 10https://gerrit.wikimedia.org/r/654923 (owner: 10Dzahn)
[07:30:13] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37
[07:37:44] <wikibugs>	 (03CR) 10Muehlenhoff: "Just go ahead and switch all of mw* to Buster, doing this in parts will only cause accidental stretch images (different runtimes of puppet" [puppet] - 10https://gerrit.wikimedia.org/r/654947 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn)
[07:51:33] <wikibugs>	 10SRE, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, 10User-brennen: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10JMeybohm)
[07:57:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082', diff saved to https://phabricator.wikimedia.org/P13669 and previous config saved to /var/cache/conftool/dbconfig/20210108-075714-marostegui.json
[07:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210108T0800)
[08:12:56] <marostegui>	 !log Deploy schema change on s4 codfw master - T270187
[08:12:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:00] <stashbot>	 T270187: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187
[08:14:35] <wikibugs>	 (03PS1) 10Elukey: Increase default executor memory size to 4G in Spark Refine settings [puppet] - 10https://gerrit.wikimedia.org/r/655016
[08:16:32] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27384/console" [puppet] - 10https://gerrit.wikimedia.org/r/655016 (owner: 10Elukey)
[08:17:13] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Increase default executor memory size to 4G in Spark Refine settings [puppet] - 10https://gerrit.wikimedia.org/r/655016 (owner: 10Elukey)
[08:24:21] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::launcher: temporary disable hdfs cleaner [puppet] - 10https://gerrit.wikimedia.org/r/655017 (https://phabricator.wikimedia.org/T270629)
[08:25:55] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27385/console" [puppet] - 10https://gerrit.wikimedia.org/r/655017 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey)
[08:28:08] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::analytics_cluster::launcher: temporary disable hdfs cleaner [puppet] - 10https://gerrit.wikimedia.org/r/655017 (https://phabricator.wikimedia.org/T270629) (owner: 10Elukey)
[08:31:53] <wikibugs>	 10SRE, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 2 others: Remove obsoleted docker images - https://phabricator.wikimedia.org/T242604 (10JMeybohm)
[08:34:08] <wikibugs>	 (03PS1) 10David Caro: [wmcs][wikireplicas] Add haproxy to the proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/655019 (https://phabricator.wikimedia.org/T271509)
[08:37:41] <wikibugs>	 (03PS2) 10David Caro: [wmcs][wikireplicas] Add haproxy to the proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/655019 (https://phabricator.wikimedia.org/T271509)
[08:43:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache/Hue [puppet] - 10https://gerrit.wikimedia.org/r/654801 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:45:41] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:46:50] <wikibugs>	 10SRE, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T270806 (10fgiunchedi) 05Resolved→03Open >>! In T270806#6729470, @Cmjohnson wrote: > @fgiunchedi We do not, loads of 3TB.  I will close the task.   Thanks @Cmjohnson ! I've disabled the `2I:2:3` disk a...
[08:48:30] <wikibugs>	 (03CR) 10David Caro: "Maybe this is why it was not checked with ppc xd" [puppet] - 10https://gerrit.wikimedia.org/r/655019 (https://phabricator.wikimedia.org/T271509) (owner: 10David Caro)
[08:48:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "> Patch Set 3: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[08:51:39] <wikibugs>	 10SRE, 10ops-eqiad: Please remove sdb from ms-be1022 - https://phabricator.wikimedia.org/T271512 (10fgiunchedi)
[08:52:49] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/650215 (https://phabricator.wikimedia.org/T179696) (owner: 10Ahmon Dancy)
[08:54:48] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] [wmcs][wikireplicas] Add haproxy to the proxy servers [puppet] - 10https://gerrit.wikimedia.org/r/655019 (https://phabricator.wikimedia.org/T271509) (owner: 10David Caro)
[08:57:46] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[08:58:46] <wikibugs>	 10SRE, 10ops-eqiad: Please remove sdb from ms-be1022 - https://phabricator.wikimedia.org/T271512 (10Joe) p:05Triage→03High
[09:01:42] <godog>	 !log swift codfw-prod: more weight to ms-be20[58-61] - T269337
[09:01:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:46] <stashbot>	 T269337: Add ms-be20[58-61] to swift - https://phabricator.wikimedia.org/T269337
[09:08:58] <moritzm>	 !log installing libxstream-java security updates on Buster
[09:09:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:18] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] "Tested manually:" [puppet] - 10https://gerrit.wikimedia.org/r/655019 (https://phabricator.wikimedia.org/T271509) (owner: 10David Caro)
[09:12:25] <wikibugs>	 (03CR) 10Muehlenhoff: "I'd simply wait a few more weeks until graphoid is undeployed and then nuke the entire graphoid Puppet class from orbit, it's the only way" [puppet] - 10https://gerrit.wikimedia.org/r/654960 (https://phabricator.wikimedia.org/T242855) (owner: 10Dzahn)
[09:12:50] <icinga-wm>	 PROBLEM - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 2I:2:3 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:12:52] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1019 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:4, 2I:4:1, 2I:4:2 - Failed: 2I:2:3 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T271514 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:12:56] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T271514 (10ops-monitoring-bot)
[09:14:06] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T271514 (10fgiunchedi)
[09:14:07] <godog>	 that was expected ^
[09:14:10] <wikibugs>	 10SRE, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T270806 (10fgiunchedi)
[09:14:45] <kormat>	 godog: if you're pessimistic enough, all errors are expected
[09:15:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: After cloning db1155:3316', diff saved to https://phabricator.wikimedia.org/P13670 and previous config saved to /var/cache/conftool/dbconfig/20210108-091528-root.json
[09:15:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:57] <godog>	 kormat: haha that escalated quickly!
[09:16:26] <godog>	 "no unexpected errors" from the shuttle launch days IIRC
[09:16:42] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/654798 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:16:58] <kormat>	 godog: :)
[09:17:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for Apache/Yarn [puppet] - 10https://gerrit.wikimedia.org/r/654800 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[09:23:28] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on ms-be1019 is CRITICAL: cluster=swift device=None instance=ms-be1019 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be1019&var-datasource=eqiad+prometheus/ops
[09:30:27] <marostegui>	 !log Restart mysql on db1115 (tendril/dbtree)
[09:30:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: After cloning db1155:3316', diff saved to https://phabricator.wikimedia.org/P13671 and previous config saved to /var/cache/conftool/dbconfig/20210108-093032-root.json
[09:30:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:10] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[09:41:21] <marostegui>	 ^expected
[09:41:27] <marostegui>	 should recover soon, as I just started it back
[09:43:10] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1001 is OK: HTTP OK: HTTP/1.1 200 OK - 103214 bytes in 6.978 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[09:45:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: After cloning db1155:3316', diff saved to https://phabricator.wikimedia.org/P13672 and previous config saved to /var/cache/conftool/dbconfig/20210108-094535-root.json
[09:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:14] <wikibugs>	 (03PS1) 10Kormat: install_server: Use a dummy partman config for d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/655024 (https://phabricator.wikimedia.org/T267670)
[10:00:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: After cloning db1155:3316', diff saved to https://phabricator.wikimedia.org/P13673 and previous config saved to /var/cache/conftool/dbconfig/20210108-100040-root.json
[10:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:05] <elukey>	 !log restart varnishkafka-webrequest on cp5001 - timeouts to kafka-jumbo1001, librdkafka seems not recovering very well
[10:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:36] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] install_server: Use a dummy partman config for d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/655024 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[10:04:23] <wikibugs>	 (03PS1) 10Kormat: install_server: Drop virtual.cfg from d-i-test partman cfg [puppet] - 10https://gerrit.wikimedia.org/r/655025 (https://phabricator.wikimedia.org/T267670)
[10:06:46] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] install_server: Drop virtual.cfg from d-i-test partman cfg [puppet] - 10https://gerrit.wikimedia.org/r/655025 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[10:06:57] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] install_server: Drop virtual.cfg from d-i-test partman cfg [puppet] - 10https://gerrit.wikimedia.org/r/655025 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[10:07:56] <wikibugs>	 (03PS4) 10WMDE-Fisch: Add a job for TemplateWizard metrics aggregation [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight)
[10:08:30] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+1] "PS4: Manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/649662 (https://phabricator.wikimedia.org/T262209) (owner: 10Awight)
[10:09:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[10:12:28] <wikibugs>	 (03PS1) 10Kormat: install_server: Fix typo in d-i-test config [puppet] - 10https://gerrit.wikimedia.org/r/655026 (https://phabricator.wikimedia.org/T267670)
[10:14:14] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] install_server: Fix typo in d-i-test config [puppet] - 10https://gerrit.wikimedia.org/r/655026 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[10:14:16] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] install_server: Fix typo in d-i-test config [puppet] - 10https://gerrit.wikimedia.org/r/655026 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[10:14:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Clarify comment to the various image data containers in DockerBuilder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655027
[10:14:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Remove useless return [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655028
[10:14:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Reformat the whole project using black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655029
[10:14:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Drop support for python < 3.7 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655030
[10:14:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add typing support [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655031
[10:15:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Clarify comment to the various image data containers in DockerBuilder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655027 (owner: 10Giuseppe Lavagetto)
[10:16:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove useless return [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655028 (owner: 10Giuseppe Lavagetto)
[10:17:03] <wikibugs>	 10SRE, 10cloud-services-team (Kanban): cron spam from cloudcontrol2004-dev.wikimedia.org - https://phabricator.wikimedia.org/T271518 (10Volans)
[10:18:10] <wikibugs>	 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10Majavah)
[10:20:29] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "See comments inline." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/654723 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)
[10:23:12] <wikibugs>	 (03PS2) 10Jbond: cfssl_ocsprefresh: blank CR soliciting general python post-review [puppet] - 10https://gerrit.wikimedia.org/r/650120
[10:24:27] <wikibugs>	 10SRE, 10cloud-services-team (Kanban): cron spam from cloudcontrol2004-dev.wikimedia.org - https://phabricator.wikimedia.org/T271518 (10dcaro) This seems to be caused by the new version of keystone not supporting the command the cron is running (https://bugs.launchpad.net/keystone/+bug/1759289).
[10:24:50] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid
[10:25:52] <wikibugs>	 10SRE, 10cloud-services-team (Kanban): cron spam from cloudcontrol2004-dev.wikimedia.org - https://phabricator.wikimedia.org/T271518 (10aborrero) This is related to {T261134}
[10:26:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138', diff saved to https://phabricator.wikimedia.org/P13674 and previous config saved to /var/cache/conftool/dbconfig/20210108-102606-marostegui.json
[10:26:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:28] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[10:28:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13675 and previous config saved to /var/cache/conftool/dbconfig/20210108-102835-root.json
[10:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1085: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/654861
[10:29:32] <wikibugs>	 (03PS2) 10Marostegui: Revert "db1085: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/654861
[10:30:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1085: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/654861 (owner: 10Marostegui)
[10:37:30] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add typing support [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655031
[10:38:22] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 01m 10s)
[10:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:09] <hashar>	 _joe_: I can help review docker-pkg related patches if that can help. Probably not today, but surely can look at them early next week
[10:39:16] <hashar>	 just add me as a reviewer if need by :]
[10:39:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] sockpuppet-api: Create basic chart and service config (0312 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[10:39:39] <_joe_>	 hashar: oh I was asking jayme already, but be my guest
[10:39:43] <_joe_>	 those are mostly noops
[10:40:05] <_joe_>	 I'm adding typing support and dropping 3.5 support
[10:40:12] <hashar>	 OH NO
[10:40:15] <hashar>	 poor Python 3.5 :\
[10:40:33] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::client: drop custom kerberos ccache settings [puppet] - 10https://gerrit.wikimedia.org/r/655033 (https://phabricator.wikimedia.org/T255262)
[10:40:41] <_joe_>	 but I decided to add typing after I had to re-follow some logic
[10:40:59] <_joe_>	 and now the code is clearer and marginally more correct :P
[10:41:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: drop custom kerberos ccache settings [puppet] - 10https://gerrit.wikimedia.org/r/655033 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey)
[10:43:15] <hashar>	 _joe_: cool. Just add me and will look into them on monday I guess
[10:43:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13676 and previous config saved to /var/cache/conftool/dbconfig/20210108-104338-root.json
[10:43:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:51] <hashar>	 !lunch prep
[10:47:06] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I don't want to open pandoras box, but I already see the app being called similarusers while stuff in k8s is called suckpuppet. Frankly, I" [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[10:47:30] <jayme>	 ups...interesting typo 
[10:51:12] <wikibugs>	 (03PS3) 10Jbond: cfssl_ocsprefresh: blank CR soliciting general python post-review [puppet] - 10https://gerrit.wikimedia.org/r/650120
[10:51:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cfssl_ocsprefresh: blank CR soliciting general python post-review [puppet] - 10https://gerrit.wikimedia.org/r/650120 (owner: 10Jbond)
[10:53:53] <wikibugs>	 (03PS4) 10Jbond: cfssl_ocsprefresh: blank CR soliciting general python post-review [puppet] - 10https://gerrit.wikimedia.org/r/650120
[10:54:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, although I haven't tested the change" [puppet] - 10https://gerrit.wikimedia.org/r/650629 (https://phabricator.wikimedia.org/T254533) (owner: 10Cwhite)
[10:55:14] <wikibugs>	 (03CR) 10Jbond: "Thanks a lot for the input see inline for responses" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/650120 (owner: 10Jbond)
[10:55:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Clarify comment to the various image data containers in DockerBuilder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655027 (owner: 10Giuseppe Lavagetto)
[10:56:03] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] etcd::v3: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/651834 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[10:56:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Remove useless return [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655028 (owner: 10Giuseppe Lavagetto)
[10:58:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13688 and previous config saved to /var/cache/conftool/dbconfig/20210108-105842-root.json
[10:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: hiera: enable back neutron hacks in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/655036 (https://phabricator.wikimedia.org/T271517)
[11:00:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: hiera: enable back neutron hacks in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/655036 (https://phabricator.wikimedia.org/T271517) (owner: 10Arturo Borrero Gonzalez)
[11:03:45] <wikibugs>	 10SRE, 10Phabricator: Excessive queries ffrom vscode-phabricator - https://phabricator.wikimedia.org/T271528 (10jbond)
[11:05:29] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "A mess to review, but I love it 😊" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655029 (owner: 10Giuseppe Lavagetto)
[11:06:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Drop support for python < 3.7 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655030 (owner: 10Giuseppe Lavagetto)
[11:06:12] <wikibugs>	 (03PS6) 10Jbond: varnish: ratelimit vscode-phabricator plugin [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482)
[11:06:41] <wikibugs>	 (03CR) 10Jbond: varnish: ratelimit vscode-phabricator plugin (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond)
[11:13:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1138 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13689 and previous config saved to /var/cache/conftool/dbconfig/20210108-111345-root.json
[11:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:59] <wikibugs>	 (03CR) 10Jgiannelos: tegola: Add docker image. (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan)
[11:15:29] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: hiera: connect cloudnet servers back to vlan 2120 [puppet] - 10https://gerrit.wikimedia.org/r/655038 (https://phabricator.wikimedia.org/T271517)
[11:16:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: hiera: connect cloudnet servers back to vlan 2120 [puppet] - 10https://gerrit.wikimedia.org/r/655038 (https://phabricator.wikimedia.org/T271517) (owner: 10Arturo Borrero Gonzalez)
[11:17:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1141', diff saved to https://phabricator.wikimedia.org/P13690 and previous config saved to /var/cache/conftool/dbconfig/20210108-111733-marostegui.json
[11:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:54] <wikibugs>	 (03PS2) 10Jbond: (WIP) ccreat ocsp helper script [puppet] - 10https://gerrit.wikimedia.org/r/654418
[11:18:06] <wikibugs>	 (03CR) 10Ema: [C: 03+1] varnish: ratelimit vscode-phabricator plugin [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond)
[11:18:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) ccreat ocsp helper script [puppet] - 10https://gerrit.wikimedia.org/r/654418 (owner: 10Jbond)
[11:19:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13691 and previous config saved to /var/cache/conftool/dbconfig/20210108-111905-root.json
[11:19:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:45] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud: hiera: add vlan 2120 back into the neutron bridge [puppet] - 10https://gerrit.wikimedia.org/r/655039 (https://phabricator.wikimedia.org/T271517)
[11:22:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: hiera: add vlan 2120 back into the neutron bridge [puppet] - 10https://gerrit.wikimedia.org/r/655039 (https://phabricator.wikimedia.org/T271517) (owner: 10Arturo Borrero Gonzalez)
[11:27:38] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore)
[11:28:09] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] mail::smarthost: hiera -> lookup [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[11:29:44] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata Query UI, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) So now, per T266702#6662363 we need to change the traffic layer to point most of query.wikidata.org to microsites, with the two paths below pointing to the existing w...
[11:31:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] push-notifications: pass x-fowarded-proto https in header [deployment-charts] - 10https://gerrit.wikimedia.org/r/654878 (owner: 10MSantos)
[11:32:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I see you also renamed the cluster from "kibana" to "kibana7". That's ok, of course, but not strictly necessary." [puppet] - 10https://gerrit.wikimedia.org/r/654436 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[11:34:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13692 and previous config saved to /var/cache/conftool/dbconfig/20210108-113408-root.json
[11:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:25] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:38:02] <wikibugs>	 (03CR) 10Jgiannelos: [C: 04-1] "Other than my build flags comment, I tested the build locally using the upstream repo and it worked fine in my dev setup (server initializ" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan)
[11:39:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) Over the past 2 days,  6.0.1 has been performing worse than 5.1.3, and similarly to 6.0.7. This would seem to indicate th...
[11:42:07] <wikibugs>	 (03PS1) 10Volans: dns: migrate script to Netbox 2.9+ [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655040 (https://phabricator.wikimedia.org/T266488)
[11:43:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27387/console" [puppet] - 10https://gerrit.wikimedia.org/r/654437 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[11:47:46] <wikibugs>	 (03PS1) 10Kormat: install_server: Fix ml-serve/raid1-2x2dev partman configs [puppet] - 10https://gerrit.wikimedia.org/r/655041 (https://phabricator.wikimedia.org/T267670)
[11:49:03] <wikibugs>	 (03CR) 10Jbond: "Resutls from vcl test:" [puppet] - 10https://gerrit.wikimedia.org/r/650494 (https://phabricator.wikimedia.org/T270482) (owner: 10Jbond)
[11:49:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13693 and previous config saved to /var/cache/conftool/dbconfig/20210108-114912-root.json
[11:49:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:57] <wikibugs>	 (03PS2) 10Kormat: install_server: Fix ml-serve/raid1-2x2dev partman configs [puppet] - 10https://gerrit.wikimedia.org/r/655041 (https://phabricator.wikimedia.org/T267670)
[11:52:34] <wikibugs>	 (03PS1) 10Ema: ATS: make number of allowed Lua states configurable [puppet] - 10https://gerrit.wikimedia.org/r/655043 (https://phabricator.wikimedia.org/T265625)
[11:52:36] <wikibugs>	 (03PS1) 10Ema: ATS: lower number of allowed Lua states on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/655044 (https://phabricator.wikimedia.org/T265625)
[11:52:57] <wikibugs>	 (03PS3) 10Kormat: install_server: Fix ml-serve/raid1-2x2dev partman configs [puppet] - 10https://gerrit.wikimedia.org/r/655041 (https://phabricator.wikimedia.org/T267670)
[11:53:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "The change looks good to me, given the service is in lvs_setup status, it should not cause paging." [puppet] - 10https://gerrit.wikimedia.org/r/654437 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[11:53:52] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] install_server: Fix ml-serve/raid1-2x2dev partman configs [puppet] - 10https://gerrit.wikimedia.org/r/655041 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[11:53:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] kibana7: remove kibana-next conftool entries [puppet] - 10https://gerrit.wikimedia.org/r/654438 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron)
[11:54:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] drop the ServerAdmin line [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654482 (https://phabricator.wikimedia.org/T251005) (owner: 10Dzahn)
[11:54:29] <wikibugs>	 (03PS1) 10Jbond: varnish tests: this script failes on the pcc run [puppet] - 10https://gerrit.wikimedia.org/r/655045
[11:56:01] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] install_server: Fix ml-serve/raid1-2x2dev partman configs [puppet] - 10https://gerrit.wikimedia.org/r/655041 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[11:57:12] <wikibugs>	 (03PS2) 10Jbond: varnish tests: this script fails on the pcc run [puppet] - 10https://gerrit.wikimedia.org/r/655045
[11:58:21] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/655045 (owner: 10Jbond)
[11:58:40] <wikibugs>	 (03PS3) 10Jbond: varnish tests: this script fails on the pcc run [puppet] - 10https://gerrit.wikimedia.org/r/655045
[11:59:14] <wikibugs>	 (03PS4) 10Jbond: varnish tests: this script fails on the pcc run [puppet] - 10https://gerrit.wikimedia.org/r/655045
[12:00:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] varnish tests: this script fails on the pcc run [puppet] - 10https://gerrit.wikimedia.org/r/655045 (owner: 10Jbond)
[12:00:56] <wikibugs>	 (03PS2) 10Ema: ATS: lower number of allowed Lua states on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/655044 (https://phabricator.wikimedia.org/T265625)
[12:04:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1141 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13694 and previous config saved to /var/cache/conftool/dbconfig/20210108-120415-root.json
[12:04:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: make number of allowed Lua states configurable [puppet] - 10https://gerrit.wikimedia.org/r/655043 (https://phabricator.wikimedia.org/T265625) (owner: 10Ema)
[12:08:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM, remember that the max lua states isn't a reloadable config parameter, so ats-be needs to be restarted" [puppet] - 10https://gerrit.wikimedia.org/r/655044 (https://phabricator.wikimedia.org/T265625) (owner: 10Ema)
[12:11:27] <wikibugs>	 10SRE, 10Phabricator, 10Traffic, 10Patch-For-Review: Excessive queries from vscode-phabricator - https://phabricator.wikimedia.org/T271528 (10jbond) p:05Triage→03Medium
[12:18:52] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts: ` rdb2004.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2...
[12:20:41] <icinga-wm>	 PROBLEM - Host rdb2004 is DOWN: PING CRITICAL - Packet loss = 100%
[12:21:29] <icinga-wm>	 RECOVERY - Host rdb2004 is UP: PING OK - Packet loss = 0%, RTA = 31.79 ms
[12:22:05] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] dns: migrate script to Netbox 2.9+ [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655040 (https://phabricator.wikimedia.org/T266488) (owner: 10Volans)
[12:23:07] <klausman>	 fuck
[12:23:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['rdb2004.codfw.wmnet'] `  Of which those **FAILED**: ` ['rdb2004.codfw.wmnet'] `
[12:23:16] <klausman>	 That was me, almost reinstalling the machine
[12:23:56] <klausman>	 aaagh, it may still boot into the installer
[12:23:58] <klausman>	 help!
[12:26:11] <kormat>	 klausman: can you connect to the console?
[12:26:23] <klausman>	 yes, but I dunno the poweroff command by heart
[12:27:56] <kormat>	 the log seems to imply it booted into the installer
[12:28:06] <klausman>	 yes, and it's likely busy wiping the disks
[12:28:15] <klausman>	 The console help is useles
[12:28:35] <klausman>	 aaand It's already unpacking packages
[12:29:13] <kormat>	 looks like rdb* are redis servers; given that eqiad is primary currently it might mean that nothing important is broken
[12:29:41] <klausman>	 editing the place I c&p'd from as we speak
[12:29:47] <kormat>	 👍
[12:31:08] <kormat>	 jayme: if you're around, rbd2004 got accidentally reimaged
[12:31:19] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts: ` ml-serve2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reim...
[12:31:24] <kormat>	 *rdb2004
[12:31:33] <klausman>	 wmf-auto-reimage-host could really use a molly-guard-like thing
[12:32:02] <kormat>	 aye
[12:33:37] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:37:25] <jayme>	 ouch ... I'm not 100% but would guess as well that nothing important is broken due to that it is a codfw server
[12:37:37] <klausman>	 FWIW, sorry
[12:38:39] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] sentry: delete module and hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[12:38:43] <wikibugs>	 (03PS1) 10Hnowlan: deployment: rename sockpuppet-api to similar-users [puppet] - 10https://gerrit.wikimedia.org/r/655047 (https://phabricator.wikimedia.org/T268837)
[12:38:47] <jayme>	 don't worry, shit happens
[12:40:19] <wikibugs>	 10SRE, 10Goal, 10Patch-For-Review: FY2020-2021 Q1 DC switchover and switchback - https://phabricator.wikimedia.org/T243314 (10Marostegui) @RLazarus can this be closed?
[12:40:43] <jayme>	 https://wikitech.wikimedia.org/wiki/Redis sais it also was a slave server, so it might just join and sync. You know in which state it is now klausman?
[12:42:09] <klausman>	 Last I saw it was still in the installer
[12:42:24] <klausman>	 But I closed the console since you might have wanted it
[12:42:50] <klausman>	 Oooh
[12:43:18] <klausman>	 https://phabricator.wikimedia.org/F33990715
[12:43:37] <klausman>	 I am not sure *what* it unpacked before.
[12:44:37] <klausman>	 jayme: what do you want me to do at that prompt? Or do you want to take over the console?
[12:45:06] <jayme>	 hmm...leave it like that I would say until we figure out. I wonder that that check actually is as I would have assumed reimage to just...well reimage
[12:45:23] <klausman>	 Yeah.
[12:45:50] <klausman>	 My suspicion is this might be about files in /etc only, i.e. dpkg wanting to ask keep/install type questions
[12:46:32] <klausman>	 /home on the target seems empty
[12:46:54] <jayme>	 Maybe. But I thought reimage would simply create new filesystems etc.
[12:47:01] <klausman>	 Same.
[12:48:22] <klausman>	 /srv is definitely empty
[12:48:43] <jayme>	 pinging _joe_ for when back from lunch/break
[12:49:12] <klausman>	 Alright, closing the console 
[12:49:24] <logmsgbot>	 !log klausman@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2001.codfw.wmnet with reason: REIMAGE
[12:49:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:42] <klausman>	 And *that* is the host I meant to install
[12:49:58] <jayme>	 yeah, fine. I'll take it from here and try to figure out what's the best thing to do
[12:50:38] <klausman>	 If I can help in any way, lmk
[12:52:14] <logmsgbot>	 !log klausman@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve2001.codfw.wmnet with reason: REIMAGE
[12:52:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:49] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] push-notifications: pass x-fowarded-proto https in header [deployment-charts] - 10https://gerrit.wikimedia.org/r/654878 (owner: 10MSantos)
[12:57:14] <wikibugs>	 (03Merged) 10jenkins-bot: push-notifications: pass x-fowarded-proto https in header [deployment-charts] - 10https://gerrit.wikimedia.org/r/654878 (owner: 10MSantos)
[13:00:05] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ml-serve2001.codfw.wmnet'] `  and were **ALL** successful.
[13:05:00] <jayme>	 klausman: from what it looks like in puppet, the rdb2003,rdb2004 cluster is not used anywhere
[13:05:09] <klausman>	 Phew.
[13:05:19] <klausman>	 At least I didn't break any serving infra
[13:09:09] <jayme>	 ah, well...maybe it is but only via nutcracker. So that should be fine as well
[13:09:43] <wikibugs>	 (03PS1) 10Kormat: install_server: Revert changes to d-i-test's config [puppet] - 10https://gerrit.wikimedia.org/r/655050 (https://phabricator.wikimedia.org/T267670)
[13:10:38] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] install_server: Revert changes to d-i-test's config [puppet] - 10https://gerrit.wikimedia.org/r/655050 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[13:10:48] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:53] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] install_server: Revert changes to d-i-test's config [puppet] - 10https://gerrit.wikimedia.org/r/655050 (https://phabricator.wikimedia.org/T267670) (owner: 10Kormat)
[13:12:20] <wikibugs>	 (03PS1) 10Ladsgroup: Make query.wikidata.org point to microsite backend instead (for GUI) [puppet] - 10https://gerrit.wikimedia.org/r/655051 (https://phabricator.wikimedia.org/T266702)
[13:13:14] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Ladsgroup) My knowledge in here is not that great but this should work ^
[13:14:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts: ` ml-serve2004.codfw.wmnet ` The log can be found in `...
[13:14:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts: ` ml-serve2003.codfw.wmnet ` The log can be found in `...
[13:14:10] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by klausman on cumin2001.codfw.wmnet for hosts: ` ml-serve2002.codfw.wmnet ` The log can be found in `...
[13:22:54] <jayme>	 as I'm pretty sure nothing really bad happened and we probably can simple complete the reimage of rdb2004, I'll get some lunch now and wait for sanity check/confirmation from _joe_ || rzl just to be sure
[13:23:41] <kormat>	 jayme: is it too late to delete the host from puppet/netbox and just pretend it never existed?
[13:24:03] <_joe_>	 rdb2004 is a replica host, so it's not used actively
[13:24:34] <jayme>	 kormat: this is a logged channel as wel, so I think thats not an option :)
[13:25:30] <jayme>	 _joe_: thanks for confirmation. So completing the reimage should fix things, right?
[13:26:05] <_joe_>	 to be clear, 2003 is heavily used right now
[13:26:09] <_joe_>	 yes
[13:26:19] <jayme>	 ack to both
[13:27:17] <jayme>	 klausman: would you mind to complete/redo the reimage of rdb2004?
[13:29:10] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:34:28] <klausman>	 I can run it again, sure.
[13:35:40] <kormat>	 klausman: you're a pro at reinstalling rdb2004 by now
[13:36:44] <jayme>	 klausman: cool. Let me know when it's done. I can double check then
[13:36:56] <jayme>	 (that replication worked)
[13:37:11] <logmsgbot>	 !log klausman@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2003.codfw.wmnet with reason: REIMAGE
[13:37:14] <logmsgbot>	 !log klausman@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
[13:37:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:43] <logmsgbot>	 !log klausman@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: REIMAGE
[13:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:06] <logmsgbot>	 !log klausman@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve2003.codfw.wmnet with reason: REIMAGE
[13:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:00] <logmsgbot>	 !log klausman@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: REIMAGE
[13:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:43] <logmsgbot>	 !log klausman@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: REIMAGE
[13:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:46:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:57:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ml-serve2003.codfw.wmnet'] `  and were **ALL** successful.
[14:12:03] <wikibugs>	 (03PS1) 10Ladsgroup: hive: Migrate hiera() to lookup() and setting datatype in serve [puppet] - 10https://gerrit.wikimedia.org/r/655065 (https://phabricator.wikimedia.org/T209953)
[14:13:24] <klausman>	 jayme: reinstall done
[14:13:37] <klausman>	 Waiting for puppet run
[14:15:51] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27390/" [puppet] - 10https://gerrit.wikimedia.org/r/655065 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[14:22:42] <wikibugs>	 (03PS1) 10QChris: Allow “Gerrit Managers” to import history [software/tegola] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/655070
[14:22:44] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Allow “Gerrit Managers” to import history [software/tegola] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/655070 (owner: 10QChris)
[14:24:44] <wikibugs>	 (03PS1) 10QChris: Import done. Revoke import grants [software/tegola] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/655072
[14:24:46] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Import done. Revoke import grants [software/tegola] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/655072 (owner: 10QChris)
[14:30:22] <wikibugs>	 (03PS1) 10Ladsgroup: hadoop: Migrate hiera() to lookup() and setting datatype [puppet] - 10https://gerrit.wikimedia.org/r/655073 (https://phabricator.wikimedia.org/T209953)
[14:32:12] <icinga-wm>	 PROBLEM - Check health of redis instance on 6378 on rdb2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.123: Connection reset by peer https://wikitech.wikimedia.org/wiki/Redis
[14:33:52] <icinga-wm>	 PROBLEM - Disk space on rdb2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.123: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=rdb2004&var-datasource=codfw+prometheus/ops
[14:36:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 33%: After schema change', diff saved to https://phabricator.wikimedia.org/P13695 and previous config saved to /var/cache/conftool/dbconfig/20210108-143610-root.json
[14:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:04] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27391/" [puppet] - 10https://gerrit.wikimedia.org/r/655073 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[14:38:12] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Clarify comment to the various image data containers in DockerBuilder [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655027 (owner: 10Giuseppe Lavagetto)
[14:38:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Remove useless return [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655028 (owner: 10Giuseppe Lavagetto)
[14:38:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Reformat the whole project using black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655029 (owner: 10Giuseppe Lavagetto)
[14:39:32] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.123: Connection reset by peer https://wikitech.wikimedia.org/wiki/Redis
[14:39:36] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.123: Connection reset by peer https://wikitech.wikimedia.org/wiki/Redis
[14:39:46] <icinga-wm>	 PROBLEM - Check health of redis instance on 6378 on rdb2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.123: Connection reset by peer https://wikitech.wikimedia.org/wiki/Redis
[14:40:24] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.123: Connection reset by peer https://wikitech.wikimedia.org/wiki/Redis
[14:40:57] <wikibugs>	 (03Merged) 10jenkins-bot: Reformat the whole project using black [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655029 (owner: 10Giuseppe Lavagetto)
[14:41:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077
[14:41:22] <icinga-wm>	 PROBLEM - Check health of redis instance on 6382 on rdb2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.123: Connection reset by peer https://wikitech.wikimedia.org/wiki/Redis
[14:41:36] <icinga-wm>	 PROBLEM - Check size of conntrack table on rdb2004 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.123: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:41:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff)
[14:42:30] <icinga-wm>	 RECOVERY - Check health of redis instance on 6378 on rdb2004 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6378 has 0 databases (), up 3 seconds https://wikitech.wikimedia.org/wiki/Redis
[14:42:46] <icinga-wm>	 RECOVERY - Check health of redis instance on 6382 on rdb2004 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6382 has 0 databases (), up 18 seconds https://wikitech.wikimedia.org/wiki/Redis
[14:43:00] <icinga-wm>	 RECOVERY - Check size of conntrack table on rdb2004 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:43:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ml-serve2002.codfw.wmnet'] `  and were **ALL** successful.
[14:43:12] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2004 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6381 has 0 databases (), up 45 seconds https://wikitech.wikimedia.org/wiki/Redis
[14:43:34] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2004 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6380 has 1 databases (db0) with 1620097 keys, up 1 minutes 9 seconds https://wikitech.wikimedia.org/wiki/Redis
[14:43:46] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2004 is OK: OK: REDIS 3.2.6 on 127.0.0.1:6379 has 1 databases (db0) with 2901653 keys, up 1 minutes 21 seconds https://wikitech.wikimedia.org/wiki/Redis
[14:44:46] <icinga-wm>	 RECOVERY - Disk space on rdb2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=rdb2004&var-datasource=codfw+prometheus/ops
[14:45:27] <jayme>	 klausman: thanks. Forced a recheck on the remaining checks. Looks fine
[14:45:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ml-serve2004.codfw.wmnet'] `  and were **ALL** successful.
[14:45:50] <_joe_>	 klausman: have you rebooted the server?
[14:45:57] <klausman>	 jayme: ack, it's doing its final reboot post-puppet, so there may be more stuff
[14:46:02] <klausman>	 14:45:12 | cumin2001.codfw.wmnet | Puppet run completed
[14:46:02] <icinga-wm>	 PROBLEM - Host rdb2004 is DOWN: PING CRITICAL - Packet loss = 100%
[14:46:04] <klausman>	 14:45:13 | rdb2004.codfw.wmnet | Rebooted host
[14:46:48] <_joe_>	 I had the time to verify redis was replicating fine before the reboot
[14:47:05] <klausman>	 Yeah, I would have delayed the reboot if I'd had a chance to do so
[14:47:42] <_joe_>	 that's ok, there is no reason why the server should come up non-functional
[14:48:03] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077
[14:48:32] <icinga-wm>	 RECOVERY - Host rdb2004 is UP: PING OK - Packet loss = 0%, RTA = 31.80 ms
[14:49:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff)
[14:51:08] <klausman>	 And it's back
[14:51:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 66%: After schema change', diff saved to https://phabricator.wikimedia.org/P13696 and previous config saved to /var/cache/conftool/dbconfig/20210108-145113-root.json
[14:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:33] <klausman>	 _joe_, jayme: once more, sorry for that mess. 
[14:51:56] <_joe_>	 klausman: it was just a standby replica, no biggie :)
[14:52:06] <klausman>	 Yeah, I got *real* lucky that way
[14:52:49] <_joe_>	 master_link_status:down uhm
[14:53:08] <_joe_>	 probably still importing data, lemme check the logs
[14:54:35] <_joe_>	 one instance out of 5
[14:55:26] <_joe_>	 sync: receiving 8671577021 bytes from master
[14:55:37] <_joe_>	 how's this instance so huge?
[14:56:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10Jclark-ctr)
[14:56:05] <_joe_>	  	ORES cache
[14:56:13] <jayme>	 yeah, one is pretty big
[14:56:16] <_joe_>	 says wikitech (at https://wikitech.wikimedia.org/wiki/Redis)
[14:56:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10Jclark-ctr) host configured and racked
[14:56:33] <_joe_>	 ok so we might need to make the replication timeout longer on that instance
[14:57:13] <_joe_>	 it's succeeding now!
[14:58:49] <wikibugs>	 (03PS1) 10David Caro: wmcs.backup: add command output on debug [puppet] - 10https://gerrit.wikimedia.org/r/655080
[14:58:50] <jayme>	 maybe just saturated the link when it synced all instances at once 
[14:59:20] <_joe_>	 yes
[14:59:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "seems better!" [puppet] - 10https://gerrit.wikimedia.org/r/655080 (owner: 10David Caro)
[15:01:58] <_joe_>	 https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=11&orgId=1&refresh=5m&var-server=rdb2004&var-datasource=thanos&var-cluster=redis&from=now-30m&to=now
[15:02:20] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@f6c50db]: minor django package upgrades
[15:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:08] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Drop support for python < 3.7 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655030 (owner: 10Giuseppe Lavagetto)
[15:03:35] <wikibugs>	 (03CR) 10Reedy: [C: 03+2] Fix a bunch of fatal errors seen in production [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654860 (https://phabricator.wikimedia.org/T271430) (owner: 10Reedy)
[15:03:55] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@f6c50db]: minor django package upgrades (duration: 01m 35s)
[15:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:21] <wikibugs>	 (03Merged) 10jenkins-bot: Drop support for python < 3.7 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655030 (owner: 10Giuseppe Lavagetto)
[15:05:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm, puppet compiler agrees:" [puppet] - 10https://gerrit.wikimedia.org/r/654898 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro)
[15:06:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13697 and previous config saved to /var/cache/conftool/dbconfig/20210108-150617-root.json
[15:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:50] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@f6c50db]: minor django package upgrades
[15:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:07:59] <wikibugs>	 (03PS3) 10Muehlenhoff: Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077
[15:08:39] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@f6c50db]: minor django package upgrades (duration: 01m 49s)
[15:08:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:36] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@f6c50db]: minor django package upgrades -> codfw1dev
[15:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:06] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@f6c50db]: minor django package upgrades -> codfw1dev (duration: 01m 30s)
[15:11:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:35] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@f6c50db]: minor django package upgrades -> codfw1dev
[15:12:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:40] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@f6c50db]: minor django package upgrades -> codfw1dev (duration: 00m 05s)
[15:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:17] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@f6c50db]: minor django package upgrades -> codfw1dev
[15:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:17] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@f6c50db]: minor django package upgrades -> codfw1dev (duration: 01m 00s)
[15:14:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10klausman) The machines now have a base install (i.e. there is nothing special for them in puppet).  The machines in eqiad should install correctly out of the box, though I have noti...
[15:17:12] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@f6c50db]: minor django package upgrades + compression
[15:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:58] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@f6c50db]: minor django package upgrades + compression (duration: 01m 47s)
[15:19:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:22] <wikibugs>	 (03PS1) 10Elukey: jupyterhub: avoid PrivateTmp for ephemeral systemd units for notebooks [puppet] - 10https://gerrit.wikimedia.org/r/655085 (https://phabricator.wikimedia.org/T255262)
[15:20:29] <wikibugs>	 (03PS2) 10Elukey: jupyterhub: avoid PrivateTmp for ephemeral systemd units for notebooks [puppet] - 10https://gerrit.wikimedia.org/r/655085 (https://phabricator.wikimedia.org/T255262)
[15:22:12] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "SO SIMPLE!" [puppet] - 10https://gerrit.wikimedia.org/r/655085 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey)
[15:23:15] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@f6c50db]: minor django package upgrades -> codfw1dev
[15:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:24:21] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@f6c50db]: minor django package upgrades -> codfw1dev (duration: 01m 06s)
[15:24:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:54] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] jupyterhub: avoid PrivateTmp for ephemeral systemd units for notebooks [puppet] - 10https://gerrit.wikimedia.org/r/655085 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey)
[15:26:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/655085 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey)
[15:26:37] <icinga-wm>	 PROBLEM - Host kafka-jumbo1009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:27:07] <elukey>	 ah ok mgmt :D
[15:27:27] <elukey>	 not great though
[15:28:23] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Horizon: Disable offline compression in Train" [puppet] - 10https://gerrit.wikimedia.org/r/655089
[15:28:57] <moritzm>	 loose cable maybe
[15:29:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: Disable offline compression in Train" [puppet] - 10https://gerrit.wikimedia.org/r/655089 (owner: 10Andrew Bogott)
[15:31:51] <icinga-wm>	 RECOVERY - Host kafka-jumbo1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms
[15:32:44] <elukey>	 good
[15:34:55] <wikibugs>	 (03PS11) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[15:35:13] <wikibugs>	 (03Merged) 10jenkins-bot: Fix a bunch of fatal errors seen in production [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/654860 (https://phabricator.wikimedia.org/T271430) (owner: 10Reedy)
[15:35:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[15:35:35] <Reedy>	 32 minutes to merge a patch to a deployment branch? nice...
[15:35:40] <wikibugs>	 (03PS12) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[15:36:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[15:36:07] <wikibugs>	 (03CR) 10Hnowlan: sockpuppet-api: Create basic chart and service config (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[15:37:16] <wikibugs>	 (03PS13) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[15:38:10] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@ecaad83]: minor django package upgrades -> codfw1dev
[15:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[15:38:54] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[15:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:49] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@ecaad83]: minor django package upgrades -> codfw1dev (duration: 01m 39s)
[15:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:52] <logmsgbot>	 !log reedy@deploy1001 Synchronized php-1.36.0-wmf.25/extensions/AbuseFilter/: T271430 T271431 T271432 T271433 (duration: 01m 00s)
[15:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:58] <stashbot>	 T271432: Uncaught CentralDBNotAvailableException on Special:AbuseLog - https://phabricator.wikimedia.org/T271432
[15:39:59] <stashbot>	 T271430: Uncaught FilterNotFoundException for non-existing filters - https://phabricator.wikimedia.org/T271430
[15:39:59] <stashbot>	 T271433: Fatal error due to null being passed to SpecsFormatter::nameGroup - https://phabricator.wikimedia.org/T271433
[15:39:59] <stashbot>	 T271431: Uncaught FilterNotFoundException for non-existing filters on ViewHistory - https://phabricator.wikimedia.org/T271431
[15:40:09] <wikibugs>	 (03PS1) 10Ottomata: Allow jupyter notebooks to write to tmp [puppet] - 10https://gerrit.wikimedia.org/r/655090 (https://phabricator.wikimedia.org/T255262)
[15:41:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Allow jupyter notebooks to write to tmp [puppet] - 10https://gerrit.wikimedia.org/r/655090 (https://phabricator.wikimedia.org/T255262) (owner: 10Ottomata)
[15:41:39] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Allow jupyter notebooks to write to tmp [puppet] - 10https://gerrit.wikimedia.org/r/655090 (https://phabricator.wikimedia.org/T255262) (owner: 10Ottomata)
[15:42:36] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) (owner: 10Bstorm)
[15:43:00] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@ecaad83]: minor django package upgrades -> codfw1dev
[15:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:29] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@ecaad83]: minor django package upgrades -> codfw1dev (duration: 00m 29s)
[15:43:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:26] <wikibugs>	 (03CR) 10Marostegui: "Thanks for the explanation. I am happy to assist you with deploying this next week if you like." [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) (owner: 10Bstorm)
[15:47:00] <wikibugs>	 (03PS1) 10Ottomata: Allow jupyter notebooks to write to tmp for newpyter too [puppet] - 10https://gerrit.wikimedia.org/r/655092 (https://phabricator.wikimedia.org/T255262)
[15:47:34] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Allow jupyter notebooks to write to tmp for newpyter too [puppet] - 10https://gerrit.wikimedia.org/r/655092 (https://phabricator.wikimedia.org/T255262) (owner: 10Ottomata)
[15:47:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.backups: replaces the image script with new one [puppet] - 10https://gerrit.wikimedia.org/r/654898 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro)
[15:48:17] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs.backup: add command output on debug [puppet] - 10https://gerrit.wikimedia.org/r/655080 (owner: 10David Caro)
[15:50:48] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) (owner: 10Bstorm)
[15:51:37] <logmsgbot>	 !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1)
[15:51:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:51:53] <wikibugs>	 (03CR) 10Marostegui: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) (owner: 10Bstorm)
[15:54:28] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@ecaad83]: minor django package upgrades -> labweb1002
[15:54:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:14] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[15:57:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:58:53] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@ecaad83]: minor django package upgrades -> labweb1002 (duration: 04m 25s)
[15:58:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:38] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[15:59:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:17] <wikibugs>	 (03CR) 10Bstorm: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27393/console" [puppet] - 10https://gerrit.wikimedia.org/r/654890 (https://phabricator.wikimedia.org/T260511) (owner: 10Bstorm)
[16:03:29] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10hashar) @jijiki thanks for the investigation!  We were kind of wondering whether the Apache reload might have triggered the opcache issue which i...
[16:03:35] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Add typing support (033 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655031 (owner: 10Giuseppe Lavagetto)
[16:04:22] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Addshore) @Ladsgroup this may interest you :)    > 1:18 PM <addshore> Hey yall! Its friday so I guess we don't want to deploy it.... but is there any way...
[16:06:39] <wikibugs>	 (03Abandoned) 10Elukey: profile::kerberos::client: change alternative ccache location [puppet] - 10https://gerrit.wikimedia.org/r/650480 (https://phabricator.wikimedia.org/T255262) (owner: 10Elukey)
[16:07:10] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: remove query killer from dedicated replica server [puppet] - 10https://gerrit.wikimedia.org/r/655093 (https://phabricator.wikimedia.org/T269211)
[16:09:09] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "Sorry for the delay. Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/639826 (https://phabricator.wikimedia.org/T266479) (owner: 10Jbond)
[16:10:40] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27394/cp3050.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/655051 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup)
[16:11:50] <wikibugs>	 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Ladsgroup) Thanks. You can also simply use this form too: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/build?delay=0sec  I...
[16:13:51] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 35:" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm)
[16:16:15] <wikibugs>	 (03PS1) 10David Caro: wmcs.backup_glance_images: disable the backups on 1003 and 1004 [puppet] - 10https://gerrit.wikimedia.org/r/655095 (https://phabricator.wikimedia.org/T270478)
[16:19:41] <wikibugs>	 (03PS14) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837)
[16:20:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) fixed the netbox issue, hsots will be imaged later today
[16:21:30] <wikibugs>	 (03PS3) 10Jbond: cfssl: helper script [puppet] - 10https://gerrit.wikimedia.org/r/654418
[16:22:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cfssl: helper script [puppet] - 10https://gerrit.wikimedia.org/r/654418 (owner: 10Jbond)
[16:24:06] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.decommission
[16:24:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:39] <wikibugs>	 10SRE, 10netops: Upgrade Routinator 3000 to 0.8.2 - https://phabricator.wikimedia.org/T269738 (10ayounsi) https://mailarchive.ietf.org/arch/msg/sidrops/mlFkEcI0DCLv0ZXLY3uZmM1x2do/
[16:25:01] <wikibugs>	 (03PS4) 10Jbond: cfssl: helper script [puppet] - 10https://gerrit.wikimedia.org/r/654418
[16:30:08] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@7466703]: selective disable of problematic compression block
[16:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:45] <logmsgbot>	 !log razzi@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[16:30:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:32:00] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@7466703]: selective disable of problematic compression block (duration: 01m 52s)
[16:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "compiler says:" [puppet] - 10https://gerrit.wikimedia.org/r/655095 (https://phabricator.wikimedia.org/T270478) (owner: 10David Caro)
[16:33:19] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@7466703]: selective disable of problematic compression block
[16:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:01] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@7466703]: selective disable of problematic compression block (duration: 01m 42s)
[16:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:42:15] <andrewbogott>	 !log shutting down labweb1001 so I can really believe that all traffic is being served by 1002
[16:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:07] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.dns.netbox
[16:43:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:59] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for hosts: ` ['labweb1001.wikimedia.org'] ` The log can be fou...
[16:49:27] <wikibugs>	 (03CR) 10Hnowlan: sockpuppet-api: Create basic chart and service config (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[16:50:58] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:51:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:51:11] <wikibugs>	 (03CR) 10Jbond: "see inline for comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff)
[16:53:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Abstract Wikipedia (Phase β): Grant Access to ldap/wmf for Cory Massaro - https://phabricator.wikimedia.org/T271245 (10cmassaro) Thank you!
[16:54:33] <wikibugs>	 (03PS1) 10CDanis: add debug handlers [software/klaxon] - 10https://gerrit.wikimedia.org/r/655100
[16:55:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add debug handlers [software/klaxon] - 10https://gerrit.wikimedia.org/r/655100 (owner: 10CDanis)
[16:56:16] <wikibugs>	 (03PS2) 10CDanis: add debug handlers [software/klaxon] - 10https://gerrit.wikimedia.org/r/655100
[17:06:44] <icinga-wm>	 PROBLEM - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[17:08:30] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on labweb1001.wikimedia.org with reason: REIMAGE
[17:08:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:34] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM, but should be merged together with renaming the stanza in private repo "k8s_infrastructure_users:" and "profile::kubernetes::deploym" [puppet] - 10https://gerrit.wikimedia.org/r/655047 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan)
[17:10:33] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on labweb1001.wikimedia.org with reason: REIMAGE
[17:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:36] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on maps1009.eqiad.wmnet with reason: Downtiming while not pooled
[17:15:37] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on maps1009.eqiad.wmnet with reason: Downtiming while not pooled
[17:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:46] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on maps2007.codfw.wmnet with reason: Downtiming while not pooled
[17:15:46] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2 days, 12:00:00 on maps2007.codfw.wmnet with reason: Downtiming while not pooled
[17:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:10] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): update RAID controller firmware on labstore1006, 1007 - https://phabricator.wikimedia.org/T268285 (10Bstorm) 05Open→03Resolved
[17:27:29] <icinga-wm>	 PROBLEM - cassandra CQL 10.192.32.46:9042 on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[17:29:53] <icinga-wm>	 PROBLEM - cassandra service on maps2007 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:31:51] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra CQL 10.192.32.46:9042 on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 9042: Connection refused Hnowlan Host out of cluster intentionally for testing. https://phabricator.wikimedia.org/T93886
[17:31:52] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra service on maps2007 is CRITICAL: CRITICAL - Unit cassandra is active but reported SubState exited, wanted running Hnowlan Host out of cluster intentionally for testing. https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[17:33:34] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on maps2007.codfw.wmnet with reason: Downtiming while not pooled
[17:33:34] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on maps2007.codfw.wmnet with reason: Downtiming while not pooled
[17:33:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:29] <wikibugs>	 (03PS1) 10CDanis: Wordsmith and re-order the "I need help!" options [software/klaxon] - 10https://gerrit.wikimedia.org/r/655109
[17:43:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] sentry: delete module and hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/654912 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[17:47:35] <wikibugs>	 (03Abandoned) 10Dzahn: admin: delete the graphoid admin group, remove from scb [puppet] - 10https://gerrit.wikimedia.org/r/654960 (https://phabricator.wikimedia.org/T242855) (owner: 10Dzahn)
[17:47:59] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "As discussed -- LGTM as long as you've checked that nothing in the environment is sensitive." [software/klaxon] - 10https://gerrit.wikimedia.org/r/655100 (owner: 10CDanis)
[17:48:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] mediawiki::php: remove code to absent mongodb module [puppet] - 10https://gerrit.wikimedia.org/r/654922 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn)
[17:48:50] <wikibugs>	 (03CR) 10Dzahn: "was removed on mwdebug* and is gone globally" [puppet] - 10https://gerrit.wikimedia.org/r/654922 (https://phabricator.wikimedia.org/T180761) (owner: 10Dzahn)
[17:50:06] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@db9da3c]: Hotfix analytics deployment [analytics/refinery@db9da3c]
[17:50:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:55] <wikibugs>	 (03CR) 10Dzahn: "No hosts found matching `C:profile::mail::smarthost` unable to do anything" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[17:52:41] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "> Patch Set 2: Code-Review+1" [software/klaxon] - 10https://gerrit.wikimedia.org/r/655100 (owner: 10CDanis)
[17:53:04] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] Wordsmith and re-order the "I need help!" options (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/655109 (owner: 10CDanis)
[17:54:36] <wikibugs>	 (03Merged) 10jenkins-bot: add debug handlers [software/klaxon] - 10https://gerrit.wikimedia.org/r/655100 (owner: 10CDanis)
[17:54:42] <wikibugs>	 (03CR) 10Dzahn: "so .. apparently this is really not even used. There is a role that includes " ::profile::mail::smarthost::wmcs" but that is with the ::wm" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[17:55:24] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Wordsmith and re-order the "I need help!" options (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/655109 (owner: 10CDanis)
[17:56:44] <wikibugs>	 (03Merged) 10jenkins-bot: Wordsmith and re-order the "I need help!" options [software/klaxon] - 10https://gerrit.wikimedia.org/r/655109 (owner: 10CDanis)
[17:56:48] <wikibugs>	 (03CR) 10Dzahn: "oh wait.. it's more than that. profile::mail::smarthost::wmcs instantiates ::profile::mail::smarthost. but it should be roles using profil" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[17:59:38] <wikibugs>	 (03PS1) 10CDanis: actually add link to meta:IRC [software/klaxon] - 10https://gerrit.wikimedia.org/r/655111
[17:59:55] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] actually add link to meta:IRC [software/klaxon] - 10https://gerrit.wikimedia.org/r/655111 (owner: 10CDanis)
[18:00:19] <wikibugs>	 (03CR) 10Dzahn: "anyways.. cleaning that stuff up is not for this patch.  and the actual hosts affected, that the puppet compiler can't find due to the non" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[18:01:13] <wikibugs>	 (03Merged) 10jenkins-bot: actually add link to meta:IRC [software/klaxon] - 10https://gerrit.wikimedia.org/r/655111 (owner: 10CDanis)
[18:01:33] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@db9da3c]: Hotfix analytics deployment [analytics/refinery@db9da3c] (duration: 11m 27s)
[18:01:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:49] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655040 (https://phabricator.wikimedia.org/T266488) (owner: 10Volans)
[18:02:10] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@db9da3c] (thin): Hotfix analytics deployment - THIN [analytics/refinery@db9da3c]
[18:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:02:17] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@db9da3c] (thin): Hotfix analytics deployment - THIN [analytics/refinery@db9da3c] (duration: 00m 07s)
[18:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:56] <wikibugs>	 (03CR) 10Dzahn: "@Bstorm I am pretty confident this patch is another noop and I would normally compile it and then also confirm on a host after merge.. but" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[18:06:40] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27382/" [puppet] - 10https://gerrit.wikimedia.org/r/651834 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:06:50] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['labweb1001.wikimedia.org'] `  and were **ALL** successful.
[18:07:21] <wikibugs>	 (03PS1) 10CDanis: tweak debug table layout [software/klaxon] - 10https://gerrit.wikimedia.org/r/655112
[18:07:58] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] tweak debug table layout [software/klaxon] - 10https://gerrit.wikimedia.org/r/655112 (owner: 10CDanis)
[18:08:53] <wikibugs>	 (03Merged) 10jenkins-bot: tweak debug table layout [software/klaxon] - 10https://gerrit.wikimedia.org/r/655112 (owner: 10CDanis)
[18:10:50] <wikibugs>	 (03CR) 10Dzahn: "noop confirmed on conf1005, conf2001" [puppet] - 10https://gerrit.wikimedia.org/r/651834 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn)
[18:14:45] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[18:15:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[18:23:09] <wikibugs>	 (03CR) 10Dzahn: "Using cloud root access I can ssh to the instance without having to add myself to the project and looking up stuff in Horizon.. now that I" [puppet] - 10https://gerrit.wikimedia.org/r/654697 (owner: 10Dzahn)
[18:24:28] <wikibugs>	 10SRE, 10ops-eqiad: Please remove sdb from ms-be1022 - https://phabricator.wikimedia.org/T271512 (10wiki_willy) a:03Cmjohnson Hi @fgiunchedi, just wanted to confirm - since this server was recently refreshed last quarter, no need to replace the disk, right?  Thanks, Willy
[18:28:38] <wikibugs>	 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10Dzahn)
[18:37:52] <wikibugs>	 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10Dzahn)
[18:40:25] <wikibugs>	 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10Dzahn) The differences between the jessie and the stretch role are that the latter uses `etcdv3` vs `etcd` and additionally includes `zookeeper` profiles.  Additionally the old role has this code:   `   5...
[18:55:52] <wikibugs>	 10SRE, 10serviceops: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10elukey) @Dzahn I can help with the zookeeper migration, it should be doable one host at the time without too many issues, but it needs to be done with care. The work for etcd might be more complicated, but it...
[19:00:15] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:05:48] <wikibugs>	 (03PS1) 10CDanis: Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117
[19:06:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis)
[19:07:13] <wikibugs>	 (03PS2) 10CDanis: Send pages with user's email address, if available [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117
[19:26:20] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@7466703]: Horizon with a bunch of Buster patches
[19:26:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:55] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@7466703]: Horizon with a bunch of Buster patches (duration: 02m 35s)
[19:28:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:32:18] <wikibugs>	 (03PS1) 10Mforns: analytics:refinery:job:data_purge Activate netflow auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/655120 (https://phabricator.wikimedia.org/T231339)
[19:45:54] <logmsgbot>	 !log andrew@deploy1001 Started deploy [striker/deploy@e4db843]: Striker deploy for T269004
[19:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:45:59] <stashbot>	 T269004: Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004
[19:48:05] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [striker/deploy@e4db843]: Striker deploy for T269004 (duration: 02m 11s)
[19:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:31] <wikibugs>	 10SRE, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Andrew)
[19:52:23] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review, 10cloud-services-team (Kanban): Upgrade labweb servers to buster - https://phabricator.wikimedia.org/T269004 (10Andrew) 05Open→03Resolved Getting Horizon on buster was a lot more trouble than I expected but this is done now.
[19:53:37] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[20:05:09] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[20:13:09] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Stop advertising webmaster@wikimedia.org in apache configs - https://phabricator.wikimedia.org/T251005 (10Dzahn) 05Open→03Resolved a:03Dzahn This is done. At least I can confirm it's gone from the production puppet repo and also...
[20:16:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] delete the mongodb module [puppet] - 10https://gerrit.wikimedia.org/r/654923 (owner: 10Dzahn)
[20:16:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "thanks all!" [puppet] - 10https://gerrit.wikimedia.org/r/654923 (owner: 10Dzahn)
[20:17:36] <wikibugs>	 10SRE, 10observability, 10Performance-Team (Radar): Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105 (10colewhite)
[20:18:43] <wikibugs>	 (03Abandoned) 10Dzahn: lvs: stop monitoring graphoid [puppet] - 10https://gerrit.wikimedia.org/r/654959 (https://phabricator.wikimedia.org/T242855) (owner: 10Dzahn)
[20:25:57] <wikibugs>	 (03PS1) 10Razzi: kafka-test: Remove rack B from kafka-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/655126 (https://phabricator.wikimedia.org/T268074)
[20:27:11] <wikibugs>	 (03PS2) 10Razzi: kafka-test: Remove rack B from kafka-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/655126 (https://phabricator.wikimedia.org/T268074)
[20:29:47] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] kafka-test: Remove rack B from kafka-test cluster [puppet] - 10https://gerrit.wikimedia.org/r/655126 (https://phabricator.wikimedia.org/T268074) (owner: 10Razzi)
[20:34:37] <wikibugs>	 (03PS1) 10Andrew Bogott: toolserver-legacy: use an acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/655127 (https://phabricator.wikimedia.org/T260835)
[20:35:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] toolserver-legacy: use an acme-chief cert [puppet] - 10https://gerrit.wikimedia.org/r/655127 (https://phabricator.wikimedia.org/T260835) (owner: 10Andrew Bogott)
[20:42:06] <wikibugs>	 10SRE: Create auto-populated LDAP group of those who have production shell access - https://phabricator.wikimedia.org/T271587 (10CDanis)
[20:53:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] dumps: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/654911 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[20:57:04] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/27398/" [puppet] - 10https://gerrit.wikimedia.org/r/654911 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[20:58:18] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/27399/" [puppet] - 10https://gerrit.wikimedia.org/r/654911 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn)
[21:02:38] <wikibugs>	 (03PS1) 10Andrew Bogott: Move toolserver.org ip to 185.15.56.245 [dns] - 10https://gerrit.wikimedia.org/r/655130 (https://phabricator.wikimedia.org/T260835)
[21:03:54] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Move toolserver.org ip to 185.15.56.245 [dns] - 10https://gerrit.wikimedia.org/r/655130 (https://phabricator.wikimedia.org/T260835) (owner: 10Andrew Bogott)
[21:08:16] <wikibugs>	 10SRE, 10observability, 10Performance-Team (Radar): Set up a statsv-like endpoint for Prometheus - https://phabricator.wikimedia.org/T180105 (10colewhite)
[21:14:37] <wikibugs>	 10SRE, 10serviceops: improve mw maintenance server switch over and discovery names - https://phabricator.wikimedia.org/T265936 (10Dzahn) The second part, having the inactive warning in MOTD is already done .. I see now that I am looking at it again:   ` 115     # T199124 116     $motd_ensure = $ensure ? { 117...
[21:17:39] <wikibugs>	 10SRE, 10serviceops: improve mw maintenance server switch over and discovery names - https://phabricator.wikimedia.org/T265936 (10Dzahn)
[21:31:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10RobH)
[21:31:23] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10RobH)
[21:33:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10RobH)
[21:34:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10RobH)
[21:38:09] <wikibugs>	 10SRE, 10serviceops: improve mw maintenance server switch over and discovery names - https://phabricator.wikimedia.org/T265936 (10Dzahn) After revisting this today I think it can be splt into 3 separate parts: (cc: @rlazarus @Joe   a) allow multiple maintenance servers per DC without enabling jobs on more than...
[21:46:53] <wikibugs>	 (03PS1) 10CDanis: tweak front page to make it clearer "Wake up" goes to a form [software/klaxon] - 10https://gerrit.wikimedia.org/r/655132
[21:47:17] <wikibugs>	 10SRE, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10Dzahn) T265936 is partially a duplicate of this ticket but also adds the part that maintenance servers are web hosts for https://noc.wikimedia.org.  Last switch-...
[21:47:23] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] tweak front page to make it clearer "Wake up" goes to a form [software/klaxon] - 10https://gerrit.wikimedia.org/r/655132 (owner: 10CDanis)
[21:47:46] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] tweak front page to make it clearer "Wake up" goes to a form [software/klaxon] - 10https://gerrit.wikimedia.org/r/655132 (owner: 10CDanis)
[21:48:52] <wikibugs>	 10SRE, 10serviceops: make noc.wikimedia.org active/active (was: improve mw maintenance server switch over and discovery names) - https://phabricator.wikimedia.org/T265936 (10Dzahn)
[21:49:09] <wikibugs>	 (03Merged) 10jenkins-bot: tweak front page to make it clearer "Wake up" goes to a form [software/klaxon] - 10https://gerrit.wikimedia.org/r/655132 (owner: 10CDanis)
[21:57:18] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "Proposed a structural change below, but LGTM if you choose not to make it." (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/655117 (owner: 10CDanis)
[22:03:15] <icinga-wm>	 PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to verify www.toolserver.org against Snakeoil cert:Certificate Snakeoil cert valid until 2021-01-11 20:13:13 +0000 (expires in 2 days) https://phabricator.wikimedia.org/tag/toolforge/
[22:03:46] <mutante>	 ^ well.. there was an IP change earlier
[22:06:39] <wikibugs>	 (03CR) 10Dzahn: "22:03 <+icinga-wm> PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to verify www.toolserver.org agains" [dns] - 10https://gerrit.wikimedia.org/r/655130 (https://phabricator.wikimedia.org/T260835) (owner: 10Andrew Bogott)
[22:08:17] <wikibugs>	 (03CR) 10Andrew Bogott: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/655130 (https://phabricator.wikimedia.org/T260835) (owner: 10Andrew Bogott)
[22:09:02] <Majavah>	 "www.toolserver.org uses an invalid security certificate."
[22:09:16] <wikibugs>	 (03CR) 10Dzahn: "> acme-chief is giving me a throttle alert so failing to generate valid certs. I'm hopeful that it will refresh when the hourly throttle r" [dns] - 10https://gerrit.wikimedia.org/r/655130 (https://phabricator.wikimedia.org/T260835) (owner: 10Andrew Bogott)
[22:09:18] <icinga-wm>	 ACKNOWLEDGEMENT - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to verify www.toolserver.org against Snakeoil cert:Certificate Snakeoil cert valid until 2021-01-11 20:13:13 +0000 (expires in 2 days) andrew bogott Waiting for the LE throttle to renew so I can get a valid cert https://phabricator.wikimedia.org/tag/toolforge/
[22:09:58] <andrewbogott>	 Majavah or mutante if you're interested in helping me troubleshoot acme-chief that'd be great; I'm fairly sure that the issue is just that the throttle interval needs to reset
[22:10:08] <tabbycat>	 " I'm hopeful that" - No, forget about it :P
[22:10:26] <Majavah>	 I have no idea how acme-chief works but let me know if I can be helpful in any way
[22:10:34] <tabbycat>	 software always finds a way to behave contrary to your wishes
[22:10:59] <mutante>	 andrewbogott: there are places in puppet where you tell acme-chief which host is allowed to get which cert
[22:11:10] <mutante>	 did you make a change where it allows that new one?
[22:11:25] <andrewbogott>	 mutante: that doesn't affect LE though does it?  Just puppet?
[22:11:36] <tabbycat>	 Majavah: I took a look at acme-chief once for some beta-cluster issue and found it extremelly complex
[22:11:42] <andrewbogott>	 https://www.irccloud.com/pastebin/u42DngBf/
[22:11:43] <mutante>	 andrewbogott: it affects if the acme-chief gives the host a cert 
[22:11:52] <Majavah>	 tabbycat: I have no shell access anywhere
[22:12:08] <andrewbogott>	 mutante: ok, that's possible then...
[22:12:58] <andrewbogott>	 mutante: are you talking about something other than the 'authorized regexes' section here?
[22:12:59] <mutante>	 andrewbogott: finding the example...
[22:13:01] <andrewbogott>	 https://www.irccloud.com/pastebin/3ksIsdyL/
[22:13:57] <mutante>	 well, it's usually authorized_hosts 
[22:14:09] <mutante>	 hieradata/role/common/acme_chief.yaml
[22:14:30] <mutante>	 but yea, looks like the same thing I meant
[22:14:50] <mutante>	 andrewbogott: yes, that
[22:15:20] <andrewbogott>	 so the issue I have isn't the host getting the cert from acme-chief (that's working, hence the snakeoil)
[22:15:24] <mutante>	 is our new host not covered by the regex?
[22:15:33] <andrewbogott>	 the issue is that the acme-chief host itself is not getting a 'real' cert from LE
[22:15:36] <andrewbogott>	 (I think)
[22:15:41] <mutante>	 oh, ok
[22:16:06] <andrewbogott>	 And LE is returning with the throttle warning I pasted above:   There were too many requests of a given type :: Error creating new order :: too many failed authorizations recently: see https://letsencrypt.org/docs/rate-limits/
[22:16:27] <mutante>	 I have no idea if he is on and in what timezone.. but if you can find Valentin he knows best
[22:16:27] <andrewbogott>	 I assume because I overwhelmed it at some incremental step
[22:16:55] <andrewbogott>	 In theory that's an hourly throttle but I don't know what they mean by 'hourly'.  I was expecting it to work when we rolled over to :00 but it didn't help
[22:18:14] <mutante>	 then probably "within the last 60 minutes" hmmm
[22:18:53] <icinga-wm>	 PROBLEM - Exim SMTP on mx2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[22:19:01] <mutante>	 is the old instance still running?
[22:19:22] <andrewbogott>	 yes but it was using ::integrated which I assume is ignored by LE
[22:19:30] <mutante>	 it might be trying to do the auth and still end up at the old IP ..due to DNS change
[22:19:37] <andrewbogott>	 could be
[22:19:56] <mutante>	 it would be possible to manually copy the cert/key 
[22:19:56] <andrewbogott>	 puppet is stopped on the old host...
[22:20:25] <icinga-wm>	 RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate mx1001.wikimedia.org will expire on Tue 02 Mar 2021 09:00:58 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting
[22:20:37] <mutante>	 eh..good to see that recovery there
[22:21:12] <mutante>	 andrewbogott: how long has it been? just over an hour since DNS change, right
[22:21:26] <andrewbogott>	 something like that
[22:21:39] <mutante>	 I see now why you said you are hopeful it will just fix itself. could be . yes
[22:22:08] <mutante>	 until then I'd either just waait or copy the cert/key from the old host over
[22:22:22] <mutante>	 and/or ping traffic
[22:22:36] <andrewbogott>	 Yeah, I think I'm going to wait and see if it recovers.  If still broken in an hour I'll copy over the cert and freeze everything until Monday
[22:22:48] <mutante>	 toolserver.org won't have that many hits 
[22:23:07] <mutante>	 I just asume that though
[22:23:09] <mutante>	 ack
[22:23:19] <mutante>	 sounds good
[22:23:41] <Platonides>	 how is it authorizing the host?
[22:23:55] <Platonides>	 dns, http: request to .well-known...  ?
[22:24:24] <Platonides>	 it complains about *failedd* authorizations 
[22:24:39] <Platonides>	 getting to the wrong ip would be an explanations, yes
[22:25:21] <mutante>	 https://gerrit.wikimedia.org/g/operations/software/acme-chief
[22:25:35] <mutante>	 Platonides: i think that. yes. that's what I assume a well
[22:26:01] <andrewbogott>	 ip shouldn't matter to LE though
[22:26:09] <andrewbogott>	 all of the interaction there is via dns records
[22:26:29] <Platonides>	 so it uses dns, not an http request
[22:26:40] <mutante>	 true, DNS challenges
[22:26:42] <Platonides>	 (sorry, I don't remember the actual names of the methods LE can use)
[22:26:47] <wikibugs>	 (03CR) 10Daimona Eaytoy: "Should be ready now, I think?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester)
[22:27:19] <andrewbogott>	 Platonides: sort of — the client requests the cert from LE via http (I think) and then LE responds with a challenge, client writes the challenge to DNS, LE confirms the challenge is present
[22:27:21] <mutante>	 both dns and http challenge appear in the config example
[22:27:48] <andrewbogott>	 yeah — I can see a dns challenge response in designate, so it's getting that far...
[22:27:55] <andrewbogott>	 suspiciously far actually
[22:28:26] <Platonides>	 yes, I was meaning the type of challenge used
[22:28:56] <Platonides>	 so, it looks like the replies weren't set in the dns? :/
[22:29:37] <mutante>	 Wmflib::Ensure $http_challenge_support = absent,
[22:30:07] <mutante>	 looks like the default is used..which is DNS 
[22:30:23] <mutante>	 acme_chief::server has that ^
[22:30:38] <andrewbogott>	 hm, no dice.  I got a whole lot of "The request message was malformed :: Unable to update challenge :: authorization must be pending" and then finally ran out the throttle again
[22:30:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:32:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:35:16] <mutante>	 andrewbogott: I am wondering though if we should not go back to the authorized regex..because:
[22:35:23] <mutante>	 host 185.15.56.245
[22:35:23] <mutante>	 245.56.15.185.in-addr.arpa domain name pointer relay.toolserver.org.
[22:35:26] <mutante>	 245.56.15.185.in-addr.arpa domain name pointer instance-toolserver-proxy-01.tools.wmflabs.org.
[22:35:39] <mutante>	 vs   ^toolserver-proxy-[0-9]+\.tools\.eqiad1\.wikimedia\.cloud$
[22:35:45] <mutante>	 that doesn't match
[22:36:02] <mutante>	 so how would it work, even if maybe there is another issue
[22:36:06] <mutante>	 and it's auth-related
[22:36:17] <icinga-wm>	 RECOVERY - HTTPS-toolserver on www.toolserver.org is OK: SSL OK - Certificate stable.toolserver.org valid until 2021-04-08 20:04:59 +0000 (expires in 89 days) https://phabricator.wikimedia.org/tag/toolforge/
[22:36:21] <mutante>	 ooh :)
[22:36:37] <andrewbogott>	 That's me swapping in the old certs
[22:36:45] <mutante>	 I was about to say it's wmflabs.org cs wikimedia.cloud
[22:36:53] <mutante>	 maybe it is
[22:36:57] <andrewbogott>	 mutante: I'm pretty sure that that regex is about what instances are allowed to download the certs from the acme-cheif server
[22:37:03] <andrewbogott>	 *chief
[22:37:13] <andrewbogott>	 and that should use the internal hostname/puppet cert
[22:37:19] <andrewbogott>	 not the public DNS name
[22:38:00] <mutante>	 hm, ok
[22:38:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10wiki_willy) Netbox error associated with this install:  https://netbox.wikimedia.org/extras/reports/accounting.Accounting/  Device with s/n GYBR673 (N/A) not present in Netbox
[22:38:42] <mutante>	 since it's fixed.. and 89 days left.. can be tested again after weekend, right
[22:39:29] <mutante>	 once the throttling part is gone there should be more clarity
[22:41:19] <Majavah>	 it's still an invalid cert for me
[22:42:38] <andrewbogott>	 Majavah: what about now?
[22:42:48] <Majavah>	 working
[22:42:54] <andrewbogott>	 ok
[22:43:00] <andrewbogott>	 going to leave this alone for now then
[22:43:02] <andrewbogott>	 thanks all
[22:48:25] <mutante>	 alright, good weekend
[22:49:49] * andrewbogott waves
[22:51:52] <wikibugs>	 (03PS1) 10Dzahn: add discovery-geo-resources for noc [dns] - 10https://gerrit.wikimedia.org/r/655168 (https://phabricator.wikimedia.org/T265936)
[22:52:36] <wikibugs>	 (03CR) 10Legoktm: mailman3: Start mailman3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup)
[23:14:43] <wikibugs>	 10SRE, 10ops-eqiad: Interface errors between pfw3a-eqiad and fasw-c1a-eqiad - https://phabricator.wikimedia.org/T271295 (10Jgreen) >>! In T271295#6729467, @Cmjohnson wrote: > @Jgreen We need to schedule this,  How does Monday at 10am local work for you?   @Cmjohnson I'd like to run this by fr-tech in our Monda...
[23:18:12] <wikibugs>	 (03PS1) 10Dzahn: deployment::rsync: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655172 (https://phabricator.wikimedia.org/T265138)
[23:20:18] <wikibugs>	 (03CR) 10Dzahn: "This is another thing that we have plenty of and is nice to try: convert a cron to a timer" [puppet] - 10https://gerrit.wikimedia.org/r/655172 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[23:21:52] <wikibugs>	 (03CR) 10Dzahn: "@legoktm also see topic branches like https://gerrit.wikimedia.org/r/q/topic:%22cron-timer%22+(status:open%20OR%20status:merged)" [puppet] - 10https://gerrit.wikimedia.org/r/655172 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[23:22:14] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] [WIP] docker_registry_ha: Add a script to generate a static HTML homepage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm)
[23:22:22] <wikibugs>	 (03PS4) 10Legoktm: [WIP] docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696)
[23:27:49] <wikibugs>	 (03CR) 10Dzahn: "before merging this I have to double check the certificate situation. ensure envoy cert exists with the right name" [puppet] - 10https://gerrit.wikimedia.org/r/654351 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)
[23:29:27] <wikibugs>	 (03PS7) 10Legoktm: mailman3: Start mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/608163 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup)
[23:30:08] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: enable standby behavior on multiinstance proxys [puppet] - 10https://gerrit.wikimedia.org/r/655174 (https://phabricator.wikimedia.org/T271476)
[23:31:35] <wikibugs>	 (03PS2) 10Bstorm: wikireplicas: enable standby behavior on multiinstance proxies [puppet] - 10https://gerrit.wikimedia.org/r/655174 (https://phabricator.wikimedia.org/T271476)
[23:36:23] <wikibugs>	 (03PS3) 10Bstorm: wikireplicas: enable standby behavior on multiinstance proxies [puppet] - 10https://gerrit.wikimedia.org/r/655174 (https://phabricator.wikimedia.org/T271476)
[23:39:12] <wikibugs>	 (03CR) 10Bstorm: "I am unfortunately not surprised that it is a noop without puppetdb responding in CI. https://integration.wikimedia.org/ci/job/operations-" [puppet] - 10https://gerrit.wikimedia.org/r/655174 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)