[00:14:44] <icinga-wm>	 PROBLEM - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100%
[00:16:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:16:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:16:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:16:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:16:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:16:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:16:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:17:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:17:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:17:28] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[00:17:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:17:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:17:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:17:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:18:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:18:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:18:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:18:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:18:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:18:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:19:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:19:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:19:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:19:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:19:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:19:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:20:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:21:00] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[00:21:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:21:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:28:27] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10thcipriani) 05Open→03Resolved @dancy confirmed he was able to get into logstash. Thanks all!
[00:39:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 34439384 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:41:42] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 31368 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:03:26] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:05:16] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[01:22:06] <icinga-wm>	 PROBLEM - puppet last run on labtestpuppetmaster2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[02:05:55] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499
[02:06:57] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 (https://phabricator.wikimedia.org/T254176) (owner: 10TrainBranchBot)
[03:57:00] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+1] Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 (https://phabricator.wikimedia.org/T254176) (owner: 10TrainBranchBot)
[04:44:22] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1080 from s1 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/608508 (https://phabricator.wikimedia.org/T253217)
[04:44:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1080 from s1 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/608508 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui)
[04:46:57] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) >>! In T256538#6265626, @herron wrote: >>>! In T256538#6262958, @Marostegui wrote: >> @herron any idea how big these DBs can be and how many writes we'd be expecting? >> Wh...
[04:47:57] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] "The -1 is a known issue that requires and entire refactoring for misc hosts." [puppet] - 10https://gerrit.wikimedia.org/r/608508 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui)
[04:50:34] <wikibugs>	 (03CR) 10Ayounsi: "TIL, you can replace /edit with /preview at the end of the Gdoc URL to get a read only version." [puppet] - 10https://gerrit.wikimedia.org/r/608490 (owner: 10CDanis)
[04:56:29] <marostegui>	 !log Remove plfrom from db1096:3316 and db1098:3316 - T256684
[04:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:56:34] <stashbot>	 T256684: pl_from index still lingers in random hosts - https://phabricator.wikimedia.org/T256684
[04:57:18] <logmsgbot>	 !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[04:57:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:57:59] <marostegui>	 !log remove pl_from index from db1141, db1121, db1148 - T256684
[04:58:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:12:15] <longma>	 I'm getting this error when doing an image pull after trying to deploy to staging in kubernetes `x509: certificate has expired or is not yet valid`.  What should I do?
[05:13:45] <marostegui>	 !log Deploy schema change on s8 codfw - T256680
[05:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:50] <stashbot>	 T256680: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680
[05:45:06] <wikibugs>	 (03PS1) 10Jeena Huneidi: Revert "blubberoid: Update to latest image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608468
[05:47:05] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Revert "blubberoid: Update to latest image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608468 (owner: 10Jeena Huneidi)
[05:48:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "blubberoid: Update to latest image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608468 (owner: 10Jeena Huneidi)
[05:51:42] <logmsgbot>	 !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[05:51:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:16:29] <icinga-wm>	 PROBLEM - Host db1097 is DOWN: PING CRITICAL - Packet loss = 100%
[06:18:19] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m1 on db1117 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1097.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1097.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[06:18:27] <icinga-wm>	 PROBLEM - MariaDB Replica IO: m1 on db2132 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1097.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1097.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[06:18:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:19:10] <marostegui>	 shit
[06:19:15] <marostegui>	 that's m1 master
[06:19:17] <marostegui>	 checking
[06:19:41] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[06:19:41] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[06:20:43] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:23:53] <icinga-wm>	 RECOVERY - Host db1097 is UP: PING WARNING - Packet loss = 50%, RTA = 0.25 ms
[06:25:23] <marostegui>	 host back and proxies reloaded
[06:25:27] <marostegui>	 looks like HW issue
[06:25:30] <marostegui>	 task being created
[06:25:37] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m1 on db1117 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[06:25:47] <icinga-wm>	 RECOVERY - MariaDB Replica IO: m1 on db2132 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[06:26:06] <marostegui>	 we need to restart etherpad from what I can see
[06:26:58] <wikibugs>	 (03CR) 10Jcrespo: "Thanks for working on this. I have a few questions, as seen below." (038 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm)
[06:26:59] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[06:27:00] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[06:27:21] <marostegui>	 etherpad is back
[06:27:44] <marostegui>	 It is back but super slow 
[06:28:05] <marostegui>	 Looks better now
[06:28:13] <jynus>	 how's the db?
[06:28:43] <marostegui>	 ?
[06:29:05] <jynus>	 is the db up or down?
[06:29:12] <marostegui>	 it is up
[06:29:27] <marostegui>	 otherwise etherpad would still be down
[06:29:28] <jynus>	 should we reload etherpad?
[06:29:35] <marostegui>	 I did already, check above
[06:30:07] <icinga-wm>	 PROBLEM - ores on ores2009 is CRITICAL: connect to address 10.192.48.90 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:31:19] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:31:45] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:31:47] <jynus>	 sorry, didn't see it on log
[06:31:54] <wikibugs>	 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui)
[06:33:22] <wikibugs>	 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui)
[06:33:48] <wikibugs>	 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) p:05Triage→03High I was in process of moving db1080 to m2, but I will move it to m1 instead so we can replace and decommission this host.
[06:35:33] <icinga-wm>	 RECOVERY - ores on ores2009 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:39:05] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Move db1080 to m1 instead of m2 [puppet] - 10https://gerrit.wikimedia.org/r/608543 (https://phabricator.wikimedia.org/T256717)
[06:40:17] <wikibugs>	 (03PS2) 10Marostegui: site.pp: Move db1080 to m1 instead of m2 [puppet] - 10https://gerrit.wikimedia.org/r/608543 (https://phabricator.wikimedia.org/T256717)
[06:42:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Move db1080 to m1 instead of m2 [puppet] - 10https://gerrit.wikimedia.org/r/608543 (https://phabricator.wikimedia.org/T256717) (owner: 10Marostegui)
[07:09:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for mayakpwiki [puppet] - 10https://gerrit.wikimedia.org/r/608546
[07:10:11] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:14:23] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Extend access for mayakpwiki [puppet] - 10https://gerrit.wikimedia.org/r/608546 (owner: 10Muehlenhoff)
[07:15:57] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:26:08] <wikibugs>	 (03CR) 10Hashar: "Thank you, I have purged the php related packages from the releases* hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/608452 (https://phabricator.wikimedia.org/T256164) (owner: 10Hashar)
[07:33:44] <wikibugs>	 10Operations: cas-icinga intermittant failures - https://phabricator.wikimedia.org/T256720 (10Kormat)
[07:38:09] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[07:42:52] <vgutierrez>	 !log reboot cp3053 - T256632
[07:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:57] <stashbot>	 T256632: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632
[07:46:56] <DannyS712>	 Can someone please check in logstash if there have been any deprecation alerts for the use of `PageArchive::getRevision` since wmf.37 ?
[07:47:04] <hashar>	 ^^ a spike of mediawiki/core changes
[07:48:09] <icinga-wm>	 ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] amusso spike of mediawiki/core changes https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[07:48:11] <hashar>	 DannyS712: yes i can
[07:48:29] <DannyS712>	 Thanks - I hard deprecated it, but just found another use in non-deprecated code
[07:48:40] <hashar>	 if I find out how to search for them hehe
[07:49:49] <hashar>	 DannyS712: is there any task I can copy paste my findings?
[07:50:01] <DannyS712>	 https://phabricator.wikimedia.org/T249982
[07:50:02] <hashar>	 I can share the rate over the last 7 days
[07:52:38] <hashar>	 DannyS712: seems there is only Special:Undelete
[07:53:43] <DannyS712>	 Yes, thats the use I found; for some reason I couldn't find the deprecation alerts on logstash-beta, so I wasn't sure if I was reading it right
[07:54:23] <hashar>	 DannyS712: https://phabricator.wikimedia.org/T249982#6266618
[07:54:40] <hashar>	 maybe logstash-beta misses the proper log configuration :-\
[07:55:00] <DannyS712>	 Yeah, https://logstash-beta.wmflabs.org/app/kibana#/dashboard/mediawiki-deprecated says no results found
[07:55:45] <DannyS712>	 Would you be willing to +2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/608470 so I can backport it?
[07:56:35] <hashar>	 I know absolutely nothing about mediawiki nowadays :-\
[07:56:53] <wikibugs>	 (03PS3) 10Kormat: install_server: Remove no-srv-format.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768)
[07:57:11] <DannyS712>	 You can compare with https://gerrit.wikimedia.org/r/c/605689 to see that the change should be correct, but I understand. Is there a task for logstash-beta not including deprecation alerts?
[07:57:19] <hashar>	 DannyS712: then I am handling the train this week, and will be more than happy to deploy the hotfix at any time
[07:57:40] <wikibugs>	 (03CR) 10Kormat: "> We could add a stub partman config like "manual-setup.cfg" which only has a comment that a server with this kind of recipe gets installe" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat)
[07:57:40] <hashar>	 but I don't feel any confident  in approving a change to mediawiki ;D
[07:57:45] <DannyS712>	 okay
[07:57:51] <hashar>	 for logstash-beta I don't know, you might have been the first one to notice it
[07:58:07] <hashar>	 I guess the devil is probably inside the mediawiki-config,  or logstash on beta is broken somehow
[07:59:05] <wikibugs>	 10Operations, 10Puppet, 10User-jbond: puppetise pupet server copy of the public ca.pem - https://phabricator.wikimedia.org/T256721 (10jbond)
[07:59:18] <wikibugs>	 10Operations, 10Puppet, 10User-jbond: puppetise pupet server copy of the public ca.pem - https://phabricator.wikimedia.org/T256721 (10jbond) p:05Triage→03High
[07:59:32] <DannyS712>	 Filed T256722
[07:59:33] <stashbot>	 T256722: Logstash-beta doesn't include any deprecation notices - https://phabricator.wikimedia.org/T256722
[07:59:46] <wikibugs>	 (03CR) 10Privacybatm: "Thank you for your review, working on a new patch set." (037 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm)
[08:01:59] <jbond42>	 !log disable puppet to restart puppetmasters front ends
[08:02:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:46] <DannyS712>	 Testing via eval.php to see what the wgMWLoggerDefaultSpi is set to
[08:05:15] <vgutierrez>	 !log powercycle cp3053 (unresponsive after reboot) - T256632
[08:05:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:19] <stashbot>	 T256632: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632
[08:05:52] <marostegui>	 !log Stop MySQL on db1117:3322 to clone db1080 (this will trigger haproxy alerts) - T256717
[08:05:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:56] <stashbot>	 T256717: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717
[08:07:11] <icinga-wm>	 RECOVERY - puppet last run on labtestpuppetmaster2001 is OK: OK: Puppet is currently disabled (restart puppet master frontends - jbond), not alerting. Last run 12 hours ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:10:38] <hashar>	 !log 1.35.0-wmf.39 was branched at e169e3dabcb2217809fc41ba44b43a39ae1a678e T254176
[08:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:43] <stashbot>	 T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176
[08:10:43] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 (https://phabricator.wikimedia.org/T254176) (owner: 10TrainBranchBot)
[08:11:01] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[08:11:25] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[08:11:57] <marostegui>	 ^ expected
[08:15:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] run_ci_locally: use latests docker image [puppet] - 10https://gerrit.wikimedia.org/r/608269 (owner: 10Jbond)
[08:16:39] <icinga-wm>	 PROBLEM - puppet last run on labtestpuppetmaster2001 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:21:40] <icinga-wm>	 RECOVERY - puppet last run on labtestpuppetmaster2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:22:11] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) >>! In T256444#6264956, @elukey wrote: > There may be another solution, namely creating a new apt component to hold 1.4.x and deploy it selectively wh...
[08:23:53] <vgutierrez>	 !log repool cp3053 - T256632
[08:23:54] <wikibugs>	 (03PS1) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[08:23:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:57] <stashbot>	 T256632: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632
[08:25:31] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb-backups: Move transferpy deployment to debian package" [puppet] - 10https://gerrit.wikimedia.org/r/608471
[08:26:07] <wikibugs>	 (03CR) 10Marostegui: "For context: https://phabricator.wikimedia.org/P11705" [puppet] - 10https://gerrit.wikimedia.org/r/608471 (owner: 10Jcrespo)
[08:27:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb-backups: Move transferpy deployment to debian package" [puppet] - 10https://gerrit.wikimedia.org/r/608471 (owner: 10Jcrespo)
[08:27:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[08:29:41] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 (10Vgutierrez) 05Open→03Stalled repooled after powercycling & issuing the following commands: ` /usr/sbin/nvme format /dev/nvme0n1 -l 2  echo ';' | /usr/sbin/sfdisk /dev/nvme0n1 `  I'll keep...
[08:31:12] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 (https://phabricator.wikimedia.org/T254176) (owner: 10TrainBranchBot)
[08:31:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[08:38:16] <hashar>	 I am doing some basic train related tasks this morning
[08:38:46] <hashar>	 !log scap prep 1.39.0-wmf.39 # T254176
[08:38:56] <wikibugs>	 (03PS2) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[08:39:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[08:40:49] <wikibugs>	 (03PS3) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[08:42:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[08:44:50] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Prevention): monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10ema)
[08:45:14] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on cp3053 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3053&var-datasource=esams+prometheus/ops
[08:45:20] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[08:45:25] <marostegui>	 ^ me
[08:48:06] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[08:50:02] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[08:50:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:50:22] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:51:27] <vgutierrez>	 !log rolling restart of codfw cp nodes after "re-formatting" nvme devices - T256655
[08:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:33] <stashbot>	 T256655: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655
[08:51:55] <hashar>	 !log Applied security patches to wmf/1.35.0-wmf.39 # T254176
[08:51:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:59] <stashbot>	 T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176
[08:53:27] <logmsgbot>	 !log hashar@deploy1001 clean aborted: Pruned MediaWiki: 1.35.0-wmf.36 (duration: 00m 00s)
[08:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:59] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[08:54:00] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:54:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:05] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2027-2028].codfw.wmnet `
[08:56:05] <wikibugs>	 (03PS1) 10Ema: purged: alert if local backlog grows past the given limits [puppet] - 10https://gerrit.wikimedia.org/r/608564 (https://phabricator.wikimedia.org/T256446)
[08:56:38] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608564 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[08:58:00] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[08:59:02] <wikibugs>	 (03PS4) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[08:59:36] <wikibugs>	 (03PS1) 10DannyS712: Remove another use of PageArchive::getRevision [core] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608472 (https://phabricator.wikimedia.org/T249982)
[09:00:02] <wikibugs>	 (03PS1) 10DannyS712: Remove another use of PageArchive::getRevision [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608473 (https://phabricator.wikimedia.org/T249982)
[09:00:25] <wikibugs>	 (03CR) 10Kormat: "Marostegui for the feedback on the idea, Jbond for feedback on the implementation." [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[09:00:56] <DannyS712>	 hashar the patch was merged; cherry picked to 38 (current deployed) and 39 (will be deployed later today) - if the 39 patch is merged soon it can just go out with the train, but the 38 would need to be deployed
[09:00:57] <wikibugs>	 (03Abandoned) 10Addshore: AdHocLogging for ReplicaMasterAwareRecordIdsAcquirer [extensions/Wikibase] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606692 (https://phabricator.wikimedia.org/T255855) (owner: 10Addshore)
[09:02:42] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:04:30] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm)
[09:05:57] <wikibugs>	 (03PS1) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565
[09:07:10] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (owner: 10Jbond)
[09:09:29] <hashar>	 DannyS712: we have cut the .39 this night
[09:09:42] <hashar>	 but it is trivial to just approve it and refresh iton the deploy machine ;)
[09:10:22] <hashar>	 I guess we can include it in wmf.39 ?
[09:10:34] <hashar>	 and skip it from wmf.38 to avoid potential breakage
[09:10:56] <DannyS712>	 Yeah, thats why I wanted to make sure you saw it before deploying 39 began; since it hasn't caused any problems in the last couple weeks, backport might not be needed
[09:11:16] <hashar>	 k
[09:11:22] <hashar>	 I will +2 the wmf.39 one 
[09:12:11] <wikibugs>	 (03CR) 10Hashar: "I will fetch it later today in order to have the patch included in the deployment today." [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608473 (https://phabricator.wikimedia.org/T249982) (owner: 10DannyS712)
[09:16:42] <wikibugs>	 10Operations, 10CAS-SSO: cas-icinga intermittant failures - https://phabricator.wikimedia.org/T256720 (10Peachey88)
[09:18:25] <wikibugs>	 10Operations, 10CAS-SSO, 10User-jbond: cas-icinga intermittant failures - https://phabricator.wikimedia.org/T256720 (10Peachey88)
[09:21:57] <logmsgbot>	 !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.36 (duration: 28m 11s)
[09:21:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:57] <wikibugs>	 (03PS1) 10Jcrespo: Revert "Revert "mariadb-backups: Move transferpy deployment to debian package"" [puppet] - 10https://gerrit.wikimedia.org/r/608475 (https://phabricator.wikimedia.org/T256725)
[09:23:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:23:26] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:27:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Do you have an example of client using this file, so we can have it in the commit message for future reference?" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott)
[09:30:20] <wikibugs>	 (03CR) 10Marostegui: "was this test? as in: the installer will stop at the partitioner but we can still run the partitioner manually and carry on with the insta" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat)
[09:30:31] <wikibugs>	 (03CR) 10Marostegui: "*tested" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat)
[09:31:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/purged] - 10https://gerrit.wikimedia.org/r/608275 (owner: 10Ema)
[09:35:42] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:37:05] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "mariadb-backups: Move transferpy deployment to debian package"" [puppet] - 10https://gerrit.wikimedia.org/r/608475 (https://phabricator.wikimedia.org/T256725) (owner: 10Jcrespo)
[09:38:20] <wikibugs>	 (03PS1) 10Joal: Add analytics data purge for pageview_actor_hourly [puppet] - 10https://gerrit.wikimedia.org/r/608568 (https://phabricator.wikimedia.org/T256417)
[09:40:17] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[09:40:19] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:40:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:23] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2029-2030].codfw.wmnet `
[09:42:46] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[09:43:03] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove deprecated and unmaintained image: envoy-tls-local-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/608277 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm)
[09:43:13] <wikibugs>	 (03PS1) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[09:44:19] <wikibugs>	 (03PS1) 10Awight: Set Status error if permission check returns false. [extensions/FileImporter] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608476 (https://phabricator.wikimedia.org/T256428)
[09:44:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, but maybe let's test this in addition with an sretest* host before attempting to use it on a db* or backup host?" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat)
[09:45:46] <wikibugs>	 (03PS4) 10Privacybatm: transferpy: Use logging package instead of print statements [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999)
[09:47:11] <logmsgbot>	 !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.37 (duration: 02m 20s)
[09:47:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:17] <wikibugs>	 (03PS1) 10Awight: Embedded surveys are hidden when no element is available [extensions/QuickSurveys] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608477 (https://phabricator.wikimedia.org/T256627)
[09:48:45] <wikibugs>	 (03PS2) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[09:50:41] <wikibugs>	 (03PS3) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[09:53:00] <wikibugs>	 (03PS1) 10Awight: Configure TeWü survey on dewiki (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608478 (https://phabricator.wikimedia.org/T253112)
[09:55:43] <wikibugs>	 (03PS4) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[09:56:38] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[10:01:02] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm) p:05Triage→03Low
[10:01:40] <wikibugs>	 (03PS5) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[10:01:46] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Reviewed by Thiemo in master.  It is trivial enough I don't see a reason for waiting next week. Will update the code on the deployment ser" [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608473 (https://phabricator.wikimedia.org/T249982) (owner: 10DannyS712)
[10:02:06] <logmsgbot>	 !log volker-e@deploy1001 Started deploy [design/style-guide@e3fda83]: Deploy design/style-guide:
[10:02:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:13] <logmsgbot>	 !log volker-e@deploy1001 Finished deploy [design/style-guide@e3fda83]: Deploy design/style-guide:  (duration: 00m 07s)
[10:02:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:03] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[10:03:04] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:03:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:07] <wikibugs>	 (03PS1) 10Marostegui: db1080: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608580 (https://phabricator.wikimedia.org/T256717)
[10:03:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:12] <wikibugs>	 (03PS6) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[10:03:20] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) db1080 is ready. Now we just need to schedule another m1 failover to promote db1080 to master.
[10:03:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1080: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608580 (https://phabricator.wikimedia.org/T256717) (owner: 10Marostegui)
[10:04:39] <vgutierrez>	 !log rolling restart of eqiad cache nodes to catch up on kernel upgrades
[10:04:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:18] <wikibugs>	 (03PS7) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[10:06:20] <wikibugs>	 (03CR) 10Kormat: "Marostegui wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat)
[10:06:23] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10elukey) >>! In T256444#6266680, @ema wrote: > Well, [[ https://github.com/edenhill/librdkafka/issues/2020 | upstream claims ]] that the new versions are AP...
[10:07:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "I didn't recall if this was tested, if it has been tested then +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat)
[10:09:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P11708 and previous config saved to /var/cache/conftool/dbconfig/20200630-100912-marostegui.json
[10:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:49] <marostegui>	 !log Deploy schema change on db1076
[10:09:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:25] <wikibugs>	 (03PS8) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[10:11:27] <wikibugs>	 10Operations: consider hybrid caching options for ssd+disk - https://phabricator.wikimedia.org/T88992 (10Aklapper)
[10:12:08] <wikibugs>	 10Operations, 10Patch-Needs-Improvement: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574 (10Aklapper)
[10:12:16] <wikibugs>	 (03PS9) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[10:15:10] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] install_server: Remove no-srv-format.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat)
[10:23:37] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat)
[10:24:38] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Marostegui)
[10:25:12] <wikibugs>	 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) p:05Triage→03Medium
[10:25:29] <wikibugs>	 (03Merged) 10jenkins-bot: Remove another use of PageArchive::getRevision [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608473 (https://phabricator.wikimedia.org/T249982) (owner: 10DannyS712)
[10:26:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:29:07] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10Tobi_WMDE_SW) >>! In T256201#6265822, @ssingh wrote: >  > @guergana.tzatchkova: Once the NDA is confirmed, the only other thing we will need is a confirmation from...
[10:30:12] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:30:39] <hashar>	 DannyS712: syncing your change for wmf.39 . It will be deployed on testwikis this afternoon
[10:30:54] <logmsgbot>	 !log hashar@deploy1001 Synchronized php-1.35.0-wmf.39/includes/specials/SpecialUndelete.php: Remove another use of PageArchive::getRevision - T249982 T254176 (duration: 00m 56s)
[10:30:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:00] <stashbot>	 T249982: PageArchive::getRevision uses timestamp, suggested replacement uses rev id - https://phabricator.wikimedia.org/T249982
[10:31:00] <stashbot>	 T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176
[10:31:26] <hashar>	 and the mediawiki-error alarm is due to a spike of " Bad UTF-8". There is a task for it
[10:34:36] <wikibugs>	 (03CR) 10Marostegui: cumin: Add db-role and db-section aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[10:34:48] <wikibugs>	 (03PS5) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[10:37:42] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[10:37:43] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:37:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:03] <wikibugs>	 (03PS1) 10Jbond: DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586
[10:38:21] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 (owner: 10Jbond)
[10:38:35] <ema>	 !log cp2040: upgrade librdkafka1 to 0.11.6-1.1wmf1 https://phabricator.wikimedia.org/P11703 T256444
[10:38:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:39] <stashbot>	 T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444
[10:41:19] <ema>	 !log cp2040: restart purged and varnishkafka to use updated librdkafka1 T256444
[10:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:25] <DannyS712>	 hashar should I abandon the .38 backport?
[10:43:57] <hashar>	 DannyS712: yeah
[10:44:10] <hashar>	 well theorically we could deploy it, but I would rather not take the risk :-]
[10:44:20] <hashar>	 given the fix will make it in this week train
[10:45:33] <wikibugs>	 (03Abandoned) 10DannyS712: Remove another use of PageArchive::getRevision [core] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608472 (https://phabricator.wikimedia.org/T249982) (owner: 10DannyS712)
[10:45:53] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[10:45:54] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:45:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:59] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2031-2032].codfw.wmnet `
[10:46:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:39] <hashar>	 lunch break
[10:52:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1076', diff saved to https://phabricator.wikimedia.org/P11710 and previous config saved to /var/cache/conftool/dbconfig/20200630-105254-marostegui.json
[10:52:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:13] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[10:59:14] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:59:15] <ema>	 !log upload librdkafka 0.11.6-1.1wmf1 to buster-wikimedia https://phabricator.wikimedia.org/P11703 T256444
[10:59:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:23] <stashbot>	 T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1100).
[11:00:04] <jouncebot>	 awight: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:07] <Lucas_WMDE>	 👋
[11:00:12] <awight>	 :-)
[11:00:16] <duesen>	 hey Lucas_WMDE 
[11:00:29] <awight>	 I'll deploy my patches now.
[11:00:32] <Lucas_WMDE>	 awight: do you want to deploy the changes yourself?
[11:00:35] <Lucas_WMDE>	 ah ok
[11:00:39] <awight>	 Thanks!
[11:01:22] <wikibugs>	 (03PS8) 10JMeybohm: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843)
[11:01:36] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "BACON" [extensions/FileImporter] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608476 (https://phabricator.wikimedia.org/T256428) (owner: 10Awight)
[11:02:12] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "BACON" [extensions/QuickSurveys] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608477 (https://phabricator.wikimedia.org/T256627) (owner: 10Awight)
[11:02:31] <wikibugs>	 (03CR) 10JMeybohm: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[11:03:14] <duesen>	 All sites were unreachable for me for a few minutes, 100% packet loss. From the lack of panic on this channel I suppose that was just me?...
[11:03:36] <duesen>	 Maybe a routing problem between my isp (1&1) and and esams
[11:03:42] <wikibugs>	 (03PS10) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[11:04:37] <marostegui>	 duesen: might have been you yeah, I don't see a drop here https://grafana.wikimedia.org/d/000000501/prometheus-varnish-http-requests?orgId=1
[11:04:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 (owner: 10Jbond)
[11:05:24] <duesen>	 ok than! everything is back to normal for me as well.
[11:06:33] <wikibugs>	 (03PS11) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[11:07:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 (owner: 10Jbond)
[11:08:47] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:11:21] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Note that all the node6 images used in CI are based on wikimedia-jessie; we're blocked on removing this by the migration of production ser" [puppet] - 10https://gerrit.wikimedia.org/r/587529 (https://phabricator.wikimedia.org/T249724) (owner: 10Alexandros Kosiaris)
[11:13:20] <ema>	 !log deneb: systemctl restart docker-reporter-base-images.service
[11:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:38] <wikibugs>	 (03PS1) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596
[11:14:27] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 48 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:14:37] <wikibugs>	 (03PS2) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596
[11:14:55] <wikibugs>	 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) @jbond is the scope of this task done or is there anything else left?
[11:15:27] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:17:11] <wikibugs>	 (03PS3) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596
[11:17:13] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:18:59] <wikibugs>	 (03PS4) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596
[11:20:42] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:22:59] <wikibugs>	 (03Merged) 10jenkins-bot: Set Status error if permission check returns false. [extensions/FileImporter] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608476 (https://phabricator.wikimedia.org/T256428) (owner: 10Awight)
[11:23:02] <wikibugs>	 (03Merged) 10jenkins-bot: Embedded surveys are hidden when no element is available [extensions/QuickSurveys] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608477 (https://phabricator.wikimedia.org/T256627) (owner: 10Awight)
[11:24:51] <wikibugs>	 (03PS5) 10Jbond: umin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596
[11:25:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:26:39] <logmsgbot>	 !log awight@deploy1001 Synchronized php-1.35.0-wmf.38/extensions/FileImporter: BACON: [[gerrit:608476|Set Status error if permission check returns false. (T256428)]] (duration: 00m 58s)
[11:26:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:46] <stashbot>	 T256428: FileImporter thinks I’m an admin even though I’m not - https://phabricator.wikimedia.org/T256428
[11:26:48] <wikibugs>	 (03PS2) 10Awight: Configure TeWü survey on dewiki (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608478 (https://phabricator.wikimedia.org/T253112)
[11:26:56] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608478 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight)
[11:28:23] <wikibugs>	 (03Merged) 10jenkins-bot: Configure TeWü survey on dewiki (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608478 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight)
[11:28:31] <logmsgbot>	 !log awight@deploy1001 Synchronized php-1.35.0-wmf.38/extensions/QuickSurveys: BACON: [[gerrit:608477|Embedded surveys are hidden when no element is available (T256627)]] (duration: 00m 56s)
[11:28:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:28:35] <stashbot>	 T256627: Embedded surveys are incorrectly shown even when embed element is missing - https://phabricator.wikimedia.org/T256627
[11:30:30] <wikibugs>	 (03PS6) 10Jbond: umin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596
[11:31:58] <jayme>	 !log pushed a scratch docker image as docker-registry.discovery.wmnet/envoy-tls-local-proxy:dontuseme - T253396
[11:31:59] <jayme>	 !log restarted docker-reporter-base-images and docker-reporter-releng-images on deneb - T253396
[11:32:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:02] <stashbot>	 T253396: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396
[11:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:41] <logmsgbot>	 !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: BACON: [[gerrit:608478|Configure TeWü survey on dewiki (take 2) (T253112)]] (duration: 00m 58s)
[11:32:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:44] <stashbot>	 T253112: Create survey for TechWish prototype announcements on dewiki and metawiki - https://phabricator.wikimedia.org/T253112
[11:33:05] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm)
[11:33:13] <awight>	 !log EU BACON cooked
[11:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:38] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Build-depend on go 1.14 [software/purged] - 10https://gerrit.wikimedia.org/r/608275 (owner: 10Ema)
[11:35:48] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[11:35:48] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:50] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: lvs: Switch proton to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/607532 (https://phabricator.wikimedia.org/T225680)
[11:38:07] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Switch proton to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/607532 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris)
[11:39:33] <wikibugs>	 (03PS7) 10Jbond: umin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596
[11:39:39] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: redis::misc: Set docker-registry maxmemory-policy [puppet] - 10https://gerrit.wikimedia.org/r/608600 (https://phabricator.wikimedia.org/T256726)
[11:42:00] <wikibugs>	 (03PS8) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596
[11:45:18] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 101 connections established with conf1004.eqiad.wmnet:4001 (min=102) https://wikitech.wikimedia.org/wiki/PyBal
[11:45:46] <wikibugs>	 (03PS12) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[11:45:48] <wikibugs>	 (03PS2) 10Jbond: DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586
[11:46:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 (owner: 10Jbond)
[11:47:50] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 58 connections established with conf2001.codfw.wmnet:2379 (min=59) https://wikitech.wikimedia.org/wiki/PyBal
[11:48:32] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.21:4030]) https://wikitech.wikimedia.org/wiki/PyBal
[11:48:56] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.21:4030]) https://wikitech.wikimedia.org/wiki/PyBal
[11:49:00] <ema>	 akosiaris: the pybal alerts are due to your LVS changes I suppose, right? ^
[11:49:02] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.21:4030]) https://wikitech.wikimedia.org/wiki/PyBal
[11:49:24] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 69 connections established with conf1004.eqiad.wmnet:4001 (min=70) https://wikitech.wikimedia.org/wiki/PyBal
[11:49:46] <wikibugs>	 (03PS3) 10Jbond: DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586
[11:50:26] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 78 connections established with conf2001.codfw.wmnet:2379 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[11:51:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 (owner: 10Jbond)
[11:51:03] <wikibugs>	 (03PS13) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571
[11:52:20] <wikibugs>	 (03PS2) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721)
[11:54:12] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.21:4030]) https://wikitech.wikimedia.org/wiki/PyBal
[11:55:06] <akosiaris>	 that's me ^, ignore please
[11:55:07] <wikibugs>	 (03PS3) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565
[11:56:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (owner: 10Jbond)
[11:58:58] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:59:30] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1200)
[12:00:50] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 102 connections established with conf1004.eqiad.wmnet:4001 (min=102) https://wikitech.wikimedia.org/wiki/PyBal
[12:00:57] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 79 connections established with conf2001.codfw.wmnet:2379 (min=79) https://wikitech.wikimedia.org/wiki/PyBal
[12:01:32] <wikibugs>	 (03PS1) 10Elukey: Fix tests for multi-threading code [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608604
[12:02:44] <wikibugs>	 (03PS4) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565
[12:03:22] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[12:03:24] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:03:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:36] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2033-2034].codfw.wmnet `
[12:04:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (owner: 10Jbond)
[12:07:05] <wikibugs>	 (03PS2) 10Elukey: Fix tests for multi-threading code [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608604
[12:07:11] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) Actually I just realised that this host won't be replaced next FY, as we are replacing up to db1095.
[12:07:36] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix tests for multi-threading code [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608604 (owner: 10Elukey)
[12:08:06] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:08:33] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: lvs: Switch proton to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/607533 (https://phabricator.wikimedia.org/T225680)
[12:08:56] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 59 connections established with conf2001.codfw.wmnet:2379 (min=59) https://wikitech.wikimedia.org/wiki/PyBal
[12:10:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Switch proton to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/607533 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris)
[12:10:38] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:10:42] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:10:43] <wikibugs>	 (03PS5) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565
[12:11:04] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 70 connections established with conf1004.eqiad.wmnet:4001 (min=70) https://wikitech.wikimedia.org/wiki/PyBal
[12:11:16] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:11:29] <wikibugs>	 (03PS6) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[12:11:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (owner: 10Jbond)
[12:12:39] <wikibugs>	 (03CR) 10Kormat: [C: 04-1] "Based on input from Jbond, i'm going to take a different approach for the enumeration of valid states/sections. I'll rebase this CR once t" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[12:15:59] <wikibugs>	 (03PS6) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721)
[12:21:59] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 6: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[12:23:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add analytics data purge for pageview_actor_hourly [puppet] - 10https://gerrit.wikimedia.org/r/608568 (https://phabricator.wikimedia.org/T256417) (owner: 10Joal)
[12:24:24] <wikibugs>	 (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[12:27:33] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:28:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:29:49] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:52] <wikibugs>	 (03PS1) 10Ema: 5.1.3-1wm16: add 0039-probe-cold-state-race.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754)
[12:31:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm16: add 0039-probe-cold-state-race.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema)
[12:42:27] <wikibugs>	 (03PS1) 10Ahuret: propagate logger to WSGIServer [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608607
[12:42:29] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608607 (owner: 10Ahuret)
[12:49:00] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: lvs: Switch proton to production [puppet] - 10https://gerrit.wikimedia.org/r/607534 (https://phabricator.wikimedia.org/T225680)
[12:49:50] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Switch proton to production [puppet] - 10https://gerrit.wikimedia.org/r/607534 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris)
[12:55:29] <wikibugs>	 (03PS2) 10CDanis: add playbook links for important alerts [puppet] - 10https://gerrit.wikimedia.org/r/608490
[12:55:56] <wikibugs>	 (03CR) 10CDanis: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/608490 (owner: 10CDanis)
[12:57:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608600 (https://phabricator.wikimedia.org/T256726) (owner: 10Alexandros Kosiaris)
[12:58:35] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] proton: Switch dev restbase to talk to TLS proton [puppet] - 10https://gerrit.wikimedia.org/r/607535 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris)
[12:59:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] role::alerting_host: update the cas-icinga vhost to use the icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/608312 (owner: 10Jbond)
[13:00:04] <jouncebot>	 hashar and twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - European+American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1300).
[13:00:56] <hashar>	 train train
[13:01:06] <hashar>	 those jouncebot messages are annoying
[13:01:40] <wikibugs>	 (03PS1) 10Hashar: testwikis wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608613
[13:02:31] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:03:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:04:19] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Thanks!" [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608607 (owner: 10Ahuret)
[13:04:21] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] propagate logger to WSGIServer [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608607 (owner: 10Ahuret)
[13:04:57] <icinga-wm>	 PROBLEM - puppet last run on idp-test2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:06:35] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608613 (owner: 10Hashar)
[13:06:49] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:07:20] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608613 (owner: 10Hashar)
[13:07:25] <logmsgbot>	 !log hashar@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.39
[13:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:29] <icinga-wm>	 RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[13:12:46] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. JMeybohm I broke this while deleting the envoy-tls-local-proxy, looking into it, https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:13] <icinga-wm>	 RECOVERY - puppet last run on idp-test2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:16:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Also enable ssoSessions actuator in prod [puppet] - 10https://gerrit.wikimedia.org/r/608616
[13:18:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:20:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:24:21] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:10] <wikibugs>	 (03PS1) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618
[13:30:46] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[13:30:47] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:30:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:37] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2035-2036].codfw.wmnet `
[13:31:52] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable validation of new signatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608619 (https://phabricator.wikimedia.org/T248632)
[13:32:41] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:46] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[13:32:47] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:08] <wikibugs>	 (03CR) 10Bartosz Dziewoński: "(For deployment next week)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608619 (https://phabricator.wikimedia.org/T248632) (owner: 10Bartosz Dziewoński)
[13:35:38] <wikibugs>	 (03CR) 10Jcrespo: "Not sure if these should be here, but commenting just in case." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat)
[13:36:38] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable validation of new signatures on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608621 (https://phabricator.wikimedia.org/T248632)
[13:37:20] <moritzm>	 !log rebooting LDAP replicas for kernel security update
[13:37:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:09] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:40] <jynus>	 growth on 503 baseline
[13:38:52] <jynus>	 since :34
[13:39:11] <jynus>	 https://grafana.wikimedia.org/d/000000503/varnish-http-errors?panelId=7&fullscreen&orgId=1&refresh=1m&from=1593520746979&to=1593524346979
[13:39:46] <jynus>	 it went down again
[13:40:10] <jynus>	 strange, doesn't look like a normal error spike
[13:41:20] <wikibugs>	 (03PS1) 10Ottomata: Camus eventlogging - consider meta.dt and dt for event partition time [puppet] - 10https://gerrit.wikimedia.org/r/608622 (https://phabricator.wikimedia.org/T256370)
[13:41:25] <wikibugs>	 (03PS1) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120)
[13:41:27] <wikibugs>	 (03PS1) 10Jbond: mariadb::core_test: open mysql port for ipd_test on db1077 server [puppet] - 10https://gerrit.wikimedia.org/r/608624 (https://phabricator.wikimedia.org/T256120)
[13:42:05] <wikibugs>	 (03CR) 10Ottomata: "To be merged after https://gerrit.wikimedia.org/r/c/analytics/refinery/+/608460 is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/608622 (https://phabricator.wikimedia.org/T256370) (owner: 10Ottomata)
[13:42:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[13:46:27] <wikibugs>	 (03PS2) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120)
[13:48:55] <wikibugs>	 (03CR) 10Jcrespo: "Not against what I think is the idea here, but modifying core db's firewall logic requires its own dedicated ticket, as it requires quite " [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[13:49:04] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[13:49:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:03] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[13:51:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:07] <wikibugs>	 (03PS2) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618
[13:53:02] <wikibugs>	 (03CR) 10Jcrespo: "Naming also be confusing, given modules/profile/manifests/mariadb/ferm.pp resource exists, too." [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[13:54:15] <wikibugs>	 (03CR) 10Jcrespo: "maybe we can move profile::mariadb::ferm to the mariadb module so a profile doesn't import another profile?" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[13:55:24] <wikibugs>	 (03PS3) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120)
[13:56:22] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:57:01] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[13:57:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:57:36] <wikibugs>	 (03CR) 10Muehlenhoff: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[13:58:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me (my earlier comment is more directed towards a followup change after this conversion is complete)" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[13:59:00] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[13:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:26] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[13:59:27] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[13:59:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:59] <wikibugs>	 10Operations, 10CAS-SSO, 10User-jbond: cas-icinga intermittant failures - https://phabricator.wikimedia.org/T256720 (10Kormat) 05Open→03Invalid Correction: this doesn't seem to be cas-specific. I've had both cas-icinga and plain icinga pages open in both firefox and chrome for a few hours, and so far i'v...
[14:00:09] <wikibugs>	 (03PS4) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120)
[14:00:49] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] add playbook links for important alerts [puppet] - 10https://gerrit.wikimedia.org/r/608490 (owner: 10CDanis)
[14:01:05] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:02:12] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:07] <wikibugs>	 (03CR) 10Jcrespo: "Could I ask what ips/dns are needed to connect to what hosts? Just to be clear, I am not claming a refactor is not needed, but I wonder if" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:06:24] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:24] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:08:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:57] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370
[14:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:01] <stashbot>	 T256370: Camus should look for multiple possible timestamp fields to use for hourly partitioining - https://phabricator.wikimedia.org/T256370
[14:09:41] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] purged: alert if local backlog grows past the given limits [puppet] - 10https://gerrit.wikimedia.org/r/608564 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[14:09:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:09:56] <logmsgbot>	 !log hashar@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.39 (duration: 62m 30s)
[14:09:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:26] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:10:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:44] <hashar>	 eventually
[14:10:54] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 (duration: 01m 56s)
[14:10:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:40] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:11:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:08] <wikibugs>	 (03CR) 10Andrew Bogott: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott)
[14:12:23] <wikibugs>	 (03PS5) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120)
[14:13:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:13:05] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:13:11] <hashar>	 time for group0
[14:13:42] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:13:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:50] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608629
[14:15:02] <wikibugs>	 (03PS3) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618
[14:15:46] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:15:46] <wikibugs>	 (03CR) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat)
[14:15:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:50] <moritzm>	 !log rebooting miscweb servers for kernel security update
[14:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:37] <wikibugs>	 (03CR) 10Jbond: "> Naming also be confusing, given modules/profile/manifests/mariadb/ferm.pp resource exists, too." [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:17:30] <wikibugs>	 (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23569/" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:17:37] <wikibugs>	 (03PS2) 10Jbond: mariadb::core_test: open mysql port for ipd_test on db1077 server [puppet] - 10https://gerrit.wikimedia.org/r/608624 (https://phabricator.wikimedia.org/T256120)
[14:18:10] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:45] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:18:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:48] <wikibugs>	 (03PS1) 10Jbond: mariadb::profile::firewall: use the profile::mariadb::ferm type [puppet] - 10https://gerrit.wikimedia.org/r/608631
[14:21:10] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:30] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime
[14:21:31] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:21:31] <wikibugs>	 (03PS2) 10Jbond: mariadb::profile::firewall: use the profile::mariadb::ferm type [puppet] - 10https://gerrit.wikimedia.org/r/608631
[14:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:33] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608629 (owner: 10Hashar)
[14:21:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:36] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608629 (owner: 10Hashar)
[14:24:22] <icinga-wm>	 RECOVERY - IPMI Sensor Status on logstash2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:24:36] <wikibugs>	 (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/compiler1001/23571/" [puppet] - 10https://gerrit.wikimedia.org/r/608624 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:24:47] <wikibugs>	 (03PS1) 10ZPapierski: Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498)
[14:25:20] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.39
[14:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski)
[14:26:08] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:26:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGtM" [puppet] - 10https://gerrit.wikimedia.org/r/608616 (owner: 10Muehlenhoff)
[14:26:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:10] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:33] <MatmaRex>	 anyone wants to merge a Beta-only config change? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/608621
[14:30:00] <wikibugs>	 (03CR) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[14:30:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:30:53] <hashar>	 for god sake
[14:30:57] <hashar>	 rolling back
[14:31:46] <wikibugs>	 (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat)
[14:32:05] <wikibugs>	 (03Abandoned) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 (owner: 10Jbond)
[14:32:18] <wikibugs>	 (03Abandoned) 10Jbond: DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 (owner: 10Jbond)
[14:32:44] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:32:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:32] <wikibugs>	 (03PS1) 10Hashar: Revert "group0 wikis to 1.35.0-wmf.39" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608636 (https://phabricator.wikimedia.org/T256759)
[14:33:46] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Revert "group0 wikis to 1.35.0-wmf.39" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608636 (https://phabricator.wikimedia.org/T256759) (owner: 10Hashar)
[14:34:01] <wikibugs>	 10Operations, 10serviceops, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10dancy)
[14:34:03] <hashar>	 will file in unbreak now as soon as I am done with the rollback
[14:34:28] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.35.0-wmf.39" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608636 (https://phabricator.wikimedia.org/T256759) (owner: 10Hashar)
[14:34:45] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[14:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:15] <wikibugs>	 10Operations, 10serviceops, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar)
[14:36:03] <wikibugs>	 (03PS4) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618
[14:36:13] <Amir1>	 Flow is fataling https://www.mediawiki.org/wiki/Topic:Vp2ezhpldxustuf4
[14:36:24] <wikibugs>	 10Operations, 10serviceops, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) CI stills use Jessie based container from docker-registry.wikimedia.org/wikimedia-jessie . The last remaining task is to have some se...
[14:36:43] <RhinosF1>	 Amir1: hasha.r is already rolling back
[14:36:53] <Amir1>	 oh okay
[14:37:35] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608621 (https://phabricator.wikimedia.org/T248632) (owner: 10Bartosz Dziewoński)
[14:37:36] <hashar>	 pfff
[14:37:38] <hashar>	 wrong task
[14:38:09] <Amir1>	 Can I quickly rebase that patch in wdeploy1001?
[14:38:13] <Amir1>	 *deploy1001
[14:38:18] <wikibugs>	 (03Merged) 10jenkins-bot: Enable validation of new signatures on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608621 (https://phabricator.wikimedia.org/T248632) (owner: 10Bartosz Dziewoński)
[14:38:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:38:31] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.35.0-wmf.39" - T256759
[14:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:35] <stashbot>	 T256759: selenium-daily-* Jenkins jobs fail with `fatal: Refusing to fetch into current branch refs/heads/master of non-bare repository` - https://phabricator.wikimedia.org/T256759
[14:39:02] <RhinosF1>	 Amir1: flow is alive now
[14:39:45] <MatmaRex>	 Amir1: thanks!
[14:39:45] <Amir1>	 RhinosF1: thanks
[14:40:08] <Amir1>	 MatmaRex: thank you for doing it, I just clicked on a shiny button (and I like clicking on shiny buttons)
[14:40:13] <RhinosF1>	 Amir1: I think hashar's to thank. I'm just talking
[14:40:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608489 (owner: 10Dzahn)
[14:41:13] <Majavah>	 anyone has a phabricator task for flow errors?
[14:42:26] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Also enable ssoSessions actuator in prod [puppet] - 10https://gerrit.wikimedia.org/r/608616 (owner: 10Muehlenhoff)
[14:42:37] <wikibugs>	 (03CR) 10Jcrespo: "So I really don't see the use of changing the profiles for a test mediawiki database. Mediawiki databases (even test ones) shouldn't have " [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:42:55] <RhinosF1>	 Majavah: I think hashar was going to file it after he rolled back
[14:42:58] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2030 is CRITICAL: cluster=cache_upload instance=cp2030 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030
[14:43:14] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:43:28] <Majavah>	 T256761
[14:43:28] <stashbot>	 T256761: 1.35.0-wmf.39 breaks Flow - https://phabricator.wikimedia.org/T256761
[14:43:42] <wikibugs>	 (03CR) 10Kormat: "Fixed issue with using wrong type, pcc is happy now: https://puppet-compiler.wmflabs.org/compiler1003/23574/" [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat)
[14:43:45] <hashar>	 !log Train blocked on Flow being broken: T256761   # T254176
[14:44:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:05] <stashbot>	 T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176
[14:44:09] <hashar>	 Amir1: yeah the traces are in https://phabricator.wikimedia.org/T256761
[14:44:12] <hashar>	 no clue what is happening
[14:44:32] <Majavah>	 also happens on beta cluster https://en.wikipedia.beta.wmflabs.org/wiki/Topic:Vp9cnfpdti9bnda2
[14:44:49] <Amir1>	 thanks 
[14:45:01] <Amir1>	 I don't know if I can debug and fix it but I give it a try
[14:45:19] <Majavah>	 I'm trying the same
[14:47:26] <wikibugs>	 10Operations, 10serviceops, 10Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762 (10JMeybohm)
[14:47:36] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 2
[14:47:39] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 2 (duration: 00m 03s)
[14:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:40] <stashbot>	 T256370: Camus should look for multiple possible timestamp fields to use for hourly partitioining - https://phabricator.wikimedia.org/T256370
[14:47:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:24] <wikibugs>	 (03PS7) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[14:48:32] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Open port to misc dbs for idp-test servers [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[14:48:59] <wikibugs>	 (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[14:49:33] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3
[14:49:36] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3 (duration: 00m 03s)
[14:49:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:33] <wikibugs>	 (03CR) 10Jcrespo: "I think this, while not ideal, is a safer change (and a method that will be closer to the way the definitive hosts will be setup)." [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo)
[14:52:11] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:54:04] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:54:12] <wikibugs>	 (03CR) 10Jbond: "also please not the noop PCC: https://puppet-compiler.wmflabs.org/compiler1001/23569/" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:55:40] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[14:56:08] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[14:56:14] <wikibugs>	 (03PS8) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[14:58:56] <wikibugs>	 (03PS1) 10Privacybatm: Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979)
[14:59:00] <moritzm>	 !log rebooting failoid hosts for kernel update
[14:59:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:09] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[14:59:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:16] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:59:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm)
[15:00:02] <wikibugs>	 (03PS9) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558
[15:01:12] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[15:01:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:01:47] <wikibugs>	 (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[15:01:50] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:03:39] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[15:03:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:05:43] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[15:05:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:07:14] <wikibugs>	 (03CR) 10Jbond: "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[15:07:30] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:10:38] <moritzm>	 !log rebooting mwdebug* hosts for kernel security update
[15:10:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:57] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[15:10:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:08] <wikibugs>	 (03CR) 10Jbond: mariadb::ferm: move firewall rules to there own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[15:12:26] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[15:13:09] <wikibugs>	 (03PS2) 10Privacybatm: Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979)
[15:14:27] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[15:14:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:44] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@1112749]: roll back to 1112749 on an-launcher1002, git-fat not pulling artifacts
[15:14:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:05] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@1112749]: roll back to 1112749 on an-launcher1002, git-fat not pulling artifacts (duration: 01m 21s)
[15:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:41] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10AMooney) @KFrancis, it looks like the form for this work has been Approved. Can this task move forwa...
[15:20:33] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[15:20:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:22:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:24:26] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[15:24:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:36] <wikibugs>	 (03PS6) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120)
[15:26:05] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[15:26:07] <wikibugs>	 (03Abandoned) 10Jbond: mariadb::core_test: open mysql port for ipd_test on db1077 server [puppet] - 10https://gerrit.wikimedia.org/r/608624 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[15:27:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo)
[15:27:33] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[15:28:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:28:25] <moritzm>	 ^ should recover in a bit
[15:28:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo)
[15:30:01] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:30:28] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[15:30:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:17] <RhinosF1>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ElectronPdfService/+/608649 should fix the train
[15:32:26] <RhinosF1>	 Amir1, Majavah: ^
[15:32:29] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[15:32:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:39] <wikibugs>	 (03PS1) 10RLazarus: openldap: Clarify output text on expiration warnings [puppet] - 10https://gerrit.wikimedia.org/r/608650
[15:32:56] <Majavah>	 having issues with my internet atm, can't look
[15:33:08] <wikibugs>	 (03PS4) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[15:33:14] <RhinosF1>	 Ack, I asked Demian to look
[15:33:36] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:34:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo)
[15:34:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608650 (owner: 10RLazarus)
[15:35:14] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[15:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:24] <wikibugs>	 (03PS5) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[15:35:41] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/608650 (owner: 10RLazarus)
[15:39:04] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:39:07] <wikibugs>	 (03CR) 10Jbond: mariadb: Setup db1077 as a misc::idp_test database server (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo)
[15:40:18] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[15:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat)
[15:44:24] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-Y on ps1-c3-codfw is CRITICAL: SNMP CRITICAL - ps1-c3-codfw-infeed-load-tower-B-phase-Y *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:44:31] <wikibugs>	 (03PS6) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[15:44:34] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is CRITICAL: SNMP CRITICAL - ps1-c3-codfw-infeed-load-tower-B-phase-X *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:44:48] <icinga-wm>	 PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:45:37] <wikibugs>	 (03PS7) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[15:46:16] <icinga-wm>	 PROBLEM - Host mw2335 is DOWN: PING CRITICAL - Packet loss = 100%
[15:46:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo)
[15:46:34] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-Z on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:46:38] <icinga-wm>	 RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:46:50] <icinga-wm>	 PROBLEM - Host mw2339 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:04] <icinga-wm>	 PROBLEM - Host db2113 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:04] <icinga-wm>	 PROBLEM - Host mw2337 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:06] <icinga-wm>	 PROBLEM - Host mw2338 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:24] <icinga-wm>	 PROBLEM - Host mw2336 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:24] <icinga-wm>	 PROBLEM - Host thumbor2001 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:34] <icinga-wm>	 PROBLEM - Host thumbor2002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:52] <icinga-wm>	 PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:00] <icinga-wm>	 PROBLEM - Host mw2336.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:00] <icinga-wm>	 PROBLEM - Host mw2337.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:48:08] <rzl>	  ^ that's rack c3 but I thought no impact was expected
[15:48:28] <rzl>	 planned PDU work today, but they were going to be one at a time per papaul 
[15:48:32] <icinga-wm>	 RECOVERY - Host db2113 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms
[15:48:32] <icinga-wm>	 RECOVERY - Host mw2338 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms
[15:48:32] <icinga-wm>	 RECOVERY - Host mw2339 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms
[15:48:32] <icinga-wm>	 RECOVERY - Host thumbor2002 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms
[15:48:34] <icinga-wm>	 RECOVERY - Host mw2335 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms
[15:48:34] <icinga-wm>	 RECOVERY - Host mw2337 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms
[15:48:42] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge - https://phabricator.wikimedia.org/T255951 (10ssingh) 05Open→03Resolved a:03ssingh Marking this as resolved as the list has been created; please reopen if there are any other issues, questions...
[15:48:46] <icinga-wm>	 RECOVERY - Host thumbor2001 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms
[15:48:46] <icinga-wm>	 RECOVERY - Host mw2336 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms
[15:48:56] <icinga-wm>	 PROBLEM - Host alert2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:50:00] <icinga-wm>	 PROBLEM - Host mw2338.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:50:00] <icinga-wm>	 PROBLEM - Host mw2339.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:50:40] <icinga-wm>	 PROBLEM - Host thumbor2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:50:40] <icinga-wm>	 PROBLEM - Host thumbor2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:51:36] <icinga-wm>	 PROBLEM - Host mw2335.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:51:36] <icinga-wm>	 PROBLEM - Host db2113.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:51:56] <wikibugs>	 (03PS8) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[15:52:02] <icinga-wm>	 RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.29 ms
[15:52:16] <wikibugs>	 (03PS1) 10Aron Manning: Hotfix: "Undefined index: print" [extensions/ElectronPdfService] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608485 (https://phabricator.wikimedia.org/T256761)
[15:53:02] <icinga-wm>	 RECOVERY - Host mw2336.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.60 ms
[15:53:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott)
[15:53:49] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` relforge1004.eqiad.wmnet ` The log...
[15:53:54] <icinga-wm>	 RECOVERY - Host mw2337.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.67 ms
[15:54:01] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3
[15:54:04] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3 (duration: 00m 03s)
[15:54:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:05] <stashbot>	 T256370: Camus should look for multiple possible timestamp fields to use for hourly partitioining - https://phabricator.wikimedia.org/T256370
[15:54:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:21] <papaul>	 the cY1 tech disconnected the wrong PDU that's why
[15:54:31] <chaomodus>	 d'oh
[15:54:48] <icinga-wm>	 RECOVERY - Host alert2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms
[15:55:52] <icinga-wm>	 RECOVERY - Host mw2338.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms
[15:55:52] <icinga-wm>	 RECOVERY - Host mw2339.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms
[15:56:06] <wkandek>	 papaul: wrong PDU in the right rack? 
[15:56:31] <icinga-wm>	 RECOVERY - Host thumbor2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.52 ms
[15:56:31] <icinga-wm>	 RECOVERY - Host thumbor2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.42 ms
[15:57:13] <rzl>	 papaul: ahh that'll do it :) thanks
[15:57:14] <wikibugs>	 (03PS1) 10Rush: peek: privacy review project renamed in Asana [puppet] - 10https://gerrit.wikimedia.org/r/608657
[15:57:16] <wikibugs>	 (03CR) 10Andrew Bogott: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott)
[15:57:26] <icinga-wm>	 RECOVERY - Host mw2335.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms
[15:57:26] <icinga-wm>	 RECOVERY - Host db2113.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms
[15:58:14] <papaul>	 wkandek: it needs to be unpliug below the floor
[15:58:20] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott)
[15:58:21] <icinga-wm>	 PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:58:41] <wkandek>	 papaul: ah, we need blue/red cables there as well :)
[15:59:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo)
[15:59:52] <wikibugs>	 (03PS9) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120)
[16:00:05] <jouncebot>	 godog and _joe_: (Dis)respected human, time to deploy Puppet request window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1600). Please do the needful.
[16:00:26] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10dancy)
[16:01:06] <wikibugs>	 (03CR) 10Rush: [C: 03+2] peek: privacy review project renamed in Asana [puppet] - 10https://gerrit.wikimedia.org/r/608657 (owner: 10Rush)
[16:01:20] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2339 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:01:38] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2336 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:01:44] <icinga-wm>	 PROBLEM - IPMI Sensor Status on mw2337 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:02:05] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10dancy)
[16:04:09] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10ssingh) a:03ssingh
[16:05:13] <wikibugs>	 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jcrespo) The proposed refactoring could broke the core/misc separation. We have deployed a far-from-ideal misc::idp_test (which I still have to...
[16:06:42] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1004.eqiad.wmnet'] `  Of which those **FAILED**: ` ['relforge1004.eqiad.wmnet'...
[16:07:32] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10ssingh)
[16:08:11] <wikibugs>	 (03PS2) 10Ottomata: Camus eventlogging - consider meta.dt and dt for event partition time [puppet] - 10https://gerrit.wikimedia.org/r/608622 (https://phabricator.wikimedia.org/T256370)
[16:09:52] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Camus eventlogging - consider meta.dt and dt for event partition time [puppet] - 10https://gerrit.wikimedia.org/r/608622 (https://phabricator.wikimedia.org/T256370) (owner: 10Ottomata)
[16:09:54] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "> > Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond)
[16:12:55] <wikibugs>	 (03CR) 10Jcrespo: "I just added a temporary misc section: https://gerrit.wikimedia.org/r/c/operations/puppet/+/608639" [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat)
[16:17:11] <wikibugs>	 (03PS2) 10CDanis: MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605)
[16:18:04] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis)
[16:19:55] <wikibugs>	 (03PS2) 10BBlack: nvme formatting was missing for new codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/608425 (https://phabricator.wikimedia.org/T256655)
[16:21:52] <wikibugs>	 (03PS1) 10Cmjohnson: Updating relforge1003-4 netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608664 (https://phabricator.wikimedia.org/T241791)
[16:22:48] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Updating relforge1003-4 netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608664 (https://phabricator.wikimedia.org/T241791) (owner: 10Cmjohnson)
[16:25:04] <wikibugs>	 (03CR) 10Jcrespo: "Because idp database will eventually go to m1, we should change this patch to work for misc services and substitute profile::mariadb::ferm" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond)
[16:25:53] <wikibugs>	 (03CR) 10Herron: [C: 03+1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond)
[16:26:58] <wikibugs>	 (03PS3) 10CDanis: MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605)
[16:27:10] <Majavah>	 tgr: I made an edit to MediaWiki:Sidebar to hopefully clear its caches, didn't affect anything
[16:28:53] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis)
[16:28:57] <Majavah>	 tgr: another thread appears to be working, https://en.wikipedia.beta.wmflabs.org/wiki/Topic:Vp9iok8ssazd6a62
[16:29:02] <Majavah>	 so it is cache?
[16:29:16] <tgr>	 that clears the message cache but not the sidebar cache, I think?
[16:29:34] <tgr>	 OTOH if it is the sidebar cache, that should affect all pages
[16:29:45] <wikibugs>	 (03PS4) 10Herron: site: add Logstash7 capacity [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) (owner: 10Filippo Giunchedi)
[16:29:51] <Majavah>	 hmh, the first thread is working again
[16:30:50] <tgr>	 I guess the edit does clear both caches somehow, then
[16:31:17] <Majavah>	 no idea, but it's working again
[16:31:23] <tgr>	 are there affected wikis beyond mw.org?
[16:31:53] <Majavah>	 group0 flow wikis?
[16:31:53] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2339 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:32:15] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2336 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:32:21] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw2337 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[16:33:31] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` relforge1004....
[16:33:35] <wikibugs>	 (03CR) 10Herron: [C: 03+2] site: add Logstash7 capacity [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) (owner: 10Filippo Giunchedi)
[16:33:47] <Majavah>	 testwiki broken https://test.wikipedia.org/wiki/Topic:Vp9iyl4amw7ofe4n
[16:35:13] <tgr>	 that's mediawikiwiki, testwiki and officewiki
[16:35:25] <tgr>	 can be fixed by hand
[16:37:33] <icinga-wm>	 RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.53 ms
[16:37:51] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-Z on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:37:52] <tgr>	 I wonder if a purge is enough
[16:40:34] <wikibugs>	 (03CR) 10Jdlrobson: "Mayakpwiki is an analyst. I'm not sure why she has been added as a reviewer." [extensions/ElectronPdfService] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608485 (https://phabricator.wikimedia.org/T256761) (owner: 10Aron Manning)
[16:42:21] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:42:39] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-Y on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:43:01] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:43:10] <Majavah>	 the backport should probably get merged and deployed?
[16:44:40] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] nvme formatting was missing for new codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/608425 (https://phabricator.wikimedia.org/T256655) (owner: 10BBlack)
[16:44:51] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:45:11] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1004.eqiad.wmnet'] `  Of which those **FAILED**: ` ['rel...
[16:46:56] <tgr>	 no, group0 has been rolled back
[16:47:22] <tgr>	 I mean, it should be, but that doesn't affect things on group0 right now
[16:48:06] <cdanis>	 !log T256444 ✔️ cdanis@cp2030.codfw.wmnet ~ 🕐☕ sudo depool
[16:48:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:13] <stashbot>	 T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444
[16:49:08] <Majavah>	 yeah, but the backport needs to be merged before the train can roll forward again?
[16:49:29] <tgr>	 yeah, we can do that in the backport window
[16:50:26] <Majavah>	 oh true
[16:52:50] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2030 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030
[16:57:30] <cdanis>	 !log T256444 restarted purged on cp2030 and repooling
[16:57:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:39] <stashbot>	 T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444
[17:00:04] <jouncebot>	 halfak and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1700).
[17:00:47] <RhinosF1>	 tgr: I doubt Demain will do the window. I'd just do it.
[17:01:22] <tgr>	 I have a meeting right now, so I'd wait for the window anyway
[17:01:31] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) @Gehel I am getting a partman error.  Is the partman recipe given correct?   The raid10-4dev is not working....
[17:01:34] <tgr>	 it's a good learning experience though
[17:03:01] <RhinosF1>	 tgr: he's said can do but he's not on irc which is surpising
[17:03:55] <tgr>	 not everyone is online all the time, it's not so strang
[17:04:22] <RhinosF1>	 tgr: see phab
[17:04:33] <Majavah>	 I'll go get a package from my local store, I'll be back in ~20mins
[17:05:57] <logmsgbot>	 !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@f9df1af]: Update mobileapps to 5c7611b9
[17:06:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:31] <logmsgbot>	 !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@f9df1af]: Update mobileapps to 5c7611b9 (duration: 03m 33s)
[17:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:09] <wikibugs>	 (03PS1) 10Mholloway: Mobileapps: Update to 2020-06-29-163540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608693
[17:13:00] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Mobileapps: Update to 2020-06-29-163540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608693 (owner: 10Mholloway)
[17:14:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:07] <wikibugs>	 (03Merged) 10jenkins-bot: Mobileapps: Update to 2020-06-29-163540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608693 (owner: 10Mholloway)
[17:15:40] <wikibugs>	 (03CR) 10Krinkle: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus)
[17:17:19] <papaul>	 !log uplugging msw-c3 power to relocate port on PDU
[17:17:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:00] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[17:18:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:19:07] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:19:09] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:19:17] <icinga-wm>	 PROBLEM - Host alert2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:24:27] <icinga-wm>	 RECOVERY - Host alert2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms
[17:26:20] * Majavah back
[17:30:27] <wikibugs>	 (03CR) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus)
[17:30:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] hiera: install mtail from component in codfw and eqsin [puppet] - 10https://gerrit.wikimedia.org/r/608450 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite)
[17:32:07] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:33:13] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[17:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:23] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:37:52] <cdanis>	 !log ✔️ cdanis@netflow2001.codfw.wmnet ~ 🕜☕ sudo systemctl restart nfacctd
[17:37:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:55] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:40:41] <mdholloway>	 !log mobileapps deployments on k8s failing with timeouts; filed T256786
[17:40:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:46] <stashbot>	 T256786: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786
[17:42:35] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:44:22] <wikibugs>	 10Operations, 10Core Platform Team, 10Traffic, 10serviceops, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10aaron) >>! In T250205#6158994, @Krinkle wrote: >>>! In T250205#6154883, @aaron wrote: >> I'm not fond of the idea of not sending purges for in...
[17:45:57] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] libraryupgrader: Add systemd units [puppet] - 10https://gerrit.wikimedia.org/r/607919 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm)
[17:51:10] <dwisehaupt>	  /ac
[17:51:13] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:30] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 03+1] "+1 to what Tim said" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui)
[17:53:03] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:56:44] <wikibugs>	 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis)
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1800).
[18:00:04] <jouncebot>	 tgr: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:23] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:45] <tgr>	 it's just me, so I'll deploy it
[18:01:05] <wikibugs>	 (03CR) 10Muehlenhoff: "I think this patch can be abandoned now that we have the decom script and the insetup role." [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) (owner: 10RobH)
[18:02:03] <wikibugs>	 (03Abandoned) 10RobH: splitting role::spare into staged and decomisssioning [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) (owner: 10RobH)
[18:05:30] <cdanis>	 !log installing libc6-dbg on netflow2001 T256790
[18:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:34] <stashbot>	 T256790: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790
[18:08:54] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Hotfix: "Undefined index: print" [extensions/ElectronPdfService] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608485 (https://phabricator.wikimedia.org/T256761) (owner: 10Aron Manning)
[18:11:51] <shdubsh>	 !log restart varnishmtail,atsmtail,ncredirmtail on ncredir,cp hosts in codfw and eqsin
[18:11:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:12:52] <wikibugs>	 (03Merged) 10jenkins-bot: Hotfix: "Undefined index: print" [extensions/ElectronPdfService] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608485 (https://phabricator.wikimedia.org/T256761) (owner: 10Aron Manning)
[18:12:55] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson)
[18:14:11] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) new-msw1-eqiad has the correct JUNOS 18.1.3 and the configuration has been copied.   Currently connected to port 2 on the a8-scs and can be moved to th...
[18:15:05] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:16:03] <wikibugs>	 (03PS1) 10Krinkle: findBadBlobs: better separate scan and mark modes. [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608667 (https://phabricator.wikimedia.org/T251778)
[18:16:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/608301 (https://phabricator.wikimedia.org/T256628) (owner: 10Jbond)
[18:18:00] <wikibugs>	 (03CR) 10Dwisehaupt: [C: 03+1] "Looks ok to me. Without knowing what's in the netbox included files it's tough to verify consistency.  I'm fine with the idea and process." [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[18:20:37] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:22:23] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:23:17] <logmsgbot>	 !log tgr@deploy1001 Synchronized php-1.35.0-wmf.39/extensions/ElectronPdfService/src/ElectronPdfServiceHooks.php: Backport: [[gerrit:608485|Hotfix: "Undefined index: print" (T256761)]] (duration: 01m 05s)
[18:23:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:22] <stashbot>	 T256761: 1.35.0-wmf.39 breaks Flow - https://phabricator.wikimedia.org/T256761
[18:25:37] <Majavah>	 tgr: tested, working on testwiki
[18:25:55] <tgr>	 thx
[18:26:27] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:27:37] <tgr>	 !log Morning deploys done
[18:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:57] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:28:17] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:28:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: puppetize /etc/ldap.conf on sssd clients [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott)
[18:29:22] <wikibugs>	 (03PS1) 10Andrew Bogott: role::puppetmaster::standalone: allow for overriding the hiera config [puppet] - 10https://gerrit.wikimedia.org/r/608706
[18:31:03] <wikibugs>	 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) Okay, here are some backtraces:  {P11715}  When I saw crashes in malloc and then installed libc6-dbg to get arguments, I was hoping that the issue was malloc being invoked with a ridiculous paramet...
[18:31:25] <cdanis>	 !log T256790 ✔️ cdanis@netflow2001.codfw.wmnet ~ 🕝☕ sudo apt install valgrind
[18:31:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:30] <stashbot>	 T256790: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790
[18:33:27] <icinga-wm>	 RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:34:21] <wikibugs>	 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF)
[18:34:44] <wikibugs>	 (03PS1) 10Dzahn: gerrit::server: if acme_chief is not used, install certbot [puppet] - 10https://gerrit.wikimedia.org/r/608707
[18:34:50] <wikibugs>	 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF) Update, as probably obvious, we have pushed the launch date back. Our target date is now July 14th.
[18:35:13] <wikibugs>	 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn)
[18:36:38] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Dzahn) a:05Dzahn→03Gehel
[18:39:00] <wikibugs>	 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) {P11716}
[18:39:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23578/" [puppet] - 10https://gerrit.wikimedia.org/r/608707 (owner: 10Dzahn)
[18:40:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop in prod, cloud-only" [puppet] - 10https://gerrit.wikimedia.org/r/608707 (owner: 10Dzahn)
[18:42:23] <wikibugs>	 (03CR) 10Dzahn: "In case you are wondering how letsencrypt::cert::integrated fails:" [puppet] - 10https://gerrit.wikimedia.org/r/608707 (owner: 10Dzahn)
[18:43:21] <wikibugs>	 (03Abandoned) 10Dzahn: gerrit: allow for 3 different methods to get TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/607116 (owner: 10Dzahn)
[18:44:19] <wikibugs>	 (03PS1) 10Elukey: profile::mediawiki::alerts: tune mediawiki-errors to be more lenient [puppet] - 10https://gerrit.wikimedia.org/r/608708 (https://phabricator.wikimedia.org/T256459)
[18:49:15] <wikibugs>	 (03PS1) 10Gergő Tisza: Varnish: Include request ID in Set-Cookie warning [puppet] - 10https://gerrit.wikimedia.org/r/608709 (https://phabricator.wikimedia.org/T256395)
[18:52:01] <wikibugs>	 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) Filed upstream https://github.com/pmacct/pmacct/issues/414
[18:53:29] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "LGTM once I9bca677 has been backported and deployed" [puppet] - 10https://gerrit.wikimedia.org/r/608709 (https://phabricator.wikimedia.org/T256395) (owner: 10Gergő Tisza)
[19:00:05] <jouncebot>	 hashar and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1900).
[19:00:42] <wikibugs>	 (03PS3) 10CDanis: mediawiki: Change mw alerts to use a moving average [puppet] - 10https://gerrit.wikimedia.org/r/608188 (owner: 10Krinkle)
[19:01:32] <hasharAway>	 lets try once again
[19:01:51] <hashar>	 tgr: thanks for the assistance with the train blocker :]
[19:02:30] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608711
[19:02:41] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] mediawiki: Change mw alerts to use a moving average [puppet] - 10https://gerrit.wikimedia.org/r/608188 (owner: 10Krinkle)
[19:02:45] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] mediawiki: Raise fatal alert treshold from 50 to 100 [puppet] - 10https://gerrit.wikimedia.org/r/608189 (owner: 10Krinkle)
[19:02:52] <wikibugs>	 (03PS3) 10CDanis: mediawiki: Raise fatal alert treshold from 50 to 100 [puppet] - 10https://gerrit.wikimedia.org/r/608189 (owner: 10Krinkle)
[19:03:55] <hashar>	 cdanis: Krinkle: I guess I can hold the train until those monitoring changes get deployed? ;)
[19:04:30] <wikibugs>	 (03PS4) 10CDanis: mediawiki: Raise fatal alert threshold from 50 to 100 [puppet] - 10https://gerrit.wikimedia.org/r/608189 (owner: 10Krinkle)
[19:04:48] <cdanis>	 it should just be a few minutes
[19:05:00] <Krinkle>	 thx cdanis 
[19:05:10] <cdanis>	 yeah thanks for writing them Krinkle 
[19:05:24] <cdanis>	 it's funny, we use irate() in some alert rules, and we use rate() on some traffic graphs
[19:05:27] <cdanis>	 exactly backwards :D
[19:06:29] <hashar>	 I will hold
[19:06:33] <hashar>	 err
[19:06:50] <hashar>	 I am holding the group0 promotion. Just le me know when icinga got refreshed
[19:08:10] <wikibugs>	 (03PS1) 10Gergő Tisza: Fully set MW_NO_SESSION for browser metadata endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608713
[19:08:52] <cdanis>	 hashar: proceed
[19:10:25] * hashar presses [ENTER]
[19:10:40] <hashar>	 [EXITING] Pulled Revision group0 wikis to 1.35.0-wmf.39 did not match Enable validation of new signatures on Beta Cluster
[19:10:41] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Fully set MW_NO_SESSION for browser metadata endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608713 (owner: 10Gergő Tisza)
[19:10:42] <hashar>	 pff
[19:10:57] <hashar>	 that is new to me
[19:11:19] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608714
[19:11:41] <wikibugs>	 (03Abandoned) 10Hashar: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608714 (owner: 10Hashar)
[19:11:56] <icinga-wm>	 PROBLEM - Check systemd state on mw2143 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:00] <icinga-wm>	 PROBLEM - Check systemd state on mw2141 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:04] <icinga-wm>	 PROBLEM - Check systemd state on mw2140 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:06] <icinga-wm>	 PROBLEM - Check systemd state on mw2194 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:08] <rzl>	 ^ looking
[19:12:20] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608711 (owner: 10Hashar)
[19:12:26] <icinga-wm>	 PROBLEM - Check systemd state on mw2196 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:28] <icinga-wm>	 PROBLEM - Check systemd state on mw2208 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:30] <icinga-wm>	 PROBLEM - Check systemd state on mw2195 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:30] <icinga-wm>	 PROBLEM - Check systemd state on mw2142 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:31] <rzl>	 I thought that might be leftover noise from C3 but it's not that rack
[19:12:32] <shdubsh>	 ^ systemd state on mw*.codfw is mtail restarting
[19:12:38] <rzl>	 ahh thanks shdubsh 
[19:12:38] <icinga-wm>	 PROBLEM - Check systemd state on mw2189 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:38] <icinga-wm>	 PROBLEM - Check systemd state on mw2192 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:38] <icinga-wm>	 PROBLEM - Check systemd state on mw2199 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:50] <icinga-wm>	 PROBLEM - Check systemd state on mw2218 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:54] <icinga-wm>	 PROBLEM - Check systemd state on mw2217 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:12:58] <icinga-wm>	 RECOVERY - Check systemd state on mw2143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:00] <icinga-wm>	 PROBLEM - Check systemd state on mw2201 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:00] <icinga-wm>	 RECOVERY - Check systemd state on mw2141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:00] <icinga-wm>	 PROBLEM - Check systemd state on mw2210 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:00] <icinga-wm>	 PROBLEM - Check systemd state on mw2212 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:04] <icinga-wm>	 PROBLEM - Check systemd state on mw2214 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:04] <icinga-wm>	 RECOVERY - Check systemd state on mw2140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:08] <rzl>	 anything we need to do about it? should we downtime that alert?
[19:13:08] <icinga-wm>	 PROBLEM - Check systemd state on mw2202 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:08] <icinga-wm>	 PROBLEM - Check systemd state on mw2207 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:10] <icinga-wm>	 PROBLEM - Check systemd state on mw2200 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:10] <icinga-wm>	 PROBLEM - Check systemd state on mw2204 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:17] <shdubsh>	 puppet has to correct the unit file :(
[19:13:30] <icinga-wm>	 PROBLEM - Check systemd state on mw2221 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:32] <icinga-wm>	 RECOVERY - Check systemd state on mw2142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:34] <icinga-wm>	 PROBLEM - Check systemd state on mw2187 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:34] <icinga-wm>	 PROBLEM - Check systemd state on mw2220 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:40] <icinga-wm>	 RECOVERY - Check systemd state on mw2192 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:13:47] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608711 (owner: 10Hashar)
[19:13:50] <shdubsh>	 sorry for the noise
[19:14:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2200 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:14:34] <icinga-wm>	 RECOVERY - Check systemd state on mw2187 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:12] <icinga-wm>	 RECOVERY - Check systemd state on mw2194 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:48] <icinga-wm>	 RECOVERY - Check systemd state on mw2189 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:48] <icinga-wm>	 RECOVERY - Check systemd state on mw2199 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:15:50] <icinga-wm>	 PROBLEM - Check systemd state on mw2222 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:16:38] <icinga-wm>	 RECOVERY - Check systemd state on mw2196 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:16:44] <icinga-wm>	 RECOVERY - Check systemd state on mw2195 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:20] <icinga-wm>	 RECOVERY - Check systemd state on mw2201 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:22] <icinga-wm>	 RECOVERY - Check systemd state on mw2210 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:17:28] <icinga-wm>	 RECOVERY - Check systemd state on mw2202 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:38] <icinga-wm>	 RECOVERY - Check systemd state on mw2207 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:40] <icinga-wm>	 RECOVERY - Check systemd state on mw2204 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:19:23] <logmsgbot>	 !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group 0 wikis to 1.35.0-wmf.39 # T254176
[19:19:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:19:28] <stashbot>	 T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176
[19:19:40] <icinga-wm>	 RECOVERY - Check systemd state on mw2212 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:08] <icinga-wm>	 RECOVERY - Check systemd state on mw2208 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:44] <icinga-wm>	 RECOVERY - Check systemd state on mw2218 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:48] <icinga-wm>	 RECOVERY - Check systemd state on mw2217 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:58] <icinga-wm>	 RECOVERY - Check systemd state on mw2214 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:24] <icinga-wm>	 RECOVERY - Check systemd state on mw2221 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:34] <icinga-wm>	 RECOVERY - Check systemd state on mw2220 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:21:46] <icinga-wm>	 RECOVERY - Check systemd state on mw2222 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:22:23] <hashar>	 train is quieter this time :]
[19:22:56] <DannyS712|away>	 Does the fix for Specail:Undelete work?
[19:24:12] <hashar>	 DannyS712|away: at least there are no errors ;)
[19:24:41] <hashar>	 and I guess we nede someone to undelete on one of the wiki to figure it out?
[19:24:47] <DannyS712|away>	 {{doing}}
[19:24:53] <hashar>	 at least there is no deprecation warning showing up yet
[19:25:00] <DannyS712|away>	 I nominate https://www.mediawiki.org/wiki/MediaWiki for deletion
[19:25:19] <hashar>	 as a deletionist kabal members, I approve
[19:25:35] <hashar>	 DannyS712|away: peak a more obscure page maybe? ;)
[19:25:53] <Majavah>	 but we decided on the last cabal meeting that there is no cabal!
[19:25:57] <hashar>	 testwiki should work
[19:26:10] <hashar>	 Majavah: that part has been deleted from the minutes. Due to rule #1
[19:26:17] <hashar>	 rule #1: if in doubt delete.
[19:26:31] <DannyS712|away>	 I was honestly about to - `Hashar said I could :) will restore in half a second` but I realized that fuzzy bot would then delete all of the translations
[19:26:42] <hashar>	 oh my
[19:26:54] <hashar>	 which means more Undeletes?
[19:27:08] <DannyS712|away>	 No, they would have to be undeleted manually
[19:27:13] <DannyS712|away>	 So I'm looking for a better page
[19:27:21] <hashar>	 my fault really, I should never had misleaded you in deleting that page by "approving" it. I apologize
[19:27:29] <Majavah>	 Project:Support desk?
[19:27:45] <hashar>	 well testwiki has .39 so we can try there
[19:27:57] <DannyS712|away>	 https://www.mediawiki.org/wiki/Special:Redirect/page/4473
[19:28:03] <DannyS712|away>	 Any objections?
[19:28:08] <Majavah>	 No
[19:28:54] <DannyS712|away>	 Okay, deleted and restored (@hashar thats your user page) - can you check the logs?
[19:28:59] <hashar>	 sure thing
[19:30:00] <hashar>	 DannyS712|away: deal, no deprecation showing.
[19:30:15] <DannyS712|away>	 thanks
[19:30:36] <hashar>	 well done :-]
[19:31:24] <DannyS712|away>	 Going to eat some icecream - tried to set my nick to away, but apparently I already did :)
[19:31:53] <hashar>	 enjoy the ice cream!
[19:32:13] <Majavah>	 And I'm going to bed, given that the UBN was resolved and it's getting late
[19:32:28] <hashar>	 Majavah: thank you !!!
[19:42:16] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[19:42:21] <wikibugs>	 (03CR) 10Papaul: [C: 03+1] mgmt: netbox-generated data for frack mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[19:54:49] <hashar>	 trains look fine so I am out :) see you tomorrow
[19:57:14] <wikibugs>	 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) 05Open→03Resolved Both new PDU's in rack C3 are in installed and configured.   Problem: moved all network devices power to PS1 before disconnecting PS2. when Tech was ready to discon...
[19:57:29] <wikibugs>	 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul)
[20:12:43] <wikibugs>	 (03CR) 10Dwisehaupt: [C: 03+1] "@Volans Thanks! I pulled that down and it all looks good to me." [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[20:13:47] <wikibugs>	 (03PS2) 10Andrew Bogott: role::puppetmaster::standalone: allow for overriding the hiera config [puppet] - 10https://gerrit.wikimedia.org/r/608706
[20:13:49] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack db grants: don't do AAAA resolution on VMs [puppet] - 10https://gerrit.wikimedia.org/r/608719
[20:14:02] <wikibugs>	 (03PS2) 10Andrew Bogott: openstack db grants: don't do AAAA resolution on VMs [puppet] - 10https://gerrit.wikimedia.org/r/608719
[20:14:04] <wikibugs>	 (03PS3) 10Andrew Bogott: role::puppetmaster::standalone: allow for overriding the hiera config [puppet] - 10https://gerrit.wikimedia.org/r/608706
[20:16:16] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Use chained cert for mail relay TLS [puppet] - 10https://gerrit.wikimedia.org/r/608720 (https://phabricator.wikimedia.org/T256806)
[20:17:19] <wikibugs>	 (03PS1) 10Cwhite: mtail: remove component and upgrade mtail to 3.0.0-rc35-3~wmf2 across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/608721 (https://phabricator.wikimedia.org/T255776)
[20:18:24] <wikibugs>	 (03CR) 10Cwhite: "mtail package to be deployed to main and this merged early next week." [puppet] - 10https://gerrit.wikimedia.org/r/608721 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite)
[20:21:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack db grants: don't do AAAA resolution on VMs [puppet] - 10https://gerrit.wikimedia.org/r/608719 (owner: 10Andrew Bogott)
[20:27:44] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review: log spam from mtail 3.0.0~rc19 on wezen - https://phabricator.wikimedia.org/T225604 (10colewhite) 05Open→03Resolved a:03colewhite wezen is no longer around and mtail has been upgraded to rc35 across the fleet.  this message does not appear to be spammi...
[20:27:46] <wikibugs>	 10Operations, 10Traffic: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 (10colewhite)
[20:30:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] role::puppetmaster::standalone: allow for overriding the hiera config [puppet] - 10https://gerrit.wikimedia.org/r/608706 (owner: 10Andrew Bogott)
[20:32:21] <wikibugs>	 (03PS1) 10Dzahn: gerrit: fix apache cert pathes when acme_chief is not used in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608723
[20:34:52] <wikibugs>	 (03PS2) 10Dzahn: gerrit: fix apache cert pathes when acme_chief is not used in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608723
[20:36:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop in prod but fixes cloud https://puppet-compiler.wmflabs.org/compiler1002/23580/" [puppet] - 10https://gerrit.wikimedia.org/r/608723 (owner: 10Dzahn)
[20:39:32] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper)
[20:43:23] <wikibugs>	 (03PS1) 10Ryan Kemper: Enable replication in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/608726 (https://phabricator.wikimedia.org/T254014)
[20:46:21] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Nuria) Approved on my end. Please be so kind to familiarize yourself with the data access guidelines: https://wikitech.wikimedia.org/wiki/Analytics/Da...
[20:46:24] <wikibugs>	 (03CR) 10Ryan Kemper: "pcc: https://puppet-compiler.wmflabs.org/compiler1003/23581/" [puppet] - 10https://gerrit.wikimedia.org/r/608726 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper)
[20:49:44] <wikibugs>	 (03PS1) 10Dzahn: gerrit: do not define any replica hosts when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608727
[20:51:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: do not define any replica hosts when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608727 (owner: 10Dzahn)
[20:56:29] <wikibugs>	 (03PS1) 10Dzahn: gerrit: make $replica_hosts an optional parameter [puppet] - 10https://gerrit.wikimedia.org/r/608728
[20:57:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: make $replica_hosts an optional parameter [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn)
[20:59:14] <wikibugs>	 (03PS1) 10Ryan Kemper: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014)
[20:59:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper)
[20:59:51] <wikibugs>	 (03PS1) 10Paladox: gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672
[21:00:54] <wikibugs>	 (03PS2) 10Paladox: gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672
[21:01:09] <wikibugs>	 (03PS2) 10Dzahn: gerrit: make $replica_hosts an optional parameter [puppet] - 10https://gerrit.wikimedia.org/r/608728
[21:01:30] <wikibugs>	 (03PS2) 10Ryan Kemper: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014)
[21:01:33] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/608672 (owner: 10Paladox)
[21:03:20] <wikibugs>	 (03CR) 10Dzahn: "noop in prod: https://puppet-compiler.wmflabs.org/compiler1002/23584/" [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn)
[21:03:22] <wikibugs>	 (03CR) 10Reedy: Add api.wikimedia.org and api.m.wikimedia.org DNS entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup)
[21:03:33] <wikibugs>	 (03PS3) 10Paladox: gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672
[21:05:06] <wikibugs>	 (03CR) 10Paladox: gerrit: make $replica_hosts an optional parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn)
[21:07:54] <wikibugs>	 10Operations, 10Traffic: Certain links being rejected by caching if opened in Internet Explorer - https://phabricator.wikimedia.org/T256302 (10Aklapper) @Urbanecm: What is the exact HTTP error type? Asking as that screenshot does not include any error message (probably not to expose the IP).  At least on the "...
[21:09:20] <wikibugs>	 (03CR) 10Dzahn: gerrit: make $replica_hosts an optional parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn)
[21:10:11] <wikibugs>	 (03CR) 10Dzahn: "noop in prod: https://puppet-compiler.wmflabs.org/compiler1002/23590/" [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn)
[21:10:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: make $replica_hosts an optional parameter [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn)
[21:10:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] thanos: set consistency-delay on store [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi)
[21:10:32] <wikibugs>	 (03CR) 10Ladsgroup: Add api.wikimedia.org and api.m.wikimedia.org DNS entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup)
[21:11:42] <wikibugs>	 (03PS1) 10Paladox: gerrit: Set heap for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/608673
[21:11:48] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608061 (https://phabricator.wikimedia.org/T251869) (owner: 10Filippo Giunchedi)
[21:11:55] <wikibugs>	 (03PS2) 10Paladox: gerrit: Set heap for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/608673
[21:11:59] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/608673 (owner: 10Paladox)
[21:13:36] <wikibugs>	 (03PS3) 10Paladox: gerrit: Set heap for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/608673
[21:14:03] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608319 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi)
[21:14:25] <wikibugs>	 (03CR) 10Ryan Kemper: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/608729 Isolate eqiad master maps1004 from cluster" [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper)
[21:17:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: Set heap for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/608673 (owner: 10Paladox)
[21:20:00] <wikibugs>	 (03PS2) 10Ladsgroup: Add api.wikimedia.org and api.m.wikimedia.org DNS entries [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945)
[21:20:21] <wikibugs>	 (03PS1) 10Dzahn: gerrit: set replica_hosts to undefined in Hiera when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608732
[21:20:39] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: set replica_hosts to undefined in Hiera when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608732 (owner: 10Dzahn)
[21:21:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: set replica_hosts to undefined in Hiera when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608732 (owner: 10Dzahn)
[21:25:56] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "see comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper)
[21:32:33] <wikibugs>	 (03PS3) 10Ryan Kemper: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014)
[21:33:08] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper)
[21:36:05] <wikibugs>	 (03CR) 10Platonides: toolforge: Use chained cert for mail relay TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608720 (https://phabricator.wikimedia.org/T256806) (owner: 10BryanDavis)
[21:37:14] <wikibugs>	 (03PS1) 10Dzahn: gerrit: add cron to auto-renew cert using certbot in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608734
[21:38:12] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single
[21:38:12] <logmsgbot>	 !log crusnov@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
[21:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:38:51] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single
[21:38:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:12] <logmsgbot>	 !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[21:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:40:28] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single
[21:40:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:51] <logmsgbot>	 !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[21:42:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] gerrit: add cron to auto-renew cert using certbot in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608734 (owner: 10Dzahn)
[21:43:56] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single
[21:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:57] <logmsgbot>	 !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[21:45:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:24] <logmsgbot>	 !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single
[21:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:45] <logmsgbot>	 !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[21:48:46] <wikibugs>	 (03PS1) 10Paladox: gerrit: Set gerrit::server::replica_hosts to an empty array in devtools [puppet] - 10https://gerrit.wikimedia.org/r/608674
[21:48:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:29] <wikibugs>	 (03PS2) 10Paladox: gerrit: Set gerrit::server::replica_hosts to an empty array in devtools [puppet] - 10https://gerrit.wikimedia.org/r/608674
[21:51:31] <wikibugs>	 (03PS3) 10Paladox: gerrit: Set gerrit::server::replica_hosts to an empty array in devtools [puppet] - 10https://gerrit.wikimedia.org/r/608674
[21:52:04] <wikibugs>	 (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/608674 (owner: 10Paladox)
[21:54:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop in prod https://puppet-compiler.wmflabs.org/compiler1002/23592/" [puppet] - 10https://gerrit.wikimedia.org/r/608674 (owner: 10Paladox)
[22:03:33] <wikibugs>	 (03PS4) 10Dzahn: gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158)
[22:08:06] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn)
[22:26:58] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={4,5} site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasour
[22:26:58] <icinga-wm>	 us/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[22:33:21] <wikibugs>	 (03PS1) 10CRusnov: offline_device: Clear primary IP addresses from device before deleting them. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/608741
[22:35:55] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] "This fixes a bug that Papaul discovered. i have verified it works in netbox-dev." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/608741 (owner: 10CRusnov)
[22:36:06] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+pr
[22:36:06] <icinga-wm>	 cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[22:43:22] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition=5 site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqi
[22:43:22] <icinga-wm>	 var-consumer_group=All
[22:48:54] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T2300).
[23:03:09] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 (owner: 10EBernhardson)
[23:09:38] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "beta cherry-picked" [puppet] - 10https://gerrit.wikimedia.org/r/608251 (https://phabricator.wikimedia.org/T99156) (owner: 10Krinkle)
[23:28:05] <mutante>	 Dereckson: hi, i noticed an issue with the database on your site
[23:42:49] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Krinkle)
[23:49:02] <Krinkle>	 mutante: https://gerrit.wikimedia.org/r/c/operations/puppet/+/603550
[23:49:22] <Krinkle>	 is that something you're fine with doing directly or would you like me to do that in mw config?
[23:50:08] <Krinkle>	 looking at the past 3 year history of private/.git on deploy1001 it seems SRE generally don't edit it.
[23:50:12] <Krinkle>	 I don't mind either way though
[23:51:45] <wikibugs>	 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10CDanis) There's no alert yet for memcache NIC saturation, and I don't believe there's one for TKOs either (@elukey  is that ri...
[23:53:55] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Set this via Horizon instead where most other per-host config lives. That way it's not requiring SRE +2 and cherry picks etc." [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan)
[23:54:34] <wikibugs>	 (03CR) 10Ppchelko: "We went completely another way, this should be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan)
[23:54:40] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "removing hashtag as it does not currently appear to be live on beta's puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan)