[00:14:44] PROBLEM - Host restbase2009 is DOWN: PING CRITICAL - Packet loss = 100% [00:16:18] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:16:32] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:16:46] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:16:46] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:16:46] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:16:54] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:16:56] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:17:14] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:17:26] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:17:28] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [00:17:28] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:17:28] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:17:34] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:17:40] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:18:02] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:18:18] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:18:32] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:18:32] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:18:42] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:18:44] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:19:02] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:19:08] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:19:16] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:19:22] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:19:24] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:19:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:20:20] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:21:00] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [00:21:04] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:21:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:28:27] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP requests for Ahmon Dancy: wmf, releng, ciadmin - https://phabricator.wikimedia.org/T256658 (10thcipriani) 05Open→03Resolved @dancy confirmed he was able to get into logstash. Thanks all! [00:39:50] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 34439384 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:41:42] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 31368 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:05:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:22:06] PROBLEM - puppet last run on labtestpuppetmaster2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [02:05:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 [02:06:57] (03PS2) 10DannyS712: Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 (https://phabricator.wikimedia.org/T254176) (owner: 10TrainBranchBot) [03:57:00] (03CR) 10Thcipriani: [C: 03+1] Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 (https://phabricator.wikimedia.org/T254176) (owner: 10TrainBranchBot) [04:44:22] (03PS1) 10Marostegui: mariadb: Move db1080 from s1 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/608508 (https://phabricator.wikimedia.org/T253217) [04:44:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1080 from s1 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/608508 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [04:46:57] 10Operations, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) >>! In T256538#6265626, @herron wrote: >>>! In T256538#6262958, @Marostegui wrote: >> @herron any idea how big these DBs can be and how many writes we'd be expecting? >> Wh... [04:47:57] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "The -1 is a known issue that requires and entire refactoring for misc hosts." [puppet] - 10https://gerrit.wikimedia.org/r/608508 (https://phabricator.wikimedia.org/T253217) (owner: 10Marostegui) [04:50:34] (03CR) 10Ayounsi: "TIL, you can replace /edit with /preview at the end of the Gdoc URL to get a read only version." [puppet] - 10https://gerrit.wikimedia.org/r/608490 (owner: 10CDanis) [04:56:29] !log Remove plfrom from db1096:3316 and db1098:3316 - T256684 [04:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:34] T256684: pl_from index still lingers in random hosts - https://phabricator.wikimedia.org/T256684 [04:57:18] !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [04:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:59] !log remove pl_from index from db1141, db1121, db1148 - T256684 [04:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:15] I'm getting this error when doing an image pull after trying to deploy to staging in kubernetes `x509: certificate has expired or is not yet valid`. What should I do? [05:13:45] !log Deploy schema change on s8 codfw - T256680 [05:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:50] T256680: imagelinks has index mismatch on s8 - https://phabricator.wikimedia.org/T256680 [05:45:06] (03PS1) 10Jeena Huneidi: Revert "blubberoid: Update to latest image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608468 [05:47:05] (03CR) 10Jeena Huneidi: [C: 03+2] Revert "blubberoid: Update to latest image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608468 (owner: 10Jeena Huneidi) [05:48:03] (03Merged) 10jenkins-bot: Revert "blubberoid: Update to latest image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/608468 (owner: 10Jeena Huneidi) [05:51:42] !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [05:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:29] PROBLEM - Host db1097 is DOWN: PING CRITICAL - Packet loss = 100% [06:18:19] PROBLEM - MariaDB Replica IO: m1 on db1117 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1097.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1097.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:18:27] PROBLEM - MariaDB Replica IO: m1 on db2132 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1097.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1097.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:18:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:19:10] shit [06:19:15] that's m1 master [06:19:17] checking [06:19:41] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:19:41] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:20:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:23:53] RECOVERY - Host db1097 is UP: PING WARNING - Packet loss = 50%, RTA = 0.25 ms [06:25:23] host back and proxies reloaded [06:25:27] looks like HW issue [06:25:30] task being created [06:25:37] RECOVERY - MariaDB Replica IO: m1 on db1117 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:25:47] RECOVERY - MariaDB Replica IO: m1 on db2132 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:26:06] we need to restart etherpad from what I can see [06:26:58] (03CR) 10Jcrespo: "Thanks for working on this. I have a few questions, as seen below." (038 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm) [06:26:59] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:27:00] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:27:21] etherpad is back [06:27:44] It is back but super slow [06:28:05] Looks better now [06:28:13] how's the db? [06:28:43] ? [06:29:05] is the db up or down? [06:29:12] it is up [06:29:27] otherwise etherpad would still be down [06:29:28] should we reload etherpad? [06:29:35] I did already, check above [06:30:07] PROBLEM - ores on ores2009 is CRITICAL: connect to address 10.192.48.90 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:31:19] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:45] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:47] sorry, didn't see it on log [06:31:54] 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) [06:33:22] 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) [06:33:48] 10Operations, 10DBA: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) p:05Triage→03High I was in process of moving db1080 to m2, but I will move it to m1 instead so we can replace and decommission this host. [06:35:33] RECOVERY - ores on ores2009 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:39:05] (03PS1) 10Marostegui: site.pp: Move db1080 to m1 instead of m2 [puppet] - 10https://gerrit.wikimedia.org/r/608543 (https://phabricator.wikimedia.org/T256717) [06:40:17] (03PS2) 10Marostegui: site.pp: Move db1080 to m1 instead of m2 [puppet] - 10https://gerrit.wikimedia.org/r/608543 (https://phabricator.wikimedia.org/T256717) [06:42:33] (03CR) 10Marostegui: [C: 03+2] site.pp: Move db1080 to m1 instead of m2 [puppet] - 10https://gerrit.wikimedia.org/r/608543 (https://phabricator.wikimedia.org/T256717) (owner: 10Marostegui) [07:09:57] (03PS1) 10Muehlenhoff: Extend access for mayakpwiki [puppet] - 10https://gerrit.wikimedia.org/r/608546 [07:10:11] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 53 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:14:23] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Extend access for mayakpwiki [puppet] - 10https://gerrit.wikimedia.org/r/608546 (owner: 10Muehlenhoff) [07:15:57] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:26:08] (03CR) 10Hashar: "Thank you, I have purged the php related packages from the releases* hosts:" [puppet] - 10https://gerrit.wikimedia.org/r/608452 (https://phabricator.wikimedia.org/T256164) (owner: 10Hashar) [07:33:44] 10Operations: cas-icinga intermittant failures - https://phabricator.wikimedia.org/T256720 (10Kormat) [07:38:09] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:42:52] !log reboot cp3053 - T256632 [07:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:57] T256632: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 [07:46:56] Can someone please check in logstash if there have been any deprecation alerts for the use of `PageArchive::getRevision` since wmf.37 ? [07:47:04] ^^ a spike of mediawiki/core changes [07:48:09] ACKNOWLEDGEMENT - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] amusso spike of mediawiki/core changes https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:48:11] DannyS712: yes i can [07:48:29] Thanks - I hard deprecated it, but just found another use in non-deprecated code [07:48:40] if I find out how to search for them hehe [07:49:49] DannyS712: is there any task I can copy paste my findings? [07:50:01] https://phabricator.wikimedia.org/T249982 [07:50:02] I can share the rate over the last 7 days [07:52:38] DannyS712: seems there is only Special:Undelete [07:53:43] Yes, thats the use I found; for some reason I couldn't find the deprecation alerts on logstash-beta, so I wasn't sure if I was reading it right [07:54:23] DannyS712: https://phabricator.wikimedia.org/T249982#6266618 [07:54:40] maybe logstash-beta misses the proper log configuration :-\ [07:55:00] Yeah, https://logstash-beta.wmflabs.org/app/kibana#/dashboard/mediawiki-deprecated says no results found [07:55:45] Would you be willing to +2 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/608470 so I can backport it? [07:56:35] I know absolutely nothing about mediawiki nowadays :-\ [07:56:53] (03PS3) 10Kormat: install_server: Remove no-srv-format.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) [07:57:11] You can compare with https://gerrit.wikimedia.org/r/c/605689 to see that the change should be correct, but I understand. Is there a task for logstash-beta not including deprecation alerts? [07:57:19] DannyS712: then I am handling the train this week, and will be more than happy to deploy the hotfix at any time [07:57:40] (03CR) 10Kormat: "> We could add a stub partman config like "manual-setup.cfg" which only has a comment that a server with this kind of recipe gets installe" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [07:57:40] but I don't feel any confident in approving a change to mediawiki ;D [07:57:45] okay [07:57:51] for logstash-beta I don't know, you might have been the first one to notice it [07:58:07] I guess the devil is probably inside the mediawiki-config, or logstash on beta is broken somehow [07:59:05] 10Operations, 10Puppet, 10User-jbond: puppetise pupet server copy of the public ca.pem - https://phabricator.wikimedia.org/T256721 (10jbond) [07:59:18] 10Operations, 10Puppet, 10User-jbond: puppetise pupet server copy of the public ca.pem - https://phabricator.wikimedia.org/T256721 (10jbond) p:05Triage→03High [07:59:32] Filed T256722 [07:59:33] T256722: Logstash-beta doesn't include any deprecation notices - https://phabricator.wikimedia.org/T256722 [07:59:46] (03CR) 10Privacybatm: "Thank you for your review, working on a new patch set." (037 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) (owner: 10Privacybatm) [08:01:59] !log disable puppet to restart puppetmasters front ends [08:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:46] Testing via eval.php to see what the wgMWLoggerDefaultSpi is set to [08:05:15] !log powercycle cp3053 (unresponsive after reboot) - T256632 [08:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:19] T256632: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 [08:05:52] !log Stop MySQL on db1117:3322 to clone db1080 (this will trigger haproxy alerts) - T256717 [08:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:56] T256717: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 [08:07:11] RECOVERY - puppet last run on labtestpuppetmaster2001 is OK: OK: Puppet is currently disabled (restart puppet master frontends - jbond), not alerting. Last run 12 hours ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:10:38] !log 1.35.0-wmf.39 was branched at e169e3dabcb2217809fc41ba44b43a39ae1a678e T254176 [08:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:43] T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176 [08:10:43] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 (https://phabricator.wikimedia.org/T254176) (owner: 10TrainBranchBot) [08:11:01] PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:11:25] PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:11:57] ^ expected [08:15:40] (03CR) 10Jbond: [C: 03+2] run_ci_locally: use latests docker image [puppet] - 10https://gerrit.wikimedia.org/r/608269 (owner: 10Jbond) [08:16:39] PROBLEM - puppet last run on labtestpuppetmaster2001 is CRITICAL: CRITICAL: Puppet last ran 12 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:21:40] RECOVERY - puppet last run on labtestpuppetmaster2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:22:11] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) >>! In T256444#6264956, @elukey wrote: > There may be another solution, namely creating a new apt component to hold 1.4.x and deploy it selectively wh... [08:23:53] !log repool cp3053 - T256632 [08:23:54] (03PS1) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [08:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:57] T256632: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 [08:25:31] (03PS1) 10Jcrespo: Revert "mariadb-backups: Move transferpy deployment to debian package" [puppet] - 10https://gerrit.wikimedia.org/r/608471 [08:26:07] (03CR) 10Marostegui: "For context: https://phabricator.wikimedia.org/P11705" [puppet] - 10https://gerrit.wikimedia.org/r/608471 (owner: 10Jcrespo) [08:27:48] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb-backups: Move transferpy deployment to debian package" [puppet] - 10https://gerrit.wikimedia.org/r/608471 (owner: 10Jcrespo) [08:27:54] (03CR) 10jerkins-bot: [V: 04-1] cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [08:29:41] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3053 nvme0 issues - https://phabricator.wikimedia.org/T256632 (10Vgutierrez) 05Open→03Stalled repooled after powercycling & issuing the following commands: ` /usr/sbin/nvme format /dev/nvme0n1 -l 2 echo ';' | /usr/sbin/sfdisk /dev/nvme0n1 ` I'll keep... [08:31:12] (03Merged) 10jenkins-bot: Branch commit for wmf/1.35.0-wmf.39 [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608499 (https://phabricator.wikimedia.org/T254176) (owner: 10TrainBranchBot) [08:31:30] RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:38:16] I am doing some basic train related tasks this morning [08:38:46] !log scap prep 1.39.0-wmf.39 # T254176 [08:38:56] (03PS2) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [08:39:45] (03CR) 10jerkins-bot: [V: 04-1] cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [08:40:49] (03PS3) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [08:42:14] (03CR) 10jerkins-bot: [V: 04-1] cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [08:44:50] 10Operations, 10Traffic, 10Patch-For-Review, 10Sustainability (Incident Prevention): monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10ema) [08:45:14] RECOVERY - Device not healthy -SMART- on cp3053 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cp3053&var-datasource=esams+prometheus/ops [08:45:20] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:45:25] ^ me [08:48:06] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:50:02] PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:50:08] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:50:22] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:51:27] !log rolling restart of codfw cp nodes after "re-formatting" nvme devices - T256655 [08:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:33] T256655: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 [08:51:55] !log Applied security patches to wmf/1.35.0-wmf.39 # T254176 [08:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:59] T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176 [08:53:27] !log hashar@deploy1001 clean aborted: Pruned MediaWiki: 1.35.0-wmf.36 (duration: 00m 00s) [08:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [08:54:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:05] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2027-2028].codfw.wmnet ` [08:56:05] (03PS1) 10Ema: purged: alert if local backlog grows past the given limits [puppet] - 10https://gerrit.wikimedia.org/r/608564 (https://phabricator.wikimedia.org/T256446) [08:56:38] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608564 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [08:58:00] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:59:02] (03PS4) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [08:59:36] (03PS1) 10DannyS712: Remove another use of PageArchive::getRevision [core] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608472 (https://phabricator.wikimedia.org/T249982) [09:00:02] (03PS1) 10DannyS712: Remove another use of PageArchive::getRevision [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608473 (https://phabricator.wikimedia.org/T249982) [09:00:25] (03CR) 10Kormat: "Marostegui for the feedback on the idea, Jbond for feedback on the implementation." [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [09:00:56] hashar the patch was merged; cherry picked to 38 (current deployed) and 39 (will be deployed later today) - if the 39 patch is merged soon it can just go out with the train, but the 38 would need to be deployed [09:00:57] (03Abandoned) 10Addshore: AdHocLogging for ReplicaMasterAwareRecordIdsAcquirer [extensions/Wikibase] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606692 (https://phabricator.wikimedia.org/T255855) (owner: 10Addshore) [09:02:42] RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:04:30] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm) [09:05:57] (03PS1) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 [09:07:10] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (owner: 10Jbond) [09:09:29] DannyS712: we have cut the .39 this night [09:09:42] but it is trivial to just approve it and refresh iton the deploy machine ;) [09:10:22] I guess we can include it in wmf.39 ? [09:10:34] and skip it from wmf.38 to avoid potential breakage [09:10:56] Yeah, thats why I wanted to make sure you saw it before deploying 39 began; since it hasn't caused any problems in the last couple weeks, backport might not be needed [09:11:16] k [09:11:22] I will +2 the wmf.39 one [09:12:11] (03CR) 10Hashar: "I will fetch it later today in order to have the patch included in the deployment today." [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608473 (https://phabricator.wikimedia.org/T249982) (owner: 10DannyS712) [09:16:42] 10Operations, 10CAS-SSO: cas-icinga intermittant failures - https://phabricator.wikimedia.org/T256720 (10Peachey88) [09:18:25] 10Operations, 10CAS-SSO, 10User-jbond: cas-icinga intermittant failures - https://phabricator.wikimedia.org/T256720 (10Peachey88) [09:21:57] !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.36 (duration: 28m 11s) [09:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:57] (03PS1) 10Jcrespo: Revert "Revert "mariadb-backups: Move transferpy deployment to debian package"" [puppet] - 10https://gerrit.wikimedia.org/r/608475 (https://phabricator.wikimedia.org/T256725) [09:23:08] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 64, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:23:26] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:27:38] (03CR) 10Arturo Borrero Gonzalez: "Do you have an example of client using this file, so we can have it in the commit message for future reference?" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott) [09:30:20] (03CR) 10Marostegui: "was this test? as in: the installer will stop at the partitioner but we can still run the partitioner manually and carry on with the insta" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:30:31] (03CR) 10Marostegui: "*tested" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:31:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/purged] - 10https://gerrit.wikimedia.org/r/608275 (owner: 10Ema) [09:35:42] RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:37:05] (03CR) 10Jcrespo: [C: 03+2] Revert "Revert "mariadb-backups: Move transferpy deployment to debian package"" [puppet] - 10https://gerrit.wikimedia.org/r/608475 (https://phabricator.wikimedia.org/T256725) (owner: 10Jcrespo) [09:38:20] (03PS1) 10Joal: Add analytics data purge for pageview_actor_hourly [puppet] - 10https://gerrit.wikimedia.org/r/608568 (https://phabricator.wikimedia.org/T256417) [09:40:17] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [09:40:19] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:23] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2029-2030].codfw.wmnet ` [09:42:46] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [09:43:03] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Remove deprecated and unmaintained image: envoy-tls-local-proxy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/608277 (https://phabricator.wikimedia.org/T253396) (owner: 10JMeybohm) [09:43:13] (03PS1) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [09:44:19] (03PS1) 10Awight: Set Status error if permission check returns false. [extensions/FileImporter] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608476 (https://phabricator.wikimedia.org/T256428) [09:44:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me, but maybe let's test this in addition with an sretest* host before attempting to use it on a db* or backup host?" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [09:45:46] (03PS4) 10Privacybatm: transferpy: Use logging package instead of print statements [software/transferpy] - 10https://gerrit.wikimedia.org/r/607028 (https://phabricator.wikimedia.org/T255999) [09:47:11] !log hashar@deploy1001 Pruned MediaWiki: 1.35.0-wmf.37 (duration: 02m 20s) [09:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:17] (03PS1) 10Awight: Embedded surveys are hidden when no element is available [extensions/QuickSurveys] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608477 (https://phabricator.wikimedia.org/T256627) [09:48:45] (03PS2) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [09:50:41] (03PS3) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [09:53:00] (03PS1) 10Awight: Configure TeWü survey on dewiki (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608478 (https://phabricator.wikimedia.org/T253112) [09:55:43] (03PS4) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [09:56:38] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:01:02] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: redis for docker-registry should have maxmemory-policy set to allkeys-lru - https://phabricator.wikimedia.org/T256726 (10JMeybohm) p:05Triage→03Low [10:01:40] (03PS5) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [10:01:46] (03CR) 10Hashar: [C: 03+2] "Reviewed by Thiemo in master. It is trivial enough I don't see a reason for waiting next week. Will update the code on the deployment ser" [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608473 (https://phabricator.wikimedia.org/T249982) (owner: 10DannyS712) [10:02:06] !log volker-e@deploy1001 Started deploy [design/style-guide@e3fda83]: Deploy design/style-guide: [10:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:13] !log volker-e@deploy1001 Finished deploy [design/style-guide@e3fda83]: Deploy design/style-guide: (duration: 00m 07s) [10:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:03] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:03:04] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:07] (03PS1) 10Marostegui: db1080: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608580 (https://phabricator.wikimedia.org/T256717) [10:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:12] (03PS6) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [10:03:20] 10Operations, 10DBA, 10Patch-For-Review: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) db1080 is ready. Now we just need to schedule another m1 failover to promote db1080 to master. [10:03:48] (03CR) 10Marostegui: [C: 03+2] db1080: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/608580 (https://phabricator.wikimedia.org/T256717) (owner: 10Marostegui) [10:04:39] !log rolling restart of eqiad cache nodes to catch up on kernel upgrades [10:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:18] (03PS7) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [10:06:20] (03CR) 10Kormat: "Marostegui wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [10:06:23] 10Operations, 10Traffic, 10Patch-For-Review, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10elukey) >>! In T256444#6266680, @ema wrote: > Well, [[ https://github.com/edenhill/librdkafka/issues/2020 | upstream claims ]] that the new versions are AP... [10:07:06] (03CR) 10Marostegui: [C: 03+1] "I didn't recall if this was tested, if it has been tested then +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [10:09:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076', diff saved to https://phabricator.wikimedia.org/P11708 and previous config saved to /var/cache/conftool/dbconfig/20200630-100912-marostegui.json [10:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:49] !log Deploy schema change on db1076 [10:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:25] (03PS8) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [10:11:27] 10Operations: consider hybrid caching options for ssd+disk - https://phabricator.wikimedia.org/T88992 (10Aklapper) [10:12:08] 10Operations, 10Patch-Needs-Improvement: puppet should try to mount all mountable swift filesystems - https://phabricator.wikimedia.org/T126574 (10Aklapper) [10:12:16] (03PS9) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [10:15:10] (03CR) 10Kormat: [C: 03+2] install_server: Remove no-srv-format.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608306 (https://phabricator.wikimedia.org/T251768) (owner: 10Kormat) [10:23:37] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) [10:24:38] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Marostegui) [10:25:12] 10Operations, 10DBA, 10User-Kormat: Port DBA dashboards to thanos - https://phabricator.wikimedia.org/T256730 (10Kormat) p:05Triage→03Medium [10:25:29] (03Merged) 10jenkins-bot: Remove another use of PageArchive::getRevision [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608473 (https://phabricator.wikimedia.org/T249982) (owner: 10DannyS712) [10:26:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:29:07] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal: Add Guergana Tzatchkova to the ldap/wmde group - https://phabricator.wikimedia.org/T256201 (10Tobi_WMDE_SW) >>! In T256201#6265822, @ssingh wrote: > > @guergana.tzatchkova: Once the NDA is confirmed, the only other thing we will need is a confirmation from... [10:30:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:30:39] DannyS712: syncing your change for wmf.39 . It will be deployed on testwikis this afternoon [10:30:54] !log hashar@deploy1001 Synchronized php-1.35.0-wmf.39/includes/specials/SpecialUndelete.php: Remove another use of PageArchive::getRevision - T249982 T254176 (duration: 00m 56s) [10:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:00] T249982: PageArchive::getRevision uses timestamp, suggested replacement uses rev id - https://phabricator.wikimedia.org/T249982 [10:31:00] T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176 [10:31:26] and the mediawiki-error alarm is due to a spike of " Bad UTF-8". There is a task for it [10:34:36] (03CR) 10Marostegui: cumin: Add db-role and db-section aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [10:34:48] (03PS5) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [10:37:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:37:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:03] (03PS1) 10Jbond: DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 [10:38:21] (03CR) 10Jbond: [C: 04-1] DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 (owner: 10Jbond) [10:38:35] !log cp2040: upgrade librdkafka1 to 0.11.6-1.1wmf1 https://phabricator.wikimedia.org/P11703 T256444 [10:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:39] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [10:41:19] !log cp2040: restart purged and varnishkafka to use updated librdkafka1 T256444 [10:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:25] hashar should I abandon the .38 backport? [10:43:57] DannyS712: yeah [10:44:10] well theorically we could deploy it, but I would rather not take the risk :-] [10:44:20] given the fix will make it in this week train [10:45:33] (03Abandoned) 10DannyS712: Remove another use of PageArchive::getRevision [core] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608472 (https://phabricator.wikimedia.org/T249982) (owner: 10DannyS712) [10:45:53] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:45:54] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:59] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2031-2032].codfw.wmnet ` [10:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:39] lunch break [10:52:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1076', diff saved to https://phabricator.wikimedia.org/P11710 and previous config saved to /var/cache/conftool/dbconfig/20200630-105254-marostegui.json [10:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [10:59:14] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:59:15] !log upload librdkafka 0.11.6-1.1wmf1 to buster-wikimedia https://phabricator.wikimedia.org/P11703 T256444 [10:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:23] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1100). [11:00:04] awight: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:07] 👋 [11:00:12] :-) [11:00:16] hey Lucas_WMDE [11:00:29] I'll deploy my patches now. [11:00:32] awight: do you want to deploy the changes yourself? [11:00:35] ah ok [11:00:39] Thanks! [11:01:22] (03PS8) 10JMeybohm: chartmuseum: Add initial module, profile and role [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) [11:01:36] (03CR) 10Awight: [C: 03+2] "BACON" [extensions/FileImporter] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608476 (https://phabricator.wikimedia.org/T256428) (owner: 10Awight) [11:02:12] (03CR) 10Awight: [C: 03+2] "BACON" [extensions/QuickSurveys] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608477 (https://phabricator.wikimedia.org/T256627) (owner: 10Awight) [11:02:31] (03CR) 10JMeybohm: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606956 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [11:03:14] All sites were unreachable for me for a few minutes, 100% packet loss. From the lack of panic on this channel I suppose that was just me?... [11:03:36] Maybe a routing problem between my isp (1&1) and and esams [11:03:42] (03PS10) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [11:04:37] duesen: might have been you yeah, I don't see a drop here https://grafana.wikimedia.org/d/000000501/prometheus-varnish-http-requests?orgId=1 [11:04:51] (03CR) 10jerkins-bot: [V: 04-1] cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 (owner: 10Jbond) [11:05:24] ok than! everything is back to normal for me as well. [11:06:33] (03PS11) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [11:07:48] (03CR) 10jerkins-bot: [V: 04-1] cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 (owner: 10Jbond) [11:08:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:11:21] (03CR) 10Jforrester: [C: 03+1] "Note that all the node6 images used in CI are based on wikimedia-jessie; we're blocked on removing this by the migration of production ser" [puppet] - 10https://gerrit.wikimedia.org/r/587529 (https://phabricator.wikimedia.org/T249724) (owner: 10Alexandros Kosiaris) [11:13:20] !log deneb: systemctl restart docker-reporter-base-images.service [11:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:38] (03PS1) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 [11:14:27] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 48 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:14:37] (03PS2) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 [11:14:55] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) @jbond is the scope of this task done or is there anything else left? [11:15:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:17:11] (03PS3) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 [11:17:13] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:18:59] (03PS4) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 [11:20:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:22:59] (03Merged) 10jenkins-bot: Set Status error if permission check returns false. [extensions/FileImporter] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608476 (https://phabricator.wikimedia.org/T256428) (owner: 10Awight) [11:23:02] (03Merged) 10jenkins-bot: Embedded surveys are hidden when no element is available [extensions/QuickSurveys] (wmf/1.35.0-wmf.38) - 10https://gerrit.wikimedia.org/r/608477 (https://phabricator.wikimedia.org/T256627) (owner: 10Awight) [11:24:51] (03PS5) 10Jbond: umin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 [11:25:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:26:39] !log awight@deploy1001 Synchronized php-1.35.0-wmf.38/extensions/FileImporter: BACON: [[gerrit:608476|Set Status error if permission check returns false. (T256428)]] (duration: 00m 58s) [11:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:46] T256428: FileImporter thinks I’m an admin even though I’m not - https://phabricator.wikimedia.org/T256428 [11:26:48] (03PS2) 10Awight: Configure TeWü survey on dewiki (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608478 (https://phabricator.wikimedia.org/T253112) [11:26:56] (03CR) 10Awight: [C: 03+2] "BACON" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608478 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:28:23] (03Merged) 10jenkins-bot: Configure TeWü survey on dewiki (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608478 (https://phabricator.wikimedia.org/T253112) (owner: 10Awight) [11:28:31] !log awight@deploy1001 Synchronized php-1.35.0-wmf.38/extensions/QuickSurveys: BACON: [[gerrit:608477|Embedded surveys are hidden when no element is available (T256627)]] (duration: 00m 56s) [11:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:35] T256627: Embedded surveys are incorrectly shown even when embed element is missing - https://phabricator.wikimedia.org/T256627 [11:30:30] (03PS6) 10Jbond: umin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 [11:31:58] !log pushed a scratch docker image as docker-registry.discovery.wmnet/envoy-tls-local-proxy:dontuseme - T253396 [11:31:59] !log restarted docker-reporter-base-images and docker-reporter-releng-images on deneb - T253396 [11:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:02] T253396: Upgrade all TLS enabled charts to v0.2 tls_helper - https://phabricator.wikimedia.org/T253396 [11:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:41] !log awight@deploy1001 Synchronized wmf-config/InitialiseSettings.php: BACON: [[gerrit:608478|Configure TeWü survey on dewiki (take 2) (T253112)]] (duration: 00m 58s) [11:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:44] T253112: Create survey for TechWish prototype announcements on dewiki and metawiki - https://phabricator.wikimedia.org/T253112 [11:33:05] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Add TLS termination to services running on kubernetes - https://phabricator.wikimedia.org/T235411 (10JMeybohm) [11:33:13] !log EU BACON cooked [11:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:38] (03CR) 10Ema: [C: 03+2] Build-depend on go 1.14 [software/purged] - 10https://gerrit.wikimedia.org/r/608275 (owner: 10Ema) [11:35:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [11:35:48] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:50] (03PS4) 10Alexandros Kosiaris: lvs: Switch proton to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/607532 (https://phabricator.wikimedia.org/T225680) [11:38:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Switch proton to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/607532 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [11:39:33] (03PS7) 10Jbond: umin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 [11:39:39] (03PS1) 10Alexandros Kosiaris: redis::misc: Set docker-registry maxmemory-policy [puppet] - 10https://gerrit.wikimedia.org/r/608600 (https://phabricator.wikimedia.org/T256726) [11:42:00] (03PS8) 10Jbond: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608596 [11:45:18] PROBLEM - PyBal connections to etcd on lvs1016 is CRITICAL: CRITICAL: 101 connections established with conf1004.eqiad.wmnet:4001 (min=102) https://wikitech.wikimedia.org/wiki/PyBal [11:45:46] (03PS12) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [11:45:48] (03PS2) 10Jbond: DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 [11:46:41] (03CR) 10jerkins-bot: [V: 04-1] cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 (owner: 10Jbond) [11:47:50] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 58 connections established with conf2001.codfw.wmnet:2379 (min=59) https://wikitech.wikimedia.org/wiki/PyBal [11:48:32] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.21:4030]) https://wikitech.wikimedia.org/wiki/PyBal [11:48:56] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.21:4030]) https://wikitech.wikimedia.org/wiki/PyBal [11:49:00] akosiaris: the pybal alerts are due to your LVS changes I suppose, right? ^ [11:49:02] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.21:4030]) https://wikitech.wikimedia.org/wiki/PyBal [11:49:24] PROBLEM - PyBal connections to etcd on lvs1015 is CRITICAL: CRITICAL: 69 connections established with conf1004.eqiad.wmnet:4001 (min=70) https://wikitech.wikimedia.org/wiki/PyBal [11:49:46] (03PS3) 10Jbond: DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 [11:50:26] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 78 connections established with conf2001.codfw.wmnet:2379 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [11:51:01] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 (owner: 10Jbond) [11:51:03] (03PS13) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 [11:52:20] (03PS2) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) [11:54:12] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.21:4030]) https://wikitech.wikimedia.org/wiki/PyBal [11:55:06] that's me ^, ignore please [11:55:07] (03PS3) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 [11:56:19] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (owner: 10Jbond) [11:58:58] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:59:30] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1200) [12:00:50] RECOVERY - PyBal connections to etcd on lvs1016 is OK: OK: 102 connections established with conf1004.eqiad.wmnet:4001 (min=102) https://wikitech.wikimedia.org/wiki/PyBal [12:00:57] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 79 connections established with conf2001.codfw.wmnet:2379 (min=79) https://wikitech.wikimedia.org/wiki/PyBal [12:01:32] (03PS1) 10Elukey: Fix tests for multi-threading code [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608604 [12:02:44] (03PS4) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 [12:03:22] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [12:03:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:36] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2033-2034].codfw.wmnet ` [12:04:16] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (owner: 10Jbond) [12:07:05] (03PS2) 10Elukey: Fix tests for multi-threading code [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608604 [12:07:11] 10Operations, 10DBA, 10Patch-For-Review: db1097 (m1 master) crashed due to memory issues. - https://phabricator.wikimedia.org/T256717 (10Marostegui) Actually I just realised that this host won't be replaced next FY, as we are replacing up to db1095. [12:07:36] (03CR) 10Elukey: [V: 03+2 C: 03+2] Fix tests for multi-threading code [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608604 (owner: 10Elukey) [12:08:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:08:33] (03PS4) 10Alexandros Kosiaris: lvs: Switch proton to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/607533 (https://phabricator.wikimedia.org/T225680) [12:08:56] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 59 connections established with conf2001.codfw.wmnet:2379 (min=59) https://wikitech.wikimedia.org/wiki/PyBal [12:10:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Switch proton to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/607533 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [12:10:38] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:10:42] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:10:43] (03PS5) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 [12:11:04] RECOVERY - PyBal connections to etcd on lvs1015 is OK: OK: 70 connections established with conf1004.eqiad.wmnet:4001 (min=70) https://wikitech.wikimedia.org/wiki/PyBal [12:11:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:11:29] (03PS6) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [12:11:59] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (owner: 10Jbond) [12:12:39] (03CR) 10Kormat: [C: 04-1] "Based on input from Jbond, i'm going to take a different approach for the enumeration of valid states/sections. I'll rebase this CR once t" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [12:15:59] (03PS6) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) [12:21:59] (03CR) 10Jcrespo: "> Patch Set 6: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [12:23:14] (03CR) 10Elukey: [C: 03+2] Add analytics data purge for pageview_actor_hourly [puppet] - 10https://gerrit.wikimedia.org/r/608568 (https://phabricator.wikimedia.org/T256417) (owner: 10Joal) [12:24:24] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [12:27:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:28:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:29:49] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:52] (03PS1) 10Ema: 5.1.3-1wm16: add 0039-probe-cold-state-race.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) [12:31:04] (03CR) 10jerkins-bot: [V: 04-1] 5.1.3-1wm16: add 0039-probe-cold-state-race.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/608606 (https://phabricator.wikimedia.org/T236754) (owner: 10Ema) [12:42:27] (03PS1) 10Ahuret: propagate logger to WSGIServer [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608607 [12:42:29] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608607 (owner: 10Ahuret) [12:49:00] (03PS4) 10Alexandros Kosiaris: lvs: Switch proton to production [puppet] - 10https://gerrit.wikimedia.org/r/607534 (https://phabricator.wikimedia.org/T225680) [12:49:50] (03CR) 10Alexandros Kosiaris: [C: 03+2] lvs: Switch proton to production [puppet] - 10https://gerrit.wikimedia.org/r/607534 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [12:55:29] (03PS2) 10CDanis: add playbook links for important alerts [puppet] - 10https://gerrit.wikimedia.org/r/608490 [12:55:56] (03CR) 10CDanis: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/608490 (owner: 10CDanis) [12:57:55] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608600 (https://phabricator.wikimedia.org/T256726) (owner: 10Alexandros Kosiaris) [12:58:35] (03CR) 10Hnowlan: [C: 03+1] proton: Switch dev restbase to talk to TLS proton [puppet] - 10https://gerrit.wikimedia.org/r/607535 (https://phabricator.wikimedia.org/T225680) (owner: 10Alexandros Kosiaris) [12:59:20] (03CR) 10Jbond: [C: 03+2] role::alerting_host: update the cas-icinga vhost to use the icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/608312 (owner: 10Jbond) [13:00:04] hashar and twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - European+American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1300). [13:00:56] train train [13:01:06] those jouncebot messages are annoying [13:01:40] (03PS1) 10Hashar: testwikis wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608613 [13:02:31] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:03:27] PROBLEM - Hadoop NodeManager on analytics1068 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:04:19] (03CR) 10Elukey: [C: 03+2] "Thanks!" [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608607 (owner: 10Ahuret) [13:04:21] (03CR) 10Elukey: [V: 03+2 C: 03+2] propagate logger to WSGIServer [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/608607 (owner: 10Ahuret) [13:04:57] PROBLEM - puppet last run on idp-test2001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:06:35] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608613 (owner: 10Hashar) [13:06:49] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:07:20] (03Merged) 10jenkins-bot: testwikis wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608613 (owner: 10Hashar) [13:07:25] !log hashar@deploy1001 Started scap: testwikis wikis to 1.35.0-wmf.39 [13:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:29] RECOVERY - Hadoop NodeManager on analytics1068 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [13:12:46] ACKNOWLEDGEMENT - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. JMeybohm I broke this while deleting the envoy-tls-local-proxy, looking into it, https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:13] RECOVERY - puppet last run on idp-test2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:16:20] (03PS1) 10Muehlenhoff: Also enable ssoSessions actuator in prod [puppet] - 10https://gerrit.wikimedia.org/r/608616 [13:18:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:20:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:24:21] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:10] (03PS1) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618 [13:30:46] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:30:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:37] 10Operations, 10Traffic, 10Patch-For-Review: Current codfw caches have wrong NVME format - https://phabricator.wikimedia.org/T256655 (10ops-monitoring-bot) Icinga downtime for 0:30:00 set by vgutierrez@cumin1001 on 2 host(s) and their services with reason: kernel upgrade ` cp[2035-2036].codfw.wmnet ` [13:31:52] (03PS1) 10Bartosz Dziewoński: Enable validation of new signatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608619 (https://phabricator.wikimedia.org/T248632) [13:32:41] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:46] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:32:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:08] (03CR) 10Bartosz Dziewoński: "(For deployment next week)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608619 (https://phabricator.wikimedia.org/T248632) (owner: 10Bartosz Dziewoński) [13:35:38] (03CR) 10Jcrespo: "Not sure if these should be here, but commenting just in case." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat) [13:36:38] (03PS1) 10Bartosz Dziewoński: Enable validation of new signatures on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608621 (https://phabricator.wikimedia.org/T248632) [13:37:20] !log rebooting LDAP replicas for kernel security update [13:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:09] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:40] growth on 503 baseline [13:38:52] since :34 [13:39:11] https://grafana.wikimedia.org/d/000000503/varnish-http-errors?panelId=7&fullscreen&orgId=1&refresh=1m&from=1593520746979&to=1593524346979 [13:39:46] it went down again [13:40:10] strange, doesn't look like a normal error spike [13:41:20] (03PS1) 10Ottomata: Camus eventlogging - consider meta.dt and dt for event partition time [puppet] - 10https://gerrit.wikimedia.org/r/608622 (https://phabricator.wikimedia.org/T256370) [13:41:25] (03PS1) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) [13:41:27] (03PS1) 10Jbond: mariadb::core_test: open mysql port for ipd_test on db1077 server [puppet] - 10https://gerrit.wikimedia.org/r/608624 (https://phabricator.wikimedia.org/T256120) [13:42:05] (03CR) 10Ottomata: "To be merged after https://gerrit.wikimedia.org/r/c/analytics/refinery/+/608460 is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/608622 (https://phabricator.wikimedia.org/T256370) (owner: 10Ottomata) [13:42:16] (03CR) 10jerkins-bot: [V: 04-1] mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [13:46:27] (03PS2) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) [13:48:55] (03CR) 10Jcrespo: "Not against what I think is the idea here, but modifying core db's firewall logic requires its own dedicated ticket, as it requires quite " [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [13:49:04] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:03] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:07] (03PS2) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618 [13:53:02] (03CR) 10Jcrespo: "Naming also be confusing, given modules/profile/manifests/mariadb/ferm.pp resource exists, too." [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [13:54:15] (03CR) 10Jcrespo: "maybe we can move profile::mariadb::ferm to the mariadb module so a profile doesn't import another profile?" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [13:55:24] (03PS3) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) [13:56:22] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:36] (03CR) 10Muehlenhoff: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [13:58:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me (my earlier comment is more directed towards a followup change after this conversion is complete)" [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [13:59:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [13:59:27] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:59] 10Operations, 10CAS-SSO, 10User-jbond: cas-icinga intermittant failures - https://phabricator.wikimedia.org/T256720 (10Kormat) 05Open→03Invalid Correction: this doesn't seem to be cas-specific. I've had both cas-icinga and plain icinga pages open in both firefox and chrome for a few hours, and so far i'v... [14:00:09] (03PS4) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) [14:00:49] (03CR) 10CDanis: [C: 03+2] add playbook links for important alerts [puppet] - 10https://gerrit.wikimedia.org/r/608490 (owner: 10CDanis) [14:01:05] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:02:12] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:07] (03CR) 10Jcrespo: "Could I ask what ips/dns are needed to connect to what hosts? Just to be clear, I am not claming a refactor is not needed, but I wonder if" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:06:24] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:57] !log otto@deploy1001 Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 [14:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:01] T256370: Camus should look for multiple possible timestamp fields to use for hourly partitioining - https://phabricator.wikimedia.org/T256370 [14:09:41] (03CR) 10CDanis: [C: 03+1] purged: alert if local backlog grows past the given limits [puppet] - 10https://gerrit.wikimedia.org/r/608564 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [14:09:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:09:56] !log hashar@deploy1001 Finished scap: testwikis wikis to 1.35.0-wmf.39 (duration: 62m 30s) [14:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:44] eventually [14:10:54] !log otto@deploy1001 Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 (duration: 01m 56s) [14:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:08] (03CR) 10Andrew Bogott: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott) [14:12:23] (03PS5) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) [14:13:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:13:05] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:13:11] time for group0 [14:13:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:50] (03PS1) 10Hashar: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608629 [14:15:02] (03PS3) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618 [14:15:46] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:15:46] (03CR) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat) [14:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:50] !log rebooting miscweb servers for kernel security update [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] (03CR) 10Jbond: "> Naming also be confusing, given modules/profile/manifests/mariadb/ferm.pp resource exists, too." [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:17:30] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/23569/" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:17:37] (03PS2) 10Jbond: mariadb::core_test: open mysql port for ipd_test on db1077 server [puppet] - 10https://gerrit.wikimedia.org/r/608624 (https://phabricator.wikimedia.org/T256120) [14:18:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:48] (03PS1) 10Jbond: mariadb::profile::firewall: use the profile::mariadb::ferm type [puppet] - 10https://gerrit.wikimedia.org/r/608631 [14:21:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:30] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [14:21:31] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:21:31] (03PS2) 10Jbond: mariadb::profile::firewall: use the profile::mariadb::ferm type [puppet] - 10https://gerrit.wikimedia.org/r/608631 [14:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:33] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608629 (owner: 10Hashar) [14:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:36] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608629 (owner: 10Hashar) [14:24:22] RECOVERY - IPMI Sensor Status on logstash2001 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:24:36] (03CR) 10Jbond: "pcc: https://puppet-compiler.wmflabs.org/compiler1001/23571/" [puppet] - 10https://gerrit.wikimedia.org/r/608624 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:24:47] (03PS1) 10ZPapierski: Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) [14:25:20] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.35.0-wmf.39 [14:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:59] (03CR) 10jerkins-bot: [V: 04-1] Configuration code for oauth proxy [puppet] - 10https://gerrit.wikimedia.org/r/608633 (https://phabricator.wikimedia.org/T251498) (owner: 10ZPapierski) [14:26:08] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:26:09] (03CR) 10Jbond: [C: 03+1] "LGtM" [puppet] - 10https://gerrit.wikimedia.org/r/608616 (owner: 10Muehlenhoff) [14:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:33] anyone wants to merge a Beta-only config change? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/608621 [14:30:00] (03CR) 10Jbond: puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [14:30:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:30:53] for god sake [14:30:57] rolling back [14:31:46] (03CR) 10Jbond: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat) [14:32:05] (03Abandoned) 10Jbond: cumin: testing [puppet] - 10https://gerrit.wikimedia.org/r/608571 (owner: 10Jbond) [14:32:18] (03Abandoned) 10Jbond: DO NOT MERGE: example refactor [puppet] - 10https://gerrit.wikimedia.org/r/608586 (owner: 10Jbond) [14:32:44] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:32] (03PS1) 10Hashar: Revert "group0 wikis to 1.35.0-wmf.39" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608636 (https://phabricator.wikimedia.org/T256759) [14:33:46] (03CR) 10Hashar: [C: 03+2] Revert "group0 wikis to 1.35.0-wmf.39" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608636 (https://phabricator.wikimedia.org/T256759) (owner: 10Hashar) [14:34:01] 10Operations, 10serviceops, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10dancy) [14:34:03] will file in unbreak now as soon as I am done with the rollback [14:34:28] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.35.0-wmf.39" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608636 (https://phabricator.wikimedia.org/T256759) (owner: 10Hashar) [14:34:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:15] 10Operations, 10serviceops, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) [14:36:03] (03PS4) 10Kormat: mariadb: Use custom types to ensure role/section have valid values. [puppet] - 10https://gerrit.wikimedia.org/r/608618 [14:36:13] Flow is fataling https://www.mediawiki.org/wiki/Topic:Vp2ezhpldxustuf4 [14:36:24] 10Operations, 10serviceops, 10Epic, 10Patch-For-Review: Track and remove jessie based container images from production - https://phabricator.wikimedia.org/T249724 (10hashar) CI stills use Jessie based container from docker-registry.wikimedia.org/wikimedia-jessie . The last remaining task is to have some se... [14:36:43] Amir1: hasha.r is already rolling back [14:36:53] oh okay [14:37:35] (03CR) 10Ladsgroup: [C: 03+2] "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608621 (https://phabricator.wikimedia.org/T248632) (owner: 10Bartosz Dziewoński) [14:37:36] pfff [14:37:38] wrong task [14:38:09] Can I quickly rebase that patch in wdeploy1001? [14:38:13] *deploy1001 [14:38:18] (03Merged) 10jenkins-bot: Enable validation of new signatures on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608621 (https://phabricator.wikimedia.org/T248632) (owner: 10Bartosz Dziewoński) [14:38:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:38:31] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.35.0-wmf.39" - T256759 [14:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:35] T256759: selenium-daily-* Jenkins jobs fail with `fatal: Refusing to fetch into current branch refs/heads/master of non-bare repository` - https://phabricator.wikimedia.org/T256759 [14:39:02] Amir1: flow is alive now [14:39:45] Amir1: thanks! [14:39:45] RhinosF1: thanks [14:40:08] MatmaRex: thank you for doing it, I just clicked on a shiny button (and I like clicking on shiny buttons) [14:40:13] Amir1: I think hashar's to thank. I'm just talking [14:40:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608489 (owner: 10Dzahn) [14:41:13] anyone has a phabricator task for flow errors? [14:42:26] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Also enable ssoSessions actuator in prod [puppet] - 10https://gerrit.wikimedia.org/r/608616 (owner: 10Muehlenhoff) [14:42:37] (03CR) 10Jcrespo: "So I really don't see the use of changing the profiles for a test mediawiki database. Mediawiki databases (even test ones) shouldn't have " [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:42:55] Majavah: I think hashar was going to file it after he rolled back [14:42:58] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2030 is CRITICAL: cluster=cache_upload instance=cp2030 job=purged site=codfw topic={codfw.resource-purge,eqiad.resource-purge} https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030 [14:43:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:43:28] T256761 [14:43:28] T256761: 1.35.0-wmf.39 breaks Flow - https://phabricator.wikimedia.org/T256761 [14:43:42] (03CR) 10Kormat: "Fixed issue with using wrong type, pcc is happy now: https://puppet-compiler.wmflabs.org/compiler1003/23574/" [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat) [14:43:45] !log Train blocked on Flow being broken: T256761 # T254176 [14:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:05] T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176 [14:44:09] Amir1: yeah the traces are in https://phabricator.wikimedia.org/T256761 [14:44:12] no clue what is happening [14:44:32] also happens on beta cluster https://en.wikipedia.beta.wmflabs.org/wiki/Topic:Vp9cnfpdti9bnda2 [14:44:49] thanks [14:45:01] I don't know if I can debug and fix it but I give it a try [14:45:19] I'm trying the same [14:47:26] 10Operations, 10serviceops, 10Kubernetes: Fix nginx config and caching for docker registry - https://phabricator.wikimedia.org/T256762 (10JMeybohm) [14:47:36] !log otto@deploy1001 Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 2 [14:47:39] !log otto@deploy1001 Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 2 (duration: 00m 03s) [14:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:40] T256370: Camus should look for multiple possible timestamp fields to use for hourly partitioining - https://phabricator.wikimedia.org/T256370 [14:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:24] (03PS7) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [14:48:32] (03PS1) 10Jcrespo: mariadb: Open port to misc dbs for idp-test servers [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [14:48:59] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [14:49:33] !log otto@deploy1001 Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3 [14:49:36] !log otto@deploy1001 Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3 (duration: 00m 03s) [14:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:33] (03CR) 10Jcrespo: "I think this, while not ideal, is a safer change (and a method that will be closer to the way the definitive hosts will be setup)." [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [14:52:11] (03CR) 10Jbond: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:54:04] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:54:12] (03CR) 10Jbond: "also please not the noop PCC: https://puppet-compiler.wmflabs.org/compiler1001/23569/" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:55:40] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [14:56:08] (03CR) 10Jcrespo: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [14:56:14] (03PS8) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [14:58:56] (03PS1) 10Privacybatm: Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) [14:59:00] !log rebooting failoid hosts for kernel update [14:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:16] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:59:24] (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm) [15:00:02] (03PS9) 10Kormat: cumin: Add db-role and db-section aliases [puppet] - 10https://gerrit.wikimedia.org/r/608558 [15:01:12] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:47] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [15:01:50] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:39] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:05:43] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:07:14] (03CR) 10Jbond: "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [15:07:30] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:38] !log rebooting mwdebug* hosts for kernel security update [15:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:57] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:08] (03CR) 10Jbond: mariadb::ferm: move firewall rules to there own profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [15:12:26] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:13:09] (03PS2) 10Privacybatm: Transferer.py: Calculate source checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) [15:14:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:44] !log otto@deploy1001 Started deploy [analytics/refinery@1112749]: roll back to 1112749 on an-launcher1002, git-fat not pulling artifacts [15:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:05] !log otto@deploy1001 Finished deploy [analytics/refinery@1112749]: roll back to 1112749 on an-launcher1002, git-fat not pulling artifacts (duration: 01m 21s) [15:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:41] 10Operations, 10LDAP-Access-Requests, 10observability, 10serviceops, 10Patch-For-Review: Grant Access to Logstash to Peter(peter.ovchyn@speedandfunction.com) - https://phabricator.wikimedia.org/T249037 (10AMooney) @KFrancis, it looks like the form for this work has been Approved. Can this task move forwa... [15:20:33] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:24:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:36] (03PS6) 10Jbond: mariadb::ferm: move firewall rules to there own profile [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) [15:26:05] (03PS2) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [15:26:07] (03Abandoned) 10Jbond: mariadb::core_test: open mysql port for ipd_test on db1077 server [puppet] - 10https://gerrit.wikimedia.org/r/608624 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [15:27:16] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [15:27:33] (03PS3) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [15:28:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:28:25] ^ should recover in a bit [15:28:43] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [15:30:01] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:30:28] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:17] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ElectronPdfService/+/608649 should fix the train [15:32:26] Amir1, Majavah: ^ [15:32:29] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:39] (03PS1) 10RLazarus: openldap: Clarify output text on expiration warnings [puppet] - 10https://gerrit.wikimedia.org/r/608650 [15:32:56] having issues with my internet atm, can't look [15:33:08] (03PS4) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [15:33:14] Ack, I asked Demian to look [15:33:36] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:34:24] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [15:34:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608650 (owner: 10RLazarus) [15:35:14] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:24] (03PS5) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [15:35:41] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/608650 (owner: 10RLazarus) [15:39:04] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:39:07] (03CR) 10Jbond: mariadb: Setup db1077 as a misc::idp_test database server (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [15:40:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608558 (owner: 10Kormat) [15:44:24] PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-Y on ps1-c3-codfw is CRITICAL: SNMP CRITICAL - ps1-c3-codfw-infeed-load-tower-B-phase-Y *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:44:31] (03PS6) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [15:44:34] PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is CRITICAL: SNMP CRITICAL - ps1-c3-codfw-infeed-load-tower-B-phase-X *-1* https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:44:48] PROBLEM - Juniper alarms on asw-c-codfw is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:45:37] (03PS7) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [15:46:16] PROBLEM - Host mw2335 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:33] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [15:46:34] PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-Z on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:46:38] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:46:50] PROBLEM - Host mw2339 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:04] PROBLEM - Host db2113 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:04] PROBLEM - Host mw2337 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:06] PROBLEM - Host mw2338 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:24] PROBLEM - Host mw2336 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:24] PROBLEM - Host thumbor2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:34] PROBLEM - Host thumbor2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:47:52] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:48:00] PROBLEM - Host mw2336.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:00] PROBLEM - Host mw2337.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:48:08] ^ that's rack c3 but I thought no impact was expected [15:48:28] planned PDU work today, but they were going to be one at a time per papaul [15:48:32] RECOVERY - Host db2113 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [15:48:32] RECOVERY - Host mw2338 is UP: PING OK - Packet loss = 0%, RTA = 36.18 ms [15:48:32] RECOVERY - Host mw2339 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [15:48:32] RECOVERY - Host thumbor2002 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [15:48:34] RECOVERY - Host mw2335 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [15:48:34] RECOVERY - Host mw2337 is UP: PING OK - Packet loss = 0%, RTA = 36.20 ms [15:48:42] 10Operations, 10Wikimedia-Mailing-lists: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge - https://phabricator.wikimedia.org/T255951 (10ssingh) 05Open→03Resolved a:03ssingh Marking this as resolved as the list has been created; please reopen if there are any other issues, questions... [15:48:46] RECOVERY - Host thumbor2001 is UP: PING OK - Packet loss = 0%, RTA = 36.22 ms [15:48:46] RECOVERY - Host mw2336 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [15:48:56] PROBLEM - Host alert2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:50:00] PROBLEM - Host mw2338.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:50:00] PROBLEM - Host mw2339.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:50:40] PROBLEM - Host thumbor2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:50:40] PROBLEM - Host thumbor2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:51:36] PROBLEM - Host mw2335.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:51:36] PROBLEM - Host db2113.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:51:56] (03PS8) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [15:52:02] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.29 ms [15:52:16] (03PS1) 10Aron Manning: Hotfix: "Undefined index: print" [extensions/ElectronPdfService] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608485 (https://phabricator.wikimedia.org/T256761) [15:53:02] RECOVERY - Host mw2336.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.60 ms [15:53:06] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott) [15:53:49] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` relforge1004.eqiad.wmnet ` The log... [15:53:54] RECOVERY - Host mw2337.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.67 ms [15:54:01] !log otto@deploy1001 Started deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3 [15:54:04] !log otto@deploy1001 Finished deploy [analytics/refinery@d63944e]: Deploying new camus wmf10 jar to an-launcher1002 for T256370 - take 3 (duration: 00m 03s) [15:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:05] T256370: Camus should look for multiple possible timestamp fields to use for hourly partitioining - https://phabricator.wikimedia.org/T256370 [15:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:21] the cY1 tech disconnected the wrong PDU that's why [15:54:31] d'oh [15:54:48] RECOVERY - Host alert2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [15:55:52] RECOVERY - Host mw2338.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms [15:55:52] RECOVERY - Host mw2339.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.66 ms [15:56:06] papaul: wrong PDU in the right rack? [15:56:31] RECOVERY - Host thumbor2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.52 ms [15:56:31] RECOVERY - Host thumbor2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.42 ms [15:57:13] papaul: ahh that'll do it :) thanks [15:57:14] (03PS1) 10Rush: peek: privacy review project renamed in Asana [puppet] - 10https://gerrit.wikimedia.org/r/608657 [15:57:16] (03CR) 10Andrew Bogott: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott) [15:57:26] RECOVERY - Host mw2335.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [15:57:26] RECOVERY - Host db2113.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.64 ms [15:58:14] wkandek: it needs to be unpliug below the floor [15:58:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott) [15:58:21] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:58:41] papaul: ah, we need blue/red cables there as well :) [15:59:13] (03CR) 10Jcrespo: [C: 03+2] mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) (owner: 10Jcrespo) [15:59:52] (03PS9) 10Jcrespo: mariadb: Setup db1077 as a misc::idp_test database server [puppet] - 10https://gerrit.wikimedia.org/r/608639 (https://phabricator.wikimedia.org/T256120) [16:00:05] godog and _joe_: (Dis)respected human, time to deploy Puppet request window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1600). Please do the needful. [16:00:26] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10dancy) [16:01:06] (03CR) 10Rush: [C: 03+2] peek: privacy review project renamed in Asana [puppet] - 10https://gerrit.wikimedia.org/r/608657 (owner: 10Rush) [16:01:20] PROBLEM - IPMI Sensor Status on mw2339 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:01:38] PROBLEM - IPMI Sensor Status on mw2336 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:01:44] PROBLEM - IPMI Sensor Status on mw2337 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:02:05] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10dancy) [16:04:09] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10ssingh) a:03ssingh [16:05:13] 10Operations, 10DBA, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jcrespo) The proposed refactoring could broke the core/misc separation. We have deployed a far-from-ideal misc::idp_test (which I still have to... [16:06:42] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['relforge1004.eqiad.wmnet'... [16:07:32] 10Operations, 10SRE-Access-Requests: Requesting access to production servers for Ahmon Dancy - RelEng - https://phabricator.wikimedia.org/T256770 (10ssingh) [16:08:11] (03PS2) 10Ottomata: Camus eventlogging - consider meta.dt and dt for event partition time [puppet] - 10https://gerrit.wikimedia.org/r/608622 (https://phabricator.wikimedia.org/T256370) [16:09:52] (03CR) 10Ottomata: [C: 03+2] Camus eventlogging - consider meta.dt and dt for event partition time [puppet] - 10https://gerrit.wikimedia.org/r/608622 (https://phabricator.wikimedia.org/T256370) (owner: 10Ottomata) [16:09:54] (03CR) 10Herron: [C: 03+1] "> > Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/608305 (https://phabricator.wikimedia.org/T251513) (owner: 10Jbond) [16:12:55] (03CR) 10Jcrespo: "I just added a temporary misc section: https://gerrit.wikimedia.org/r/c/operations/puppet/+/608639" [puppet] - 10https://gerrit.wikimedia.org/r/608618 (owner: 10Kormat) [16:17:11] (03PS2) 10CDanis: MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605) [16:18:04] (03CR) 10RLazarus: [C: 03+1] MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [16:19:55] (03PS2) 10BBlack: nvme formatting was missing for new codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/608425 (https://phabricator.wikimedia.org/T256655) [16:21:52] (03PS1) 10Cmjohnson: Updating relforge1003-4 netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608664 (https://phabricator.wikimedia.org/T241791) [16:22:48] (03CR) 10Cmjohnson: [C: 03+2] Updating relforge1003-4 netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/608664 (https://phabricator.wikimedia.org/T241791) (owner: 10Cmjohnson) [16:25:04] (03CR) 10Jcrespo: "Because idp database will eventually go to m1, we should change this patch to work for misc services and substitute profile::mariadb::ferm" [puppet] - 10https://gerrit.wikimedia.org/r/608623 (https://phabricator.wikimedia.org/T256120) (owner: 10Jbond) [16:25:53] (03CR) 10Herron: [C: 03+1] puppetmaster::frontend: manage ca_cert.pem and fix types lookup calls [puppet] - 10https://gerrit.wikimedia.org/r/608565 (https://phabricator.wikimedia.org/T256721) (owner: 10Jbond) [16:26:58] (03PS3) 10CDanis: MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605) [16:27:10] tgr: I made an edit to MediaWiki:Sidebar to hopefully clear its caches, didn't affect anything [16:28:53] (03CR) 10CDanis: [C: 03+2] MW PHP-FPM worker saturation: make it page [puppet] - 10https://gerrit.wikimedia.org/r/607163 (https://phabricator.wikimedia.org/T252605) (owner: 10CDanis) [16:28:57] tgr: another thread appears to be working, https://en.wikipedia.beta.wmflabs.org/wiki/Topic:Vp9iok8ssazd6a62 [16:29:02] so it is cache? [16:29:16] that clears the message cache but not the sidebar cache, I think? [16:29:34] OTOH if it is the sidebar cache, that should affect all pages [16:29:45] (03PS4) 10Herron: site: add Logstash7 capacity [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) (owner: 10Filippo Giunchedi) [16:29:51] hmh, the first thread is working again [16:30:50] I guess the edit does clear both caches somehow, then [16:31:17] no idea, but it's working again [16:31:23] are there affected wikis beyond mw.org? [16:31:53] group0 flow wikis? [16:31:53] RECOVERY - IPMI Sensor Status on mw2339 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:32:15] RECOVERY - IPMI Sensor Status on mw2336 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:32:21] RECOVERY - IPMI Sensor Status on mw2337 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:33:31] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` relforge1004.... [16:33:35] (03CR) 10Herron: [C: 03+2] site: add Logstash7 capacity [puppet] - 10https://gerrit.wikimedia.org/r/608279 (https://phabricator.wikimedia.org/T256443) (owner: 10Filippo Giunchedi) [16:33:47] testwiki broken https://test.wikipedia.org/wiki/Topic:Vp9iyl4amw7ofe4n [16:35:13] that's mediawikiwiki, testwiki and officewiki [16:35:25] can be fixed by hand [16:37:33] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.53 ms [16:37:51] PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-Z on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:37:52] I wonder if a purge is enough [16:40:34] (03CR) 10Jdlrobson: "Mayakpwiki is an analyst. I'm not sure why she has been added as a reviewer." [extensions/ElectronPdfService] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608485 (https://phabricator.wikimedia.org/T256761) (owner: 10Aron Manning) [16:42:21] PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:42:39] PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-Y on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:43:01] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:43:10] the backport should probably get merged and deployed? [16:44:40] (03CR) 10BBlack: [C: 03+2] nvme formatting was missing for new codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/608425 (https://phabricator.wikimedia.org/T256655) (owner: 10BBlack) [16:44:51] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:45:11] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['rel... [16:46:56] no, group0 has been rolled back [16:47:22] I mean, it should be, but that doesn't affect things on group0 right now [16:48:06] !log T256444 ✔️ cdanis@cp2030.codfw.wmnet ~ 🕐☕ sudo depool [16:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:13] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [16:49:08] yeah, but the backport needs to be merged before the train can roll forward again? [16:49:29] yeah, we can do that in the backport window [16:50:26] oh true [16:52:50] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2030 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030 [16:57:30] !log T256444 restarted purged on cp2030 and repooling [16:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:39] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [17:00:04] halfak and accraze: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1700). [17:00:47] tgr: I doubt Demain will do the window. I'd just do it. [17:01:22] I have a meeting right now, so I'd wait for the window anyway [17:01:31] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Cmjohnson) @Gehel I am getting a partman error. Is the partman recipe given correct? The raid10-4dev is not working.... [17:01:34] it's a good learning experience though [17:03:01] tgr: he's said can do but he's not on irc which is surpising [17:03:55] not everyone is online all the time, it's not so strang [17:04:22] tgr: see phab [17:04:33] I'll go get a package from my local store, I'll be back in ~20mins [17:05:57] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@f9df1af]: Update mobileapps to 5c7611b9 [17:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:31] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@f9df1af]: Update mobileapps to 5c7611b9 (duration: 03m 33s) [17:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:09] (03PS1) 10Mholloway: Mobileapps: Update to 2020-06-29-163540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608693 [17:13:00] (03CR) 10Mholloway: [C: 03+2] Mobileapps: Update to 2020-06-29-163540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608693 (owner: 10Mholloway) [17:14:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:07] (03Merged) 10jenkins-bot: Mobileapps: Update to 2020-06-29-163540-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/608693 (owner: 10Mholloway) [17:15:40] (03CR) 10Krinkle: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [17:17:19] !log uplugging msw-c3 power to relocate port on PDU [17:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:00] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:07] PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:09] PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:19:17] PROBLEM - Host alert2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:24:27] RECOVERY - Host alert2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms [17:26:20] * Majavah back [17:30:27] (03CR) 10RLazarus: mcrouter_wancache: Add mcrouter support for a machine-local memcached instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/594760 (https://phabricator.wikimedia.org/T244340) (owner: 10RLazarus) [17:30:51] (03CR) 10Cwhite: [C: 03+2] hiera: install mtail from component in codfw and eqsin [puppet] - 10https://gerrit.wikimedia.org/r/608450 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [17:32:07] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:33:13] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [17:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:23] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:52] !log ✔️ cdanis@netflow2001.codfw.wmnet ~ 🕜☕ sudo systemctl restart nfacctd [17:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:55] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:40:41] !log mobileapps deployments on k8s failing with timeouts; filed T256786 [17:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:46] T256786: mobileapps kubernetes deployment is timing out - https://phabricator.wikimedia.org/T256786 [17:42:35] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:44:22] 10Operations, 10Core Platform Team, 10Traffic, 10serviceops, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10aaron) >>! In T250205#6158994, @Krinkle wrote: >>>! In T250205#6154883, @aaron wrote: >> I'm not fond of the idea of not sending purges for in... [17:45:57] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:46] (03CR) 10Dzahn: [C: 03+2] libraryupgrader: Add systemd units [puppet] - 10https://gerrit.wikimedia.org/r/607919 (https://phabricator.wikimedia.org/T173478) (owner: 10Legoktm) [17:51:10] /ac [17:51:13] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:51:30] (03CR) 10Aaron Schulz: [C: 03+1] "+1 to what Tim said" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [17:53:03] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:44] 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) [18:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1800). [18:00:04] tgr: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:23] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:45] it's just me, so I'll deploy it [18:01:05] (03CR) 10Muehlenhoff: "I think this patch can be abandoned now that we have the decom script and the insetup role." [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) (owner: 10RobH) [18:02:03] (03Abandoned) 10RobH: splitting role::spare into staged and decomisssioning [puppet] - 10https://gerrit.wikimedia.org/r/508671 (https://phabricator.wikimedia.org/T222352) (owner: 10RobH) [18:05:30] !log installing libc6-dbg on netflow2001 T256790 [18:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:34] T256790: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 [18:08:54] (03CR) 10Gergő Tisza: [C: 03+2] Hotfix: "Undefined index: print" [extensions/ElectronPdfService] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608485 (https://phabricator.wikimedia.org/T256761) (owner: 10Aron Manning) [18:11:51] !log restart varnishmtail,atsmtail,ncredirmtail on ncredir,cp hosts in codfw and eqsin [18:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:52] (03Merged) 10jenkins-bot: Hotfix: "Undefined index: print" [extensions/ElectronPdfService] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608485 (https://phabricator.wikimedia.org/T256761) (owner: 10Aron Manning) [18:12:55] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) [18:14:11] 10Operations, 10ops-eqiad, 10netops: (Need by: 2019-09-30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10Cmjohnson) new-msw1-eqiad has the correct JUNOS 18.1.3 and the configuration has been copied. Currently connected to port 2 on the a8-scs and can be moved to th... [18:15:05] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:03] (03PS1) 10Krinkle: findBadBlobs: better separate scan and mark modes. [core] (wmf/1.35.0-wmf.39) - 10https://gerrit.wikimedia.org/r/608667 (https://phabricator.wikimedia.org/T251778) [18:16:21] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/608301 (https://phabricator.wikimedia.org/T256628) (owner: 10Jbond) [18:18:00] (03CR) 10Dwisehaupt: [C: 03+1] "Looks ok to me. Without knowing what's in the netbox included files it's tough to verify consistency. I'm fine with the idea and process." [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [18:20:37] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:23] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:17] !log tgr@deploy1001 Synchronized php-1.35.0-wmf.39/extensions/ElectronPdfService/src/ElectronPdfServiceHooks.php: Backport: [[gerrit:608485|Hotfix: "Undefined index: print" (T256761)]] (duration: 01m 05s) [18:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:22] T256761: 1.35.0-wmf.39 breaks Flow - https://phabricator.wikimedia.org/T256761 [18:25:37] tgr: tested, working on testwiki [18:25:55] thx [18:26:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:27:37] !log Morning deploys done [18:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:57] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:28:21] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps: puppetize /etc/ldap.conf on sssd clients [puppet] - 10https://gerrit.wikimedia.org/r/608068 (owner: 10Andrew Bogott) [18:29:22] (03PS1) 10Andrew Bogott: role::puppetmaster::standalone: allow for overriding the hiera config [puppet] - 10https://gerrit.wikimedia.org/r/608706 [18:31:03] 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) Okay, here are some backtraces: {P11715} When I saw crashes in malloc and then installed libc6-dbg to get arguments, I was hoping that the issue was malloc being invoked with a ridiculous paramet... [18:31:25] !log T256790 ✔️ cdanis@netflow2001.codfw.wmnet ~ 🕝☕ sudo apt install valgrind [18:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:30] T256790: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 [18:33:27] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:34:21] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF) [18:34:44] (03PS1) 10Dzahn: gerrit::server: if acme_chief is not used, install certbot [puppet] - 10https://gerrit.wikimedia.org/r/608707 [18:34:50] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10CKoerner_WMF) Update, as probably obvious, we have pushed the launch date back. Our target date is now July 14th. [18:35:13] 10Operations, 10Traffic: Switch blog.wikimedia.org to diff.wikimedia.org - https://phabricator.wikimedia.org/T254367 (10Dzahn) [18:36:38] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10Dzahn) a:05Dzahn→03Gehel [18:39:00] 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) {P11716} [18:39:49] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/23578/" [puppet] - 10https://gerrit.wikimedia.org/r/608707 (owner: 10Dzahn) [18:40:00] (03CR) 10Dzahn: [C: 03+2] "noop in prod, cloud-only" [puppet] - 10https://gerrit.wikimedia.org/r/608707 (owner: 10Dzahn) [18:42:23] (03CR) 10Dzahn: "In case you are wondering how letsencrypt::cert::integrated fails:" [puppet] - 10https://gerrit.wikimedia.org/r/608707 (owner: 10Dzahn) [18:43:21] (03Abandoned) 10Dzahn: gerrit: allow for 3 different methods to get TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/607116 (owner: 10Dzahn) [18:44:19] (03PS1) 10Elukey: profile::mediawiki::alerts: tune mediawiki-errors to be more lenient [puppet] - 10https://gerrit.wikimedia.org/r/608708 (https://phabricator.wikimedia.org/T256459) [18:49:15] (03PS1) 10Gergő Tisza: Varnish: Include request ID in Set-Cookie warning [puppet] - 10https://gerrit.wikimedia.org/r/608709 (https://phabricator.wikimedia.org/T256395) [18:52:01] 10Operations, 10netops: nfacctd segfaulting on netflow2001 - https://phabricator.wikimedia.org/T256790 (10CDanis) Filed upstream https://github.com/pmacct/pmacct/issues/414 [18:53:29] (03CR) 10CDanis: [C: 03+1] "LGTM once I9bca677 has been backported and deployed" [puppet] - 10https://gerrit.wikimedia.org/r/608709 (https://phabricator.wikimedia.org/T256395) (owner: 10Gergő Tisza) [19:00:05] hashar and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - European+American Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T1900). [19:00:42] (03PS3) 10CDanis: mediawiki: Change mw alerts to use a moving average [puppet] - 10https://gerrit.wikimedia.org/r/608188 (owner: 10Krinkle) [19:01:32] lets try once again [19:01:51] tgr: thanks for the assistance with the train blocker :] [19:02:30] (03PS1) 10Hashar: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608711 [19:02:41] (03CR) 10CDanis: [C: 03+2] mediawiki: Change mw alerts to use a moving average [puppet] - 10https://gerrit.wikimedia.org/r/608188 (owner: 10Krinkle) [19:02:45] (03CR) 10CDanis: [C: 03+2] mediawiki: Raise fatal alert treshold from 50 to 100 [puppet] - 10https://gerrit.wikimedia.org/r/608189 (owner: 10Krinkle) [19:02:52] (03PS3) 10CDanis: mediawiki: Raise fatal alert treshold from 50 to 100 [puppet] - 10https://gerrit.wikimedia.org/r/608189 (owner: 10Krinkle) [19:03:55] cdanis: Krinkle: I guess I can hold the train until those monitoring changes get deployed? ;) [19:04:30] (03PS4) 10CDanis: mediawiki: Raise fatal alert threshold from 50 to 100 [puppet] - 10https://gerrit.wikimedia.org/r/608189 (owner: 10Krinkle) [19:04:48] it should just be a few minutes [19:05:00] thx cdanis [19:05:10] yeah thanks for writing them Krinkle [19:05:24] it's funny, we use irate() in some alert rules, and we use rate() on some traffic graphs [19:05:27] exactly backwards :D [19:06:29] I will hold [19:06:33] err [19:06:50] I am holding the group0 promotion. Just le me know when icinga got refreshed [19:08:10] (03PS1) 10Gergő Tisza: Fully set MW_NO_SESSION for browser metadata endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608713 [19:08:52] hashar: proceed [19:10:25] * hashar presses [ENTER] [19:10:40] [EXITING] Pulled Revision group0 wikis to 1.35.0-wmf.39 did not match Enable validation of new signatures on Beta Cluster [19:10:41] (03CR) 10Krinkle: [C: 03+1] Fully set MW_NO_SESSION for browser metadata endpoints [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608713 (owner: 10Gergő Tisza) [19:10:42] pff [19:10:57] that is new to me [19:11:19] (03PS1) 10Hashar: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608714 [19:11:41] (03Abandoned) 10Hashar: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608714 (owner: 10Hashar) [19:11:56] PROBLEM - Check systemd state on mw2143 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:00] PROBLEM - Check systemd state on mw2141 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:04] PROBLEM - Check systemd state on mw2140 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:06] PROBLEM - Check systemd state on mw2194 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:08] ^ looking [19:12:20] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608711 (owner: 10Hashar) [19:12:26] PROBLEM - Check systemd state on mw2196 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:28] PROBLEM - Check systemd state on mw2208 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:30] PROBLEM - Check systemd state on mw2195 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:30] PROBLEM - Check systemd state on mw2142 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:31] I thought that might be leftover noise from C3 but it's not that rack [19:12:32] ^ systemd state on mw*.codfw is mtail restarting [19:12:38] ahh thanks shdubsh [19:12:38] PROBLEM - Check systemd state on mw2189 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:38] PROBLEM - Check systemd state on mw2192 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:38] PROBLEM - Check systemd state on mw2199 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:50] PROBLEM - Check systemd state on mw2218 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:54] PROBLEM - Check systemd state on mw2217 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:58] RECOVERY - Check systemd state on mw2143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:00] PROBLEM - Check systemd state on mw2201 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:00] RECOVERY - Check systemd state on mw2141 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:00] PROBLEM - Check systemd state on mw2210 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:00] PROBLEM - Check systemd state on mw2212 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:04] PROBLEM - Check systemd state on mw2214 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:04] RECOVERY - Check systemd state on mw2140 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:08] anything we need to do about it? should we downtime that alert? [19:13:08] PROBLEM - Check systemd state on mw2202 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:08] PROBLEM - Check systemd state on mw2207 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:10] PROBLEM - Check systemd state on mw2200 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:10] PROBLEM - Check systemd state on mw2204 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:17] puppet has to correct the unit file :( [19:13:30] PROBLEM - Check systemd state on mw2221 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:32] RECOVERY - Check systemd state on mw2142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:34] PROBLEM - Check systemd state on mw2187 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:34] PROBLEM - Check systemd state on mw2220 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:40] RECOVERY - Check systemd state on mw2192 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:13:47] (03Merged) 10jenkins-bot: group0 wikis to 1.35.0-wmf.39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608711 (owner: 10Hashar) [19:13:50] sorry for the noise [19:14:12] RECOVERY - Check systemd state on mw2200 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:14:34] RECOVERY - Check systemd state on mw2187 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:12] RECOVERY - Check systemd state on mw2194 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:48] RECOVERY - Check systemd state on mw2189 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:48] RECOVERY - Check systemd state on mw2199 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:50] PROBLEM - Check systemd state on mw2222 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:38] RECOVERY - Check systemd state on mw2196 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:44] RECOVERY - Check systemd state on mw2195 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:20] RECOVERY - Check systemd state on mw2201 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:22] RECOVERY - Check systemd state on mw2210 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:17:28] RECOVERY - Check systemd state on mw2202 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:38] RECOVERY - Check systemd state on mw2207 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:40] RECOVERY - Check systemd state on mw2204 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:23] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group 0 wikis to 1.35.0-wmf.39 # T254176 [19:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:28] T254176: 1.35.0-wmf.39 deployment blockers - https://phabricator.wikimedia.org/T254176 [19:19:40] RECOVERY - Check systemd state on mw2212 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:08] RECOVERY - Check systemd state on mw2208 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:44] RECOVERY - Check systemd state on mw2218 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:48] RECOVERY - Check systemd state on mw2217 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:58] RECOVERY - Check systemd state on mw2214 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:24] RECOVERY - Check systemd state on mw2221 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:34] RECOVERY - Check systemd state on mw2220 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:21:46] RECOVERY - Check systemd state on mw2222 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:23] train is quieter this time :] [19:22:56] Does the fix for Specail:Undelete work? [19:24:12] DannyS712|away: at least there are no errors ;) [19:24:41] and I guess we nede someone to undelete on one of the wiki to figure it out? [19:24:47] {{doing}} [19:24:53] at least there is no deprecation warning showing up yet [19:25:00] I nominate https://www.mediawiki.org/wiki/MediaWiki for deletion [19:25:19] as a deletionist kabal members, I approve [19:25:35] DannyS712|away: peak a more obscure page maybe? ;) [19:25:53] but we decided on the last cabal meeting that there is no cabal! [19:25:57] testwiki should work [19:26:10] Majavah: that part has been deleted from the minutes. Due to rule #1 [19:26:17] rule #1: if in doubt delete. [19:26:31] I was honestly about to - `Hashar said I could :) will restore in half a second` but I realized that fuzzy bot would then delete all of the translations [19:26:42] oh my [19:26:54] which means more Undeletes? [19:27:08] No, they would have to be undeleted manually [19:27:13] So I'm looking for a better page [19:27:21] my fault really, I should never had misleaded you in deleting that page by "approving" it. I apologize [19:27:29] Project:Support desk? [19:27:45] well testwiki has .39 so we can try there [19:27:57] https://www.mediawiki.org/wiki/Special:Redirect/page/4473 [19:28:03] Any objections? [19:28:08] No [19:28:54] Okay, deleted and restored (@hashar thats your user page) - can you check the logs? [19:28:59] sure thing [19:30:00] DannyS712|away: deal, no deprecation showing. [19:30:15] thanks [19:30:36] well done :-] [19:31:24] Going to eat some icecream - tried to set my nick to away, but apparently I already did :) [19:31:53] enjoy the ice cream! [19:32:13] And I'm going to bed, given that the UBN was resolved and it's getting late [19:32:28] Majavah: thank you !!! [19:42:16] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:42:21] (03CR) 10Papaul: [C: 03+1] mgmt: netbox-generated data for frack mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [19:54:49] trains look fine so I am out :) see you tomorrow [19:57:14] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) 05Open→03Resolved Both new PDU's in rack C3 are in installed and configured. Problem: moved all network devices power to PS1 before disconnecting PS2. when Tech was ready to discon... [19:57:29] 10Operations, 10ops-codfw, 10procurement: codfw: Next Gen test rack - https://phabricator.wikimedia.org/T251570 (10Papaul) [20:12:43] (03CR) 10Dwisehaupt: [C: 03+1] "@Volans Thanks! I pulled that down and it all looks good to me." [dns] - 10https://gerrit.wikimedia.org/r/607741 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans) [20:13:47] (03PS2) 10Andrew Bogott: role::puppetmaster::standalone: allow for overriding the hiera config [puppet] - 10https://gerrit.wikimedia.org/r/608706 [20:13:49] (03PS1) 10Andrew Bogott: openstack db grants: don't do AAAA resolution on VMs [puppet] - 10https://gerrit.wikimedia.org/r/608719 [20:14:02] (03PS2) 10Andrew Bogott: openstack db grants: don't do AAAA resolution on VMs [puppet] - 10https://gerrit.wikimedia.org/r/608719 [20:14:04] (03PS3) 10Andrew Bogott: role::puppetmaster::standalone: allow for overriding the hiera config [puppet] - 10https://gerrit.wikimedia.org/r/608706 [20:16:16] (03PS1) 10BryanDavis: toolforge: Use chained cert for mail relay TLS [puppet] - 10https://gerrit.wikimedia.org/r/608720 (https://phabricator.wikimedia.org/T256806) [20:17:19] (03PS1) 10Cwhite: mtail: remove component and upgrade mtail to 3.0.0-rc35-3~wmf2 across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/608721 (https://phabricator.wikimedia.org/T255776) [20:18:24] (03CR) 10Cwhite: "mtail package to be deployed to main and this merged early next week." [puppet] - 10https://gerrit.wikimedia.org/r/608721 (https://phabricator.wikimedia.org/T255776) (owner: 10Cwhite) [20:21:13] (03CR) 10Andrew Bogott: [C: 03+2] openstack db grants: don't do AAAA resolution on VMs [puppet] - 10https://gerrit.wikimedia.org/r/608719 (owner: 10Andrew Bogott) [20:27:44] 10Operations, 10observability, 10Patch-For-Review: log spam from mtail 3.0.0~rc19 on wezen - https://phabricator.wikimedia.org/T225604 (10colewhite) 05Open→03Resolved a:03colewhite wezen is no longer around and mtail has been upgraded to rc35 across the fleet. this message does not appear to be spammi... [20:27:46] 10Operations, 10Traffic: varnishmtail panics on buster - https://phabricator.wikimedia.org/T243591 (10colewhite) [20:30:33] (03CR) 10Andrew Bogott: [C: 03+2] role::puppetmaster::standalone: allow for overriding the hiera config [puppet] - 10https://gerrit.wikimedia.org/r/608706 (owner: 10Andrew Bogott) [20:32:21] (03PS1) 10Dzahn: gerrit: fix apache cert pathes when acme_chief is not used in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608723 [20:34:52] (03PS2) 10Dzahn: gerrit: fix apache cert pathes when acme_chief is not used in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608723 [20:36:19] (03CR) 10Dzahn: [C: 03+2] "noop in prod but fixes cloud https://puppet-compiler.wmflabs.org/compiler1002/23580/" [puppet] - 10https://gerrit.wikimedia.org/r/608723 (owner: 10Dzahn) [20:39:32] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608459 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [20:43:23] (03PS1) 10Ryan Kemper: Enable replication in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/608726 (https://phabricator.wikimedia.org/T254014) [20:46:21] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Nuria) Approved on my end. Please be so kind to familiarize yourself with the data access guidelines: https://wikitech.wikimedia.org/wiki/Analytics/Da... [20:46:24] (03CR) 10Ryan Kemper: "pcc: https://puppet-compiler.wmflabs.org/compiler1003/23581/" [puppet] - 10https://gerrit.wikimedia.org/r/608726 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [20:49:44] (03PS1) 10Dzahn: gerrit: do not define any replica hosts when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608727 [20:51:43] (03CR) 10Dzahn: [C: 03+2] gerrit: do not define any replica hosts when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608727 (owner: 10Dzahn) [20:56:29] (03PS1) 10Dzahn: gerrit: make $replica_hosts an optional parameter [puppet] - 10https://gerrit.wikimedia.org/r/608728 [20:57:40] (03CR) 10jerkins-bot: [V: 04-1] gerrit: make $replica_hosts an optional parameter [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn) [20:59:14] (03PS1) 10Ryan Kemper: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) [20:59:36] (03CR) 10jerkins-bot: [V: 04-1] Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [20:59:51] (03PS1) 10Paladox: gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672 [21:00:54] (03PS2) 10Paladox: gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672 [21:01:09] (03PS2) 10Dzahn: gerrit: make $replica_hosts an optional parameter [puppet] - 10https://gerrit.wikimedia.org/r/608728 [21:01:30] (03PS2) 10Ryan Kemper: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) [21:01:33] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/608672 (owner: 10Paladox) [21:03:20] (03CR) 10Dzahn: "noop in prod: https://puppet-compiler.wmflabs.org/compiler1002/23584/" [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn) [21:03:22] (03CR) 10Reedy: Add api.wikimedia.org and api.m.wikimedia.org DNS entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [21:03:33] (03PS3) 10Paladox: gerrit: Prevent new line being added under javaOptions [puppet] - 10https://gerrit.wikimedia.org/r/608672 [21:05:06] (03CR) 10Paladox: gerrit: make $replica_hosts an optional parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn) [21:07:54] 10Operations, 10Traffic: Certain links being rejected by caching if opened in Internet Explorer - https://phabricator.wikimedia.org/T256302 (10Aklapper) @Urbanecm: What is the exact HTTP error type? Asking as that screenshot does not include any error message (probably not to expose the IP). At least on the "... [21:09:20] (03CR) 10Dzahn: gerrit: make $replica_hosts an optional parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn) [21:10:11] (03CR) 10Dzahn: "noop in prod: https://puppet-compiler.wmflabs.org/compiler1002/23590/" [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn) [21:10:17] (03CR) 10Dzahn: [C: 03+2] gerrit: make $replica_hosts an optional parameter [puppet] - 10https://gerrit.wikimedia.org/r/608728 (owner: 10Dzahn) [21:10:30] (03CR) 10Cwhite: [C: 03+1] thanos: set consistency-delay on store [puppet] - 10https://gerrit.wikimedia.org/r/607783 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [21:10:32] (03CR) 10Ladsgroup: Add api.wikimedia.org and api.m.wikimedia.org DNS entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) (owner: 10Ladsgroup) [21:11:42] (03PS1) 10Paladox: gerrit: Set heap for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/608673 [21:11:48] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608061 (https://phabricator.wikimedia.org/T251869) (owner: 10Filippo Giunchedi) [21:11:55] (03PS2) 10Paladox: gerrit: Set heap for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/608673 [21:11:59] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/608673 (owner: 10Paladox) [21:13:36] (03PS3) 10Paladox: gerrit: Set heap for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/608673 [21:14:03] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608319 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [21:14:25] (03CR) 10Ryan Kemper: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/608729 Isolate eqiad master maps1004 from cluster" [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [21:17:11] (03CR) 10Dzahn: [C: 03+2] gerrit: Set heap for gerrit-prod-1001 [puppet] - 10https://gerrit.wikimedia.org/r/608673 (owner: 10Paladox) [21:20:00] (03PS2) 10Ladsgroup: Add api.wikimedia.org and api.m.wikimedia.org DNS entries [dns] - 10https://gerrit.wikimedia.org/r/599273 (https://phabricator.wikimedia.org/T246945) [21:20:21] (03PS1) 10Dzahn: gerrit: set replica_hosts to undefined in Hiera when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608732 [21:20:39] (03CR) 10Paladox: [C: 03+1] gerrit: set replica_hosts to undefined in Hiera when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608732 (owner: 10Dzahn) [21:21:08] (03CR) 10Dzahn: [C: 03+2] gerrit: set replica_hosts to undefined in Hiera when in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608732 (owner: 10Dzahn) [21:25:56] (03CR) 10Gehel: [C: 04-1] "see comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [21:32:33] (03PS3) 10Ryan Kemper: Isolate eqiad master maps1004 from cluster [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) [21:33:08] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/608729 (https://phabricator.wikimedia.org/T254014) (owner: 10Ryan Kemper) [21:36:05] (03CR) 10Platonides: toolforge: Use chained cert for mail relay TLS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608720 (https://phabricator.wikimedia.org/T256806) (owner: 10BryanDavis) [21:37:14] (03PS1) 10Dzahn: gerrit: add cron to auto-renew cert using certbot in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608734 [21:38:12] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [21:38:12] !log crusnov@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [21:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:51] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [21:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:12] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:28] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [21:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:51] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:39] (03CR) 10Dzahn: [C: 03+2] gerrit: add cron to auto-renew cert using certbot in cloud [puppet] - 10https://gerrit.wikimedia.org/r/608734 (owner: 10Dzahn) [21:43:56] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [21:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:57] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:24] !log crusnov@cumin1001 START - Cookbook sre.hosts.reboot-single [21:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:45] !log crusnov@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [21:48:46] (03PS1) 10Paladox: gerrit: Set gerrit::server::replica_hosts to an empty array in devtools [puppet] - 10https://gerrit.wikimedia.org/r/608674 [21:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:29] (03PS2) 10Paladox: gerrit: Set gerrit::server::replica_hosts to an empty array in devtools [puppet] - 10https://gerrit.wikimedia.org/r/608674 [21:51:31] (03PS3) 10Paladox: gerrit: Set gerrit::server::replica_hosts to an empty array in devtools [puppet] - 10https://gerrit.wikimedia.org/r/608674 [21:52:04] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/608674 (owner: 10Paladox) [21:54:46] (03CR) 10Dzahn: [C: 03+2] "noop in prod https://puppet-compiler.wmflabs.org/compiler1002/23592/" [puppet] - 10https://gerrit.wikimedia.org/r/608674 (owner: 10Paladox) [22:03:33] (03PS4) 10Dzahn: gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) [22:08:06] (03CR) 10Paladox: [C: 03+1] gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) (owner: 10Dzahn) [22:26:58] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={4,5} site=eqiad topic={rsyslog-notice,udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasour [22:26:58] us/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [22:33:21] (03PS1) 10CRusnov: offline_device: Clear primary IP addresses from device before deleting them. [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/608741 [22:35:55] (03CR) 10CRusnov: [C: 03+2] "This fixes a bug that Papaul discovered. i have verified it works in netbox-dev." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/608741 (owner: 10CRusnov) [22:36:06] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+pr [22:36:06] cluster=logging-eqiad&var-topic=All&var-consumer_group=All [22:43:22] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition=5 site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqi [22:43:22] var-consumer_group=All [22:48:54] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200630T2300). [23:03:09] (03CR) 10Ryan Kemper: [C: 03+2] Revert "Revert "Role for SDoC WDQS"" [puppet] - 10https://gerrit.wikimedia.org/r/602171 (owner: 10EBernhardson) [23:09:38] (03CR) 10Dzahn: [C: 03+2] "beta cherry-picked" [puppet] - 10https://gerrit.wikimedia.org/r/608251 (https://phabricator.wikimedia.org/T99156) (owner: 10Krinkle) [23:28:05] Dereckson: hi, i noticed an issue with the database on your site [23:42:49] 10Operations, 10serviceops, 10Patch-For-Review: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Krinkle) [23:49:02] mutante: https://gerrit.wikimedia.org/r/c/operations/puppet/+/603550 [23:49:22] is that something you're fine with doing directly or would you like me to do that in mw config? [23:50:08] looking at the past 3 year history of private/.git on deploy1001 it seems SRE generally don't edit it. [23:50:12] I don't mind either way though [23:51:45] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10CDanis) There's no alert yet for memcache NIC saturation, and I don't believe there's one for TKOs either (@elukey is that ri... [23:53:55] (03CR) 10Krinkle: [C: 04-1] "Set this via Horizon instead where most other per-host config lives. That way it's not requiring SRE +2 and cherry picks etc." [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [23:54:34] (03CR) 10Ppchelko: "We went completely another way, this should be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan) [23:54:40] (03CR) 10Krinkle: [C: 04-1] "removing hashtag as it does not currently appear to be live on beta's puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/598753 (https://phabricator.wikimedia.org/T251176) (owner: 10Hnowlan)