[00:10:11] RECOVERY - Memory correctable errors -EDAC- on wtp2020 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [00:15:19] RECOVERY - EDAC syslog messages on wtp2020 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2020&var-datasource=codfw+prometheus/ops [00:34:14] (03PS1) 10DannyS712: Remove project namespace from flaggedrevs on ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514413 (https://phabricator.wikimedia.org/T225037) [00:35:28] (03PS2) 10DannyS712: Remove project namespace from flaggedrevs on ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514413 (https://phabricator.wikimedia.org/T225037) [00:38:11] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10bd808) >>! In T223902#5211701, @bd808 wrote: > Reading the discussion here and in irc earlier today, I think the more general topic of whic... [02:37:55] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23609496 and 0 seconds [02:39:21] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 736 and 3 seconds [03:26:53] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [03:59:21] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:09:43] !log starting postgres slave init on maps2001 - T224395 [04:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:09:50] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [05:07:19] !log Upgrade Mysql on labsdb1012 - T224852 [05:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:24] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [05:17:58] 10Operations, 10serviceops, 10PHP 7.2 support, 10Performance-Team (Radar), and 2 others: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) - https://phabricator.wikimedia.org/T224491 (10Joe) >>! In T224491#5234675, @pmiazga wrote: > @Niedzielski noticed another, pretty similar issue: {T22501... [05:19:05] (03PS1) 10Marostegui: db2058: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514418 (https://phabricator.wikimedia.org/T220170) [05:19:44] (03CR) 10jerkins-bot: [V: 04-1] db2058: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514418 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [05:20:24] (03PS2) 10Marostegui: db2058db2065: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514418 (https://phabricator.wikimedia.org/T220170) [05:21:07] (03CR) 10jerkins-bot: [V: 04-1] db2058db2065: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514418 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [05:21:21] (03PS3) 10Marostegui: db2058db2065: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514418 (https://phabricator.wikimedia.org/T220170) [05:22:17] (03PS4) 10Marostegui: db2058,db2065: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514418 (https://phabricator.wikimedia.org/T220170) [05:22:49] !log Change replication topology on m3 codfw to promote db2065 as codfw master instead of db2042 - T221533 [05:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:55] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [05:24:26] 10Operations, 10Page-Previews, 10Readers-Web-Backlog, 10Wikimedia-production-error: [Bug] TypeError in PopupsContext - https://phabricator.wikimedia.org/T225018 (10Joe) What I see on the linked kibana error dashboard are 2 errors, in the same second, from the same server, which was just rebooted and just s... [05:25:27] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:26:53] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:28:58] !log Keep compressing tables on labsdb1012 - T222978 [05:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:04] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [05:30:13] (03CR) 10Marostegui: [C: 03+2] db2058,db2065: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514418 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [05:31:42] !log Stop MySQL on db1125 (sanitarium) s2,s4,s6,s7 to upgrade mysql - T224852 [05:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:47] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [05:40:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514419 (https://phabricator.wikimedia.org/T224852) [05:46:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514419 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:47:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514419 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:47:17] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514419 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [05:49:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 for upgrade T224852 (duration: 01m 06s) [05:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:10] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [05:49:22] !log Upgrade MySQL on db1084 T224852 [05:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:53] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514420 [05:55:29] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514420 (owner: 10Marostegui) [05:56:20] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514420 (owner: 10Marostegui) [05:56:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514420 (owner: 10Marostegui) [05:57:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 after upgrade T224852 (duration: 00m 55s) [05:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:31] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [05:59:49] (03PS1) 10Marostegui: db-eqiad.php: Give more weight to db1084 on API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514421 [06:11:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give more weight to db1084 on API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514421 (owner: 10Marostegui) [06:12:36] (03Merged) 10jenkins-bot: db-eqiad.php: Give more weight to db1084 on API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514421 (owner: 10Marostegui) [06:12:53] (03CR) 10jenkins-bot: db-eqiad.php: Give more weight to db1084 on API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514421 (owner: 10Marostegui) [06:13:45] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1084 into API (duration: 00m 54s) [06:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:58] !log Start topology changes on s4 codfw to replace current master db2051 with db2090 - T220170 [06:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:03] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [06:18:08] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) Connections established to the eqiad memcached shards after the reboot of the mw1* hosts: {F29345943} [06:21:11] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10elukey) [06:21:17] (03PS1) 10Marostegui: db-codfw.php: Mimic codfw weights to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514423 (https://phabricator.wikimedia.org/T220170) [06:23:09] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Mimic codfw weights to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514423 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:24:00] (03Merged) 10jenkins-bot: db-codfw.php: Mimic codfw weights to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514423 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:24:14] (03CR) 10jenkins-bot: db-codfw.php: Mimic codfw weights to eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514423 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:25:29] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Mimic s4 codfw weights to eqiad T220170 (duration: 00m 55s) [06:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:34] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [06:31:18] (03PS1) 10Marostegui: mariadb: Promote db2090 to codfw s4 master [puppet] - 10https://gerrit.wikimedia.org/r/514424 (https://phabricator.wikimedia.org/T220170) [06:33:52] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2090 to codfw s4 master [puppet] - 10https://gerrit.wikimedia.org/r/514424 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:41:05] (03Abandoned) 10Elukey: Replace the 'hdfs' user with 'analytics' in Hadoop's job launchers [puppet] - 10https://gerrit.wikimedia.org/r/504861 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey) [06:41:10] (03PS1) 10Marostegui: db-codfw.php: Promote db2090 to s4 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514425 (https://phabricator.wikimedia.org/T220170) [06:43:09] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2090 to s4 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514425 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:43:30] (03PS1) 10Marostegui: mariadb: db2051 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514426 (https://phabricator.wikimedia.org/T220170) [06:44:09] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2090 to s4 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514425 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:44:31] (03CR) 10jenkins-bot: db-codfw.php: Promote db2090 to s4 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514425 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:44:37] (03CR) 10Marostegui: [C: 03+2] mariadb: db2051 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514426 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [06:45:23] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2090 to s4 codfw master T220170 (duration: 00m 54s) [06:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:29] !log Restart MySQL on db2110 to get the binlog format changed to STATEMENT - T220170 [06:45:29] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [06:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:29] (03PS13) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [06:52:56] (03CR) 10Elukey: "Replaced the hiera call with lookup as requested, pcc still ok https://puppet-compiler.wmflabs.org/compiler1001/16869/" [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [07:03:27] (03PS5) 10Muehlenhoff: Stop using transitional package names for Icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) [07:06:36] 10Operations, 10Performance-Team, 10User-Elukey: Consider adding per shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) [07:06:57] 10Operations, 10Performance-Team, 10User-Elukey: Consider adding per shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) [07:07:16] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) [07:07:18] (03CR) 10Muehlenhoff: [C: 03+2] Stop using transitional package names for Icinga plugins [puppet] - 10https://gerrit.wikimedia.org/r/494681 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [07:07:40] 10Operations, 10Performance-Team, 10observability, 10User-Elukey: Consider adding per shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) [07:07:49] 10Operations, 10Performance-Team, 10observability, 10User-Elukey: Consider adding per shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10elukey) p:05Triage→03Normal [07:08:59] PROBLEM - Host db1091 is DOWN: PING CRITICAL - Packet loss = 100% [07:09:04] what? [07:09:05] checking that [07:09:13] indeed down [07:09:20] depooling [07:09:55] going to silence it before it pages everyone [07:10:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514427 [07:10:29] (03CR) 10Marostegui: [V: 03+2 C: 03+2] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514427 (owner: 10Marostegui) [07:10:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514427 (owner: 10Marostegui) [07:11:13] depooled, going to investigate what happened [07:11:34] weird, I wonder what happened [07:11:52] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 - host went down (duration: 00m 55s) [07:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:21] (03PS1) 10Ema: cache: reimage cp3035 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514428 (https://phabricator.wikimedia.org/T222937) [07:16:25] (03PS1) 10Muehlenhoff: ntp: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514429 [07:16:27] 10Operations, 10ops-eqiad, 10DBA: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) [07:19:54] (03PS3) 10Muehlenhoff: base: Remove support for trusty/Ubuntu in multiple places [puppet] - 10https://gerrit.wikimedia.org/r/500400 [07:19:56] RECOVERY - Host db1091 is UP: PING WARNING - Packet loss = 93%, RTA = 0.28 ms [07:20:36] ACKNOWLEDGEMENT - HP RAID on db1091 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T225061 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:20:39] 10Operations, 10ops-eqiad: Degraded RAID on db1091 - https://phabricator.wikimedia.org/T225061 (10ops-monitoring-bot) [07:20:58] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [07:21:34] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp3035 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514428 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [07:22:30] (03CR) 10Muehlenhoff: [C: 03+2] base: Remove support for trusty/Ubuntu in multiple places [puppet] - 10https://gerrit.wikimedia.org/r/500400 (owner: 10Muehlenhoff) [07:22:51] !log depool cp3035 and reimage as upload_ats T222937 [07:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:56] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [07:23:00] 10Operations, 10ops-eqiad, 10DBA: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) p:05Triage→03High BBU broke ` Battery/Capacitor Count: 0 ` @Cmjohnson Can we give this host some priority? I wouldn't want to have it down for the whole offsite week. I believe its support just... [07:23:42] (03CR) 10Ema: [C: 03+2] cache: reimage cp3035 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514428 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [07:23:50] (03PS2) 10Ema: cache: reimage cp3035 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/514428 (https://phabricator.wikimedia.org/T222937) [07:24:25] 10Operations, 10ops-eqiad, 10DBA: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) @jcrespo I am going to place db1135 temporarily (T222682) to replace this host until we have found a solution [07:27:18] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3035.esams.wmnet'] ` The log can be found in `... [07:27:29] (03PS1) 10Marostegui: db-eqiad.php: Clarify db1091 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514431 [07:28:14] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) The translate-groups key has been successfully red... [07:29:33] (03PS1) 10Muehlenhoff: sslcert: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514432 [07:31:15] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Clarify db1091 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514431 (owner: 10Marostegui) [07:31:17] (03PS1) 10Marostegui: mariadb: Temporary: place db1135 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/514433 (https://phabricator.wikimedia.org/T225060) [07:32:08] (03Merged) 10jenkins-bot: db-eqiad.php: Clarify db1091 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514431 (owner: 10Marostegui) [07:32:24] (03CR) 10jenkins-bot: db-eqiad.php: Clarify db1091 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514431 (owner: 10Marostegui) [07:32:25] 10Operations, 10ops-eqiad: Degraded RAID on db1091 - https://phabricator.wikimedia.org/T225061 (10Marostegui) 05Open→03Declined This is a broken BBU, being handled at T225060 [07:33:17] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514413 (https://phabricator.wikimedia.org/T225037) (owner: 10DannyS712) [07:33:42] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1091 status (duration: 00m 56s) [07:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Temporary: place db1135 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/514433 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [07:34:48] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:34:54] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:34:58] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:35:02] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:35:38] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:35:44] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:35:44] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:35:46] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:35:48] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:35:50] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:36:00] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:36:00] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:36:02] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 40 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:36:06] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:36:13] what's going on? [07:36:18] ema: ^ [07:36:45] cp35 reimage [07:36:49] marostegui: it's the reimage of cp3035, nothing to worry about. Thanks! [07:36:51] *cp3035 [07:36:53] and sorry for the noise [07:36:53] ah right [07:36:54] cheers [07:37:00] 10Operations, 10SRE-Access-Requests: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10awight) [07:37:10] (03PS1) 10Muehlenhoff: cpufrequtils: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514434 [07:37:45] reimaging to ATS actually does remove the ipsec cruft on all other nodes (like those that are alerting now), but puppet hasn't run on those yet [07:38:16] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 32 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:38:53] (03PS4) 10Vgutierrez: ATS: Set log mode independently of log filters [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) [07:40:40] (03CR) 10Hashar: [C: 03+1] nagios_common: update members of the gerrit contact group [puppet] - 10https://gerrit.wikimedia.org/r/512292 (owner: 10Dzahn) [07:40:47] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [07:41:39] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 40 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:42:43] hashar: adt-run is failing on debian-glue jobs for reasons that seem Ruby-related? Do you have any clue about what's going on? https://integration.wikimedia.org/ci/job/debian-glue/1478/console [07:43:20] is that the reason why the job fails BTW? [07:44:39] ema: I blame debian! :D [07:44:53] is that a new failure? [07:44:55] (03PS3) 10Muehlenhoff: Include grub::defaults unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/505886 (https://phabricator.wikimedia.org/T140100) [07:45:29] hashar: possibly, but I haven't had to do with debian-glue jobs in a bit [07:45:36] !log Transfer dbprov1001.eqiad.wmnet:snapshot.s4.2019-06-04--21-37-03.tar.gz to db1135 to provision it on s4 T225060 [07:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:42] T225060: db1091 crashed - https://phabricator.wikimedia.org/T225060 [07:45:49] hashar: has there been any changes in the CI images lately? [07:45:55] none to my knowledge [07:45:58] I see [07:45:58] 00:02:21.528 Err http://security.debian.org jessie/updates Release.gpg [07:45:58] 00:02:21.528 Cannot initiate the connection to webproxy.eqiad.wmnet:8080 (2620:0:861:1:208:80:154:22). - connect (101: Network is unreachable) [IP: 2620:0:861:1:208:80:154:22 8080] [07:45:58] 00:02:21.544 Reading package lists... [07:46:04] (03CR) 10Muehlenhoff: [C: 03+2] Include grub::defaults unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/505886 (https://phabricator.wikimedia.org/T140100) (owner: 10Muehlenhoff) [07:46:13] but that might just be noise [07:46:14] ah yes, that also does not look good [07:46:51] OH [07:47:02] well the build works just fine [07:47:16] https://integration.wikimedia.org/ci/job/debian-glue/1478/ has a couple .deb [07:47:27] but it is marked as a failure due to the test results [07:47:38] https://integration.wikimedia.org/ci/job/debian-glue/1478/testReport/ [07:47:39] E: libvarnishapi1: postinst-must-call-ldconfig usr/lib/x86_64-linux-gnu/libvarnishapi.so.1.0.6 [07:47:40] ;] [07:47:46] ema: tldr: lintian complains [07:47:47] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 32 not-conn: cp3035_v4, cp3035_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:48:46] hashar: ah! How do you find out which of the build commands causes the build to be marked as failed? [07:49:19] in other words, how do you know that it's lintian and not (for example) the network unreachable error to cause the -1? [07:49:47] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 32 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:49:55] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 32 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:50:23] ema: hmm years of experience and a wild guess ? :] [07:50:25] it is not DNS! [07:50:35] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 40 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:50:58] so that got lintian 2.5.30+deb8u4 , and we run it as lintian --profile wikimedia [07:51:16] I think the profile just add our custom distro names (eg: jessie-wikimedia ) [07:51:31] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 32 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:52:01] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3035.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3035.esams.wmnet'] ` [07:52:33] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 40 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:52:52] (03PS1) 10Muehlenhoff: openstack::common Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514435 [07:54:19] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 40 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:54:43] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 40 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:55:37] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 32 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:55:37] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 32 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:55:51] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 40 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:55:59] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 40 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:57:29] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 32 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [07:59:01] ema: that most probably existed previously. Then I cant tell why on the previous build it passed just fine, its from July 2018 and we no more have the logs [08:03:11] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 32 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [08:03:18] (03PS1) 10Marostegui: db1091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514436 (https://phabricator.wikimedia.org/T225060) [08:05:37] !log installing qemu security updates on Ganeti hosts [08:05:37] (03CR) 10Marostegui: [C: 03+2] db1091: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514436 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [08:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:47] PROBLEM - puppet last run on mc2022 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:06:51] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:06:57] (03PS6) 10Elukey: mcrouter: allow async foreign set/delete WAN cache operations [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [08:07:25] PROBLEM - puppet last run on db2071 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:08:35] PROBLEM - puppet last run on ping2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:09:50] ACKNOWLEDGEMENT - IPMI Sensor Status on cp3035 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] Ema Sadly known [08:11:23] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:11:27] PROBLEM - puppet last run on kubetcd2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:11:35] 10Operations, 10ops-eqiad, 10DBA: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) [08:12:05] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:12:11] !log Reboot db1091 T225060 [08:12:13] PROBLEM - puppet last run on db2089 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:17] T225060: db1091 crashed - https://phabricator.wikimedia.org/T225060 [08:12:23] !log pool cp3035 w/ ATS backend T222937 [08:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:28] T222937: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 [08:12:46] !log rebooting pybal-test2001 for tests with new qemu [08:12:49] PROBLEM - puppet last run on thumbor2003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:59] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:13:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:05] PROBLEM - puppet last run on ms-be2050 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:09] (03CR) 10Elukey: "Getting back to this, there is a big backlog of mcrouter/memcached tasks, sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [08:13:24] (03CR) 10Elukey: "Pcc shows no op: https://puppet-compiler.wmflabs.org/compiler1001/16874/" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [08:16:18] (03CR) 10Vgutierrez: [C: 03+2] ATS: Set log mode independently of log filters [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) (owner: 10Vgutierrez) [08:16:27] (03PS5) 10Vgutierrez: ATS: Set log mode independently of log filters [puppet] - 10https://gerrit.wikimedia.org/r/512636 (https://phabricator.wikimedia.org/T224397) [08:21:02] <_joe_> !log performing a rolling restart of the php appservers via cumin to test speed and safety of the operations proposed in T224857 [08:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:07] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:21:08] T224857: Enhance MediaWiki deployments for support of php7.x - https://phabricator.wikimedia.org/T224857 [08:21:41] PROBLEM - puppet last run on kafka-main2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [08:22:10] 10Operations, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10Pchelolo) [08:22:27] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/bash/puppet-common.sh] [08:22:56] (03PS3) 10Vgutierrez: ATS: Ensure proper permissions for ATS layouts [puppet] - 10https://gerrit.wikimedia.org/r/512643 (https://phabricator.wikimedia.org/T221217) [08:25:09] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/install-pkg-src] [08:25:46] (03CR) 10Vgutierrez: [C: 03+2] ATS: Avoid using traffic_layout [puppet] - 10https://gerrit.wikimedia.org/r/512855 (https://phabricator.wikimedia.org/T224428) (owner: 10Vgutierrez) [08:26:03] (03PS3) 10Vgutierrez: ATS: Avoid using traffic_layout [puppet] - 10https://gerrit.wikimedia.org/r/512855 (https://phabricator.wikimedia.org/T224428) [08:26:05] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [08:30:27] 10Operations, 10Traffic: ATS: traffic_layout currently forces to use its own copy of shared libraries - https://phabricator.wikimedia.org/T224428 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [08:34:29] RECOVERY - puppet last run on ping2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:35:33] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:36:06] jouncebot: next [08:36:06] In 2 hour(s) and 23 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1100) [08:36:16] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Provision db1135 into s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514439 (https://phabricator.wikimedia.org/T225060) [08:36:29] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:36:37] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:37:15] RECOVERY - puppet last run on kafka-main2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:37:55] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:38:07] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:38:09] RECOVERY - puppet last run on mc2022 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:38:09] RECOVERY - puppet last run on db2089 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:38:45] RECOVERY - puppet last run on thumbor2003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [08:38:45] RECOVERY - puppet last run on db2071 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [08:38:49] RECOVERY - puppet last run on ms-be2050 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:39:39] RECOVERY - puppet last run on kubetcd2001 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [08:39:49] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [08:40:52] <_joe_> !log rolling restart of php7 on the api servers, to test a different strategy of restarting compared to the appservers. [08:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:25] (03CR) 10Gehel: [C: 04-1] wdqs: add WDQS restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [08:42:06] !log removing maps2001 from cassandra cluster. It is going to be reimaged - T224395 [08:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:12] T224395: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 [08:45:44] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2001.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906050845_gehel_3... [08:47:16] (03PS1) 10Alex Monk: openstack designate wmf_sink: rm unreachable code [puppet] - 10https://gerrit.wikimedia.org/r/514440 [08:47:45] (03PS1) 10Ema: Drop 0001-gethdr_extrachance.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514441 [08:47:47] (03PS1) 10Ema: Add 0024-vbt-get-force-fresh.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514442 [08:47:49] (03PS1) 10Ema: Add 0025-extrachance-one-retry.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514443 [08:52:41] (03CR) 10jerkins-bot: [V: 04-1] Add 0024-vbt-get-force-fresh.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514442 (owner: 10Ema) [08:54:13] (03PS66) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [08:54:48] (03CR) 10Gehel: [C: 04-1] add WDQS reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) (owner: 10Mathew.onipe) [08:56:07] (03CR) 10jerkins-bot: [V: 04-1] Drop 0001-gethdr_extrachance.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514441 (owner: 10Ema) [08:57:03] (03CR) 10jerkins-bot: [V: 04-1] Add 0025-extrachance-one-retry.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514443 (owner: 10Ema) [09:03:00] (03PS1) 10Volans: icinga: manage metamonitor known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) [09:03:59] (03CR) 10jerkins-bot: [V: 04-1] icinga: manage metamonitor known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [09:04:10] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Provision db1135 into s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514439 (https://phabricator.wikimedia.org/T225060) [09:05:06] (03PS2) 10Volans: icinga: manage metamonitor known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) [09:08:25] (03PS1) 10Arturo Borrero Gonzalez: openstack: remove unused common class [puppet] - 10https://gerrit.wikimedia.org/r/514445 (https://phabricator.wikimedia.org/T220051) [09:09:13] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/514435 (owner: 10Muehlenhoff) [09:09:15] (03CR) 10Volans: "I'll take care of manually creating the $HOME directory on the existing hosts given that Puppet would take care of it only on user creatio" [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [09:09:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: remove unused common class [puppet] - 10https://gerrit.wikimedia.org/r/514445 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez) [09:15:39] (03PS1) 10Elukey: profile::kerberos::kadminserver: allow puppetmaster to rsync keytabs [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) [09:16:18] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [09:18:20] (03PS2) 10Elukey: profile::kerberos::kadminserver: allow puppetmaster to rsync keytabs [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) [09:21:59] (03CR) 10Jcrespo: [C: 03+1] "Ok, still lagging behind for a full repool." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514439 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [09:22:37] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Provision db1135 into s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514439 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [09:22:43] 10Operations, 10Diffusion, 10Release-Engineering-Team (Kanban), 10User-zeljkofilipin: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10zeljkofilipin) [09:23:44] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db1135 into s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514439 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [09:24:00] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Provision db1135 into s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514439 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [09:25:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool without traffic db1135 into s4 T225060 (duration: 00m 56s) [09:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:22] T225060: db1091 crashed - https://phabricator.wikimedia.org/T225060 [09:26:16] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool without traffic db1135 into s4 T225060 (duration: 00m 55s) [09:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:26] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [09:32:34] (03PS1) 10Marostegui: db-eqiad.php: Slowly pool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514448 [09:34:48] (03CR) 10Marostegui: [C: 04-2] "not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514448 (owner: 10Marostegui) [09:38:40] zeljkof: Hey, I have a train blocker https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/514446 this needs to be backported to wmf.8 before deployment to group1 [09:39:39] Amir1: it's not mentioned in the task? T220733 [09:39:40] T220733: 1.34.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T220733 [09:39:55] are you deploying it during eu swat? [09:40:02] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2001.codfw.wmnet'] ` and were **ALL** successful. [09:40:20] (03PS1) 10Pmiazga: Enable the new history page in the advanced mobile contributions mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514449 (https://phabricator.wikimedia.org/T219895) [09:41:05] zeljkof: I can deploy it ASAP [09:41:40] looks like there's no other deployments until SWAT, if nobody complains, go ahead :) [09:42:53] Sure, let's wait until the patch gets merged on master [09:44:20] Amir1: can I deploy mw-config then? [09:44:33] marostegui: go ahead please [09:44:37] excellent thanks! [09:44:41] this takes a little bit of time [09:46:00] (03CR) 10Marostegui: db-eqiad.php: Slowly pool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514448 (owner: 10Marostegui) [09:46:27] (03PS1) 10Marostegui: db1135: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514450 (https://phabricator.wikimedia.org/T225060) [09:47:06] (03PS2) 10Pmiazga: Enable the new history page in the advanced mobile contributions mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514449 (https://phabricator.wikimedia.org/T219895) [09:47:11] (03CR) 10Marostegui: [C: 03+2] db1135: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514450 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [09:47:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly pool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514448 (owner: 10Marostegui) [09:48:43] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly pool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514448 (owner: 10Marostegui) [09:48:57] (03CR) 10jenkins-bot: db-eqiad.php: Slowly pool db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514448 (owner: 10Marostegui) [09:50:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1135 with very low weight on s4 (duration: 00m 55s) [09:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:20] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/514429 (owner: 10Muehlenhoff) [09:54:44] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/514432 (owner: 10Muehlenhoff) [10:01:18] 10Operations, 10DBA, 10observability: Generate instance list of database hosts to be monitored automatically from exported resources - https://phabricator.wikimedia.org/T177779 (10fgiunchedi) I was reviewing #observability backlog, to me it looks like this is a duplicate of {T145072} ? [10:02:35] (03PS3) 10Elukey: profile::kerberos::kadminserver: allow puppetmaster to rsync keytabs [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) [10:03:04] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:36] 10Operations, 10DBA, 10observability: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb - https://phabricator.wikimedia.org/T145072 (10jcrespo) [10:03:39] 10Operations, 10DBA, 10observability: Generate instance list of database hosts to be monitored automatically from exported resources - https://phabricator.wikimedia.org/T177779 (10jcrespo) [10:05:14] 10Operations, 10DBA, 10observability: Generate instance list of active database hosts to be monitored from prometheus - https://phabricator.wikimedia.org/T145072 (10jcrespo) [10:07:25] (03PS1) 10Alex Monk: certmanager: Set up config for running inside labs realm [puppet] - 10https://gerrit.wikimedia.org/r/514454 (https://phabricator.wikimedia.org/T171188) [10:09:55] !log mount sdb3 on ms-be1022 - T225079 [10:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:00] T225079: Alert on unmounted swift partitions - https://phabricator.wikimedia.org/T225079 [10:11:15] (03PS1) 10Marostegui: db-eqiad.php: Give db1135 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514455 [10:11:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/514434 (owner: 10Muehlenhoff) [10:12:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give db1135 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514455 (owner: 10Marostegui) [10:13:27] (03Merged) 10jenkins-bot: db-eqiad.php: Give db1135 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514455 (owner: 10Marostegui) [10:13:42] (03CR) 10jenkins-bot: db-eqiad.php: Give db1135 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514455 (owner: 10Marostegui) [10:14:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1135 (duration: 00m 55s) [10:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:36] (03PS19) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [10:19:16] (03CR) 10jerkins-bot: [V: 04-1] Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:21:21] I'm deploying a change for flagged revs [10:22:27] (03Abandoned) 10Muehlenhoff: openstack::common Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514435 (owner: 10Muehlenhoff) [10:24:20] (03PS20) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [10:24:23] (03PS1) 10Muehlenhoff: swift: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514457 [10:25:56] (03CR) 10jerkins-bot: [V: 04-1] Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:28:54] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: https://lists.wikimedia.org/mailman/listinfo/wikimediapl-l has mixed encoding - https://phabricator.wikimedia.org/T111457 (10Aklapper) 05Declined→03Resolved Well, it seems fixed nowadays. [10:29:17] 10Operations, 10LDAP-Access-Requests: Request to add Rmaung to the ldap/wmf group - https://phabricator.wikimedia.org/T224744 (10Aklapper) [10:31:57] 10Operations, 10Performance-Team: Test usage of igbinary with apcu with MediaWiki - https://phabricator.wikimedia.org/T225074 (10Krinkle) [10:36:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/514332 (https://phabricator.wikimedia.org/T224535) (owner: 10Ayounsi) [10:37:18] (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:37:30] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 3 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10mobrovac) [10:37:39] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to Logstash for Cstone - https://phabricator.wikimedia.org/T225010 (10fsero) [10:39:29] (03CR) 10Jbond: [C: 03+2] firewall logging: enable firewall logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511703 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [10:41:58] (03PS2) 10Jbond: firewall logging: enable firewall logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511703 (https://phabricator.wikimedia.org/T116011) [10:42:47] 10Operations, 10SRE-Access-Requests: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10fsero) waiting for manager's approval [10:42:52] (03CR) 10jerkins-bot: [V: 04-1] firewall logging: enable firewall logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511703 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [10:44:07] (03PS3) 10Jbond: firewall logging: enable firewall logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511703 (https://phabricator.wikimedia.org/T116011) [10:44:55] zeljkof: if things related to FlaggedRevs had issues, let me know please. I'm pushing a big change [10:45:35] Amir1: will do, thanks for the heads up [10:45:53] (03CR) 10Jbond: [C: 03+2] firewall logging: enable firewall logging on external servers [puppet] - 10https://gerrit.wikimedia.org/r/511703 (https://phabricator.wikimedia.org/T116011) (owner: 10Jbond) [10:47:14] Jenkins is soooooooo slow, a backport to flaggedrevs take half an hour [10:57:21] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.8/extensions/FlaggedRevs: [[gerrit:514456|Add ext.flaggedRevs.icons to modules registeration]] (duration: 00m 57s) [10:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:01] The backport for flagged revs is done, there's one small backport left for WikimediaMessages (legal things) [10:58:58] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10fsero) Hi @WMDE-leszek, your request looks reasonable to me, i'd like for @RStallman-legalteam to verify NDAs for a double check. But this chang... [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1100). [11:00:04] Urbanecm and raynor: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:20] o/ [11:00:24] o/ [11:00:25] Ready to SWAT [11:00:40] o/ [11:00:59] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514413 (https://phabricator.wikimedia.org/T225037) (owner: 10DannyS712) [11:01:06] I have one small thing to deploy but I can go last [11:01:51] (03Merged) 10jenkins-bot: Remove project namespace from flaggedrevs on ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514413 (https://phabricator.wikimedia.org/T225037) (owner: 10DannyS712) [11:02:06] (03CR) 10jenkins-bot: Remove project namespace from flaggedrevs on ruwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514413 (https://phabricator.wikimedia.org/T225037) (owner: 10DannyS712) [11:02:30] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: services: unify primary/secondary roles [puppet] - 10https://gerrit.wikimedia.org/r/514279 (https://phabricator.wikimedia.org/T224743) [11:03:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "The PCC diff for labmon1001/1002 doesn't make a lot of sense to me. But it seems that the role name change is being accounted for. I will " [puppet] - 10https://gerrit.wikimedia.org/r/514279 (https://phabricator.wikimedia.org/T224743) (owner: 10Arturo Borrero Gonzalez) [11:07:04] ping me when you all are done, thanks [11:12:30] (03PS21) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [11:15:00] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [11:20:46] (03PS1) 10Jbond: icinga: Add a script to parse and query the stus.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [11:25:41] Urbanecm - any updates? [11:25:50] raynor, sorry, got distracted [11:26:08] np [11:27:42] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: [[:gerrit:514413|Remove project namespace from flaggedrevs on ruwikisource]] (T225037) (duration: 00m 54s) [11:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:48] T225037: Disable FlaggedRevs for NS "Wikisourse" of ruwikisource - https://phabricator.wikimedia.org/T225037 [11:28:20] raynor, I'm done, you can continue [11:28:30] thx [11:29:02] (03CR) 10Pmiazga: [C: 03+2] Enable the new history page in the advanced mobile contributions mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514449 (https://phabricator.wikimedia.org/T219895) (owner: 10Pmiazga) [11:29:58] (03Merged) 10jenkins-bot: Enable the new history page in the advanced mobile contributions mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514449 (https://phabricator.wikimedia.org/T219895) (owner: 10Pmiazga) [11:30:17] (03CR) 10jenkins-bot: Enable the new history page in the advanced mobile contributions mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514449 (https://phabricator.wikimedia.org/T219895) (owner: 10Pmiazga) [11:31:05] (03CR) 10Jbond: "By default exit 0 if all hosts are optimal and one if any services are in a failed state. This allows a script to just keep polling until" [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [11:32:13] (03PS2) 10Jbond: icinga: Add a script to parse and query the stus.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [11:32:48] (03CR) 10Hoo man: [C: 03+1] Add new terms normalized schema tables as public 1:1 views in labs. [puppet] - 10https://gerrit.wikimedia.org/r/514411 (https://phabricator.wikimedia.org/T225038) (owner: 10Alaa Sarhan) [11:34:01] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/514449/ is on mwdebug1002 - testing [11:34:05] (03PS1) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [11:41:12] raynor: let me know when you're done please [11:41:47] Amir1 - just done testing, will sync prod in a minute [11:41:52] Thanks! [11:43:41] !log pmiazga@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:514449|Enable the new history page in the advanced mobile contributions mode (T219895)]] (duration: 00m 56s) [11:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:47] T219895: Style the history page for AMC users - https://phabricator.wikimedia.org/T219895 [11:43:54] Amir1, done, SWAT window is yours [11:44:03] Thank you! [11:44:08] please close the window once you're done. [11:44:17] (03PS1) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [11:44:54] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [11:45:43] According to jenkins it takes 28 minutes to merge a simple backport on WikimediaMessages extension... [11:53:21] (03CR) 10Volans: "Thanks for continuing this shared effort. My review is biased as this was a 4-hands effort with John :)" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [11:59:54] jouncebot: next [11:59:54] In 0 hour(s) and 0 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1200) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1200) [12:00:28] (SWAT is still ongoing FYI, CI is taking a while on the last backport) [12:07:00] jouncebot: now [12:07:00] For the next 0 hour(s) and 52 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1200) [12:13:59] 10Operations, 10SRE-Access-Requests: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10Tobi_WMDE_SW) [12:15:18] the backport was finally merged, will be deployed soon [12:17:05] 10Operations, 10SRE-Access-Requests: Requesting access to deployment cluster for awight - https://phabricator.wikimedia.org/T225062 (10Tobi_WMDE_SW) Endorsing @awight 's request. Especially in the light of the potentially planned changes to EU SWAT routine (@greg should have gotten an email already). [12:18:28] (03PS2) 10Hashar: beta: lower swift server workers [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) [12:18:37] (03CR) 10Hashar: "Typo fix in the commit message :]" [puppet] - 10https://gerrit.wikimedia.org/r/513059 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [12:32:21] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.8/extensions/WikimediaMessages/: SWAT: [[gerrit:514460|Fix wikidata copyright message (T224536)]] (duration: 00m 56s) [12:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:29] T224536: Add EntitySchema in the license footer - https://phabricator.wikimedia.org/T224536 [12:38:51] (03PS3) 10Jbond: icinga: Add a script to parse and query the stus.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [12:40:01] (03CR) 10Jbond: "thanks code updated" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [12:46:52] !log EU SWAT finished [12:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:27] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514467 [12:49:36] (03PS16) 10Ema: varnish: ratelimit thumbor - cache_upload frontend [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [12:50:23] (03CR) 10Ema: [C: 03+2] varnish: ratelimit thumbor - cache_upload frontend [puppet] - 10https://gerrit.wikimedia.org/r/512495 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [12:50:29] I am going to restart CI eventually [12:50:36] err [12:50:48] marostegui: ema: I am going to restart CI in a few :] [12:51:11] hashar: can I quickly push it? [12:51:16] yeah yeah [12:51:16] Or you prefer me to wait? [12:51:23] marostegui: I will wait, dont worry :] [12:51:25] doing it! will take just a minute! [12:51:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514467 (owner: 10Marostegui) [12:52:22] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514467 (owner: 10Marostegui) [12:52:47] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1135 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514467 (owner: 10Marostegui) [12:53:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1135 (duration: 00m 54s) [12:53:28] hashar: I am done! :) [12:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:57] marostegui: you ar emagic :] CI restart should not take long [12:56:22] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [12:59:12] (03PS7) 10Ema: varnish: cache_upload global cache miss rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [13:00:04] zeljkof: That opportune time is upon us again. Time for a MediaWiki train - European version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1300). [13:00:45] o/ [13:00:49] I <3 train [13:00:53] 🚂 [13:02:22] petition to make jouncebot use 🚂 in its messages [13:03:03] !log restarting Jenkins [13:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:21] zeljkof: sorry I really had to restart Jenkins :/ [13:03:52] hashar: no problemo, it doesn't take forever to restart anymore, right? [13:07:25] 10Operations, 10Performance-Team, 10observability, 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10Krinkle) [13:08:03] (03PS2) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [13:08:18] zeljkof: yeah it is super fast [13:08:45] is it safe to recheck a patch now or should I wait for the jenkins restart to be done? [13:08:59] Lucas_WMDE: it has restarted already [13:09:02] okay :) [13:09:18] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [13:09:19] worse case, the jobs get canceled, jenkins restart, and evenutally Zuul restart the jobs [13:09:27] jenkins seems back up, starting with the train [13:10:51] I am off, be back later tonigh [13:10:52] t [13:10:58] (03PS1) 10Zfilipin: group1 wikis to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514470 [13:11:00] (03CR) 10Zfilipin: [C: 03+2] group1 wikis to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514470 (owner: 10Zfilipin) [13:11:00] ring me if needed :] [13:12:01] (03Merged) 10jenkins-bot: group1 wikis to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514470 (owner: 10Zfilipin) [13:12:16] (03CR) 10jenkins-bot: group1 wikis to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514470 (owner: 10Zfilipin) [13:14:19] (03CR) 10Ottomata: [C: 03+1] "Have only skimmed, but +1 from me!" [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [13:14:47] uh oh, scap failed :( [13:15:29] ah, looks like it only failed for one canary, it's still running [13:17:31] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.34.0-wmf.8 [13:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:55] !log start es2,es3 backup on codfw [13:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:36] (03CR) 10Ottomata: "Ah so this is just for getting the keytab files over to the puppetmaster so we can manually commit them to the private repo. Puppet will " [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:23:09] (03CR) 10Muehlenhoff: "@Ottomata: Yes" [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:24:22] (03CR) 10Elukey: "> Ah so this is just for getting the keytab files over to the" [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:25:29] (03PS1) 10Elukey: profile::kerberos::kadminserver: add generate_keytabs.py [puppet] - 10https://gerrit.wikimedia.org/r/514471 (https://phabricator.wikimedia.org/T212257) [13:26:41] (03CR) 10Elukey: "@Moritz: I added the script as part of the kadminserver's profile files, and removed (as we discussed) the chmod util parts.. lemme know :" [puppet] - 10https://gerrit.wikimedia.org/r/514471 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:31:38] (03PS4) 10Ottomata: [EventBus] Add wgEventServiceStreamConfig variable and switch 2 topics to eventgate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [13:31:51] (03PS5) 10Ottomata: [EventBus] Add wgEventServiceStreamConfig variable and switch 2 topics to eventgate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [13:33:44] (03CR) 10Elukey: "Keith I left some minor comments, let me know what you think. Are you going to re-use the same kafka id when migrating? If so, when are yo" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [13:37:12] (03PS14) 10Elukey: mcrouter: allow to tune server timeout and timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [13:38:42] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 7 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[openssh-client],Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [13:38:46] (03PS15) 10Elukey: mcrouter: allow to tune timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) [13:39:09] (03CR) 10Ottomata: [C: 03+1] "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:40:17] jouncebot: next [13:40:17] In 2 hour(s) and 19 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1600) [13:40:32] jijiki: puppet disabled, going to merge [13:41:09] (03CR) 10Elukey: [C: 03+2] mcrouter: allow to tune timeouts until tko [puppet] - 10https://gerrit.wikimedia.org/r/468363 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [13:41:31] (03PS6) 10Ottomata: [EventBus] Add wgEventServiceStreamConfig and switch 2 topics in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [13:41:54] I was upgrading es2004 [13:42:07] probably will get fixed automatically on next puppter run [13:42:29] jijiki: ah snap sorry I didn't see it in the backscroll, my bad [13:43:03] err jynus [13:43:07] auto-complete fail [13:43:17] jynus: ok to proceed anyway? [13:43:25] proceed with what? [13:43:39] with my change, and re-enable puppet on canaries [13:44:08] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:44:26] ^yes [13:44:45] ack [13:44:59] just to be sure, zeljkof am I interfering with any work of yours? [13:45:40] swat is in 2hrs from what jouncebot says [13:46:00] oh but [13:46:12] yeah there was a deployment earlier on [13:46:14] elukey: I'm done with train [13:46:18] zeljkof: <3 [13:46:25] ok good [13:46:57] super, let's proceed.. I am going to depool one node, run puppet, and check that everything is good [13:47:22] mw1276.eqiad.wmnet [13:50:47] looks good, repooling [13:52:29] jijiki: we can proceed [13:52:42] (03CR) 10Muehlenhoff: [C: 03+1] "One nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514471 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:53:00] elukey: I will depool, run pupper, restart mcrouter and repool ? [13:53:08] puppet* [13:53:15] jijiki: run puppet is enough, it restarts mcrouter [13:53:31] alright then [13:53:34] enjoy krb [13:53:40] "enjoy" [13:53:43] :D [13:54:09] (03CR) 10Elukey: profile::kerberos::kadminserver: add generate_keytabs.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514471 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [13:55:23] (03PS2) 10Elukey: profile::kerberos::kadminserver: add generate_keytabs.py [puppet] - 10https://gerrit.wikimedia.org/r/514471 (https://phabricator.wikimedia.org/T212257) [13:56:35] !log enabling puppet and pooling on mw* canaries [13:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:42] 10Operations, 10Traffic, 10HTTPS: Provide acme-chief/TLS SNI list support in compile_redirects() - https://phabricator.wikimedia.org/T225096 (10Vgutierrez) [13:57:06] !log restart mcrouter on MediaWiki app/api canaries to pick up new config change (timeouts before marking a memcached shard as TKO from 3 to 10) - T203786 [13:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:11] T203786: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [13:58:10] Amir1: I bet CirrusSearch is similarly broken [13:59:29] :(( [14:00:17] (03PS1) 10Jbond: rerepo: add mtail repo [puppet] - 10https://gerrit.wikimedia.org/r/514476 [14:00:22] Hmm, it's fine [14:01:31] (03PS2) 10Jbond: rerepo: add mtail component so we can backport buster mtail to stretch [puppet] - 10https://gerrit.wikimedia.org/r/514476 [14:01:58] (03PS1) 10Vgutierrez: redirects.dat: Provide acme-chief/TLS SNI list support in compile_redirects() [puppet] - 10https://gerrit.wikimedia.org/r/514477 (https://phabricator.wikimedia.org/T225096) [14:05:17] 10Operations, 10ops-codfw, 10Traffic: lvs2002: raid battery failure - https://phabricator.wikimedia.org/T213417 (10Papaul) p:05Normal→03Low [14:06:18] (03PS3) 10Elukey: profile::kerberos::kadminserver: add generate_keytabs.py [puppet] - 10https://gerrit.wikimedia.org/r/514471 (https://phabricator.wikimedia.org/T212257) [14:06:43] (03PS1) 10Mathew.onipe: maps: enable osm replication cron [puppet] - 10https://gerrit.wikimedia.org/r/514479 (https://phabricator.wikimedia.org/T224395) [14:07:09] zeljkof: For the FR issue... It will need a full scap to rebuild i18n after it's backported :( [14:07:45] Reedy: no problemo, I can do that tomorrow [14:08:49] Reedy: or, what's the plan? backport and full scap today-tomorrow during a swat window? [14:09:11] Backport can go whenever [14:09:23] Just needs full scap before FR .8 is more places... [14:09:28] Because of the broken messages [14:09:30] jouncebot: now [14:09:30] For the next 0 hour(s) and 50 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1300) [14:09:32] jouncebot: next [14:09:32] In 1 hour(s) and 50 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1600) [14:09:38] We could just run it in the rest of this window, WFM :) [14:09:41] we/I [14:09:53] Reedy: sure, can you backport and scap now? [14:10:03] I'm just waiting for jenkins to stop saying no to my master patch [14:10:03] or, should I run scap? [14:10:07] :D [14:10:12] ok, let me know if you need me [14:10:22] cheers [14:13:21] (03PS1) 10Volans: dbconfig: move -s/--scope up one level [software/conftool] - 10https://gerrit.wikimedia.org/r/514480 [14:13:23] (03PS1) 10Volans: dbconfig: do not require a reason for section rw [software/conftool] - 10https://gerrit.wikimedia.org/r/514481 [14:13:25] (03PS1) 10Volans: dbconfig: unify casing for DB error messages [software/conftool] - 10https://gerrit.wikimedia.org/r/514482 [14:13:27] (03PS1) 10Volans: dbconfig: catch multiple sections error [software/conftool] - 10https://gerrit.wikimedia.org/r/514483 [14:13:29] (03PS1) 10Volans: kvobject: expose asdict() method [software/conftool] - 10https://gerrit.wikimedia.org/r/514484 [14:13:31] (03PS1) 10Volans: dbconfig: implement section all get [software/conftool] - 10https://gerrit.wikimedia.org/r/514485 [14:13:33] (03PS1) 10Volans: dbconfig: remove unnecessary return [software/conftool] - 10https://gerrit.wikimedia.org/r/514486 [14:15:33] Amir1: I've gotta go AFK for 5-10 minutes. If jenkins C+2's the master patch, can you cherry pick to .8 and +2? [14:15:48] (03PS7) 10Elukey: mcrouter: allow async foreign set/delete WAN cache operations [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [14:15:54] And I can deploy when I get back [14:17:58] (03PS1) 10Elukey: Remove Analytics Hadoop proxy config for superset [puppet] - 10https://gerrit.wikimedia.org/r/514487 (https://phabricator.wikimedia.org/T223919) [14:18:04] Reedy: sure [14:18:08] Ta :) [14:18:25] (03CR) 10CDanis: [C: 03+1] icinga: manage metamonitor known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [14:20:10] (03PS1) 10Mholloway: Add nagios contact group for wikifeeds service [puppet] - 10https://gerrit.wikimedia.org/r/514489 (https://phabricator.wikimedia.org/T170455) [14:20:12] (03PS1) 10Mholloway: Add role/profile for wikifeeds service [puppet] - 10https://gerrit.wikimedia.org/r/514490 (https://phabricator.wikimedia.org/T170455) [14:21:38] (03PS2) 10Mholloway: Add role/profile for wikifeeds service [puppet] - 10https://gerrit.wikimedia.org/r/514490 (https://phabricator.wikimedia.org/T170455) [14:24:40] !log Poweroff db1091 for BBU replacement - T225060 [14:24:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:48] T225060: db1091 crashed - https://phabricator.wikimedia.org/T225060 [14:26:46] 10Operations, 10ops-eqiad, 10DBA: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Cmjohnson) Good afternoon! db1091...i do have a spare bbu but that spare has been helpful the last year or so. HP is slow to send out the batteries, they can take days to get because of their slow response time... [14:26:57] (03PS8) 10Ema: varnish: cache_upload rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [14:28:17] (03CR) 10CDanis: [C: 03+2] kvobject: expose asdict() method [software/conftool] - 10https://gerrit.wikimedia.org/r/514484 (owner: 10Volans) [14:28:31] (03CR) 10CDanis: [C: 03+2] dbconfig: remove unnecessary return [software/conftool] - 10https://gerrit.wikimedia.org/r/514486 (owner: 10Volans) [14:29:19] (03CR) 10CDanis: [C: 03+2] dbconfig: do not require a reason for section rw [software/conftool] - 10https://gerrit.wikimedia.org/r/514481 (owner: 10Volans) [14:30:36] (03CR) 10CDanis: [C: 03+2] dbconfig: catch multiple sections error [software/conftool] - 10https://gerrit.wikimedia.org/r/514483 (owner: 10Volans) [14:31:25] (03PS2) 10Marostegui: DNS: Remove mgmt asset tag for rdb200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/513608 (owner: 10Papaul) [14:31:36] (03CR) 10Ottomata: [C: 03+1] Remove Analytics Hadoop proxy config for superset [puppet] - 10https://gerrit.wikimedia.org/r/514487 (https://phabricator.wikimedia.org/T223919) (owner: 10Elukey) [14:31:44] (03CR) 10Ottomata: [C: 03+1] "We should remove the hive Datastore settings in the superset UI too." [puppet] - 10https://gerrit.wikimedia.org/r/514487 (https://phabricator.wikimedia.org/T223919) (owner: 10Elukey) [14:31:54] (03CR) 10Marostegui: [C: 03+2] DNS: Remove mgmt asset tag for rdb200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/513608 (owner: 10Papaul) [14:32:59] (03CR) 10Elukey: [C: 03+2] Remove Analytics Hadoop proxy config for superset [puppet] - 10https://gerrit.wikimedia.org/r/514487 (https://phabricator.wikimedia.org/T223919) (owner: 10Elukey) [14:33:18] (03CR) 10Ottomata: "Hmmm, there is a table defined by Neil that uses the Hive connection: wmf.edit_hourly." [puppet] - 10https://gerrit.wikimedia.org/r/514487 (https://phabricator.wikimedia.org/T223919) (owner: 10Elukey) [14:33:59] (03PS9) 10Ema: varnish: cache_upload rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224434) (owner: 10Jbond) [14:34:15] (03CR) 10Elukey: [C: 03+2] "> Hmmm, there is a table defined by Neil that uses the Hive" [puppet] - 10https://gerrit.wikimedia.org/r/514487 (https://phabricator.wikimedia.org/T223919) (owner: 10Elukey) [14:34:40] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10Papaul) [14:34:43] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Decommission rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10Papaul) 05Open→03Resolved This is complete [14:34:47] (03CR) 10CDanis: [C: 03+2] dbconfig: unify casing for DB error messages [software/conftool] - 10https://gerrit.wikimedia.org/r/514482 (owner: 10Volans) [14:34:58] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10User-jijiki: Decommission rdb2001, rdb2002 - https://phabricator.wikimedia.org/T209425 (10Papaul) [14:44:18] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:44:30] (03PS2) 10Ema: Add 0025-extrachance-one-retry.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514443 [14:44:32] (03PS1) 10Ema: Add 0026-transient-full-cache_req_body-panic.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514496 [14:45:47] marostegui: that's you (netbox alert) :) [14:45:53] see https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ [14:46:03] checking [14:46:30] volans: that's jynus! we are decommissioning the good old dbstore hosts! [14:46:37] yay [14:46:43] but they are still in puppetdb [14:46:53] decomissioning stage [14:46:55] volans: yeah, I am not fully aware on which stage they are [14:46:59] they are spare [14:47:00] I think it is WIP [14:47:09] yep it's just the order of things [14:47:12] colalso https://wikitech.wikimedia.org/wiki/Talk:Server_Lifecycle#Stages [14:47:21] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission [14:47:43] the change to decomissioning should happen after the decom script (actually the script will do it itself) [14:47:52] so these are active servers? [14:47:53] so that they are not anymore in puppet [14:47:56] that seems wrong to me [14:47:59] if you're mocing them to spare role [14:48:02] *moving [14:48:05] it's another thing [14:48:11] what? [14:48:20] 10Operations, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10ema) [14:49:02] could someone please document the stages? [14:49:08] because they are very confusing to me [14:49:21] I have started decomissioning a server [14:49:23] the process still has races and incongruences I know, not your fault [14:49:27] so not it is active? [14:49:29] *now [14:49:38] you see the confussion? [14:49:41] (03CR) 10CDanis: [C: 03+2] dbconfig: implement section all get [software/conftool] - 10https://gerrit.wikimedia.org/r/514485 (owner: 10Volans) [14:49:45] I can put them active no problem [14:49:47] 10Operations, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10ema) We (#traffic) have decided to continue allowing requests violating the UA policy. Instead of blocking them, we will apply stricter rate limiting... [14:49:50] for me the passage to spare::system is unnecessary and not needed [14:50:02] if we decom the server for real, as we should [14:50:15] I do as written... if the written was clearer :-D [14:50:40] jynus: signup for the session "[design] Hardware decommissioning: find an agreement on next steps" [14:50:53] (03CR) 10CDanis: [C: 03+1] dbconfig: move -s/--scope up one level (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514480 (owner: 10Volans) [14:50:59] I'd like to get some agreement to move forward on this topic [14:51:04] (03CR) 10jerkins-bot: [V: 04-1] Add 0026-transient-full-cache_req_body-panic.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514496 (owner: 10Ema) [14:51:07] so should I set them up as active or spare? [14:52:30] so let's state the facts, out of my ideas [14:52:31] now they are active [14:52:44] (03PS10) 10Ema: varnish: cache_upload rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224884) (owner: 10Jbond) [14:52:45] - if the host is in puppet and state is decom the alert will trigger [14:52:54] and dbstore1002, which is unreachable, is "decomissioning" [14:53:00] (03CR) 10jerkins-bot: [V: 04-1] Add 0025-extrachance-one-retry.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514443 (owner: 10Ema) [14:53:20] volans: just to be clear [14:53:26] I am not disagreeing [14:53:28] unreachable because already done the decom or because hardware failure? [14:53:40] I will do as asked, just it is confusing to me [14:53:41] I'm going to die of boredom waiting for jerkins [14:54:04] I totally agree that is confusing, that's why I want to find agreements next week to fix the current status [14:54:10] volans: as far as I know dbstore1002 is fully decommed [14:54:27] Reedy: look at the bright side, you won't be there to see the -1 [14:54:32] but obviously I do not have understanding of its phisical status [14:54:37] ok, it's not showing in the report, so I guess it's correct [14:54:37] lol [14:55:08] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [14:55:10] but if I am a dc op and see a host as active, I may worry? [14:55:11] sorry for the troble jynus, I hope to get it fixed soon [14:55:12] not watching jenkins today, I do't have anything up for merge [14:55:20] volans: not causing any issue [14:55:26] again, not complaining [14:55:53] just saying I thought I was doing the right thing but I apprently didn't [14:55:53] (03PS2) 10Andrew Bogott: openstack designate wmf_sink: rm unreachable code [puppet] - 10https://gerrit.wikimedia.org/r/514440 (owner: 10Alex Monk) [14:55:55] it's the mess of passing through spare::system that I think we could skip and get directly to kill the host, so things will be more atomical and without intermediate limbo-state [14:56:03] uf [14:56:09] ok, that I see problems with [14:56:23] limbo-state is actually the most common state [14:56:26] after active [14:56:34] (03CR) 10Andrew Bogott: [C: 03+2] openstack designate wmf_sink: rm unreachable code [puppet] - 10https://gerrit.wikimedia.org/r/514440 (owner: 10Alex Monk) [14:56:35] but do we need it? [14:56:39] jouncebot: next [14:56:40] In 1 hour(s) and 3 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1600) [14:56:52] volans: again, I don't have an opinion, I am fully passive here [14:57:04] whatever you decide I will do [14:57:08] ok :) [14:57:33] my only feedback is that things like "making work" [14:57:46] or "stopping" a host are not punctual [14:58:18] I don't know for other services, but for dbs that can take ke weeks, whitout counting the pure dc work [14:58:35] hhvm sucks [14:58:35] (which I know it is also not fast, due to things like safe disk wiping, etc.) [14:59:10] whether that has to be reflected on netbox or not, I dnt know [14:59:16] yeah I know [14:59:25] that's the issue I would like to solve [14:59:30] let me give one use cases where that would be confusing [14:59:31] sorry meeting starting... [14:59:44] I am dc ops, want to unrack a server but I see it as active [14:59:54] (confusing, maybe?) [14:59:55] not sure [15:00:04] ask them [15:00:08] yes, in fact it cannot be unracked until decomm'ed [15:01:15] as in running the decommissioning cookbook that removes it from puppet and other places [15:01:36] decomm has at least 3 phases: "make the host not pooled" "make the host spare" "config/netwoprk disable" [15:01:49] and a 4th, unrack/disk wipe [15:02:21] that is the 3 step [15:03:10] !log reedy@deploy1001 Started scap: Rebuild .8 i18n for FlaggedRevs [15:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:03] (03CR) 10Andrew Bogott: "I finally have time to think about this! I think I'm ready to merge, but I'd like some inline docs about how to use it first." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506872 (https://phabricator.wikimedia.org/T220268) (owner: 10Alex Monk) [15:06:53] (03PS4) 10Andrew Bogott: toolserver: redirect ~nikola/svgtranslate.php to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/512341 (https://phabricator.wikimedia.org/T224265) (owner: 10Aklapper) [15:08:35] (03PS2) 10Andrew Bogott: toolserver: redirect /tiles to https://tiles.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/514360 (owner: 10BryanDavis) [15:10:43] (03CR) 10Andrew Bogott: [C: 03+2] toolserver: redirect /tiles to https://tiles.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/514360 (owner: 10BryanDavis) [15:11:05] (03PS2) 10Bstorm: wiki replicas: Add specialized views of the "comment" table [puppet] - 10https://gerrit.wikimedia.org/r/513943 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [15:11:31] (03PS2) 10Ema: Add 0026-transient-full-cache_req_body-panic.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514496 [15:12:22] (03PS3) 10Andrew Bogott: Remove unused/untrusted IP ranges from trusted lists [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [15:14:33] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Provide acme-chief/TLS SNI list support in compile_redirects() - https://phabricator.wikimedia.org/T225096 (10Vgutierrez) p:05Triage→03Normal [15:14:40] (03CR) 10Herron: ">" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:14:58] (03CR) 10jerkins-bot: [V: 04-1] Add 0026-transient-full-cache_req_body-panic.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514496 (owner: 10Ema) [15:16:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good per se, but we don't strictly need a separate component here, adding a new component is useful for bigger, gradual changes like" [puppet] - 10https://gerrit.wikimedia.org/r/514476 (owner: 10Jbond) [15:17:46] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/514476 (owner: 10Jbond) [15:18:06] (03Abandoned) 10Jbond: rerepo: add mtail component so we can backport buster mtail to stretch [puppet] - 10https://gerrit.wikimedia.org/r/514476 (owner: 10Jbond) [15:23:21] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Release Pipeline, 10serviceops, and 2 others: Machine vision image metadata service - https://phabricator.wikimedia.org/T224917 (10Mholloway) p:05Triage→03Normal [15:32:17] (03PS1) 10Reedy: Load Collection via extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514509 [15:33:35] 10Operations, 10netops, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10jbond) p:05Triage→03Normal [15:33:47] sync-apaches: 56% (ok: 150; fail: 0; left: 117) [15:34:00] wow [15:34:22] 31 minutes [15:34:25] so far [15:35:01] 80% done [15:35:40] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:35:57] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 3 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10herron) Here's a first shot at per-host replacement steps for kafka2003... [15:36:13] !log installing exim4 security updates [15:36:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I've reviewed this, checked both netbox and our configs, and LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/509140 (https://phabricator.wikimedia.org/T222392) (owner: 10Ayounsi) [15:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:56] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 75190 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:37:06] 31 minutes. ouch [15:38:31] (03PS3) 10Ema: Add 0026-transient-full-cache_req_body-panic.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514496 [15:38:58] 10Operations, 10netops, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) Data collection stopped after the upgrade to Routinator 0.4.0: https://grafana.wikimedia.org/d/UwUa77GZk/rpki?refresh=5m&orgId=1&from=now-7d&to=now ` ayounsi@rpki100... [15:40:04] scap-cdb-rebuild: 31% (ok: 92; fail: 0; left: 197) [15:40:30] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 49.89, 22.09, 14.45 [15:40:48] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 59.33, 25.14, 15.48 [15:41:40] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 63.71, 30.40, 17.75 [15:41:58] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 62.25, 27.20, 18.20 [15:41:58] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 20.58, 19.78, 14.33 [15:42:16] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 27.18, 24.24, 16.08 [15:42:20] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CRITICAL - load average: 73.53, 30.24, 18.98 [15:42:23] grumpy api servers [15:42:42] no it is reedy 0 - servers 2 [15:43:00] heh, yeah, it's probably busy servers + scap-cdb-rebuild [15:43:00] ahha [15:43:22] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 28.62, 25.52, 18.44 [15:43:44] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 31.78, 28.37, 19.37 [15:44:25] !log reedy@deploy1001 Finished scap: Rebuild .8 i18n for FlaggedRevs (duration: 41m 14s) [15:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:32] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 17.35, 23.89, 17.43 [15:45:30] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) [15:45:33] 10Operations, 10netops, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) [15:46:11] (03CR) 10Reedy: [C: 03+2] Load Collection via extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514509 (owner: 10Reedy) [15:46:25] (03Merged) 10jenkins-bot: Load Collection via extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514509 (owner: 10Reedy) [15:46:40] (03CR) 10jenkins-bot: Load Collection via extension.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514509 (owner: 10Reedy) [15:46:54] !log reedy@deploy1001 Scap failed!: Call to mwscript eval.php returned: None [15:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:58] (03CR) 10jerkins-bot: [V: 04-1] Add 0026-transient-full-cache_req_body-panic.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514496 (owner: 10Ema) [15:50:17] oh, meh [15:50:29] (03PS1) 10Reedy: Revert "Load Collection via extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514511 [15:50:35] (03CR) 10Reedy: [C: 03+2] "meh" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514511 (owner: 10Reedy) [15:50:44] hehe +2 "meh" [15:51:00] mixed feelings there [15:51:33] (03Merged) 10jenkins-bot: Revert "Load Collection via extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514511 (owner: 10Reedy) [15:53:56] (03PS1) 10Ema: Add lintian override: postinst-must-call-ldconfig [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514514 [15:54:34] 10Operations, 10netops, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10fgiunchedi) Yes that looks like an error on routinator side, you can also use `promtool check rules` to see what prometheus makes of that ` prometheus1003:~$ curl -s http://r... [15:55:29] (03CR) 10jenkins-bot: Revert "Load Collection via extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514511 (owner: 10Reedy) [15:57:46] Reedy: what was the error? [15:58:36] legoktm: extension.json wasn't in .7 for collection [15:58:40] it was still extension-wip [15:58:40] oops [15:59:01] this is why we have the sanity checks :) [15:59:09] :D [15:59:34] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) sounds sensible, thanks @fsero [15:59:36] it's on you, I was already crazy before getting hired! [15:59:50] ups, wrong channel [16:00:04] MaxSem, RoanKattouw, and Niharika: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T1600). [16:00:04] ottomata: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:02:08] (03CR) 10jerkins-bot: [V: 04-1] Add lintian override: postinst-must-call-ldconfig [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514514 (owner: 10Ema) [16:02:11] (03CR) 10Elukey: "ah yes now I see the change in id in commons.yaml, missed that! LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [16:03:26] ottomata: About? [16:03:36] (03PS7) 10Reedy: [EventBus] Add wgEventServiceStreamConfig and switch 2 topics in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [16:03:52] \o/ [16:03:53] here [16:04:00] wow that was a celabratory 'here' [16:04:20] Is it testable? [16:04:23] (03CR) 10Reedy: [C: 03+2] [EventBus] Add wgEventServiceStreamConfig and switch 2 topics in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [16:04:28] 10Operations, 10netops, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) Opened https://github.com/NLnetLabs/routinator/issues/154 upstream. [16:04:31] Or just want it pushing live? [16:04:45] 10Operations, 10netops, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10ayounsi) a:03ayounsi [16:04:47] Reedy: i think it is testable, since it is for group0 only [16:04:52] (03PS2) 10Ema: Add lintian override: postinst-must-call-ldconfig [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514514 [16:04:55] i believe that revision tags are added during a visualeditor edit [16:05:03] I don't care either way. Happy to just push it live if you want ;p [16:05:10] so I can try editing on test wiki mwdebug1002 [16:05:12] let's test [16:05:16] that would be nice. [16:05:55] (03Merged) 10jenkins-bot: [EventBus] Add wgEventServiceStreamConfig and switch 2 topics in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [16:06:12] (03CR) 10jenkins-bot: [EventBus] Add wgEventServiceStreamConfig and switch 2 topics in group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [16:09:15] ottomata: Should be on mwdebug1002 [16:09:36] testing [16:09:51] it works!!!! [16:10:02] proceed please! [16:11:18] !log remove BGP to AS38082 on cr4-ulsfo (left the IXP) [16:11:20] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add wgEventServiceStreamConfig and switch 2 topics in group0 T222822 (duration: 00m 56s) [16:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:27] swat done ;P [16:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:27] T222822: Make EventBus extension support configurable per-event/stream EventServiceName - https://phabricator.wikimedia.org/T222822 [16:11:38] Thank you Reedy ! [16:12:35] np [16:14:33] (03PS3) 10Ema: Add lintian override: postinst-must-call-ldconfig [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514514 [16:14:35] (03PS2) 10Ema: Add 0019-vary-stevedore-mem-leak.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/513976 [16:14:37] (03PS2) 10Ema: Add 0020-assert-error-http1_minimal_response.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/513977 (https://phabricator.wikimedia.org/T224694) [16:14:39] (03PS2) 10Ema: Add 0021-dont-test-gunzip-partial.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514043 [16:14:41] (03PS2) 10Ema: Add 0022-deref-objcore-synth-err.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514315 [16:14:43] (03PS2) 10Ema: Add 0023-pass-delivery-is-no-err.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514318 [16:14:45] (03PS2) 10Ema: Drop 0001-gethdr_extrachance.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514441 [16:14:47] (03PS2) 10Ema: Add 0024-vbt-get-force-fresh.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514442 [16:14:49] (03PS3) 10Ema: Add 0025-extrachance-one-retry.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514443 [16:14:51] (03PS4) 10Ema: Add 0026-transient-full-cache_req_body-panic.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514496 [16:15:57] (03PS1) 10Ottomata: [EventBus] use eventgate-main for 2 events on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514523 (https://phabricator.wikimedia.org/T222822) [16:16:38] (03PS2) 10Ottomata: [EventBus] use eventgate-main for 2 events on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514523 (https://phabricator.wikimedia.org/T211248) [16:17:25] Reedy: Pchelolo and I really wanted to do ^ but decided to just do group0 first. forgot we could just test like that on mwdebug1002. [16:17:40] since swat is 'over', mind if I deploy that? [16:20:29] ottomata: Go for it. [16:20:34] thanks [16:20:49] (03CR) 10Ottomata: [C: 03+2] [EventBus] use eventgate-main for 2 events on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514523 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:21:08] (03CR) 10jenkins-bot: [EventBus] use eventgate-main for 2 events on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514523 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:22:38] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: use eventgate-main for 2 events on all wikis - T211248 (duration: 00m 55s) [16:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:44] T211248: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [16:25:45] (03PS1) 10Ottomata: [EventBus] Revert - Send user-blocks-change using eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514525 (https://phabricator.wikimedia.org/T211248) [16:27:37] (03CR) 10Ottomata: [C: 03+2] [EventBus] Revert - Send user-blocks-change using eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514525 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:28:06] (03PS1) 10Ema: Add 0027-assert-error-vca_make_session.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514526 [16:29:14] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert user-blocks-change to use eventbus and old schema - T211248 (duration: 00m 54s) [16:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:21] T211248: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [16:30:03] 10Operations, 10netops, 10observability: Prometheus logs showing errors for routinator - https://phabricator.wikimedia.org/T225108 (10fgiunchedi) Unrelated to the issue at hand, but I'd also recommend prefixing metrics with `routinator_` so it is clear where they are coming from [16:30:14] (03PS2) 10Gehel: maps: enable osm replication cron [puppet] - 10https://gerrit.wikimedia.org/r/514479 (https://phabricator.wikimedia.org/T224395) (owner: 10Mathew.onipe) [16:31:17] (03CR) 10Gehel: [C: 03+2] maps: enable osm replication cron [puppet] - 10https://gerrit.wikimedia.org/r/514479 (https://phabricator.wikimedia.org/T224395) (owner: 10Mathew.onipe) [16:31:36] (03PS1) 10Ema: Add 0028-panic-return-cond-fetch.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514529 [16:33:19] (03CR) 10jenkins-bot: [EventBus] Revert - Send user-blocks-change using eventbus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514525 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [16:33:59] (03PS1) 10Ema: Add 0029-ban-lurker-bo-backoff.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514533 [16:34:15] (03CR) 10Lucas Werkmeister (WMDE): Enable reftabs on testwikidata (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [16:35:51] (03PS1) 10Ladsgroup: Remove unused config variable wgWikibaseEnableSenses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514534 [16:40:03] (03PS3) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [16:42:37] (03PS1) 10Ema: Add 0030-startup-show-version.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514538 [16:44:07] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [16:44:28] (03PS2) 10Ema: Add 0030-startup-show-version.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514538 [16:46:47] (03PS4) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [16:47:00] (03CR) 10jerkins-bot: [V: 04-1] Add 0028-panic-return-cond-fetch.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514529 (owner: 10Ema) [16:47:31] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [16:48:11] Reedy: how do I deploy a config change to just mwdebug1002? [16:48:19] is https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Staging_changes right? [16:48:23] or is that for just mw code? [16:51:42] ottomata: Same as for MW code; change it on the deploy host and `scap pull` it. [16:52:58] ok, jsut log into the host and run scap pull [16:53:03] from home dir or wherever [16:56:19] (03CR) 10Bstorm: [C: 03+2] "This seems to test out right locally. I like it." [puppet] - 10https://gerrit.wikimedia.org/r/513943 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [16:56:29] (03PS3) 10Bstorm: wiki replicas: Add specialized views of the "comment" table [puppet] - 10https://gerrit.wikimedia.org/r/513943 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [16:57:16] (03PS5) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [16:58:37] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [17:00:12] 10Operations, 10ops-eqiad, 10netops: upgrade mr1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10RobH) p:05Triage→03Normal [17:00:23] 10Operations, 10ops-eqiad, 10netops: upgrade mr1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10RobH) [17:01:09] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10RobH) [17:02:35] 10Operations, 10ops-eqiad, 10netops: upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10RobH) [17:03:32] 10Operations, 10ops-eqiad, 10netops: upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10RobH) 05Open→03Stalled Please note @papaul is working with @ayongsi to upgrade the codfw msw1 on T224250. The current plan is to allow that to complete, and then replicate its wor... [17:05:55] * elukey off! o/ [17:06:02] (wrong chan :P_ [17:10:25] 10Operations, 10ops-eqiad, 10DBA: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Cmjohnson) 05Open→03Resolved The bbu has been replaced. [17:17:23] (03PS2) 10Volans: dbconfig: move -s/--scope up one level [software/conftool] - 10https://gerrit.wikimedia.org/r/514480 [17:17:25] (03PS2) 10Volans: dbconfig: do not require a reason for section rw [software/conftool] - 10https://gerrit.wikimedia.org/r/514481 [17:17:27] (03PS2) 10Volans: dbconfig: unify casing for DB error messages [software/conftool] - 10https://gerrit.wikimedia.org/r/514482 [17:17:29] (03PS2) 10Volans: dbconfig: catch multiple sections error [software/conftool] - 10https://gerrit.wikimedia.org/r/514483 [17:17:31] (03PS2) 10Volans: kvobject: expose asdict() method [software/conftool] - 10https://gerrit.wikimedia.org/r/514484 [17:17:33] (03PS2) 10Volans: dbconfig: implement section all get [software/conftool] - 10https://gerrit.wikimedia.org/r/514485 [17:17:35] (03PS2) 10Volans: dbconfig: remove unnecessary return [software/conftool] - 10https://gerrit.wikimedia.org/r/514486 [17:17:37] (03PS1) 10Volans: dbconfig: fix return value [software/conftool] - 10https://gerrit.wikimedia.org/r/514543 [17:17:39] (03PS1) 10Volans: selectors: do not pre-compile the regex [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 [17:17:45] (03CR) 10Volans: dbconfig: move -s/--scope up one level (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514480 (owner: 10Volans) [17:18:08] (03CR) 10CDanis: [C: 03+2] dbconfig: move -s/--scope up one level [software/conftool] - 10https://gerrit.wikimedia.org/r/514480 (owner: 10Volans) [17:20:45] (03Merged) 10jenkins-bot: dbconfig: move -s/--scope up one level [software/conftool] - 10https://gerrit.wikimedia.org/r/514480 (owner: 10Volans) [17:20:49] (03Merged) 10jenkins-bot: dbconfig: do not require a reason for section rw [software/conftool] - 10https://gerrit.wikimedia.org/r/514481 (owner: 10Volans) [17:21:03] (03Merged) 10jenkins-bot: dbconfig: unify casing for DB error messages [software/conftool] - 10https://gerrit.wikimedia.org/r/514482 (owner: 10Volans) [17:21:05] (03Merged) 10jenkins-bot: dbconfig: catch multiple sections error [software/conftool] - 10https://gerrit.wikimedia.org/r/514483 (owner: 10Volans) [17:21:09] (03Merged) 10jenkins-bot: kvobject: expose asdict() method [software/conftool] - 10https://gerrit.wikimedia.org/r/514484 (owner: 10Volans) [17:22:10] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Cmjohnson) Update on this server. I have updated all of the f/w including the raid card. I am able to isolate the problem to slot 0 right now. I moved the disks around and they do not report any e... [17:22:30] 10Operations, 10Wikimedia-Logstash: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline - https://phabricator.wikimedia.org/T225122 (10herron) p:05Triage→03Normal [17:22:53] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10herron) [17:22:55] 10Operations, 10Wikimedia-Logstash: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline - https://phabricator.wikimedia.org/T225122 (10herron) [17:23:17] (03Merged) 10jenkins-bot: dbconfig: implement section all get [software/conftool] - 10https://gerrit.wikimedia.org/r/514485 (owner: 10Volans) [17:23:20] (03Merged) 10jenkins-bot: dbconfig: remove unnecessary return [software/conftool] - 10https://gerrit.wikimedia.org/r/514486 (owner: 10Volans) [17:23:32] (03PS2) 10Andrew Bogott: certmanager: Set up config for running inside labs realm [puppet] - 10https://gerrit.wikimedia.org/r/514454 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [17:23:37] (03PS22) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [17:25:17] (03CR) 10Andrew Bogott: [C: 03+2] certmanager: Set up config for running inside labs realm [puppet] - 10https://gerrit.wikimedia.org/r/514454 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [17:25:29] (03CR) 10Volans: "I'd really like @_joe_ to have a deep look at this one, it's scary and touches conftool core's functionality." [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 (owner: 10Volans) [17:28:40] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [17:30:19] (03PS1) 10BryanDavis: wiki replicas: Add specialized views of the "actor" table [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) [17:31:42] 10Operations, 10ops-eqiad, 10DBA: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) Thank you so much @Cmjohnson I can see the battery now: ` Cache Backup Power Source: Batteries Battery/Capacitor Count: 1 Battery/Capacitor Status: OK ` Next steps I will take: - Start MySQL... [17:32:16] !log Start MySQL with replication stopped on db1091 - T225060 [17:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:23] T225060: db1091 crashed - https://phabricator.wikimedia.org/T225060 [17:32:44] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) Thanks for the heads up! Let's see what Dell says [17:32:48] (03PS6) 10Arturo Borrero Gonzalez: toolforge: migrate k8s code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514464 (https://phabricator.wikimedia.org/T215531) [17:33:20] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10herron) [17:36:56] !log Start replication db1091 - T225060 [17:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:20] (03CR) 10Bstorm: wiki replicas: Add specialized views of the "actor" table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [17:41:05] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10RStallman-legalteam) Hello, We have NDAs on file for all except: Andrew Kostka, Jakob Warkotsch, Johannes Kroll, Tobias Gritschacher. Let me kno... [17:44:23] (03PS2) 10BryanDavis: wiki replicas: Add specialized views of the "actor" table [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) [17:45:19] (03CR) 10BryanDavis: wiki replicas: Add specialized views of the "actor" table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [17:50:13] 10Operations, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) [17:55:27] 10Operations, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10ayounsi) What are the needed network changes? The usual two are: 1/ switch port config (usually for DCops), for that we need to know which hosts... [17:55:42] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:56:25] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Kanban (Done with CPT), and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10herron) [17:56:27] 10Operations, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10herron) [17:57:58] 10Operations, 10Analytics, 10EventBus, 10Wikimedia-Logstash: Move eventgate logs to new logging infrastructure - https://phabricator.wikimedia.org/T225129 (10herron) [18:08:30] PROBLEM - MegaRAID on es2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:08:31] ACKNOWLEDGEMENT - MegaRAID on es2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T225131 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:08:35] 10Operations, 10ops-codfw: Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10ops-monitoring-bot) [18:09:48] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [18:10:48] 10Operations, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) > 1/ switch port config (usually for DCops), for that we need to know which hosts are going to which vlan cloudvirtan100[1-5] should be... [18:11:24] 10Operations, 10ops-codfw: Degraded RAID on es2003 - https://phabricator.wikimedia.org/T225131 (10Marostegui) @jcrespo I guess the transfer put some unexpected stress on these disks [18:23:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - bad mainboard/memory - https://phabricator.wikimedia.org/T220853 (10Andrew) Any update about this? Are parts on the way? [18:39:44] PROBLEM - Host mwdebug2002 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:26] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [18:41:42] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [18:42:00] Reedy: Is T225115 another extension.json thing? [18:42:01] T225115: neither bureaucrats nor admins can grant the right for "autoreview" or "editor" since 1.34.0-wmf.8. - https://phabricator.wikimedia.org/T225115 [18:42:20] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:42:52] RECOVERY - Host mwdebug2002 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms [18:43:36] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Ottomata) I don't think eventstreams is in k8s, is it? [18:45:05] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) >>! In T198901#5237895, @Ottomata wrote: > I don't think eve... [18:45:44] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [18:46:03] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [18:55:03] 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10RobH) p:05Triage→03Normal [18:55:56] 10Operations, 10ops-codfw: codfw humidity too high - https://phabricator.wikimedia.org/T225137 (10RobH) Since this may require CyrusOne techs to enter our cage, I've assigned this to @papaul to setup/arrange/handle directly with CyrusOne support. If I need to handle this instead (due to onsite time constraint... [19:07:06] 10Operations, 10DC-Ops, 10netops, 10observability: Send some LibreNMS alerts to dcops and netops only - https://phabricator.wikimedia.org/T224180 (10RobH) so I'd just email the google group. Then the default settings for the folks in that (DC ops) is to get email updates (unless they have disabled it.) [19:10:49] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), and 2 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10akosiaris) [19:11:25] (03PS1) 10Jbond: pbuilder: disable Acquire::Check-Valid-Until on repos [puppet] - 10https://gerrit.wikimedia.org/r/514555 [19:16:32] (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [19:21:10] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10ayounsi) [19:21:16] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10ayounsi) 05Resolved→03Open Alert is warning again. > DISK WARNING - free space: / 4594 MB (10%... [19:30:12] James_F: I'd be surprised if it's not [19:41:44] (03PS1) 10Reedy: Re-add $wgAddGroups $wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514563 (https://phabricator.wikimedia.org/T225115) [19:41:59] jouncebot: now [19:42:00] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [19:42:28] (03PS2) 10Reedy: Re-add $wgAddGroups $wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514563 (https://phabricator.wikimedia.org/T225115) [19:42:41] (03CR) 10Reedy: [C: 03+2] Re-add $wgAddGroups $wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514563 (https://phabricator.wikimedia.org/T225115) (owner: 10Reedy) [19:43:36] (03Merged) 10jenkins-bot: Re-add $wgAddGroups $wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514563 (https://phabricator.wikimedia.org/T225115) (owner: 10Reedy) [19:44:10] (03CR) 10jenkins-bot: Re-add $wgAddGroups $wgRemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514563 (https://phabricator.wikimedia.org/T225115) (owner: 10Reedy) [19:44:42] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:45:24] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: T225115 (duration: 00m 54s) [19:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:29] T225115: neither bureaucrats nor admins can grant the right for "autoreview" or "editor" since 1.34.0-wmf.8. - https://phabricator.wikimedia.org/T225115 [19:48:18] !log Check data consistency on db1091 against db1135 - T225060 [19:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:24] T225060: db1091 crashed - https://phabricator.wikimedia.org/T225060 [19:57:52] !log contint1001: docker container prune -f && docker image prune -f # reclaimed 166 MB and 3.4 GB [19:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:58] (03PS1) 10CDanis: dbctl: print failures to stderr, don't print on successes [software/conftool] - 10https://gerrit.wikimedia.org/r/514567 [20:00:04] cscott, arlolra, subbu, bearND, and halfak: I, the Bot under the Fountain, allow thee, The Deployer, to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T2000). [20:06:08] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational [20:14:36] no parsoid deploy today [20:19:53] (03PS1) 10Alexandros Kosiaris: blubberoid: Don't page on LVS failures [puppet] - 10https://gerrit.wikimedia.org/r/514574 [20:21:13] 10Operations, 10Analytics, 10EventBus, 10Wikimedia-Logstash: Move eventgate logs to new logging infrastructure - https://phabricator.wikimedia.org/T225129 (10Ottomata) Hm, sure! EventGate is built using service-template-node and service-runner. So if we get a version of those things that can use the new... [20:23:36] !log mforns@deploy1001 Started deploy [analytics/refinery@0660e70]: deploying analytics/refinery up to 0660e70153dec892ae20bee7119a72cc17e8ec87 [20:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:57] (03PS1) 10Ottomata: Set LVS eventgate-* service to critical: true [puppet] - 10https://gerrit.wikimedia.org/r/514575 [20:25:18] !log akosiaris@deploy1001 scap-helm blubberoid upgrade -f blubberoid-values.yaml production stable/blubberoid [namespace: blubberoid, clusters: eqiad,codfw] [20:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:24] !log akosiaris@deploy1001 scap-helm blubberoid cluster eqiad completed [20:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:29] !log akosiaris@deploy1001 scap-helm blubberoid cluster codfw completed [20:25:30] !log akosiaris@deploy1001 scap-helm blubberoid finished [20:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:11] (03PS1) 10Reedy: Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) [20:29:42] PROBLEM - designate-sink process on cloudservices1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-sink https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:29:50] andrewbogott, ^ ? [20:30:06] just testing things [20:30:14] it should recover in a second or two [20:30:19] (03PS2) 10Reedy: Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) [20:30:21] that did page though [20:30:29] andrewbogott: that paged [20:30:42] indeed, paged for me [20:30:45] same [20:30:57] (03CR) 10Reedy: [C: 03+2] Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) (owner: 10Reedy) [20:30:59] also there is nothing about designate-sink on the linked runbook page [20:31:02] hm, that surprises me [20:31:03] ACKNOWLEDGEMENT - designate-sink process on cloudservices1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-sink andrew bogott this will be back shortly https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:31:04] yep [20:31:12] RECOVERY - designate-sink process on cloudservices1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/designate-sink https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:31:22] * andrewbogott looks for a 'why that paged' task [20:31:34] if it testing can we disable the paging? it is late in EU evening [20:31:47] maybe related to https://phabricator.wikimedia.org/T223458 [20:32:05] * marostegui goes back to bed [20:32:14] yeah, I'll downtime if I mess with the service again. It shouldn't really be paging you at all though [20:32:24] thanks! [20:33:26] (03CR) 10jerkins-bot: [V: 04-1] Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) (owner: 10Reedy) [20:33:28] (03CR) 10jerkins-bot: [V: 04-1] Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) (owner: 10Reedy) [20:34:30] (03PS3) 10Reedy: Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) [20:34:47] (03CR) 10Reedy: [C: 03+2] Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) (owner: 10Reedy) [20:37:38] (03Merged) 10jenkins-bot: Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) (owner: 10Reedy) [20:37:54] (03CR) 10jenkins-bot: Turn off some unwanted FR config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514601 (https://phabricator.wikimedia.org/T225138) (owner: 10Reedy) [20:39:22] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: Turn off some FR config T225138 (duration: 00m 54s) [20:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:39] T225138: FlaggedRevs multiple levels of review status since 1.34.0-wmf.8 (81a18d9) - https://phabricator.wikimedia.org/T225138 [20:40:28] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [20:41:04] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:41:38] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:41:48] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:42:00] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:42:00] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:43:06] !log mforns@deploy1001 Finished deploy [analytics/refinery@0660e70]: deploying analytics/refinery up to 0660e70153dec892ae20bee7119a72cc17e8ec87 (duration: 19m 30s) [20:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:18] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:45:52] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:46:16] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:46:16] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:02:06] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:07:50] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:10:16] andrewbogott: I added some digging to the task you mentioned [21:10:27] thanks! [21:12:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:13:50] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:21:55] (03PS1) 10Mforns: analytics::refinery::job::refine Bump up refinery_jar_version [puppet] - 10https://gerrit.wikimedia.org/r/514616 [21:26:37] (03PS2) 10Mforns: analytics::refinery::job::refine Bump up refinery_jar_version [puppet] - 10https://gerrit.wikimedia.org/r/514616 [21:33:30] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:02:16] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [22:05:24] (03PS1) 10BryanDavis: toolserver: redirect ~nikola/svgtranslate.php to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/514618 (https://phabricator.wikimedia.org/T224265) [22:10:26] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [22:10:28] is gerrit playing up for anyone? [22:10:35] ah, yes [22:11:08] meh [22:11:14] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [22:11:16] PROBLEM - SSH access on cobalt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit [22:11:34] weird, cobalt is working for me [22:12:03] gerrit is definitely unreachable though. [22:12:04] hmm... [22:12:12] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:15:24] !log restarting gerrit on cobalt due to it being down (seems like Java out of heap space) [22:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:08] PROBLEM - puppet last run on schema2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [22:17:41] thanks chaomodus :) #hugops [22:18:07] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, and 2 others: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10mmodell) {rOPUPc23fe1fc58ac96f649625dda358b1b84abdad022} [22:18:14] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.062 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [22:18:16] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.15.13-13-gd782b2dd6b (SSHD-CORE-1.6.0) (protocol 2.0) https://wikitech.wikimedia.org/wiki/Gerrit [22:18:16] :) cheers! [22:18:52] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27122 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [22:18:54] OH that ssh error is gerrit's internal ssh [22:19:00] I get it now haah [22:19:02] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 2 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] [22:19:05] (03CR) 10Bstorm: wiki replicas: Add specialized views of the "actor" table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [22:19:06] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 3 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [22:19:12] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [22:19:23] yeah that alert could be more explicit about the custom port [22:19:24] PROBLEM - puppet last run on webperf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [22:19:36] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [22:19:38] those puppet errors will probably resolve themselves [22:20:56] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [22:21:15] yes, these are all of the cloned repos [22:21:54] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics.wikimedia.org] [22:22:26] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 6 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [22:22:56] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [22:23:16] PROBLEM - puppet last run on webperf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [22:23:30] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [22:23:36] (03CR) 10Bstorm: [C: 04-1] "Since it currently crashes :)" [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [22:24:32] (03PS3) 10BryanDavis: wiki replicas: Add specialized views of the "actor" table [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) [22:25:10] (03CR) 10BryanDavis: wiki replicas: Add specialized views of the "actor" table (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [22:28:58] (03CR) 10Andrew Bogott: [C: 03+2] toolserver: redirect ~nikola/svgtranslate.php to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/514618 (https://phabricator.wikimedia.org/T224265) (owner: 10BryanDavis) [22:29:23] (03CR) 10Bstorm: "Looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [22:29:32] (03PS4) 10Bstorm: wiki replicas: Add specialized views of the "actor" table [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [22:34:03] (03CR) 10Bstorm: [C: 03+2] wiki replicas: Add specialized views of the "actor" table [puppet] - 10https://gerrit.wikimedia.org/r/514548 (https://phabricator.wikimedia.org/T224850) (owner: 10BryanDavis) [22:38:08] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:38:46] RECOVERY - puppet last run on schema2001 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [22:39:33] chaomodus it OOM? [22:39:39] yep [22:39:42] seems like it [22:40:08] chaomodus https://gerrit.wikimedia.org/r/monitoring?part=graph&graph=usedMemory looks ok to me [22:40:31] yes.. it does [22:40:41] but it was throwing heap errors [22:40:46] cannot allocate heap etc. [22:40:51] ah [22:41:14] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:42:36] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [22:44:02] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:44:38] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:44:56] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:45:08] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:46:12] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:48:43] well that's new. [22:48:52] thanks for the restart chaomodus [22:48:59] no worries :) [22:51:34] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:51:44] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [22:51:54] RECOVERY - puppet last run on webperf2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190605T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:02:06] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [23:04:04] (03CR) 10Volans: [C: 03+2] "Agree!" [software/conftool] - 10https://gerrit.wikimedia.org/r/514567 (owner: 10CDanis) [23:06:39] (03Merged) 10jenkins-bot: dbctl: print failures to stderr, don't print on successes [software/conftool] - 10https://gerrit.wikimedia.org/r/514567 (owner: 10CDanis) [23:13:30] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:36:20] (03PS1) 10CDanis: dbctl: de-generic-ify helper argument names [software/conftool] - 10https://gerrit.wikimedia.org/r/514632 [23:48:54] (03PS2) 10Volans: dbconfig: fix return value [software/conftool] - 10https://gerrit.wikimedia.org/r/514543 [23:48:55] (03PS2) 10Volans: selectors: do not pre-compile the regex [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 [23:48:57] (03PS1) 10Volans: dbconfig: use lists for sectionLoads sections [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 [23:52:51] (03CR) 10Volans: dbctl: de-generic-ify helper argument names (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514632 (owner: 10CDanis) [23:58:22] (03CR) 10Volans: "I'd like Giuseppe to have a look at this one too if possible. I've tried at least other 2 different approaches and were definitely worse t" [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans)