[01:00:13] (03PS1) 10SBassett: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521189 (https://phabricator.wikimedia.org/T227416) [01:06:33] Hey all - going to deploy https://gerrit.wikimedia.org/r/521189 as a stop-gap for T227416 [01:07:14] (03CR) 10SBassett: [C: 03+2] Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521189 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [01:08:12] (03Merged) 10jenkins-bot: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521189 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [01:15:46] (03CR) 10jenkins-bot: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521189 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [01:16:44] !log sbassett@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Temporary make account creation limits more restrictive (duration: 00m 50s) [01:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:28] (03PS1) 10DannyS712: Add 'templateeditor' user group and protection level on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) [01:37:43] (03PS2) 10DannyS712: Add 'templateeditor' user group and protection level on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) [01:39:06] (03CR) 10jerkins-bot: [V: 04-1] Add 'templateeditor' user group and protection level on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) (owner: 10DannyS712) [01:41:03] (03PS3) 10DannyS712: Add 'templateeditor' user group and protection level on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) [01:51:15] (03PS1) 10DannyS712: Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521194 (https://phabricator.wikimedia.org/T227000) [01:51:53] (03PS2) 10DannyS712: Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521194 (https://phabricator.wikimedia.org/T227000) [02:14:41] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 93725400 and 5 seconds [02:14:43] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 94801080 and 6 seconds [02:19:07] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 39012888 and 3 seconds [02:25:01] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 28549640 and 0 seconds [02:26:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 69012640 and 3 seconds [02:27:55] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 313464 and 66 seconds [02:29:17] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 170936 and 38 seconds [02:29:21] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 101472 and 40 seconds [02:29:21] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2192 and 40 seconds [03:07:40] (03PS1) 10SBassett: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521195 (https://phabricator.wikimedia.org/T227416) [03:10:31] (03CR) 10SBassett: [C: 03+2] Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521195 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [03:11:27] (03Merged) 10jenkins-bot: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521195 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [03:11:42] (03CR) 10jenkins-bot: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521195 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [03:18:15] ^ Deploying more aggressive acct creation throttles for all wikiquotes and wiktionaries due to ongoing spam attack [03:19:15] !log sbassett@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Temporary make account creation limits more restrictive (duration: 00m 53s) [03:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:12] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521198 [05:04:19] (03PS3) 10Marostegui: mariadb: Promote db1132 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) [05:06:04] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521198 (owner: 10Marostegui) [05:06:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521198 (owner: 10Marostegui) [05:07:10] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521198 (owner: 10Marostegui) [05:08:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 for upgrade (duration: 00m 50s) [05:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:35] !log Stop MySQL on db1094 for upgrade [05:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:17] PROBLEM - puppet last run on conf1005 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [05:11:43] !log Drop empty table edit_page_tracking from s7 - T57385 [05:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:49] T57385: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 [05:13:45] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521200 [05:14:32] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521200 (owner: 10Marostegui) [05:15:32] (03PS2) 10Marostegui: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521200 [05:18:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521200 (owner: 10Marostegui) [05:19:24] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521200 (owner: 10Marostegui) [05:19:38] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521200 (owner: 10Marostegui) [05:20:16] (03PS1) 10Marostegui: db-eqiad.php: Fix typo with db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521201 [05:21:56] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fix typo with db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521201 (owner: 10Marostegui) [05:22:26] !log Drop empty table edit_page_tracking from some s3 wikis - T57385 [05:22:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:31] T57385: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 [05:22:45] (03Merged) 10jenkins-bot: db-eqiad.php: Fix typo with db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521201 (owner: 10Marostegui) [05:22:59] (03CR) 10jenkins-bot: db-eqiad.php: Fix typo with db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521201 (owner: 10Marostegui) [05:23:44] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385 (10Marostegui) 05Open→03Resolved >>! In T57385#5310801, @ArielGlenn wrote: > Um, it has? I just found it on meta, though empty. > > wikiadmin@1... [05:24:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1094 after upgrade (duration: 00m 49s) [05:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:06] (03CR) 10Marostegui: [C: 03+1] "Nice work! My only comment as I said the other day is that it would be more helpful to have all the hosts in the same DC with the same "+"" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [05:31:33] !log Compress medium wikis on labsdb1009 - T222978 [05:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:38] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [05:32:38] 10Operations, 10ops-codfw, 10serviceops: restbase2009 lockup - https://phabricator.wikimedia.org/T227408 (10jijiki) p:05Triage→03Normal [05:36:02] (03PS1) 10Marostegui: db-eqiad.php: More weight to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521202 [05:37:09] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More weight to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521202 (owner: 10Marostegui) [05:37:31] RECOVERY - puppet last run on conf1005 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [05:38:00] (03Merged) 10jenkins-bot: db-eqiad.php: More weight to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521202 (owner: 10Marostegui) [05:38:15] (03CR) 10jenkins-bot: db-eqiad.php: More weight to db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521202 (owner: 10Marostegui) [05:39:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More weight to db1094 after upgrade (duration: 00m 51s) [05:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:56] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521203 (https://phabricator.wikimedia.org/T227062) [05:42:53] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521203 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:43:44] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521203 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:43:58] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521203 (https://phabricator.wikimedia.org/T227062) (owner: 10Marostegui) [05:44:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 for binlog format change (duration: 00m 49s) [05:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:06] !log Restart MySQL on db1109 to pick up STATEMENT as binlog format - T227062 [05:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:12] T227062: Failover s8 (wikidatawiki) db primary master db1071 to db1104 (read-only required) - https://phabricator.wikimedia.org/T227062 [05:46:55] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521204 [05:47:56] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521204 (owner: 10Marostegui) [05:48:49] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521204 (owner: 10Marostegui) [05:49:07] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521204 (owner: 10Marostegui) [05:50:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1094 after upgrade, slowly repool db1109 after changing its binlog format (duration: 00m 49s) [05:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:51] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521205 [06:07:02] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521205 (owner: 10Marostegui) [06:07:55] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521205 (owner: 10Marostegui) [06:08:10] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521205 (owner: 10Marostegui) [06:09:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1109 after changing its binlog format (duration: 00m 49s) [06:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:47] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521207 [06:29:57] PROBLEM - puppet last run on db1132 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:30:59] PROBLEM - puppet last run on mw2258 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:35:25] RECOVERY - puppet last run on db1132 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:36:18] !log Run compare for s5 main tables on db2038 vs db2059 - T221533 [06:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:23] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [06:40:58] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521207 (owner: 10Marostegui) [06:42:11] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521207 (owner: 10Marostegui) [06:42:26] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521207 (owner: 10Marostegui) [06:43:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1109 after changing its binlog format (duration: 00m 49s) [06:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:12] 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1006 - https://phabricator.wikimedia.org/T227394 (10MoritzMuehlenhoff) 05Open→03Declined Duplicate of T224260 [06:58:11] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:41] (03PS2) 10Elukey: role::statistics::explorer: add base firewall [puppet] - 10https://gerrit.wikimedia.org/r/520706 (https://phabricator.wikimedia.org/T170826) [06:59:29] (03CR) 10Elukey: [C: 03+2] role::statistics::explorer: add base firewall [puppet] - 10https://gerrit.wikimedia.org/r/520706 (https://phabricator.wikimedia.org/T170826) (owner: 10Elukey) [07:00:07] !log add base::firewall to stat1004 - T170826 [07:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:13] T170826: Enable base::firewall on stat boxes after restricting Spark REPL ports. - https://phabricator.wikimedia.org/T170826 [07:01:47] \o/ [07:02:09] (03PS1) 10Ema: cache: rename cache::upload_ats role to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) [07:06:15] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) Adding @Ottomata for a quick check about the next steps, but it sounds to me that having one kerberos host per DC seems the most flexible so... [07:15:39] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 2 misc nodes for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10MoritzMuehlenhoff) >>! In T227288#5308441, @elukey wrote: > If you think that we'll have a future use case for codfw, I am +1 to buy one misc node i... [07:21:09] (03PS1) 10Elukey: Add version 0.0.0+git20181106-2 [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/521219 [07:21:13] (03PS2) 10Elukey: Add version 0.0.0+git20181106-2 [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/521219 [07:23:08] (03PS3) 10Elukey: Add version 0.0.0+git20181106-2 [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/521219 [07:29:30] (03PS1) 10Ladsgroup: Enable jsonld output format for wikibase entities everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521221 (https://phabricator.wikimedia.org/T207168) [07:31:15] (03PS2) 10Ema: cache: rename cache::upload_ats role to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) [07:32:21] (03CR) 10DCausse: [C: 03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521038 (https://phabricator.wikimedia.org/T227379) (owner: 10Daimona Eaytoy) [07:32:34] 10Operations, 10Wikimedia-General-or-Unknown, 10serviceops, 10Patch-For-Review: Remove pear/mail packages from WMF MW app servers - https://phabricator.wikimedia.org/T195364 (10jijiki) [07:34:25] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:34:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:56] (03PS7) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [07:37:10] 10Operations, 10Performance-Team, 10serviceops: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10jijiki) [07:37:32] 10Operations, 10Performance-Team, 10serviceops, 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10elukey) [07:38:14] 10Operations, 10Core Platform Team, 10Page-Previews, 10RESTBase-API: Page summary endpoint in RESTBase not updated since about June 27 - https://phabricator.wikimedia.org/T226983 (10jijiki) [07:38:59] !log deploying sys schema to missing db production hosts [07:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:12] !log rebooting weblog1001 for kernel security update [07:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:59] 10Operations, 10Core Platform Team, 10Page-Previews, 10RESTBase-API: Page summary endpoint in RESTBase not updated since about June 27 - https://phabricator.wikimedia.org/T226983 (10elukey) Explicitly adding @Eevans @Pchelolo @WDoranWMF (@mobrovac is now OOTO). [07:43:01] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:43:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:20] !log rebooting hassium to pick up MDS-enabled qemu [07:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:32] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) [07:51:26] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10elukey) [07:52:48] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10elukey) Amended this task and created T227425 :) [08:14:28] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [08:14:35] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) p:05Triage→03Normal [08:17:30] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [08:23:10] (03PS8) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [08:23:20] RECOVERY - snapshot of s6 in codfw on db1115 is OK: snapshot for s6 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-07-08 07:12:27 from db2097.codfw.wmnet:3316 (489 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:27:10] !log updated buster installer images to final release [08:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:52] (03PS2) 10Muehlenhoff: Add DHCP entries for poolcounter100[45], poolcounter200[34] [puppet] - 10https://gerrit.wikimedia.org/r/520906 (https://phabricator.wikimedia.org/T226811) [08:33:11] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "It's what we have running in production now:)" [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/521219 (owner: 10Elukey) [08:33:23] (03PS3) 10Ema: cache: rename cache::upload_ats role to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) [08:33:38] (03CR) 10Elukey: [C: 03+2] Add version 0.0.0+git20181106-2 [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/521219 (owner: 10Elukey) [08:35:07] (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP entries for poolcounter100[45], poolcounter200[34] [puppet] - 10https://gerrit.wikimedia.org/r/520906 (https://phabricator.wikimedia.org/T226811) (owner: 10Muehlenhoff) [08:35:48] (03PS1) 10Elukey: Fix typo in add_gets.patch [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/521224 [08:36:11] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Alphabetize metrics [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521041 (owner: 10BryanDavis) [08:36:13] (03CR) 10Elukey: [C: 03+2] Fix typo in add_gets.patch [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/521224 (owner: 10Elukey) [08:36:41] (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1002/17243/" [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) (owner: 10Ema) [08:38:23] (03PS9) 10Jcrespo: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 [08:41:24] (03CR) 10Vgutierrez: [C: 03+1] cache: rename cache::upload_ats role to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) (owner: 10Ema) [08:43:46] (03CR) 10Muehlenhoff: Add dnsupdate, rd, recursion, security, and udp metrics (032 comments) [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521042 (https://phabricator.wikimedia.org/T227411) (owner: 10BryanDavis) [08:44:25] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Log more info when `pdns_control list` fails [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521043 (owner: 10BryanDavis) [08:49:19] (03CR) 10Jcrespo: [C: 03+2] "> Patch Set 6: Code-Review+1" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [08:49:44] (03Merged) 10jenkins-bot: replication_tree.py: Console output of a replica set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520768 (owner: 10Jcrespo) [08:50:48] (03PS4) 10Jcrespo: Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 [08:54:06] (03PS5) 10Elukey: aptrepo: add thirdparty/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [08:57:52] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) [09:01:10] (03CR) 10Jcrespo: [C: 03+2] Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [09:01:54] (03CR) 10Volans: [C: 03+2] "LGTM, thanks for the patch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [09:01:56] (03Merged) 10jenkins-bot: Ask for confirmation before the critical stops on certain scripts [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520850 (owner: 10Jcrespo) [09:03:03] (03CR) 10jerkins-bot: [V: 04-1] sre.ganeti.makevm: add dns check before creating the vm [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [09:04:17] mmmmm [09:04:42] (03PS1) 10Jcrespo: switchover.py: Check binary log format before switch [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 [09:05:01] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Check binary log format before switch [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 (owner: 10Jcrespo) [09:05:56] volans: o/ anything changed in tox-docker? [09:06:13] (03PS2) 10Jcrespo: switchover.py: Check binary log format before switch [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 [09:07:10] (03PS3) 10Jcrespo: switchover.py: Check binary log format before switch [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 [09:07:33] elukey: checking [09:09:39] it might be the new release of pep257 [09:09:42] I'll take care of it [09:12:32] yep, pydocstyle from 3.0.0 to 4.0.0 2 days ago [09:14:06] (03PS4) 10Ema: cache: rename cache::upload_ats role to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) [09:14:43] (03CR) 10Filippo Giunchedi: "LGTM (the Prometheus part) but files will need cleanup" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) (owner: 10Ema) [09:15:59] (03CR) 10Filippo Giunchedi: [C: 03+1] initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [09:16:19] (03PS5) 10Ema: cache: rename cache::upload_ats role to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) [09:16:43] (03CR) 10Filippo Giunchedi: [C: 03+2] Log more info when `pdns_control list` fails [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521043 (owner: 10BryanDavis) [09:22:09] (03CR) 10Ema: [C: 03+2] cache: rename cache::upload_ats role to cache::upload [puppet] - 10https://gerrit.wikimedia.org/r/521217 (https://phabricator.wikimedia.org/T227328) (owner: 10Ema) [09:24:30] (03PS1) 10Volans: docstrings: pep257 compatibility [cookbooks] - 10https://gerrit.wikimedia.org/r/521228 [09:24:31] elukey: ^^^ [09:26:26] (03CR) 10Filippo Giunchedi: mediawiki::webserver: add mtail to gather latency, error rate metrics (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [09:27:39] (03CR) 10Elukey: [C: 03+1] docstrings: pep257 compatibility [cookbooks] - 10https://gerrit.wikimedia.org/r/521228 (owner: 10Volans) [09:27:47] (03CR) 10Volans: [C: 03+2] docstrings: pep257 compatibility [cookbooks] - 10https://gerrit.wikimedia.org/r/521228 (owner: 10Volans) [09:27:59] volans: very delicate changes, pay attention when deploying :P [09:28:04] rotfl [09:28:06] (03CR) 10Filippo Giunchedi: "LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [09:29:47] (03Merged) 10jenkins-bot: docstrings: pep257 compatibility [cookbooks] - 10https://gerrit.wikimedia.org/r/521228 (owner: 10Volans) [09:31:20] (03PS5) 10Volans: sre.ganeti.makevm: add dns check before creating the vm [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [09:34:08] (03CR) 10Effie Mouzeli: [C: 03+2] base::monitoring::host: ignore /mnt/hdfs from disk checks [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [09:35:53] (03PS1) 10Jcrespo: WMFReplication: Parallelize slaves() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 [09:36:09] (03PS1) 10Reedy: Enable StopForumSpam on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521233 (https://phabricator.wikimedia.org/T181217) [09:36:16] (03CR) 10jerkins-bot: [V: 04-1] WMFReplication: Parallelize slaves() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 (owner: 10Jcrespo) [09:37:42] (03CR) 10Effie Mouzeli: [C: 03+1] base::monitoring::host: ignore /mnt/hdfs from disk checks [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [09:39:03] (03PS2) 10Ema: ATS: do not overwrite Server header but add it if missing [puppet] - 10https://gerrit.wikimedia.org/r/520875 (https://phabricator.wikimedia.org/T224119) [09:39:42] (03CR) 10Vgutierrez: [C: 03+1] "Thanks! <3" [puppet] - 10https://gerrit.wikimedia.org/r/520875 (https://phabricator.wikimedia.org/T224119) (owner: 10Ema) [09:40:43] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [09:41:15] (03PS1) 10Muehlenhoff: Update partman selection for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/521235 (https://phabricator.wikimedia.org/T224572) [09:41:52] 10Operations, 10Traffic, 10Patch-For-Review: Rename role::cache::upload_ats to role::cache::upload - https://phabricator.wikimedia.org/T227328 (10ema) 05Open→03Resolved a:03ema Done. [09:41:55] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10ema) [09:42:36] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10ema) 05Open→03Resolved a:03ema All cache_upload nodes are now using ATS instead of Varnish for on-disk caching. Closing. [09:43:37] (03PS2) 10Muehlenhoff: Update partman selection for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/521235 (https://phabricator.wikimedia.org/T224572) [09:45:56] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: add dns check before creating the vm [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [09:46:34] (03CR) 10Muehlenhoff: [C: 03+2] Update partman selection for pool counters [puppet] - 10https://gerrit.wikimedia.org/r/521235 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [09:46:36] (03CR) 10Hashar: [C: 03+1] "The DEBUG level is merely because the stack dump is generated with DEBUG level:" [puppet] - 10https://gerrit.wikimedia.org/r/505253 (owner: 10Hashar) [09:47:05] hashar: do you know if conftool/confctl is used/working in deployment-prep by any chance? [09:47:30] (03Merged) 10jenkins-bot: sre.ganeti.makevm: add dns check before creating the vm [cookbooks] - 10https://gerrit.wikimedia.org/r/520897 (https://phabricator.wikimedia.org/T203963) (owner: 10Elukey) [09:49:50] !log removed /srv/prometheus/ops/targets/varnish-upload-ats_mtail_$DC.yaml from prometheus hosts [09:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:03] !log elukey@cumin1001 START - Cookbook sre.ganeti.makevm [09:51:03] !log elukey@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [09:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:17] this was a test --^ [09:51:33] (03PS3) 10Ema: ATS: do not overwrite Server header but add it if missing [puppet] - 10https://gerrit.wikimedia.org/r/520875 (https://phabricator.wikimedia.org/T224119) [09:52:08] (03CR) 10Ema: [C: 03+2] ATS: do not overwrite Server header but add it if missing [puppet] - 10https://gerrit.wikimedia.org/r/520875 (https://phabricator.wikimedia.org/T224119) (owner: 10Ema) [09:55:24] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) ` elukey@cumin1001:~$ sudo cookbook sre.ganeti.makevm eqiad_A test_not_existing.eqiad.wmnet --vcpus 2 --memory 4... [09:58:00] (03PS2) 10Jcrespo: WMFReplication: Parallelize slaves() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 [10:00:14] (03PS1) 10Ema: cache: deploy acme-chief unified certs on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/521241 (https://phabricator.wikimedia.org/T226477) [10:00:32] (03CR) 10jerkins-bot: [V: 04-1] cache: deploy acme-chief unified certs on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/521241 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [10:00:47] (03PS2) 10Ema: cache: deploy acme-chief unified certs on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/521241 (https://phabricator.wikimedia.org/T226477) [10:01:12] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+1] "LGTM https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/17247/console" [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [10:03:14] (03PS1) 10Vgutierrez: ncredir: Notify nginx when redirection_maps.conf is changed [puppet] - 10https://gerrit.wikimedia.org/r/521242 (https://phabricator.wikimedia.org/T133548) [10:04:14] (03CR) 10Ema: [C: 03+1] ncredir: Notify nginx when redirection_maps.conf is changed [puppet] - 10https://gerrit.wikimedia.org/r/521242 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:04:44] (03CR) 10Vgutierrez: [C: 03+1] cache: deploy acme-chief unified certs on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/521241 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [10:06:36] (03CR) 10Ema: [C: 03+2] cache: deploy acme-chief unified certs on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/521241 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [10:06:56] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Notify nginx when redirection_maps.conf is changed [puppet] - 10https://gerrit.wikimedia.org/r/521242 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [10:07:10] (03PS2) 10Vgutierrez: ncredir: Notify nginx when redirection_maps.conf is changed [puppet] - 10https://gerrit.wikimedia.org/r/521242 (https://phabricator.wikimedia.org/T133548) [10:09:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [10:11:40] (03CR) 10Jbond: [C: 03+2] Update urbanecm's .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/521185 (owner: 10Urbanecm) [10:11:48] (03PS2) 10Jbond: Update urbanecm's .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/521185 (owner: 10Urbanecm) [10:11:56] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10elukey) >>! In T203963#5308788, @MoritzMuehlenhoff wrote: > Ah, and one more thing: After typing "done" for confirmation... [10:15:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520957 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [10:22:59] (03CR) 10Jbond: "Thanks" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520290 (owner: 10Jbond) [10:24:14] 10Operations, 10Operations-Software-Development, 10serviceops-radar, 10Patch-For-Review, and 3 others: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10MoritzMuehlenhoff) >>! In T203963#5312748, @elukey wrote: > Riccardo wrote above that cumin's logging is disabled tempor... [10:25:20] (03CR) 10Volans: python3/icinga check: refactor check to python3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520290 (owner: 10Jbond) [10:29:16] (03PS1) 10Filippo Giunchedi: hieradata: enable centrallog1001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/521245 (https://phabricator.wikimedia.org/T200706) [10:30:05] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T1030). [10:33:20] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521246 (https://phabricator.wikimedia.org/T128546) [10:36:00] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521246 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:50] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521246 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:37:35] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521246 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:56] (03CR) 10Urbanecm: [C: 04-1] Add 'templateeditor' user group and protection level on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) (owner: 10DannyS712) [10:38:56] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:521246| Bumping portals to master (T128546)]] (duration: 00m 51s) [10:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:02] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:39:46] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:521246| Bumping portals to master (T128546)]] (duration: 00m 49s) [10:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:52] (03CR) 10Urbanecm: [C: 04-1] Add 'templateeditor' user group and protection level on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) (owner: 10DannyS712) [10:43:31] (03PS1) 10Muehlenhoff: Add poolcounter1004,1005,2003,2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/521248 (https://phabricator.wikimedia.org/T224572) [10:44:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] RESTRouter: Add initial Helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [10:45:06] 10Operations: reinstall RT server with private IP and Buster - https://phabricator.wikimedia.org/T180641 (10MoritzMuehlenhoff) [10:46:05] (03CR) 10Urbanecm: [C: 03+1] Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521194 (https://phabricator.wikimedia.org/T227000) (owner: 10DannyS712) [10:48:37] (03CR) 10Muehlenhoff: [C: 03+2] Add poolcounter1004,1005,2003,2004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/521248 (https://phabricator.wikimedia.org/T224572) (owner: 10Muehlenhoff) [10:51:48] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10MoritzMuehlenhoff) p:05Triage→03Normal [10:53:42] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Upgrade LVS servers to stretch - https://phabricator.wikimedia.org/T177961 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff With the recent replacement of lvs100*, all LVS servers are running Stretch. [10:56:37] (03PS2) 10Jbond: python3/icinga check: refactor check to python3 [puppet] - 10https://gerrit.wikimedia.org/r/520290 [10:56:55] !log installing poolcounter2003/2004 [10:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:40] (03CR) 10Jbond: "updated thanks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520290 (owner: 10Jbond) [10:59:50] (03CR) 10Volans: [C: 03+1] "Code looks good. Just one thing I realized, make sure the puppet side of it includes python3-requests as dependency (as opposed to the py2" [puppet] - 10https://gerrit.wikimedia.org/r/520290 (owner: 10Jbond) [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T1100). [11:00:05] Urbanecm and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] hi [11:00:17] o/ [11:00:36] o/ [11:00:58] Urbanecm: do you want to start? [11:01:02] yes Lucas_WMDE [11:01:10] Last minute question: do we have enough time for an extra patch? [11:01:21] Daimona, we'll see at the end of the window :) [11:01:32] feel free to add the patch to the calendar, it may or may not be deployed [11:01:44] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520992 (https://phabricator.wikimedia.org/T226764) (owner: 10Urbanecm) [11:01:49] o/ [11:01:53] Yeah, though I see we're already at 6, but it's not urgent :-) [11:02:13] Six is not hard limit, we can continue if there's time - but it can't be guaranteed :) [11:02:29] Yup, I'll wait & see [11:02:42] (03Merged) 10jenkins-bot: Add zh_classicalwiki to commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520992 (https://phabricator.wikimedia.org/T226764) (owner: 10Urbanecm) [11:02:47] Daimona, btw, I've +2'ed the backport, giving time for CI [11:02:55] Yep, looking at that [11:02:58] (03CR) 10jenkins-bot: Add zh_classicalwiki to commonsuploads.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520992 (https://phabricator.wikimedia.org/T226764) (owner: 10Urbanecm) [11:03:01] (03PS4) 10Urbanecm: Create "autopatrolled" user group on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520507 (https://phabricator.wikimedia.org/T227208) (owner: 10DannyS712) [11:03:06] Will you run it after deployment, or later today? [11:03:08] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520507 (https://phabricator.wikimedia.org/T227208) (owner: 10DannyS712) [11:03:27] Daimona, probably after deployment [11:03:36] Great, thanks [11:04:10] (03Merged) 10jenkins-bot: Create "autopatrolled" user group on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520507 (https://phabricator.wikimedia.org/T227208) (owner: 10DannyS712) [11:04:57] !log urbanecm@deploy1001 Synchronized dblists/commonsuploads.dblist: SWAT: [[:gerrit:520507|Create "autopatrolled" user group on az.wiktionary]] (T227208) (duration: 00m 50s) [11:04:58] (03PS2) 10Urbanecm: Add several Ukrainian government websites to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520997 (https://phabricator.wikimedia.org/T227366) [11:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:03] T227208: Add autopatroller user group to az.wiktionary - https://phabricator.wikimedia.org/T227208 [11:05:07] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520997 (https://phabricator.wikimedia.org/T227366) (owner: 10Urbanecm) [11:05:10] (03CR) 10jenkins-bot: Create "autopatrolled" user group on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520507 (https://phabricator.wikimedia.org/T227208) (owner: 10DannyS712) [11:05:37] ehh, wrong message :( [11:06:07] (03Merged) 10jenkins-bot: Add several Ukrainian government websites to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520997 (https://phabricator.wikimedia.org/T227366) (owner: 10Urbanecm) [11:06:21] (03CR) 10jenkins-bot: Add several Ukrainian government websites to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520997 (https://phabricator.wikimedia.org/T227366) (owner: 10Urbanecm) [11:07:50] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:520507|Create "autopatrolled" user group on az.wiktionary]] (T227208) (duration: 00m 49s) [11:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:46] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521194 (https://phabricator.wikimedia.org/T227000) (owner: 10DannyS712) [11:09:18] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:520997|Add several Ukrainian government websites to wgCopyUploadsDomains]] (T227366) (duration: 00m 49s) [11:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:23] T227366: Please add numerous *.gov.ua domains to the wgCopyUploadsDomains whitelist. - https://phabricator.wikimedia.org/T227366 [11:09:49] (03Merged) 10jenkins-bot: Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521194 (https://phabricator.wikimedia.org/T227000) (owner: 10DannyS712) [11:10:06] (03CR) 10jenkins-bot: Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521194 (https://phabricator.wikimedia.org/T227000) (owner: 10DannyS712) [11:10:09] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [11:11:07] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: SWAT: [[:gerrit:521194|Remove "עמוד" namespace from wgFlaggedRevsNamespaces for hewikisource]] (T227000) (duration: 00m 49s) [11:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:12] T227000: Configuration request for Flagged Reviews at Hebrew Wikisource - https://phabricator.wikimedia.org/T227000 [11:11:36] Lucas_WMDE, feel free to deploy your patches. My backport wasn't merged yet, so please hand SWAT over to me after you're done :) [11:11:56] ehh, wrong ping.. [11:11:59] Amir1, ^^ [11:12:15] noted [11:12:41] (03PS2) 10Ladsgroup: Enable jsonld output format for wikibase entities everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521221 (https://phabricator.wikimedia.org/T207168) [11:12:50] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521221 (https://phabricator.wikimedia.org/T207168) (owner: 10Ladsgroup) [11:13:05] Finally jerkins did it [11:13:13] Daimona, I see :) [11:13:23] It always takes ages... [11:13:36] yeah, esp for mediawiki/* [11:13:52] (03PS1) 10Vgutierrez: ncredir: Use a custom access_log log_format [puppet] - 10https://gerrit.wikimedia.org/r/521249 (https://phabricator.wikimedia.org/T133548) [11:13:54] (03Merged) 10jenkins-bot: Enable jsonld output format for wikibase entities everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521221 (https://phabricator.wikimedia.org/T207168) (owner: 10Ladsgroup) [11:14:48] (03PS2) 10Vgutierrez: ncredir: Use a custom access_log log_format [puppet] - 10https://gerrit.wikimedia.org/r/521249 (https://phabricator.wikimedia.org/T133548) [11:14:50] (03CR) 10Jbond: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/520290 (owner: 10Jbond) [11:14:59] (03PS3) 10Jbond: python3/icinga check: refactor check to python3 [puppet] - 10https://gerrit.wikimedia.org/r/520290 [11:15:25] looks good, moving forward [11:16:28] (03CR) 10jenkins-bot: Enable jsonld output format for wikibase entities everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521221 (https://phabricator.wikimedia.org/T207168) (owner: 10Ladsgroup) [11:16:53] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:521221|Enable jsonld output format for wikibase entities everywhere (T207168)]] (duration: 00m 49s) [11:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:00] T207168: Provide JSON-LD support for Wikidata - https://phabricator.wikimedia.org/T207168 [11:18:22] (03CR) 10Ladsgroup: [C: 03+2] Disable Wikidata for ProofreadPage namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520780 (https://phabricator.wikimedia.org/T227201) (owner: 10Matěj Suchánek) [11:19:20] (03Merged) 10jenkins-bot: Disable Wikidata for ProofreadPage namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520780 (https://phabricator.wikimedia.org/T227201) (owner: 10Matěj Suchánek) [11:19:36] (03CR) 10jenkins-bot: Disable Wikidata for ProofreadPage namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520780 (https://phabricator.wikimedia.org/T227201) (owner: 10Matěj Suchánek) [11:22:26] !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:520780|Disable Wikidata for ProofreadPage namespaces (T227201)]] (duration: 00m 50s) [11:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:32] T227201: Disable Wikidata sitelinks in ProofreadPage namespaces - https://phabricator.wikimedia.org/T227201 [11:22:50] Urbanecm: I'm done! [11:22:55] thanks Amir1 [11:24:54] Daimona, deploying the script backport [11:25:03] Great [11:25:08] (should be a noop, but to get it on maintenance host, i need to do it) [11:25:38] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/AbuseFilter/: SWAT: [[:gerrit:520991|Fix query in normalizeThrottleParameters]] (T209565) (duration: 00m 51s) [11:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:43] T209565: Dry run for normalizeThrottleParameters.php - https://phabricator.wikimedia.org/T209565 [11:26:20] Daimona, running the dry run [11:26:30] Nice, I'll check logstash [11:26:40] thanks Daimona [11:26:45] !log installing poolcounter1004/1005 [11:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:38] (03PS15) 10Urbanecm: Remove HD logos for projects with no entry in wgLogo or add a wgLogo entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521183 (https://phabricator.wikimedia.org/T227418) [11:27:47] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521183 (https://phabricator.wikimedia.org/T227418) (owner: 10Urbanecm) [11:28:54] (03Merged) 10jenkins-bot: Remove HD logos for projects with no entry in wgLogo or add a wgLogo entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521183 (https://phabricator.wikimedia.org/T227418) (owner: 10Urbanecm) [11:29:09] (03CR) 10jenkins-bot: Remove HD logos for projects with no entry in wgLogo or add a wgLogo entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521183 (https://phabricator.wikimedia.org/T227418) (owner: 10Urbanecm) [11:30:49] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: [[:gerrit:521183|Remove HD logos for projects with no entry in wgLogo or add a wgLogo entry]] (1/2, T227418) (duration: 00m 49s) [11:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:54] T227418: Several projects has an entry in wgLogoHD, but no entry in wgLogo - https://phabricator.wikimedia.org/T227418 [11:30:59] (03CR) 10DannyS712: Add 'templateeditor' user group and protection level on commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) (owner: 10DannyS712) [11:31:43] (03PS4) 10DannyS712: Add 'templateeditor' user group and protection level on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) [11:31:44] I'm going AFK 10 minutes, https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/521038/ is the extra patch in case there's time, and I don't see any error on logstash! [11:31:58] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:521183|Remove HD logos for projects with no entry in wgLogo or add a wgLogo entry]] (2/2, T227418) (duration: 00m 49s) [11:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:21] Daimona, can deploy it rn if you want :) [11:32:31] Uh actually I see one now [11:32:36] PHP Warning: Unable to start TLS: Can't contact LDAP server [11:32:51] is this related? [11:32:53] That'd be great, thanks [11:33:00] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521038 (https://phabricator.wikimedia.org/T227379) (owner: 10Daimona Eaytoy) [11:33:02] I don't know, will investigate on the task after we get the results [11:33:09] ^^ let's do it then :) ^^ [11:33:55] (03Merged) 10jenkins-bot: Fix array shape for $wgCirrusSearchExtraIndexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521038 (https://phabricator.wikimedia.org/T227379) (owner: 10Daimona Eaytoy) [11:34:09] Daimona, is that patch testable on mwdebug1002? [11:34:10] (03CR) 10jenkins-bot: Fix array shape for $wgCirrusSearchExtraIndexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521038 (https://phabricator.wikimedia.org/T227379) (owner: 10Daimona Eaytoy) [11:35:00] I don't think so [11:35:13] ok [11:35:18] I guess it's enough to see if we start getting new errors [11:35:37] ok [11:35:57] deploying [11:36:44] !log urbanecm@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: SWAT: [[:gerrit:521038|Fix array shape for $wgCirrusSearchExtraIndexes]] (T227379) (duration: 00m 51s) [11:36:48] Daimona, deployed [11:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:50] T227379: OtherIndexesUpdater.php: PHP Warning: Invalid argument supplied for foreach() - https://phabricator.wikimedia.org/T227379 [11:37:32] Daimona, and when you have time, please review https://phabricator.wikimedia.org/P8719, that's the output of the dry run [11:39:13] !log Purged 14 logo urls for T227418 [11:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:21] T227418: Several projects has an entry in wgLogoHD, but no entry in wgLogo - https://phabricator.wikimedia.org/T227418 [11:39:58] Sure, here I am [11:40:28] I guess the CirrusSearch one is fine [11:40:41] cool! [11:41:59] RECOVERY - snapshot of s3 in codfw on db1115 is OK: snapshot for s3 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-07-08 06:17:17 from db2098.codfw.wmnet:3313 (747 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [11:43:45] And the error for the dry-run seems unrelated [11:43:51] Well [11:43:55] that's perfect [11:44:04] It comes from the script execution, but it looks like some config issue [11:44:17] seems so [11:44:24] so, let's run the script for real then Daimona ? [11:44:27] Would you mind running it again, only for labtestwiki? [11:44:30] Not yet [11:44:31] sure [11:44:33] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.11/includes/Title.php: SWAT: [[:gerrit:521253|Title: ensure getBaseTitle and getRootTitle return valid Titles]] (T225585) (duration: 00m 50s) [11:44:39] I have to write an announcement for Tech News :) [11:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:42] aha [11:44:42] T225585: Unable to open user sub pages with trailing space (Blank page fatal: "invalid DB key") - https://phabricator.wikimedia.org/T225585 [11:44:50] Which I'll do now [11:45:03] Then wait until it reaches people [11:45:52] Daimona, ran [11:46:04] (dry run for labtestwiki) [11:46:27] Thanks, got the error again... I guess it's just something labtestwiki-specific, but I don't care [11:46:54] ok [11:47:09] ping me in the task when this will be ready for the script [11:47:46] (03PS18) 10Urbanecm: Test if 1x logo exists for all HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521180 [11:47:52] !log EU SWAT done [11:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:10] Sure, and thanks a lot! [12:01:21] yw [12:02:22] (03CR) 10Urbanecm: [C: 03+2] "noop for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521180 (owner: 10Urbanecm) [12:03:14] (03Merged) 10jenkins-bot: Test if 1x logo exists for all HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521180 (owner: 10Urbanecm) [12:03:29] (03CR) 10jenkins-bot: Test if 1x logo exists for all HD logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521180 (owner: 10Urbanecm) [12:32:06] (03CR) 10Muehlenhoff: [C: 03+1] hieradata: enable centrallog1001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/521245 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [12:35:42] (03CR) 10Jcrespo: "Line 40: options = parser.parse_args()" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [12:36:48] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [12:36:50] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [12:36:50] !log kartik@deploy1001 scap-helm cxserver finished [12:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:16] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [12:39:18] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [12:39:18] !log kartik@deploy1001 scap-helm cxserver finished [12:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:39] (03PS1) 10Ema: varnish: stop sending the Via response header [puppet] - 10https://gerrit.wikimedia.org/r/521261 (https://phabricator.wikimedia.org/T194814) [12:42:12] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [12:42:14] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [12:42:14] !log kartik@deploy1001 scap-helm cxserver finished [12:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [12:51:34] (03PS4) 10Elukey: base::monitoring::host: ignore /mnt/hdfs from disk checks [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) [12:52:14] !log copy mtail to buster-wikimedia - T225604 [12:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:19] T225604: log spam from mtail 3.0.0~rc19 on wezen - https://phabricator.wikimedia.org/T225604 [12:52:43] 10Operations, 10vm-requests: Site: eqiad/codfw 2 VMs each for pool counters - https://phabricator.wikimedia.org/T226811 (10MoritzMuehlenhoff) 05Open→03Resolved VMs have been created, implementation happens via T224572 [12:54:08] 10Operations, 10Core Platform Team, 10Page-Previews, 10RESTBase-API, 10Services (doing): Page summary endpoint in RESTBase not updated since about June 27 - https://phabricator.wikimedia.org/T226983 (10Pchelolo) PR here: https://github.com/wikimedia/restbase/pull/1162 Will test and deploy soon. [12:54:15] (03CR) 10Elukey: [C: 03+2] "Thanks all for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [12:59:42] (03PS3) 10Vgutierrez: ncredir: Use a custom access_log log_format [puppet] - 10https://gerrit.wikimedia.org/r/521249 (https://phabricator.wikimedia.org/T133548) [12:59:47] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: codfw: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227425 (10elukey) [13:01:24] (03PS4) 10Vgutierrez: ncredir: Use a custom access_log log_format [puppet] - 10https://gerrit.wikimedia.org/r/521249 (https://phabricator.wikimedia.org/T133548) [13:09:39] 10Operations, 10observability, 10Wikimedia-Incident: prometheus: upgrade to 2.11 - https://phabricator.wikimedia.org/T222113 (10fgiunchedi) [13:14:00] (03CR) 10Ema: [C: 03+1] ncredir: Use a custom access_log log_format [puppet] - 10https://gerrit.wikimedia.org/r/521249 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [13:14:33] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Use a custom access_log log_format [puppet] - 10https://gerrit.wikimedia.org/r/521249 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [13:14:48] (03PS5) 10Vgutierrez: ncredir: Use a custom access_log log_format [puppet] - 10https://gerrit.wikimedia.org/r/521249 (https://phabricator.wikimedia.org/T133548) [13:18:30] (03PS28) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [13:20:31] (03PS1) 10Marostegui: mariadb: Enable performance_schema on parsercache/misc [puppet] - 10https://gerrit.wikimedia.org/r/521266 [13:22:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_upload site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:23:52] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-upload site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:24:40] (03PS2) 10Marostegui: mariadb: Enable performance_schema on parsercache/misc [puppet] - 10https://gerrit.wikimedia.org/r/521266 [13:26:07] (03CR) 10Marostegui: "I am fine with this, as long as we can override it" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 (owner: 10Jcrespo) [13:26:08] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [13:26:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:26:39] mhh looks like 503s for upload indeed [13:26:48] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [13:27:01] a spike though, already passed [13:27:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:27:45] it looks like cp3038 has the most [13:28:05] (03CR) 10Marostegui: "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler1002/17251/" [puppet] - 10https://gerrit.wikimedia.org/r/521266 (owner: 10Marostegui) [13:28:14] and gone [13:30:17] !log bounce prometheus@k8s on prometheus1003 [13:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:04] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:33:22] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [13:34:32] (03PS1) 10Elukey: profile::base: exclude fuse.fuse_dfs from disk space checks [puppet] - 10https://gerrit.wikimedia.org/r/521272 (https://phabricator.wikimedia.org/T226698) [13:35:00] 10Operations, 10Wikimedia-production-error: Labtestwiki returns 503 error - https://phabricator.wikimedia.org/T227476 (10Urbanecm) p:05Unbreak!→03Triage Probably not UBN!. I've tested this locally on a random application server according to https://wikitech.wikimedia.org/wiki/Debugging_in_production: ` [u... [13:36:45] (03PS1) 10Marostegui: mariadb: Promote db2069 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/521273 [13:37:58] PROBLEM - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [13:39:46] (03PS1) 10Filippo Giunchedi: prometheus: alert when k8s cache isn't updating [puppet] - 10https://gerrit.wikimedia.org/r/521275 (https://phabricator.wikimedia.org/T227478) [13:40:31] (03CR) 10jerkins-bot: [V: 04-1] prometheus: alert when k8s cache isn't updating [puppet] - 10https://gerrit.wikimedia.org/r/521275 (https://phabricator.wikimedia.org/T227478) (owner: 10Filippo Giunchedi) [13:41:57] (03PS1) 10Elukey: Add analytics keytab for an-tool1006 [labs/private] - 10https://gerrit.wikimedia.org/r/521277 [13:42:13] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add analytics keytab for an-tool1006 [labs/private] - 10https://gerrit.wikimedia.org/r/521277 (owner: 10Elukey) [13:42:15] (03PS2) 10Filippo Giunchedi: prometheus: alert when k8s cache isn't updating [puppet] - 10https://gerrit.wikimedia.org/r/521275 (https://phabricator.wikimedia.org/T227478) [13:42:31] (03PS1) 10Andrew Bogott: bootstrap-vz: configure base image to use sssd for buster and stretch [puppet] - 10https://gerrit.wikimedia.org/r/521278 (https://phabricator.wikimedia.org/T227475) [13:42:53] jynus: re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/519203/14/modules/profile/files/prometheus/mysqld_exporter_config.py#209 in that PS the parser is never asked to parse config_path [13:43:28] (03CR) 10Elukey: "This works: https://puppet-compiler.wmflabs.org/compiler1002/17253/an-tool1006.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/521272 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [13:44:23] (03CR) 10Elukey: [C: 03+2] "Of course base::monitoring::host is overridden by profile::base, I didn't understand Effie's pcc correctly (since it was showing no-ops)." [puppet] - 10https://gerrit.wikimedia.org/r/520989 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [13:45:52] 10Operations, 10Analytics, 10hardware-requests, 10User-Elukey: eqiad: 1 misc node for the Kerberos KDC service - https://phabricator.wikimedia.org/T227288 (10Ottomata) +1 for 1 eqiad and 1 codfw [13:47:29] 10Operations, 10Page-Previews, 10RESTBase-API, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Team 2): Page summary endpoint in RESTBase not updated since about June 27 - https://phabricator.wikimedia.org/T226983 (10WDoranWMF) [13:48:49] !log ppchelko@deploy1001 Started deploy [restbase/deploy@8e81e98]: Release 1.0, expose talk endpoints T225733, suggestions endpoints T224754, fix summary purging T226983 [13:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:56] T226983: Page summary endpoint in RESTBase not updated since about June 27 - https://phabricator.wikimedia.org/T226983 [13:48:57] T225733: Expose new talk endpoint via RESTBase - https://phabricator.wikimedia.org/T225733 [13:48:57] T224754: Deploy new recommendation-api endpoints for Suggested Edits in RESTBase - https://phabricator.wikimedia.org/T224754 [13:50:59] (03PS21) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [13:51:35] !log running "apt-get --allow-releaseinfo-update" on all buster hosts which were installed prior to the final buster release [13:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:28] !log import AMD ROCm's Debian repo key (9386B48A1A693C5C) manually on install1002 - T224723 [13:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:33] T224723: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 [13:52:41] (03PS6) 10Elukey: aptrepo: add thirdparty/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) [13:53:23] (03CR) 10Elukey: [C: 03+2] aptrepo: add thirdparty/amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/520848 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [13:53:28] !log reprepro --delete clearvanished on install1002 to cleanup trusty [13:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:33] akosiaris: https://gerrit.wikimedia.org/r/c/operations/puppet/+/521275 when you get a chance [13:56:42] (03PS3) 10Filippo Giunchedi: prometheus: alert when k8s cache isn't updating [puppet] - 10https://gerrit.wikimedia.org/r/521275 (https://phabricator.wikimedia.org/T227478) [13:58:53] (03PS1) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) [13:59:48] (03CR) 10jerkins-bot: [V: 04-1] [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [14:01:02] (03PS2) 10Filippo Giunchedi: hieradata: enable centrallog1001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/521245 (https://phabricator.wikimedia.org/T200706) [14:01:15] RECOVERY - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [14:01:25] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable centrallog1001 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/521245 (https://phabricator.wikimedia.org/T200706) (owner: 10Filippo Giunchedi) [14:01:50] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: Import AMD rocm packages in wikimedia-buster - https://phabricator.wikimedia.org/T224723 (10elukey) ` root@install1002:~# reprepro --noskipold --component thirdparty/amd-rocm checkupdate buster-wikimedia Calculating packages to get... ` I am pr... [14:01:52] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article had an unexpected value for header etag: W/[object Object]/ee8664c0-a188-11e9-a093-ff56e13f2529 https://wikitech.wikimedia.org/wiki/RESTBase [14:01:54] (03PS2) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) [14:02:20] (03CR) 10Elukey: [C: 03+1] apache::mod_conf: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/520778 (owner: 10Muehlenhoff) [14:02:30] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article had an unexpected value for header etag: W/[object Object]/05a54720-a189-11e9-8f5f-24b90af79dfb https://wikitech.wikimedia.org/wiki/RESTBase [14:02:48] (03CR) 10jerkins-bot: [V: 04-1] [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [14:02:50] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article had an unexpected value for header etag: W/[object Object]/114ad860-a189-11e9-b017-56cbda13e6dc https://wikitech.wikimedia.org/wiki/RESTBase [14:02:50] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus: alert when k8s cache isn't updating [puppet] - 10https://gerrit.wikimedia.org/r/521275 (https://phabricator.wikimedia.org/T227478) (owner: 10Filippo Giunchedi) [14:02:52] working on it [14:03:15] !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [14:03:16] !log eevans@deploy1001 scap-helm sessionstore cluster staging completed [14:03:16] !log eevans@deploy1001 scap-helm sessionstore finished [14:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:38] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article had an unexpected value for header etag: W/[object Object]/2ddf4830-a189-11e9-84dd-bdfc4d15a08c https://wikitech.wikimedia.org/wiki/RESTBase [14:03:44] (03PS3) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) [14:03:51] Pchelolo: do you need any help or is it under control? (just seen the alerT) [14:03:55] (03CR) 10Jcrespo: [C: 04-1] mariadb: Enable performance_schema on parsercache/misc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/521266 (owner: 10Marostegui) [14:04:08] elukey: it's a new endpoint not used by anything, so it's fine [14:04:24] I'm not sure why is it alerting though.. [14:04:35] (03CR) 10jerkins-bot: [V: 04-1] [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [14:05:00] Pchelolo: super [14:05:01] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@8e81e98]: Release 1.0, expose talk endpoints T225733, suggestions endpoints T224754, fix summary purging T226983 (duration: 16m 11s) [14:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:11] T226983: Page summary endpoint in RESTBase not updated since about June 27 - https://phabricator.wikimedia.org/T226983 [14:05:11] T225733: Expose new talk endpoint via RESTBase - https://phabricator.wikimedia.org/T225733 [14:05:11] T224754: Deploy new recommendation-api endpoints for Suggested Edits in RESTBase - https://phabricator.wikimedia.org/T224754 [14:05:20] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article had an unexpected value for header etag: W/[object Object]/6b2a5a40-a189-11e9-a267-04c7abe501b5 https://wikitech.wikimedia.org/wiki/RESTBase [14:05:24] Pchelolo: I think because service-checker will check every swagger endpoint [14:05:58] (03CR) 10Jcrespo: "> Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 (owner: 10Jcrespo) [14:06:01] godog: yeee. the funny thing is that locally on the RB nodes 'all endpoints are healthy' [14:06:15] lol [14:08:54] (03PS4) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) [14:09:26] (03CR) 10Marostegui: "I am not sure I understand what you mean, what's correct and what isn't?" [puppet] - 10https://gerrit.wikimedia.org/r/521266 (owner: 10Marostegui) [14:09:48] (03CR) 10jerkins-bot: [V: 04-1] [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [14:10:11] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: alert when k8s cache isn't updating [puppet] - 10https://gerrit.wikimedia.org/r/521275 (https://phabricator.wikimedia.org/T227478) (owner: 10Filippo Giunchedi) [14:10:18] (03PS4) 10Filippo Giunchedi: prometheus: alert when k8s cache isn't updating [puppet] - 10https://gerrit.wikimedia.org/r/521275 (https://phabricator.wikimedia.org/T227478) [14:11:31] (03CR) 10Marostegui: "> > Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 (owner: 10Jcrespo) [14:12:13] (03PS1) 10Urbanecm: Remove unused logos from /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521282 (https://phabricator.wikimedia.org/T227419) [14:12:28] (03PS5) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) [14:13:25] (03CR) 10jerkins-bot: [V: 04-1] [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [14:13:32] (03PS1) 10Muehlenhoff: Switch pool counters for Thumbor in codfw to poolcounter2003 [puppet] - 10https://gerrit.wikimedia.org/r/521283 (https://phabricator.wikimedia.org/T224572) [14:15:40] (03PS2) 10Urbanecm: Remove unused logos from /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521282 (https://phabricator.wikimedia.org/T227419) [14:15:55] (03PS6) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) [14:17:21] (03CR) 10jerkins-bot: [V: 04-1] [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [14:17:42] (03PS3) 10Urbanecm: Remove unused logos from /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521282 (https://phabricator.wikimedia.org/T227419) [14:17:54] (03PS7) 10Urbanecm: [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) [14:18:25] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) a:03Papaul [14:18:47] (03CR) 10jerkins-bot: [V: 04-1] [test] there should be no unused images in /static/images/project-logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [14:21:38] !log shutting down elastic2054 for troubleshooting [14:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:16] PROBLEM - puppet last run on prometheus1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:26:59] (03CR) 10Jcrespo: "> Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521226 (owner: 10Jcrespo) [14:32:05] !log ppchelko@deploy1001 Started deploy [restbase/deploy@9a99b17]: Loosen etag regex for talk endpoint and fix alert [14:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:41] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) Multi-bit memory errors detected on a memory device at location(s) DIMM_B2. [14:33:45] 10Operations, 10decommission: Decommission analytics10[28-41] - https://phabricator.wikimedia.org/T227485 (10elukey) p:05Triage→03Normal [14:34:32] 10Operations, 10decommission: Decommission analytics10[28-41] - https://phabricator.wikimedia.org/T227485 (10elukey) [14:34:53] 10Operations, 10decommission: Decommission analytics10[28-41] - https://phabricator.wikimedia.org/T227485 (10elukey) 05Open→03Stalled Not actionable yet. [14:35:16] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10Andrew) >>! In T214275#5307951, @elukey wrote: > > The latter should be doable, but the former seems a bit more complicated. Is there a... [14:36:01] (03PS3) 10Jcrespo: mariadb: Enable performance_schema on parsercache/misc [puppet] - 10https://gerrit.wikimedia.org/r/521266 (owner: 10Marostegui) [14:36:18] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:36:22] (03PS1) 10Filippo Giunchedi: prometheus: fix dashboard_links in k8s cache alert [puppet] - 10https://gerrit.wikimedia.org/r/521287 [14:37:34] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:38:07] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) {F29708883} [14:39:24] (03CR) 10Effie Mouzeli: "We need to break this patch in smaller ones, check T226675 for details" [puppet] - 10https://gerrit.wikimedia.org/r/514226 (https://phabricator.wikimedia.org/T226675) (owner: 10Ppchelko) [14:39:33] (03CR) 10Marostegui: "Compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/17255/" [puppet] - 10https://gerrit.wikimedia.org/r/521266 (owner: 10Marostegui) [14:39:35] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix dashboard_links in k8s cache alert [puppet] - 10https://gerrit.wikimedia.org/r/521287 (owner: 10Filippo Giunchedi) [14:40:16] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:40:36] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:40:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Enable performance_schema on parsercache/misc [puppet] - 10https://gerrit.wikimedia.org/r/521266 (owner: 10Marostegui) [14:40:52] (03PS4) 10Marostegui: mariadb: Enable performance_schema on parsercache/misc [puppet] - 10https://gerrit.wikimedia.org/r/521266 [14:40:54] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:42:56] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:43:22] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:43:29] !log decommissioning restbase1017-c -- T222960 [14:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:34] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [14:44:59] !log Restart MySQL on db1132 to enable performance_schema - T226952 [14:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:04] T226952: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 [14:46:20] RECOVERY - Host elastic2054 is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [14:48:12] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@9a99b17]: Loosen etag regex for talk endpoint and fix alert (duration: 16m 07s) [14:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:17] (03PS2) 10BryanDavis: Add dnsupdate, rd, recursion, security, and udp metrics [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521042 (https://phabricator.wikimedia.org/T227411) [14:48:20] 10Operations, 10Page-Previews, 10RESTBase-API, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Team 2): Page summary endpoint in RESTBase not updated since about June 27 - https://phabricator.wikimedia.org/T226983 (10Pchelolo) 05Open→03Resol... [14:49:42] (03PS1) 10Marostegui: db-codfw.php: Promote db2069 as x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521291 [14:51:09] RECOVERY - puppet last run on prometheus1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:51:46] (03CR) 10BryanDavis: Add dnsupdate, rd, recursion, security, and udp metrics (032 comments) [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521042 (https://phabricator.wikimedia.org/T227411) (owner: 10BryanDavis) [14:52:54] 10Operations, 10Discovery: elastic2054 unresponsive - https://phabricator.wikimedia.org/T227298 (10Papaul) I swapped B2 with A2, no more error. leaving this task open for a week. If we do have the same problem on A2, I will request a replacement. {F29709003} [14:55:44] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10jijiki) @elukey I can do thumbor, not sure when yet. [14:56:26] (03PS1) 10Ottomata: Migrate page-* events to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) [14:57:17] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Deprecate the usage of nutcracker for memcached - https://phabricator.wikimedia.org/T214275 (10jijiki) [14:57:20] 10Operations, 10Thumbor, 10serviceops: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) [14:57:36] !log Failover x1 codfw from db2045 to db2069 [14:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:26] (03CR) 10Ppchelko: Migrate page-* events to eventgate-main (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:00:12] (03PS2) 10Ottomata: Migrate page-* events to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) [15:00:38] (03CR) 10Ottomata: "done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:00:50] (03CR) 10Ppchelko: [C: 03+1] Migrate page-* events to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:02:26] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:04:02] !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [15:04:03] !log eevans@deploy1001 scap-helm sessionstore cluster staging completed [15:04:03] !log eevans@deploy1001 scap-helm sessionstore finished [15:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:59] (03CR) 10Thcipriani: [C: 03+1] contint1001: point Docker data to a different partition [puppet] - 10https://gerrit.wikimedia.org/r/520738 (https://phabricator.wikimedia.org/T207707) (owner: 10Hashar) [15:07:36] !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [15:07:37] !log eevans@deploy1001 scap-helm sessionstore cluster staging completed [15:07:37] !log eevans@deploy1001 scap-helm sessionstore finished [15:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:11] (03PS2) 10Marostegui: mariadb: Promote db2069 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/521273 [15:09:23] PROBLEM - Prometheus k8s cache not updating on prometheus2003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [15:10:20] (03PS1) 10Jbond: wmflib: add new dirtree function. [puppet] - 10https://gerrit.wikimedia.org/r/521295 [15:10:54] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2069 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/521273 (owner: 10Marostegui) [15:11:01] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add new dirtree function. [puppet] - 10https://gerrit.wikimedia.org/r/521295 (owner: 10Jbond) [15:12:20] !log jiji@deploy1001 Started deploy [cpjobqueue/deploy@7379e91]: Migrating refreshLinks to PHP7 - T219150 [15:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:30] T219150: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 [15:13:46] !log jiji@deploy1001 Finished deploy [cpjobqueue/deploy@7379e91]: Migrating refreshLinks to PHP7 - T219150 (duration: 01m 26s) [15:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:51] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:15:32] !log shutting down db2097 T225378 T216240 [15:15:37] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2069 as x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521291 (owner: 10Marostegui) [15:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:38] T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 [15:15:39] T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 [15:15:58] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [15:16:51] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2069 as x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521291 (owner: 10Marostegui) [15:17:06] (03CR) 10jenkins-bot: db-codfw.php: Promote db2069 as x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521291 (owner: 10Marostegui) [15:18:25] (03PS2) 10Jbond: wmflib: add new dirtree function. [puppet] - 10https://gerrit.wikimedia.org/r/521295 [15:19:53] (03PS1) 10Cparle: Configure help urls for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521298 (https://phabricator.wikimedia.org/T227226) [15:21:49] (03CR) 10Alaa Sarhan: [C: 03+1] statistics: Add wdqs host to wmde statistcs configuration [puppet] - 10https://gerrit.wikimedia.org/r/520901 (https://phabricator.wikimedia.org/T218710) (owner: 10Ladsgroup) [15:21:54] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2069 as x1 codfw master (duration: 00m 50s) [15:21:58] (03PS3) 10Jbond: wmflib: add new dirtree function. [puppet] - 10https://gerrit.wikimedia.org/r/521295 [15:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:49] (03PS1) 10Jcrespo: Fix switchover.py when master is replicating and uses GTID [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521300 [15:34:14] (03CR) 10Jcrespo: [C: 03+2] Fix switchover.py when master is replicating and uses GTID [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521300 (owner: 10Jcrespo) [15:34:28] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup, 10Patch-For-Review: Archive zookeeper puppet submodule - https://phabricator.wikimedia.org/T227164 (10Milimetric) p:05Triage→03High [15:36:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/521272 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [15:39:12] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10Papaul) 05Open→03Resolved Before {F29709098} After {F29709100} This is complete return tracking information {F29709113} [15:39:37] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) 05Resolved→03Open a:05Papaul→03jcrespo ` Mem: 515690 ` HW seems to be fixed, owning for the followup (software) st... [15:40:03] (03Merged) 10jenkins-bot: Fix switchover.py when master is replicating and uses GTID [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521300 (owner: 10Jcrespo) [15:40:15] 10Operations, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) [15:41:50] PROBLEM - Prometheus k8s cache not updating on prometheus2004 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [15:42:33] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Papaul) @Marostegui Thanks [15:45:04] !log Failover db2069 to db2045 on x1 codfw [15:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:09] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, and 3 others: Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10Jdforrester-WMF) [15:45:12] 10Operations, 10Epic, 10Maps (Kartotherian): Move Kartotherian and Tilerator to Kubernetes - https://phabricator.wikimedia.org/T216826 (10Jdforrester-WMF) [15:45:30] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10Papaul) [15:46:14] (03PS1) 10Marostegui: Revert "db-codfw.php: Promote db2069 as x1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521305 [15:46:27] (03PS1) 10Marostegui: Revert "mariadb: Promote db2069 to x1 codfw master" [puppet] - 10https://gerrit.wikimedia.org/r/521306 [15:50:26] 10Operations, 10Dumps-Generation, 10SDC General, 10Wikidata: Capacity planning for Commons Structured Data - https://phabricator.wikimedia.org/T226093 (10ArielGlenn) I've commented about this over on the other ticket. Let's see what they say. [15:50:55] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote db2069 to x1 codfw master" [puppet] - 10https://gerrit.wikimedia.org/r/521306 (owner: 10Marostegui) [15:51:19] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Promote db2069 as x1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521305 (owner: 10Marostegui) [15:53:22] (03PS1) 10SBassett: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521307 (https://phabricator.wikimedia.org/T227416) [15:53:43] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Promote db2069 as x1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521305 (owner: 10Marostegui) [15:53:57] (03PS1) 10Urbanecm: Change liwikinews logo to correct one per community wish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) [15:54:02] (03CR) 10jenkins-bot: Revert "db-codfw.php: Promote db2069 as x1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521305 (owner: 10Marostegui) [15:54:53] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2045 instead of db2069 as x1 codfw master (duration: 00m 49s) [15:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:40] (03CR) 10jerkins-bot: [V: 04-1] Change liwikinews logo to correct one per community wish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) (owner: 10Urbanecm) [15:56:23] Hey all - need to security-deploy https://gerrit.wikimedia.org/r/521307. Anyone still deploying wmf-config stuff right now? [15:56:53] not me [15:56:56] I am done [15:57:40] 10Operations, 10Analytics, 10Patch-For-Review, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10Milimetric) p:05Normal→03High [15:57:54] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10Milimetric) a:03Ottomata [15:58:00] Ok, tx marostegui [15:59:38] !log bounce prometheus@k8s on prometheus200[34] - T227478 [15:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:43] T227478: prometheus@k8s on prometheus1003 stopped updating deployments / metrics - https://phabricator.wikimedia.org/T227478 [16:00:48] (03CR) 10SBassett: [C: 03+2] Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521307 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [16:01:40] (03Merged) 10jenkins-bot: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521307 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [16:01:54] (03CR) 10jenkins-bot: Temporary make account creation limits more restrictive [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521307 (https://phabricator.wikimedia.org/T227416) (owner: 10SBassett) [16:05:38] !log sbassett@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Temporary make account creation limits more restrictive - part III (duration: 00m 50s) [16:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:56] PROBLEM - Prometheus prometheus2004/k8s restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [16:07:00] PROBLEM - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [16:08:14] (03PS2) 10Urbanecm: Change liwikinews logo to correct one per community wish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) [16:09:16] RECOVERY - Prometheus k8s cache not updating on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2004&var-datasource=codfw+prometheus/ops [16:09:18] (03CR) 10jerkins-bot: [V: 04-1] Change liwikinews logo to correct one per community wish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) (owner: 10Urbanecm) [16:11:02] RECOVERY - Prometheus k8s cache not updating on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus2003&var-datasource=codfw+prometheus/ops [16:12:22] (03PS3) 10Urbanecm: Change liwikinews logo to correct one per community wish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) [16:12:51] jouncebot, next [16:12:51] In 0 hour(s) and 47 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T1700) [16:13:56] (03PS1) 10CRusnov: netbox: Add configuration and timers for csv dumps [puppet] - 10https://gerrit.wikimedia.org/r/521313 [16:14:45] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add configuration and timers for csv dumps [puppet] - 10https://gerrit.wikimedia.org/r/521313 (owner: 10CRusnov) [16:15:11] (03PS8) 10CRusnov: Add new dumpbackup.py script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) [16:20:11] (03PS2) 10CRusnov: netbox: Add configuration and timers for csv dumps [puppet] - 10https://gerrit.wikimedia.org/r/521313 [16:22:21] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Phabricator: Make sure phabricator works properly including our puppet roles on jessie - https://phabricator.wikimedia.org/T158434 (10mmodell) 05Open→03Resolved [16:22:28] (03CR) 10Marostegui: "Let's merge tomorrow after m2 failover?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 (owner: 10Jcrespo) [16:30:12] RECOVERY - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [16:30:34] RECOVERY - Prometheus prometheus2004/k8s restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [16:31:23] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10marcella) I am David's manager and I confirm his business need for this access. Thank you! [16:33:40] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) 05Open→03Resolved [16:33:47] (03CR) 10Elukey: "I am +100 but we need to create documentation for the alarm first, otherwise it will only add noise :(" [puppet] - 10https://gerrit.wikimedia.org/r/520475 (https://phabricator.wikimedia.org/T227065) (owner: 10Herron) [16:36:09] !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [16:36:10] !log eevans@deploy1001 scap-helm sessionstore cluster staging completed [16:36:10] !log eevans@deploy1001 scap-helm sessionstore finished [16:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:12] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10Nuria) @DLynch For the record: are you a permanent employee of the foundations and thus have a NDA on file? have you read? https://office.wikimedia.org/wiki/Data_acc... [16:38:17] !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [16:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:52] (03PS3) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:38:54] !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [16:38:55] !log eevans@deploy1001 scap-helm sessionstore cluster staging completed [16:38:55] !log eevans@deploy1001 scap-helm sessionstore finished [16:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:06] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:40:23] (03PS4) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:40:43] !log eevans@deploy1001 scap-helm sessionstore upgrade staging -f sessionstore-staging-values.yaml stable/kask [namespace: sessionstore, clusters: staging] [16:40:44] !log eevans@deploy1001 scap-helm sessionstore cluster staging completed [16:40:44] !log eevans@deploy1001 scap-helm sessionstore finished [16:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:18] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:41:22] (03PS1) 10Hashar: zuul: actually write stack to stack_dump.log [puppet] - 10https://gerrit.wikimedia.org/r/521315 [16:42:24] (03CR) 10Hashar: [C: 03+1] "Of course I screwed it up and missed updating the destination file: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/521315/" [puppet] - 10https://gerrit.wikimedia.org/r/505253 (owner: 10Hashar) [16:43:05] (03CR) 10Hashar: "Follow up on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/505253/" [puppet] - 10https://gerrit.wikimedia.org/r/521315 (owner: 10Hashar) [16:45:28] (03PS5) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:46:19] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:53:29] (03PS1) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:54:09] (03PS6) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:55:19] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [16:57:39] (03PS2) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [16:57:45] (03PS7) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [16:58:40] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [17:00:05] gehel and onimisionipe: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T1700). [17:00:20] jouncebot: o/ cc:SMalyshev [17:02:40] (03PS8) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [17:03:41] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [17:04:03] love the weekly deploy deploy [17:04:51] (03PS9) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [17:05:45] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [17:06:28] (03PS4) 10Ppchelko: RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [17:07:24] (03CR) 10Jforrester: "Hmm. What about when we have temporary celebration images? We don't want to delete the permanent image, but we won't be pointing at it…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [17:08:03] !log gehel@deploy1001 Started deploy [wdqs/wdqs@4b7cdf5]: new blazegraph and updater version [17:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:18] (03PS5) 10Ppchelko: RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [17:09:11] (03CR) 10Ppchelko: RESTRouter: Add initial Helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [17:09:19] (03PS10) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [17:10:18] (03CR) 10Urbanecm: "> Patch Set 7:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [17:10:38] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [17:10:58] 10Operations, 10Deployments, 10Release: OSError: [Errno 1] Operation not permitted when running git fat pull - https://phabricator.wikimedia.org/T208259 (10Gehel) I ran into this issue again when deploying WDQS today. Some of the binaries were owned by the previous deployer. My workaround was to reset owners... [17:12:58] 10Operations, 10Deployments, 10Release: OSError: [Errno 1] Operation not permitted when running git fat pull - https://phabricator.wikimedia.org/T208259 (10Smalyshev) Do we need to do git fat pull on deploy machine? I never did it, I thought it happens on the target machine automagically. [17:14:22] (03PS11) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [17:14:30] (03CR) 10Smalyshev: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) (owner: 10Smalyshev) [17:15:36] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [17:15:38] 10Operations, 10Deployments, 10Release: OSError: [Errno 1] Operation not permitted when running git fat pull - https://phabricator.wikimedia.org/T208259 (10Gehel) >>! In T208259#5314368, @Smalyshev wrote: > Do we need to do git fat pull on deploy machine? I never did it, I thought it happens on the target ma... [17:16:39] 10Operations, 10Deployments, 10Release: OSError: [Errno 1] Operation not permitted when running git fat pull - https://phabricator.wikimedia.org/T208259 (10Smalyshev) Not sure. Checking binaries is a good idea (though wdq9 test deploy should mostly take care of that, but who knows). Probably fixing permissio... [17:20:50] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@4b7cdf5]: new blazegraph and updater version (duration: 12m 47s) [17:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:35] SMalyshev: deploy completed, tests are green [17:21:50] gehel: great, thanks... [17:34:03] (03CR) 10Jforrester: "> Patch Set 7:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521281 (https://phabricator.wikimedia.org/T227419) (owner: 10Urbanecm) [17:34:10] (03PS3) 10Urbanecm: Fix several incorrect logo sizes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521316 (https://phabricator.wikimedia.org/T211413) [17:35:48] (03PS12) 10Urbanecm: Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) [17:36:44] (03CR) 10jerkins-bot: [V: 04-1] Test if 2x logo version is 2 times bigger than 1x logo version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521181 (https://phabricator.wikimedia.org/T211413) (owner: 10Urbanecm) [17:42:09] (03PS1) 10Elukey: aptrepo: add missing update for amd-rocm [puppet] - 10https://gerrit.wikimedia.org/r/521319 (https://phabricator.wikimedia.org/T224723) [17:53:14] 10Operations, 10Deployments, 10Release: OSError: [Errno 1] Operation not permitted when running git fat pull - https://phabricator.wikimedia.org/T208259 (10thcipriani) >>! In T208259#5314362, @Gehel wrote: > I ran into this issue again when deploying WDQS today. Some of the binaries were owned by the previou... [17:58:22] 10Operations, 10Machine vision, 10serviceops, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10MusikAnimal) >>! In T225664#5262056, @Joe wrote: > Hi! A very quick skim of the upstream project sugg... [18:00:05] MaxSem, RoanKattouw, and Niharika: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T1800). [18:00:05] Smalyshev and Urbanecm: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:12] here [18:00:16] I can SWAT today! [18:00:32] (03PS4) 10Urbanecm: Enable RDF output for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) (owner: 10Smalyshev) [18:00:38] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) (owner: 10Smalyshev) [18:00:53] Urbanecm: great thanks [18:01:37] (03Merged) 10jenkins-bot: Enable RDF output for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) (owner: 10Smalyshev) [18:01:52] (03CR) 10jenkins-bot: Enable RDF output for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) (owner: 10Smalyshev) [18:02:05] SMalyshev, your patch is on mwdebug1002, if you want to test there [18:02:49] Urbanecm: seems to work fine [18:02:56] thanks SMalyshev, deploying [18:03:32] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520446 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse) [18:04:14] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:520078|Enable RDF output for MediaInfo]] (T221916) (duration: 00m 49s) [18:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:19] T221916: Create RDF export for structured data stored for files - https://phabricator.wikimedia.org/T221916 [18:04:20] SMalyshev, deployed [18:04:27] (03Merged) 10jenkins-bot: [cirrus] Increase elastic master timeout to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520446 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse) [18:04:44] (03CR) 10jenkins-bot: [cirrus] Increase elastic master timeout to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520446 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse) [18:04:56] SMalyshev, is the other patch testable? [18:05:05] Urbanecm: nope not really [18:05:18] the only test is to actually run the reindex and see when it times out [18:05:24] ok, deploying [18:05:31] (it at all, the point of the patch is kinda to avoid timing out) [18:05:40] *if [18:06:01] i see [18:06:02] deploying [18:06:12] (03PS5) 10Urbanecm: Add 'templateeditor' user group and protection level on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) (owner: 10DannyS712) [18:06:23] (03CR) 10Urbanecm: [C: 03+2] "SWAT" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) (owner: 10DannyS712) [18:06:36] !log urbanecm@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[:gerrit:520446|[cirrus] Increase elastic master timeout to 5m]] (T227136) (duration: 00m 49s) [18:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:42] T227136: Reindexing search index wikidatawiki for eqiad fails - https://phabricator.wikimedia.org/T227136 [18:06:43] SMalyshev, all deployed [18:06:51] Urbanecm: thank you! [18:06:54] yw [18:07:20] (03Merged) 10jenkins-bot: Add 'templateeditor' user group and protection level on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) (owner: 10DannyS712) [18:07:38] (03CR) 10jenkins-bot: Add 'templateeditor' user group and protection level on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521191 (https://phabricator.wikimedia.org/T227420) (owner: 10DannyS712) [18:09:48] (03PS4) 10Urbanecm: Change liwikinews logo to correct one per community wish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) [18:09:53] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) (owner: 10Urbanecm) [18:10:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:521191|Add templateeditor user group and protection level on commons]] (T227420) (duration: 00m 49s) [18:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:29] T227420: Define new protection level 'templateeditor' and the associated right and usergroup on Commons - https://phabricator.wikimedia.org/T227420 [18:11:37] (03Merged) 10jenkins-bot: Change liwikinews logo to correct one per community wish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) (owner: 10Urbanecm) [18:11:51] (03CR) 10jenkins-bot: Change liwikinews logo to correct one per community wish [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521308 (https://phabricator.wikimedia.org/T227418) (owner: 10Urbanecm) [18:13:27] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: [[:gerrit:521308|Change liwikinews logo to correct one per community wish]] (1/2, T227418) (duration: 00m 49s) [18:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:32] T227418: Several projects has an entry in wgLogoHD, but no entry in wgLogo - https://phabricator.wikimedia.org/T227418 [18:14:44] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:521308|Change liwikinews logo to correct one per community wish]] (2/2, T227418) (duration: 00m 49s) [18:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:04] !log Morning SWAT done [18:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:46] (03CR) 10Muehlenhoff: [C: 03+1] "Ah, yes, we missed that in the original patch." [puppet] - 10https://gerrit.wikimedia.org/r/521319 (https://phabricator.wikimedia.org/T224723) (owner: 10Elukey) [18:17:57] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add dnsupdate, rd, recursion, security, and udp metrics [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521042 (https://phabricator.wikimedia.org/T227411) (owner: 10BryanDavis) [18:18:15] (03PS2) 10Muehlenhoff: Log more info when `pdns_control list` fails [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521043 (owner: 10BryanDavis) [18:18:37] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) All 3 Cassandra instances are decommissioned; We are ready to begin [18:19:15] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Log more info when `pdns_control list` fails [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521043 (owner: 10BryanDavis) [18:21:42] SMalyshev: do you have a few minutes for me? (and which channel should I join in that case?) [18:24:48] (03PS1) 10BBlack: Un-submodule for nginx: move to prod env [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/521323 (https://phabricator.wikimedia.org/T183454) [18:24:50] (03PS1) 10BBlack: Un-submodule for nginx: rename to orig path [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/521324 [18:25:35] (03CR) 10jerkins-bot: [V: 04-1] Un-submodule for nginx: move to prod env [1/2] [puppet] - 10https://gerrit.wikimedia.org/r/521323 (https://phabricator.wikimedia.org/T183454) (owner: 10BBlack) [18:33:17] !log installing zeromq3 security updates [18:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:00] Pchelolo: yt? going to do page* eventgate-main [18:37:10] si senior [18:37:21] (03PS3) 10Ottomata: Migrate page-* events to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) [18:37:23] (03CR) 10Dzahn: "ah, i thought you wanted that to end up in debug.log, heh" [puppet] - 10https://gerrit.wikimedia.org/r/521315 (owner: 10Hashar) [18:37:56] (never mind my message) [18:38:34] (03CR) 10Ottomata: [C: 03+2] Migrate page-* events to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [18:39:14] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Development services): Reimage cobalt and gerrit2001 as buster - https://phabricator.wikimedia.org/T176774 (10Paladox) [18:39:34] (03Merged) 10jenkins-bot: Migrate page-* events to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [18:39:54] (03CR) 10jenkins-bot: Migrate page-* events to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521293 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [18:41:25] verifying on mwdebug1002 [18:42:28] move and create work as expected [18:42:30] deploying [18:43:50] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Produce page-* streams to eventgate-main - T211248 (duration: 00m 50s) [18:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:55] T211248: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [18:44:02] Pchelolo: deployed! [18:49:39] (03PS1) 10Ottomata: Refine mediawiki_page* with schema aware Refine job [puppet] - 10https://gerrit.wikimedia.org/r/521328 (https://phabricator.wikimedia.org/T211248) [18:50:08] (03CR) 10Ottomata: "To be merged in a couple of hours (or tomorrow)" [puppet] - 10https://gerrit.wikimedia.org/r/521328 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [18:50:54] Lucas_WMDE: sure [18:51:32] Lucas_WMDE: you can go to wikimedia-discovery, or any place you like depending on content [18:53:05] SMalyshev: too late (I did eventually figure out how to construct Blazegraph IVs, see Gerrit) [18:54:09] if you know a better solution for https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/472704/10/blazegraph/src/test/java/org/wikidata/query/rdf/blazegraph/mwapi/MWApiServiceFactoryUnitTest.java#65, leave a comment :) [18:55:05] Lucas_WMDE: for constructing IVs etc. we have utility functions IIRC [18:56:29] (03PS2) 10Andrew Bogott: bootstrap-vz: configure base image to use sssd for buster and stretch [puppet] - 10https://gerrit.wikimedia.org/r/521278 (https://phabricator.wikimedia.org/T227475) [18:56:31] (03PS1) 10Andrew Bogott: bootstrap-vs: remove facter package pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/521330 [18:56:42] (03CR) 10Dzahn: [C: 03+2] zuul: actually write stack to stack_dump.log [puppet] - 10https://gerrit.wikimedia.org/r/521315 (owner: 10Hashar) [18:57:31] Lucas_WMDE: I see you still have repeating code in getLimitsFromParams for parsing parameters, any reason for that? [19:02:08] (03PS2) 10Andrew Bogott: bootstrap-vs: remove facter package pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/521330 [19:02:50] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vs: remove facter package pinning for buster [puppet] - 10https://gerrit.wikimedia.org/r/521330 (owner: 10Andrew Bogott) [19:03:54] (03CR) 10Dzahn: [C: 03+2] ntp/systemd: add notes_urls for timesyncd and systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/520957 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [19:04:20] (03PS2) 10Dzahn: ntp/systemd: add notes_urls for timesyncd and systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/520957 (https://phabricator.wikimedia.org/T197873) [19:05:50] no particular reason [19:06:01] not sure how well extracting it would work, though, with the "once" special case [19:06:07] I suppose “get and clear” could be extracted [19:06:23] but I’m going home now, see you [19:06:34] Lucas_WMDE: also I am not sure what you did with makeConstant is right [19:06:55] we have methods for creating constants already in BigdataValuesHelper [19:07:12] and I am not sure the way you're getting value factory is right... [19:07:52] I’m sure it’s not right, but even getting that far took me two hours [19:08:02] I’ll be in a much better mood to listen to your suggestions tomorrow :) [19:08:03] maybe not super-imporant in this particular case but better to do it right. I'll check on which way is right [19:08:15] Lucas_WMDE: sure, ping me then [19:08:41] we do have helpers for this in tests but I need to look which one to use [19:09:59] (03PS1) 10Muehlenhoff: Bump version in changelog [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521333 [19:13:43] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10greg) [19:16:07] (03CR) 10Muehlenhoff: [C: 03+2] Bump version in changelog [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/521333 (owner: 10Muehlenhoff) [19:20:57] 10Operations, 10Machine vision, 10serviceops, 10Service-deployment-requests, 10Services (watching): Internal deployment of open_nsfw-- image scoring service - https://phabricator.wikimedia.org/T225664 (10Ramsey-WMF) To add to what MusikAnimal said, for SDC we're mainly looking at using this information t... [19:22:04] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10greg) Approved. [19:22:39] 10Operations, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests, 10Release-Engineering-Team (Deployment services): Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10greg) [19:23:11] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Maps (Tilerator): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10Mholloway) a:05Mholloway→03None [19:23:37] !log uploaded prometheus-pdns-exporter 0.4.1 to stretch-wikimedia T227411 [19:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:42] T227411: prometheus-pdns-exporter log noise about unexpected metrics - https://phabricator.wikimedia.org/T227411 [19:32:58] jouncebot: now [19:32:58] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [19:33:01] jouncebot: next [19:33:01] In 0 hour(s) and 26 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T2000) [19:34:17] (03PS1) 10DLynch: Oversample all EditAttemptStep events on VE-as-mobile-default wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521338 (https://phabricator.wikimedia.org/T227317) [19:35:32] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[zuul-reload] [19:36:43] (03PS6) 10Ppchelko: RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [19:37:04] Reedy, if you want, I can deploy the fix myself - but I'm happy to leave it up to you if you want :) [19:38:26] seems you're on it, thanks [19:38:46] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/OATHAuth/src/Key/TOTPKey.php: T227502 (duration: 00m 50s) [19:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:53] T227502: 2FA Scratch codes error - https://phabricator.wikimedia.org/T227502 [19:39:38] (03CR) 10Ppchelko: RESTRouter: Add initial Helm chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [19:39:53] I would've had a fix up earlier if the wifi hadn't died for 20 mins after my initial comment :) [19:41:05] well, you're probably more experienced with MW and its bugs than me :-). [19:42:08] The fix is fairly easily identified as you probably found, the services name didn't match up [19:42:26] yup [19:51:37] (03CR) 10Esanders: Oversample all EditAttemptStep events on VE-as-mobile-default wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521338 (https://phabricator.wikimedia.org/T227317) (owner: 10DLynch) [19:53:04] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: mcrouter codfw proxies sometimes lead to TKOs - https://phabricator.wikimedia.org/T227265 (10kchapman) [20:00:04] cscott, arlolra, subbu, bearND, and halfak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T2000). [20:01:46] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:05:08] PROBLEM - Check systemd state on wdqs1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:14:59] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Kibana functionality missing after upgrade: histograms - https://phabricator.wikimedia.org/T152782 (10greg) [20:30:29] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) [20:31:28] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10RobH) a:05RobH→03Cmjohnson The replacement SSD has been ordered on T226756. It should arrive within a week or so, then this can progress. (once the linked... [20:34:45] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) [20:47:15] (03PS2) 10Ottomata: eventstreams: add admins contact to eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/520475 (https://phabricator.wikimedia.org/T227065) (owner: 10Herron) [20:52:06] 10Operations, 10Wikimedia-production-error: Labtestwiki returns 503 error - https://phabricator.wikimedia.org/T227476 (10Urbanecm) Okay, seems I've tested from an incorrect host. But anyway, labweb1001 gives similar result. ` [urbanecm@labweb1001 ~]$ curl -H 'Host: labtestwikitech.wikimedia.org' "http://$(hos... [20:53:38] !log Upgraded prometheus-pdns-exporter to 0.4.1 on cloudservices1003.wikimedia.org (T227411) [20:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:45] T227411: prometheus-pdns-exporter log noise about unexpected metrics - https://phabricator.wikimedia.org/T227411 [20:55:12] (03PS3) 10Dzahn: contint1001: point Docker data to a different partition [puppet] - 10https://gerrit.wikimedia.org/r/520738 (https://phabricator.wikimedia.org/T207707) (owner: 10Hashar) [20:57:06] !log Upgraded prometheus-pdns-exporter to 0.4.1 on cloudservices1004.wikimedia.org (T227411) [20:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:28] (03CR) 10Ottomata: "Added some docs at https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration" [puppet] - 10https://gerrit.wikimedia.org/r/520475 (https://phabricator.wikimedia.org/T227065) (owner: 10Herron) [21:00:05] bawolff and Reedy: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T2100). [21:05:32] (03CR) 10Dzahn: [C: 03+2] contint1001: point Docker data to a different partition [puppet] - 10https://gerrit.wikimedia.org/r/520738 (https://phabricator.wikimedia.org/T207707) (owner: 10Hashar) [21:08:27] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) [21:31:34] (03PS2) 10DLynch: Oversample all EditAttemptStep events on VE-as-mobile-default wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521338 (https://phabricator.wikimedia.org/T227317) [21:35:57] 10Operations, 10Wikimedia-Site-requests: Global rename of Waldir → Waldyrious: supervision needed - https://phabricator.wikimedia.org/T225370 (10waldyrious) >>! In T225370#5299189, @jcrespo wrote: > I am not a developer, but to me T225370#5298483 would seem like an intended thing. You may want to document that... [21:59:57] 10Operations, 10observability, 10serviceops, 10Performance-Team (Radar), 10User-Elukey: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10ayounsi) >>! In T224454#5308149, @fgiunchedi wrote: > re: bandwidth itself, I believe we do have port utilization alerts based... [22:00:46] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - /mnt/docker/overlay2/06df99b3cd1b8b00241659fb783f51c3a7d88f148dccf05ae5932dbd0778bdd1/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [22:03:42] RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [22:13:17] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10thcipriani) >>! In T207707#5302763, @hashar wrote: > So I think we can just: > * **stic... [22:17:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Dzahn) 05Open→03Resolved I merged @hashar 's change to the docker data dir and @thcipriani restarte... [22:17:07] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) [22:19:04] PROBLEM - MariaDB Slave Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 840.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:28:14] (03CR) 10Ayounsi: [C: 03+1] Add Ipv6 PTR/AAA records for an-worker* and an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/520767 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [22:31:09] (03CR) 10Dzahn: [C: 03+1] Add Ipv6 PTR/AAA records for an-worker* and an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/520767 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [22:31:24] (03PS4) 10Dzahn: Add Ipv6 PTR/AAAA records for an-worker* and an-coord1001 [dns] - 10https://gerrit.wikimedia.org/r/520767 (https://phabricator.wikimedia.org/T225296) (owner: 10Elukey) [22:41:38] (03CR) 10Jbond: [C: 03+2] monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 (owner: 10Jbond) [22:41:47] (03PS7) 10Jbond: monitoring: Add type checking to monitoring::graphite_threshold [puppet] - 10https://gerrit.wikimedia.org/r/520746 [22:42:57] 10Operations, 10ops-eqiad, 10DC-Ops: b1-eqiad pdu refresh - https://phabricator.wikimedia.org/T227536 (10RobH) [22:43:31] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [22:44:59] 10Operations, 10ops-eqiad, 10DC-Ops: b2-eqiad pdu refresh - https://phabricator.wikimedia.org/T227538 (10RobH) [22:45:02] (03CR) 10Esanders: [C: 03+1] Oversample all EditAttemptStep events on VE-as-mobile-default wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/521338 (https://phabricator.wikimedia.org/T227317) (owner: 10DLynch) [22:45:44] 10Operations, 10ops-eqiad, 10DC-Ops: b3-eqiad pdu refresh - https://phabricator.wikimedia.org/T227539 (10RobH) [22:46:10] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh - https://phabricator.wikimedia.org/T227540 (10RobH) [22:46:44] 10Operations, 10ops-eqiad, 10DC-Ops: b5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227541 (10RobH) [22:47:09] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227541 (10RobH) [22:47:18] PROBLEM - puppet last run on cloudstore1008 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:48:14] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:48:20] 10Operations, 10ops-eqiad, 10DC-Ops: b7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227542 (10RobH) [22:48:28] (03PS9) 10Dzahn: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [22:48:49] ^^ looking [22:49:06] 10Operations, 10ops-eqiad, 10DC-Ops: b8-eqiad pdu refresh - https://phabricator.wikimedia.org/T227543 (10RobH) [22:50:15] (03CR) 10Dzahn: [C: 03+2] "> the links in the following have already been reviewed:" [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [22:50:30] 10Operations, 10ops-eqiad, 10DC-Ops: (July 22-26) install new PDUs in rows A/B (Top level tracking task) - https://phabricator.wikimedia.org/T226778 (10RobH) [22:50:42] jbond42|away: oh, just removed an "s" and added a "." to comment [22:50:50] using the online editor [22:51:29] (03PS10) 10Dzahn: monitoring::graphite_threshold: add notes_link [puppet] - 10https://gerrit.wikimedia.org/r/520747 (https://phabricator.wikimedia.org/T197873) (owner: 10Jbond) [22:51:32] and rebasing [22:51:43] yep no problem [22:53:06] (03PS1) 10Jbond: labstore: ensure graphite_threshold uses numerics [puppet] - 10https://gerrit.wikimedia.org/r/521370 [22:53:52] (03CR) 10Jbond: [C: 03+2] labstore: ensure graphite_threshold uses numerics [puppet] - 10https://gerrit.wikimedia.org/r/521370 (owner: 10Jbond) [22:54:01] (03PS2) 10Jbond: labstore: ensure graphite_threshold uses numerics [puppet] - 10https://gerrit.wikimedia.org/r/521370 [22:59:40] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:00:04] MaxSem, RoanKattouw, and Niharika: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190708T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:16] (03PS1) 10Jbond: labstore: fix paramater type [puppet] - 10https://gerrit.wikimedia.org/r/521371 [23:02:23] (03CR) 10Jbond: [C: 03+2] labstore: fix paramater type [puppet] - 10https://gerrit.wikimedia.org/r/521371 (owner: 10Jbond) [23:02:31] (03PS2) 10Jbond: labstore: fix paramater type [puppet] - 10https://gerrit.wikimedia.org/r/521371 [23:03:20] 10Operations, 10Release-Engineering-Team-TODO, 10Scoring-platform-team, 10Release-Engineering-Team (Deployment services): Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10Jdforrester-WMF) [23:03:27] i was going to say i check the labstore one but that seems like you got it :) [23:03:35] !log changing password for user "Naomi.piquette" [23:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:18] mutante: yes im working on it [23:04:44] :) [23:05:15] will need to do one more i thik i may have removed the notes_link in an earlier "fix" [23:07:22] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:07:36] gotcha. yep [23:10:12] PROBLEM - puppet last run on labstore1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:11:06] PROBLEM - puppet last run on cloudstore1009 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:11:21] (03PS1) 10Jbond: labstore: fix paramter name notes_url vs notes_link [puppet] - 10https://gerrit.wikimedia.org/r/521372 [23:11:49] (03CR) 10Jbond: [C: 03+2] labstore: fix paramter name notes_url vs notes_link [puppet] - 10https://gerrit.wikimedia.org/r/521372 (owner: 10Jbond) [23:14:20] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:14:36] mutante: should be fixed now but will hang around for another 30 mins :) [23:15:15] jbond42: looks like we already got recovery:) you can also use cumin to run puppet on labstore* and shorten the 30 minutes to like 5 [23:15:30] RECOVERY - puppet last run on labstore1007 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:15:58] well, there we go anyways :) [23:16:07] im not worried about the ones that are failed im confident they are good just want to make sure i havn't added a further regression [23:16:21] RECOVERY - puppet last run on cloudstore1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:16:29] either way its not a problem just keeping half an eye in here [23:17:39] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:18:33] RECOVERY - puppet last run on cloudstore1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:22:13] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [23:24:29] PROBLEM - puppet last run on graphite1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [23:26:55] jbond42: so about notes URLs/links in general.. there is that special case i have left because it is: [23:27:48] using $title inside the name of the service.. it's inside a defined type and reused by multiple services [23:28:11] so either need a single URL that describes it for all or also use $title in the URL [23:29:41] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [23:30:30] (03PS1) 10Jbond: monitoring::graphite_threshhold: allow url encode strings [puppet] - 10https://gerrit.wikimedia.org/r/521375 [23:30:49] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:30:56] mutante: this is where havinf a standard uri endpoint would help [23:31:25] jbond42: yea, we can declare it or "all services using uwsgi" [23:31:27] for [23:31:51] Monitoring/Services/$title [23:32:01] (03CR) 10Jbond: [C: 03+2] monitoring::graphite_threshhold: allow url encode strings [puppet] - 10https://gerrit.wikimedia.org/r/521375 (owner: 10Jbond) [23:33:15] mutante: its easy to do from the puppet side [23:34:19] yes, doing it [23:34:33] cool :D [23:35:09] RECOVERY - puppet last run on graphite1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:35:45] (03PS1) 10Dzahn: uwsgi::app: add notes_url for services using uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/521376 [23:37:13] (03PS2) 10Dzahn: uwsgi::app: add notes_url for services using uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/521376 (https://phabricator.wikimedia.org/T197873) [23:38:13] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Eevans) [23:38:24] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10Eevans) 05Stalled→03Resolved [23:41:51] (03PS1) 10Dzahn: postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377 [23:44:38] (03CR) 10jerkins-bot: [V: 04-1] uwsgi::app: add notes_url for services using uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/521376 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [23:44:40] (03CR) 10jerkins-bot: [V: 04-1] postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377 (owner: 10Dzahn) [23:47:59] (03PS2) 10Dzahn: postgresql: add icinga notes_url for postgres repl lag [puppet] - 10https://gerrit.wikimedia.org/r/521377 [23:53:16] (03PS3) 10Dzahn: uwsgi::app: add notes_url for services using uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/521376 (https://phabricator.wikimedia.org/T197873)