[00:01:30] (03CR) 10Dzahn: [C: 03+2] Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [00:01:33] (03PS8) 10Dzahn: Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [00:01:52] (03CR) 10Mobrovac: [C: 03+1] "Ok, we are now good to move on this." [puppet] - 10https://gerrit.wikimedia.org/r/503452 (https://phabricator.wikimedia.org/T208087) (owner: 10Mobrovac) [00:18:53] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) [00:19:22] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) a:05awight→03None [00:20:01] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) I have not seen any updates for this RFC in a few months. My understanding is that most of the issues are addressed. Is... [00:30:46] (03CR) 10Krinkle: [C: 03+2] profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:31:46] (03Merged) 10jenkins-bot: profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:32:17] * Krinkle staging on mwdebug1002 [00:33:28] !log creating new restbase schema -- T221031 [00:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:35] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [00:33:38] T221031: Create new mobile storage tables - https://phabricator.wikimedia.org/T221031 [00:33:48] (03CR) 10jenkins-bot: profiler: Discard incomplete stacks and monitor via statsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503652 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [00:33:54] !log krinkle@deploy1001 Synchronized wmf-config/profiler.php: I7589aa153 (duration: 00m 52s) [00:33:55] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [00:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:01] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:34:06] 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Set `enable_dl` to 0 in php.ini - https://phabricator.wikimedia.org/T220681 (10Krinkle) [00:34:24] 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) [00:34:38] !log pooled maps2003 - postgres init complete! [00:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:35] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:37:12] 10Operations, 10Cassandra, 10Maps: Collect metrics on maps cassandra - https://phabricator.wikimedia.org/T221055 (10Mathew.onipe) [00:43:07] (03CR) 10Harej: [C: 03+1] "This looks good to go. I am not sure why the "do not merge" label is applied here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480284 (https://phabricator.wikimedia.org/T212182) (owner: 10Awight) [00:46:53] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [00:52:01] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:46:20] !log aaron@deploy1001 Synchronized php-1.33.0-wmf.25/includes/parser/Parser.php: 73529ae6c5ffb6 (duration: 00m 53s) [01:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:27] PROBLEM - DNS labvirt1006.mgmt on labvirt1006.mgmt is CRITICAL: Domain labvirt1006.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:13:55] (03PS1) 10Ayounsi: Add looking glass CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/504233 (https://phabricator.wikimedia.org/T106056) [03:15:05] (03PS7) 10Ayounsi: Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) [03:17:07] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) (owner: 10Ayounsi) [03:30:12] (03PS8) 10Ayounsi: Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) [04:08:55] PROBLEM - puppet last run on elastic1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:35:25] RECOVERY - puppet last run on elastic1040 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:53:31] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:58:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504237 [05:00:02] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504237 (owner: 10Marostegui) [05:01:34] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504237 (owner: 10Marostegui) [05:01:48] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1007" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504237 (owner: 10Marostegui) [05:02:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1007 (duration: 00m 31s) [05:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:09:17] checking that [05:13:06] It started at around 4:59 UTC [05:13:47] And looks most of them are TextSlotDiffRenderer.php [05:15:59] <_joe_> so given there was no change I think [05:16:05] <_joe_> unless I am missing something [05:16:29] I did revert a change at 05:02 [05:16:34] a parsercache change [05:16:37] but it should bee a noop [05:16:43] https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-24h&to=now [05:16:48] but it started before that [05:17:33] https://logstash.wikimedia.org/goto/1fb5b0737c4fd5895a3bb22affa36555 [05:19:55] PROBLEM - puppet last run on logstash1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:21:52] <_joe_> so the problem is those requests are taking an insane amount of time to execute [05:22:30] yeah, and it looks like it is always TextSlotDiffRenderer.php [05:23:08] <_joe_> it's also a bit strange, I remembered we had a 200 seconds timeout on API [05:23:39] <_joe_> sigh the error has the url but not the wiki [05:23:55] <_joe_> oh here it is, under "server" [05:25:17] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [05:29:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [05:51:41] RECOVERY - puppet last run on logstash1011 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:54:30] (03PS1) 10Marostegui: install_server: Allow reimage dbprov1001,1002 [puppet] - 10https://gerrit.wikimedia.org/r/504239 (https://phabricator.wikimedia.org/T219399) [05:58:58] (03PS3) 10Giuseppe Lavagetto: profile::docker::builder: add periodic job to prune old images [puppet] - 10https://gerrit.wikimedia.org/r/504029 [06:08:42] (03CR) 10Elukey: haproxy: improve metrics (via mtail) and logging (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [06:09:41] (03PS2) 10Elukey: admin: Add Bryan Davis (bd808) to 'researchers' group [puppet] - 10https://gerrit.wikimedia.org/r/504094 (https://phabricator.wikimedia.org/T220892) (owner: 10BryanDavis) [06:23:46] (03PS7) 10Effie Mouzeli: haproxy: improve metrics (via mtail) and logging [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [06:27:29] (03CR) 10Elukey: [C: 03+2] admin: Add Bryan Davis (bd808) to 'researchers' group [puppet] - 10https://gerrit.wikimedia.org/r/504094 (https://phabricator.wikimedia.org/T220892) (owner: 10BryanDavis) [06:28:20] (03CR) 10Effie Mouzeli: haproxy: improve metrics (via mtail) and logging (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [06:30:17] Can anybody invite me to #_security [06:30:36] I dunno why I keep losing access [06:30:52] done [06:31:22] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Membership in "researchers" group for Bryan Davis - https://phabricator.wikimedia.org/T220892 (10elukey) 05Open→03Resolved a:03elukey @bd808 done! [06:33:07] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:36:56] (03PS8) 10Effie Mouzeli: haproxy: improve metrics (via mtail) and logging [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [06:51:53] (03CR) 10Elukey: "Hey Aaron and Timo! Sorry for the lag in answering, and probably for my follow up below (since I am still a big ignorant of all the bits a" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [06:55:27] (03CR) 10Vgutierrez: [C: 03+2] redirects.dat: Remove wikisource.gr [puppet] - 10https://gerrit.wikimedia.org/r/500716 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [06:55:37] (03PS3) 10Vgutierrez: redirects.dat: Remove wikisource.gr [puppet] - 10https://gerrit.wikimedia.org/r/500716 (https://phabricator.wikimedia.org/T213705) [06:57:21] (03CR) 10Elukey: [C: 03+1] "Left a nit but it is more a lack of puppet knowledge from my side." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [06:57:56] (03CR) 10Vgutierrez: [C: 03+2] Add SPF record for wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/503165 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [06:58:02] (03PS2) 10Vgutierrez: Add SPF record for wikisource.org [dns] - 10https://gerrit.wikimedia.org/r/503165 (https://phabricator.wikimedia.org/T193408) [06:59:33] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:59:57] (03CR) 10Vgutierrez: [C: 03+2] Add SPF record for wikibooks.org [dns] - 10https://gerrit.wikimedia.org/r/503177 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [07:00:03] (03PS2) 10Vgutierrez: Add SPF record for wikibooks.org [dns] - 10https://gerrit.wikimedia.org/r/503177 (https://phabricator.wikimedia.org/T193408) [07:09:34] (03PS2) 10Vgutierrez: Add SPF records for non-canonical non-parked domains [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) [07:09:36] (03PS1) 10Vgutierrez: Add SPF record for wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/504241 (https://phabricator.wikimedia.org/T220786) [07:09:38] (03PS1) 10Vgutierrez: Add SPF record for wikimedia.ee [dns] - 10https://gerrit.wikimedia.org/r/504242 (https://phabricator.wikimedia.org/T220786) [07:09:40] (03PS1) 10Vgutierrez: Add SPF record for toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/504243 (https://phabricator.wikimedia.org/T220786) [07:09:43] (03PS1) 10Vgutierrez: Add SPF record for wmftest.org [dns] - 10https://gerrit.wikimedia.org/r/504244 (https://phabricator.wikimedia.org/T220786) [07:11:39] !log upgrading Java on Hadoop/Kafka/Jumbo/Druid clusters [07:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:49] (03CR) 10Vgutierrez: "> Patch Set 1: Code-Review-1" [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [07:14:40] (03CR) 10Marostegui: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [07:22:15] PROBLEM - puppet last run on analytics1075 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-8-jdk] [07:24:46] (03PS2) 10Ema: Revert "cp2005: use ATS backends instead of Varnish" [puppet] - 10https://gerrit.wikimedia.org/r/504038 (https://phabricator.wikimedia.org/T213263) [07:25:26] !log cp2005: depool varnish-fe in preparation of traffic switchback to Varnish T213263 [07:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:30] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [07:25:44] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2005.codfw.wmnet,service=nginx [07:25:45] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2005.codfw.wmnet,service=varnish-fe [07:25:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:19] (03CR) 10Ema: [C: 03+2] Revert "cp2005: use ATS backends instead of Varnish" [puppet] - 10https://gerrit.wikimedia.org/r/504038 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [07:26:31] (03CR) 10Effie Mouzeli: haproxy: improve metrics (via mtail) and logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [07:27:31] RECOVERY - puppet last run on analytics1075 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:30:21] (03CR) 10Elukey: [C: 03+1] ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [07:30:34] (03CR) 10Marostegui: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [07:32:11] (03PS1) 10Vgutierrez: acme_chief: Issue birdlg certificate [puppet] - 10https://gerrit.wikimedia.org/r/504248 (https://phabricator.wikimedia.org/T106056) [07:32:53] !log cp2005: repool varnish-fe pointing to Varnish T213263 [07:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:57] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [07:33:23] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2005.codfw.wmnet,service=nginx [07:33:24] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2005.codfw.wmnet,service=varnish-fe [07:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:53] !log Upgrade db2093 [07:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:49] (03PS2) 10Fsero: registryha: added 1 VM on eqiad and 2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/504063 (https://phabricator.wikimedia.org/T214289) [07:41:28] (03CR) 10jerkins-bot: [V: 04-1] registryha: added 1 VM on eqiad and 2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/504063 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [07:44:31] (03PS2) 10Ema: Revert "cp2002: use ATS backends instead of Varnish" [puppet] - 10https://gerrit.wikimedia.org/r/504037 (https://phabricator.wikimedia.org/T213263) [07:45:00] !log cp2002: depool varnish-fe in preparation of traffic switchback to Varnish T213263 [07:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:05] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [07:45:15] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=nginx [07:45:16] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=varnish-fe [07:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:31] (03CR) 10Ema: [C: 03+2] Revert "cp2002: use ATS backends instead of Varnish" [puppet] - 10https://gerrit.wikimedia.org/r/504037 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [07:47:29] !log rebooting Swift frontends in eqiad combined kernel/glibc/OpenSSL security updates [07:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:36] (03PS3) 10Fsero: registryha: added 1 VM on eqiad and 2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/504063 (https://phabricator.wikimedia.org/T214289) [07:50:31] (03CR) 10Fsero: [C: 03+2] registryha: added 1 VM on eqiad and 2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/504063 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [07:50:38] !log cp2002: repool varnish-fe pointing to Varnish T213263 [07:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:42] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [07:50:47] (03PS4) 10Fsero: registryha: added 1 VM on eqiad and 2 on codfw [puppet] - 10https://gerrit.wikimedia.org/r/504063 (https://phabricator.wikimedia.org/T214289) [07:50:47] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=nginx [07:50:48] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=varnish-fe [07:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:42] (03PS1) 10KartikMistry: Remove ExternalGuidanceEnableContextDetection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504261 (https://phabricator.wikimedia.org/T219819) [07:55:04] (03CR) 10Vgutierrez: [C: 04-1] "please use acme-chief <3" (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) (owner: 10Ayounsi) [08:01:36] !log rebooting Swift frontends in codfw for combined kernel/glibc/OpenSSL security updates [08:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:15] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [08:12:31] PROBLEM - puppet last run on an-worker1095 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:21:05] (03CR) 10Giuseppe Lavagetto: haproxy: improve metrics (via mtail) and logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [08:26:46] (03CR) 10Jcrespo: [C: 03+1] install_server: Allow reimage dbprov1001,1002 [puppet] - 10https://gerrit.wikimedia.org/r/504239 (https://phabricator.wikimedia.org/T219399) (owner: 10Marostegui) [08:28:05] (03CR) 10Effie Mouzeli: haproxy: improve metrics (via mtail) and logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [08:28:11] (03PS2) 10Marostegui: install_server: Allow reimage dbprov1001,1002 [puppet] - 10https://gerrit.wikimedia.org/r/504239 (https://phabricator.wikimedia.org/T219399) [08:28:41] (03CR) 10Elukey: [C: 03+1] haproxy: improve metrics (via mtail) and logging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [08:28:53] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage dbprov1001,1002 [puppet] - 10https://gerrit.wikimedia.org/r/504239 (https://phabricator.wikimedia.org/T219399) (owner: 10Marostegui) [08:33:02] !log rebooting ms-be1020 for combined kernel/glibc/OpenSSL update [08:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:21] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10jcrespo) Could you summarize the tables added and the resource commitment, just to make sure we all understand it well? [08:36:06] (03PS1) 10Elukey: profile::hadoop::spark2: auto upload of spark2-assembly.zip [puppet] - 10https://gerrit.wikimedia.org/r/504268 (https://phabricator.wikimedia.org/T218343) [08:38:14] RECOVERY - puppet last run on an-worker1095 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:40:49] !log Updated the Wikidata property suggester with data from the 2019-04-08 JSON dump and applied the T132839 workarounds [08:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:53] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [08:41:11] 10Operations, 10Cassandra, 10Maps: Collect metrics on maps cassandra - https://phabricator.wikimedia.org/T221055 (10Mathew.onipe) p:05Triage→03Normal [08:56:56] (03PS2) 10Ema: Revert "cp1076: use ATS backends instead of Varnish" [puppet] - 10https://gerrit.wikimedia.org/r/504036 (https://phabricator.wikimedia.org/T213263) [08:57:33] !log cp1076: depool varnish-fe in preparation of traffic switchback to Varnish T213263 [08:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:38] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [08:57:43] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1076.eqiad.wmnet,service=nginx [08:57:45] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1076.eqiad.wmnet,service=varnish-fe [08:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:05] (03CR) 10Ema: [C: 03+2] Revert "cp1076: use ATS backends instead of Varnish" [puppet] - 10https://gerrit.wikimedia.org/r/504036 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:04:53] (03PS1) 10Elukey: Remove HTTPS config from the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/504275 (https://phabricator.wikimedia.org/T217412) [09:05:15] !log cp1076: repool varnish-fe pointing to Varnish T213263 [09:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:19] T213263: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 [09:05:30] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=nginx [09:05:31] (03CR) 10Elukey: [C: 03+2] Remove HTTPS config from the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/504275 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [09:05:32] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1076.eqiad.wmnet,service=varnish-fe [09:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:59] (03PS2) 10Ema: Revert "cache: hiera flag to use ATS as local backend" [puppet] - 10https://gerrit.wikimedia.org/r/504039 (https://phabricator.wikimedia.org/T213263) [09:08:20] (03PS1) 10Elukey: role::an_test_cluster::hadoop::master|standby: unset ferm TLS config [puppet] - 10https://gerrit.wikimedia.org/r/504277 (https://phabricator.wikimedia.org/T217412) [09:08:22] (03CR) 10Ema: [C: 03+2] Revert "cache: hiera flag to use ATS as local backend" [puppet] - 10https://gerrit.wikimedia.org/r/504039 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema) [09:08:56] (03CR) 10Santhosh: [C: 03+1] Remove ExternalGuidanceEnableContextDetection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504261 (https://phabricator.wikimedia.org/T219819) (owner: 10KartikMistry) [09:09:55] (03PS2) 10Elukey: role::an_test_cluster::hadoop::master|standby: unset ferm TLS config [puppet] - 10https://gerrit.wikimedia.org/r/504277 (https://phabricator.wikimedia.org/T217412) [09:10:32] (03CR) 10Elukey: [C: 03+2] role::an_test_cluster::hadoop::master|standby: unset ferm TLS config [puppet] - 10https://gerrit.wikimedia.org/r/504277 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [09:11:23] (03CR) 10Jbond: [C: 03+2] puppet_major_version4: remove old puppet_major_version variable. [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [09:11:34] (03PS7) 10Jbond: puppet_major_version4: remove old puppet_major_version variable. [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) [09:15:45] (03CR) 10Effie Mouzeli: "> >" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:23:59] (03CR) 10Marostegui: [C: 03+1] haproxy: improve metrics (via mtail) and logging [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:24:02] PROBLEM - puppet last run on mw1336 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:26:09] !log [late logging] swift container-to-container synchronization enabled between docker_registry_eqiad and docker_registry_codfw swift containers at 08:15:00 UTC [09:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:36] (03PS2) 10Jcrespo: mariadb: Reduce db1078 load in preparation for depool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504019 (https://phabricator.wikimedia.org/T219115) [09:33:48] (03PS2) 10Santhosh: Remove ExternalGuidanceEnableContextDetection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504261 (https://phabricator.wikimedia.org/T219819) (owner: 10KartikMistry) [09:34:24] <_joe_> jbond42: oh you removed that horror <3 [09:34:41] (03PS3) 10Santhosh: Remove ExternalGuidanceEnableContextDetection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504261 (https://phabricator.wikimedia.org/T219819) (owner: 10KartikMistry) [09:34:41] <_joe_> jbond42: can we now use hiera.production.yaml as well? [09:35:02] <_joe_> (yes, this is a trick to try to nerd-snipe you) [09:35:31] _joe_: i can look into it but i wanted to ask you something about the hiera backends if you have a second [09:35:34] (03CR) 10Santhosh: [C: 04-1] "Hold till https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ExternalGuidance/+/497713 is merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504261 (https://phabricator.wikimedia.org/T219819) (owner: 10KartikMistry) [09:35:46] <_joe_> sure, maybe not in this channel, too noisy :) [09:36:00] yes [09:37:37] !log Disabling puppet on dbproxy* and thumbor* to merge 502972 [09:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:29] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reduce db1078 load in preparation for depool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504019 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [09:43:32] (03Merged) 10jenkins-bot: mariadb: Reduce db1078 load in preparation for depool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504019 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [09:44:03] (03CR) 10Effie Mouzeli: [C: 03+2] haproxy: improve metrics (via mtail) and logging [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:44:12] (03PS9) 10Effie Mouzeli: haproxy: improve metrics (via mtail) and logging [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [09:44:24] (03CR) 10jenkins-bot: mariadb: Reduce db1078 load in preparation for depool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504019 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [09:50:12] RECOVERY - puppet last run on mw1336 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:51:29] (03PS4) 10Giuseppe Lavagetto: profile::docker::builder: add periodic job to prune old images [puppet] - 10https://gerrit.wikimedia.org/r/504029 [09:51:59] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Reduce db1078 load (duration: 00m 53s) [09:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:44] (03PS2) 10Jcrespo: mariadb: Depool db1078 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504022 (https://phabricator.wikimedia.org/T219115) [09:53:15] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: remove yarn/hdfs/mr TLS conf [puppet] - 10https://gerrit.wikimedia.org/r/504278 (https://phabricator.wikimedia.org/T217412) [09:53:52] (03PS1) 10Gilles: Enable Origin Trials on mobile Spanish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504279 (https://phabricator.wikimedia.org/T221065) [09:56:31] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: remove yarn/hdfs/mr TLS conf [puppet] - 10https://gerrit.wikimedia.org/r/504278 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [09:59:20] !log Enabling puppet again on on dbproxy* and thumbor* [09:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:26] (03PS1) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [10:02:51] (03PS2) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [10:05:21] (03PS5) 10Giuseppe Lavagetto: profile::docker::builder: add periodic job to prune old images [puppet] - 10https://gerrit.wikimedia.org/r/504029 [10:07:13] (03PS3) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [10:09:26] (03PS25) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [10:09:28] (03PS16) 10Ema: cache: implement profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) [10:09:59] (03CR) 10Gilles: [C: 03+2] Enable Origin Trials on mobile Spanish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504279 (https://phabricator.wikimedia.org/T221065) (owner: 10Gilles) [10:11:00] (03Merged) 10jenkins-bot: Enable Origin Trials on mobile Spanish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504279 (https://phabricator.wikimedia.org/T221065) (owner: 10Gilles) [10:14:00] (03CR) 10Ema: "This and the parent commit are now a noop (modulo the removal of a trailing whitespace in the varnish.service systemd unit). See https://p" [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:15:30] (03PS4) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [10:16:17] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 (owner: 10Elukey) [10:17:31] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394 (10fsero) [10:17:34] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: publish 1.9.1 envoy docker image - https://phabricator.wikimedia.org/T220382 (10fsero) 05Open→03Resolved image was published. [10:18:38] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [10:18:40] !log gilles@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T221065 Set up origin trials on Spanish Wikipedia mobile site (duration: 00m 52s) [10:18:40] 10Operations, 10Kubernetes, 10Patch-For-Review: set up a test node with new version, Redis as cache, a new Swift container and export metrics over graphana - https://phabricator.wikimedia.org/T210076 (10fsero) 05Open→03Resolved [10:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:44] T221065: Set up origin trials on Spanish Wikipedia mobile site - https://phabricator.wikimedia.org/T221065 [10:19:22] (03CR) 10jenkins-bot: Enable Origin Trials on mobile Spanish Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504279 (https://phabricator.wikimedia.org/T221065) (owner: 10Gilles) [10:21:29] (03PS5) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [10:21:32] (03CR) 10Jbond: [C: 03+1] "lgtm src commit is correct commit and requirements match" [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/504013 (owner: 10Volans) [10:21:43] !log installing xapian-core update from stretch point release [10:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:56] !log T221065 mwscript purgeList.php eswiki --all --verbose on mwmaint1002 [10:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:26] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [10:23:32] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fsero) 05Open→03Resolved I enabled cross replication for swift todayand it seems to work. The replication seems to be qu... [10:24:57] (03PS1) 10Effie Mouzeli: thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 [10:25:23] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10Joe) Yeah if replication model is eventual consistency, I think we just want a single discovery record that we make active/pa... [10:25:45] (03CR) 10jerkins-bot: [V: 04-1] thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 (owner: 10Effie Mouzeli) [10:25:53] 10Operations, 10Kubernetes: Evaluate VMWare's Harbour as a docker registry - https://phabricator.wikimedia.org/T202504 (10fsero) 05Open→03Resolved a:03fsero since we are not moving forward with Harbor at least for now, closing this task. The evaluation was done and the results are listed on this task htt... [10:26:55] (03PS6) 10Giuseppe Lavagetto: profile::docker::builder: add periodic job to prune old images [puppet] - 10https://gerrit.wikimedia.org/r/504029 [10:33:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15804/boron.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/504029 (owner: 10Giuseppe Lavagetto) [10:37:14] (03PS2) 10Effie Mouzeli: thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 [10:38:05] (03CR) 10jerkins-bot: [V: 04-1] thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 (owner: 10Effie Mouzeli) [10:43:47] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/log/prune-production-images/prune-production-images] [10:45:23] 10Operations, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10fgiunchedi) [10:45:38] !log installing libjs-bootstrap updates from Stretch point release [10:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:54] (03PS1) 10Elukey: profile::kerberos:*: only use global variables [puppet] - 10https://gerrit.wikimedia.org/r/504289 [10:49:31] !log T221065 eswiki purge finished [10:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:36] T221065: Set up origin trials on Spanish Wikipedia mobile site - https://phabricator.wikimedia.org/T221065 [10:49:39] 10Operations, 10Patch-For-Review: Integrate Stretch 9.8 point update - https://phabricator.wikimedia.org/T216384 (10MoritzMuehlenhoff) [10:51:21] jouncebot, next [10:51:21] In 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1100) [10:53:46] (03PS1) 10Giuseppe Lavagetto: profile::docker::builder: fix logging basedir [puppet] - 10https://gerrit.wikimedia.org/r/504291 [10:53:52] (03PS6) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [10:54:38] <_joe_> come on jenkins [10:55:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::docker::builder: fix logging basedir [puppet] - 10https://gerrit.wikimedia.org/r/504291 (owner: 10Giuseppe Lavagetto) [10:58:53] (03PS1) 10Urbanecm: Enable signatures in 2019: NS (ID 128) for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504292 (https://phabricator.wikimedia.org/T221062) [10:58:56] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:59:35] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Decommission analytics1003 - https://phabricator.wikimedia.org/T206524 (10elukey) [10:59:43] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10elukey) [10:59:53] jouncebot, refresh [10:59:54] I refreshed my knowledge about deployments. [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1100). [11:00:04] hoo, alaa_wmde, and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] here :) [11:00:31] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10elukey) [11:00:37] 10Operations, 10Continuous-Integration-Infrastructure, 10Goal, 10Release-Engineering-Team (Kanban), 10User-fgiunchedi: Add Prometheus exporter to Jenkins instances - https://phabricator.wikimedia.org/T182759 (10elukey) [11:00:47] (03PS16) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [11:01:02] I'll SWAT [11:01:41] (03CR) 10jerkins-bot: [V: 04-1] puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:01:47] Urbanecm: If you want, I can start with your patch [11:01:58] hoo, I don't have problems with either [11:02:11] feel free to choose the order :-) [11:02:43] (03CR) 10Hoo man: [C: 03+2] Enable signatures in 2019: NS (ID 128) for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504292 (https://phabricator.wikimedia.org/T221062) (owner: 10Urbanecm) [11:03:48] (03Merged) 10jenkins-bot: Enable signatures in 2019: NS (ID 128) for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504292 (https://phabricator.wikimedia.org/T221062) (owner: 10Urbanecm) [11:04:03] (03CR) 10jenkins-bot: Enable signatures in 2019: NS (ID 128) for wikimaniawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504292 (https://phabricator.wikimedia.org/T221062) (owner: 10Urbanecm) [11:04:30] Urbanecm: You can test on mwdebug1002 now [11:04:34] testing [11:06:05] (03PS8) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [11:06:16] hoo, works, please deploy [11:06:30] Good :) [11:07:25] !log hoo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable signatures in 2019: NS (ID 128) for wikimaniawiki (T221062) (duration: 00m 52s) [11:07:59] (03CR) 10Hoo man: [C: 03+2] Revert "WikibaseClient: Conditionally enable mapframe support" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503643 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [11:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:12] T221062: Enable signature in VisualEditor for namespace "2019" - https://phabricator.wikimedia.org/T221062 [11:09:00] (03Merged) 10jenkins-bot: Revert "WikibaseClient: Conditionally enable mapframe support" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503643 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [11:10:52] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Revert "WikibaseClient: Conditionally enable mapframe support" (T218051) (duration: 00m 51s) [11:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:00] (03PS7) 10Hoo man: Add wgWikibaseMusicalNotationLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [11:14:45] (03CR) 10Hoo man: [C: 03+2] Add wgWikibaseMusicalNotationLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [11:15:01] (03PS1) 10Arturo Borrero Gonzalez: openstack: codfw1dev: update DB for neutron-server [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) [11:15:30] (03CR) 10jerkins-bot: [V: 04-1] openstack: codfw1dev: update DB for neutron-server [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [11:15:32] (03CR) 10jenkins-bot: Revert "WikibaseClient: Conditionally enable mapframe support" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503643 (https://phabricator.wikimedia.org/T218051) (owner: 10Hoo man) [11:15:49] (03Merged) 10jenkins-bot: Add wgWikibaseMusicalNotationLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [11:16:02] (03CR) 10jenkins-bot: Add wgWikibaseMusicalNotationLineWidthInches to config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [11:16:10] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [11:17:41] (03PS2) 10Arturo Borrero Gonzalez: openstack: codfw1dev: update DB for neutron-server [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) [11:17:51] (03CR) 10Hoo man: Add wgWikibaseMusicalNotationLineWidthInches to config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498661 (https://phabricator.wikimedia.org/T218191) (owner: 10Alaa Sarhan) [11:18:09] !log hoo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add wgWikibaseMusicalNotationLineWidthInches to config (T218191) (duration: 00m 52s) [11:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:14] T218191: Long score overlaps with statement box and edit button - https://phabricator.wikimedia.org/T218191 [11:18:16] PROBLEM - puppet last run on mw1268 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:19:33] (03PS3) 10Arturo Borrero Gonzalez: openstack: codfw1dev: update DB for neutron-server [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) [11:20:53] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [11:22:59] (03PS4) 10Arturo Borrero Gonzalez: openstack: codfw1dev: update DB for neutron-server [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) [11:23:32] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [11:24:13] (03PS17) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [11:24:29] (03CR) 10jerkins-bot: [V: 04-1] puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:25:01] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: codfw1dev: update DB for neutron-server [puppet] - 10https://gerrit.wikimedia.org/r/504298 (https://phabricator.wikimedia.org/T219626) (owner: 10Arturo Borrero Gonzalez) [11:28:16] (03PS18) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [11:33:18] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:34:10] (03PS7) 10Elukey: [WIP] Enable Kerberos in the Analytics Hadoop Test cluster [puppet] - 10https://gerrit.wikimedia.org/r/504280 [11:35:22] (03CR) 10Volans: "Looks good, minor nitpicks/improvements inline, but almost ready." (0316 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 (owner: 10Jbond) [11:36:22] (03PS3) 10Effie Mouzeli: thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 [11:36:30] (03PS13) 10Faidon Liambotis: Port MakeVM to a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [11:37:11] (03CR) 10jerkins-bot: [V: 04-1] thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 (owner: 10Effie Mouzeli) [11:37:46] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) [11:39:32] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) [11:40:17] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudcontrol2001-dev & cloudvirt200[123]-dev - https://phabricator.wikimedia.org/T214448 (10aborrero) 05Open→03Resolved This should be done now. >>! In T214448#5107629, @Dzahn wrote: > on [cloudcontrol2001-dev: : > >... [11:41:02] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt200[123]-dev.codfw.wmnet: update role name [puppet] - 10https://gerrit.wikimedia.org/r/504302 [11:44:17] (03PS1) 10Jbond: facter3/puppet5: migrate systems to puppet5/facter3 [puppet] - 10https://gerrit.wikimedia.org/r/504303 (https://phabricator.wikimedia.org/T219803) [11:44:38] RECOVERY - puppet last run on mw1268 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [11:45:31] (03PS1) 10Elukey: Remove old override for aqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/504305 [11:46:47] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/15813/" [puppet] - 10https://gerrit.wikimedia.org/r/504305 (owner: 10Elukey) [11:46:55] (03PS2) 10Elukey: Remove old override for aqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/504305 [11:48:22] (03PS2) 10Arturo Borrero Gonzalez: cloudvirt200[123]-dev.codfw.wmnet: update role name [puppet] - 10https://gerrit.wikimedia.org/r/504302 [11:49:15] (03CR) 10Elukey: [C: 03+1] facter3/puppet5: migrate systems to puppet5/facter3 [puppet] - 10https://gerrit.wikimedia.org/r/504303 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:50:41] (03PS2) 10Jbond: facter3/puppet5: migrate systems to puppet5/facter3 [puppet] - 10https://gerrit.wikimedia.org/r/504303 (https://phabricator.wikimedia.org/T219803) [11:51:45] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: migrate systems to puppet5/facter3 [puppet] - 10https://gerrit.wikimedia.org/r/504303 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:52:04] (03PS4) 10Effie Mouzeli: thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 [11:52:51] (03CR) 10jerkins-bot: [V: 04-1] thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 (owner: 10Effie Mouzeli) [11:53:08] (03PS3) 10Arturo Borrero Gonzalez: cloudvirt200[123]-dev.codfw.wmnet: update role name [puppet] - 10https://gerrit.wikimedia.org/r/504302 [11:54:28] (03CR) 10Vgutierrez: cache: implement profile::cache::varnish::backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [11:55:13] (03PS1) 10Elukey: Remove stat1004 hiera host specific config [puppet] - 10https://gerrit.wikimedia.org/r/504306 [11:56:42] (03CR) 10Muehlenhoff: raid: add ssacli class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503334 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [11:56:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1001/15815/" [puppet] - 10https://gerrit.wikimedia.org/r/504302 (owner: 10Arturo Borrero Gonzalez) [11:57:16] (03CR) 10Vgutierrez: cache: add profile::cache::varnish::frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [11:57:18] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/15816/" [puppet] - 10https://gerrit.wikimedia.org/r/504306 (owner: 10Elukey) [11:57:26] (03PS2) 10Elukey: Remove stat1004 hiera host specific config [puppet] - 10https://gerrit.wikimedia.org/r/504306 [11:57:29] (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove stat1004 hiera host specific config [puppet] - 10https://gerrit.wikimedia.org/r/504306 (owner: 10Elukey) [11:58:58] (03PS4) 10DCausse: [cirrus] add cloudelastic service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) [12:00:13] (03CR) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [12:00:49] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:02:31] (03PS1) 10Elukey: role::an_cluster::hadoop::master|standby: remove unused hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504307 [12:03:45] (03PS1) 10Jbond: facter3/puppet5: migrate systems to puppet5/facter3 [puppet] - 10https://gerrit.wikimedia.org/r/504308 (https://phabricator.wikimedia.org/T219803) [12:04:22] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: migrate systems to puppet5/facter3 [puppet] - 10https://gerrit.wikimedia.org/r/504308 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:04:29] (03PS1) 10Arturo Borrero Gonzalez: cumin: aliases: include codfw1dev openstack deployment [puppet] - 10https://gerrit.wikimedia.org/r/504309 [12:04:32] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/15817/" [puppet] - 10https://gerrit.wikimedia.org/r/504307 (owner: 10Elukey) [12:04:40] (03PS2) 10Elukey: role::an_cluster::hadoop::master|standby: remove unused hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504307 [12:04:42] (03CR) 10Elukey: [V: 03+2 C: 03+2] role::an_cluster::hadoop::master|standby: remove unused hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504307 (owner: 10Elukey) [12:04:58] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:06:00] (03CR) 10Muehlenhoff: "Let's also yank the old labtest alias in the same patch?" [puppet] - 10https://gerrit.wikimedia.org/r/504309 (owner: 10Arturo Borrero Gonzalez) [12:07:31] !log rebooting cloudvirt200[123]-dev because deep changes in config [12:07:47] 10Operations, 10serviceops, 10vm-requests, 10User-jijiki: Site: 4 VM request for kubernetes - https://phabricator.wikimedia.org/T220822 (10jijiki) [12:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:24] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:29] (03PS5) 10Effie Mouzeli: thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 [12:11:22] (03PS1) 10Alexandros Kosiaris: Introduce kubernetes{1,2}00{5,6}.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504311 (https://phabricator.wikimedia.org/T220822) [12:11:42] (03CR) 10jerkins-bot: [V: 04-1] Introduce kubernetes{1,2}00{5,6}.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504311 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris) [12:13:24] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [12:14:49] (03PS2) 10Alexandros Kosiaris: Introduce kubernetes{1,2}00{5,6}.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504311 (https://phabricator.wikimedia.org/T220822) [12:15:44] (03CR) 10jerkins-bot: [V: 04-1] Introduce kubernetes{1,2}00{5,6}.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504311 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris) [12:17:04] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:17:09] !log installing OpenSSL 1.0.2 updates on cp* Varnish hosts [12:17:10] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:26] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:20:18] (03PS6) 10Effie Mouzeli: thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 (https://phabricator.wikimedia.org/T220499) [12:22:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504284 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [12:25:02] (03PS3) 10Alexandros Kosiaris: Introduce kubernetes{1,2}00{5,6}.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504311 (https://phabricator.wikimedia.org/T220822) [12:26:46] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/504309 (owner: 10Arturo Borrero Gonzalez) [12:27:39] (03PS2) 10Arturo Borrero Gonzalez: cumin: aliases: include codfw1dev openstack deployment [puppet] - 10https://gerrit.wikimedia.org/r/504309 [12:28:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504309 (owner: 10Arturo Borrero Gonzalez) [12:29:51] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) uniqueid fact is also missing [12:32:05] (03CR) 10Ema: cache: implement profile::cache::varnish::backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:32:55] (03CR) 10Ema: cache: add profile::cache::varnish::frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [12:35:04] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:37:31] (03PS1) 10Arturo Borrero Gonzalez: labtestcontrol2003: reimage and rename to cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/504320 (https://phabricator.wikimedia.org/T220095) [12:39:02] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:41:52] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [12:46:25] (03PS1) 10Arturo Borrero Gonzalez: labtestcontrol2003: rename to cloudcontrol2003-dev [dns] - 10https://gerrit.wikimedia.org/r/504321 (https://phabricator.wikimedia.org/T220095) [12:47:08] (03PS1) 10Jbond: facter3: add uniqueid fact [puppet] - 10https://gerrit.wikimedia.org/r/504322 (https://phabricator.wikimedia.org/T219803) [12:47:46] (03CR) 10Marostegui: [C: 03+1] raid: refactor structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [12:47:48] elukey: it looks like 639f3aed280ea2fd535877d44556a2c9e2279fbb hasn't been properly merged across the puppet masters, specifically labpuppetmaster1001 [12:47:49] (03CR) 10jerkins-bot: [V: 04-1] facter3: add uniqueid fact [puppet] - 10https://gerrit.wikimedia.org/r/504322 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:48:22] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. [12:48:59] and fixed apparently :) [12:49:52] (03PS1) 10Gilles: Add tests for haproxy mtail program [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) [12:50:19] (03CR) 10jerkins-bot: [V: 04-1] Add tests for haproxy mtail program [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [12:50:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestcontrol2003: rename to cloudcontrol2003-dev [dns] - 10https://gerrit.wikimedia.org/r/504321 (https://phabricator.wikimedia.org/T220095) (owner: 10Arturo Borrero Gonzalez) [12:51:21] (03PS2) 10Gilles: Add tests for haproxy mtail program [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) [12:51:36] (03CR) 10jerkins-bot: [V: 04-1] Add tests for haproxy mtail program [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [12:51:36] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. [12:52:00] (03PS2) 10Jbond: facter3: add uniqueid fact [puppet] - 10https://gerrit.wikimedia.org/r/504322 (https://phabricator.wikimedia.org/T219803) [12:52:30] !log swift eqiad-prod continue ms-be1013 decom - T220590 [12:52:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestcontrol2003: reimage and rename to cloudcontrol2003-dev [puppet] - 10https://gerrit.wikimedia.org/r/504320 (https://phabricator.wikimedia.org/T220095) (owner: 10Arturo Borrero Gonzalez) [12:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:58] T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 [12:53:08] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [12:53:56] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [12:54:42] !log cleanup redundant prometheus-elasticsearch units on elasticsearch servers [12:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:05] !log installing ghostscript update on thumbor1001 [12:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:20] (03CR) 10Gilles: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [13:00:44] (03CR) 10jerkins-bot: [V: 04-1] Add tests for haproxy mtail program [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [13:01:40] (03PS2) 10Ema: 0.2: use scanner.Bytes instead of .Text [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/499827 [13:01:46] (03CR) 10Ottomata: [C: 03+1] profile::hadoop::spark2: auto upload of spark2-assembly.zip [puppet] - 10https://gerrit.wikimedia.org/r/504268 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [13:02:08] (03CR) 10Vgutierrez: [C: 03+1] cache: implement profile::cache::varnish::backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/503381 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:02:18] (03CR) 10Vgutierrez: [C: 03+1] cache: add profile::cache::varnish::frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:02:55] (03PS3) 10Gilles: Add tests for haproxy mtail program [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) [13:03:01] !log T220095 renaming/reimaging labtestcontrol2003 as cloudcontrol2003-dev [13:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:06] T220095: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 [13:04:59] (03CR) 10Ema: [C: 03+2] 0.2: use scanner.Bytes instead of .Text [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/499827 (owner: 10Ema) [13:05:13] (03PS2) 10Elukey: profile::hadoop::spark2: auto upload of spark2-assembly.zip [puppet] - 10https://gerrit.wikimedia.org/r/504268 (https://phabricator.wikimedia.org/T218343) [13:05:17] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15819/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/504268 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [13:05:41] (03Abandoned) 10Elukey: mapred-site.xml: test [puppet/cdh] - 10https://gerrit.wikimedia.org/r/498135 (owner: 10Elukey) [13:06:09] (03Abandoned) 10Elukey: profile::hadoop::common: add ssl parameter to ssl-config.xml's set [puppet] - 10https://gerrit.wikimedia.org/r/496486 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [13:06:38] (03CR) 10Alexandros Kosiaris: [C: 03+2] Introduce kubernetes{1,2}00{5,6}.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504311 (https://phabricator.wikimedia.org/T220822) (owner: 10Alexandros Kosiaris) [13:06:42] (03PS4) 10Alexandros Kosiaris: Introduce kubernetes{1,2}00{5,6}.{eqiad,codfw}.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504311 (https://phabricator.wikimedia.org/T220822) [13:08:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. In the future we should probably move away from using this fact (and the underlying hostid(1) given that systemd generates /et" [puppet] - 10https://gerrit.wikimedia.org/r/504322 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:10:44] !log fifo-log-demux 0.2 uploaded to stretch-wikimedia [13:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:27] (03PS1) 10Mathew.onipe: maps: enable prometheus exporter for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/504326 (https://phabricator.wikimedia.org/T221055) [13:12:57] (03CR) 10jerkins-bot: [V: 04-1] maps: enable prometheus exporter for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/504326 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:13:12] !log cp-ats: upgrade fifo-log-demux to 0.2 and restart services [13:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:08] (03PS2) 10Mathew.onipe: maps: enable prometheus exporter for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/504326 (https://phabricator.wikimedia.org/T221055) [13:15:22] (03CR) 10Gehel: [C: 03+2] Add wdqs data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/488256 (https://phabricator.wikimedia.org/T213401) (owner: 10Mathew.onipe) [13:18:09] 10Operations, 10Thumbor, 10serviceops: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) [13:18:38] 10Operations, 10Thumbor, 10serviceops: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) p:05Triage→03Normal [13:19:39] (03PS1) 10Gehel: wdqs: add arguments to not use LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/504327 (https://phabricator.wikimedia.org/T213401) [13:20:30] (03CR) 10Herron: [C: 03+1] Add SPF records for non-canonical non-parked domains [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [13:22:43] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: add arguments to not use LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/504327 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [13:22:53] (03CR) 10Gehel: [C: 03+2] wdqs: add arguments to not use LVS [cookbooks] - 10https://gerrit.wikimedia.org/r/504327 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [13:27:11] (03CR) 10Effie Mouzeli: [C: 03+2] Add tests for haproxy mtail program [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [13:28:04] (03CR) 10Mathew.onipe: "Changes are expected according to PCC: https://puppet-compiler.wmflabs.org/compiler1002/15821/" [puppet] - 10https://gerrit.wikimedia.org/r/504326 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:31:37] (03CR) 10Gehel: [C: 03+1] "LGTM, review from Elukey would be welcomed!" [puppet] - 10https://gerrit.wikimedia.org/r/504326 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:32:15] (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [13:33:43] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/15822/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/504284 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [13:34:09] !log Disabling puppet on thumbor* to merge 504284 [13:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:29] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [13:34:39] (03CR) 10jerkins-bot: [V: 04-1] Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [13:35:22] 10Operations, 10Puppet: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jbond) [13:35:39] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/504322 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:35:48] (03PS3) 10Jbond: facter3: add uniqueid fact [puppet] - 10https://gerrit.wikimedia.org/r/504322 (https://phabricator.wikimedia.org/T219803) [13:37:09] (03CR) 10Jbond: [C: 03+2] facter3: add uniqueid fact [puppet] - 10https://gerrit.wikimedia.org/r/504322 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:38:06] (03CR) 10Elukey: [C: 03+1] "it would be great if all the clusters were using profile::cassandra (even with only one instance) but it is probably too big and not in sc" [puppet] - 10https://gerrit.wikimedia.org/r/504326 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [13:38:36] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/15823/prometheus1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/504284 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [13:38:41] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [13:38:43] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [13:38:57] (03PS7) 10Effie Mouzeli: thumbor: enable haproxy mtail metrics [puppet] - 10https://gerrit.wikimedia.org/r/504284 (https://phabricator.wikimedia.org/T220499) [13:39:08] !log restetting cookbooks repo on cumin1001 (local changes) [13:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:35] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:00] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [13:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:44] gehel: what did I do wrong? :) [13:43:01] not sure yet, checking the logs [13:43:11] (03PS1) 10Ottomata: eventgate-analytics - Parameterize some service runner settings with defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/504330 [13:43:55] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:46:09] (03PS1) 10Ema: swift: new cert for ms-fe.svc.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504331 (https://phabricator.wikimedia.org/T204245) [13:47:14] (03CR) 10Muehlenhoff: [C: 03+1] profile::kerberos:*: only use global variables [puppet] - 10https://gerrit.wikimedia.org/r/504289 (owner: 10Elukey) [13:49:42] (03PS1) 10Gehel: wdqs: ensure reaceiver has started before sending file [cookbooks] - 10https://gerrit.wikimedia.org/r/504332 (https://phabricator.wikimedia.org/T213401) [13:49:48] onimisionipe: ^ [13:50:03] (03PS2) 10Gehel: wdqs: ensure receiver has started before sending file [cookbooks] - 10https://gerrit.wikimedia.org/r/504332 (https://phabricator.wikimedia.org/T213401) [13:50:40] ah.. I see. [13:52:03] !log Enable puppet on thumbor* [13:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:15] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/504332 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [13:52:39] (03CR) 10Gehel: [C: 03+2] wdqs: ensure receiver has started before sending file [cookbooks] - 10https://gerrit.wikimedia.org/r/504332 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [13:53:52] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [13:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:57] (03PS4) 10Effie Mouzeli: Add tests for haproxy mtail program [puppet] - 10https://gerrit.wikimedia.org/r/504323 (https://phabricator.wikimedia.org/T220499) (owner: 10Gilles) [13:54:03] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [13:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:58] (03PS1) 10Gehel: wdqs: fix source and destination that were inverted [cookbooks] - 10https://gerrit.wikimedia.org/r/504333 (https://phabricator.wikimedia.org/T213401) [13:57:07] onimisionipe: ^ and one more [13:57:20] looking [13:57:52] damn! sorry! [13:57:56] (03CR) 10Jcrespo: "I am not sure the new stuff will work or it is rational without changing that (global variables), but ok." [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [13:58:01] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: fix source and destination that were inverted [cookbooks] - 10https://gerrit.wikimedia.org/r/504333 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [13:58:35] !log Disable puppet on thumbor1001 for ~24h to serve traffic via haproxy - T187765 [13:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:39] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [14:01:30] !log Depooling thumbor1001 [14:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:58] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=ms-fe2005.codfw.wmnet [14:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:42] (03CR) 10Jbond: [C: 03+1] "this looks good but i wonder if it would be better to expose some facts to use instead of making this a global variable" [puppet] - 10https://gerrit.wikimedia.org/r/504289 (owner: 10Elukey) [14:04:29] !log test https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504331/ on ms-fe2005 T204245 [14:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:33] T204245: Run MediaWiki media originals active/active - https://phabricator.wikimedia.org/T204245 [14:07:56] !log Pooling thumbor1001 [14:08:06] (03PS2) 10Elukey: profile::kerberos:*: only use global variables [puppet] - 10https://gerrit.wikimedia.org/r/504289 [14:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:14] (03CR) 10Elukey: [C: 03+2] profile::kerberos:*: only use global variables [puppet] - 10https://gerrit.wikimedia.org/r/504289 (owner: 10Elukey) [14:10:44] (03CR) 10Vgutierrez: [C: 03+2] Add SPF records for non-canonical non-parked domains [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [14:10:58] (03PS3) 10Vgutierrez: Add SPF records for non-canonical non-parked domains [dns] - 10https://gerrit.wikimedia.org/r/503258 (https://phabricator.wikimedia.org/T220786) [14:13:58] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) [14:17:02] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe2005.codfw.wmnet [14:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:19] (03CR) 10Ema: [C: 03+2] swift: new cert for ms-fe.svc.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504331 (https://phabricator.wikimedia.org/T204245) (owner: 10Ema) [14:19:22] (03CR) 10Gehel: [C: 03+2] wdqs: fix source and destination that were inverted [cookbooks] - 10https://gerrit.wikimedia.org/r/504333 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [14:19:27] (03PS2) 10Ema: swift: new cert for ms-fe.svc.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504331 (https://phabricator.wikimedia.org/T204245) [14:20:03] !log roll restart of all the druid daemons on druid100[1-6] to pick up new openjdk updates [14:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:21] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:25] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [14:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:35] (03PS1) 10Effie Mouzeli: thumbor: Fix mtail group and log path [puppet] - 10https://gerrit.wikimedia.org/r/504335 (https://phabricator.wikimedia.org/T220499) [14:23:17] that was suspiciously fast! [14:23:58] PROBLEM - DPKG on ms-be1018 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:24:00] PROBLEM - DPKG on ms-be1027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:24:04] RECOVERY - Free Blazegraph allocators wdqs-blazegraph on wdqs1010 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [14:24:20] PROBLEM - DPKG on ms-be1039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:24:46] ^ dpkg is me [14:25:44] PROBLEM - High lag on wdqs1009 is CRITICAL: 2.592e+06 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:26:08] RECOVERY - DPKG on ms-be1027 is OK: All packages OK [14:26:28] RECOVERY - DPKG on ms-be1039 is OK: All packages OK [14:27:29] (03CR) 10Effie Mouzeli: [V: 03+1] "LGTM https://puppet-compiler.wmflabs.org/compiler1002/15825/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/504335 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [14:27:45] gehel: cookbook worked? [14:27:59] onimisionipe: nope, still digging in the logs [14:28:05] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10ema) 05Open→03Resolved @Ottomata thanks! The new error message is helpful, and the proposed solution works. [14:28:55] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] thumbor: Fix mtail group and log path [puppet] - 10https://gerrit.wikimedia.org/r/504335 (https://phabricator.wikimedia.org/T220499) (owner: 10Effie Mouzeli) [14:29:12] (03PS2) 10Effie Mouzeli: thumbor: Fix mtail group and log path [puppet] - 10https://gerrit.wikimedia.org/r/504335 (https://phabricator.wikimedia.org/T220499) [14:30:22] !log swift-fe-codfw: nginx reload for new TLS certificate T204245 [14:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:26] T204245: Run MediaWiki media originals active/active - https://phabricator.wikimedia.org/T204245 [14:36:30] RECOVERY - DPKG on ms-be1018 is OK: All packages OK [14:38:18] (03PS1) 10Ema: swift: new cert for ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504340 (https://phabricator.wikimedia.org/T204245) [14:40:02] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: labtestcontrol2001.wikimedia.org - https://phabricator.wikimedia.org/T218021 (10Papaul) [14:40:16] (03PS1) 10Alexandros Kosiaris: install: Fix a typo with kubestagetcd [puppet] - 10https://gerrit.wikimedia.org/r/504341 [14:40:19] (03PS1) 10Alexandros Kosiaris: Add kubernetes[12]00[56] [puppet] - 10https://gerrit.wikimedia.org/r/504342 (https://phabricator.wikimedia.org/T220822) [14:40:23] (03PS1) 10Gehel: wdqs: add some logging to data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504343 (https://phabricator.wikimedia.org/T213401) [14:41:10] (03PS9) 10Jbond: Add prometheus interface to spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/496496 [14:41:45] onimisionipe: ^^ [14:41:57] looking [14:43:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] install: Fix a typo with kubestagetcd [puppet] - 10https://gerrit.wikimedia.org/r/504341 (owner: 10Alexandros Kosiaris) [14:43:18] (03CR) 10jerkins-bot: [V: 04-1] wdqs: add some logging to data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504343 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [14:45:08] (03PS2) 10Gehel: wdqs: add some logging to data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504343 (https://phabricator.wikimedia.org/T213401) [14:45:36] !log test https://gerrit.wikimedia.org/r/504340 on ms-fe1005 T204245 [14:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:41] T204245: Run MediaWiki media originals active/active - https://phabricator.wikimedia.org/T204245 [14:46:44] (03PS2) 10Ema: swift: new cert for ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504340 (https://phabricator.wikimedia.org/T204245) [14:47:21] (03CR) 10Ema: [C: 03+2] swift: new cert for ms-fe.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/504340 (https://phabricator.wikimedia.org/T204245) (owner: 10Ema) [14:49:16] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: add some logging to data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504343 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [14:49:19] (03PS1) 10Ottomata: Include procps in wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/504345 [14:50:58] (03PS1) 10Ottomata: eventgate-analytics Add wmfdebug and profile options for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/504346 (https://phabricator.wikimedia.org/T220661) [14:51:12] (03CR) 10Jbond: [C: 03+2] raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) (owner: 10Jbond) [14:51:15] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Parameterize some service runner settings with defaults [deployment-charts] - 10https://gerrit.wikimedia.org/r/504330 (owner: 10Ottomata) [14:51:19] (03CR) 10Gehel: [C: 03+2] wdqs: add some logging to data transfer cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504343 (https://phabricator.wikimedia.org/T213401) (owner: 10Gehel) [14:51:22] (03PS7) 10Jbond: raid: refactor structure [puppet] - 10https://gerrit.wikimedia.org/r/503333 (https://phabricator.wikimedia.org/T220787) [14:51:47] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ms-fe1005.eqiad.wmnet [14:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:26] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [14:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:07] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: expose metrics in prometheus format for new docker-registry and create a grafana dashboard - https://phabricator.wikimedia.org/T221099 (10fsero) [14:53:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] Include procps in wmfdebug image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/504345 (owner: 10Ottomata) [14:53:13] (03PS2) 10Ottomata: eventgate-analytics Add wmfdebug and profile options for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/504346 (https://phabricator.wikimedia.org/T220661) [14:53:25] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [14:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:00] (03PS2) 10Ottomata: Include procps in wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/504345 [14:54:12] (03CR) 10Ottomata: "Thx, copy/paste booboo" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/504345 (owner: 10Ottomata) [14:54:24] (03PS1) 10Jbond: Revert "raid: refactor structure" [puppet] - 10https://gerrit.wikimedia.org/r/504347 [14:54:30] PROBLEM - High lag on wdqs1010 is CRITICAL: 2.59e+06 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:55:14] PROBLEM - High lag on wdqs1009 is CRITICAL: 2.59e+06 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [14:55:26] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: migrate endpoint from old registry instance to new one - https://phabricator.wikimedia.org/T221101 (10fsero) [14:55:35] ^ wdqs* is me, silencing [14:56:29] !log swift-fe-eqiad: nginx reload for new TLS certificate T204245 [14:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:33] T204245: Run MediaWiki media originals active/active - https://phabricator.wikimedia.org/T204245 [14:56:47] (03PS3) 10Ottomata: eventgate-analytics Add wmfdebug and profile options for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/504346 (https://phabricator.wikimedia.org/T220661) [14:58:42] (03PS1) 10Muehlenhoff: Fix data type for krb_kdc_servers [puppet] - 10https://gerrit.wikimedia.org/r/504348 [14:58:54] !log manual data transfer from wdqs1008 to wdqs1009 - T220830 [14:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:58] T220830: data reimport on wdqs1009 and wdqs1010 - https://phabricator.wikimedia.org/T220830 [15:00:05] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics Add wmfdebug and profile options for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/504346 (https://phabricator.wikimedia.org/T220661) (owner: 10Ottomata) [15:01:11] (03CR) 10Fsero: eventgate-analytics Add wmfdebug and profile options for debugging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/504346 (https://phabricator.wikimedia.org/T220661) (owner: 10Ottomata) [15:02:14] (03CR) 10Andrew Bogott: [C: 03+1] "This looks fine to me, although it changes a bunch of files that will be ripped out shortly." [puppet] - 10https://gerrit.wikimedia.org/r/503407 (owner: 10Alex Monk) [15:04:37] 10Operations, 10Cassandra, 10Maps, 10Patch-For-Review: Collect metrics on maps cassandra - https://phabricator.wikimedia.org/T221055 (10Gehel) a:03Mathew.onipe [15:06:02] (03PS1) 10Ema: Revert "ATS: use 'swift-ro' as the origin server for thumb traffic" [puppet] - 10https://gerrit.wikimedia.org/r/504349 (https://phabricator.wikimedia.org/T204245) [15:07:30] (03Abandoned) 10Jbond: Revert "raid: refactor structure" [puppet] - 10https://gerrit.wikimedia.org/r/504347 (owner: 10Jbond) [15:08:52] 10Operations, 10Maps (Tilerator): Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10Gehel) 05Open→03Resolved a:03Gehel Stretch migration is completed. This should be fixed, we'll reopen if this happens again. [15:08:55] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10Gehel) [15:08:59] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics Add wmfdebug and profile options for debugging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/504346 (https://phabricator.wikimedia.org/T220661) (owner: 10Ottomata) [15:09:07] (03PS3) 10Ema: trafficserver: Switch to using swift.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/502458 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [15:09:54] (03PS1) 10Arturo Borrero Gonzalez: labtestcontrol2003: cleanup [dns] - 10https://gerrit.wikimedia.org/r/504350 (https://phabricator.wikimedia.org/T220095) [15:10:46] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestcontrol2003: cleanup [dns] - 10https://gerrit.wikimedia.org/r/504350 (https://phabricator.wikimedia.org/T220095) (owner: 10Arturo Borrero Gonzalez) [15:10:49] (03CR) 10Ema: [C: 03+2] Revert "ATS: use 'swift-ro' as the origin server for thumb traffic" [puppet] - 10https://gerrit.wikimedia.org/r/504349 (https://phabricator.wikimedia.org/T204245) (owner: 10Ema) [15:11:03] (03CR) 10Ema: [C: 03+2] trafficserver: Switch to using swift.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/502458 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [15:12:01] (03PS1) 10Muehlenhoff: Add initial Hiera settings for kdc1001 [puppet] - 10https://gerrit.wikimedia.org/r/504351 [15:12:03] (03PS1) 10Muehlenhoff: Assign role for kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/504352 [15:12:35] (03PS2) 10Muehlenhoff: Add initial Hiera settings for kerberos1001 Change-Id: I3ff004d5ba2a9a5b2ae47cadcf2181f463678b39 [puppet] - 10https://gerrit.wikimedia.org/r/504351 [15:12:37] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/15827/" [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:12:50] (03CR) 10Ottomata: [V: 03+2 C: 03+2] "Ah I didn't mean to merge this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/504346 (https://phabricator.wikimedia.org/T220661) (owner: 10Ottomata) [15:13:05] (03CR) 10jerkins-bot: [V: 04-1] Add initial Hiera settings for kerberos1001 Change-Id: I3ff004d5ba2a9a5b2ae47cadcf2181f463678b39 [puppet] - 10https://gerrit.wikimedia.org/r/504351 (owner: 10Muehlenhoff) [15:14:02] (03PS1) 10Ottomata: eventgate-analytics - sleep infinity in wmfdebug [deployment-charts] - 10https://gerrit.wikimedia.org/r/504354 [15:14:23] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - sleep infinity in wmfdebug [deployment-charts] - 10https://gerrit.wikimedia.org/r/504354 (owner: 10Ottomata) [15:14:32] (03CR) 10Elukey: [C: 03+1] Fix data type for krb_kdc_servers [puppet] - 10https://gerrit.wikimedia.org/r/504348 (owner: 10Muehlenhoff) [15:16:12] !log roll restart kafka on kafka-jumbo100[1-6] to pick up openjdk upgrades [15:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:19] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Include procps in wmfdebug image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/504345 (owner: 10Ottomata) [15:16:57] (03PS3) 10Muehlenhoff: Add initial Hiera settings for kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/504351 [15:18:02] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:18:42] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:18:45] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:45] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:18] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:19:19] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:19:19] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:38] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 (10aborrero) p:05Triage→03Normal a:05aborrero→03Papaul @papaul please update switch port description and physical labeling for this server,... [15:20:58] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging --version 0.0.28 -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:20:59] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:20:59] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:33] (03CR) 10Volans: "Looks good, couple of minor nitpicks." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [15:26:18] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:27:26] 10Operations, 10Patch-For-Review: apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10Jgreen) [15:27:56] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:27:58] (03CR) 10Volans: "Any other opinions on this?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [15:28:22] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational [15:29:31] (03PS1) 10Arturo Borrero Gonzalez: cloudnet2002-dev: use codfw1dev role instead of labtestn [puppet] - 10https://gerrit.wikimedia.org/r/504356 [15:30:01] (03CR) 10jerkins-bot: [V: 04-1] cloudnet2002-dev: use codfw1dev role instead of labtestn [puppet] - 10https://gerrit.wikimedia.org/r/504356 (owner: 10Arturo Borrero Gonzalez) [15:31:50] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:32:10] (03PS2) 10Arturo Borrero Gonzalez: cloudnet2002-dev: use codfw1dev role instead of labtestn [puppet] - 10https://gerrit.wikimedia.org/r/504356 [15:32:46] (03PS2) 10Muehlenhoff: Fix data type for krb_kdc_servers [puppet] - 10https://gerrit.wikimedia.org/r/504348 [15:34:11] (03CR) 10Muehlenhoff: [C: 03+2] Fix data type for krb_kdc_servers [puppet] - 10https://gerrit.wikimedia.org/r/504348 (owner: 10Muehlenhoff) [15:34:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1001/15829/" [puppet] - 10https://gerrit.wikimedia.org/r/504356 (owner: 10Arturo Borrero Gonzalez) [15:36:27] !log reimaging cloudnet2002-dev because role name change [15:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:49] 10Operations, 10Puppet: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10colewhite) p:05Triage→03Normal [15:37:22] (03Abandoned) 10Ema: tlsproxy::localssl: add snakeoil cert support [puppet] - 10https://gerrit.wikimedia.org/r/479242 (owner: 10Ema) [15:37:28] (03PS1) 10Fsero: registryha: added prometheus scraping new registry [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) [15:37:34] (03PS4) 10Muehlenhoff: Add initial Hiera settings for kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/504351 [15:39:15] 10Operations, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10colewhite) p:05Triage→03Normal [15:39:42] 10Operations, 10Wikimedia-Logstash: config file change canarying for logstash - https://phabricator.wikimedia.org/T221052 (10colewhite) p:05Triage→03Normal [15:40:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1005 with 10G interfaces - https://phabricator.wikimedia.org/T221049 (10colewhite) p:05Triage→03Normal [15:40:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1006 with 10G interfaces - https://phabricator.wikimedia.org/T221048 (10colewhite) p:05Triage→03Normal [15:40:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1007 with 10G interfaces - https://phabricator.wikimedia.org/T221047 (10colewhite) p:05Triage→03Normal [15:41:45] 10Operations: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 (10colewhite) p:05Triage→03High [15:42:40] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1013 - https://phabricator.wikimedia.org/T220907 (10colewhite) p:05Triage→03High [15:46:21] (03PS3) 10Arturo Borrero Gonzalez: cloudnet2002-dev: use codfw1dev role instead of labtestn [puppet] - 10https://gerrit.wikimedia.org/r/504356 [15:46:57] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:50:19] (03PS26) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [15:51:03] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This would remove 70-persistent-net.rules on 162 machines, probably causing mayhem when they reboot. The full list obtain via cumin is" [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk) [15:51:34] (03PS27) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [15:52:18] (03CR) 10Alexandros Kosiaris: [C: 03+1] registryha: added prometheus scraping new registry [puppet] - 10https://gerrit.wikimedia.org/r/504360 (https://phabricator.wikimedia.org/T221099) (owner: 10Fsero) [15:55:47] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10colewhite) p:05Triage→03High [15:58:12] (03CR) 10Alex Monk: "Ack, thanks. Let's just explicitly absent away individual old files as necessary for now then?" [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk) [16:00:05] godog and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:03:59] (03CR) 10Elukey: "Should we add those to common.yaml so all the hadoop roles will be able to pick them up?" [puppet] - 10https://gerrit.wikimedia.org/r/504351 (owner: 10Muehlenhoff) [16:04:00] (03CR) 10Elukey: [C: 03+1] Assign role for kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/504352 (owner: 10Muehlenhoff) [16:05:08] (03CR) 10Jbond: [C: 03+2] puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [16:05:31] (03PS19) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [16:07:04] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:08:51] (03PS1) 10Jbond: Revert "puppet: Refactor of the base::puppet class" [puppet] - 10https://gerrit.wikimedia.org/r/504364 [16:09:09] (03CR) 10Muehlenhoff: "I'm fine either way, this is initially for the setup of the KDC, we can drop the role-specific setting when the bigger WIP is merged?" [puppet] - 10https://gerrit.wikimedia.org/r/504351 (owner: 10Muehlenhoff) [16:09:20] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:09:52] (03CR) 10jerkins-bot: [V: 04-1] Revert "puppet: Refactor of the base::puppet class" [puppet] - 10https://gerrit.wikimedia.org/r/504364 (owner: 10Jbond) [16:11:45] (03CR) 10Alexandros Kosiaris: [C: 04-2] "> Ack, thanks. Let's just explicitly absent away individual old files as necessary for now then?" [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk) [16:13:42] 10Operations, 10fundraising-tech-ops, 10procurement: SSL renewal: frdata.wm.o expires 19/05/13 - https://phabricator.wikimedia.org/T220882 (10colewhite) p:05Triage→03High [16:15:45] jbond42 hi, i think https://gerrit.wikimedia.org/r/501617 broke puppet, im getting https://phabricator.wikimedia.org/P8406 [16:16:31] looking [16:17:45] paladox: let me try and push a quick fix to that [16:17:53] thanks :) [16:20:50] paladox: do you change the clas parameter for base::puppet::server to an empty string? [16:21:23] Um, im not sure how the labs puppet class does it with base::puppet::server [16:21:53] the puppet master uses labspuppet master [16:24:35] (03CR) 10Elukey: "Sure!" [puppet] - 10https://gerrit.wikimedia.org/r/504351 (owner: 10Muehlenhoff) [16:25:31] (03CR) 10Volans: "It seems there are conflicts when rebasing, it need manual rebase." [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [16:27:24] jbond42 i have this "puppetmaster: labs-puppetmaster.wikimedia.org" set [16:28:08] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1078 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504022 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [16:28:22] Taking over mw deployments for 5 minutes [16:28:23] paladox: where do you have that set? [16:28:32] horizon [16:28:44] (in horizion hiera thing) [16:29:11] (03Merged) 10jenkins-bot: mariadb: Depool db1078 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504022 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [16:29:33] ok let me try setting that on one of my test hosts [16:29:53] could oyu also try removing it and see if it inherits the correct config? [16:31:12] ok [16:32:59] (03CR) 10jenkins-bot: mariadb: Depool db1078 for hardware maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504022 (https://phabricator.wikimedia.org/T219115) (owner: 10Jcrespo) [16:33:55] paladox: ok im reverting the change [16:34:10] ok [16:34:27] in fact you can help me with that :D [16:34:49] (03PS23) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [16:34:53] https://gerrit.wikimedia.org/r/c/operations/puppet/+/504364 [16:35:03] (03CR) 10Paladox: [C: 03+1] Revert "puppet: Refactor of the base::puppet class" [puppet] - 10https://gerrit.wikimedia.org/r/504364 (owner: 10Jbond) [16:35:55] (03CR) 10jerkins-bot: [V: 04-1] Revert "puppet: Refactor of the base::puppet class" [puppet] - 10https://gerrit.wikimedia.org/r/504364 (owner: 10Jbond) [16:36:45] (03Abandoned) 10Alex Monk: udev: purge unpuppetised rules [puppet] - 10https://gerrit.wikimedia.org/r/503708 (owner: 10Alex Monk) [16:37:38] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1078 (duration: 00m 52s) [16:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:25] jbond42 it dosen't appear that removing "puppetmaster:" fixes it, but i also have it set in wikitech (for the other instances to use that puppet master). [16:39:55] hiera for wikitech is cached so im not sure how long after changing it that the puppet master will pull in the new value. [16:40:30] im not sure either. have also got volans lookint at this to help [16:41:25] !log disabling notifications on db1078 T219115 [16:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:29] T219115: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 [16:41:42] mw patch queue is now free, unless something unexpected happens [16:42:13] jbond42: it seems the [16:42:13] String $puppetmaster = lookup('puppetmaster'), [16:42:18] is the one failing I think [16:42:54] (03PS1) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 [16:42:57] modules/profile/manifests/base/puppet.pp:15:3 [16:42:58] from the log [16:43:08] (03PS1) 10Jbond: Fix puppet: remove alais call as it may not work [puppet] - 10https://gerrit.wikimedia.org/r/504368 [16:43:21] volans: ^^ i think it may be related to this [16:43:45] !log upgrading and shutting down db1078 T219115 [16:43:57] (03CR) 10Volans: Fix puppet: remove alais call as it may not work (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/504368 (owner: 10Jbond) [16:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:16] hmm [16:44:18] ca_server = puppetmaster1001.eqiad.wmnet [16:44:28] some how that was inserted on wmcs instances. [16:44:40] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:43] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:44:44] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:59] (03PS2) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [16:45:01] paladox: where? [16:45:15] gerrit-test.git.eqiad.wmflabs [16:45:22] (03PS2) 10Jbond: Fix puppet: remove alias call as it may not work [puppet] - 10https://gerrit.wikimedia.org/r/504368 [16:45:23] I mean in which file [16:45:28] oh, puppet.conf [16:45:52] mmmh I don't have it in the instance I'm checkijng [16:45:57] also puppet is broken [16:47:18] jbond42: so regardint this patch, I don't see profile::base::puppet::puppetmaster used anywhere [16:47:24] what should solve? [16:47:41] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set wmfdebug_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:47:42] yes you right that is just left over cruft it should be a noop [16:47:42] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:47:42] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:45] otherwise we can just revert, but given that was quite a refactor seeing if we can avoid it [16:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:10] volans: i wuld love to avoid it if we can [16:48:45] !log puppet node clean bast2001.wikimedia.org ; puppet node deactivate bast2001.wikimedia.org ; it showed up in Icinga again despite running decom cookbook (T219492) [16:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:49] T219492: decommission bast2001 - https://phabricator.wikimedia.org/T219492 [16:48:52] does lookup() behave differently than hiera()? [16:49:14] it shouldn'tt [16:49:57] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:52:42] (03CR) 10Jforrester: Invariant config cleanup: I - Initial DB and performance items (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501003 (owner: 10Jforrester) [16:53:32] !log bast2001 - shutdown -h now - decom'ed (T219492) [16:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:37] jbond42: also curious that wmcs hostnames don't match Variant[Stdlib::Fqdn, Stdlib::Compat::Ip_address] [16:54:27] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript extensions/WikibaseQualityConstraints/maintenance/ImportConstraintEntities.php --wiki=testwikidatawiki --config-format=wgConf | tee T221108.php [16:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:32] T221108: Configure suggestion constraint level on Test Wikidata - https://phabricator.wikimedia.org/T221108 [16:54:32] volans: i think that was because of literal - [16:55:10] Pattern[/^(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z0-9]|[A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9])$/] [16:55:53] the \- means that it should accept - [16:55:54] (03PS14) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [16:56:01] In paladox's error message that's because it's an empty string for some reason [16:56:05] volans: from the paste that paladox posted the class recived '' [16:56:23] right [16:56:41] sorry my bad [16:57:29] (03CR) 10jerkins-bot: [V: 04-1] Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [16:57:41] which i think means the lookup finds an entry with an empty string value as im fairly sure that the lookup function fails with a different message if it cant find something and no default is passed [16:58:49] I'm not familiar with the change you merge, hence I'm a bit lost, trying to make sense of it [16:59:45] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1700). Please do the needful. [17:00:31] (03PS1) 10Lucas Werkmeister (WMDE): Configure suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504374 (https://phabricator.wikimedia.org/T221108) [17:01:04] no parsoid deploy today [17:02:47] (03PS1) 10EBernhardson: Return CirrusSearch to standard execution on eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504375 [17:03:04] jbond42: I'm trying something on my puppetmaster [17:03:31] sureack [17:03:38] that ofc has conflicts :( [17:08:52] volans: can you +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/504368 because i think it may be having some affect [17:08:58] https://gerrit.wikimedia.org/r/c/operations/puppet/+/504368 [17:09:11] probably due to module automatic lookups [17:09:13] (03CR) 10Volans: [C: 03+1] "let's try" [puppet] - 10https://gerrit.wikimedia.org/r/504368 (owner: 10Jbond) [17:09:28] (03CR) 10Jbond: [C: 03+2] Fix puppet: remove alias call as it may not work [puppet] - 10https://gerrit.wikimedia.org/r/504368 (owner: 10Jbond) [17:11:02] jbond42: converting lookup() to hiera() it works [17:11:12] on my self-hosted puppetmaster [17:11:31] but [17:11:38] I get [17:11:39] +ca_server = puppetmaster1001.eqiad.wmnet [17:12:21] I've this right now [17:12:21] - String $puppetmaster = lookup('puppetmaster'), [17:12:21] - Stdlib::Host $ca_server = lookup('puppet_ca_server'), [17:12:21] + String $puppetmaster = hiera('puppetmaster'), [17:12:21] + Stdlib::Host $ca_server = hiera('puppet_ca_server'), [17:12:25] PROBLEM - Apache HTTP on mw1263 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:13:24] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "no-op preparatory change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504374 (https://phabricator.wikimedia.org/T221108) (owner: 10Lucas Werkmeister (WMDE)) [17:13:36] (03PS2) 10Dzahn: remove bast2001 production IPs [dns] - 10https://gerrit.wikimedia.org/r/504210 (https://phabricator.wikimedia.org/T219492) [17:13:43] RECOVERY - Apache HTTP on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [17:14:26] no that i have removed that alias setting i get the Found key: "puppetmaster" value: "puppet-phabricator.phabricator.eqiad.wmflabs" [17:14:30] (03Merged) 10jenkins-bot: Configure suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504374 (https://phabricator.wikimedia.org/T221108) (owner: 10Lucas Werkmeister (WMDE)) [17:14:34] with sudo puppet lookup --compile --explain --node puppet-phabricator.phabricator.eqiad.wmflabs puppetmaster [17:14:46] paladox: cold you test again [17:14:51] * paladox tests [17:14:52] jbond42: it works [17:14:58] but I still get [17:14:58] +ca_server = puppetmaster1001.eqiad.wmnet [17:15:08] (03CR) 10Dzahn: [C: 03+2] remove bast2001 production IPs [dns] - 10https://gerrit.wikimedia.org/r/504210 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [17:15:26] jbond42: that's not defined in labs.yaml [17:15:28] volans: yes looking over my change i think that was empty by defaukt and i added a global value [17:15:28] for the VMs [17:15:37] (03PS1) 10Cmjohnson: Adding dns prod/mgm for dbprov1001/1002 [dns] - 10https://gerrit.wikimedia.org/r/504378 (https://phabricator.wikimedia.org/T219399) [17:15:55] also its' defined for labpuppetmasters host hiera [17:15:58] and might be wrong? [17:16:01] huh, i got "+environment = production" [17:16:08] that's ok [17:16:14] we use a single environment [17:16:18] oh [17:16:19] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:504374|no-op preparatory change (T221108)]] (duration: 00m 52s) [17:16:20] it's the same code [17:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:24] T221108: Configure suggestion constraint level on Test Wikidata - https://phabricator.wikimedia.org/T221108 [17:16:32] volans: previoulsy it was only used by the puppet master classes [17:16:37] (03PS2) 10Cmjohnson: Adding dns prod/mgm for dbprov1001/1002 [dns] - 10https://gerrit.wikimedia.org/r/504378 (https://phabricator.wikimedia.org/T219399) [17:16:39] jbond42: yes I know [17:16:49] so the labpuppetmasters are clients of the prod puppts [17:16:52] but masters of the VMs [17:16:59] so in the [agent] they need prod value [17:17:06] while the VMs need the labpuppetmasters [17:17:17] now is the opposite [17:17:18] though puppet works now :) [17:17:39] I'm not even sure the agent reads ca_server [17:17:46] i think the agents just dont need this value right. we only need this for puppet masters that are not a ca? [17:17:49] (03CR) 10jenkins-bot: Configure suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504374 (https://phabricator.wikimedia.org/T221108) (owner: 10Lucas Werkmeister (WMDE)) [17:17:56] volans: me neither [17:18:02] * jbond42 checking [17:18:04] that's already in the [master] section [17:18:17] (03CR) 10Cmjohnson: [C: 03+2] Adding dns prod/mgm for dbprov1001/1002 [dns] - 10https://gerrit.wikimedia.org/r/504378 (https://phabricator.wikimedia.org/T219399) (owner: 10Cmjohnson) [17:20:11] (03PS5) 10Muehlenhoff: Add initial Hiera settings for kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/504351 [17:20:54] 10Operations, 10cloud-services-team: labpuppetmaster logs 'cannot collect exported resources without storeconfigs being set' - https://phabricator.wikimedia.org/T221115 (10Volans) [17:20:58] (03PS1) 10Lucas Werkmeister (WMDE): Enable suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504380 (https://phabricator.wikimedia.org/T221108) [17:21:20] (03PS1) 10Dzahn: remove bast2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/504381 (https://phabricator.wikimedia.org/T219492) [17:21:28] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [17:21:29] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:21:47] (03CR) 10Muehlenhoff: [C: 03+2] Add initial Hiera settings for kerberos1001 [puppet] - 10https://gerrit.wikimedia.org/r/504351 (owner: 10Muehlenhoff) [17:21:57] (03CR) 10jerkins-bot: [V: 04-1] Enable suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504380 (https://phabricator.wikimedia.org/T221108) (owner: 10Lucas Werkmeister (WMDE)) [17:22:11] (03CR) 10Dzahn: [C: 03+2] remove bast2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/504381 (https://phabricator.wikimedia.org/T219492) (owner: 10Dzahn) [17:22:20] (03PS2) 10Dzahn: remove bast2001 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/504381 (https://phabricator.wikimedia.org/T219492) [17:23:04] !log depooling mw1280 T218006 [17:23:15] we have some issue in cloudvps, and the bots are down [17:23:24] cdanis: ^^^ [17:24:20] !log force initialization of unassigned shards on elasticsearch eqiad [17:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:10] !log rebooted cloudnet1003 [17:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:19] !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1280.eqiad.wmnet,cluster=api_appserver [17:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:18] !log restarting rabbitmq on cloudcontrol1003 [17:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:21] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [17:28:21] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:32] 10Operations, 10LDAP-Access-Requests: add SChang & LDoan to WMF LDAP group for transparency report editing - https://phabricator.wikimedia.org/T221118 (10Dzahn) [17:29:38] !log that was me downtiming cloudcontrol1003.wikimedia.org for 30 mins [17:32:11] 10Operations, 10PHP 7.2 support, 10Performance-Team (Radar): Set `enable_dl` to 0 in php.ini - https://phabricator.wikimedia.org/T220681 (10colewhite) p:05Triage→03Normal [17:32:13] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380 (10Dzahn) Hi @JbuattiWMF , i made a new ticket for your request and linked it here: -> T221118. We have a rotating clinic duty on who handles access requ... [17:32:13] !log that was me downtiming cloudcontrol1003.wikimedia.org for 30 mins [17:32:48] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Assistance with LDAP Access for Transparency Report - https://phabricator.wikimedia.org/T140380 (10Dzahn) 05Open→03Resolved [17:33:16] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10CDanis) 13:26:19 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1280.eqiad.wmnet,cluster=api_appserver [17:33:26] (03PS1) 10Jbond: fix wmcs puppet: Add a default value for ca_server in labs [puppet] - 10https://gerrit.wikimedia.org/r/504384 [17:33:44] 10Operations, 10ops-codfw: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10Dzahn) [17:33:51] (03CR) 10Paladox: [C: 03+1] fix wmcs puppet: Add a default value for ca_server in labs [puppet] - 10https://gerrit.wikimedia.org/r/504384 (owner: 10Jbond) [17:34:14] jbond42: actually, I've a problem with that [17:34:16] re-thinking of it [17:34:39] arturo: andrewbogott can you check ok [17:34:42] it requires all the owners of custom puppetmasters to add that to horizon for all VMs child of that puppetmaster [17:34:47] opps half finished sentence [17:35:17] 10Operations, 10ops-codfw: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10Dzahn) 05Open→03Resolved bast2001 has been decom'ed and in T219492 and is now fully replaced by bast2002. https://wikitech.wikimedia.org/wiki/Bast2002 https://wikitech.wikimedia.org/wiki/H... [17:35:18] ok as a bandaid but probably we could just drop it in labs [17:35:28] sorry I just agreed to do this 2 minutes ago :) [17:36:02] So if someone was to create a new instance and try to run it against a their puppet master would that fail? [17:36:35] !log toolforge k8s reallocation (from nova-network to neutron) is causing troubles with IRC bots, expect missing entries in the SAL [17:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:42] volans: its ok looking at the hira we may be able to do a better bandaid one sec [17:36:53] ok [17:37:11] jouncebot: now [17:37:11] For the next 0 hour(s) and 22 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1700) [17:37:13] jouncebot: next [17:37:13] In 0 hour(s) and 22 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1800) [17:37:50] so the agent seems to read and should use it, but probably only when sending the CSR and retrieving the signed cert [17:38:10] and if not defined defaults to the puppetmaster [17:39:26] volans: yes thats my reading as well which means its almost usless as puppet is not really configured by puppet when before it sends its csr [17:40:02] jbond42: was there something I can help with? [17:40:40] andrewbogott: not yet i tagged you a bit to early :) [17:40:49] ok, lmk :) [17:40:51] (03PS15) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [17:41:59] (03PS1) 10Lucas Werkmeister (WMDE): Configure suggestion constraint status on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504386 (https://phabricator.wikimedia.org/T221107) [17:42:48] (03PS2) 10Jbond: fix wmcs puppet: Add a default value for ca_server in labs [puppet] - 10https://gerrit.wikimedia.org/r/504384 [17:43:35] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "production no-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504386 (https://phabricator.wikimedia.org/T221107) (owner: 10Lucas Werkmeister (WMDE)) [17:44:37] (03Merged) 10jenkins-bot: Configure suggestion constraint status on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504386 (https://phabricator.wikimedia.org/T221107) (owner: 10Lucas Werkmeister (WMDE)) [17:44:43] jbond42: looks good, let's get a compiler check [17:44:45] volans: do you have a wmcs host you can test the above on. as labs is more prefered i think it should take this value asuming the lookup dosen;t treat '' as nil [17:44:47] we can check wmcs too [17:44:47] now [17:45:04] pick compiler1001.puppet-diffs.eqiad.wmflabs [17:45:34] or you mean to test via puppetmaster hack? I can do that too [17:45:34] :) [17:45:41] volans: yes [17:45:45] ill do a compiler [17:45:51] ok trying [17:46:35] thanks [17:46:56] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:504386|no-op preparatory change (T221107)]] (duration: 00m 52s) [17:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:00] T221107: Configure suggestion constraint level on Beta - https://phabricator.wikimedia.org/T221107 [17:47:35] !log beginning rolling ELK upgrade to 5.6.15 [17:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:53] (03PS1) 10Lucas Werkmeister (WMDE): Enable suggestion constraint status on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504389 (https://phabricator.wikimedia.org/T221107) [17:48:34] (03PS2) 10Lucas Werkmeister (WMDE): Enable suggestion constraint status on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504380 (https://phabricator.wikimedia.org/T221108) [17:48:41] jbond42: noop so far [17:49:49] jbond42: isn't that puppet_ca_server ? [17:50:30] modules/profile/manifests/base/puppet.pp: Stdlib::Host $ca_server = lookup('puppet_ca_server'), [17:51:05] ahh yes let me try that [17:51:19] (03CR) 10Lucas Werkmeister (WMDE): "Come to think of it, I’m not sure if we actually want to do this… we could also let constraints on beta become enabled once the default in" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504389 (https://phabricator.wikimedia.org/T221107) (owner: 10Lucas Werkmeister (WMDE)) [17:51:21] (03PS1) 10Mathew.onipe: icinga: filter out unassigned shards in check [puppet] - 10https://gerrit.wikimedia.org/r/504390 [17:51:42] (03CR) 10jenkins-bot: Configure suggestion constraint status on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504386 (https://phabricator.wikimedia.org/T221107) (owner: 10Lucas Werkmeister (WMDE)) [17:51:44] (03CR) 10Herron: [C: 03+2] Upgrade logstash plugins to 5.6.15 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/502613 (https://phabricator.wikimedia.org/T219571) (owner: 10Herron) [17:51:47] (03CR) 10Herron: [V: 03+2 C: 03+2] Upgrade logstash plugins to 5.6.15 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/502613 (https://phabricator.wikimedia.org/T219571) (owner: 10Herron) [17:51:51] (03PS3) 10Jbond: fix wmcs puppet: Add a default value for ca_server in labs [puppet] - 10https://gerrit.wikimedia.org/r/504384 [17:52:00] 10Operations, 10Operations-Software-Development, 10cloud-services-team: cumin aliases not matching any hosts - https://phabricator.wikimedia.org/T221125 (10Dzahn) [17:52:01] mmh now fails, give me a sec [17:52:34] profile::base::puppet needs to be converted to String too [17:53:07] -ca_server = puppetmaster1001.eqiad.wmnet [17:53:07] + [17:53:08] yay [17:53:48] jbond42: ^^^ [17:54:01] yes on sec thanks [17:54:13] didn't know if you saw :D [17:54:15] no hurry [17:54:54] (03PS4) 10Jbond: fix wmcs puppet: Add a default value for ca_server in labs [puppet] - 10https://gerrit.wikimedia.org/r/504384 [17:56:38] ok i think we are good volans can i get a +1 :) [17:56:38] (03CR) 10Volans: [C: 03+1] "LGTM, let's test a compiler in prod to be sure" [puppet] - 10https://gerrit.wikimedia.org/r/504384 (owner: 10Jbond) [17:56:51] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/WikimediaIncubator/: T220623 (duration: 00m 53s) [17:56:51] its like you knew i was typing ;) [17:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:55] lol [17:56:57] T220623: Unable to view certain pages on incubator.wikimedia.org (Fatal error: operator not supported) - https://phabricator.wikimedia.org/T220623 [17:57:13] (03CR) 10Jbond: [C: 03+2] fix wmcs puppet: Add a default value for ca_server in labs [puppet] - 10https://gerrit.wikimedia.org/r/504384 (owner: 10Jbond) [17:57:23] (03PS5) 10Jbond: fix wmcs puppet: Add a default value for ca_server in labs [puppet] - 10https://gerrit.wikimedia.org/r/504384 [17:57:25] volans: thanks for all the help [17:58:03] no prob at all [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1800) [18:00:23] jbond42: all good on a test VM [18:00:32] (03CR) 10Dzahn: "good timing. support ends in 6 days :)" [puppet] - 10https://gerrit.wikimedia.org/r/503452 (https://phabricator.wikimedia.org/T208087) (owner: 10Mobrovac) [18:00:33] great thanks [18:01:03] thanks for the fixes! [18:01:08] and the large refactor [18:01:25] (03CR) 10Ayounsi: "Thank you, replied to your comments!" (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) (owner: 10Ayounsi) [18:01:52] volans: no probs :) [18:02:02] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10colewhite) we're pretty sure this is a false alarm [18:02:03] paladox: you should be all good now sorry for the interuption [18:02:16] thank you! (also no need to be sorry :)) [18:02:16] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10colewhite) 05Open→03Resolved [18:02:35] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15835/" [puppet] - 10https://gerrit.wikimedia.org/r/503452 (https://phabricator.wikimedia.org/T208087) (owner: 10Mobrovac) [18:02:44] (03PS2) 10Dzahn: Set restbase200[78] as spares and remove them from conftool [puppet] - 10https://gerrit.wikimedia.org/r/503452 (https://phabricator.wikimedia.org/T208087) (owner: 10Mobrovac) [18:02:51] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10User-herron: logstash1012 lock up caused central logging stuck - https://phabricator.wikimedia.org/T220500 (10colewhite) p:05Triage→03High [18:03:01] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10User-herron: logstash1012 lock up caused central logging stuck - https://phabricator.wikimedia.org/T220500 (10colewhite) p:05High→03Normal [18:03:05] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1078 for hardware maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504392 [18:03:25] (03PS9) 10Ayounsi: Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) [18:03:44] 10Operations, 10Traffic, 10wikitech.wikimedia.org: Wikitech page views sometimes default to MobileFrontend - https://phabricator.wikimedia.org/T220567 (10colewhite) p:05Triage→03Normal [18:04:09] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool - https://phabricator.wikimedia.org/T220787 (10colewhite) p:05Triage→03Normal [18:04:15] (03CR) 10jerkins-bot: [V: 04-1] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) (owner: 10Ayounsi) [18:04:56] urandom: i am doing the "decom restbase2007/2008" thing but to decom i have to reactivate puppet which is disabled with the reason "has been decomed" ;) [18:05:11] actually they weren't yet [18:05:28] also needs more decom steps but will make the ticket for it then [18:06:04] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 3 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10colewhite) p:05Triage→03Normal [18:06:38] 10Operations, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10colewhite) p:05Triage→03Normal [18:06:45] (03PS1) 10Brennen Bearnes: admin: add brennen to gerrit-admin [puppet] - 10https://gerrit.wikimedia.org/r/504393 [18:07:15] (03PS2) 10Mathew.onipe: icinga: filter out unassigned shards in check [puppet] - 10https://gerrit.wikimedia.org/r/504390 [18:07:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10User-Zppix, 10cloud-services-team (Kanban): VMs on cloudvirt1015 crashing - https://phabricator.wikimedia.org/T220853 (10colewhite) p:05Triage→03Normal [18:07:56] !log restbase2007, restbase2008 - re-enabled puppet which was disabled with reason 'decom'ed' but actually needed to run to decom after they had moved to role::spare::system (T208087) [18:08:16] (03CR) 10Volans: [C: 03+1] "LGTM, nit in the commit message" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [18:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:25] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [18:08:27] (03CR) 10Paladox: [C: 03+1] admin: add brennen to gerrit-admin [puppet] - 10https://gerrit.wikimedia.org/r/504393 (owner: 10Brennen Bearnes) [18:08:37] 10Operations, 10cloud-services-team: labpuppetmaster logs 'cannot collect exported resources without storeconfigs being set' - https://phabricator.wikimedia.org/T221115 (10colewhite) p:05Triage→03Normal [18:09:20] (03CR) 10Thcipriani: [C: 03+1] admin: add brennen to gerrit-admin [puppet] - 10https://gerrit.wikimedia.org/r/504393 (owner: 10Brennen Bearnes) [18:09:23] 10Operations, 10Operations-Software-Development, 10cloud-services-team: cumin aliases not matching any hosts - https://phabricator.wikimedia.org/T221125 (10colewhite) p:05Triage→03Normal [18:10:29] 10Operations, 10hardware-requests, 10User-Elukey: eqiad: (3) - zookeeper cluster for Analytics - https://phabricator.wikimedia.org/T220687 (10colewhite) p:05Triage→03Normal [18:10:37] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10Cmjohnson) [18:11:00] (03PS2) 10Dzahn: admin: add brennen to gerrit-admin [puppet] - 10https://gerrit.wikimedia.org/r/504393 (https://phabricator.wikimedia.org/T218858) (owner: 10Brennen Bearnes) [18:11:01] 10Operations, 10Mail: remove RT mail aliases - https://phabricator.wikimedia.org/T220844 (10colewhite) p:05Triage→03Normal [18:11:15] (03CR) 10Paladox: [C: 03+1] "Probably want to add "Bug: T218858"?" [puppet] - 10https://gerrit.wikimedia.org/r/504393 (https://phabricator.wikimedia.org/T218858) (owner: 10Brennen Bearnes) [18:11:34] 10Operations, 10SRE-Access-Requests: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10colewhite) p:05Triage→03Normal [18:11:45] (03CR) 10jerkins-bot: [V: 04-1] admin: add brennen to gerrit-admin [puppet] - 10https://gerrit.wikimedia.org/r/504393 (https://phabricator.wikimedia.org/T218858) (owner: 10Brennen Bearnes) [18:12:52] (03CR) 10Dzahn: "i linked the relevant ticket for this and added the tag to it for ldap access requests. i notice the ticket says "More discussions to happ" [puppet] - 10https://gerrit.wikimedia.org/r/504393 (https://phabricator.wikimedia.org/T218858) (owner: 10Brennen Bearnes) [18:12:54] (03CR) 10Greg Grossmeier: [C: 03+1] "Approved from my side." [puppet] - 10https://gerrit.wikimedia.org/r/504393 (https://phabricator.wikimedia.org/T218858) (owner: 10Brennen Bearnes) [18:13:59] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10Cmjohnson) @robh, @Marostegui or @jcrespo These servers are finished with the on-site specific tasks. All yours to finish [18:14:06] !log powering off mw1280 to replace DIMM [18:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:01] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) a:03jcrespo Thanks! [18:16:04] (03PS3) 10CDanis: admin: add brennen to gerrit-admin [puppet] - 10https://gerrit.wikimedia.org/r/504393 (https://phabricator.wikimedia.org/T218858) (owner: 10Brennen Bearnes) [18:16:15] (03PS4) 10CDanis: admin: add brennen to gerrit-admin [puppet] - 10https://gerrit.wikimedia.org/r/504393 (https://phabricator.wikimedia.org/T218858) (owner: 10Brennen Bearnes) [18:16:56] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10jcrespo) a:05Marostegui→03jcrespo This is fixed, but not closing because I cannot repool the server yet (Deployment schedule conflic). I will repool it tomorr... [18:17:41] mutante: so I guess you sorted it out [18:17:52] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:07] (03CR) 10CDanis: [C: 03+2] admin: add brennen to gerrit-admin [puppet] - 10https://gerrit.wikimedia.org/r/504393 (https://phabricator.wikimedia.org/T218858) (owner: 10Brennen Bearnes) [18:18:09] mutante: but yeah, I only disable puppet because I mask the units [18:18:36] so re-enabling to apply a new role is fine [18:19:50] PROBLEM - Host mw1280.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:21:06] urandom: i understand, yep. yes, it's alright then if i do it _after_ they changed to spare role, which is what happened :) [18:21:34] (03PS1) 10Dbarratt: Deploy Partial blocks to Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504403 (https://phabricator.wikimedia.org/T220434) [18:22:00] RECOVERY - Host mw1280 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [18:23:52] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10Cmjohnson) I replaced both DIMM A1 and B1 since I had previously ordered one for mw1264 that I did not need. Please add back to but I have a feeling that a CPU may be bad. Let's leave this open for a week... [18:24:11] (03PS1) 10Ottomata: eventgate-analytics - write v8 profile log file to /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/504405 [18:24:35] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - write v8 profile log file to /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/504405 (owner: 10Ottomata) [18:24:58] RECOVERY - Host mw1280.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [18:25:05] 10Operations, 10Wikimedia-Logstash, 10User-herron: logstash1012 lock up caused central logging stuck - https://phabricator.wikimedia.org/T220500 (10Cmjohnson) [18:25:14] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set wmfdebug_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:17] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:25:17] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:27] (03CR) 10Tchanders: [C: 03+1] Deploy Partial blocks to Chinese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504403 (https://phabricator.wikimedia.org/T220434) (owner: 10Dbarratt) [18:26:30] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:26:44] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:26:54] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:27:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:20] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:28:10] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:28:12] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:28:40] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:29:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:29:18] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:29:40] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:30:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services: labsdb1009.mgmt down - https://phabricator.wikimedia.org/T218789 (10Cmjohnson) 05Open→03Resolved Cable was loose, fixed User:root logged-in to ILOMXQ62005Z0.(10.65.4.59 / FE80::1E98:ECFF:FE2E:A6A6) iLO Standard 2.40 at Dec 02 2015 Server Name: Se... [18:30:42] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:31:16] (03PS1) 10Andrew Bogott: nova: turn off nova-fullstack monitoring in the 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/504406 [18:31:22] 10Operations, 10ops-eqiad: mw1264 DIMM error - https://phabricator.wikimedia.org/T217274 (10Cmjohnson) 05Open→03Resolved Resolving for now [18:32:16] (03CR) 10Andrew Bogott: [C: 03+2] nova: turn off nova-fullstack monitoring in the 'main' region [puppet] - 10https://gerrit.wikimedia.org/r/504406 (owner: 10Andrew Bogott) [18:32:35] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Cmjohnson) The server is out of warranty, can we get a replacement or use a spare replacement? [18:33:32] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [18:33:48] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:33:58] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:34:18] RECOVERY - Host labsdb1009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.90 ms [18:34:24] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:34:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:34:34] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:34:50] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:35:30] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:36:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:38:31] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:38:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:39:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:39:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:39:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Cmjohnson) [18:39:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Update label and switch to rename labvirt1008 to cloudvirt1008 - https://phabricator.wikimedia.org/T220443 (10Cmjohnson) 05Open→03Resolved Completed [18:39:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Cmjohnson) [18:39:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Cmjohnson) [18:39:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Completed [18:40:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:40:19] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:40:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Cmjohnson) @Andrew Is this re-imaged yet? Can we resolve this task? Thanks [18:41:45] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Update label and switch to rename labvirt1009 to cloudvirt1009 - https://phabricator.wikimedia.org/T216281 (10Cmjohnson) 05Open→03Resolved Completed [18:41:59] 10Operations, 10ops-eqiad: Update label and switch to rename labvirt1012 to cloudvirt1012 - https://phabricator.wikimedia.org/T216192 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Completed [18:42:03] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:43:37] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:44:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:44:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:44:29] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:44:35] (03PS1) 10Jbond: puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 [18:45:09] (03CR) 10jerkins-bot: [V: 04-1] puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 (owner: 10Jbond) [18:45:11] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:45:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:46:00] !log branching 1.34.0-wmf.1 refs T220726 [18:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:05] T220726: 1.34.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T220726 [18:46:57] (03PS2) 10Jbond: puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 [18:47:25] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10Dzahn) >>! In T215411#5116050, @Cmjohnson wrote: > The server is out of warranty, can we get a replacement or use a spare replacement? Yes, per Robh: >>! In T2154... [18:47:25] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:49:40] !next [18:50:06] jouncebot: next [18:50:06] In 0 hour(s) and 9 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1900) [18:50:07] yay my fixes made it into thebranch whew [18:50:29] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:50:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:51:32] seeing some pretty big spikes in 500s over the last hour. another one rising right now [18:51:34] hey guys! ptwiki is sometimes returning Request from MYIP via cp1085 cp1085, Varnish XID 305608811 Error: 503, Backend fetch failed at Tue, 16 Apr 2019 18:50:35 GMT [18:51:43] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:51:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:13] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:52:33] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:52:43] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:52:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:53:00] 10Operations, 10ops-eqiad, 10Thumbor, 10serviceops, 10User-jijiki: thumbor1004 memory errors - https://phabricator.wikimedia.org/T215411 (10RobH) [18:53:01] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:53:04] (03PS3) 10Jbond: puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 [18:53:08] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set main_app.profiling_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [18:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:10] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:53:10] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:31] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:54:05] hmm [18:54:07] i just got [18:54:08] Request from 2a00:23c4:ad14:9700:7c67:dbe1:2471:7d5f via cp1085 cp1085, Varnish XID 359390289 [18:54:08] Error: 503, Backend fetch failed at Tue, 16 Apr 2019 18:54:00 GMT [18:54:10] on enwiki [18:54:10] shdubsh: we're discussing in #wikimedia-traffic [18:54:35] is everything OK? [18:54:39] lots of 503s [18:55:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:55:11] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:55:12] !log cdanis@cp1085.eqiad.wmnet ~ % sudo -i depool [18:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10Marostegui) Yaaay! [18:56:56] (03CR) 10EBernhardson: [C: 03+1] mwgrep: Include JSON files in search [puppet] - 10https://gerrit.wikimedia.org/r/503349 (owner: 10Esanders) [18:57:01] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:58:01] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Americas version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T1900). [19:00:21] please hold train, we're experiencing some other issues [19:02:42] hello, I assume we are aware that many requests to all sites, including the APIs, are failing? I'm seeing 500s and 503s [19:02:45] 10Operations, 10Puppet: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jbond) [19:03:07] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:03:11] yes, we're aware [19:03:13] 10Operations, 10Puppet: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (10jbond) [19:03:21] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:04:14] ok thank you :) [19:04:26] !log restarting varnish backend on cp1085 [19:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:06] !log restarting varnish backend on cp1083 [19:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:53] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:08:31] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:08:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:08:39] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:08:55] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:09:13] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:09:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:09:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:09:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:09:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:11:33] !log twentyafterfour@deploy1001:/srv/mediawiki-staging$ scap prep 1.34.0-wmf.1 [19:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:19] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [19:14:15] twentyafterfour: just wanted to make sure you saw 15:00:21 please hold train, we're experiencing some other issues [19:14:43] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:15:07] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [19:15:33] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [19:15:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:15:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:18:32] 10Operations, 10CX-cxserver, 10Citoid, 10Graphoid, and 10 others: Make services swagger specs standard compliant - https://phabricator.wikimedia.org/T218217 (10MSantos) > In order to minimize copy-pasting, it's would be preferred to use swagger 3 components feature or YAML references. @Pchelolo Does that... [19:19:53] cdanis: I haven't started deploying anything yet [19:19:58] but thank you [19:20:35] so far I've just been branching and cloning [19:20:42] k! [19:25:48] (03PS1) 10Cmjohnson: Adding mgmt dns for db1139/1140 [dns] - 10https://gerrit.wikimedia.org/r/504414 (https://phabricator.wikimedia.org/T218985) [19:29:18] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) Ok, so while we are now monitoring these PDUs, I have not yet done the following: [] - label every single power cable with a unique serial number [] - audit/update each server and docu... [19:29:30] 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) [19:33:05] PROBLEM - Host restbase1013.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:34:24] cmjohnson1: expected that restbase1013.mgmt went down just now? [19:35:30] mutante no [19:37:55] cmjohnson1: i can connect to it. rescheduling the icinga check [19:38:05] RECOVERY - Host restbase1013.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [19:38:11] (03PS3) 10Gehel: icinga: filter out unassigned shards in check [puppet] - 10https://gerrit.wikimedia.org/r/504390 (owner: 10Mathew.onipe) [19:39:12] (03CR) 10Gehel: [C: 03+2] icinga: filter out unassigned shards in check [puppet] - 10https://gerrit.wikimedia.org/r/504390 (owner: 10Mathew.onipe) [19:40:09] mutante it just came back [19:40:18] cmjohnson1: ack, it did [19:40:35] i told it to check again faster than usual [19:40:45] could have been a a few seconds actual downtime [19:45:43] a new bot? [19:46:01] (03PS3) 10Gehel: maps: enable prometheus exporter for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/504326 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [19:46:53] (03PS1) 10Ottomata: eventgate-analytics - debug_mode_enabled turns on Node inspector, profiler and wmfdebug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/504418 (https://phabricator.wikimedia.org/T220661) [19:46:56] ACKNOWLEDGEMENT - EDAC syslog messages on thumbor1004 is CRITICAL: 24 ge 4 daniel_zahn https://phabricator.wikimedia.org/T215411 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [19:47:44] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - debug_mode_enabled turns on Node inspector, profiler and wmfdebug sidecar [deployment-charts] - 10https://gerrit.wikimedia.org/r/504418 (https://phabricator.wikimedia.org/T220661) (owner: 10Ottomata) [19:47:57] @wmpbot ? [19:48:09] typo [19:48:18] @wmopbot ? [19:48:19] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set main_app.debug_mode_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:22] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [19:48:22] !log otto@deploy1001 scap-helm eventgate-analytics finished [19:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:39] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set debug_mode_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:49:41] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [19:49:41] !log otto@deploy1001 scap-helm eventgate-analytics finished [19:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:52] (03CR) 10Gehel: [C: 03+2] maps: enable prometheus exporter for cassandra [puppet] - 10https://gerrit.wikimedia.org/r/504326 (https://phabricator.wikimedia.org/T221055) (owner: 10Mathew.onipe) [19:56:28] !log restarting cassandra on maps* for config change - T221055 [19:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:32] T221055: Collect metrics on maps cassandra - https://phabricator.wikimedia.org/T221055 [19:56:46] (03PS1) 10Ottomata: eventgate-analytics - also expose service-runner worker inspector port [deployment-charts] - 10https://gerrit.wikimedia.org/r/504419 (https://phabricator.wikimedia.org/T220661) [19:57:47] (03PS2) 10Ottomata: eventgate-analytics - also expose service-runner worker inspector port [deployment-charts] - 10https://gerrit.wikimedia.org/r/504419 (https://phabricator.wikimedia.org/T220661) [19:58:07] (03PS2) 10Cmjohnson: Adding mgmt dns for db1139/1140 [dns] - 10https://gerrit.wikimedia.org/r/504414 (https://phabricator.wikimedia.org/T218985) [19:58:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Andrew) [19:58:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Andrew) [19:58:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Andrew) [19:58:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Andrew) [19:58:22] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - also expose service-runner worker inspector port [deployment-charts] - 10https://gerrit.wikimedia.org/r/504419 (https://phabricator.wikimedia.org/T220661) (owner: 10Ottomata) [19:59:00] (03PS2) 10Dzahn: restbase::base: remove include passwords::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/503151 [19:59:50] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set debug_mode_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [19:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:53] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [19:59:53] !log otto@deploy1001 scap-helm eventgate-analytics finished [19:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [20:04:37] (03PS1) 10ArielGlenn: clean up old symlinks once per job, not for every output file [dumps] - 10https://gerrit.wikimedia.org/r/504421 [20:04:44] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set debug_mode_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:04:46] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:04:46] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:04] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for db1139/1140 [dns] - 10https://gerrit.wikimedia.org/r/504414 (https://phabricator.wikimedia.org/T218985) (owner: 10Cmjohnson) [20:05:49] !log mobrovac@deploy1001 Started deploy [restbase/deploy@dfca9e6] (dev-cluster): Use the simplified key/value bucket [20:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:14] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Cmjohnson) [20:06:23] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:07:01] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Cmjohnson) a:05Cmjohnson→03RobH @Marostegui @robh these are racked and all on-site work is completed. [20:08:35] (03PS1) 10Andrew Bogott: Rename more labvirts to cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/504422 (https://phabricator.wikimedia.org/T221138) [20:09:22] (03PS2) 10Andrew Bogott: Rename more labvirts to cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/504422 (https://phabricator.wikimedia.org/T221138) [20:11:14] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@dfca9e6] (dev-cluster): Use the simplified key/value bucket (duration: 05m 24s) [20:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:07] (03PS1) 10Andrew Bogott: Rename more labvirts to cloudvirts [dns] - 10https://gerrit.wikimedia.org/r/504424 (https://phabricator.wikimedia.org/T221138) [20:12:30] (03PS2) 10Andrew Bogott: Rename more labvirts to cloudvirts [dns] - 10https://gerrit.wikimedia.org/r/504424 (https://phabricator.wikimedia.org/T221138) [20:13:14] (03CR) 10Andrew Bogott: [C: 03+2] Rename more labvirts to cloudvirts [dns] - 10https://gerrit.wikimedia.org/r/504424 (https://phabricator.wikimedia.org/T221138) (owner: 10Andrew Bogott) [20:13:26] (03CR) 10Andrew Bogott: [C: 03+2] Rename more labvirts to cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/504422 (https://phabricator.wikimedia.org/T221138) (owner: 10Andrew Bogott) [20:15:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [20:15:52] (03PS2) 10ArielGlenn: clean up old symlinks once per job, not for every output file [dumps] - 10https://gerrit.wikimedia.org/r/504421 [20:15:55] (03CR) 10Nuria: [C: 03+1] admin: allow analytics-admins to control jupyter user units [puppet] - 10https://gerrit.wikimedia.org/r/504067 (owner: 10Elukey) [20:17:31] (03CR) 10ArielGlenn: [C: 03+2] clean up old symlinks once per job, not for every output file [dumps] - 10https://gerrit.wikimedia.org/r/504421 (owner: 10ArielGlenn) [20:17:50] (03PS1) 10CRusnov: Change black check to not enforce quote style. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504426 [20:19:42] !log ariel@deploy1001 Started deploy [dumps/dumps@796ccb5]: use safe_load yaml and getReplicaServer.php, cleanup symlinks once per job only [20:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:46] !log ariel@deploy1001 Finished deploy [dumps/dumps@796ccb5]: use safe_load yaml and getReplicaServer.php, cleanup symlinks once per job only (duration: 00m 04s) [20:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:57] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Marostegui) So from DC Ops side only missing the production DNS entries? Thanks Chris! [20:23:15] !log mobrovac@deploy1001 Started deploy [restbase/deploy@dfca9e6]: Use the simplified key/value bucket - T215960 [20:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:19] T215960: Simplify MCS storage model - https://phabricator.wikimedia.org/T215960 [20:25:51] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:28:00] (03PS4) 10Mforns: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) [20:29:53] can I resume the train now? [20:31:19] (03PS5) 10Mforns: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) [20:31:19] Things look like they've settled down. I think you're safe to resume. [20:33:03] (03PS6) 10Mforns: Adapt saltrotate and EventLoggingSanitization params in data_purge.pp [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) [20:37:07] (03CR) 10Mforns: [C: 04-1] "This is not WIP any more, but as it needs to wait for https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/484250/ to be deployed befor" [puppet] - 10https://gerrit.wikimedia.org/r/485063 (https://phabricator.wikimedia.org/T212014) (owner: 10Mforns) [20:38:10] (03CR) 10Dzahn: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler1002/15836/restbase1013.eqiad.wmnet/change.restbase1013.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/503151 (owner: 10Dzahn) [20:39:10] (03Abandoned) 10Dzahn: restbase::base: remove include passwords::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/503151 (owner: 10Dzahn) [20:39:47] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10ayounsi) p:05Triage→03High [20:42:40] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [20:42:57] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set debug_mode_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:42:58] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:42:58] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:07] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@dfca9e6]: Use the simplified key/value bucket - T215960 (duration: 19m 52s) [20:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:12] T215960: Simplify MCS storage model - https://phabricator.wikimedia.org/T215960 [20:43:31] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) [20:43:50] (03PS1) 1020after4: testwikis wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504428 [20:43:52] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504428 (owner: 1020after4) [20:44:03] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:44:40] (03Merged) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504428 (owner: 1020after4) [20:44:48] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) @ayounsi lvs1013 and 1014 on-site work has been completed. I did not add the LVS vlan....I will leave that to you. I still need to run the cross-connects bu... [20:45:01] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.34.0-wmf.1 refs T220726 [20:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:06] T220726: 1.34.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T220726 [20:45:20] 10Operations, 10Wikimedia-Logstash: Kibana breaks during rolling upgrade - https://phabricator.wikimedia.org/T221143 (10herron) p:05Triage→03Normal [20:45:31] 10Operations, 10Wikimedia-Logstash, 10User-herron: Kibana breaks during rolling upgrade - https://phabricator.wikimedia.org/T221143 (10herron) [20:52:23] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) a:03Reedy [20:52:57] 10Operations, 10Mail: remove RT mail aliases - https://phabricator.wikimedia.org/T220844 (10Dzahn) a:03Dzahn [20:53:11] (03CR) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504428 (owner: 1020after4) [20:53:41] RECOVERY - High lag on wdqs1009 is OK: (C)3600 ge (W)1200 ge 1179 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:55:04] !log removing keystone endpoints for the 'eqiad' region [20:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:11] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:58:55] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [20:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:58] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [20:58:58] !log otto@deploy1001 scap-helm eventgate-analytics finished [20:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:03] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [21:00:40] (03PS1) 10Andrew Bogott: openstack: remove some references to the old 'eqiad' region [puppet] - 10https://gerrit.wikimedia.org/r/504430 [21:02:19] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10ayounsi) [21:02:23] (03CR) 10Andrew Bogott: [C: 03+2] openstack: remove some references to the old 'eqiad' region [puppet] - 10https://gerrit.wikimedia.org/r/504430 (owner: 10Andrew Bogott) [21:02:39] !log cdanis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet [21:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:50] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Cmjohnson) @jgreen I need an mgmt IP for this.... I apologize but I don't recall any of the fundraising mgmt info. please provide IP, Subnet, and Gateway [21:02:54] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10CDanis) 17:02:39 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet [21:02:57] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10ayounsi) [21:03:09] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [21:03:14] (03PS2) 10Alex Monk: openstack: Replace main clientpackages/observerenv stuff with eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/503407 [21:03:33] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [21:03:39] (03PS3) 10Andrew Bogott: openstack: Replace main clientpackages/observerenv stuff with eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/503407 (owner: 10Alex Monk) [21:04:07] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10CDanis) a:05Cmjohnson→03jijiki [21:05:22] 10Operations, 10ops-eqiad, 10serviceops: mw1280 crashed - https://phabricator.wikimedia.org/T218006 (10jijiki) 05Open→03Resolved @CDanis Thank you! I am resolving this for now. [21:07:13] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [21:09:19] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [21:09:57] !log add wpao to wmf/ops in LDAP - T221142 [21:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:02] T221142: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 [21:11:06] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10ayounsi) [21:11:14] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Jgreen) >>! In T213104#5116588, @Cmjohnson wrote: > @jgreen I need an mgmt IP for this.... I apologize but I don't recall any of the fundraising mgmt info. > > please p... [21:12:16] (03CR) 10DCausse: [C: 03+1] Return CirrusSearch to standard execution on eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504375 (owner: 10EBernhardson) [21:12:27] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Cmjohnson) Great! I will make all the changes locally if you can update your side. Thanks! [21:13:09] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [21:13:33] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [21:14:27] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [21:14:29] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and cable frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T213104 (10Cmjohnson) @Jgreen production ports Port eth0 is fasw-c1a port 17 port eth1 is fasw-c2a port 17 [21:17:27] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [21:17:57] (03CR) 10Andrew Bogott: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/15841/" [puppet] - 10https://gerrit.wikimedia.org/r/503407 (owner: 10Alex Monk) [21:18:05] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 62.53, 34.08, 20.99 [21:18:19] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [21:18:26] (03PS1) 10Jgreen: repurpose thulium.mgmt.frack.wmnet IP for frav1002.mgmt.frack.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504431 (https://phabricator.wikimedia.org/T213104) [21:18:57] (03CR) 10Volans: coherence report: General improvements and rack checks (036 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [21:20:01] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [21:20:10] (03CR) 10Jgreen: [C: 03+2] repurpose thulium.mgmt.frack.wmnet IP for frav1002.mgmt.frack.wmnet [dns] - 10https://gerrit.wikimedia.org/r/504431 (https://phabricator.wikimedia.org/T213104) (owner: 10Jgreen) [21:20:32] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10ayounsi) [21:20:43] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10RobH) I just added nick wiki_willy to the _security irc channel. [21:21:49] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.34.0-wmf.1 refs T220726 (duration: 36m 47s) [21:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:54] T220726: 1.34.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T220726 [21:21:57] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.66, 22.81, 19.34 [21:23:29] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [21:24:21] !log deleting 'eqiad' endpoint in keystone [21:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:21] (03PS1) 10Jforrester: wikitech: Enable the UserMerge extension for clean-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504434 (https://phabricator.wikimedia.org/T165795) [21:28:00] bd808: ^^ [21:28:45] (03CR) 10BryanDavis: [C: 03+1] wikitech: Enable the UserMerge extension for clean-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504434 (https://phabricator.wikimedia.org/T165795) (owner: 10Jforrester) [21:34:20] 10Operations, 10DC-Ops, 10LDAP-Access-Requests: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Reedy) a:05Reedy→03None [21:42:21] (03PS1) 10Andrew Bogott: Define profile::openstack::eqiad1::region in labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/504438 [21:42:49] (03CR) 10Paladox: [C: 03+1] Define profile::openstack::eqiad1::region in labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/504438 (owner: 10Andrew Bogott) [21:42:53] (03CR) 10Andrew Bogott: [C: 03+2] Define profile::openstack::eqiad1::region in labs.yaml [puppet] - 10https://gerrit.wikimedia.org/r/504438 (owner: 10Andrew Bogott) [21:43:47] (03PS1) 1020after4: group0 wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504439 [21:43:49] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504439 (owner: 1020after4) [21:44:51] (03Merged) 10jenkins-bot: group0 wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504439 (owner: 1020after4) [21:46:54] (03PS1) 10Alex Monk: [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 [21:47:21] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [21:47:30] (03PS2) 10Alex Monk: [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 [21:47:34] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.34.0-wmf.1 refs T220726 [21:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:38] T220726: 1.34.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T220726 [21:47:42] (03CR) 10Alex Monk: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [21:47:57] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Experiment: Try removing file that may not actually be used [puppet] - 10https://gerrit.wikimedia.org/r/504440 (owner: 10Alex Monk) [21:49:05] (03CR) 10jenkins-bot: group0 wikis to 1.34.0-wmf.1 refs T220726 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504439 (owner: 1020after4) [21:49:53] (03PS1) 10Dzahn: admins: add Willy Pao to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/504441 (https://phabricator.wikimedia.org/T221142) [21:50:41] !log mobrovac@deploy1001 Started deploy [restbase/deploy@f1c767d] (dev-cluster): mobile-sections simplification: use the key/value bucket only [21:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:10] (03PS10) 10Ayounsi: Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) [21:51:44] (03CR) 10Dzahn: [C: 03+2] admins: add Willy Pao to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/504441 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn) [21:51:54] (03CR) 10jerkins-bot: [V: 04-1] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) (owner: 10Ayounsi) [21:53:50] twentyafterfour: Prod clear for a config deploy? [21:56:05] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@f1c767d] (dev-cluster): mobile-sections simplification: use the key/value bucket only (duration: 05m 24s) [21:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:21] !log mobrovac@deploy1001 Started deploy [restbase/deploy@f1c767d]: mobile-sections simplification: use the key/value bucket only - T215960 [21:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:25] T215960: Simplify MCS storage model - https://phabricator.wikimedia.org/T215960 [21:57:26] 10Operations, 10DC-Ops, 10LDAP-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) - added to puppet admins module in the ldap_only_users section - subcribed to both ops mailman lists [22:00:17] 10Operations, 10DC-Ops, 10LDAP-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [22:00:54] (03PS11) 10Ayounsi: Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) [22:01:43] (03CR) 10Jforrester: [C: 03+2] wikitech: Enable the UserMerge extension for clean-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504434 (https://phabricator.wikimedia.org/T165795) (owner: 10Jforrester) [22:02:02] James_F: I'll take a few minutes of deploy time after you're done as well :) [22:02:14] Krinkle: Want me to do it for you? [22:02:15] 10Operations, 10DC-Ops, 10LDAP-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [22:02:24] James_F: sure, thanks :) [22:02:28] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/NewUserMessage/+/504445/ [22:02:33] it's in Jenkins land now [22:02:49] (03Merged) 10jenkins-bot: wikitech: Enable the UserMerge extension for clean-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504434 (https://phabricator.wikimedia.org/T165795) (owner: 10Jforrester) [22:03:03] Right. [22:03:44] James_F: clear [22:04:36] Thanks! [22:06:07] 10Operations, 10DC-Ops, 10LDAP-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Jdforrester-WMF) [22:09:26] 10Operations, 10DC-Ops, 10LDAP-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) [22:09:39] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T165795 Enable the UserMerge extension for clean-up on wikitech (duration: 01m 00s) [22:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:52] T165795: Ldap auth extension vs. ldap vs. username Case - https://phabricator.wikimedia.org/T165795 [22:11:12] Krinkle: Live on mwdebug1002. [22:11:49] (03CR) 10jenkins-bot: wikitech: Enable the UserMerge extension for clean-up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504434 (https://phabricator.wikimedia.org/T165795) (owner: 10Jforrester) [22:13:09] James_F: only 1 new wiki I haven't visited before to test it, and looks good. [22:17:23] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@f1c767d]: mobile-sections simplification: use the key/value bucket only - T215960 (duration: 20m 02s) [22:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:28] T215960: Simplify MCS storage model - https://phabricator.wikimedia.org/T215960 [22:18:38] Krinkle: OK, syncing. [22:19:01] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10ayounsi) [22:19:45] (03PS1) 10Jforrester: wikitech: Give bureaucrats the usermerge right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504447 (https://phabricator.wikimedia.org/T165795) [22:20:13] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/NewUserMessage/includes/NewUserMessage.php: Disable onLocalUserCreated for known bot accounts (duration: 01m 01s) [22:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:17] (03CR) 10Jforrester: [C: 03+2] "Forgot we'd removed this from everyone because it the way it can break SUL wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504447 (https://phabricator.wikimedia.org/T165795) (owner: 10Jforrester) [22:20:25] (03PS1) 10Thcipriani: gerrit: increase core.packedGitLimit [puppet] - 10https://gerrit.wikimedia.org/r/504448 (https://phabricator.wikimedia.org/T221026) [22:21:16] (03Merged) 10jenkins-bot: wikitech: Give bureaucrats the usermerge right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504447 (https://phabricator.wikimedia.org/T165795) (owner: 10Jforrester) [22:21:18] 10Operations, 10SRE-Access-Requests, 10monitoring: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10Dzahn) @bd808 could you give an example host and/or service in Icinga where it doesn't let you schedule downtime? [22:23:09] (03CR) 10Paladox: [C: 03+1] gerrit: increase core.packedGitLimit [puppet] - 10https://gerrit.wikimedia.org/r/504448 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [22:23:19] (03CR) 10jenkins-bot: wikitech: Give bureaucrats the usermerge right [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504447 (https://phabricator.wikimedia.org/T165795) (owner: 10Jforrester) [22:23:36] 10Operations, 10SRE-Access-Requests, 10monitoring: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10Dzahn) There are 2 ways to do this. either with global permissions for all hosts and services via editing the cgi.cfg file in the puppet repo, or by ensuring that t... [22:24:05] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T165795 Give bureaucrats the usermerge right (duration: 00m 59s) [22:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:10] T165795: Ldap auth extension vs. ldap vs. username Case - https://phabricator.wikimedia.org/T165795 [22:24:33] Okie-dokie, I'm done with prod. [22:26:12] (03PS6) 10Dzahn: wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) [22:28:38] (03CR) 10Dzahn: [C: 03+2] wikiba.se: add Apache rewrites for www to naked domain [puppet] - 10https://gerrit.wikimedia.org/r/500695 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [22:29:36] ^ this isnt on the cluster and not getting traffic yet.. [22:30:48] just because apache rewrites has historically been a dangerous area [22:31:36] 10Operations, 10ops-codfw: find horizontal PDUs in codfw - https://phabricator.wikimedia.org/T221153 (10RobH) p:05Triage→03Low [22:33:04] 10Operations, 10ops-codfw: find horizontal PDUs in codfw - https://phabricator.wikimedia.org/T221153 (10RobH) p:05Low→03Normal [22:33:41] (03CR) 10Dzahn: [C: 03+2] gerrit: increase core.packedGitLimit [puppet] - 10https://gerrit.wikimedia.org/r/504448 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [22:33:53] (03PS2) 10Dzahn: gerrit: increase core.packedGitLimit [puppet] - 10https://gerrit.wikimedia.org/r/504448 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [22:34:30] (03PS6) 10Bstorm: labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) [22:35:23] (03PS7) 10Bstorm: labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) [22:35:34] (03CR) 10Dzahn: "merged on master, letting you do the restart when you want" [puppet] - 10https://gerrit.wikimedia.org/r/504448 (https://phabricator.wikimedia.org/T221026) (owner: 10Thcipriani) [22:35:47] mutante: thanks! [22:35:50] yw:) [22:39:10] !log restarting gerrit for configuration update https://gerrit.wikimedia.org/r/504448 [22:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:58] !log gerrit back [22:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:37] (03CR) 10Bstorm: [C: 03+2] labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:46:11] PROBLEM - puppet last run on webperf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhgui] [22:46:55] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [22:47:22] known.. running puppet [22:47:31] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [22:48:57] (03PS5) 10Bstorm: cloudstore: deploy maps/scratch cluster as nfs::secondary [puppet] - 10https://gerrit.wikimedia.org/r/502342 (https://phabricator.wikimedia.org/T209527) [22:51:29] RECOVERY - puppet last run on webperf1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:52:11] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:52:49] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:00:04] hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190416T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:43] huh, i must have mucked up my addition...i have patches to deploy [23:01:00] yea i put them in tomorrow ... shifting [23:01:34] Ha. [23:02:45] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504426 (owner: 10CRusnov) [23:05:40] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) a:05Joe→03Dzahn [23:08:35] (03CR) 10CRusnov: [C: 03+2] Change black check to not enforce quote style. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504426 (owner: 10CRusnov) [23:08:45] (03PS1) 10Kaldari: Add static.inaturalist.org to $wgCopyUploadsDomains for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504469 (https://phabricator.wikimedia.org/T221154) [23:09:57] (03PS2) 10CRusnov: Change black check to not enforce quote style. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504426 [23:13:28] (03CR) 10Dzahn: "could you do me a favor and use this duplicate instead? https://gerrit.wikimedia.org/r/c/operations/dns/+/283870" [dns] - 10https://gerrit.wikimedia.org/r/504243 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [23:13:46] (03PS3) 10Dzahn: added spf record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon) [23:15:55] (03CR) 10Krinkle: "A brief and partly ignorant/naive response from my side - the status quo before multi-dc is that memcached is not replicated or otherwise " [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [23:16:03] (03PS4) 10Dzahn: Add SPF record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon) [23:16:23] (03CR) 10jerkins-bot: [V: 04-1] Add SPF record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T131930) (owner: 10Mschon) [23:17:25] * ebernhardson mutters at npm failing the gate-and-submit ... [23:17:28] (03PS5) 10Dzahn: Add SPF record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T220786) (owner: 10Mschon) [23:17:49] (03CR) 10jerkins-bot: [V: 04-1] Add SPF record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T220786) (owner: 10Mschon) [23:20:07] PROBLEM - Host cr4-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [23:20:07] PROBLEM - Host cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [23:20:07] PROBLEM - Host re0.cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [23:21:13] XioNoX: ^ [23:21:14] XioNoX: expected? [23:21:29] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 59, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:21:54] (03PS6) 10Dzahn: Add SPF record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T220786) (owner: 10Mschon) [23:23:05] (03PS2) 10EBernhardson: Return CirrusSearch to standard execution on eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504375 (https://phabricator.wikimedia.org/T220901) [23:23:50] volans: looks like there was an issue with the link between 3 and 4 before https://phabricator.wikimedia.org/T196030 [23:24:07] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:24:17] RECOVERY - Host cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.46 ms [23:24:18] oh,ok [23:24:30] it might have rebooted [23:24:33] ack [23:25:09] PROBLEM - PyBal BGP sessions are established on lvs4007 is CRITICAL: 0 le 0 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo+prometheus/ops [23:25:20] System booted: 2019-04-16 23:19:34 UTC (00:05:42 ago) [23:25:29] RECOVERY - Host cr4-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.40 ms [23:25:29] RECOVERY - Host re0.cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.05 ms [23:25:53] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 49 probes of 407 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [23:26:27] RECOVERY - PyBal BGP sessions are established on lvs4007 is OK: (C)0 le (W)0 le 1 https://grafana.wikimedia.org/dashboard/db/pybal-bgp?var-datasource=ulsfo+prometheus/ops [23:27:44] mutante: if you could maybe open a task [23:28:38] (03PS1) 10Krinkle: Add entries to wgCSPFalsePositiveUrls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504474 (https://phabricator.wikimedia.org/T207900) [23:28:52] volans: sure, will do [23:31:09] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 407 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [23:33:25] right when I was biking home [23:33:43] XioNoX: ofc! you know, murphy's :D [23:33:52] System booted: 2019-04-16 23:19:32 UTC (00:14:14 ago) [23:34:28] yeah, see above :) [23:34:37] 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10Dzahn) [23:34:43] there is a ticket [23:35:41] Apr 16 23:20:49 cr4-ulsfo eventd[5380]: SYSTEM_ABNORMAL_SHUTDOWN: System abnormally shut down [23:35:43] no kidding [23:35:56] rotfl [23:36:05] which command show you that? [23:36:08] * volans eager to learn [23:36:20] `show log messages` [23:36:25] ack [23:36:51] volans: is everything stable btw? [23:37:21] !log ebernhardson@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/CirrusSearch/includes/: Fix fatals on malformed search queries against overridden clusters (duration: 01m 06s) [23:37:34] seems so, I didn' tsee any other alert going off apart the expected ones [23:37:43] well, all the alerts recovered [23:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:11] volans: did it cause any user facing issue? or was the failover smooth? [23:39:07] yeah, small dip https://grafana.wikimedia.org/d/000000343/load-balancers?orgId=1&panelId=11&fullscreen&from=now-1h&to=now [23:39:11] we did not hear any user complaints, just the icinga alerts [23:39:35] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10EBernhardson) [23:39:40] (03CR) 10EBernhardson: [C: 03+2] Return CirrusSearch to standard execution on eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504375 (https://phabricator.wikimedia.org/T220901) (owner: 10EBernhardson) [23:39:59] I don't see anything obvious in the logs, I'll collect data and open a Jtac case [23:40:38] what's the task? [23:40:54] (03Merged) 10jenkins-bot: Return CirrusSearch to standard execution on eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504375 (https://phabricator.wikimedia.org/T220901) (owner: 10EBernhardson) [23:41:05] (03CR) 10jenkins-bot: Return CirrusSearch to standard execution on eqiad cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504375 (https://phabricator.wikimedia.org/T220901) (owner: 10EBernhardson) [23:41:42] XioNoX: https://phabricator.wikimedia.org/T221156 [23:41:55] thx [23:42:39] !log ebernhardson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Return CirrusSearch to standard execution against eqiad cluster (duration: 01m 00s) [23:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:52] XioNoX: how's possible that the log for shutdown is at 23:20:49 and the system booted at 23:19:32? Also no logs in the 5 minutes prior [23:45:17] i mean, that counts as abnormal! [23:45:23] the log above is when it booted [23:45:31] when it boots it says why it went down [23:45:40] well, "why" [23:45:42] lol [23:45:45] ok [23:46:21] so it logs after the fact [23:46:29] that seems clear also from [23:46:29] Apr 16 23:20:49 cr4-ulsfo kernel: Rebooting... [23:46:30] :D [23:46:37] that is after the reboot time reported by uptime [23:47:32] (03PS3) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [23:47:36] no logs in the 5 min before is normal [23:47:47] the logs are usually quiet, with a cron job every 5min [23:47:47] (03CR) 10CRusnov: coherence report: General improvements and rack checks (036 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [23:47:53] ok [23:48:04] (03CR) 10jerkins-bot: [V: 04-1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [23:48:08] so the cause seems to be a spin lock on cpuid 0 [23:48:53] (03PS4) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [23:49:15] ah? [23:49:28] (03CR) 10jerkins-bot: [V: 04-1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov) [23:50:17] from the logs [23:50:25] ah yeah, I see it [23:50:57] https://www.irccloud.com/pastebin/xtT1M3DL/ [23:51:08] yep that part [23:51:18] (03PS5) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) [23:51:38] https://www.irccloud.com/pastebin/i50kuuWL/ [23:51:44] it shared a core dump [23:51:47] how nice [23:53:59] nice, add that to the task ;) [23:54:04] yep [23:54:10] I'm heading off, kinda late here [23:54:13] thx for the 2nd look :) [23:54:35] yw, first time I tried my access for something real :D [23:58:33] (03PS1) 10Dzahn: icinga: let BryanDavis issue commands on all hosts and services [puppet] - 10https://gerrit.wikimedia.org/r/504482 (https://phabricator.wikimedia.org/T220887) [23:58:35] 10Operations, 10SRE-Access-Requests, 10monitoring: Allow Bryan Davis to downtime alerts in Icinga - https://phabricator.wikimedia.org/T220887 (10bd808) >>! In T220887#5116784, @Dzahn wrote: > @bd808 could you give an example host and/or service in Icinga where it doesn't let you schedule downtime? I don't k...