[00:01:20] (03PS1) 10Ayounsi: Netbox, set the napalm_username variable and matching keyholder [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) [00:04:53] (03CR) 10Ayounsi: "Some questions/comments:" [puppet] - 10https://gerrit.wikimedia.org/r/464082 (https://phabricator.wikimedia.org/T205898) (owner: 10Ayounsi) [00:06:11] 10Operations, 10Traffic: Update certspotter - https://phabricator.wikimedia.org/T204993 (10faidon) >>! In T204993#4610222, @MoritzMuehlenhoff wrote: > Adding the Debian maintainer :-) This seems fixed in 0.9-1 so updating stretch-backports to 0.9 could fix this. This is now done :) [00:12:10] 10Operations, 10Wikimedia-Mailing-lists: Transfer Mailman List ownership - https://phabricator.wikimedia.org/T206089 (10eliza) I’m unsure - i’ve cc’d Ellie. Eliza [00:14:11] MaxSem: your patch (wmf.24) is live on mwdebug2002 [00:17:58] MaxSem: ? [00:18:05] Are you around? [00:18:17] sorry [00:18:21] yes, it works [00:18:40] ack. going live [00:18:57] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10Dzahn) On icinga1001 we have /usr/lib/nagios/plugins/ with a lot of plugins, just like on einsteinium, BUT, on icinga1001 we don't have "/usr/lib/nagios/plug... [00:19:29] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/GlobalPreferences/resources/ext.GlobalPreferences.global.ooui.js: SWAT: [[gerrit:464071|Fail gracefully if we failed to find associated widget (T205991)]] (duration: 00m 55s) [00:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:33] T205991: "Select options below to be global" missing a checkbox - https://phabricator.wikimedia.org/T205991 [00:22:48] 10Operations, 10OTRS, 10Stewards-and-global-tools: https://meta.wikimedia.org/wiki/Special:Contact/Stewards is being abused by spammers - https://phabricator.wikimedia.org/T188985 (10JJMC89) 05Open>03Resolved [00:23:28] stephanebisson: your patch needs some time to test for the job runner and similar stuff [00:23:39] I would do it tomorrow EU-mid day SWAT [00:23:47] would it work for you? [00:24:08] It's waaay over time of the SWAT [00:24:18] Amir1: yeah, I won't be available though, is it ok for you? [00:24:29] yup [00:24:54] Amir1: thanks for all the other patches [00:25:17] you're welcome. Sorry about it. This SWAT was crazy [00:38:18] MaxSem: the wmf.23 is live on mwdebug2002 [00:38:46] !log icinga1001 (not prod yet), removing all icinga packages, running puppet to reinstall them, debugging dpkg issue [00:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:50] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 58.65 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:40:13] Amir1: works [00:40:36] let's go [00:41:27] !log ladsgroup@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/GlobalPreferences/resources/ext.GlobalPreferences.global.ooui.js: SWAT: [[gerrit:464070|Fail gracefully if we failed to find associated widget (T205991)]] (duration: 00m 57s) [00:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:31] T205991: "Select options below to be global" missing a checkbox - https://phabricator.wikimedia.org/T205991 [00:42:17] thank you Amir1 [00:42:50] yw [00:42:57] !log Evening SWAT is done [00:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:19] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 88.77 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [01:03:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:08:00] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:12:46] (03PS1) 10Dzahn: base: do not allow mailman server to NRPE to other hosts for no reason [puppet] - 10https://gerrit.wikimedia.org/r/464086 [01:13:36] (03PS2) 10Dzahn: base: do not allow mailman server to NRPE to other hosts for no reason [puppet] - 10https://gerrit.wikimedia.org/r/464086 [01:16:00] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:16:43] eh.. didnt touch einsteinium at all..looking [01:21:09] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:21:21] (03PS1) 10Dzahn: base/icinga/nrpe: move nrpe_allowed IPs to Hiera, add icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464087 (https://phabricator.wikimedia.org/T202782) [01:21:27] just ran puppet and it's ok [01:22:13] (03PS3) 10Dzahn: base: do not allow mailman server to NRPE to other hosts for no reason [puppet] - 10https://gerrit.wikimedia.org/r/464086 (https://phabricator.wikimedia.org/T202782) [01:23:15] (03PS2) 10Dzahn: base/icinga/nrpe: move nrpe_allowed IPs to Hiera, add icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464087 (https://phabricator.wikimedia.org/T202782) [01:24:53] (03CR) 10Cwhite: [C: 031] base/icinga/nrpe: move nrpe_allowed IPs to Hiera, add icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464087 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:25:09] (03PS1) 10Dzahn: icinga: enable icinga service on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) [01:39:19] PROBLEM - High lag on wdqs1010 is CRITICAL: 9.270e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:03:29] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received [03:04:29] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [03:07:49] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received [03:09:50] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [03:50:58] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Krinkle) [04:45:09] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/NavigationTiming: T205580 - I04c52658fbf6d (duration: 01m 03s) [04:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:14] T205580: Microbenchmark device power and record results in NavigationTiming - https://phabricator.wikimedia.org/T205580 [04:57:50] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 59.55 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:02:10] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 73.71 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:17:15] (03PS1) 10Marostegui: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464094 (https://phabricator.wikimedia.org/T205913) [05:18:25] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464094 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [05:19:32] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464094 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [05:20:21] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464095 [05:20:42] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2070 (duration: 00m 56s) [05:20:44] !log Deploy schema change on db2070 - T205913 [05:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:48] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [05:21:42] !log Deploy schema change on db1075 (s3 eqiad master), lag will be generated - T205913 [05:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:37] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464095 (owner: 10Marostegui) [05:23:41] (03CR) 10jenkins-bot: db-codfw.php: Depool db2070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464094 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [05:23:47] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464095 (owner: 10Marostegui) [05:24:00] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2070" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464095 (owner: 10Marostegui) [05:24:19] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/languages/Language.php: T206030 - I985dfa3eb17 (duration: 00m 56s) [05:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:24] T206030: "PHP Notice: Undefined index: 810" from ApiQuerySiteinfo - https://phabricator.wikimedia.org/T206030 [05:25:24] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2070 (duration: 00m 57s) [05:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:26] !log Deploy schema change on db1067 (s1 eqiad master), lag will be generated - T205913 [05:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:30] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [05:51:57] (03PS1) 10Marostegui: db-codfw.php: Depool db2085:3111 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464099 (https://phabricator.wikimedia.org/T205913) [05:53:17] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2085:3111 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464099 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [05:54:22] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2085:3111 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464099 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [05:54:52] 10Operations, 10Analytics, 10Traffic, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Pchelolo) > Ok, @Pchelolo gets the persistence award! Yay! I've got the award! > Let me understand: are there other headers we would need besides the accept one... [05:55:54] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2085:3311 (duration: 00m 58s) [05:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:22] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@e1aab7b]: Request Parsoid HTML version 2.0.0 (0866a07) [05:56:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:28] !log Deploy schema change on db2085:3311 - T205913 [05:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:32] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [05:59:54] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@e1aab7b]: Request Parsoid HTML version 2.0.0 (0866a07) (duration: 03m 32s) [05:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:57] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2085:3111" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464100 [06:01:11] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2085:3111" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464100 (owner: 10Marostegui) [06:02:22] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2085:3111" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464100 (owner: 10Marostegui) [06:03:31] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2085:3311 (duration: 00m 56s) [06:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:47] (03CR) 10jenkins-bot: db-codfw.php: Depool db2085:3111 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464099 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [06:07:49] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2085:3111" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464100 (owner: 10Marostegui) [06:28:29] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl] [06:31:20] (03PS1) 10Marostegui: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464102 (https://phabricator.wikimedia.org/T205913) [06:33:18] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) [06:33:30] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464102 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [06:35:05] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464102 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [06:36:25] (03CR) 10jenkins-bot: db-codfw.php: Depool db2055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464102 (https://phabricator.wikimedia.org/T205913) (owner: 10Marostegui) [06:37:35] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2055 (duration: 00m 56s) [06:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:40] !log Deploy schema change on db2055 - T205913 [06:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:44] T205913: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 [06:38:20] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464103 [06:39:36] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464103 (owner: 10Marostegui) [06:42:58] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464103 (owner: 10Marostegui) [06:42:59] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2055 (duration: 00m 55s) [06:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:09] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 27 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:50:09] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 16 probes of 317 (alerts on 25) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [06:50:23] (03CR) 10Alexandros Kosiaris: [C: 031] "This is arguably better solved in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464087/. But indeed that default value in hiera i" [puppet] - 10https://gerrit.wikimedia.org/r/464086 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [06:50:23] (03CR) 10Alexandros Kosiaris: [C: 031] base/icinga/nrpe: move nrpe_allowed IPs to Hiera, add icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464087 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [06:51:01] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2055" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464103 (owner: 10Marostegui) [06:58:50] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:05:38] !log mholloway-shell@deploy1001 Started deploy [kartotherian/deploy@27062b4] (maps1004): Specify WDQS endpoint at wdqs.discovery.wmnet in the service config (T205607) [07:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:42] T205607: Kartotherian should use discovery endpoint to connect to wikidata query service - https://phabricator.wikimedia.org/T205607 [07:06:06] !log mholloway-shell@deploy1001 Finished deploy [kartotherian/deploy@27062b4] (maps1004): Specify WDQS endpoint at wdqs.discovery.wmnet in the service config (T205607) (duration: 00m 28s) [07:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:08] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` an-coord1001.eqiad.wmnet ``` The log can... [07:15:11] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['an-coord1001.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['an-coord1001.eqiad.wmnet... [07:15:24] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` an-coord1001.eqiad.wmnet ``` The log can... [07:19:21] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) So the plan is: * Ignore the extra dbs for s5 connection on multisource host (dbstore1002, labsdb1009/10/11) with: ``` STOP SLAVE 's5'; set def... [07:21:18] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Addshore: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Smalyshev) 05Open>03Resolved Looks like the problems are gone for now. [07:31:15] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['an-coord1001.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['an-coord1001.eqiad.wmnet... [07:32:43] people still report broken UW on Commons [07:32:43] (03CR) 10Smalyshev: "I think this is ready, does it need anything else?" [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) (owner: 10Smalyshev) [07:32:48] please fix that asap [07:34:46] can you link the task in here yannf ? [07:35:11] (so people will have a quick pointer) [07:50:59] (03PS1) 10Alexandros Kosiaris: k8s: Alert on the sum of operation types [puppet] - 10https://gerrit.wikimedia.org/r/464104 [07:51:21] !log deploying replication filtes to s5 at labsdb1009/10/11 and dbstore1002 T184805 [07:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:26] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [07:52:07] elukey, https://phabricator.wikimedia.org/T205636 [07:52:41] this was supposed to be fixed, and it is still in "Needs Triage" :/ [07:53:57] (03CR) 10ArielGlenn: "> I think this is ready, does it need anything else?" [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) (owner: 10Smalyshev) [08:14:54] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) After hitting by mistake F10 the host got stuck several times in: `Unified Server Configurator does not support console redirection` After... [08:18:02] 10Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10faidon) Had a chat with @Joe, apparently rdb1005/6 are currently unused and can be reimaged at any point in time. There are some longer-term goals here (rebuilding these with stretch, a newer version of Redis, which requi... [08:18:43] !log starting importing of certain s3 wikis into eqiad s5 master T184805 [08:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:47] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [08:21:40] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database mgwiktionary: database exists on query. Default database: mgwiktionary. [Query snipped] [08:23:21] handling that [08:29:11] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table mgwiktionary.site_stats doesnt exist on query. Default database: mgwiktionary. [Query snipped] [08:29:59] jynus: ^ [08:30:20] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:30:37] yep yep [08:32:15] 10Operations, 10Analytics, 10Traffic, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10phuedx) ☝️ Best I could do at short notice… [08:33:47] 10Operations, 10Traffic: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) @Papaul that's right, reinstalling both servers would be the fastest/safest approach :) [08:34:31] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:34:36] !log fixing replication filters on dbstore1002 [08:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:36] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) From System Setup I can see that `D0:94:66:5F:75:BC` (set in puppet) shows Link Status Connected, while the other NICs are disconnected, so... [08:37:01] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:38:11] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:47:12] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) [08:50:05] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) If I leave the PXE boot running (even if it seems stuck in a blank screen) I end up in: ``` Loading Linux 4.9.0-8-amd64 ... Loading initial... [08:50:42] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10Margott) Im having the same error for GLAM mailing list. {F26288183} [08:51:05] (03PS2) 10Giuseppe Lavagetto: mediawiki: Install php-gd for ZeroBanner [puppet] - 10https://gerrit.wikimedia.org/r/462584 (owner: 10Legoktm) [08:51:56] (03Abandoned) 10Elukey: role::cache::text: add matomo1001 backend for piwik.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/463957 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [08:55:48] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Install php-gd for ZeroBanner [puppet] - 10https://gerrit.wikimedia.org/r/462584 (owner: 10Legoktm) [08:58:00] (03PS1) 10Elukey: role::cache::text: add a backed for matomo1001 [puppet] - 10https://gerrit.wikimedia.org/r/464110 (https://phabricator.wikimedia.org/T202962) [08:59:43] (03PS4) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) [08:59:45] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: add apcu-bc for backwards compatibility [puppet] - 10https://gerrit.wikimedia.org/r/464111 (https://phabricator.wikimedia.org/T201140) [09:00:46] (03PS1) 10Elukey: Replace bohrium with matomo1001 in cache text configuration [puppet] - 10https://gerrit.wikimedia.org/r/464112 (https://phabricator.wikimedia.org/T202962) [09:02:36] (03PS2) 10Elukey: Replace bohrium with matomo1001 in cache text configuration [puppet] - 10https://gerrit.wikimedia.org/r/464112 (https://phabricator.wikimedia.org/T202962) [09:05:28] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) [09:06:06] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) p:05Triage>03High [09:08:10] (03PS1) 10Elukey: Clean up bohrium's references in cache text [puppet] - 10https://gerrit.wikimedia.org/r/464113 (https://phabricator.wikimedia.org/T202962) [09:18:44] (03PS1) 10Alexandros Kosiaris: Specify policyTypes in Network Policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/464116 [09:19:40] PROBLEM - MariaDB Slave SQL: x1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error executing row event: Table cebwiki.echo_event doesnt exist [09:19:47] ^ we are on that [09:22:48] (03CR) 10Ema: [C: 031] "Typo in commit log, lgtm otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464110 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [09:23:24] (03PS2) 10Elukey: role::cache::text: add a backend for matomo1001 [puppet] - 10https://gerrit.wikimedia.org/r/464110 (https://phabricator.wikimedia.org/T202962) [09:23:42] !log fixing replication filters on dbstore1002 (again) [09:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:01] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:24:34] (03CR) 10Ema: [C: 031] Replace bohrium with matomo1001 in cache text configuration [puppet] - 10https://gerrit.wikimedia.org/r/464112 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [09:24:53] (03CR) 10Ema: [C: 031] Clean up bohrium's references in cache text [puppet] - 10https://gerrit.wikimedia.org/r/464113 (https://phabricator.wikimedia.org/T202962) (owner: 10Elukey) [09:26:47] !log reducing io overhead temporarilly in exchange for crash safety for s5 replicas T184805 [09:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:51] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [09:26:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify policyTypes in Network Policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/464116 (owner: 10Alexandros Kosiaris) [09:32:50] (03PS1) 10Giuseppe Lavagetto: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) [09:32:51] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 51.89 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:35:22] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "This is not needed, in fact the real solution to the problem is to amend the anti-pattern in mediawiki-config's etcd functions, where we i" [puppet] - 10https://gerrit.wikimedia.org/r/464111 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [09:38:39] (03PS4) 10Banyek: wmf-pt-kill: WMF patched version 2 [debs/wmf-pt-kill] - 10https://gerrit.wikimedia.org/r/463931 [09:41:02] 10Operations, 10Core Platform Team Kanban (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Joe) >>! In T176370#4625630, @Legoktm wrote: >>>! In T176370#4625612, @dmaza wrote: >> Is there any particular reason... [09:42:24] (03PS5) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) [09:42:41] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 79.81 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:45:49] 10Operations, 10monitoring: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) [09:47:45] (03PS2) 10Gehel: mjolnir: Enable loggers created before logging is setup [puppet] - 10https://gerrit.wikimedia.org/r/455045 (owner: 10EBernhardson) [09:48:28] (03CR) 10Gehel: [C: 032] mjolnir: Enable loggers created before logging is setup [puppet] - 10https://gerrit.wikimedia.org/r/455045 (owner: 10EBernhardson) [09:49:28] (03PS6) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) [09:50:39] (03Abandoned) 10Arturo Borrero Gonzalez: cloudvps: unprotect some nova API queries [puppet] - 10https://gerrit.wikimedia.org/r/463790 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [09:52:20] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:54:31] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:59:56] (03PS7) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) [10:00:00] <_joe_> seventh time's the charm? [10:02:10] (03PS4) 10Jcrespo: mariadb: Setup db1116 for backup generation on eqiad of s7 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) [10:03:41] (03CR) 10Jcrespo: [C: 032] mariadb: Setup db1116 for backup generation on eqiad of s7 and s8 [puppet] - 10https://gerrit.wikimedia.org/r/463484 (https://phabricator.wikimedia.org/T201392) (owner: 10Jcrespo) [10:06:43] <_joe_> make it eight :/ [10:06:44] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10Marostegui) Are we still good for Thursday at 16:00 UTC for row B? [10:06:57] (03PS8) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) [10:13:00] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:13:33] that's me ^ [10:14:01] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:21:42] (03PS1) 10Urbanecm: Lift account creation cap for 2018-10-04 on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464122 (https://phabricator.wikimedia.org/T206119) [10:22:51] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:24:00] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:24:31] PROBLEM - Check systemd state on cloudcontrol1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:30:31] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={create_container,remove_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:33:10] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:34:20] jouncebot, next [10:34:20] In 0 hour(s) and 25 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1100) [10:34:51] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:36:01] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet operation_type={create_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:38:11] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:38:31] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:41:50] !log start compressing dbstore1001:x1 tables [10:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:00] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:45:51] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet operation_type={create_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:51:42] RECOVERY - Check systemd state on cloudcontrol1004 is OK: OK - running: The system is fully operational [10:52:50] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:57:01] PROBLEM - Etcd cluster health on conf1004 is CRITICAL: The etcd server is unhealthy [10:58:50] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={remove_container,stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a European Mid-day SWAT(Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1100). [11:00:04] raynor, Amir1, and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:20] o/ [11:00:24] I can SWAT today [11:00:39] Hi [11:00:47] Can I ask my patches be deployed before others? [11:00:53] I have only 30 minutes of time :( [11:00:58] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:00:59] Urbanecm: sure, I'll start with you [11:01:00] o/ [11:01:27] thank you [11:01:27] Amir1, raynor: you're deployers, right? you can continue after me, Urbanecm is in a hurry [11:01:44] <_joe_> the etcd on conf1004 is my doing [11:01:48] <_joe_> sorry for the noise [11:01:51] yup, I'm deployer and I can wait [11:01:58] Urbanecm: no problemo :D [11:01:58] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Aklapper) @eyoung, @eliza: `wikimania-l` is not run by Ellie. For the other three mailing lists, Ellie can log in as mailing list admin and add the email address of the new mail... [11:02:20] Amir1 do you want to go first, my task requires a bit of testing [11:03:34] hmm, mine is not testable but I need to wait for ten minutes -ish to make sure logs are clean [11:03:44] I rather go last [11:04:19] Urbanecm: sorry, postman will be back in a minute [11:04:29] postman? [11:04:32] Oh :D [11:04:48] I though you're telling the postman will be back in a minute :D [11:05:05] Amir1: kk, I'll go first [11:05:21] Urbanecm: sorry, he came at the worst of times :) I'm back [11:05:28] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) [11:05:58] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:06:35] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464122 (https://phabricator.wikimedia.org/T206119) (owner: 10Urbanecm) [11:07:32] Urbanecm: uh, I've never had to run the script for throttle rule, is that new? [11:07:43] (03Merged) 10jenkins-bot: Lift account creation cap for 2018-10-04 on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464122 (https://phabricator.wikimedia.org/T206119) (owner: 10Urbanecm) [11:08:17] zeljkof, I don't know, I was told by Reedy (IIRC) that it is necessary to run it if the event is in next 72 hours. [11:08:31] It is not necessary to run it unless it is last time rule [11:08:36] This is for tomorrow. [11:08:41] See the linked docs :) [11:08:46] It's necessary if there's been stuff on that IP to cause something to be logged [11:08:55] uh oh, we've had many last-minute throttle deployments, I've never run the script :/ [11:09:20] Thank you Reedy for your explanation, that explains why it sometime works without the script and sometime not :) [11:09:27] Reedy: so, it's just in case, not needed in general? [11:09:32] Yeah [11:09:43] Usually if the event is going on and the limits have already been reached [11:10:26] (03PS1) 10Giuseppe Lavagetto: profile::etcd::v3: do not setup firewall rules if only localhost is allowed [puppet] - 10https://gerrit.wikimedia.org/r/464125 [11:10:28] (03PS1) 10Giuseppe Lavagetto: role::configcluster_stretch: only allow localhost to connect to etcd [puppet] - 10https://gerrit.wikimedia.org/r/464126 [11:11:08] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type={stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:13:07] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:13:08] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:13:41] logmsgbot's back! It's an important thing when deploy time [11:14:35] Urbanecm: ah, just noticed a typo in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/464122 (after merge) :/ [11:14:44] "commonswii" [11:14:49] (03CR) 10jenkins-bot: Lift account creation cap for 2018-10-04 on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464122 (https://phabricator.wikimedia.org/T206119) (owner: 10Urbanecm) [11:14:50] ahh [11:14:52] will fix it [11:14:59] sorry [11:15:07] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:15:22] I should have noticed it before :/ [11:15:57] (03PS1) 10Urbanecm: Fix a typo in tomorrow throttle definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464127 [11:16:02] I should not make typos [11:16:06] Uploaded a fix ^^ [11:16:10] please review&merge [11:16:58] (03PS1) 10Elukey: profile::etcd::replication: allow the config of src/dst repl ports [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) [11:17:32] (03PS2) 10Zfilipin: Fix a typo in lift account creation cap for cswiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464127 (https://phabricator.wikimedia.org/T206119) (owner: 10Urbanecm) [11:17:59] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464127 (https://phabricator.wikimedia.org/T206119) (owner: 10Urbanecm) [11:18:24] Urbanecm: I've updated the commit message slightly and added phab task number [11:18:51] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::v3: do not setup firewall rules if only localhost is allowed [puppet] - 10https://gerrit.wikimedia.org/r/464125 (owner: 10Giuseppe Lavagetto) [11:18:56] thx [11:18:58] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12726/conf1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/464125 (owner: 10Giuseppe Lavagetto) [11:19:04] (03Merged) 10jenkins-bot: Fix a typo in lift account creation cap for cswiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464127 (https://phabricator.wikimedia.org/T206119) (owner: 10Urbanecm) [11:19:24] now I see that my commit message is far from perfect, but it's merged, oh well [11:19:28] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:19:29] (03CR) 10Elukey: [C: 031] role::configcluster_stretch: only allow localhost to connect to etcd [puppet] - 10https://gerrit.wikimedia.org/r/464126 (owner: 10Giuseppe Lavagetto) [11:20:11] Urbanecm: please add the commit to the calendar [11:20:19] will do [11:20:26] (03PS3) 10Zfilipin: Fix a typo in zhwikiversity's importsources definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460947 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [11:20:28] RECOVERY - Etcd cluster health on conf1004 is OK: The etcd server is healthy [11:20:48] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:464127|Fix a typo in lift account creation cap for cswiki event (T206119)]] (duration: 00m 56s) [11:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:52] T206119: Lift account creation cap for 2018-10-04 - https://phabricator.wikimedia.org/T206119 [11:21:00] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) >>! In T197242#4610467, @Mvolz wrote: >>>! In T197242#4587749, @Sebastian_Berlin-WMSE wrote: >... [11:21:02] done [11:21:08] Urbanecm: thanks! [11:21:11] yw [11:21:15] (03PS2) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) [11:21:19] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460947 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [11:21:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] profile::etcd::replication: allow the config of src/dst repl ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [11:21:47] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:22:30] (03Merged) 10jenkins-bot: Fix a typo in zhwikiversity's importsources definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460947 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [11:23:15] Urbanecm: 460947 is at mwdebug2001 [11:23:31] (03PS3) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) [11:24:00] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:24:14] I don't have import privs there, but the wiki is up, so let's consider it working [11:24:21] please push it into production [11:24:27] Urbanecm: ok [11:25:42] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460947|Fix a typo in zhwikiversitys importsources definition (T201328)]] (duration: 00m 57s) [11:25:42] (03PS3) 10Zfilipin: Create eliminator group at Vietnamese Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460701 (https://phabricator.wikimedia.org/T202207) (owner: 10Urbanecm) [11:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:46] T201328: Transwiki import in zhwikiversity - https://phabricator.wikimedia.org/T201328 [11:25:52] Urbanecm: deployed [11:25:55] thx [11:26:03] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460701 (https://phabricator.wikimedia.org/T202207) (owner: 10Urbanecm) [11:27:08] (03Merged) 10jenkins-bot: Create eliminator group at Vietnamese Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460701 (https://phabricator.wikimedia.org/T202207) (owner: 10Urbanecm) [11:27:46] (03PS4) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) [11:28:23] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:28:25] Urbanecm: 460701 is at mwdebug2001 [11:29:10] working, please push it into production [11:29:18] ok [11:29:33] (03CR) 10jenkins-bot: Fix a typo in lift account creation cap for cswiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464127 (https://phabricator.wikimedia.org/T206119) (owner: 10Urbanecm) [11:29:35] (03CR) 10jenkins-bot: Fix a typo in zhwikiversity's importsources definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460947 (https://phabricator.wikimedia.org/T201328) (owner: 10Urbanecm) [11:29:38] (03CR) 10jenkins-bot: Create eliminator group at Vietnamese Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460701 (https://phabricator.wikimedia.org/T202207) (owner: 10Urbanecm) [11:30:13] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:460701|Create eliminator group at Vietnamese Wikibooks (T202207)]] (duration: 00m 58s) [11:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:17] T202207: Create eliminator group at Vietnamese Wikibooks - https://phabricator.wikimedia.org/T202207 [11:30:22] Urbanecm: deployed! [11:30:25] thank you! [11:30:38] Amir1, raynor: swat is yours! [11:30:43] kk [11:31:29] (03PS5) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) [11:32:06] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: use novaadmin credentials [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:32:21] Amir1, I'll be quick, I have just one patch to push ,second one is postponed [11:32:47] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Cleanup WDQS logging configuration - https://phabricator.wikimedia.org/T206121 (10Gehel) [11:32:48] (03PS2) 10Pmiazga: Remove dead config relating to wgRelatedArticlesEnabledBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462573 (https://phabricator.wikimedia.org/T202306) (owner: 10Jdlrobson) [11:32:54] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Cleanup WDQS logging configuration - https://phabricator.wikimedia.org/T206121 (10Gehel) p:05Triage>03Normal [11:33:11] (03CR) 10Pmiazga: [C: 032] Remove dead config relating to wgRelatedArticlesEnabledBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462573 (https://phabricator.wikimedia.org/T202306) (owner: 10Jdlrobson) [11:33:53] okay, take your time [11:34:15] (03Merged) 10jenkins-bot: Remove dead config relating to wgRelatedArticlesEnabledBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462573 (https://phabricator.wikimedia.org/T202306) (owner: 10Jdlrobson) [11:35:12] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] "I think the jenkins report about wrong puppet:// URL is false, so removing it as reviewer." [puppet] - 10https://gerrit.wikimedia.org/r/464124 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:36:12] (03PS2) 10Elukey: profile::etcd::replication: allow the config of src/dst repl ports [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) [11:38:38] !log downtime cloudcontrol1003,1004 for 2h for T203177 [11:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:43] T203177: cloudvps: metrics and analytics - https://phabricator.wikimedia.org/T203177 [11:39:48] (03CR) 10Filippo Giunchedi: [C: 031] k8s: Alert on the sum of operation types [puppet] - 10https://gerrit.wikimedia.org/r/464104 (owner: 10Alexandros Kosiaris) [11:41:08] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: typo in puppet URL: missing / [puppet] - 10https://gerrit.wikimedia.org/r/464130 (https://phabricator.wikimedia.org/T203177) [11:42:07] (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: typo in puppet URL: missing / [puppet] - 10https://gerrit.wikimedia.org/r/464130 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:42:08] !log pmiazga@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:462573|Remove dead config relating to wgRelatedArticlesEnabledBucketSize (T202306)]] (duration: 00m 57s) [11:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:12] T202306: Remove EventLogging code from RelatedArticles - https://phabricator.wikimedia.org/T202306 [11:43:24] Amir1: I'm done [11:43:33] you can sync your patch [11:44:39] (03CR) 10jenkins-bot: Remove dead config relating to wgRelatedArticlesEnabledBucketSize [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462573 (https://phabricator.wikimedia.org/T202306) (owner: 10Jdlrobson) [11:45:28] !log converting enwiki.slots to TokuDB on host dbstrore1002 (T205544) [11:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:32] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [11:45:49] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: s/content/source/ in systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/464131 (https://phabricator.wikimedia.org/T203177) [11:45:51] thanks [11:46:39] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463945 (https://phabricator.wikimedia.org/T203286) (owner: 10Sbisson) [11:47:14] (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: s/content/source/ in systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/464131 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [11:48:11] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12728/" [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [11:48:18] RECOVERY - swift-object-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:48:27] RECOVERY - swift-object-updater on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:48:27] RECOVERY - swift-object-auditor on ms-be1040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:48:28] RECOVERY - swift-account-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:48:37] RECOVERY - swift-account-auditor on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:48:39] that's me ^ [11:48:48] RECOVERY - swift-account-server on ms-be1040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:49:18] RECOVERY - swift-account-reaper on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:50:20] (03PS2) 10Ladsgroup: Don't purge articlequality, draftquality scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463945 (https://phabricator.wikimedia.org/T203286) (owner: 10Sbisson) [11:50:32] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463945 (https://phabricator.wikimedia.org/T203286) (owner: 10Sbisson) [11:51:39] (03Merged) 10jenkins-bot: Don't purge articlequality, draftquality scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463945 (https://phabricator.wikimedia.org/T203286) (owner: 10Sbisson) [11:54:08] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463945|Don't purge articlequality, draftquality scores (T203286)]] (duration: 00m 57s) [11:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:13] T203286: New Pages Feed: run ORES backfill script in English Wikipedia - https://phabricator.wikimedia.org/T203286 [11:55:45] (03PS8) 10Vgutierrez: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [11:57:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [11:57:33] (03PS1) 10Banyek: admin: banyek dotfiles changed [puppet] - 10https://gerrit.wikimedia.org/r/464132 [11:58:26] (03CR) 10Banyek: [C: 032] admin: banyek dotfiles changed [puppet] - 10https://gerrit.wikimedia.org/r/464132 (owner: 10Banyek) [11:59:28] (03CR) 10jenkins-bot: Don't purge articlequality, draftquality scores [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463945 (https://phabricator.wikimedia.org/T203286) (owner: 10Sbisson) [11:59:45] (03PS9) 10Vgutierrez: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1200) [12:00:39] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: use systemd::service content parameter [puppet] - 10https://gerrit.wikimedia.org/r/464134 (https://phabricator.wikimedia.org/T203177) [12:02:17] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 58.27 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:03:00] 10Operations, 10monitoring: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10faidon) Why wasn't this caught by check_ping? Is it actual packet loss? [12:06:08] (03PS2) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: use systemd::service content parameter [puppet] - 10https://gerrit.wikimedia.org/r/464134 (https://phabricator.wikimedia.org/T203177) [12:07:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12730/" [puppet] - 10https://gerrit.wikimedia.org/r/464134 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [12:12:41] !log Starting mwscript extensions/ORES/maintenance/BackfillPageTriageQueue.php --wiki enwiki (T203286) [12:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:45] T203286: New Pages Feed: run ORES backfill script in English Wikipedia - https://phabricator.wikimedia.org/T203286 [12:13:08] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 85.82 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:14:44] (03PS10) 10Vgutierrez: [WIP] Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [12:20:15] (03CR) 10Giuseppe Lavagetto: [C: 04-1] profile::etcd::replication: allow the config of src/dst repl ports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [12:20:24] <_joe_> elukey: sorry :P [12:20:53] (03PS2) 10Giuseppe Lavagetto: role::configcluster_stretch: only allow localhost to connect to etcd [puppet] - 10https://gerrit.wikimedia.org/r/464126 [12:20:58] (03PS1) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: replace instead of override systemd service file [puppet] - 10https://gerrit.wikimedia.org/r/464137 (https://phabricator.wikimedia.org/T203177) [12:22:04] _joe_ no no sorry I didn't think about it :( [12:23:06] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12731/" [puppet] - 10https://gerrit.wikimedia.org/r/464137 (https://phabricator.wikimedia.org/T203177) (owner: 10Arturo Borrero Gonzalez) [12:26:29] !log Finished mwscript extensions/ORES/maintenance/BackfillPageTriageQueue.php --wiki enwiki (T203286) [12:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:34] T203286: New Pages Feed: run ORES backfill script in English Wikipedia - https://phabricator.wikimedia.org/T203286 [12:27:27] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:27:47] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:29:57] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:30:16] <_joe_> akosiaris: you know what's causing all those alerts on the api latencies? [12:30:38] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:30:39] (03PS3) 10Giuseppe Lavagetto: role::configcluster_stretch: only allow localhost to connect to etcd [puppet] - 10https://gerrit.wikimedia.org/r/464126 [12:31:54] (03CR) 10Giuseppe Lavagetto: [C: 032] role::configcluster_stretch: only allow localhost to connect to etcd [puppet] - 10https://gerrit.wikimedia.org/r/464126 (owner: 10Giuseppe Lavagetto) [12:32:18] (03PS3) 10Elukey: profile::etcd::replication: allow the config of src/dst repl ports [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) [12:34:58] 10Operations, 10Patch-For-Review, 10User-fgiunchedi: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) [12:35:16] 10Operations, 10IRCecho, 10Patch-For-Review, 10User-fgiunchedi: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10fgiunchedi) [12:35:26] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/12732/" [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [12:37:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:40:10] _joe_: very well [12:40:15] it's me :-) [12:40:28] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:40:31] ah the ones about production, no it's not me [12:40:49] I was testing things in the staging environment however [12:41:12] but that does not explain argon/acrux [12:41:53] hm, it's for the CONNECT verb [12:42:24] that's usually very low and not at 14400s [12:50:08] that's weird [12:51:08] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 59.02 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:52:16] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (watching), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Sebastian_Berlin-WMSE) >>! In T197242#4638029, @mobrovac wrote: > Also, the current idea is to start usi... [12:54:06] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) From the install1002 point of view: ``` Oct 3 12:38:12 install1002 dhcpd: DHCPOFFER on 10.64.21.104 to d0:94:66:5f:75:bc via 10.64.21.2 Oc... [12:54:39] !log DROP unused RESTBase tables - T204752 [12:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:44] T204752: Clean up restrictions tables in cassandra - https://phabricator.wikimedia.org/T204752 [12:55:03] (03PS1) 10Giuseppe Lavagetto: profile::etcd::v3: allow connections to the advertised port [puppet] - 10https://gerrit.wikimedia.org/r/464141 [12:55:39] _joe_: {instance="10.64.0.45:6443",job="k8s-api",resource="pods",scope="namespace",subresource="portforward",verb="CONNECT"} [12:55:48] that's 1 of the big ones [12:55:56] what on earth tried to do a portforward? [12:57:45] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::v3: allow connections to the advertised port [puppet] - 10https://gerrit.wikimedia.org/r/464141 (owner: 10Giuseppe Lavagetto) [13:00:19] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1300) [13:08:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is problematic as it messes with all groups and lacks knowledge about canaries making it so that scap does not deploy code in the pas" [puppet] - 10https://gerrit.wikimedia.org/r/463709 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [13:10:48] (03PS1) 10Giuseppe Lavagetto: profile::etcd::tlsproxy: correctly set the upstream server if using tls [puppet] - 10https://gerrit.wikimedia.org/r/464142 [13:13:11] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) There are two changes needed, including https://gerrit.wikimedia.org/r/463966.... [13:15:01] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd::tlsproxy: correctly set the upstream server if using tls [puppet] - 10https://gerrit.wikimedia.org/r/464142 (owner: 10Giuseppe Lavagetto) [13:15:48] (03PS1) 10GTirloni: prometheus-openstack-exporter: Add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) [13:15:49] 10Operations, 10monitoring, 10User-fgiunchedi: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi) [13:15:57] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:16:03] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10fgiunchedi) [13:16:19] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: Add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) (owner: 10GTirloni) [13:18:19] (03PS2) 10GTirloni: prometheus-openstack-exporter: Add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) [13:19:07] PROBLEM - BGP status on cr1-eqsin is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 316. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:57] (03CR) 10GTirloni: "> Patch Set 2: Verified+2" [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) (owner: 10GTirloni) [13:20:27] PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [13:22:04] (03CR) 10Filippo Giunchedi: [C: 031] "Ping me when merging this, as a precaution we should stop puppet on production prometheus hosts and reenable only on one host to test this" [puppet] - 10https://gerrit.wikimedia.org/r/463966 (https://phabricator.wikimedia.org/T204088) (owner: 10Ottomata) [13:22:37] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 76.83 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:23:08] 10Operations, 10monitoring, 10User-fgiunchedi: Expand modern metrics infrastructure coverage (2018-19 Q2 goal) - https://phabricator.wikimedia.org/T205862 (10fgiunchedi) [13:23:12] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196 (10fgiunchedi) [13:25:39] 10Operations, 10Wikimedia-Mailing-lists: I get a "403 Forbidden" error when subscribing to a list - https://phabricator.wikimedia.org/T205694 (10Margott) >>! In T205694#4625598, @herron wrote: > (maybe connecting from another device or location)? This actually worked(sorry I didnt read that early), when I hav... [13:26:57] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 49.42 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:28:09] (03CR) 10Giuseppe Lavagetto: [C: 031] Use cumin::selector instead of profile::cumin::target in get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/463966 (https://phabricator.wikimedia.org/T204088) (owner: 10Ottomata) [13:30:07] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 70.19 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:30:09] (03PS3) 10GTirloni: prometheus-openstack-exporter: Add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) [13:31:30] godog: shall we merge now? [13:35:55] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) Got some help from Faidon, one setting in the BIOS for the serial console wasn't correct (I've set `Serial Port Address set to Serial Device... [13:36:00] (03CR) 10Alexandros Kosiaris: [C: 032] k8s: Alert on the sum of operation types [puppet] - 10https://gerrit.wikimedia.org/r/464104 (owner: 10Alexandros Kosiaris) [13:36:07] (03PS2) 10Alexandros Kosiaris: k8s: Alert on the sum of operation types [puppet] - 10https://gerrit.wikimedia.org/r/464104 [13:36:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s: Alert on the sum of operation types [puppet] - 10https://gerrit.wikimedia.org/r/464104 (owner: 10Alexandros Kosiaris) [13:37:07] ottomata: sure, I can merge, disable puppet, etc [13:39:40] oh ok cool [13:39:43] godog: proceed then! [13:39:55] (03PS1) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [13:40:05] jouncebot: next [13:40:08] In 2 hour(s) and 19 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1600) [13:40:20] (03CR) 10Filippo Giunchedi: [C: 032] Use cumin::selector instead of profile::cumin::target in get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/463966 (https://phabricator.wikimedia.org/T204088) (owner: 10Ottomata) [13:40:31] (03PS2) 10Filippo Giunchedi: Use cumin::selector instead of profile::cumin::target in get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/463966 (https://phabricator.wikimedia.org/T204088) (owner: 10Ottomata) [13:40:37] (03CR) 10jerkins-bot: [V: 04-1] wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [13:40:57] PROBLEM - puppet last run on dns5002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 21 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [13:41:08] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:42:40] ottomata: ack, I'll test run [13:44:48] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 32 probes of 340 (alerts on 25) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:45:27] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 38 probes of 316 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:46:27] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) >>! In T204970#4637700, @elukey wrote: > After hitting by mistake F10 the host got stuck several times in: > > `Unified Server Configurator... [13:47:08] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` an-coord1001.eqiad.wmnet ``` The log can... [13:48:22] (03PS1) 1020after4: Install scap version 3.8.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/464152 [13:49:25] (03CR) 1020after4: [C: 04-1] "Please upload the package before deploying this change :)" [puppet] - 10https://gerrit.wikimedia.org/r/464152 (owner: 1020after4) [13:49:48] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 1 probes of 340 (alerts on 25) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:50:28] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 316 (alerts on 25) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [13:50:58] RECOVERY - puppet last run on dns5002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:52:58] ottomata: looks like some hosts got removed, e.g. from prometheus1003 not sure why yet tho [13:53:19] uh oh.... godog maybe they never ran puppet since I merged the cumin::seletctor patch yesterday? [13:53:23] and then aren't in puppet db with it? [13:53:26] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 (10mmodell) [13:53:42] ottomata: ah yeah that's possible, getting the list now [13:54:29] analytics1068 / matomo1001 / maps1004 / lvs1011 / lvs1012 / an-coord1001 [13:54:38] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 (10mmodell) repurposing this task since it's still open. ping @fgiunchedi can you upload the latest scap when you have a chance? Related Pup... [13:55:04] (03PS2) 1020after4: Install scap version 3.8.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/464152 [13:55:11] yup, iirc analytics1068 might be offline (elukey?) [13:55:19] an-coord1001 is new [13:55:34] (03PS2) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [13:55:41] ok looks like it is working as intended, that churn is due to hosts not being in puppetdb for $reasons [13:55:45] (03CR) 10jerkins-bot: [V: 04-1] Install scap version 3.8.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/464152 (owner: 1020after4) [13:56:12] (03CR) 10jerkins-bot: [V: 04-1] wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [13:56:46] ottomata: not sure about prometheus in beta but production prometheus LGTM [13:57:00] yep! [13:57:12] 1068 still down, an-coord1001 is in d-i now [13:57:35] matomo1001 might have puppet disabled [13:57:37] (still working on it) [13:57:44] ack, all expected [13:58:46] (03PS3) 1020after4: Install scap version 3.8.7-1 [puppet] - 10https://gerrit.wikimedia.org/r/464152 (https://phabricator.wikimedia.org/T204383) [14:00:01] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10fgiunchedi) [14:03:35] godog: Right now I'm sending a logstash log message to deployment-logstash2.eqiad.wmflabs:12201 but it doesn't reach it. Do you know why? is there a way to check why logstash discard these messages? [14:05:09] Amir1: sigh, can you login into deployment-logstash2 ? if so check /var/log/logstash in case there's hints [14:05:49] yeah, okay. let me check [14:05:50] thanks [14:07:26] !log converting wikidatawiki.change_tag to TokuDB on host dbstrore1002 (T205544) [14:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:31] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [14:10:21] godog: I've got this: [14:10:24] Gelfd failed to parse a message skipping {:exception=>#, :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/1.9/gems/gelfd-0.2.0/lib/gelfd/parser.rb:14:in `parse'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-gelf-3.0.5/lib/logstash/inputs/gelf.rb:104:in `udp_listener'", [14:10:24] "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-gelf-3.0.5/lib/logstash/inputs/gelf.rb:77:in `run'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:456:in `inputworker'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:449:in `start_input'"]}' [14:10:47] but I have no idea how to fix [14:11:40] Amir1: ftw there is a gelf formatter already https://github.com/cdumay/logging-gelf/tree/master/src/logging_gelf [14:11:58] instead of writing one, it might make sense to reuse it [14:12:40] I see it already does quite a bit more than your change [14:12:57] yeah [14:13:12] we also have python-logstash in use by thumbor FWIW [14:13:24] ah that too, I did not know that [14:16:03] akosiaris: godog My patch is basically using that but the library is unmaintained and pretty buggy towards python3, look at the issues made against it [14:18:23] there's also no evidence afaics that backtrace is due to your input heh [14:18:37] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:20:25] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['an-coord1001.eqiad.wmnet'] ``` and were **ALL** successful. [14:22:06] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) Hi Ellie, what Krenair said above is true. Each list just has a single individual password. If you have that (those), you can login and change the admin list yourself.... [14:23:43] 10Operations, 10Wikimedia-Mailing-lists: Transfer mailman ownership of Wikimania lists - https://phabricator.wikimedia.org/T206089 (10Dzahn) If you want Irene to be added to the wikimania-l admins you can email wikimania-l-owner@lists.wikimedia.org. [14:24:20] (03CR) 10Mathew.onipe: "> Patch Set 2: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [14:25:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: labvirt1009 HP Raid alert - https://phabricator.wikimedia.org/T198479 (10Bstorm) @Cmjohnson Were you able to get around to this? It looks to be in the same state (which will be frustrating if it was already swapped, lol). [14:27:11] 10Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442 (10Dzahn) a:03Dzahn Cool! Thanks. I will do that. [14:28:26] 10Operations, 10monitoring: add monitoring to alert on hosts without RAID - https://phabricator.wikimedia.org/T206131 (10Dzahn) [14:32:39] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) [14:33:23] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10elukey) Assigning to Rob to see if anything needs to be done from the DC ops side before closing. [14:35:35] 10Operations, 10Traffic: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) @Vgutierrez let me know when I have green light to start working on this. [14:36:44] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational [14:37:06] (03CR) 10Anomie: [C: 031] "Looks sane. Haven't tested." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [14:42:54] !log fixed some prometheus metrics grants on dbstore1001:3306, db1116:3317 and db1116:3318 [14:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:08] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Services (blocked): Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Nuria) a:03Ottomata [14:43:17] (03PS1) 10Vgutierrez: install_server: limit flat.cfg to lvs200[1-6] letting lvs20[09-10] out [puppet] - 10https://gerrit.wikimedia.org/r/464156 (https://phabricator.wikimedia.org/T205970) [14:44:52] 10Operations, 10Traffic, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) @Papaul as soon as https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464156/ gets merged :) [14:47:01] (03CR) 10Alexandros Kosiaris: [C: 031] "There is an alternative of having a hiera variable that is populated differently (enabled vs disabled) for mwmaint1001 vs mwmaint1002 but " (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [14:48:00] (03PS2) 10Bstorm: wiki replicas: Remove most comment joins from non-compat tables [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) [14:48:58] (03CR) 10Bstorm: wiki replicas: Remove most comment joins from non-compat tables (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [14:56:04] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:57:38] (03PS11) 10Vgutierrez: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [15:00:13] jouncebot: next [15:00:13] In 0 hour(s) and 59 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1600) [15:04:54] (03PS5) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) [15:04:54] (03CR) 10jerkins-bot: [V: 04-1] naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [15:04:54] (03PS2) 10Jcrespo: mariadb: Move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) [15:04:54] (03PS6) 10Cwhite: naggen2: python3 and remove activerecord support [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) [15:05:02] (03PS1) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: remove the restart parsoid step, now useless [cookbooks] - 10https://gerrit.wikimedia.org/r/464162 [15:06:29] (03PS1) 10Jcrespo: mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) [15:06:59] (03PS7) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) [15:07:34] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Still good, here is the list of hosts currently on the new asw2-b-eqiad that will be impacted by Thursday 4th 16:00UTC 2h maintenance window (with a worse case of a 30min do... [15:10:12] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/462425 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:11:15] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:11:36] (03PS12) 10Vgutierrez: Detect when cert config changes and re-issue [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [15:14:59] (03CR) 10Giuseppe Lavagetto: [C: 031] profile::etcd::replication: allow the config of src/dst repl ports [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [15:16:12] jouncebot: now [15:16:12] No deployments scheduled for the next 0 hour(s) and 43 minute(s) [15:16:14] jouncebot: next [15:16:14] In 0 hour(s) and 43 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1600) [15:22:11] 10Operations, 10ops-ulsfo, 10netops, 10Patch-For-Review: cr2-ulsfo crash - https://phabricator.wikimedia.org/T204782 (10RobH) We can resolve this since we decommissioned cr2 and are getting rid of it, right? [15:23:59] (03PS4) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) (owner: 10GTirloni) [15:24:34] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) (owner: 10GTirloni) [15:25:04] what is it that regularly runs puppet? [15:25:11] a cron somewhere? [15:25:27] 10Operations, 10ops-ulsfo, 10netops, 10Patch-For-Review: cr2-ulsfo crash - https://phabricator.wikimedia.org/T204782 (10ayounsi) 05Open>03Resolved a:03ayounsi Yep. [15:25:47] ah, /etc/cron.d/puppet [15:30:18] (03PS5) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) (owner: 10GTirloni) [15:30:52] 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Ban spam arriving to my tools email - https://phabricator.wikimedia.org/T202558 (10Bstorm) @MarcoAurelio Seem good enough? This certainly cleared up my spam folder a lot. [15:31:34] (03PS3) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [15:32:21] (03CR) 10jerkins-bot: [V: 04-1] wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [15:34:54] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12740/ the change is a noop in prod, and it's already applied in beta, where it's intende" [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [15:35:10] (03PS9) 10Giuseppe Lavagetto: profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) [15:35:19] (03PS3) 10Bstorm: openstack: add case for stretch and newton in client repos [puppet] - 10https://gerrit.wikimedia.org/r/463782 (https://phabricator.wikimedia.org/T203254) [15:35:37] <_joe_> bstorm_: newton? \o/ [15:35:49] <_joe_> I was told at the time it allows much more flexible networking [15:36:03] newton or neutron? [15:36:08] <_joe_> neutron [15:36:08] <_joe_> hehe [15:36:12] :) [15:36:12] <_joe_> I misread [15:36:16] <_joe_> and repeated it too [15:36:23] /13/8 [15:36:49] yeah newton is an openstack version [15:36:51] <_joe_> I blame english and its funny translitteration to the latin alphabet [15:36:54] <_joe_> :P [15:37:04] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 34.84 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [15:37:04] Yeah, I'm just trying to get a stretch client library install to not error out for reasons :) [15:37:11] which I don't think we use yet [15:37:19] 10Operations, 10monitoring: Upgrade to Prometheus 2.x - https://phabricator.wikimedia.org/T187987 (10colewhite) This might be worth trying: https://gitlab.com/gitlab-org/prometheus-storage-migrator [15:37:21] Stretch = newton packages [15:37:24] but which is already EOL and greater than the version we do use? [15:37:41] <_joe_> it's newer than mitaka, which is what we use IIRC [15:37:43] Yup! It EOL'd after the one we use, therefore it's the Future. [15:37:48] 🤣 [15:37:51] <_joe_> aahahah [15:37:56] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::php: add support for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [15:38:07] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10colewhite) a:03colewhite [15:38:27] (03CR) 10Arturo Borrero Gonzalez: [C: 032] prometheus-openstack-exporter: add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) (owner: 10GTirloni) [15:38:30] think Ocata is also EOL [15:38:36] (03PS6) 10Arturo Borrero Gonzalez: prometheus-openstack-exporter: add monitored endpoints [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) (owner: 10GTirloni) [15:38:44] pike's okay but it's got less time left than ubuntu trusty [15:38:51] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler seems happy: https://puppet-compiler.wmflabs.org/compiler1001/12741/" [puppet] - 10https://gerrit.wikimedia.org/r/464144 (https://phabricator.wikimedia.org/T203177) (owner: 10GTirloni) [15:45:25] PROBLEM - puppet last run on snapshot1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:35] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:48:55] (03PS2) 10Gehel: Fix WDQS service name [puppet] - 10https://gerrit.wikimedia.org/r/464020 (owner: 10Smalyshev) [15:50:42] (03CR) 10Gehel: [C: 032] Fix WDQS service name [puppet] - 10https://gerrit.wikimedia.org/r/464020 (owner: 10Smalyshev) [15:51:15] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6136.74 seconds [15:51:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4090.67 seconds [15:51:54] PROBLEM - puppet last run on snapshot1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:41] 10Operations, 10Wikimedia-Logstash: Procure and provision Logging pipeline hardware in multiple datacenters - https://phabricator.wikimedia.org/T205850 (10herron) a:03herron [15:56:05] 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Optic replaced yesterday and confirmed no more issues. Steps for today: [] Verify cr2-eqiad is VRRP master [] Disable interfaces from cr1-eqiad:ae1 to as... [15:57:48] 10Operations, 10MediaWiki-Shell, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), 10Core Platform Team Kanban (Later): Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10CCicalese_WMF) [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1600). [16:00:04] bpirkle: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:02:06] yay, it worked [16:02:38] :) [16:02:55] PROBLEM - puppet last run on snapshot1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:05:03] Looks like no ones volunteering to swat :P [16:05:37] (03PS1) 10Dduvall: ci: Disable Docker container logging [puppet] - 10https://gerrit.wikimedia.org/r/464174 (https://phabricator.wikimedia.org/T206134) [16:06:08] bpirkle: Can this be tested from a web browser? Can't say I've tried to Special:Export a Flow page [16:07:30] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [dumps/dcat] - 10https://gerrit.wikimedia.org/r/464175 (owner: 10L10n-bot) [16:07:35] PROBLEM - ElasticSearch shard size check on search.svc.codfw.wmnet is CRITICAL: CRITICAL - enwiki_content_1538363684(55gb) [16:07:47] Hrm, I tested more manually than that (xdebug + fiddling around with the codebase) [16:08:03] Which doesn't help here [16:08:07] Generally in SWAT, we ask people to test it in prod [16:08:08] https://www.mediawiki.org/wiki/Special:Export/Talk:Google_Code-in/Mentors [16:08:14] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is OK: (C)60 le (W)70 le 73.91 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:08:17] That suggests we probably can't test it via a web browser :P [16:08:22] So it just needs deploying [16:09:52] Yeah. I'm pretty confident in the change itself, as Tim reviewed it. There were a number of extensions affected by T203424, we simply neglected to merge this one when we merged the rest. [16:09:53] T203424: Replace the WikiExporter backup dump streaming mode with batched queries - https://phabricator.wikimedia.org/T203424 [16:10:16] Fair enough [16:10:29] I'm sure apergos can confirm it fixes it as intended [16:11:47] hmm I don't think I have anything with flow in my mw-vagrant yet [16:11:48] (03PS1) 10Krinkle: profiler: Prevent Xenon flushes from being able to fatal a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) [16:11:59] I can test it on a snapshot host when it arrives there [16:12:07] yeah, that's what I was meaning [16:12:13] (03CR) 10Jcrespo: "Moved dblists deploy to a separate commit: https://gerrit.wikimedia.org/r/464164, as this can be safely tested in read only mode beforehan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463935 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [16:12:15] just the double check it's still not broken :) [16:12:18] right [16:12:44] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:12:51] (03CR) 10jerkins-bot: [V: 04-1] profiler: Prevent Xenon flushes from being able to fatal a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) (owner: 10Krinkle) [16:16:22] jerkins is going fast today I see [16:21:37] 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10ayounsi) [16:21:42] btw I just looked up the 'test it in prod' t-shirt yesterday... couldn't make up my mind to buy one though [16:22:48] (03PS2) 10Krinkle: profiler: Prevent Xenon flushes from being able to fatal a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) [16:23:08] (03PS2) 10Dzahn: install_server: limit flat.cfg to lvs200[1-6] letting lvs20[09-10] out [puppet] - 10https://gerrit.wikimedia.org/r/464156 (https://phabricator.wikimedia.org/T205970) (owner: 10Vgutierrez) [16:24:02] (03CR) 10Dzahn: [C: 032] install_server: limit flat.cfg to lvs200[1-6] letting lvs20[09-10] out [puppet] - 10https://gerrit.wikimedia.org/r/464156 (https://phabricator.wikimedia.org/T205970) (owner: 10Vgutierrez) [16:24:16] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/Flow/: fixup flow exporting T203424 (duration: 01m 03s) [16:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:20] T203424: Replace the WikiExporter backup dump streaming mode with batched queries - https://phabricator.wikimedia.org/T203424 [16:25:17] bpirkle: apergos deployed [16:25:26] ah ha [16:26:02] thank you [16:26:44] excuse me while I check a puppet issue first, sorry [16:27:04] _joe_: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::mediawiki::php::enable_fpm in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/mediawiki/php.pp:18:27 on node snapshot1008.eqiad.wmnet [16:27:10] I guess you know the right fix? [16:27:55] Have you tried turning it off and on again? [16:28:21] apergos: what role is it using? [16:28:46] all the snaps have the issue [16:29:03] so it's not the specific snapshot role [16:29:10] it needs a copy of this but for the snapshot role [16:29:10] hieradata/role/common/deployment_server.yaml:profile::mediawiki::php::enable_fpm: false [16:29:30] so hierdata/role/common/snapshot/producer.yaml or dumper.yaml [16:29:31] or both [16:30:07] all the appserver roles have it.. like common/mediawiki/appserver/api.yaml:profile::mediawiki::php::enable_fpm: false [16:30:17] there's 4 roles [16:31:10] only 2 of them have a yaml file in role/common/snapshot/ so far. but it can simply be copied [16:32:11] we can't add it at the profile level? there's a common profile that's in all these roles [16:32:59] profile::dumps::generation::worker::common [16:33:04] that's what I'd like to see get it [16:33:06] there is hieradata/common/profile/ as well.. so think yes [16:33:20] it's just far less common [16:33:26] 10Operations, 10netops, 10Goal: Increase network capacity (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T199142 (10ayounsi) [16:33:32] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) 05stalled>03Resolved [16:33:41] as long as it's not by host name [16:35:50] yes, add it in hieradata/common/profile/dumps/generation/worker/common.yaml that's an existing file [16:36:21] where it also selects the PHP version as php7.0 [16:37:22] Reedy: bpirkle: tested flow regular and full dumps, they run without error, thank you very much [16:39:18] apergos: great, thank you! [16:41:04] (03PS4) 10Elukey: profile::etcd::replication: allow the config of src/dst repl ports [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) [16:42:00] (03PS1) 10ArielGlenn: fix broken puppet on snapshots after php-fpm merge [puppet] - 10https://gerrit.wikimedia.org/r/464180 [16:42:14] (03CR) 10Elukey: [C: 032] profile::etcd::replication: allow the config of src/dst repl ports [puppet] - 10https://gerrit.wikimedia.org/r/464128 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [16:42:34] (03PS1) 10Dzahn: dumps: add (disabled) php-fpm support in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/464181 [16:42:36] mutante: ^^ like this? [16:42:41] lol [16:42:54] feel free to merge your copy [16:43:10] heh! ok [16:43:41] (03PS2) 10Dzahn: dumps: add (disabled) php-fpm support in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/464181 [16:44:11] (03CR) 10Dzahn: [C: 032] dumps: add (disabled) php-fpm support in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/464181 (owner: 10Dzahn) [16:44:24] (03Abandoned) 10ArielGlenn: fix broken puppet on snapshots after php-fpm merge [puppet] - 10https://gerrit.wikimedia.org/r/464180 (owner: 10ArielGlenn) [16:44:43] (03CR) 10Dzahn: "follow-up was needed for dumps hosts: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464181/" [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [16:46:22] let me know when it's puppet-merged please [16:46:25] apergos: doesnt seem to work for role(dumps::generation::worker::dumper) [16:46:30] apergos: it is [16:46:42] doesnt work on snapshot1009 [16:47:20] I would expect the lookup to not work that way [16:47:21] (03CR) 10Alex Monk: [C: 031] "PS4-PS12 diff lgtm. Volans, since both me and Valentin worked on this can you approve?" [software/certcentral] - 10https://gerrit.wikimedia.org/r/460382 (owner: 10Alex Monk) [16:49:15] (03PS1) 10Dzahn: dumps: disable php-fpm in Hiera on role level [puppet] - 10https://gerrit.wikimedia.org/r/464183 [16:49:40] apergos: ^ pretty confident this will work though [16:49:57] yes, it's just annoying [16:50:25] please go ahead and pull the value from the profile common file too [16:50:29] might as well go into the same changeset [16:51:56] ack [16:52:04] (03PS2) 10Dzahn: dumps: disable php-fpm in Hiera on role level [puppet] - 10https://gerrit.wikimedia.org/r/464183 [16:53:39] seems fine to me [16:53:45] (03CR) 10Dzahn: "correction: using role level instead of profile level: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464183/" [puppet] - 10https://gerrit.wikimedia.org/r/455154 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [16:53:54] 'k [16:53:55] (03CR) 10Dzahn: [C: 032] dumps: disable php-fpm in Hiera on role level [puppet] - 10https://gerrit.wikimedia.org/r/464183 (owner: 10Dzahn) [16:54:10] (03PS3) 10Dzahn: dumps: disable php-fpm in Hiera on role level [puppet] - 10https://gerrit.wikimedia.org/r/464183 [16:55:46] lol of course it had to be rebased [16:56:25] godog: akosiaris: FYI it works, it just takes some time I think: https://logstash-beta.wmflabs.org/goto/efff8adfe1fe6ca0e34af4679928721c [16:56:45] apergos: works on snapshot1009 now [16:56:57] i expect recoveries.. i'll let you run puppet on the others? [16:57:40] I ran it on 1008 [16:57:42] it's fine there [16:57:45] thanks [16:57:45] RECOVERY - puppet last run on snapshot1008 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:58:02] I can wait the 30 mins for the others [16:58:10] cool. yw! i will be back later [16:58:26] see ya [16:58:44] RECOVERY - puppet last run on snapshot1009 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:59:25] 10Operations, 10Horizon, 10Traffic, 10Upstream: Horizon Designate dashboard not allowing creation of NS records - https://phabricator.wikimedia.org/T204013 (10Krenair) I got projectadmin in the `openstack` tenant back and made an instance called labs-t204013-osdev. Then I followed https://docs.openstack.or... [17:00:17] well I was already on 1005 so I ran it there too [17:01:44] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:03:24] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 54.69 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:07:25] (03PS4) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [17:08:10] (03CR) 10jerkins-bot: [V: 04-1] wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [17:08:45] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 79.96 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:09:35] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:12] !log reinstalling OS on lvs2009 [17:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:40] 10Operations, 10Traffic, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) puppet certs for lvs2009 and lvs2010 delete from master for OS reinstall [17:13:05] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 58.65 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:13:35] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:14:35] <_joe_> apergos: sorry I went for a shower [17:14:45] <_joe_> the solution is to set that hiera variable to false [17:14:47] no worries [17:14:53] <_joe_> how did I miss the snapshots [17:14:55] already done and fixed thanks to mut ante [17:16:24] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 71.85 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:16:54] RECOVERY - puppet last run on snapshot1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:17:56] (03PS1) 10Cwhite: memcached, redis: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) [17:22:43] (03PS3) 10RobH: bast4002: switch over prometheus [puppet] - 10https://gerrit.wikimedia.org/r/393943 (https://phabricator.wikimedia.org/T179050) (owner: 10BBlack) [17:23:50] (03CR) 10jerkins-bot: [V: 04-1] bast4002: switch over prometheus [puppet] - 10https://gerrit.wikimedia.org/r/393943 (https://phabricator.wikimedia.org/T179050) (owner: 10BBlack) [17:24:27] (03PS1) 10RobH: changing prometheus.svc.ulsfo.wmnet entry to bast4002 [dns] - 10https://gerrit.wikimedia.org/r/464369 (https://phabricator.wikimedia.org/T179050) [17:25:55] (03PS4) 10RobH: bast4002: switch over prometheus [puppet] - 10https://gerrit.wikimedia.org/r/393943 (https://phabricator.wikimedia.org/T179050) (owner: 10BBlack) [17:26:17] !log disable cr1-eqiad:ae1 - T201145 [17:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:22] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [17:28:12] !log start of recabling asw2-a-eqiad between asw and cr1 - T201145 [17:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:10] (03CR) 10RobH: [C: 032] bast4002: switch over prometheus [puppet] - 10https://gerrit.wikimedia.org/r/393943 (https://phabricator.wikimedia.org/T179050) (owner: 10BBlack) [17:33:41] (03PS1) 10RobH: remove bast4001 from prometheus firewall exceptions [puppet] - 10https://gerrit.wikimedia.org/r/464371 (https://phabricator.wikimedia.org/T179050) [17:34:50] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10RobH) a:05RobH>03fgiunchedi Ok, updates from IRC sync up and followup actions: * @robh updated (per @fgiunchedi's instruction) @bblack's patchset https://gerrit.wikimedia.org... [17:38:54] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) FINALLY GOT IT! https://beta-prometheus.wmflabs.org/beta/graph?g0.range_input=1... [17:39:08] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) a:03Ottomata [17:39:33] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Ottomata) [17:42:29] (03PS5) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [17:42:30] !log re-enable cr1-eqiad:ae1 - T201145 [17:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:35] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [17:42:55] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 262.68 seconds [17:43:25] (03PS6) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [17:43:56] (03CR) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [17:44:04] <_joe_> mutante: thanks for taking care of the fpm issue [17:44:11] (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [17:44:14] (03PS1) 10Jdlrobson: Log client errors on beta cluster at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464373 (https://phabricator.wikimedia.org/T202026) [17:44:17] <_joe_> in fact, on dumps that should stay to false even in the future :) [17:47:33] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Steps to migrate asw2-b-eqiad to a supported topology. {F26293194} Step 1) [] Enable all VC ports on FPC2 and FPC7 ``` request virtual-chassis vc-port set pic-slot 0 port... [17:48:03] indeed it should :-) [17:48:10] _joe_: welcome:) i wasn't sure if it would work to set on profile level because i see we have hieradata/common/profile/ but usually i always just use roles. though in this case it means adding it in 4 places rather than one common one as apergos pointed out [17:48:27] yeah the profile lookups don't work like that though [17:48:36] turns out it didnt work so we used roles [17:48:40] <_joe_> well, no, that should've gone to all roles :) [17:48:51] it has.. in the second change [17:48:51] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) Thats awesome!!!!🎉 🎉 🎉 🎉 🎉 🎉 @Thank you @Ottomata @Krenair and @fgiunchedi thi... [17:49:05] we just tried the other way first [17:49:11] <_joe_> there is a reason why we don't do DRY there [17:49:26] <_joe_> it's a good thing to see all the config for a role in one place [17:49:29] <_joe_> if possible [17:49:58] <_joe_> there are truly global settings that go in the common/ hierarchy, but this is not one of them [17:50:22] <_joe_> and btw, the mediawiki => profile::mediawiki move is not that far away now :) [17:50:35] <_joe_> I know it will be hard to kill, but one can try :P [17:51:12] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327 (10RobH) 05Open>03Resolved All new systems are in place, resolving this task. [17:51:15] 10Operations, 10Traffic, 10Patch-For-Review: Investigate better DNS cache/lookup solutions - https://phabricator.wikimedia.org/T104442 (10RobH) [17:51:49] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wiktionary.org [puppet] - 10https://gerrit.wikimedia.org/r/462477 (https://phabricator.wikimedia.org/T196968) [17:52:33] <_joe_> mutante: ^^ I'm running the compiler now, we got to the list of the large projects now [17:53:24] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.04 ms [17:53:28] <_joe_> I was planning on doing at least wiktionary and wikiquote tomorrow, can you rebase the patches and take a look? [17:53:29] _joe_: ok, both very cool.. mediawiki => profile and apache :) [17:53:41] yes, i will take a look [17:53:58] <_joe_> ah nevermind, all those changes need some modification [17:54:12] <_joe_> :/ [17:54:22] <_joe_> I might take a look later, or tomorrow maybe :P [17:54:22] what is it? [17:54:26] ok [17:54:36] <_joe_> upload_rewrite now is a struct [17:54:40] <_joe_> used to be a string [17:55:02] <_joe_> ofc if you want to modify them, please do [17:55:12] <_joe_> I'm not jealous of my patches :P [17:55:49] ok,cool! i will look later today [17:56:33] before the Internet Archive party starts , attending that in SF tonight [17:56:35] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [17:56:53] that host is being reinstalled, i know the name from ticket [18:00:23] (03PS6) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [18:01:31] mutante: yes i log the message 12:10 < papaul> !log reinstalling OS on lvs2009 [18:03:23] _joe_, when you say 'an https interface to mediawiki' [18:03:36] you're not talking about the nginx/varnish layer being able to talk HTTPS are you? [18:04:29] i.e. https://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [18:05:03] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Niedzielski) {icon thumbs-up}{icon thumbs-up}{icon thumbs-up}{icon thumbs-up}{icon thumbs-... [18:05:56] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Krenair) so this is resolved? [18:06:01] papaul: yep, thank you [18:07:00] re: hieradata/common/profile since the lookup didn't work like that for us kind makes me wonder how all the existing stuff inside that works though , if it does [18:07:17] !log disable ulsfo Zayo transit/transport links [18:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:33] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@d5bab41]: Bump cirrusSearchLinksUpdate concurrency to 20 [18:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:08:31] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@d5bab41]: Bump cirrusSearchLinksUpdate concurrency to 20 (duration: 00m 57s) [18:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:44] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 65, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:09:35] !log lvs2009 - schedule downtime in icinga for 4 hours, reinstall in progress [18:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:39] <_joe_> Krenair: no I mean TLS on the application servers in beta [18:10:57] <_joe_> we have it in production, we should add it there too, even though it's not as necessary [18:11:14] <_joe_> just to allow people to test connecting apps via TLS, which is a sane practice [18:11:14] _joe_, what certs do you use for that in prod? [18:11:23] puppet? [18:11:26] <_joe_> we generate certs with the puppet CA [18:11:41] so, extra certs from the puppet CA that aren't the host's normal puppet certs? [18:11:49] <_joe_> if you ping me in a couple weeks (post-switchback), I can show you how it's done or do it myself [18:11:52] <_joe_> yes [18:11:53] ok [18:11:56] thanks [18:12:05] <_joe_> sorry, bbl [18:12:59] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10herron) [18:13:03] 10Operations, 10MediaWiki-Debug-Logger, 10Performance-Team: Set up request profiling for PHP 7 - https://phabricator.wikimedia.org/T206152 (10Krinkle) p:05Triage>03Normal [18:14:44] ACKNOWLEDGEMENT - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T205970 [18:16:04] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [18:16:40] !log lvs2010 - schduled downtime for host and services for 12 hours for reinstall [18:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:20] 10Operations, 10Traffic, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) The reason the OS reinstall is taking long is that when all 4 NIC's are plugged in, the server can not auto configure the first NIC so I have to to in the BIOS and disa... [18:22:49] 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Ban spam arriving to my tools email - https://phabricator.wikimedia.org/T202558 (10MarcoAurelio) >>! In T202558#4638736, @Bstorm wrote: > @MarcoAurelio Seem good enough? This certainly cleared up my spam folder a lot. Hello @... [18:23:14] PROBLEM - Host lvs2009 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:18] 10Operations, 10netops, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) 05Open>03Resolved This is now stable. Back to T187960 for the remaining steps. [18:27:04] RECOVERY - Host lvs2009 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [18:30:58] 10Operations, 10Traffic, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) LVS 2009 ``` root@lvs2009:~# fdisk -l Disk /dev/sda: 223.6 GiB, 240057409536 bytes, 468862128 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physic... [18:35:14] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:35:25] 10Operations, 10Traffic, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) - Re-enable the other 3 NIC's - First puppet run complete LVS2009 is ready [18:36:38] !log reinstalling OS on lvs2010 [18:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:03] (03PS1) 10Cwhite: hiera: remove diamond from etherpad [puppet] - 10https://gerrit.wikimedia.org/r/464380 (https://phabricator.wikimedia.org/T183454) [18:38:10] (03PS7) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [18:38:54] (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [18:40:18] (03PS1) 10Krinkle: profiler: Move flush to a function and prep beta/prod consolidate (beta-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464381 (https://phabricator.wikimedia.org/T176916) [18:40:20] (03PS1) 10Krinkle: profiler: Use wmfArcLampFlush() in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464382 (https://phabricator.wikimedia.org/T176916) [18:40:24] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10herron) [18:41:35] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:42:05] (03PS2) 10Pmiazga: Beta: Log client errors on beta cluster at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464373 (https://phabricator.wikimedia.org/T202026) (owner: 10Jdlrobson) [18:42:10] 10Operations, 10Research, 10SRE-Access-Requests: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10herron) Thanks @Isaac! Looks like we just need a manager thumbs up here in the task before moving forward with granting access. [18:42:20] (03CR) 10Pmiazga: [C: 032] Beta: Log client errors on beta cluster at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464373 (https://phabricator.wikimedia.org/T202026) (owner: 10Jdlrobson) [18:42:44] (03CR) 10Imarlier: profiler: Use wmfArcLampFlush() in prod (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464382 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [18:44:02] (03PS3) 10Krinkle: profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) [18:44:09] (03Merged) 10jenkins-bot: Beta: Log client errors on beta cluster at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464373 (https://phabricator.wikimedia.org/T202026) (owner: 10Jdlrobson) [18:47:07] (03PS8) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [18:47:10] (03PS1) 10Herron: admin: add chelsyx to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/464384 (https://phabricator.wikimedia.org/T205736) [18:48:09] (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [18:48:24] (03PS2) 10Cwhite: memcached, redis: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) [18:48:36] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to stats, analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) [18:48:59] (03CR) 10Krinkle: profiler: Use wmfArcLampFlush() in prod (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464382 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [18:49:04] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) [18:49:09] (03CR) 10Herron: [C: 032] admin: add chelsyx to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/464384 (https://phabricator.wikimedia.org/T205736) (owner: 10Herron) [18:49:48] !log Deployed patches for T206130 [18:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:02] (03CR) 10Krinkle: [C: 032] profiler: Move flush to a function and prep beta/prod consolidate (beta-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464381 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [18:50:16] (03PS9) 10Dzahn: mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) [18:50:29] (03CR) 10Imarlier: profiler: Use wmfArcLampFlush() in prod (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464382 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [18:50:57] (03PS3) 10Cwhite: memcached, redis: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) [18:51:58] (03Merged) 10jenkins-bot: profiler: Move flush to a function and prep beta/prod consolidate (beta-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464381 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [18:54:19] hey [18:54:37] post-merge queue is stuck [18:54:37] https://integration.wikimedia.org/zuul/ [18:54:53] (03CR) 10jenkins-bot: Beta: Log client errors on beta cluster at 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464373 (https://phabricator.wikimedia.org/T202026) (owner: 10Jdlrobson) [18:54:55] (03CR) 10jenkins-bot: profiler: Move flush to a function and prep beta/prod consolidate (beta-only) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464381 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [18:56:30] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/12747/" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [18:57:40] (03CR) 10Dzahn: [C: 031] "compiler output: a bit surprised it changes _from_ present on mwmaint1001. also mwmaint1002 isn't known by compiler yet. but it still seem" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [18:59:57] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 (10mmodell) [19:00:04] marxarelli: That opportune time is upon us again. Time for a MediaWiki train - Americas version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T1900). [19:01:03] (03PS4) 10Dzahn: Revert "Gerrit: Add missing resource /var/lib/gerrit2/review_site" [puppet] - 10https://gerrit.wikimedia.org/r/462770 (https://phabricator.wikimedia.org/T196835) (owner: 10Thcipriani) [19:02:12] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krenair) [19:02:36] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12748/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/462770 (https://phabricator.wikimedia.org/T196835) (owner: 10Thcipriani) [19:02:54] raynor: not stuck. It's just a bit behind. There's a lot of merges and it has a low priority to do only 1 at once. [19:02:55] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:03:30] raynor: is there something in particular you are waiting for from that pipeline? afaik it's mostly doc.wikimedia.org updates [19:03:43] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10herron) 05Open>03Resolved a:03herron >>! In T205736#4635602, @chelsyx wrote: > Thanks everyo... [19:03:57] nah, somehow it picked my task -> I merged the beta cluster config change and we really wanted to check it [19:04:02] (03PS7) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1010 [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) [19:04:21] (03PS1) 10Cwhite: profile, graphite: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464389 (https://phabricator.wikimedia.org/T183454) [19:04:22] and then I saw that some task is like 2h in the queue and it frightened me a bit, but it went trough... [19:04:29] (03CR) 10Dzahn: [C: 032] "this did:" [puppet] - 10https://gerrit.wikimedia.org/r/462770 (https://phabricator.wikimedia.org/T196835) (owner: 10Thcipriani) [19:04:33] looks like the post-merge for mediawiki-config have higher priority [19:04:48] paladox: see comment on gerrit above [19:04:55] +.wikimedia-footer { [19:04:59] or they are picked by different worker. Krinkle, thx for quick answer [19:05:03] that was expected, correct? [19:05:09] thcipriani the footer works! [19:05:14] mutante yup! [19:05:27] thanks! [19:05:41] alright [19:05:45] 10Operations, 10Traffic, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Papaul) lvs2010 ``` root@lvs2010:~# fdisk -l Disk /dev/sda: 223.6 GiB, 240057409536 bytes, 468862128 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical)... [19:06:43] raynor: Yeah, the concurrency is limited per type of job, not per pipeline. so 1 npm doc update, 1 doxygen doc update, 1 beta update :) [19:08:15] (03PS3) 10Niedzielski: smaller wiki Minerva a/b tests are bumped to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463875 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [19:08:20] 10Operations, 10Traffic, 10Patch-For-Review: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) Thanks @papaul! [19:08:41] 10Operations, 10Patch-For-Review: Audit/fix hosts with no RAID configured - https://phabricator.wikimedia.org/T136562 (10Vgutierrez) [19:08:44] 10Operations, 10Traffic: lvs2009/lvs2010 with no RAID configured - https://phabricator.wikimedia.org/T205970 (10Vgutierrez) 05Open>03Resolved a:03Papaul [19:09:04] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:09:04] (03PS1) 10Anomie: Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) [19:09:52] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to to analytics-search-users, statistics-privatedata-users for Chelsy Xie - https://phabricator.wikimedia.org/T205736 (10chelsyx) Thank you @herron ! [19:10:05] (03CR) 10Gergő Tisza: [C: 031] Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie) [19:10:18] anomie: can I interest you in debugging an issue with importDump.php —uploads? https://phabricator.wikimedia.org/T206013 [19:10:35] * anomie looks [19:10:58] anomie: I can put your key on -static if you want to log in and try. And/or I can get you the dump in question. [19:11:08] here comes the choo choo... [19:11:13] andrewbogott: I'll try static analysis first [19:11:30] anomie: thanks! Let me know what I can do to help [19:11:48] (03PS1) 10Dduvall: group1 wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464392 [19:11:49] (03CR) 10Dduvall: [C: 032] group1 wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464392 (owner: 10Dduvall) [19:13:23] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464392 (owner: 10Dduvall) [19:14:34] !log dduvall@deploy1001 scap failed: average error rate on 6/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [19:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:11] oh boy. all sorts of havoc [19:15:38] https://www.irccloud.com/pastebin/xaeaFWo3/ [19:15:46] 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Ban spam arriving to my tools email - https://phabricator.wikimedia.org/T202558 (10Bstorm) 05Open>03Resolved That's better than my personal accounts get :) Let's close this for now and re-open if needed. I hope your healt... [19:15:59] !log rolling back group1 after rapid rise in fatals [19:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:54] (03CR) 10Smalyshev: wdqs: auto deployment of wdqs on wdqs1010 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [19:18:42] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: rollback group1 to 1.32.0-wmf.23 [19:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:08] thcipriani: `scap sync-file` will handle the symlink properly, right? (see ^) [19:19:21] the *php* symlink rather [19:19:26] (03PS2) 10Aaron Schulz: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [19:21:02] (03CR) 10Smalyshev: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/461862 (https://phabricator.wikimedia.org/T202830) (owner: 10Smalyshev) [19:21:05] marxarelli: scap sync-file will sync the symlink not its target [19:21:25] rgr. good [19:22:18] mutante: paladox thanks for the merge and the work there, glad to see that go live! Looks like css on the login screen is a bit odd, but can be fixed in followup. [19:22:31] yep :) [19:23:17] !log dduvall@deploy1001 Synchronized php: rollback group1 to 1.32.0-wmf.23 (duration: 00m 54s) [19:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:15] (03CR) 10Aaron Schulz: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [19:24:21] (03PS1) 10Dduvall: Rollback group1 to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464394 [19:24:31] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464392 (owner: 10Dduvall) [19:25:02] andrewbogott: What git revision of MediaWiki are you using there? I'm having trouble matching up line numbers. [19:25:07] (03CR) 10Dduvall: [C: 032] "(Already synced during emergency rollback.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464394 (owner: 10Dduvall) [19:26:32] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) 05Open>03Resolved Yup! I can see events here > https://grafana-labs-admin.w... [19:26:33] (03Merged) 10jenkins-bot: Rollback group1 to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464394 (owner: 10Dduvall) [19:27:35] * addshore reads up [19:27:58] marxarelli: ooof, is there a ticket yet? [19:28:12] addshore: filing now [19:28:29] thanks, looks like it is a wikibase thing, just finished dinner and figured i would pop my head in [19:29:31] 10Operations, 10Security-team-backlog, 10monitoring, 10Patch-For-Review: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300 (10chasemp) >>! In T150300#4637103, @gerritbot wrote: > Change 464077 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Ti... [19:29:38] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T206004 (10Cmjohnson) 05Open>03declined T [19:30:31] addshore: i appreciate it! https://phabricator.wikimedia.org/T206161 [19:31:07] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Cmjohnson) @Dzahn the disk was replaced but it's unconfigured good ....I have not tried to add it back but no success. can you give it a go please [19:33:20] !log running initial osm import in maps1004 [19:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:56] !log uploaded 2.99.9161-beta-1+wmf1 to stretch-wikimedia [19:35:56] anomie: it's 3.31 but I had some debug lines in when I ran the test. Let me reset and re-run [19:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:59] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.06 seconds [19:38:57] andrewbogott: If you can tell me the specific revision, e.g. from `git log -1`, that'd be helpful. [19:39:22] anomie: it's in the ticket, I think… 1.31.1 (8641df9) [19:39:24] is that enough? [19:39:38] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.104 second response time [19:39:48] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.102 second response time [19:39:49] andrewbogott: Ok, I see it upon reloading. [19:39:59] PROBLEM - Wikitech-static main page has content on labtestweb2001 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 291 bytes in 0.103 second response time [19:40:29] !log upgraded gdnsd to 2.99.9161 on authdns2001 [19:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:01] (03PS1) 10Cwhite: openstack, rabbitmq: remove diamond [puppet] - 10https://gerrit.wikimedia.org/r/464399 (https://phabricator.wikimedia.org/T183454) [19:46:58] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33005 bytes in 0.242 second response time [19:47:08] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 33005 bytes in 0.210 second response time [19:47:19] RECOVERY - Wikitech-static main page has content on labtestweb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 33005 bytes in 0.189 second response time [19:47:20] (03CR) 10jenkins-bot: Rollback group1 to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464394 (owner: 10Dduvall) [19:48:08] anomie: I updated the stack trace in the bug with a run from a cleaner install [19:48:09] (03CR) 10Krinkle: wmfSetupEtcd: Correctly initialize the local cache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto) [19:48:13] so line numbers should be right now [19:48:21] * anomie looks [19:49:39] andrewbogott: Is that still with 8641df9 or did you checkout a new version? [19:50:06] anomie: same version [19:50:08] https://www.irccloud.com/pastebin/8fw7EvuZ/ [19:50:42] well, what the heck, I didn't rebase anything... [19:50:47] and yet now it's reporting a different version :( [19:50:55] So I don't know what the deal is [19:50:55] https://wikitech-static.wikimedia.org/wiki/Special:Version [19:51:09] Looks like you're on master there. [19:51:25] ah, yeah, typo when reseting [19:51:32] dammit, I'll do this once more and then update that stack trace again [19:52:44] * andrewbogott always enjoys how it takes 10+ minutes for 'composer update' to run [19:55:21] anomie: I updated the paste in the ticket, yet again. Sorry for the confusion [19:55:29] 10Operations, 10Community-Tech, 10MediaWiki-Parser, 10Thumbor, and 3 others: Show SVGs in page language if available - https://phabricator.wikimedia.org/T205040 (10MaxSem) a:03MaxSem [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor I � Unicode. All rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T2000). [20:00:19] no parsoid deploy today. [20:01:20] (03CR) 10Smalyshev: wdqs: auto deployment of wdqs on wdqs1010 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/464151 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [20:03:43] !log upgraded gdnsd to 2.99.9161 on multatuli [20:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:24] (03PS3) 10Bstorm: wiki replicas: Remove most comment joins from non-compat tables [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) [20:18:26] 10Operations, 10monitoring: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) I can only speculate at this point, but the packet loss seems be happening in burst, depending on the check interval on ping, we might miss it. Not sure if icinga alerts on the fi... [20:18:44] !log optic swap on cr4-ulsfo:et-0/0/1 [20:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:07] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:23:17] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:26:07] (03PS1) 10Thcipriani: Gerrit: fix login screen css [puppet] - 10https://gerrit.wikimedia.org/r/464418 [20:28:49] !log deployed proposed WikibaseQualityConstraints fix and wikiversions bump for wikidatawiki to mwdebug1001 and mwdebug1002 for verification (T206161) [20:28:53] (03PS2) 10Thcipriani: Gerrit: fix login screen css [puppet] - 10https://gerrit.wikimedia.org/r/464418 [20:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:53] T206161: Spike in fatals from WikiPageEntityMetaDataLookup during group1 deployment of 1.32.0-wmf.24 - https://phabricator.wikimedia.org/T206161 [20:29:18] andrewbogott: Replied on the task with something to try. [20:31:51] (03CR) 10Anomie: [C: 031] "Seems sane. Haven't tested." [puppet] - 10https://gerrit.wikimedia.org/r/463541 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [20:34:18] anomie: sorry, I don't understand, what line am I changing in which file? You mention two files in your comment (and I have a WikiRevision.php but it doesn't have a line 789) [20:35:15] andrewbogott: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/8641df974d1a030f7b7fad1e8e45e0f51f9816e1/includes/import/ImportableUploadRevisionImporter.php#88 [20:35:43] anomie: that seems to help! It still throws a whole lot of errors but gets further [20:35:50] so we must have had two problems... [20:36:13] I looked before, and the files that are throwing that "Unable to open source data in /srv/mediawiki/w/includes/libs/mime/XmlTypeCheck.php on line 158" error are definitely not XML files [20:36:13] 10Operations: puppet compiler set to eqiad as primary dc while prod is codfw - https://phabricator.wikimedia.org/T206166 (10Dzahn) [20:36:19] although I guess they could be zipped or something [20:37:21] that error appears to happen for everything except for pdfs [20:39:18] andrewbogott: pdfs are detected earlier in MimeAnalyzer::doGuessMimeType(), so it doesn't get to the part about trying to load it as XML. SVG files should also not give that error though. [20:39:50] ok… I'm just eyeballing this, I might've missed some svgs. [20:39:57] (because SVGs are XML) [20:40:04] (03CR) 10Dzahn: [C: 031] "re: surprising results: https://phabricator.wikimedia.org/T206166" [puppet] - 10https://gerrit.wikimedia.org/r/463563 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [20:40:17] is that warning nothing to worry about? Like, is it just saying "this isn't xml, no problem" and moving on to another parser? [20:40:32] um... [20:40:42] https://www.irccloud.com/pastebin/lFEare6q/ [20:40:46] andrewbogott: Depends. If SVGs aren't giving the error then it's probably ok. If they are giving the error, then we have something else going on than just "this isn't xml" [20:41:44] andrewbogott: Sanity check: that Bd808-test.svg really is a valid SVG? [20:42:58] I'm not sure, but I see that happening with every svg [20:43:21] So probably they aren't all invalid. Hmm. [20:46:19] (03PS4) 10Dzahn: base: do not allow mailman server to NRPE to other hosts for no reason [puppet] - 10https://gerrit.wikimedia.org/r/464086 (https://phabricator.wikimedia.org/T202782) [20:46:49] (03CR) 10Dzahn: [C: 032] base: do not allow mailman server to NRPE to other hosts for no reason [puppet] - 10https://gerrit.wikimedia.org/r/464086 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:48:03] * bd808 denies uploading invalid SVGs to wikitech [20:48:11] anomie: it's reasonably possible that the dump file I'm working from is totally wrong [20:48:27] although it would have to be wrong in a very specific way... [20:48:38] https://wikitech.wikimedia.org/wiki/File:Bd808-test.svg looks valid... [20:48:59] 10Operations, 10ops-ulsfo, 10netops: Interface errors on cr4-ulsfo:et-0/0/1 - https://phabricator.wikimedia.org/T205937 (10RobH) Ok this was odd and I had to sync with @ayounsi via IRC. The spare optic is the same model, but made in China where the other 4 40g optics were made in Malaysia. The Chinese vers... [20:52:04] addshore: still waiting on jenkins but i'll deploy as soon as your patch is merged [20:52:11] thanks again for jumping on that so quickly [20:52:12] marxarelli: ack [20:52:17] ill be here [20:52:18] np :) [20:52:46] * apergos sideyes bd 808 [20:54:42] bd808, nope, https://validator.w3.org/check?uri=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Flabs%2F1%2F10%2FBd808-test.svg&charset=%28detect+automatically%29&doctype=Inline&group=0 [20:55:51] (03PS3) 10Dzahn: base/icinga/nrpe: move nrpe_allowed IPs to Hiera, add icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464087 (https://phabricator.wikimedia.org/T202782) [20:56:13] (03PS1) 10Gilles: Install python 2 variant of sklearn on stat machines [puppet] - 10https://gerrit.wikimedia.org/r/464425 [21:10:05] addshore: k. jenkins is finally done. deploying... [21:10:14] [= [21:12:18] !log dduvall@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/WikibaseQualityConstraints/src/ServiceWiring.php: deploying fix to 1.32.0-wmf.24 for T206161 (duration: 00m 57s) [21:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:22] T206161: Spike in fatals from WikiPageEntityMetaDataLookup during group1 deployment of 1.32.0-wmf.24 - https://phabricator.wikimedia.org/T206161 [21:13:05] (03PS1) 10Dduvall: group1 wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464431 [21:13:07] (03CR) 10Dduvall: [C: 032] group1 wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464431 (owner: 10Dduvall) [21:13:25] anomie: I updated that ticket, for when you're interested [21:13:47] 10Operations, 10ops-eqiad: helium (bacula) - Device not healthy -SMART- - https://phabricator.wikimedia.org/T205364 (10Dzahn) I don't know how to do that. How did you try it? Are there maybe docs or examples how that is usually done? [21:14:22] (03Merged) 10jenkins-bot: group1 wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464431 (owner: 10Dduvall) [21:14:50] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T206004 (10Dzahn) Why declined? It's still unfixed , isn't it? But now both tickets are closed. [21:16:27] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.32.0-wmf.24 [21:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:57] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:17:11] marxarelli: looking good from here [21:17:23] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.32.0-wmf.24 (duration: 00m 55s) [21:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:42] addshore: from here too [21:19:06] wee! i'll update the roadmap [21:20:16] glad we managed to avoid disrupting the train too much :) [21:20:18] and i just realized i haven't had lunch yet. tamale time is now. [21:20:22] yeah, me too [21:20:30] it was great to have your support [21:20:57] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10Eevans) [21:21:06] 10Operations, 10ops-ulsfo, 10decommission: decommission lvs400[1-4].ulsfo.wmnet - https://phabricator.wikimedia.org/T178535 (10RobH) I went ahead and plugged in power to one of the power supplies for all four of these systems, then usb booted linux and ran a wipe instance per shell for each disk (sda and sdb... [21:21:43] [= until next time [21:21:54] :) [21:22:25] same deploy time, same deploy channel [21:25:13] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 8 others: Investigate more efficient memcached solution for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T97368 (10Addshore) The 10 second on server cache of the data resulted... [21:25:18] !log upgraded gdnsd to 2.99.9161 on authdns1001 [21:25:19] * addshore taps out [21:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:00] (03CR) 10jenkins-bot: group1 wikis to 1.32.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464431 (owner: 10Dduvall) [21:37:30] (03PS1) 10Andrew Bogott: Fix the variable name of $wgSitenotice to $wgSiteNotice [wikitech-static] - 10https://gerrit.wikimedia.org/r/464435 (https://phabricator.wikimedia.org/T200479) [21:37:55] (03CR) 10Andrew Bogott: [V: 032 C: 032] Fix the variable name of $wgSitenotice to $wgSiteNotice [wikitech-static] - 10https://gerrit.wikimedia.org/r/464435 (https://phabricator.wikimedia.org/T200479) (owner: 10Andrew Bogott) [21:39:33] (03CR) 10Paladox: [C: 031] Gerrit: fix login screen css [puppet] - 10https://gerrit.wikimedia.org/r/464418 (owner: 10Thcipriani) [21:42:27] RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:46:12] andrewbogott: Nice catch. [21:46:17] andrewbogott: How did you find that typo? [21:46:25] I imagine it must've been a long walk in the wildernes. [21:46:27] It only took me, what, 18 months to notice? [21:46:32] :D [21:46:34] Mystery solved. [21:47:58] (03CR) 10Dzahn: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12750/" [puppet] - 10https://gerrit.wikimedia.org/r/464087 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [21:48:01] Hm.. I notice the logo is a 404 on static. [21:48:14] Its looking for it in /w/images/ [21:49:14] Krinkle: that might be https://phabricator.wikimedia.org/T206013 [21:49:44] although I'm not sure if the logo is part of the dump, it might need to be copied over as a special case [21:49:52] It's not part of the dump [21:50:06] it's referenced directly from wmf-config/static/ [21:50:15] which prod exposes at apache /static with an alias [21:50:17] PROBLEM - High CPU load on API appserver on mw2142 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:50:23] Might need an apache config change. [21:50:35] Although currently the wikitech-static site intentionally configures it to /w/images/ [21:50:45] which presumably means it got copied there by other means at some point, but went missing [21:51:06] (03PS3) 10Dzahn: Gerrit: fix login screen css [puppet] - 10https://gerrit.wikimedia.org/r/464418 (owner: 10Thcipriani) [21:51:25] storing it there via wikitech-static git would make the current config work, otherewise, the config override could be removed, and /static added as link like we do in prod [21:51:26] PROBLEM - MD RAID on pc1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:27] PROBLEM - High CPU load on API appserver on mw1341 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:37] PROBLEM - Check whether ferm is active by checking the default input chain on pc1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:38] PROBLEM - Check systemd state on labsdb1007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:46] PROBLEM - dhclient process on torrelay1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:46] PROBLEM - MariaDB Slave IO: s1 on db1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:46] PROBLEM - MD RAID on labsdb1007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:47] PROBLEM - Check size of conntrack table on torrelay1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:47] PROBLEM - High CPU load on API appserver on mw1317 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:47] PROBLEM - Disk space on db1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:47] PROBLEM - Check systemd state on pc1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:47] PROBLEM - configured eth on labsdb1007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:47] PROBLEM - dhclient process on db1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:48] PROBLEM - MariaDB Slave IO: s7 on db1090 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:48] PROBLEM - Disk space on db1090 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:49] PROBLEM - mysqld processes on db1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:56] PROBLEM - DPKG on db1090 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:57] PROBLEM - configured eth on dns2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:57] PROBLEM - MariaDB read only s7 on db1090 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:57] PROBLEM - MariaDB read only s2 on db1090 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:57] PROBLEM - MariaDB read only s1 on db1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:57] PROBLEM - MD RAID on torrelay1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:57] PROBLEM - DPKG on labsdb1007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:51:58] PROBLEM - cassandra-a service on aqs1008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:04] PROBLEM - mysqld processes on pc1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:04] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:04] PROBLEM - Check whether ferm is active by checking the default input chain on elastic1036 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:04] PROBLEM - Check whether ferm is active by checking the default input chain on aqs1008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:04] PROBLEM - DPKG on db1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:05] PROBLEM - Check systemd state on wtp1027 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:05] PROBLEM - Disk space on elastic1036 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:05] PROBLEM - dhclient process on wtp1027 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:06] damn [21:52:16] PROBLEM - dhclient process on pc1005 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:16] PROBLEM - Disk space on labsdb1007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:16] PROBLEM - Check whether ferm is active by checking the default input chain on db1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:16] PROBLEM - Check whether ferm is active by checking the default input chain on dns2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:16] PROBLEM - Check size of conntrack table on db1063 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:16] PROBLEM - DPKG on relforge1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:16] PROBLEM - Check systemd state on db1114 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:17] PROBLEM - Check size of conntrack table on labsdb1007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:17] PROBLEM - dhclient process on aqs1008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:18] PROBLEM - DPKG on kubetcd2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:18] PROBLEM - Check systemd state on dns2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:19] PROBLEM - Check whether ferm is active by checking the default input chain on wtp1027 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:20] that was probably me because i added an allowed host to NRPE [21:52:20] PROBLEM - nova-compute proc minimum on labvirt1003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:21] whoah, hello icinga-wm [21:52:25] PROBLEM - mysqld processes on db1063 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:25] PROBLEM - dhclient process on relforge1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:25] PROBLEM - dhclient process on oresrdb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:25] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp2007 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:25] PROBLEM - Check whether ferm is active by checking the default input chain on mw2180 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:25] PROBLEM - configured eth on mw2142 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:52:57] oof [21:53:00] need help? [21:53:15] well, first of all.. it's only an issue of icinga itself [21:53:21] ahh ok [21:53:23] and i added a new host to be allowed to do NRPE [21:53:31] which apparently broke that [21:54:04] mutante: are the mysql pages related? [21:54:13] (03PS1) 10MSantos: Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/464439 (https://phabricator.wikimedia.org/T205462) [21:54:15] my phone is freaking out, what's up? [21:54:15] very likely, yes [21:54:21] NRPE is broken [21:54:24] not the acutal hosts [21:54:25] uh? [21:54:29] can I ignore? [21:54:35] <_joe_> what's up? [21:54:40] (03PS1) 10Dzahn: Revert "base/icinga/nrpe: move nrpe_allowed IPs to Hiera, add icinga1001" [puppet] - 10https://gerrit.wikimedia.org/r/464440 [21:54:43] so sounds like we should revert the recent NRPE change? [21:54:44] ah [21:54:45] NRPE is broken [21:54:54] yes, i am reverting it [21:55:11] (03CR) 10Dzahn: [C: 032] Revert "base/icinga/nrpe: move nrpe_allowed IPs to Hiera, add icinga1001" [puppet] - 10https://gerrit.wikimedia.org/r/464440 (owner: 10Dzahn) [21:55:13] ok thanks! [21:56:50] PROBLEM - Disk space on db1080 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:50] PROBLEM - Check systemd state on wtp2008 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:50] PROBLEM - Varnish HTCP daemon on cp2002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:50] PROBLEM - Freshness of OCSP Stapling files on cp4023 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:50] PROBLEM - configured eth on cp3040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:51] PROBLEM - Check systemd state on analytics1060 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:51] PROBLEM - configured eth on analytics1046 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:52] PROBLEM - Check whether ferm is active by checking the default input chain on wtp1045 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:56:58] (03PS1) 10MSantos: Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/464441 (https://phabricator.wikimedia.org/T205462) [21:57:25] !log einstienium - disabling puppet [21:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:28] all under control? just broken nrpe? [21:57:36] appears so, yes [21:57:48] yea, sorry everybody, it is NRPE itself [21:57:57] stopped puppet on icinga server [21:58:00] <_joe_> mutante: need assistance? [21:58:00] PROBLEM - carbon-cache@f service on labmon1002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:58:02] PROBLEM - kvm ssl cert on labvirt1014 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:58:02] PROBLEM - DPKG on authdns2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:58:02] PROBLEM - MD RAID on ores2004 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:58:02] PROBLEM - configured eth on mw2242 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:58:02] PROBLEM - swift-object-replicator on ms-be1040 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:58:02] PROBLEM - MariaDB Slave SQL: es3 on es1019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:58:17] maybe if you can try to restart nagios-nrpe on everything [21:58:47] i wanted to sto picinga itself to save you from more alert spam.. let me stop the bot [21:59:35] <_joe_> before I restart nrpe everywhere [21:59:45] do we need agent runs everywhere first, to push the revert? [21:59:53] <_joe_> that too [21:59:55] yes [21:59:59] we need the revert applied to all nodes before a restart will be effective [22:00:10] <_joe_> jeez [22:00:13] <_joe_> ok [22:00:33] (03Abandoned) 10MSantos: Skipping download if PBF file exists [puppet] - 10https://gerrit.wikimedia.org/r/464439 (https://phabricator.wikimedia.org/T205462) (owner: 10MSantos) [22:00:55] it is possible that we just have to start nrpe [22:01:00] i am trying on a random host [22:01:16] !log icinga stopped manually [22:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:24] <_joe_> -allowed_hosts=127.0.0.1 [22:01:25] <_joe_> +allowed_hosts=127.0.0.1,208.80.153.74,208.80.155.119 [22:01:30] <_joe_> you need to run puppet [22:01:33] <_joe_> so [22:01:47] <_joe_> do it in batches of max 40 [22:01:57] _joe_: I’d be happy to do that via cumin with small batch size. it must be late for you [22:02:02] !log mw2242 - started nagios-nrpe-server [22:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:24] <_joe_> the query is cumin 'R:file = /etc/nagios/nrpe_local.cfg' [22:02:39] herron: i would be glad if you can run the cumin command, i am also not on a great connection [22:02:52] ok sure thing [22:02:54] <_joe_> 1319 hosts [22:02:59] let's first just try to start nagios-nrpe-service ? [22:03:07] <_joe_> mutante: that won't work [22:03:10] it is running on mw2242 [22:03:14] <_joe_> see what I wrote up there [22:03:21] <_joe_> that's the diff [22:03:27] sorry, I was in techcom and could not really just bail as it was my rfc under discussion [22:03:36] <_joe_> nrpe was only listening on localhost [22:03:40] <_joe_> I was in bed :P [22:03:55] arrg,, i see.. yes [22:04:29] <_joe_> herron: I'd do something like cumin -b 40 -p 95 'R:file = /etc/nagios/nrpe_local.cfg' run-puppet-agent [22:04:52] <_joe_> we're without monitoring, it's worth risking to overload the puppetmasters [22:04:57] i am going to neodymium as well [22:05:02] i can also run a batch [22:05:06] <_joe_> no [22:05:09] if yoy run there [22:05:15] then you risk overloading the puppetmasters [22:05:15] ok [22:05:16] <_joe_> only one of you please [22:05:18] let's just do one place [22:05:24] <_joe_> -b 40 is for a reason [22:05:29] <_joe_> do it in a tmux session [22:05:32] I’ve got it running [22:05:42] <_joe_> herron: log it :) [22:06:32] !log herron@neodymium:~$ sudo cumin -b 40 -p 95 'R:file = /etc/nagios/nrpe_local.cfg' run-puppet-agent [22:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:58] <_joe_> the file is wrong on only 274 hosts, luckily [22:07:15] depending on the outcome might need another pass [22:07:36] need to get used to using the new cumin master btw [22:07:45] heh [22:07:49] I have neodymium in muscle memory [22:08:30] XioNoX: ^ are the mgmt rules on network being updated for cumin[12]001? [22:08:41] so do we want to do something to suppress recoveries? [22:08:47] you could have grepped as first command to fail if yhe file was alredy correct [22:08:49] robh: yes [22:08:56] cool, i wasnt sure wanted to check! [22:08:58] =] [22:09:01] just an optimization to run less puppets [22:09:47] thx good point volans|off [22:10:08] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review, and 2 others: Update Debian Package for Scap to 3.8.7-1 - https://phabricator.wikimedia.org/T204383 (10mmodell) [22:10:32] <_joe_> volans|off: I did that but the command was already launched [22:12:11] sorry, so do we want to suppress recovery SMS or not worry about that? [22:12:44] if someone is not near their keyboard not having recoveries can be alarming. i have no viewpoint otherwise [22:12:53] i mean, sms cost money but minimal so meh [22:13:54] SMalyshev: did you disable puppet on wdqs1010.eqiad.wmnet by any chance? [22:14:01] i am trying to come up with a good way to suppress them.. but i can only think of changing all timezones in the contacts or manually sscheduling downtime.. hrm [22:14:12] might be able to blackhole the mail [22:14:29] it would be a lot of recoveries. easiest would be to set enable_notifications = 0, then let icinga recover, then re-enable [22:14:46] yea, shdubsh is right. let's do that [22:14:51] cool nice and simple [22:15:00] i wouldnt blackhole emails.... [22:15:04] emails are the least annoying thing [22:15:16] the sms are also emails jsut to special addresses though [22:15:18] sorry I mean the emails that go off to the SMS gateways [22:15:29] but enable_notifications is much more straightforward [22:15:44] puppet run complete? [22:15:53] 34% through [22:16:25] it’s slow with 40 hosts at a time. but if we have notifications off and ircecho disabled I don’t see why we couldn’t fire icinga back up [22:16:52] if you deem is much quicker to run only on affected hosts and you have already yhe list or the command to run it only there [22:17:15] uploading patch [22:17:32] I don't see a big probkem to ctrl+c the current one. _joe_ do you see issues to ctrl+c some puppet run? [22:17:41] <_joe_> no [22:17:55] mutante: patch? [22:18:04] (03PS1) 10Dzahn: icinga: disable notifications from einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/464448 [22:18:12] shdubsh: ^ that? [22:18:41] <_joe_> herron: I can open a tmux on cumin2001 and launch the command if you stop yours [22:18:45] it’s already nearly 50% I don’t see much benefit paying the forced puppet runs any more mind [22:18:46] do we need a patch to change an option for only a few minutes? [22:19:21] shdubsh: not if puppet is disabled. ok [22:19:35] <_joe_> well, the fact we'd get monitoring back is a nifty bonus [22:19:41] I have the change ready to write [22:19:42] <_joe_> anyways, I'm off to bed then [22:19:52] <_joe_> see you tomorrow! [22:19:56] good night [22:19:59] shdubsh: i see you are in icinga.cfg alreayd, go for it then [22:20:13] thanks _joe_ goodnight! [22:20:25] !log einsteinium: setting enable_notifications=0 and starting icinga [22:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:15] if I'm not needed anymore I'd go too [22:21:54] go! good night [22:22:04] good night v olans [22:22:07] checks are already starting to recover [22:22:08] will make sure it's all green before turning that back on [22:22:10] have a good night volans|off thanks [22:22:21] ack tty tomorrowbthanks3 [22:23:39] 62% now, not too much longer [22:24:52] thanks a lot for taking the cumin part, herron [22:25:07] you bet [22:25:10] out of all places i am also on a moving bus [22:25:20] well timed! [22:25:46] yea.. :/ [22:25:49] somewhere murphy is having a good long laugh at our expense [22:26:51] ha! good ol’ greyhoud deployment [22:28:07] somebdoy said something presidential alrt [22:29:08] down anotehr 200 alerts [22:29:58] nice. 85% now [22:31:36] hello, can anyone here give me ssh access to analytics-tool1002.eqiad.wmnet ? [22:31:44] watching the icinga counter decrease is oddly satisfying [22:32:10] apergos: yes puppet on 1010 is off [22:32:40] SMalyshev: can you re-enable it for a little bit and run puppet over there? [22:32:56] a bad icinga config went around and there's a need for cleanup [22:33:17] you should be able to turn it off again right away ( mutante, is that true? ) [22:34:08] mmm.. bad moment to ask everything is going kaput, sorry [22:35:11] forced puppet runs are finished [22:35:28] apergos: what's the problem with 1010? I am running some long-running task there and wouldn't want puppet to mess it up... [22:36:21] going to kick off another failed-only pass just to be sure [22:36:38] !log herron@neodymium:~$ sudo cumin -b 15 -p 95 '*' 'run-puppet-agent -q --failed-only' [22:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:23] SMalyshev: it was all (or a lot) of the hosts: [22:37:28] they got this change https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/464087/ [22:37:37] apergos: yes, that is true. or if needed we can manually fix it [22:37:38] it should only have affected icinga, nothing else [22:38:00] (thanks mu tante) [22:38:22] 772 left... [22:39:02] if you dont want to start puppet, i can go and fix the NRPE config on them [22:39:11] apergos: would it be very bad if I do it several hours later? [22:39:38] no, just let mutante know (or maybe folks will manually fix) [22:39:42] or I can make manual fix, what I need to do? [22:39:57] how did it break though if puppet was off [22:39:59] we have at least one other host that will wait until morning [22:40:36] mutante: I don't know what happened on those with puppet off, I just know joe flagged them as weird [22:40:40] mutante: I disabled puppet sometime earlier today [22:40:59] I can check manually if there's a problem, just tell me how [22:41:07] puppet failed-only pass finished [22:41:29] 414 left... [22:43:02] I'm seeing only 6 unhandled left. Comfortable turning notifications back on? [22:43:45] not just yet I think, maybe a few more minutes [22:44:00] nice! [22:44:01] and none of those 6 are NRPE related [22:44:04] in “service problems” there are still some. just going through and re-scheduling next check to speed it up [22:44:45] but actually those seem to be just one host [22:44:49] nevermind! [22:44:55] I think we're in the clear [22:45:12] yeah looks good! [22:45:24] !log einsteinium: setting enable_notifications=1 and reloading icinga [22:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:43] woo hoo! [22:46:11] thanks so much [22:46:11] notifications banner is gone [22:46:17] yes, icinga web ui looking good [22:46:41] ircecho still down? [22:46:43] ok, I'm gonna bail, I was mostly just cheerleading anyways [22:46:48] good night y'all [22:46:54] night apergos :) [22:46:55] later apergos! [22:47:26] i will check the special cases now [22:47:52] it’s getting close to bathtime/bedtime for my little one here. ok if I split at this point? [22:48:12] yes, please do. i will handle the list of special ones [22:48:22] cool, have a good night! [22:48:27] later, herron :) [22:48:28] thanks again, you too [22:48:42] mutante: think we're good to re-enable puppet on einsteinium? [22:49:06] yes [22:49:18] !log re-enable puppet on einsteinium [22:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:28] will the bot come back on its own? [22:51:49] SMalyshev: the ones where puppet was disabled earlier today should not need any action , the bad change has been reverted and when re-enabeld they should just get both changes [22:51:54] i will still double check though [22:55:20] o/ [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181003T2300). [23:00:04] niedzielski, James_F, and RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:24] Heya. [23:01:45] Hey. I'll do the SWAT today [23:01:57] Cool. [23:02:03] Was about to offer. ;-) [23:02:43] Hmm niedzielski is not in this channel [23:02:58] i'm here! [23:03:02] Oh hi [23:03:09] enick_847: I'll do yours first [23:03:21] (03CR) 10Catrope: [C: 032] smaller wiki Minerva a/b tests are bumped to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463875 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [23:03:33] 👍 [23:05:02] (03Merged) 10jenkins-bot: smaller wiki Minerva a/b tests are bumped to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463875 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [23:07:10] (03Abandoned) 10Dzahn: icinga: disable notifications from einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/464448 (owner: 10Dzahn) [23:07:39] SMalyshev: dont worry about it, puppet can be enabled whenever. it won't make a real difference for the issue [23:07:54] it will just apply an unrelated thing that isnt urgent [23:07:54] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10jcrespo) a:03jcrespo [23:09:19] (03CR) 10jenkins-bot: smaller wiki Minerva a/b tests are bumped to 100% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463875 (https://phabricator.wikimedia.org/T200792) (owner: 10Jdlrobson) [23:09:28] RoanKattouw: what server should i test? mwdebug1002? [23:09:44] enick_847: mwdebug2001, which may be called mw2017 in your debug tool [23:10:23] thanks RoanKattouw , i see it on mwdebug2001 [23:11:00] Cool, deploying everywhere now [23:12:28] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bump Minerva A/B test rates to 100% on jawiki, ruwiki, fawiki (T200792) (duration: 00m 56s) [23:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:37] T200792: Run A/B test on page issues (Farsi, Japanese, Russian, English) - https://phabricator.wikimedia.org/T200792 [23:13:43] PROBLEM - Device not healthy -SMART- on bast4001 is CRITICAL: cluster=misc device=sdc instance=bast4001:9100 job=node site=ulsfo https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast4001&var-datasource=ulsfo%2520prometheus%252Fops [23:14:26] PROBLEM - LVS HTTP IPv4 on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:16:19] thanks RoanKattouw , i'm watching our graphs for page issues and reading depth schemas which expect to increase [23:21:42] SMalyshev: that being said, it seems there might be an unrelated issue with wdqs? [23:23:27] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.24/includes/utils/UIDGenerator.php: Make UID clock drift error have more details (T94522) (duration: 00m 58s) [23:23:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:33] T94522: Some requests fail with UIDGenerator error "Process clock is outdated or drifted" - https://phabricator.wikimedia.org/T94522 [23:24:03] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on einsteinium is CRITICAL: 47.04 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:25:45] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: connect atlas-ulsfo to scs-ulsfo - https://phabricator.wikimedia.org/T206185 (10RobH) p:05Triage>03Normal [23:27:23] 10Operations, 10DNS, 10Traffic, 10WMF-Communications, and 4 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776 (10Varnent) [23:29:04] RoanKattouw: I see about a 2x spike on the page issues schema and a similar increase in quantity of events for reading depth so I believe all is well. Thank you for swatting! [23:29:47] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/PageTriage/: Hide copyvio AFC filter option behind flag (T205918) (duration: 00m 57s) [23:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:52] T205918: [betalabs] NPP: 'Potential issues' do not display 'None' and 'Copyvio' filters - https://phabricator.wikimedia.org/T205918 [23:31:15] RoanKattouw: Ouch, the VE patches took 28 minutes to merge. [23:31:23] Oh did they just finish? [23:31:30] Yeah. [23:31:35] Aha they did [23:31:39] mutante: what kind of issue? [23:31:40] Finally. [23:32:12] SMalyshev: socket timeout for wdqs.svc.codfw.wmnet [23:32:45] James_F: OK, all three are on mwdebug2001 [23:32:56] RoanKattouw: Fun. [23:34:51] Eurgh, so slow. [23:35:14] James_F: Also sorry for +2ing then -1ing your Special:Preferences OOUI change, I spotted an inline comment my +2 added [23:36:31] Tsk, wrong channel. [23:36:39] RoanKattouw: wmf.23 and .24 both look good to roll. [23:41:18] !log catrope@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/VisualEditor/: Require Parsoid HTML 2.0.0, and handle its