[00:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T0000). [00:01:21] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [00:01:21] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [00:01:23] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 68 ESP OK [00:01:31] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 68 ESP OK [00:01:31] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 64 ESP OK [00:01:31] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK [00:01:32] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [00:01:32] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 68 ESP OK [00:01:32] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [00:01:32] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK [00:01:33] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [00:01:33] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 52 ESP OK [00:01:34] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [00:01:41] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 52 ESP OK [00:01:42] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 52 ESP OK [00:01:42] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 52 ESP OK [00:01:42] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 52 ESP OK [00:01:51] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 68 ESP OK [00:01:52] RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 52 ESP OK [00:02:01] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK [00:02:01] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [00:02:01] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 68 ESP OK [00:02:02] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [00:02:11] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 52 ESP OK [00:02:11] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 68 ESP OK [00:02:11] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [00:02:11] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 68 ESP OK [00:02:12] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 68 ESP OK [00:02:12] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [00:02:12] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK [00:02:21] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 52 ESP OK [00:02:31] RECOVERY - IPsec on cp1075 is OK: Strongswan OK - 52 ESP OK [00:02:31] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 52 ESP OK [00:02:32] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 52 ESP OK [00:11:04] (03CR) 10Krinkle: "Afaik 'retry-after' doesn't, or shouldn't, relate to job loss (retry exhaustion during read-only). Because as you say, "retrying after X " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [00:12:11] 10Operations, 10Puppet: Puppet agent takes a long time to finish when adding IPv6 addresses - https://phabricator.wikimedia.org/T205577 (10Krenair) [00:14:34] (03PS1) 10Ayounsi: Icinga, update mr1-ulsfo IPs [puppet] - 10https://gerrit.wikimedia.org/r/463170 [00:16:49] (03CR) 10Ayounsi: [C: 032] Icinga, update mr1-ulsfo IPs [puppet] - 10https://gerrit.wikimedia.org/r/463170 (owner: 10Ayounsi) [00:17:44] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Krinkle) 05Resolved>03Open Broken again. is responding.. (good)... [00:19:40] * Krinkle staging on mwdebug2001 [00:21:24] RECOVERY - Juniper alarms on mr1-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [00:21:44] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:22:40] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/FeaturedFeeds: T205573 (duration: 00m 59s) [00:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:48] T205573: Fatal error possible on Main Pages that use FeaturedFeeds - https://phabricator.wikimedia.org/T205573 [00:24:37] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10ayounsi) [00:29:34] (03PS1) 10Dzahn: icinga: pass user/group from profile, change to nagios on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/463175 (https://phabricator.wikimedia.org/T202782) [00:33:49] (03CR) 10Dzahn: "compiler shows no change on einsteinium and the desired change on icinga1001: https://puppet-compiler.wmflabs.org/compiler1002/12647/einst" [puppet] - 10https://gerrit.wikimedia.org/r/463175 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:35:47] (03CR) 10Dzahn: [C: 032] "actually using this to switch icinga1001 without touching einsteinium like this: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/4" [puppet] - 10https://gerrit.wikimedia.org/r/462833 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:36:10] (03PS2) 10Dzahn: icinga: pass user/group from profile, change to nagios on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/463175 (https://phabricator.wikimedia.org/T202782) [00:38:43] (03PS3) 10Dzahn: icinga: pass user/group from profile, change to nagios on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/463175 (https://phabricator.wikimedia.org/T202782) [00:39:15] (03CR) 10Dzahn: [C: 032] icinga: pass user/group from profile, change to nagios on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/463175 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:47:09] (03CR) 10Dzahn: [C: 032] "noop on tegmen, noop on einsteinium" [puppet] - 10https://gerrit.wikimedia.org/r/463175 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [00:49:29] (03CR) 10Dzahn: [C: 032] "fixed many errors on icinga1001.. some places remain that use hardcoded "icinga" user that become more obvious now" [puppet] - 10https://gerrit.wikimedia.org/r/463175 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:05:55] (03PS1) 10Dzahn: icinga: fix more places with hardcoded user name [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) [01:06:25] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix more places with hardcoded user name [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:07:40] (03PS2) 10Dzahn: icinga: fix more places with hardcoded user name [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) [01:14:18] !log repair /dev/sde1 on ms-be0240 - T199198 [01:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:26] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [01:14:42] !log repair /dev/sdn1 on ms-be0241 - T199198 [01:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:16:58] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/12649/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:22:10] (03PS3) 10Dzahn: icinga: fix more places with hardcoded user name [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) [01:22:40] (03CR) 10jerkins-bot: [V: 04-1] icinga: fix more places with hardcoded user name [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:26:01] (03PS4) 10Dzahn: icinga: fix more places with hardcoded user name [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) [01:26:43] (03CR) 10Dzahn: [C: 031] "https://puppet-compiler.wmflabs.org/compiler1002/12650/" [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:28:41] (03CR) 10Dzahn: [C: 032] icinga: fix more places with hardcoded user name [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:29:34] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:29:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:32:30] i am reading the link but i dont see that extra information it talks about in the alert ^ [01:33:44] RECOVERY - Filesystem available is greater than filesystem size on ms-be2041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [01:33:48] "CenturyLink Scheduled Maintenance (esams-eqiad link repair) - Philadelphia, PA, USA" [01:33:52] ok [01:34:01] that matches what we see afaict [01:35:26] no, wrong date. but there is another scheduled maintenance [01:39:40] (03CR) 10Dzahn: [C: 032] "noop on tegmen/einsteinium" [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:45:31] (03CR) 10Dzahn: [C: 032] "fixed a few more errors on icinga1001 but the gift keeps on giving.. more to follow-up.. getting closer though" [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [01:48:53] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [02:02:52] (03PS1) 10Dzahn: nagios_common: make user/group configurable from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/463180 (https://phabricator.wikimedia.org/T202782) [02:03:27] (03CR) 10jerkins-bot: [V: 04-1] nagios_common: make user/group configurable from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/463180 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:05:55] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463175/ and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/463177/ fixed ma" [puppet] - 10https://gerrit.wikimedia.org/r/463180 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [02:22:06] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [02:35:11] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.22) (duration: 13m 25s) [02:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:37] !log starting inplace reindexing of viwiki and commonswiki - T204362 [02:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:45] T204362: Resolve elasticsearch shard size alert by doing an in place reindex - https://phabricator.wikimedia.org/T204362 [02:48:59] (03PS2) 10Mathew.onipe: Switch public cluster to Kafka event source [puppet] - 10https://gerrit.wikimedia.org/r/462907 (https://phabricator.wikimedia.org/T189458) [02:49:17] (03CR) 10Mathew.onipe: Switch public cluster to Kafka event source (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462907 (https://phabricator.wikimedia.org/T189458) (owner: 10Mathew.onipe) [02:53:49] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.23) (duration: 07m 16s) [02:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:54:36] (03PS4) 10Mathew.onipe: cumin: added aliases for each wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) [02:54:58] (03CR) 10Mathew.onipe: cumin: added aliases for each wdqs clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) (owner: 10Mathew.onipe) [03:04:40] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Sep 27 03:04:40 UTC 2018 (duration 10m 51s) [03:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:38:53] (03CR) 10Smalyshev: [C: 031] Switch public cluster to Kafka event source [puppet] - 10https://gerrit.wikimedia.org/r/462907 (https://phabricator.wikimedia.org/T189458) (owner: 10Mathew.onipe) [03:42:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [04:32:02] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529 (10Pine) I sent an email to the Education list admins to request that they comment in this ticket in response to the questions above. [05:17:51] (03PS1) 10Marostegui: db-codfw.php: Depool db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463185 [05:18:55] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463185 (owner: 10Marostegui) [05:19:57] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463185 (owner: 10Marostegui) [05:21:11] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2072 (duration: 01m 00s) [05:21:13] !log Deploy schema change on db2072 [05:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:10] 10Operations, 10Scoring-platform-team (Current), 10User-Ladsgroup: Let the ORES application set log severity, not uWSGI - https://phabricator.wikimedia.org/T181546 (10Ladsgroup) a:03Ladsgroup It will be also handled by changes to the logging system of ours. [05:26:37] !log Drop wikiuser on dbstore1002 [05:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:02] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463187 [05:32:57] (03CR) 10jenkins-bot: db-codfw.php: Depool db2072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463185 (owner: 10Marostegui) [05:38:23] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463187 (owner: 10Marostegui) [05:39:27] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463187 (owner: 10Marostegui) [05:40:37] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2072 (duration: 00m 57s) [05:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:44] !log Deploy schema change on s1 eqiad master, lag will be generated - T203709 [05:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:52] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [05:42:40] (03PS4) 10Giuseppe Lavagetto: service: fix spec for debian 9+ [puppet] - 10https://gerrit.wikimedia.org/r/458495 (https://phabricator.wikimedia.org/T203645) [05:49:39] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2072" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463187 (owner: 10Marostegui) [06:01:33] (03PS3) 10Smalyshev: Enable phrase search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462351 (https://phabricator.wikimedia.org/T163642) [06:02:31] (03CR) 10Smalyshev: "The code has been deployed so we can enable this by default." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462351 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [06:07:15] (03CR) 10Giuseppe Lavagetto: [C: 032] service: fix spec for debian 9+ [puppet] - 10https://gerrit.wikimedia.org/r/458495 (https://phabricator.wikimedia.org/T203645) (owner: 10Giuseppe Lavagetto) [06:10:23] (03PS8) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: convert wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/461976 (https://phabricator.wikimedia.org/T196968) [06:10:46] (03CR) 10Volans: "LGTM, just minor stylish nitpicks inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/462791 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [06:11:05] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web::prod_sites: convert wikivoyage.org [puppet] - 10https://gerrit.wikimedia.org/r/461976 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [06:17:33] !log Deploy schema change on dbstore2002:3311 [06:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:03] <_joe_> sigh [06:19:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463192 [06:19:25] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: fix function call [puppet] - 10https://gerrit.wikimedia.org/r/462853 [06:19:45] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki::web::vhost: fix function call [puppet] - 10https://gerrit.wikimedia.org/r/462853 (owner: 10Giuseppe Lavagetto) [06:19:53] PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:20:01] <_joe_> that's me ^^ [06:20:19] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463192 (owner: 10Marostegui) [06:23:44] (03PS2) 10Marostegui: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463192 [06:25:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463192 (owner: 10Marostegui) [06:26:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463192 (owner: 10Marostegui) [06:27:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1089 (duration: 00m 56s) [06:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:43] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [06:28:53] (03CR) 10Volans: [C: 04-1] "Thanks for taking care of this! There few additional things that should be done in the migrations, see comments inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463133 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [06:29:54] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:32:24] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/sudoers],File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt] [06:34:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1089 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463192 (owner: 10Marostegui) [06:46:55] (03CR) 10Volans: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [06:54:43] (03PS18) 10Jcrespo: backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) [06:54:45] (03PS3) 10Jcrespo: mariadb backup monitoring: Add size checks [puppet] - 10https://gerrit.wikimedia.org/r/462724 (https://phabricator.wikimedia.org/T203969) [06:54:47] (03PS3) 10Jcrespo: mariadb: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450314 (owner: 10Dzahn) [06:55:20] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [06:57:45] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:13] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:10] !log Stop replication in sync on db1089 and dbstore1002:s1 [07:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463200 [07:04:49] \o all [07:04:55] !low rebooting mw1312-mw1329 for kernel security update [07:05:07] * addshore is going to actually perform the PC purge for wikidatawiki that he was discussing on the ops mailing list today [07:05:23] addshore: You are not adding any new key, right? [07:05:27] marostegui: nope [07:05:41] now using a hook! and it will happen in stages [07:05:54] !log disablingh puppet on all databases [07:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:29] addshore: I am asking because I was remembering this: https://phabricator.wikimedia.org/T167784 [07:06:32] so we will make wikidatawiki reject parser cache values for pages that were cached before the 15th of spet, and then after change that time to the 19th, and that will be where we need to get to [07:06:44] (03PS1) 10Tarrow: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463202 (https://phabricator.wikimedia.org/T205330) [07:06:47] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463200 (owner: 10Marostegui) [07:07:44] (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463202 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:07:54] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463200 (owner: 10Marostegui) [07:07:55] marostegui: hmmmmmm looking at ParserCache in mediawiki it wont send a purge request to the PC .... [07:08:22] (03PS4) 10Jcrespo: mariadb: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450314 (owner: 10Dzahn) [07:08:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1089 (duration: 00m 55s) [07:09:03] addshore: we were told that pc handle invalidations automatically because it checks they are really valid rathern than using them blindly [07:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:16] purging happens after a month or so from the maintenance hosts [07:09:27] aaah yes, i guess on a PC miss, it is going to insert a new value moments after [07:10:04] so disk shouldnt be an issue with this way of clearing the cache [07:10:43] <_joe_> addshore: actually, if you do a lot of inserts and a lot of deletes disk usage might explode, depending on the storage engine mysql uses [07:10:57] <_joe_> or if you have replication set up, and which kind [07:11:00] ack [07:11:14] <_joe_> so I'd check the assumption with jynus and marostegui [07:11:24] addshore: good, I was concerned about the whole process and how it can impact the disk usage [07:12:06] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [07:12:17] ack, so we will be "purging" cached values between the 5th and 15th, so 10 days worth of parser cache initially [07:12:40] and then afterwards want to "purge" also from the 15th to the 19th, so 4 more days, but more recent days so there will likely be more entries there [07:13:15] how fast? [07:13:17] at any point if we decide disk or anything else is going south, then we can just revert the hook and the "purge" will stop, as it isn't really a purge, just telling mediawiki not to use the cache entries when it checks for them [07:13:17] addshore: is there a way to purge, let's say, days 5th and 6th and stop, so we can evaluate the impact? [07:13:24] marostegui: yes [07:13:44] we can start with 1 days :0 [07:13:45] :) [07:14:34] addshore: I would prefer if we start with 1 day, and then check graphs [07:14:38] marostegui: can do [07:15:33] anybody checking that cr2-esams alert? [07:15:56] marostegui: jynus and I'm going to write some docs about this afterward :) [07:16:34] addshore: sounds good, please let us know before you are ready to start, I want to grab some numbers so we can check the before and the after 1 day purge [07:16:37] jynus: It's hard to say as apparently we have no idea how many entries we will actually be ignoring, but in a single day shouldnt be that many [07:16:43] marostegui: will do! [07:16:54] it would be nice if the PC was actually split per wiki, that would make this much easier [07:17:13] then we could query and see exactly how many cache entries we would be invalidating before doing it [07:17:40] (03PS1) 10Tarrow: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463203 (https://phabricator.wikimedia.org/T205330) [07:18:04] PROBLEM - Filesystem available is greater than filesystem size on ms-be2042 is CRITICAL: cluster=swift device=/dev/sdh1 fstype=xfs instance=ms-be2042:9100 job=node mountpoint=/srv/swift-storage/sdh1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [07:18:26] (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463203 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:18:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1089" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463200 (owner: 10Marostegui) [07:19:54] (03PS2) 10Addshore: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463203 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:19:59] ^^ thats the patch [07:20:22] (03CR) 10Addshore: [C: 032] RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463203 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:21:21] we should be able to monitor the "rejected" values on https://grafana.wikimedia.org/dashboard/db/parser-cache?refresh=5m&orgId=1&from=now-6h&to=now [07:21:30] and see how many things are actually being rejected [07:21:47] (03Merged) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463203 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:21:50] addshore: I want to see the impact of that on disk and on binlogs [07:22:19] addshore: log the maintenance on the Week of: section of deployments [07:22:29] will do! [07:22:31] (03CR) 10Jcrespo: [C: 032] mariadb: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450314 (owner: 10Dzahn) [07:23:19] done https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1804358&oldid=1804351 [07:23:32] will check it out on a mwdebug server now and test [07:23:45] addshore: ok, I am gathering some data before you start [07:23:57] ack! [07:24:04] marostegui: I'll wait for your ping to sync the patch [07:24:44] yep, give me some minutes [07:26:39] !log enabling puppet on all core eqiad hosts [07:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:07] !log test formatting sdc on ms-be1040 with crc=0 - T199198 [07:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:15] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [07:27:36] addshore: go for the first day [07:27:41] marostegui: okay! [07:28:30] syncing [07:29:22] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: [[gerrit:463203|RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-06]] (duration: 00m 57s) [07:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:32] marostegui: ^^ done [07:29:33] marostegui: the good news is that if something bad happens- we could optimize the eqiad dbs just easier [07:29:41] jynus: true XD [07:29:51] so note it isn't an instance purge, it will only reject the cache entries as requests try to access them [07:29:56] *instant [07:30:04] addshore: how long should we wait then? [07:30:06] yes, but it will poppulate more keys [07:30:20] which is the issue, not the purging :-) [07:30:46] !log test formatting sdd on ms-be1040 with crc=0 - T199198 [07:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:46] I'm still slightly unclear about that, it should be using the same key, so it should overwrite the old value in the PC? [07:33:33] (03CR) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-06 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463203 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:34:13] jynus: marostegui we have rejected 587 keays so far [07:34:41] addshore: yeah, not seeing any impact on disk/binlogs [07:35:04] 721 total [07:35:33] i imagine the closer we get to the current time the more keys will be included in each day [07:36:00] shall we go for 2 more days? [07:36:21] addshore: yeah [07:36:43] we will make a patch for the 8th and deploy it as soon as it is up [07:37:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [07:39:48] (03PS1) 10Tarrow: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463207 (https://phabricator.wikimedia.org/T205330) [07:40:29] (03CR) 10Addshore: [C: 032] RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463207 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:42:37] (03CR) 10Addshore: [C: 032] RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463207 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:43:40] (03Merged) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463207 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:44:18] syncing [07:44:44] (03PS1) 10Tarrow: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463208 (https://phabricator.wikimedia.org/T205330) [07:45:06] currently at a rate of 150 keys per minute with 1 days worth being rejected [07:45:11] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-08 (duration: 00m 55s) [07:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:42] 10Operations, 10netops: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10jcrespo) As an update to T201139#4483590 es1014 continues to show strange network patterns- I only see them at app layer, so take this with a grain of salt, but aside from continuing being u... [07:47:43] marostegui: and the rate has increased to roughly 600 per minuite with 3 days worth now being rejected [07:48:15] (03CR) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463207 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:49:12] marostegui: how do you feel like rejecting another 4 days worth? :) [07:50:13] addshore: one sec, I am checking [07:50:19] ack [07:51:56] 10Operations, 10netops: Enable cumin1001 in router ACLs - https://phabricator.wikimedia.org/T205513 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:52:01] addshore: go for it [07:52:14] (03CR) 10Addshore: [C: 032] RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463208 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:53:06] 10Operations, 10Wikimedia-Mailing-lists: Open Foundation West Africa (OFWA) mailing list - https://phabricator.wikimedia.org/T203966 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff List has been created [07:53:17] (03Merged) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463208 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [07:53:48] !log enabling puppet back on all db hosts [07:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:23] !log addshore@deploy1001 sync-file aborted: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-08 (duration: 00m 01s) [07:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:24] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-12 (duration: 00m 55s) [07:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:24] marostegui: it looks like the rate of rejected keys is now roughly 1000 per minute [07:57:22] 1500 per minuite :) [07:57:47] let's let it settle before going for more [07:57:52] marostegui: okay! [07:58:10] !log rebooting db1108 for kernel & mariadb upgrade (T205288) [07:58:13] I have a meeting for the next hour, but will keep watching the graphs and in here and I'll ping you a little later to do some more rejecting :) [07:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:18] T205288: Maintenance M4 cluster - https://phabricator.wikimedia.org/T205288 [07:58:25] jynus: I've seen you enabled puppet it's free to go? [07:58:26] addshore: sounds good [08:00:08] (03PS2) 10D3r1ck01: Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) [08:00:54] (03CR) 10jerkins-bot: [V: 04-1] Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) (owner: 10D3r1ck01) [08:01:45] banyek: sure, I said to ping me, but it didn't block anything [08:02:08] it was only in case that if I run enable and the server was down, I would need to make sure to run it on the failed hosts again [08:02:36] ah, ok [08:02:39] til [08:03:07] (03CR) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463208 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [08:04:07] (03CR) 10DCausse: [C: 031] Enable phrase search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462351 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [08:10:29] (03PS3) 10D3r1ck01: Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) [08:14:14] (03CR) 10Muehlenhoff: [C: 031] "Looks good, thanks! I'm merging the change." [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) (owner: 10Mathew.onipe) [08:14:20] (03PS5) 10Muehlenhoff: cumin: added aliases for each wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) (owner: 10Mathew.onipe) [08:16:35] (03CR) 10Muehlenhoff: [C: 032] cumin: added aliases for each wdqs clusters [puppet] - 10https://gerrit.wikimedia.org/r/463088 (https://phabricator.wikimedia.org/T205542) (owner: 10Mathew.onipe) [08:18:32] (03PS4) 10Muehlenhoff: Print group memberships which granted Hadoop access to check for HDFS cleanups [puppet] - 10https://gerrit.wikimedia.org/r/459558 (https://phabricator.wikimedia.org/T200312) [08:19:15] addshore: for when you are back, I think we can purge a bit more [08:19:37] marostegui: okay ! :) [08:20:48] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/454478 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [08:26:18] (03PS11) 10Giuseppe Lavagetto: php: add service management for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/454478 (https://phabricator.wikimedia.org/T201140) [08:26:22] <_joe_> moritzm: I'm merging it then [08:26:29] sounds good! [08:26:36] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) Since yesterday we've seen reoccurence of this bug on previously-fixed filesystems (i.e. sdd on ms-be2040) thus I've started... [08:26:57] !low rebooting mw1330-mw1333, mw1339-mw1348 for kernel security updates [08:28:22] (03CR) 10Giuseppe Lavagetto: [C: 032] php: add service management for php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/454478 (https://phabricator.wikimedia.org/T201140) (owner: 10Giuseppe Lavagetto) [08:29:02] (03CR) 10Jcrespo: [C: 032] mariadb backup monitoring: Add size checks [puppet] - 10https://gerrit.wikimedia.org/r/462724 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [08:29:11] (03PS4) 10Jcrespo: mariadb backup monitoring: Add size checks [puppet] - 10https://gerrit.wikimedia.org/r/462724 (https://phabricator.wikimedia.org/T203969) [08:30:16] !log upgrading db1104 (kernel-mariadb) and rebooting it (T205514) [08:30:18] (03PS19) 10Jcrespo: backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) [08:30:22] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Banyek) The donor host will be db1104 for recloning, and I update that first [08:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:25] T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 [08:30:56] (03CR) 10jerkins-bot: [V: 04-1] backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [08:31:06] (03CR) 10Jcrespo: [V: 032] backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [08:31:13] (03CR) 10Jcrespo: [V: 032 C: 032] backups: Setup new backup check for all sections on zarcillo db [puppet] - 10https://gerrit.wikimedia.org/r/461679 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [08:39:32] !log Deploy schema change on labtestwiki - T203709 [08:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:40] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [08:48:49] !log Deploy schema change on labswiki (db1073 master) - T203709 [08:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:57] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [08:52:51] (03CR) 10Mobrovac: "The difference stems from the what change-prop is implemented. Originally, we thought to effectively pause the CP workers during the read-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463072 (https://phabricator.wikimedia.org/T204154) (owner: 10Mobrovac) [08:54:06] !log test formatting sde on ms-be1040 with crc=0 - T199198 [08:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:14] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [08:56:41] !log stopping replocication & mariadb on db1104 and db1092 as db1092 is getting recloned from db1104 (T205514) [08:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:49] T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 [09:01:36] (03Abandoned) 10Filippo Giunchedi: base: link documentation for negative disk space available reported #2 [puppet] - 10https://gerrit.wikimedia.org/r/463071 (https://phabricator.wikimedia.org/T199198) (owner: 10Filippo Giunchedi) [09:01:42] PROBLEM - MariaDB Slave IO: s8 on db1092 is CRITICAL: CRITICAL slave_io_state could not connect [09:02:12] PROBLEM - MariaDB Slave SQL: s8 on db1092 is CRITICAL: CRITICAL slave_sql_state could not connect [09:02:27] These was me [09:02:35] jouncebot: now [09:02:39] No deployments scheduled for the next 1 hour(s) and 57 minute(s) [09:02:45] (03CR) 10Filippo Giunchedi: [C: 031] Introduce cumin::selector dummy class [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [09:03:22] RECOVERY - MariaDB Slave SQL: s8 on db1092 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:03:48] marostegui: I'll add another 3 days to the rejected keys? :) [09:03:52] RECOVERY - MariaDB Slave IO: s8 on db1092 is OK: OK slave_io_state Slave_IO_Running: Yes [09:03:58] addshore: go for it [09:04:27] (03Abandoned) 10Addshore: Set wikidatawiki CacheEpoch to 2018-09-15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462867 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:04:27] !log rebooting ores1* for kernel security updates [09:04:29] (03Abandoned) 10Addshore: Set wikidatawiki CacheEpoch to 2018-09-10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462866 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:36] (03Abandoned) 10Addshore: Set wikidatawiki CacheEpoch to 2018-09-03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462865 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:04:49] (03Abandoned) 10Addshore: Invalidate wikidatawiki cache with wgCacheEpoch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462728 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:05:05] tarrow: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/463202/ has a merge conflict [09:05:10] can you push a new version? [09:07:06] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [09:08:27] addshore: did you already do the 12th? [09:08:49] tarrow: i dont see it https://gerrit.wikimedia.org/r/#/q/owner:Tarrow+status:open ? [09:09:00] yes, the 12th went out before our meeting [09:09:02] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::vhost: generalize the upload_rewrite mechanism. [puppet] - 10https://gerrit.wikimedia.org/r/463217 [09:09:05] 15th is next [09:09:30] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/463208 [09:09:33] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::web::vhost: generalize the upload_rewrite mechanism. [puppet] - 10https://gerrit.wikimedia.org/r/463217 (owner: 10Giuseppe Lavagetto) [09:09:34] ah [09:09:52] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:11:48] (03PS2) 10Tarrow: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463202 (https://phabricator.wikimedia.org/T205330) [09:12:11] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:12:30] (03CR) 10Addshore: [C: 032] RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463202 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:13:33] (03Merged) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463202 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:14:19] syncing [09:15:13] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-15 (duration: 00m 57s) [09:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:33] * addshore watches the rejection rate [09:16:11] (03CR) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463202 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:16:18] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::vhost: generalize the upload_rewrite mechanism. [puppet] - 10https://gerrit.wikimedia.org/r/463217 [09:17:21] rejected are still only around 20% of expired :) [09:17:32] tarrow: do you want to make the last patch too ready? [09:17:34] [= [09:17:54] to 19th at 2000 HRS? [09:18:09] yup, or whatever we decided it was! [= [09:18:28] that will then get us up to date! did you want one more chunk? [09:20:09] let me check first [09:20:12] marostegui: ack [09:20:59] Sorry to slow down the process, but I don't want any issues like T167784 [09:21:00] T167784: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784 [09:21:10] marostegui: totally understandable :) [09:23:04] Looks like we're still only causing 5% of the current cache misses :) [09:23:14] 10Operations, 10netops: Level 3 link between cr2-eqiad and cr2-esams is down/flapping - https://phabricator.wikimedia.org/T205609 (10mark) [09:23:15] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/12654/mw1261.eqiad.wmnet/ seems ok, but I'd like another pair of eyes to look at it." [puppet] - 10https://gerrit.wikimedia.org/r/463217 (owner: 10Giuseppe Lavagetto) [09:23:36] yeh, the rate didn't actually increase that much with the last bump [09:23:36] (03PS1) 10Banyek: mariadb: depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) [09:23:37] addshore: I think we can go for more [09:23:45] marostegui: amazing [09:23:56] tarrow: patch please [= [09:24:01] straight to target date then? 2018-09-19 20:00:00? [09:24:13] yup [09:24:14] 10Operations, 10netops: Level 3 link between cr2-eqiad and cr2-esams is down/flapping - https://phabricator.wikimedia.org/T205609 (10Vgutierrez) Link details: cr2-esams:xe-0/1/3 - Transport: cr2-eqiad:xe-4/1/3 (Level3, BDFS2448, 84ms) [09:24:20] woohoo! [09:24:22] Thats only a bump of ~ 4 days [09:25:11] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1104 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) (owner: 10Banyek) [09:27:36] (03PS1) 10Tarrow: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463221 (https://phabricator.wikimedia.org/T205330) [09:27:56] (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463221 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:28:05] addshore: ^^ [09:29:37] (03PS2) 10Banyek: mariadb: depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) [09:30:09] 10Operations, 10netops: Level 3 link between cr2-eqiad and cr2-esams is down/flapping - https://phabricator.wikimedia.org/T205609 (10Vgutierrez) CenturyLink Ticket #: 15243142 [09:31:33] (03CR) 10Addshore: [C: 032] RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463221 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:32:42] (03Merged) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463221 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:33:55] (03CR) 10Marostegui: mariadb: depool db1104 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) (owner: 10Banyek) [09:34:07] syncing [09:35:01] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-19 (duration: 00m 56s) [09:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:36] (03PS3) 10Banyek: mariadb: depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) [09:35:36] marostegui: thats is as fr as we need to go :) [09:35:42] I will check [09:37:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [09:37:57] marostegui: it looks like the rejection rate might have doubled for the first minute, 5k per minuite [09:38:05] checking its impact [09:40:45] tarrow: so, things that are varnish cached of couse for anon users may still have the issue, but logged in users will have it all fixed now :) [09:42:02] Woo! I'll write on the tix [09:42:44] tarrow: i realise i should have tagged the SQL entries with the ticket, but forgot, but at least the patches were tagged with it.... [09:44:03] marostegui: hows it looking? [09:44:21] so far so good [09:44:24] lovely [09:44:27] addshore: you need to do more? [09:44:30] the rate shouldnt change much from now on [09:44:32] no, all done [09:44:36] ah cool [09:44:38] it should bounce around a bit but slowly tail off [09:45:25] addshore: which SQL entries? [09:45:31] (03CR) 10jenkins-bot: RejectParserCacheValue Hook to purge wikidatawiki to 2018-09-19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463221 (https://phabricator.wikimedia.org/T205330) (owner: 10Tarrow) [09:45:33] (03CR) 10Marostegui: mariadb: depool db1104 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) (owner: 10Banyek) [09:45:39] tarrow: /SQL/SAL/ [09:47:23] (03PS4) 10Banyek: mariadb: depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) [09:47:52] (03CR) 10Marostegui: [C: 031] mariadb: depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) (owner: 10Banyek) [09:49:07] jouncebot: next [09:49:07] In 1 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1100) [09:49:09] jouncebot: next [09:49:09] In 1 hour(s) and 10 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1100) [09:49:10] banyek: ^ [09:49:11] :) [09:49:15] heh [09:49:27] jinx [09:50:36] (03CR) 10Banyek: [C: 032] mariadb: depool db1104 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) (owner: 10Banyek) [09:51:25] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) (owner: 10Banyek) [09:51:38] (03Merged) 10jenkins-bot: mariadb: depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) (owner: 10Banyek) [09:54:43] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T205514: depooling db1104, adding db1109 as temproray api host for s8 (duration: 00m 56s) [09:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:52] T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 [10:00:23] (03CR) 10jenkins-bot: mariadb: depool db1104 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463219 (https://phabricator.wikimedia.org/T205514) (owner: 10Banyek) [10:02:24] and tarrow the varnished cached pages would expire 14 days after the deployment that caused an issue, if they aren't purged for another reason before hand [10:04:39] addshore: since, you're around, what the status of wgCacheEpoch? [10:07:09] (03PS1) 10Urbanecm: Add some namespaces aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463227 (https://phabricator.wikimedia.org/T201675) [10:25:53] (03PS1) 10Jcrespo: mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) [10:25:55] zeljkof: all done [10:26:08] didnt use wgCacheEpoch in the end, we have a hook at the bottom of CommonSettings [10:26:15] I'll write a followup mail to ops at some point! [10:26:23] addshore: nothing broken? ;P [10:26:24] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [10:26:32] asking for a friend that runs train this week ;) [10:26:57] nothing broken [10:27:06] and the PC rejection rate is lovely and pretty low https://usercontent.irccloud-cdn.com/file/2FaCV34b/image.png [10:27:23] All good from my front too [10:27:37] (03PS1) 10Muehlenhoff: Remove two rsyncd modules on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/463229 [10:27:41] (03CR) 10Elukey: [C: 031] "Mostly very minor nits, added a comment just in case." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463217 (owner: 10Giuseppe Lavagetto) [10:28:34] (03CR) 10Elukey: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/463229 (owner: 10Muehlenhoff) [10:29:37] marostegui: it seems like this is a pretty solid wany to approach parsing the cache for wikidatawiki in the future when we need to [10:30:00] I'm going to add a comment next to the wgCacheEpoch in the config saying perhaps use this hook instead if your thinking about bumping the epoch [10:30:20] it would be nice to also be able to purge the varnish cache in a similar way, but that might require some more work [10:30:21] addshore: yeah, better to do it slowly and double checking, at least for the first iterations [10:32:06] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [10:32:38] (03CR) 10Muehlenhoff: [C: 032] Remove two rsyncd modules on stat hosts [puppet] - 10https://gerrit.wikimedia.org/r/463229 (owner: 10Muehlenhoff) [10:32:57] addshore: thanks, train always makes me nervous, just checking :) [10:33:18] (03CR) 10Marostegui: "just a small typo, other than that, let's go with this approach!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [10:41:37] !log stop icinga-wm temporarily [10:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:11] (03PS1) 10Addshore: Add some comments about wikidata parser cache purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463231 [10:45:59] (03CR) 10Addshore: [C: 032] Add some comments about wikidata parser cache purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463231 (owner: 10Addshore) [10:46:04] (03CR) 10Jcrespo: mariadb backups monitoring: Make icinga messages more human friendly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [10:47:03] (03PS2) 10Jcrespo: mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) [10:47:13] (03Merged) 10jenkins-bot: Add some comments about wikidata parser cache purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463231 (owner: 10Addshore) [10:47:41] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [10:48:34] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: NOOP phpdoc comment changes pt1/2 (duration: 00m 56s) [10:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:02] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: NOOP phpdoc comment changes pt2/2 (duration: 00m 56s) [10:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:39] (03CR) 10jenkins-bot: Add some comments about wikidata parser cache purging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463231 (owner: 10Addshore) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:00:22] hmm? I've patch for SWAT! [11:00:52] zeljkof: Are you in for SWAT today? [11:02:30] kart_: I am! [11:02:36] did you add it to the calendar? [11:02:41] zeljkof: yes. [11:02:59] zeljkof: but wait till few min. Jenkins still running for cherry-pick. [11:04:07] kart_: I see it [11:04:12] ok, I'll get ready for deployment [11:06:54] kart_: the patch in master has a -1 from you? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/462878 [11:06:59] that's been fixed? [11:07:33] zeljkof: yes. Fixed. No one else can reproduce it :) [11:08:02] And, I hope it will stay that way! ;) [11:08:31] 18 min and Jenkins still running :/ [11:09:56] kart_: I'll +2 the patch as soon as I'm ready, I will take a while to merge too [11:10:15] zeljkof: merging is quicker. [11:10:26] I don't think so, it probably runs the same jobs [11:10:36] and if it's slow, it's slow [11:10:50] :~ [11:11:51] ok, +2d, jenkins will vote -1 anyway if there is a problem [11:12:46] test pipeline (after creating the patch) runs 3 jobs, gate-and-submit-swat pipeline (after +2) runs 5 jobs :D [11:13:08] the same 3 as test pipeline, and two additional [11:14:09] OK. we're good at first now. [11:18:14] (03PS4) 10D3r1ck01: Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) [11:36:04] kart_: please stand by, almost merged... [11:36:11] * zeljkof is holding breath [11:36:19] :) [11:36:28] kart_: merged! [11:36:36] (03PS3) 10Jcrespo: mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) [11:36:41] ok, I'll get it at debug server, and ping you [11:37:05] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [11:37:10] cool. [11:37:12] kart_: please note that because of datacenter switch debug server is mwdebug2001.codfw.wmnet (alias mw2017.codfw.wmnet) [11:37:38] noted. Thanks. [11:39:08] (03PS4) 10Jcrespo: mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) [11:39:41] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [11:39:48] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups monitoring: Make icinga messages more human friendly [puppet] - 10https://gerrit.wikimedia.org/r/463228 (https://phabricator.wikimedia.org/T203969) (owner: 10Jcrespo) [11:39:55] kart_: ok, it's at debug server, please test [11:41:20] yeah. testing.. [11:43:06] zeljkof: all good. Go ahead. [11:43:33] kart_: ok, deploying [11:44:33] !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/ContentTranslation/: SWAT: [[gerrit:463232|Use numerical option when setting CX version preference (T205493)]] (duration: 00m 57s) [11:44:34] (03CR) 10Gehel: [C: 04-1] Fix: Regenerate map tiles up to zoom level 9 with notify-tilerator-regen (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [11:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:42] T205493: [testwiki-wmf.23] Cannot disable new version for translations - https://phabricator.wikimedia.org/T205493 [11:44:53] kart_: deployed! please check and thanks for deploying with #releng :D [11:45:01] !log EU SWAT finished [11:45:03] Thanks zeljkof! [11:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [11:50:52] (03PS3) 10Gehel: Fix: Regenerate map tiles up to zoom level 9 with notify-tilerator-regen [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [11:51:51] PROBLEM - swift-container-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:51:51] PROBLEM - swift-container-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:52:01] PROBLEM - swift-account-reaper on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:52:02] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:52:11] PROBLEM - swift-account-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:52:11] PROBLEM - swift-container-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:52:31] PROBLEM - swift-object-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:52:41] PROBLEM - swift-container-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:52:42] PROBLEM - swift-object-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:52:42] PROBLEM - swift-account-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:52:42] PROBLEM - swift-object-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:52:42] PROBLEM - swift-account-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:54:34] (03PS4) 10Gehel: Fix: Regenerate map tiles up to zoom level 9 with notify-tilerator-regen [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [11:55:16] godog: ^^ expected? [11:55:52] PROBLEM - swift-container-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:55:52] PROBLEM - swift-account-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:55:52] PROBLEM - swift-object-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:55:52] PROBLEM - swift-object-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:56:01] PROBLEM - swift-account-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:56:11] PROBLEM - swift-container-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:56:11] PROBLEM - swift-container-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:56:21] PROBLEM - swift-account-reaper on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:56:22] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:56:31] PROBLEM - swift-account-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:56:31] PROBLEM - swift-container-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:56:43] (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/12656/" [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [11:56:51] PROBLEM - swift-object-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:58:12] "error: entire web request took longer than 60 seconds and timed out" is driving me crazy :/ [11:58:32] there is a bit jump in fatal monitor [11:58:39] since about 10:30 [11:59:12] PROBLEM - swift-container-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:59:12] PROBLEM - swift-object-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:59:12] PROBLEM - swift-object-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:59:12] PROBLEM - swift-account-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:59:12] PROBLEM - swift-object-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:59:21] PROBLEM - swift-account-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:59:31] PROBLEM - swift-container-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:59:31] PROBLEM - swift-container-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:59:42] PROBLEM - swift-account-reaper on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:59:42] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:59:43] what's going on? [11:59:50] is anyone looking at this? [11:59:51] PROBLEM - swift-account-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:59:51] PROBLEM - swift-container-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1200) [12:00:19] paravoid: I pinged godog and trying to find docs on swift, but no idea what's actually going on [12:00:31] paravoid: there is a single swift server with all services failing [12:00:56] I can see that [12:01:35] (03CR) 10Mholloway: [C: 031] "Thanks for updating!" [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [12:02:30] jouncebot: next [12:02:30] In 0 hour(s) and 57 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1300) [12:04:32] PROBLEM - swift-object-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:04:51] PROBLEM - swift-account-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:04:51] PROBLEM - swift-object-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:04:51] PROBLEM - swift-account-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:05:11] PROBLEM - swift-account-reaper on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:06:35] I am logged in com2 [12:07:24] ah it says Puppet is disabled. filippo T199198 [12:07:25] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [12:07:34] wondering if he is running a repair on xfs [12:08:31] ah yes he is working on it [12:09:00] downtiming [12:09:29] elukey: gehel is already doing so [12:10:08] no harm in doing it twice :) [12:10:15] heh [12:10:38] apergos: ah sorry I didn't see it logged in here [12:10:46] no worries [12:11:05] were you guys talking in here? I don't see anything :( [12:11:12] ah not in here [12:11:34] !log installing libapache2-mod-perl2 security updates [12:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:50] jouncebot: now [12:12:50] For the next 0 hour(s) and 47 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1200) [12:12:51] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Banyek) recloning finished, the hosts are replicating again [12:16:53] (03PS5) 10Gehel: Fix: Regenerate map tiles up to zoom level 9 with notify-tilerator-regen [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [12:17:29] (03CR) 10Gehel: [C: 032] Fix: Regenerate map tiles up to zoom level 9 with notify-tilerator-regen [puppet] - 10https://gerrit.wikimedia.org/r/462918 (https://phabricator.wikimedia.org/T202201) (owner: 10Mholloway) [12:24:59] !log reboot of wdqs2004-2006 for kernel upgrade [12:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:06] (03PS1) 10Elukey: Updated README with last used version and known limitations [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/463243 [12:29:23] (03PS2) 10Elukey: Updated README with last used version and known limitations [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/463243 [12:31:21] (03PS1) 10Gehel: wdqs: don't send nginx logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/463248 (https://phabricator.wikimedia.org/T200563) [12:35:03] (03CR) 10Joal: "One mininit (I love that word), then good :)" (031 comment) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/463243 (owner: 10Elukey) [12:36:25] (03CR) 10Elukey: Updated README with last used version and known limitations (031 comment) [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/463243 (owner: 10Elukey) [12:37:20] 10Operations: rsync puppet module doesn't delete removed config - https://phabricator.wikimedia.org/T205618 (10MoritzMuehlenhoff) [12:43:29] (03PS3) 10Elukey: Updated README with last used version and known limitations [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/463243 [12:43:43] (03CR) 10Elukey: [V: 032 C: 032] Updated README with last used version and known limitations [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/463243 (owner: 10Elukey) [12:55:04] (03PS1) 10Muehlenhoff: Filter out duplicates in allowed hosts for profile::analytics::database::meta::backup_dest [puppet] - 10https://gerrit.wikimedia.org/r/463253 [12:55:41] (03CR) 10jerkins-bot: [V: 04-1] Filter out duplicates in allowed hosts for profile::analytics::database::meta::backup_dest [puppet] - 10https://gerrit.wikimedia.org/r/463253 (owner: 10Muehlenhoff) [12:56:51] (03PS2) 10Muehlenhoff: Filter out duplicates in allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/463253 [12:59:49] that was indeed an expired downtime, thanks gehel elukey ! [13:00:04] zeljkof: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1300). [13:00:04] apologies for the noise/scare [13:00:04] godog: don't forget apergos ! [13:00:10] indeed, thanks apergos too! [13:00:47] (03CR) 10Bstorm: "I have a concern about the notion that the web and analytics servers get different roles because labsdb1010 is the secondary web server. " [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [13:01:09] 10Operations, 10DBA, 10JADE, 10MW-1.32-release-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10Marostegui) >>! In T202596#4604545, @awight wrote: > Thanks for all the attention given... [13:04:38] are things ok? can I move forward with the train? [13:05:17] jouncebot: next [13:05:17] In 2 hour(s) and 54 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1600) [13:05:22] if you're asking about the swift alters, you can ignore those [13:05:27] *alerts [13:05:40] just in general, there were a lot of things going on here, so just checking :) [13:05:47] ok, I'll move the train forward [13:06:16] good luck! [13:06:18] * apergos salutes [13:06:36] (03PS1) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [13:07:10] (03CR) 10jerkins-bot: [V: 04-1] wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [13:08:23] (03CR) 10Jcrespo: "> I have a concern about the notion that the web and analytics" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [13:08:48] (03PS2) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [13:09:50] (03CR) 10Banyek: "we can attach the different hieara variables directly to the hosts instead of the roles - that would work, but it would be bad to maintain" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [13:12:45] (03CR) 10Jcrespo: "> we can attach the different hieara variables directly to the hosts" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [13:13:57] !log repair /dev/sdh1 on ms-be2042 - T199198 [13:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:05] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [13:14:35] (03PS1) 10Zfilipin: all wikis to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463256 [13:14:37] (03CR) 10Zfilipin: [C: 032] all wikis to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463256 (owner: 10Zfilipin) [13:14:45] (03CR) 10Banyek: "> > we can attach the different hieara variables directly to the" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [13:15:43] (03CR) 10Marostegui: "This is the key, what Jaime wrote: This is just a productionization of the current system" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [13:15:45] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463256 (owner: 10Zfilipin) [13:17:56] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.23 [13:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:00] (03PS1) 10Filippo Giunchedi: secret: add dummy icinga irc secret [labs/private] - 10https://gerrit.wikimedia.org/r/463258 [13:20:38] (03CR) 10Filippo Giunchedi: [C: 032] secret: add dummy icinga irc secret [labs/private] - 10https://gerrit.wikimedia.org/r/463258 (owner: 10Filippo Giunchedi) [13:20:51] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] secret: add dummy icinga irc secret [labs/private] - 10https://gerrit.wikimedia.org/r/463258 (owner: 10Filippo Giunchedi) [13:21:05] (03PS1) 10Elukey: Add the analytics-alerts contact to the analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/463259 (https://phabricator.wikimedia.org/T172532) [13:21:50] (03CR) 10jerkins-bot: [V: 04-1] Add the analytics-alerts contact to the analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/463259 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [13:23:39] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/12659/einsteinium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/463051 (https://phabricator.wikimedia.org/T205526) (owner: 10Filippo Giunchedi) [13:25:29] zeljkof: do I undertand correctly that the train finished? (Can I deploy config change as repooling a db host?) [13:26:41] banyek: I'm done with train, waiting for "60 seconds" log spam to go away [13:27:03] As soon as logs go back to normal, go ahead [13:27:31] tx [13:27:46] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463256 (owner: 10Zfilipin) [13:31:53] banyek: there's a lot of "Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR" in mediawiki-errors [13:32:00] moritzm, o/ [13:32:27] ah, looks like it's gone, logs look good to me now [13:32:32] halfak: hi [13:33:04] I wait for 10 minutes now, and the do the repool [13:33:05] zeljkof: might have been T203786 :( [13:33:06] T203786: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [13:34:29] !log installing postgres security updates on labsdb1004 [13:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:31] (03CR) 10Volans: [C: 031] "LGTM, let's do this!" [puppet] - 10https://gerrit.wikimedia.org/r/463051 (https://phabricator.wikimedia.org/T205526) (owner: 10Filippo Giunchedi) [13:35:47] (03CR) 10Filippo Giunchedi: [C: 032] icinga: use IRC account password [puppet] - 10https://gerrit.wikimedia.org/r/463051 (https://phabricator.wikimedia.org/T205526) (owner: 10Filippo Giunchedi) [13:36:25] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:37:25] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [13:40:23] 10Operations: Integrate jessie 8.11 point update - https://phabricator.wikimedia.org/T198058 (10MoritzMuehlenhoff) 05Open>03Resolved Completed. These updates are fully rolled out: ``` subversion clamav postgresql-9.4 ``` [13:41:24] (03PS2) 10Filippo Giunchedi: icinga: use IRC account password [puppet] - 10https://gerrit.wikimedia.org/r/463051 (https://phabricator.wikimedia.org/T205526) [13:41:32] !log repooling db1104 (T205514) [13:41:34] (03CR) 10Filippo Giunchedi: [C: 032] icinga: use IRC account password [puppet] - 10https://gerrit.wikimedia.org/r/463051 (https://phabricator.wikimedia.org/T205526) (owner: 10Filippo Giunchedi) [13:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:40] T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 [13:42:42] (03CR) 10Bstorm: "> Patch Set 29:" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [13:43:05] volans: merged [13:43:09] (03PS1) 10Banyek: Revert "mariadb: depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463262 [13:43:11] ack [13:44:48] !log installing ca-certificates updates for jessie/stretch [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:23] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T205514: revert: depooling db1104, adding db1109 as temproray api host for s8 (duration: 00m 55s) [13:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:47] (03CR) 10Bstorm: "Compiler looks good. That's 99% of what I'd care about :) https://puppet-compiler.wmflabs.org/compiler1002/12660/labsdb1009.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [13:47:23] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463262 (owner: 10Banyek) [13:47:48] banyek: didn't you deploy already? [13:48:15] I wrote it to the databases: I didn't merged it :/ [13:48:26] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463262 (owner: 10Banyek) [13:49:54] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T205514: revert: depooling db1104, adding db1109 as temproray api host for s8 (duration: 00m 56s) [13:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:02] T205514: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 [13:55:09] (03PS1) 10Filippo Giunchedi: icinga: fix ircecho password ownership [puppet] - 10https://gerrit.wikimedia.org/r/463263 (https://phabricator.wikimedia.org/T205526) [13:56:12] volans: ^ [13:56:17] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/463263 (https://phabricator.wikimedia.org/T205526) (owner: 10Filippo Giunchedi) [13:56:19] a fractal of sadness [13:56:28] indeed! [13:56:37] (03CR) 10Filippo Giunchedi: [C: 032] icinga: fix ircecho password ownership [puppet] - 10https://gerrit.wikimedia.org/r/463263 (https://phabricator.wikimedia.org/T205526) (owner: 10Filippo Giunchedi) [13:57:43] (03CR) 10jenkins-bot: Revert "mariadb: depool db1104" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463262 (owner: 10Banyek) [13:59:58] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10User-Banyek: db1069 has errored disk in slot 7 - https://phabricator.wikimedia.org/T205253 (10Banyek) [14:00:58] welcome back icinga-wm! [14:01:09] now with a real account! [14:01:26] oh yes :D [14:01:29] nice job guys [14:01:54] hehe thanks vgutierrez! [14:02:05] closing one branch of the rabbit hole [14:02:47] (03CR) 10Marostegui: "Probably a good idea to disable puppet on the three hosts, merge it, and then run it manually on one of them, to see what we get :-)" [puppet] - 10https://gerrit.wikimedia.org/r/458810 (https://phabricator.wikimedia.org/T183983) (owner: 10Banyek) [14:03:27] 10Operations, 10Patch-For-Review: Register and identify icinga-wm - https://phabricator.wikimedia.org/T205526 (10fgiunchedi) [14:05:22] I'll request a cloak [14:06:13] (03PS1) 10Jcrespo: mariadb backups: Convert db1116 into an eqiad backup source host [puppet] - 10https://gerrit.wikimedia.org/r/463268 (https://phabricator.wikimedia.org/T196376) [14:10:11] (03CR) 10Vgutierrez: [C: 031] Test Server invariants [debs/pybal] - 10https://gerrit.wikimedia.org/r/445207 (https://phabricator.wikimedia.org/T184715) (owner: 10Mark Bergsma) [14:12:48] (03PS1) 10Giuseppe Lavagetto: monitoring: fix spec pre_condition [puppet] - 10https://gerrit.wikimedia.org/r/463270 [14:13:05] (03PS2) 10Giuseppe Lavagetto: monitoring: fix spec pre_condition [puppet] - 10https://gerrit.wikimedia.org/r/463270 [14:13:20] (03CR) 10Vgutierrez: [C: 031] Remove Server.modified and refresh preexisting servers individually [debs/pybal] - 10https://gerrit.wikimedia.org/r/446614 (owner: 10Mark Bergsma) [14:13:49] I'll stop icinga-wm briefly to request its cloak [14:14:03] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: fix spec pre_condition [puppet] - 10https://gerrit.wikimedia.org/r/463270 (owner: 10Giuseppe Lavagetto) [14:14:34] (03CR) 10Vgutierrez: [C: 031] Don't recalculate server.up in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447766 (owner: 10Mark Bergsma) [14:15:03] (03PS2) 10Giuseppe Lavagetto: Add the analytics-alerts contact to the analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/463259 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:16:55] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [14:18:46] RECOVERY - Filesystem available is greater than filesystem size on ms-be2042 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [14:18:56] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [14:24:05] RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:25:05] (03PS1) 10Greg Grossmeier: l10nupdate: disable temporarily [puppet] - 10https://gerrit.wikimedia.org/r/463275 [14:25:52] (03CR) 10jerkins-bot: [V: 04-1] l10nupdate: disable temporarily [puppet] - 10https://gerrit.wikimedia.org/r/463275 (owner: 10Greg Grossmeier) [14:27:05] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:29:16] !log converting srwiki.pagelinks to TokuDB on host dbstore1002 (T205544) [14:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:28] T205544: dbstore1002 /srv filling up - https://phabricator.wikimedia.org/T205544 [14:30:46] (03CR) 10Jdlrobson: [C: 04-1] "1 thing and then good to go!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) (owner: 10D3r1ck01) [14:37:23] (03PS2) 10Zfilipin: l10nupdate: disable temporarily [puppet] - 10https://gerrit.wikimedia.org/r/463275 (owner: 10Greg Grossmeier) [14:38:30] (03CR) 10jerkins-bot: [V: 04-1] l10nupdate: disable temporarily [puppet] - 10https://gerrit.wikimedia.org/r/463275 (owner: 10Greg Grossmeier) [14:39:26] stopping icinga-wm again for a minute [14:43:42] 10Operations, 10Puppet, 10User-Banyek: Debian package or files managed my puppet for pt-kill-wmf - https://phabricator.wikimedia.org/T203674 (10Banyek) The change will be merged on 2018-10-01 (Monday) morning. First only on one of the hosts (I recommend to pick one of the 'wikireplica_web' role hosts. Let's... [14:44:50] (03PS1) 10Bstorm: labstore: Change the dumps monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/463276 [14:45:05] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[ca-certificates],Exec[set debconf flag seen for wireshark-common/install-setuid] [14:46:36] banyek: there is an alert on dbstore1002, s3 [14:46:53] were you doing maintenace there? [14:47:11] because I didn't muted the replication lag notification, I guess [14:47:16] let me check it quick [14:47:19] (sorry) [14:47:36] PROBLEM - MariaDB Slave Lag: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 902.84 seconds [14:47:59] yes, that was it [14:48:00] :( [14:48:48] !log disabling checks on cloudvirt1019 to replace raid controller cable T196507 [14:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:56] T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 [14:49:33] (03PS3) 10Cwhite: icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) [14:50:02] (03PS3) 10Thcipriani: l10nupdate: disable temporarily [puppet] - 10https://gerrit.wikimedia.org/r/463275 (owner: 10Greg Grossmeier) [14:50:03] (03CR) 10Cwhite: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [14:50:05] (03CR) 10jerkins-bot: [V: 04-1] icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [14:58:10] jouncebot: next [14:58:10] In 1 hour(s) and 1 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1600) [14:58:25] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:58:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This doesn't look correct. There are a ton of hiera lookups that are happening for the exact same key from module classes and not the prof" [puppet] - 10https://gerrit.wikimedia.org/r/463180 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [14:58:59] (03CR) 10Elukey: [C: 032] Add the analytics-alerts contact to the analytics contact group [puppet] - 10https://gerrit.wikimedia.org/r/463259 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [14:59:27] PROBLEM - SSH on cloudvirt1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:06] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[python-novaclient],Package[mysql-client] [15:00:10] (03PS1) 10Muehlenhoff: Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/463277 [15:00:47] (03CR) 10Alexandros Kosiaris: icinga: fix more places with hardcoded user name (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463177 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [15:02:35] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: pass user/group from profile, change to nagios on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/463175 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [15:02:41] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [15:02:58] !log rebooting labtestvirt2003 for microcode tests [15:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:31] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) 05Open>03Resolved [15:03:44] jouncebot: now [15:03:44] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [15:03:58] !log T196507 2h downtime cloudvirt1019 in icinga [15:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:07] T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 [15:04:20] (03PS3) 10Reedy: Undeploy EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447946 (https://phabricator.wikimedia.org/T125618) [15:04:37] (03CR) 10Reedy: [C: 032] Undeploy EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447946 (https://phabricator.wikimedia.org/T125618) (owner: 10Reedy) [15:05:43] (03Merged) 10jenkins-bot: Undeploy EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447946 (https://phabricator.wikimedia.org/T125618) (owner: 10Reedy) [15:07:04] (03PS1) 10Andrew Bogott: region-migrate: take a nap before we start the rsync [puppet] - 10https://gerrit.wikimedia.org/r/463279 (https://phabricator.wikimedia.org/T204745) [15:07:07] !log add peering with Telin in esams [15:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:13] (03CR) 10jenkins-bot: Undeploy EducationProgram [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447946 (https://phabricator.wikimedia.org/T125618) (owner: 10Reedy) [15:07:30] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Bye Bye Education Program, Bye Bye. T125618 (duration: 00m 58s) [15:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:38] T125618: Deprecate and remove the EducationProgram extension from Wikimedia servers after June 30, 2018 - https://phabricator.wikimedia.org/T125618 [15:08:15] (03CR) 10Volans: "Good job. I've put a bunch of really nitpick style details, a couple of questions and a couple of optional possible alternatives inline." (0315 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [15:08:41] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:08:43] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bye Bye Education Program, Bye Bye. T125618 (duration: 00m 56s) [15:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:54] (03CR) 10Alexandros Kosiaris: icinga::ircbot: move from module to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [15:09:32] RECOVERY - SSH on cloudvirt1019 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u6 (protocol 2.0) [15:09:59] 10Operations, 10Cloud-Services, 10Mail, 10Patch-For-Review: Create a Cloud VPS SMTP smarthost - https://phabricator.wikimedia.org/T41785 (10Andrew) Yeah, we were, and I did, and then I pasted the wrong thing :) ``` Andrews-MacBook-Pro-3:~ andrew$ dig +short mx-out01.wmflabs.org 185.15.56.18 Andrews-MacBo... [15:10:03] 10Operations, 10IRCecho: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10Volans) `icinga-wm` nick was registered, Filippo has also asked for a cloack. The hack has been reverted and ircecho is now able to join also channels with `+r` mode. In case of a crash it will be restarted... [15:10:04] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/463279 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [15:10:08] !log reedy@deploy1001 Synchronized wmf-config/extension-list: Bye Bye Education Program, Bye Bye. T125618 (duration: 00m 55s) [15:10:11] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:20] RECOVERY - puppet last run on wtp1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:10:24] (03CR) 10Andrew Bogott: [C: 032] region-migrate: take a nap before we start the rsync [puppet] - 10https://gerrit.wikimedia.org/r/463279 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [15:12:26] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10Cmjohnson) [15:12:45] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T205609 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:12:45] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://phabricator.wikimedia.org/T205609 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:14:04] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10Cmjohnson) a:05Cmjohnson>03None @elukey everything looks good on our end I was able to access the server root@an-coord1001.mgmt.eqiad.wmnet's password: /admin1-> [15:15:04] 10Operations, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install an-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) [15:15:06] 10Operations, 10ops-eqiad: update label on an-master100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T204999 (10Cmjohnson) 05Open>03Resolved [15:15:38] (03PS1) 10Mholloway: Kartotherian: Add wikidata_query_service var for configuring WDQS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/463281 (https://phabricator.wikimedia.org/T205607) [15:16:17] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: horizon: mwv-apt is now in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/463282 (https://phabricator.wikimedia.org/T204745) [15:16:56] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: horizon: mwv-apt is now in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/463282 (https://phabricator.wikimedia.org/T204745) (owner: 10Arturo Borrero Gonzalez) [15:17:06] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [15:18:50] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [15:19:10] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:19:32] !log swapping out failed disk slot 3 rdb1004 [15:19:34] thanks cmjohnson1! [15:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:40] PROBLEM - HHVM rendering on mw2200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:40] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 74655 bytes in 0.679 second response time [15:22:10] PROBLEM - etcd request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:22:41] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) Sorry, they sent another new battery. Swapped the battery and let's see if it gets beyond recharging status [15:23:20] RECOVERY - etcd request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:23:30] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:23:51] 10Operations, 10ops-eqiad: Degraded RAID on rdb1004 - https://phabricator.wikimedia.org/T205284 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Replaced disk in slot 3 [15:23:54] (03PS5) 10D3r1ck01: Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) [15:25:50] 10Operations, 10Citoid, 10Services, 10Patch-For-Review, and 3 others: Deploy translation-server-v2 - https://phabricator.wikimedia.org/T201611 (10akosiaris) https://gerrit.wikimedia.org/g/mediawiki/services/zotero/+/refs/heads/master would be the repository @Mvolz I just created it and gave it the same per... [15:26:21] PROBLEM - Check systemd state on analytics1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:27:18] this is me --^ [15:27:20] testing [15:27:25] notifications works [15:28:31] RECOVERY - Check systemd state on analytics1003 is OK: OK - running: The system is fully operational [15:30:41] (03CR) 10Dzahn: [C: 032] icinga::ircbot: move from module to profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462835 (owner: 10Dzahn) [15:31:39] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.23/maintenance/: add new mtx script (duration: 00m 58s) [15:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:01] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:38:22] (03CR) 10Jdlrobson: [C: 031] "Lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) (owner: 10D3r1ck01) [15:39:19] (03PS1) 10Giuseppe Lavagetto: rake_modules/specdeps: fix logic in resolving specs that need running [puppet] - 10https://gerrit.wikimedia.org/r/463293 [15:40:07] (03PS1) 10Vogone: Remove 'metawiki' from 'wgForceUIMsgAsContentMsg' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 [15:40:49] (03CR) 10Jdlrobson: [C: 031] "Today is a 11am utc so 3hrs time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) (owner: 10D3r1ck01) [15:42:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [15:48:44] (03PS2) 10Vogone: Remove 'metawiki' from 'wgForceUIMsgAsContentMsg' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 [15:49:21] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:53:10] (03PS1) 10Elukey: Add an-coord1001 to puppet [puppet] - 10https://gerrit.wikimedia.org/r/463297 (https://phabricator.wikimedia.org/T204970) [15:55:43] !log reedy@deploy1001 Synchronized php-1.32.0-wmf.23/autoload.php: new mtx script (duration: 00m 56s) [15:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:44] volans: Many thanks for resurecting archiva in the analytics chan - This is most helpful :) [15:58:53] s/archiva/icinga [15:59:00] did too much archiva today it seems [15:59:04] (03PS1) 10Cmjohnson: Fixing dhcpd file to match correct mac [puppet] - 10https://gerrit.wikimedia.org/r/463300 (https://phabricator.wikimedia.org/T193655) [16:00:04] godog and _joe_: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1600). [16:00:04] reedy: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:12] (03PS2) 10Bstorm: labstore: Change the dumps monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/463276 [16:00:36] * greg-g waves [16:01:11] (03CR) 10Paladox: [C: 031] "bump" [puppet] - 10https://gerrit.wikimedia.org/r/462770 (https://phabricator.wikimedia.org/T196835) (owner: 10Thcipriani) [16:01:20] (03CR) 10Cmjohnson: [C: 032] Fixing dhcpd file to match correct mac [puppet] - 10https://gerrit.wikimedia.org/r/463300 (https://phabricator.wikimedia.org/T193655) (owner: 10Cmjohnson) [16:03:04] greg-g: coincidence or you are waiting for puppet swat? :) [16:03:15] joal: yw, has proven to be quite a rabbit hole between filippo and me, at least now it will reconnect [16:03:42] greg-g: ah yeah https://gerrit.wikimedia.org/r/c/operations/puppet/+/463275 is yours [16:04:29] is that for the ongoing translations freeze? is there a task? [16:05:04] thcipriani should this https://gerrit.wikimedia.org/r/462770 be added to puppet swat? [16:06:14] (03CR) 10Filippo Giunchedi: [C: 032] l10nupdate: disable temporarily [puppet] - 10https://gerrit.wikimedia.org/r/463275 (owner: 10Greg Grossmeier) [16:06:24] (03PS4) 10Filippo Giunchedi: l10nupdate: disable temporarily [puppet] - 10https://gerrit.wikimedia.org/r/463275 (owner: 10Greg Grossmeier) [16:09:09] (03CR) 10Volans: "Minor nitpicks, looks good otherwise." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/462810 (https://phabricator.wikimedia.org/T204088) (owner: 10Alex Monk) [16:09:54] Vogone: here? I was looking at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/463294 though I'd like more reviewers and +1s [16:10:16] (03PS3) 10Bstorm: labstore: Change the dumps monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/463276 [16:11:12] godog: I think it's pretty uncontroversial but obviously not a 'puppet' but a MediaWiki change [16:11:36] i. e. I'm fine if it gets re-scheduled to the later SWAT, but I do not think there is anything wrong with the change itself [16:12:15] Vogone: ack, yeah I'd prefer if it was scheduled for SWAT instead [16:13:56] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Kalliope Tsouroupidou - https://phabricator.wikimedia.org/T202486 (10Kalliope) It's worked great. Thank you for your help!! [16:20:09] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Cmjohnson) a:03Bstorm Both of these servers are able to be installed. assigning to @Bstorm [16:24:23] (03CR) 10Steinsplitter: [C: 031] "The metawiki mainpage is using the TA extension as of today, therefore this change is needed to restore the initial behavior of the page i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (owner: 10Vogone) [16:25:19] !log ppchelko@deploy1001 Started deploy [restbase/deploy@0f11d5d]: Canary on 2001 for content negotiations T128040 [16:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:27] T128040: Document and implement the REST API format versioning and negotiation policy - https://phabricator.wikimedia.org/T128040 [16:28:43] (03PS1) 10Elukey: Replace analytics team's contacts with analytics-alerts [puppet] - 10https://gerrit.wikimedia.org/r/463306 [16:29:20] RECOVERY - swift-container-auditor on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:29:30] (03PS1) 10Krinkle: logging: Disable 'Wikibase.NewItemIdFormatter' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463307 (https://phabricator.wikimedia.org/T204791) [16:29:36] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@0f11d5d]: Canary on 2001 for content negotiations T128040 (duration: 04m 17s) [16:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:50] RECOVERY - swift-container-updater on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:29:50] RECOVERY - swift-container-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:29:59] (03CR) 10Addshore: [C: 031] logging: Disable 'Wikibase.NewItemIdFormatter' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463307 (https://phabricator.wikimedia.org/T204791) (owner: 10Krinkle) [16:30:11] RECOVERY - swift-container-server on ms-be1040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:30:14] (03CR) 10Krinkle: [C: 032] logging: Disable 'Wikibase.NewItemIdFormatter' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463307 (https://phabricator.wikimedia.org/T204791) (owner: 10Krinkle) [16:30:17] that's me ^ [16:30:56] addshore: thx [16:31:27] (03Merged) 10jenkins-bot: logging: Disable 'Wikibase.NewItemIdFormatter' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463307 (https://phabricator.wikimedia.org/T204791) (owner: 10Krinkle) [16:33:15] (03CR) 10jenkins-bot: logging: Disable 'Wikibase.NewItemIdFormatter' channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463307 (https://phabricator.wikimedia.org/T204791) (owner: 10Krinkle) [16:34:52] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T204791 (duration: 00m 57s) [16:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:00] T204791: Wikibase critical error "Failed to format entity ID. Cache key contains characters that are not allowed" - https://phabricator.wikimedia.org/T204791 [16:39:54] !log ppchelko@deploy1001 Started deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 [16:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:03] T128040: Document and implement the REST API format versioning and negotiation policy - https://phabricator.wikimedia.org/T128040 [16:43:01] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 (duration: 03m 06s) [16:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:48] !log ppchelko@deploy1001 Started deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 take 2, feeds [16:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:31] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:47:37] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 take 2, feeds (duration: 03m 49s) [16:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:45] T128040: Document and implement the REST API format versioning and negotiation policy - https://phabricator.wikimedia.org/T128040 [16:47:46] !log ppchelko@deploy1001 Started deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 take 3, feeds [16:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:31] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy [16:49:51] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:49:59] (03CR) 10Bstorm: [C: 032] labstore: Change the dumps monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/463276 (owner: 10Bstorm) [16:51:45] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 take 3, feeds (duration: 03m 59s) [16:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:10] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:52:14] !log ppchelko@deploy1001 Started deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 take 4, feeds.... [16:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:16] (03PS1) 10Cwhite: fixes incorrect checking of hiera calls. [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/463314 [16:56:30] (03PS6) 10D3r1ck01: Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) [16:57:01] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 take 4, feeds.... (duration: 04m 46s) [16:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:09] T128040: Document and implement the REST API format versioning and negotiation policy - https://phabricator.wikimedia.org/T128040 [16:57:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:57:10] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [16:57:12] !log ppchelko@deploy1001 Started deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 take 5, feeds.... [16:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:11] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [16:58:11] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [16:58:33] ignoring restbase-dev alerts because obviously it's happening during deployment and even nice to see the recovery [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1700). [17:00:04] 10Operations, 10ops-eqiad: apply hostname labels to an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T205034 (10Cmjohnson) 05Open>03Resolved [17:00:06] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: setup/install an-coord1001/wmf7621 - https://phabricator.wikimedia.org/T204970 (10Cmjohnson) [17:00:14] I have deployment for ores [17:00:28] 10Operations, 10Puppet, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, and 2 others: Prometheus resources in deployment-prep to create grafana graphs of EventLogging - https://phabricator.wikimedia.org/T204088 (10Jdlrobson) We're seeing [[ https://grafana-labs-admin.wikimedia.org/dashboard/db/niedziels... [17:00:41] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Regression: [Regression] stats.wikipedia.org redirect no longer works ("Domain not served here") - https://phabricator.wikimedia.org/T126281 (10Nuria) Ping @BBlack can we get the redirect from https://stats.wikipedia.org to https://stats.w... [17:02:38] (03CR) 10GTirloni: [C: 032] shinken - Change Puppet thresholds [puppet] - 10https://gerrit.wikimedia.org/r/461632 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [17:02:52] (03PS2) 10GTirloni: shinken - Change Puppet thresholds [puppet] - 10https://gerrit.wikimedia.org/r/461632 (https://phabricator.wikimedia.org/T161898) [17:03:13] !log swapping failed disk slot 7 db1069 [17:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:21] !log ladsgroup@deploy1001 Started deploy [ores/deploy@a717199]: Send metrics for non-major responses [17:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:44] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@0f11d5d]: Full deploy for content negotiations T128040 take 5, feeds.... (duration: 06m 32s) [17:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:52] T128040: Document and implement the REST API format versioning and negotiation policy - https://phabricator.wikimedia.org/T128040 [17:05:02] (03PS9) 10Andrew Bogott: keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) (owner: 10Alex Monk) [17:05:58] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10User-Banyek: db1069 has errored disk in slot 7 - https://phabricator.wikimedia.org/T205253 (10Cmjohnson) @Marostegui The disk on slot 7 has been replaced, please resolve after rebuild [17:06:09] (03PS10) 10Andrew Bogott: keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) (owner: 10Alex Monk) [17:06:39] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Cmjohnson) the HP required AHS log has been uploaded to their dropbox. Waiting on their response. [17:06:52] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10User-Banyek: db1069 has errored disk in slot 7 - https://phabricator.wikimedia.org/T205253 (10Banyek) thanks, I'll look after this [17:07:47] (03CR) 10Andrew Bogott: [C: 032] keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) (owner: 10Alex Monk) [17:07:55] Canary is fine, moving forward [17:09:31] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:11:06] (03PS1) 10Cwhite: icinga: linter compliance and stretch user [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) [17:11:30] (03CR) 10jerkins-bot: [V: 04-1] icinga: linter compliance and stretch user [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:11:41] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:13:00] (03PS1) 10Sbisson: Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) [17:15:24] (03Abandoned) 10Cwhite: icinga: linter compliance and stretch user [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:16:21] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet operation_type=create_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:18:05] (03PS1) 10Alex Monk: shinkengen config: Remove deleted labs project [puppet] - 10https://gerrit.wikimedia.org/r/463320 [17:18:40] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:22:55] (03CR) 10Bstorm: [C: 032] shinkengen config: Remove deleted labs project [puppet] - 10https://gerrit.wikimedia.org/r/463320 (owner: 10Alex Monk) [17:23:12] (03CR) 10Andrew Bogott: [C: 032] shinkengen config: Remove deleted labs project [puppet] - 10https://gerrit.wikimedia.org/r/463320 (owner: 10Alex Monk) [17:23:27] (03PS1) 10Alex Monk: shinken: Also remove wdq-mm contact group [puppet] - 10https://gerrit.wikimedia.org/r/463322 [17:24:59] (03CR) 10MF-Warburg: [C: 031] "Brilliant change, very necessary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (owner: 10Vogone) [17:26:01] RECOVERY - MariaDB Slave Lag: s3 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 157.47 seconds [17:27:10] PROBLEM - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [17:27:12] ACKNOWLEDGEMENT - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T205649 [17:27:16] !log ladsgroup@deploy1001 Finished deploy [ores/deploy@a717199]: Send metrics for non-major responses (duration: 23m 55s) [17:27:16] 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T205649 (10ops-monitoring-bot) [17:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:42] !log reboot asw2-a-eqiad (not in prod) - T201145 [17:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:50] T201145: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 [17:28:45] (03PS2) 10Muehlenhoff: Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/463277 [17:30:09] (03PS1) 10Alex Monk: Remove all remaining wdq_mm references [puppet] - 10https://gerrit.wikimedia.org/r/463325 [17:30:55] (03CR) 10Muehlenhoff: [V: 032 C: 032] Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/463277 (owner: 10Muehlenhoff) [17:31:55] (03CR) 10Bstorm: [C: 032] shinken: Also remove wdq-mm contact group [puppet] - 10https://gerrit.wikimedia.org/r/463322 (owner: 10Alex Monk) [17:32:03] (03PS2) 10Bstorm: shinken: Also remove wdq-mm contact group [puppet] - 10https://gerrit.wikimedia.org/r/463322 (owner: 10Alex Monk) [17:32:07] (03CR) 10Dzahn: "only parts of it i got to before, the whole conversion away from includes is still good and we should do it. i kind of want to restore thi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:34:11] (03CR) 10Smalyshev: wdqs: cleanup logback configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) (owner: 10Gehel) [17:34:51] (03PS2) 10Elukey: Replace analytics team's contacts with analytics-alerts [puppet] - 10https://gerrit.wikimedia.org/r/463306 (https://phabricator.wikimedia.org/T172532) [17:37:31] (03PS1) 10Andrew Bogott: Keystone: followup patch for automatic domain creation/cleanup [puppet] - 10https://gerrit.wikimedia.org/r/463329 (https://phabricator.wikimedia.org/T162977) [17:37:44] (03CR) 10Thcipriani: [C: 031] "Seems to add the needed functionality for blubberoid and preserve existing service functionality. LGTM." [software/service-checker] - 10https://gerrit.wikimedia.org/r/461457 (owner: 10Dduvall) [17:38:09] (03CR) 10jerkins-bot: [V: 04-1] Keystone: followup patch for automatic domain creation/cleanup [puppet] - 10https://gerrit.wikimedia.org/r/463329 (https://phabricator.wikimedia.org/T162977) (owner: 10Andrew Bogott) [17:41:01] (03PS2) 10Andrew Bogott: Keystone: followup patch for automatic domain creation/cleanup [puppet] - 10https://gerrit.wikimedia.org/r/463329 (https://phabricator.wikimedia.org/T162977) [17:41:49] (03CR) 10Andrew Bogott: [C: 032] Keystone: followup patch for automatic domain creation/cleanup [puppet] - 10https://gerrit.wikimedia.org/r/463329 (https://phabricator.wikimedia.org/T162977) (owner: 10Andrew Bogott) [17:47:17] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@a0054ba]: Update mobileapps to 0d6c2b7 [17:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:39] (03PS4) 10Dzahn: icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:50:04] (03CR) 10Dzahn: "PS4: fixed lint issue, arrow alignment" [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:50:38] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@a0054ba]: Update mobileapps to 0d6c2b7 (duration: 03m 21s) [17:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:48] RECOVERY - Device not healthy -SMART- on db1069 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [17:53:59] (03PS1) 10Andrew Bogott: nfs-mounts.yaml: remove defs for some deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/463332 [17:54:05] !log arlolra@deploy1001 Started deploy [parsoid/deploy@6a2c25c]: Updating Parsoid to af3a920 [17:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:38] (03CR) 10Andrew Bogott: [C: 032] nfs-mounts.yaml: remove defs for some deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/463332 (owner: 10Andrew Bogott) [17:55:10] (03CR) 10Alex Monk: [C: 031] nfs-mounts.yaml: remove defs for some deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/463332 (owner: 10Andrew Bogott) [17:56:25] (03CR) 10Dzahn: [C: 031] "yesterday i removed/purged all the icinga/nagios related stuff from icinga1001 and let puppet recreate things afterwards. i can confirm th" [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:56:45] (03PS5) 10Dzahn: icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [17:57:57] jouncebot: convert "Thursday, September 25th at 3:30pm Central US" to @now :p [18:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1800). [18:00:05] Vogone and Smalyshev: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:47] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:04:48] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@6a2c25c]: Updating Parsoid to af3a920 (duration: 10m 43s) [18:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:23] hey all there's a late entry in the swat window from @d3r1ck [18:05:39] (not sure who's on duty today) [18:05:44] o/ [18:06:19] jdlrobson: o/ [18:06:37] I've just added it, is that okay? jdlrobson? [18:07:17] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:07:17] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:07:57] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:08:31] should be fine ^ per jouncebot, we're still waiting on someone being free to run the window [18:08:52] Okay! [18:11:18] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 271 bytes in 0.014 second response time [18:11:55] PROBLEM - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/nfs/home - 305 bytes in 0.022 second response time [18:12:14] !log Updated Parsoid to af3a920 (T198511, T163438, T108776, T205334, T114413) [18:12:20] ^^ those are expected. bstorm_ is working on NFS in Cloud VPS land [18:12:37] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.006 second response time [18:13:17] PROBLEM - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [18:13:22] Yeah. [18:13:30] Should be coming back now. [18:13:56] Vogone: SMalyshev are you around? [18:14:08] yes :) [18:14:16] RECOVERY - toolschecker: NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.150 second response time [18:14:17] RECOVERY - toolschecker: tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 1015 bytes in 12.560 second response time [18:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:22] T205334: Identify and remove dead code - https://phabricator.wikimedia.org/T205334 [18:14:22] T163438: VisualEditor broken on wikitech when codfw is primary: "Error loading data from server: apierror-visualeditor-docserver-http: HTTP 500." - https://phabricator.wikimedia.org/T163438 [18:14:25] T108776: Parsoid on wikitech fails - https://phabricator.wikimedia.org/T108776 [18:14:25] T114413: Support various conversions in Parsoid's pb2pb endpoint - https://phabricator.wikimedia.org/T114413 [18:14:26] T198511: VisualEditor losing Media: links - https://phabricator.wikimedia.org/T198511 [18:14:43] twentyafterfour thcipriani you can or anyone else run swat? [18:14:58] <_joe_> I was at dinner, everything allright I see [18:15:01] <_joe_> cool [18:15:35] PROBLEM - Getent speed check on labstore1004 is CRITICAL: CRITICAL: getent group tools.admin failed [18:17:05] That last one is more puzzling... [18:17:13] !log Starting mwscript extensions/ORES/maintenance/BackfillPageTriageQueue.php --wiki enwiki (T203286) [18:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:21] T203286: New Pages Feed: run ORES backfill script in English Wikipedia - https://phabricator.wikimedia.org/T203286 [18:17:27] RECOVERY - MegaRAID on db1069 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [18:17:35] jdlrobson: give me a few minutes and I can do it [18:24:15] 10Operations, 10netops: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) Thanks for the update, note that es1014 is in row B (issues tracked in T201039) Rough timeline is to get row B fixed next week, no ETA yet for row C (but I'm not aware of ongoing is... [18:24:29] ok, I'm around [18:24:42] looks like no one jumped in in the interim, so I can SWAT [18:25:45] RECOVERY - Getent speed check on labstore1004 is OK: OK: getent group returns within a second [18:26:22] Vogone: could you get some more review on your change? I'm not familiar with the task or the code it's touching (which is from 2008 as you pointed out), so I'm not comfortable merging it in this window without someone more familiar looking it over. Sorry :( [18:26:39] SMalyshev: I didn't see in scrollback if you were around for SWAT? [18:26:47] thcipriani: it's rather urgent since the main page has been migrated to a new system [18:26:53] oops I wasn't [18:27:00] is it too late or we can do it now? [18:27:10] I was listening in to Metrics and forgot [18:27:22] SMalyshev: I'm just getting going on SWAT (off to a late start) [18:27:34] great [18:28:02] is there anyone available to review Vogone 's patch for T205633 ? [18:28:03] T205633: T18701 causes empty main pages on metawiki - https://phabricator.wikimedia.org/T205633 [18:29:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) (owner: 10D3r1ck01) [18:30:28] (03Merged) 10jenkins-bot: Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) (owner: 10D3r1ck01) [18:31:30] where is the patch? [18:31:48] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/463294/ [18:32:26] oh found it [18:32:42] Well i can confirm that that would work as advertised [18:33:08] I assume you know better than me that meta actually wants their to be only one main page? [18:33:33] (03PS1) 10Alex Monk: openstack horizon: Remove references to deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/463337 [18:33:37] (03PS3) 10Legoktm: Remove 'metawiki' from 'wgForceUIMsgAsContentMsg' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (https://phabricator.wikimedia.org/T205633) (owner: 10Vogone) [18:34:09] d3r1ck: jdlrobson you change is live on mwdebug2002, check please [18:34:26] okay thcipriani [18:34:36] *your [18:34:39] (03CR) 10Brian Wolff: [C: 031] "This patch does what it says on the tin (Make there be only one main page for meta)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (https://phabricator.wikimedia.org/T205633) (owner: 10Vogone) [18:35:39] (03Restored) 10Dzahn: icinga: linter compliance and stretch user [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:35:57] thcipriani, jdlrobson, it works from my end :) [18:36:07] jdlrobson: Can you confirm please? :) [18:38:30] (03CR) 10Legoktm: [C: 031] "Based on what Brion said in 2008 (T18701#208739), seems like a good idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (https://phabricator.wikimedia.org/T205633) (owner: 10Vogone) [18:38:30] Vogone: I'll get your patch merged in this window if you're still available. Sorry for the delay, just wanted some verification :) [18:38:47] yes, I understand :) [18:38:58] d3r1ck: still looking :) [18:39:13] jdlrobson: Okay! [18:39:21] I'm double checking too just in case :) [18:39:42] d3r1ck: thcipriani sync away! [18:39:45] it looks good my end too [18:39:53] * thcipriani does [18:40:08] \o/ [18:40:19] Once you are done with swat, is it ok I do a security related deployed [18:40:22] *deploy [18:42:14] jdlrobson, thcipriani, so we're good? [18:42:23] (03PS1) 10Alex Monk: Remove references to deleted labs project ci-staging [puppet] - 10https://gerrit.wikimedia.org/r/463339 [18:42:29] bawolff: sure, I'll ping you when I'm done [18:42:39] cool :) [18:43:11] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463119|Enable Page-previews on Wikivoyage]] T203981 (duration: 00m 57s) [18:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:20] T203981: Enable Page Previews on English Wikivoyage - https://phabricator.wikimedia.org/T203981 [18:43:23] ^ jdlrobson d3r1ck should be live [18:43:40] (03CR) 10jenkins-bot: Enable Page-previews on Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463119 (https://phabricator.wikimedia.org/T203981) (owner: 10D3r1ck01) [18:43:49] w00t [18:43:55] \o/ [18:43:59] d3r1ck: am keeping an eye on https://grafana.wikimedia.org/dashboard/db/reading-web-page-previews?refresh=1m&orgId=1 [18:44:11] let's see if anything unexpected happens to those graphs [18:44:31] (unlikely but good to monitor :)) [18:44:37] SMalyshev: your change is live on mwdebug2002, check please [18:44:43] checking [18:45:03] (03PS4) 10Thcipriani: Remove 'metawiki' from 'wgForceUIMsgAsContentMsg' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (https://phabricator.wikimedia.org/T205633) (owner: 10Vogone) [18:45:11] jdlrobson: Let me start looking too [18:45:18] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (https://phabricator.wikimedia.org/T205633) (owner: 10Vogone) [18:45:38] (03PS2) 10Alex Monk: Remove references to deleted labs project ci-staging [puppet] - 10https://gerrit.wikimedia.org/r/463339 [18:46:27] (03Merged) 10jenkins-bot: Remove 'metawiki' from 'wgForceUIMsgAsContentMsg' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (https://phabricator.wikimedia.org/T205633) (owner: 10Vogone) [18:46:29] 10Operations, 10Scap, 10Datacenter-Switchover-2018, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 (10Krinkle) In addition to the servers Scap checks, there is also the url... [18:46:41] thcipriani: yep, works as it should [18:46:51] 10Operations, 10Scap, 10Datacenter-Switchover-2018, 10Release-Engineering-Team (Watching / External), 10Wikimedia-Incident: Scap is checking canary servers in dormant instead of active-dc - https://phabricator.wikimedia.org/T204907 (10Krinkle) (I see James reported that at T205559.) [18:47:03] SMalyshev: cool, thanks for checking, going live [18:49:03] d3r1ck: looks like it's logged in users only [18:49:08] or am i wrong? [18:49:21] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.23/extensions/WikimediaEvents/modules/wikibase/ext.wikimediaEvents.completionClicks.js: SWAT: [[gerrit:463296|Ignore clicks with empty search string]] T205301 (duration: 00m 56s) [18:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:29] T205301: Property searches in wikidatacompletionsearchclicks have mostly null values - https://phabricator.wikimedia.org/T205301 [18:49:31] ^ SMalyshev should be live now [18:49:43] thcipriani: cool, thanks! [18:49:51] jdlrobson: No, you're correct! [18:50:02] was that intentional? [18:50:03] SMalyshev: yw :) [18:50:14] Vogone: your change is live on mwdebug2002, check please [18:50:40] (03PS2) 10Dzahn: icinga: linter compliance and stretch user [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [18:51:06] (03CR) 10Andrew Bogott: [C: 032] openstack horizon: Remove references to deleted projects [puppet] - 10https://gerrit.wikimedia.org/r/463337 (owner: 10Alex Monk) [18:51:46] jdlrobson: No, but are Annon users supposed to have access to that? [18:52:03] yeh that was the idea [18:52:09] :( [18:52:14] maybe caching? [18:52:19] according to the code.. it should be working [18:52:34] Default is 0 [18:52:37] Not 1 [18:52:47] shouldSendModuleToUser returns true if user is anon though.. [18:53:15] ahh caching [18:53:16] it works [18:53:19] action=purge did the trick [18:54:02] Hmmm... not working for me after purge, let me retry [18:54:50] Aha, jdlrobson it works! [18:54:58] Took a few secs like 45 [18:55:01] but it's working now [18:55:08] * d3r1ck grins [18:55:18] thcipriani: still waiting on Vogone ? [18:55:29] yes [18:55:39] nice! Looks good! [18:55:45] thanks for taking care of that d3r1ck [18:56:24] jdlrobson: Thanks for the enormous help :) [18:56:25] Vogone: have you verified changes via this extension? https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [18:56:50] I just installed it, yes [18:56:59] takes some time ;) [18:57:03] d3r1ck: i've pinged the village pump to let them know this happened [18:57:14] jdlrobson: Great, thanks a lot :) [18:57:56] Vogone: ack :) [18:58:15] (03CR) 10Andrew Bogott: [C: 032] Remove references to deleted labs project ci-staging [puppet] - 10https://gerrit.wikimedia.org/r/463339 (owner: 10Alex Monk) [18:58:35] (03CR) 10jenkins-bot: Remove 'metawiki' from 'wgForceUIMsgAsContentMsg' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463294 (https://phabricator.wikimedia.org/T205633) (owner: 10Vogone) [18:59:04] !log update all MR routers to include cumin1001 - T205513 [18:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:59:12] T205513: Enable cumin1001 in router ACLs - https://phabricator.wikimedia.org/T205513 [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T1900) [19:01:08] Can I use this spot to deploy a security thing (The train is on european time right?) [19:02:20] it is on EU time today (already happened) still haven't totally finished SWAT [19:02:50] waiting on the go-ahead to sync [19:03:00] Oh cool cool. [19:03:26] thcipriani: sorry, didn't mean to seem like I'm chomping at the bit :) [19:03:32] no worries :) [19:05:51] thcipriani: I can't see a difference, but perhaps I am doing something wrong [19:06:34] just ship it [19:07:04] Vogone: for your change? Click on the site logo [19:07:19] And try again for ?uselang= [19:07:25] see if it goes to different places [19:07:59] yes now it works [19:08:02] a few minutes ago it did not [19:08:09] (still directed me to the old page) [19:08:34] cool, so looks good? [19:09:07] yeah [19:09:38] * thcipriani syncs [19:10:27] (03CR) 10Dzahn: [C: 032] icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [19:10:31] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463294|Remove "metawiki" from "wgForceUIMsgAsContentMsg"]] T205633 (duration: 00m 56s) [19:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:47] ^ Vogone should be live, thanks for your patience on the review piece [19:10:48] T205633: T18701 causes empty main pages on metawiki - https://phabricator.wikimedia.org/T205633 [19:10:50] bawolff: all yours [19:10:53] (03PS6) 10Dzahn: icinga: enforce mode on nagios files [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [19:11:12] thcipriani: Thanks :) [19:11:33] thcipriani: thanks a lot! [19:12:13] yw, both :) [19:15:39] jdlrobson: Some changes happening on those graphs :) [19:26:02] (03CR) 10Cwhite: [C: 031] "Looks like existing infra does not change: https://puppet-compiler.wmflabs.org/compiler1002/12663/" [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [19:26:14] (03PS3) 10Cwhite: icinga: linter compliance and stretch user [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) [19:29:42] !log update all cr1/2-eqiad to include cumin1001 - T205513 [19:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:50] T205513: Enable cumin1001 in router ACLs - https://phabricator.wikimedia.org/T205513 [19:30:22] (03CR) 10Dzahn: [C: 032] "noop on einsteinium. fixed modes on these files on icinga1001" [puppet] - 10https://gerrit.wikimedia.org/r/463112 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [19:31:14] (03CR) 10Dzahn: "thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/450314 (owner: 10Dzahn) [19:31:25] !log deploy related to T194204 [19:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:07] * bawolff done now [19:33:10] 10Operations, 10netops: Enable cumin1001 in router ACLs - https://phabricator.wikimedia.org/T205513 (10ayounsi) 05Open>03Resolved a:03ayounsi All filters updated, let me know if any issues. [19:33:34] @d3r1ck what are you seeing? I'm not seeing anything abnormal. Still consistent with last 7 days [19:33:56] No, nothing abnormal [19:34:04] Just the normal rise and fall :) [19:47:34] 10Operations, 10TechCom-RFC, 10Traffic, 10Patch-For-Review, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10kchapman) TechCom has approved this RFC [19:47:55] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10kchapman) [19:57:08] is anything happening now, or is it safe to deploy parsoid? [20:00:28] arlolra, i would take that silence as safe to go. :) [20:00:58] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:01:16] !log arlolra@deploy1001 Started deploy [parsoid/deploy@0272096]: Updating Parsoid to ff6ffb5 [20:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:08] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [20:03:14] !log update pfw3-codfw/eqiad firewall rules - T205574 [20:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:22] T205574: deploy new pfw config - https://phabricator.wikimedia.org/T205574 [20:09:21] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@0272096]: Updating Parsoid to ff6ffb5 (duration: 08m 05s) [20:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:17] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:18] ccept-Encoding [20:13:18] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:18] ccept-Encoding [20:13:19] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:19] ccept-Encoding [20:13:27] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:27] ccept-Encoding [20:13:37] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected v [20:13:38] y: Accept, Accept-Encoding [20:13:47] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:47] ccept-Encoding [20:13:47] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:47] ccept-Encoding [20:13:49] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected v [20:13:49] y: Accept, Accept-Encoding [20:13:49] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:49] ccept-Encoding [20:13:49] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:49] ccept-Encoding [20:13:57] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:57] ccept-Encoding [20:13:58] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title}{/revision}{/tid} (Get media in test page) is CRITICAL: Test Get media in test page had an unexpected value for header vary: Accept, Accept-Encoding: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections had an unexpected value for hea [20:13:58] ccept-Encoding [20:14:53] arlolra: is all that noise fallout from your deploy? [20:15:02] 10Operations, 10fundraising-tech-ops, 10netops: deploy new pfw config - https://phabricator.wikimedia.org/T205574 (10cwdent) Typo in config file, updated version: 1538079234 [20:15:18] bd808: I'm trying to determine [20:15:21] hopefully not [20:15:24] bd808, arlolra and i were discussing that .. we don't think it is related. [20:15:55] bearND, mdholloway|afk might be able to help clarify what that is. [20:22:42] (03PS3) 10Gehel: wdqs: cleanup logback configuration [puppet] - 10https://gerrit.wikimedia.org/r/463254 (https://phabricator.wikimedia.org/T200563) [20:24:37] (03CR) 10Dzahn: [C: 031] "noop in compiler. wmf-style: total violations delta -24 (!) and should fix icinga1001 errors :)" [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:25:06] (03CR) 10Dzahn: [C: 032] icinga: linter compliance and stretch user [puppet] - 10https://gerrit.wikimedia.org/r/463316 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [20:26:01] 10Operations, 10TechCom: change my email address in the techcom alias - https://phabricator.wikimedia.org/T205661 (10daniel) [20:47:39] 10Operations, 10Analytics, 10Traffic: Artificial spike in offset of unique devices from November to February 6th on wikidata - https://phabricator.wikimedia.org/T165560 (10Nuria) [20:49:24] 10Operations, 10fundraising-tech-ops, 10netops: deploy new pfw config - https://phabricator.wikimedia.org/T205574 (10cwdent) 05Open>03Resolved a:03cwdent Looks good, thanks @ayounsi [20:49:54] (03PS2) 10Herron: WIP: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [20:50:34] (03CR) 10jerkins-bot: [V: 04-1] WIP: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [21:13:03] (03PS1) 10Dzahn: icinga: also set user/group for commands file [puppet] - 10https://gerrit.wikimedia.org/r/463374 [21:13:33] (03CR) 10jerkins-bot: [V: 04-1] icinga: also set user/group for commands file [puppet] - 10https://gerrit.wikimedia.org/r/463374 (owner: 10Dzahn) [21:18:07] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:18:28] PROBLEM - Filesystem available is greater than filesystem size on ms-be1041 is CRITICAL: cluster=swift device=/dev/sdf1 fstype=xfs instance=ms-be1041:9100 job=node mountpoint=/srv/swift-storage/sdf1 site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [21:22:27] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:26:47] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:31:17] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:34:31] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: move/setup/install frauth2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T204079 (10cwdent) Thanks @papaul and @ayounsi - server is up and accessible. I tried changing to the bonded ethernet config w/o restarting (only kicke... [21:34:47] (03PS1) 10Cwhite: icinga: allow rsync from multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/463379 (https://phabricator.wikimedia.org/T202782) [21:40:07] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [21:55:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:58:21] 10Puppet, 10Discovery-Search, 10Beta-Cluster-reproducible: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10Krenair) [21:58:29] (03PS1) 10BryanDavis: tools: Update usage of ::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/463385 (https://phabricator.wikimedia.org/T198351) [21:58:55] (03CR) 10Alex Monk: "Caused T205672" [puppet] - 10https://gerrit.wikimedia.org/r/441338 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [21:59:13] (03PS2) 10BryanDavis: tools: Update usage of ::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/463385 (https://phabricator.wikimedia.org/T198351) [22:02:44] (03CR) 10Andrew Bogott: [C: 032] tools: Update usage of ::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/463385 (https://phabricator.wikimedia.org/T198351) (owner: 10BryanDavis) [22:09:16] (03CR) 10Dzahn: "looks good, let's just move the definition of the "active_host" to one central place, in hieradata/role/common/ rather than hieradata/host" [puppet] - 10https://gerrit.wikimedia.org/r/463379 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:12:48] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:13:29] (03CR) 10Dzahn: "per IRC, yes, we can also compare $::fqdn to $active_server in puppet and already know we are "passive" if it's not matching and don't eve" [puppet] - 10https://gerrit.wikimedia.org/r/463379 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:21:33] (03PS1) 10BryanDavis: tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) [22:22:19] (03CR) 10jerkins-bot: [V: 04-1] tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) (owner: 10BryanDavis) [22:23:08] (03PS1) 10Cwhite: icinga: allow rsync from multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/463387 (https://phabricator.wikimedia.org/T202782) [22:25:34] (03Abandoned) 10Cwhite: icinga: allow rsync from multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/463379 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:34:03] (03PS2) 10BryanDavis: tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) [22:34:39] (03CR) 10jerkins-bot: [V: 04-1] tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) (owner: 10BryanDavis) [22:35:34] (03CR) 10Dzahn: [C: 031] "compiler says it does what it should do on all hosts: https://puppet-compiler.wmflabs.org/compiler1002/12664/" [puppet] - 10https://gerrit.wikimedia.org/r/463387 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:35:54] (03CR) 10Dzahn: [C: 032] icinga: allow rsync from multiple hosts [puppet] - 10https://gerrit.wikimedia.org/r/463387 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:37:36] 10Puppet, 10Discovery-Search, 10Beta-Cluster-reproducible, 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10Krenair) Added this to deployment-prep project hiera: `profile::elasticsearch::base_data_dir: /srv/elasticsearc... [22:38:50] (03CR) 10Dzahn: [C: 032] "this was important because with the rsync not working on icinga1001 and puppet being disabled during rsync.. running puppet on icinga1001 " [puppet] - 10https://gerrit.wikimedia.org/r/463387 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [22:57:27] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:58:06] (03PS3) 10BryanDavis: tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180927T2300). [23:00:04] SMalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:01:40] (03CR) 10Dzahn: [C: 032] "removed unnecessary ferm rules from tegmen/icinga1001. added needed ferm rule on einsteinium. rsync on icinga1001 from einsteinium now wor" [puppet] - 10https://gerrit.wikimedia.org/r/463387 (https://phabricator.wikimedia.org/T202782) (owner: 10Cwhite) [23:02:28] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:04:07] PROBLEM - Host lvs4004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:04:07] PROBLEM - Host lvs4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:05:45] 10Puppet, 10Discovery-Search, 10Beta-Cluster-reproducible, 10Patch-For-Review: Elasticsearch puppet config changes broke puppet in various instances - https://phabricator.wikimedia.org/T205672 (10Krenair) `elasticsearch_5@beta-search` service on deployment-logstash2 is still failing shortly after starting.... [23:07:47] PROBLEM - Host lvs4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:07:59] !log Finished mwscript extensions/ORES/maintenance/BackfillPageTriageQueue.php --wiki enwiki (T203286) [23:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:09] T203286: New Pages Feed: run ORES backfill script in English Wikipedia - https://phabricator.wikimedia.org/T203286 [23:08:27] PROBLEM - Host lvs4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [23:09:06] hm, ulsfo mgmt stuff going down? [23:09:11] but just for lvs hosts... [23:09:54] Things still aren't finished after moving servers [23:09:54] XioNoX ^ [23:10:08] Reedy, so this might be a known thing? [23:10:24] I don't know for sure, but I don't think it's serving traffic etc [23:10:31] ok [23:10:32] yeah, those LVS are decomissioned [23:10:36] even then it's only mgmt so [23:10:46] the downtime probably expired [23:10:49] I'll ack them [23:12:05] done [23:12:09] ACKNOWLEDGEMENT - Host lvs4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Servers decommissioned (and not plugged back after ulsfo DC move) [23:12:09] ACKNOWLEDGEMENT - Host lvs4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Servers decommissioned (and not plugged back after ulsfo DC move) [23:12:09] ACKNOWLEDGEMENT - Host lvs4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Servers decommissioned (and not plugged back after ulsfo DC move) [23:12:09] ACKNOWLEDGEMENT - Host lvs4004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Servers decommissioned (and not plugged back after ulsfo DC move) [23:12:23] thx for the ping! [23:12:55] (03PS4) 10BryanDavis: tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) [23:18:20] anybody for swat? [23:18:28] sorry I missed the message again :( [23:22:10] (03CR) 10BryanDavis: [C: 031] "Tested via cherry-pick on tools-puppetmaster-01. Applied diff was minor changes to /etc/elasticsearch/labs-tools/elasticsearch.yml: node.n" [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) (owner: 10BryanDavis) [23:23:31] SMalyshev: sure, let's SWAT. [23:23:44] (03PS4) 10Thcipriani: Enable phrase search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462351 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [23:24:09] great! [23:24:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462351 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [23:25:38] (03Merged) 10jenkins-bot: Enable phrase search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462351 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [23:26:09] SMalyshev: live on mwdebug2002, check please [23:26:37] checking [23:27:09] yep, works fine [23:27:24] okie doke, going live [23:27:49] (03PS5) 10Andrew Bogott: tools: Update usage of ::elasticsearch (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/463386 (https://phabricator.wikimedia.org/T198351) (owner: 10BryanDavis) [23:29:04] (03CR) 10jenkins-bot: Enable phrase search config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462351 (https://phabricator.wikimedia.org/T163642) (owner: 10Smalyshev) [23:29:14] !log thcipriani@deploy1001 Synchronized wmf-config/WikibaseSearchSettings.php: SWAT: [[gerrit:462351|Enable phrase search config]] T163642 (duration: 00m 56s) [23:29:16] thcipriani: thanks! [23:29:21] SMalyshev: yw :) [23:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:23] T163642: Index Wikidata strings in statements for fulltext search - https://phabricator.wikimedia.org/T163642 [23:29:25] also, should be live now [23:34:52] paladox: ooh. there actually IS an error in DNS for that rsync issue :) [23:35:15] tcpdump showed me it's connecting from 2620:0:861:3:208:80:154:84 [23:35:27] heh [23:35:33] 2620:0:861:1:208:80:154:84 [23:35:39] see that "3" in the middle.. meep [23:35:43] wrong row [23:37:33] it looks consistent in forward and reverse lookup. but it's just consistently the wrong one :) [23:40:26] oh [23:42:04] (03PS1) 10Dzahn: fix IPv6 for icinga1001, is in eqiad row C, not row A [dns] - 10https://gerrit.wikimedia.org/r/463391 (https://phabricator.wikimedia.org/T202782) [23:43:40] (03PS2) 10Dzahn: fix IPv6 for icinga1001, is in eqiad row C, not row A [dns] - 10https://gerrit.wikimedia.org/r/463391 (https://phabricator.wikimedia.org/T202782) [23:45:49] (03CR) 10Dzahn: [C: 032] "root@icinga1001:/# ip a s | grep -E 'inet6.*global'" [dns] - 10https://gerrit.wikimedia.org/r/463391 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:47:38] leeeroy jenkins, pls verify [23:49:28] lol [23:53:13] (03CR) 10Dzahn: [C: 032] "recheck" [dns] - 10https://gerrit.wikimedia.org/r/463391 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [23:53:45] heh, yea, it did it immediately. just have to tell it