[00:03:11] 10Operations, 10Analytics, 10Analytics-Kanban: Move internal sites hosted on thorium to ganeti instance(s) - https://phabricator.wikimedia.org/T202011 (10Dzahn) There are now: analytics-tool1001 - for superset analytics-tool1002 - for turnilo analytics-tool1003 - for hue more details on T202013#4516863 C... [00:04:24] (03PS1) 10Brian Wolff: Add wikimedia.org to allowed source list for Mathoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454180 [00:07:10] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@9fbf9fa]: Don't clean up the queue size counters [00:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:57] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@9fbf9fa]: Don't clean up the queue size counters (duration: 00m 46s) [00:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv6: Connect, AS2914/IPv4: Connect [00:10:07] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 [00:19:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [00:41:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [00:41:57] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [00:56:49] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10fgiunchedi) Thanks for your help on investigating this everyone! Very helpful insights. As it stands these are the options I believe: 1. unc... [01:25:03] (03PS2) 10Aaron Schulz: Enable broadcasted mcrouter operations for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452592 (https://phabricator.wikimedia.org/T198239) [01:31:11] (03CR) 10Aaron Schulz: [C: 032] Enable broadcasted mcrouter operations for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452592 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [01:32:27] (03Merged) 10jenkins-bot: Enable broadcasted mcrouter operations for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452592 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [01:35:09] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Enable broadcasted mcrouter operations for all wikis (duration: 00m 51s) [01:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:16] (03CR) 10jenkins-bot: Enable broadcasted mcrouter operations for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452592 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [01:42:06] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [02:00:07] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 30, down: 0, shutdown: 2 [02:01:48] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0 [02:06:18] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [02:08:47] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [02:12:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [02:25:47] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 32, down: 0, shutdown: 0 [02:28:55] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.16) (duration: 08m 32s) [02:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:09] (03PS2) 10Krinkle: Remove unused config wgWMETrackGeoFeatures (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454148 [02:38:12] (03CR) 10Krinkle: [C: 032] Remove unused config wgWMETrackGeoFeatures (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454148 (owner: 10Krinkle) [02:38:20] (03PS2) 10Krinkle: Remove unused config wgWMETrackGeoFeatures (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454149 [02:38:49] AaronSchulz: finished with deployment? [02:39:12] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Aug 21 02:39:12 UTC 2018 (duration 10m 18s) [02:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:25] Krinkle: a good while ago [02:39:29] * Krinkle staging on deploy1001/mwdebug1002 [02:39:30] thx [02:39:45] (03Merged) 10jenkins-bot: Remove unused config wgWMETrackGeoFeatures (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454148 (owner: 10Krinkle) [02:39:58] (03CR) 10jenkins-bot: Remove unused config wgWMETrackGeoFeatures (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454148 (owner: 10Krinkle) [02:42:22] (03CR) 10Krinkle: [C: 032] Remove unused config wgWMETrackGeoFeatures (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454149 (owner: 10Krinkle) [02:43:11] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: rm unused I741b16452 (duration: 00m 56s) [02:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:43:37] (03Merged) 10jenkins-bot: Remove unused config wgWMETrackGeoFeatures (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454149 (owner: 10Krinkle) [02:47:20] (03PS1) 10Jalexander: Add IP ranges for ptWiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454186 (https://phabricator.wikimedia.org/T202354) [02:47:31] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: rm unused Id5f5295d (duration: 00m 50s) [02:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:48:29] (03CR) 10jerkins-bot: [V: 04-1] Add IP ranges for ptWiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454186 (https://phabricator.wikimedia.org/T202354) (owner: 10Jalexander) [02:54:18] (03CR) 10jenkins-bot: Remove unused config wgWMETrackGeoFeatures (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454149 (owner: 10Krinkle) [03:05:54] * Krinkle done with deploy [03:30:58] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 841.15 seconds [03:45:21] (03PS2) 10Jalexander: Add IP ranges for ptWiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454186 (https://phabricator.wikimedia.org/T202354) [03:49:08] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 224.65 seconds [03:53:24] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to restricted production access and analytics-privatedata-users for Karen Brown - https://phabricator.wikimedia.org/T201668 (10Jalexander) >>! In T201668#4516444, @RobH wrote: > Please note that I'm now on clinic duty this week, so I... [03:53:41] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10Jalexander) >>! In T201667#4516460, @RobH wrote: > Please note this task is currently blocked on @PEarleyWMF logging into thei... [04:41:21] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport] [04:54:40] !log Set max_connections back from 800 to 500 on db1073 - T188589 [04:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:46] T188589: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 [04:55:09] 10Operations, 10Cloud-Services, 10DBA, 10Patch-For-Review: m5-master overloaded by idle connections to the nova database - https://phabricator.wikimedia.org/T188589 (10Marostegui) nova is now using 107 connections. nova_api is using 5 connections. The general health of the connection pool is a lot better w... [05:01:59] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport] [05:04:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454196 [05:06:37] (03CR) 10Marostegui: "I am going to test on dbproxy1006 - pasive for m1" [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) (owner: 10Marostegui) [05:06:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454196 (owner: 10Marostegui) [05:07:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454196 (owner: 10Marostegui) [05:09:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 (duration: 00m 51s) [05:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:06] !log Stop puppet on dbproxy1006 to do some haproxy logging tests - https://phabricator.wikimedia.org/T201021 [05:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454196 (owner: 10Marostegui) [06:00:59] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:05:04] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10daniel) > Delete the pages and drop the namespace. Note that storage isn't reclaimed, but this should be... [06:06:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [06:06:29] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [06:23:01] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) After all the tests, these are the options needed on `db-master.cfg`: ``` option tcplog option log-health-checks log /dev/log local0 in... [06:23:13] (03PS2) 10Marostegui: db-master.cfg: Enable haproxy health-check logging [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) [06:23:18] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454197 [06:24:56] (03PS3) 10Marostegui: db-master.cfg: Enable haproxy health-check logging [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) [06:25:09] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454197 (owner: 10Marostegui) [06:26:19] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:26:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454197 (owner: 10Marostegui) [06:27:29] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096:3315 (duration: 00m 50s) [06:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:09] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:30:55] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454197 (owner: 10Marostegui) [06:58:29] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:07:41] 10Operations, 10Quarry, 10cloud-services-team (Kanban): Let quarry use the mariadb module - https://phabricator.wikimedia.org/T181205 (10jcrespo) Actually, instead of: ``` class {'mariadb::packages_wmf': class {'mariadb::service': ``` `class { 'mariadb::packages'` should work to install the regular service... [07:20:40] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10jcrespo) > dropping namespaces is not officially supported by MediaWiki at all Also deleting data such a... [07:23:52] (03PS3) 10Muehlenhoff: Enable intel-microcode for all bare metal servers with an Intel CPU [puppet] - 10https://gerrit.wikimedia.org/r/453997 (https://phabricator.wikimedia.org/T127825) [07:34:08] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Enable moved paragrah detection everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454026 (https://phabricator.wikimedia.org/T199800) (owner: 10WMDE-Fisch) [07:47:29] (03CR) 10Muehlenhoff: [C: 032] Enable intel-microcode for all bare metal servers with an Intel CPU [puppet] - 10https://gerrit.wikimedia.org/r/453997 (https://phabricator.wikimedia.org/T127825) (owner: 10Muehlenhoff) [07:52:23] (03CR) 10Marostegui: [C: 032] db-master.cfg: Enable haproxy health-check logging [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) (owner: 10Marostegui) [07:52:31] (03PS4) 10Marostegui: db-master.cfg: Enable haproxy health-check logging [puppet] - 10https://gerrit.wikimedia.org/r/454039 (https://phabricator.wikimedia.org/T201021) [07:54:06] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2: Release wikidiff2 v1.7.3 and update the production servers - https://phabricator.wikimedia.org/T202301 (10Pipetricker) [07:55:38] !log Slowly reload haproxies on dbproxy10XX - T201021 [07:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:44] T201021: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 [07:56:17] (03PS5) 10Volans: Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) [07:56:19] (03PS3) 10Volans: Add a retry decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) [07:59:51] (03PS1) 10Muehlenhoff: Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 [08:00:36] (03CR) 10jerkins-bot: [V: 04-1] Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 (owner: 10Muehlenhoff) [08:02:48] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370 (10aaron) [08:05:06] (03PS2) 10Matěj Suchánek: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 [08:05:36] (03CR) 10Gehel: [C: 031] Add dnsdisc module to manipulate DNS Discovery (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:06:57] (03CR) 10Matěj Suchánek: Update several Wikidata-related configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [08:10:30] (03CR) 10Gehel: [C: 031] "Minor comment inline, but the current version is already good enough to be merged." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:14:01] (03PS1) 10Marostegui: db-replicas.cfg: Enable healthcheck logging [puppet] - 10https://gerrit.wikimedia.org/r/454205 (https://phabricator.wikimedia.org/T201021) [08:14:39] (03CR) 10jerkins-bot: [V: 04-1] db-replicas.cfg: Enable healthcheck logging [puppet] - 10https://gerrit.wikimedia.org/r/454205 (https://phabricator.wikimedia.org/T201021) (owner: 10Marostegui) [08:15:52] (03PS2) 10Marostegui: db-replicas.cfg: Enable healthcheck logging [puppet] - 10https://gerrit.wikimedia.org/r/454205 (https://phabricator.wikimedia.org/T201021) [08:16:38] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/454203 (owner: 10Muehlenhoff) [08:16:51] (03CR) 10Marostegui: [C: 032] db-replicas.cfg: Enable healthcheck logging [puppet] - 10https://gerrit.wikimedia.org/r/454205 (https://phabricator.wikimedia.org/T201021) (owner: 10Marostegui) [08:17:23] (03CR) 10jerkins-bot: [V: 04-1] Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 (owner: 10Muehlenhoff) [08:17:25] (03CR) 10DCausse: search.wikimedia.org should properly handle multivalue separation char (0x1F) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [08:17:34] (03PS3) 10DCausse: search.wikimedia.org should properly handle multivalue separation char (0x1F) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 [08:19:24] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review: HAproxy on dbproxy hosts lack enough logging - https://phabricator.wikimedia.org/T201021 (10Marostegui) 05Open>03Resolved a:03Marostegui Deployed, reloaded haproxies everywhere and I can see now the checks on the logs. [08:19:52] (03CR) 10Gehel: search.wikimedia.org should properly handle multivalue separation char (0x1F) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [08:21:27] (03CR) 10DCausse: search.wikimedia.org should properly handle multivalue separation char (0x1F) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 (owner: 10DCausse) [08:21:54] (03PS4) 10DCausse: search.wikimedia.org should properly handle multivalue separation char (0x1F) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446801 [08:25:47] (03PS2) 10Muehlenhoff: Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 [08:27:02] (03PS2) 10Gehel: elasticsearch: storage device name changed with new partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/453094 (https://phabricator.wikimedia.org/T198391) [08:27:04] (03CR) 10jerkins-bot: [V: 04-1] Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 (owner: 10Muehlenhoff) [08:27:09] (03PS3) 10Gehel: elasticsearch: storage device name changed with new partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/453094 (https://phabricator.wikimedia.org/T198391) [08:28:42] (03PS3) 10Muehlenhoff: Remove enable_microcode logic [puppet] - 10https://gerrit.wikimedia.org/r/454203 [08:29:23] (03CR) 10Addshore: [C: 031] Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [08:33:49] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Samuel Guebo - https://phabricator.wikimedia.org/T202362 (10Jalexander) [08:35:57] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Samuel Guebo - https://phabricator.wikimedia.org/T202362 (10JanWMF) approved [08:37:02] !log Rename blob_tracking and blob_orphans on db1089 - T59186 [08:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:09] T59186: Drop blob_tracking and blob_orphans everywhere - https://phabricator.wikimedia.org/T59186 [08:40:55] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Ty Hargrove - https://phabricator.wikimedia.org/T202363 (10Jalexander) [08:41:25] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Samuel Guebo - https://phabricator.wikimedia.org/T202362 (10Jalexander) [08:42:13] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Ty Hargrove - https://phabricator.wikimedia.org/T202363 (10JanWMF) approved [08:46:18] (03PS2) 10Vgutierrez: [WIP] Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 [08:47:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 (owner: 10Vgutierrez) [08:51:24] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler02/12149/" [puppet] - 10https://gerrit.wikimedia.org/r/454203 (owner: 10Muehlenhoff) [08:54:47] (03PS1) 10Jcrespo: mariadb: Depool cluster24 (es2) from new writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454210 (https://phabricator.wikimedia.org/T202364) [08:56:48] 10Operations, 10Wikimedia-Mailing-lists: Password reset request for wikimedia-nd mailing list - https://phabricator.wikimedia.org/T202247 (10Geekdidi) Awesome, I'll be waiting for the link :) [08:56:57] (03CR) 10Marostegui: "I assume you'd still set up the master on read only on a MySQL level?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454210 (https://phabricator.wikimedia.org/T202364) (owner: 10Jcrespo) [08:58:23] (03CR) 10Jcrespo: "> I assume you'd still set up the master on read only on a MySQL" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454210 (https://phabricator.wikimedia.org/T202364) (owner: 10Jcrespo) [09:03:42] (03PS2) 10Jcrespo: mariadb: Depool cluster24 (es2) from new writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454210 (https://phabricator.wikimedia.org/T202364) [09:04:25] (03PS1) 10Jcrespo: mariadb: Promote es1015 to es2 master and repool es2 for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454211 (https://phabricator.wikimedia.org/T202364) [09:05:49] (03PS2) 10Jcrespo: mariadb: Promote es1015 to es2 master and repool es2 for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454211 (https://phabricator.wikimedia.org/T202364) [09:08:25] (03PS1) 10MarcoAurelio: Configure gendered namespaces for pl.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) [09:14:18] (03CR) 10Gehel: [C: 032] elasticsearch: storage device name changed with new partitioning scheme [puppet] - 10https://gerrit.wikimedia.org/r/453094 (https://phabricator.wikimedia.org/T198391) (owner: 10Gehel) [09:15:34] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3045.esams.wmnet', 'cp4026.ulsfo.wmnet', 'cp5001.eqsin.wmnet'] ``` The log c... [09:15:50] (03CR) 10MarcoAurelio: Configure gendered namespaces for pl.wiktionary (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) (owner: 10MarcoAurelio) [09:16:13] !log installing jetty9 security updates [09:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:27] jouncebot: next [09:17:27] In 1 hour(s) and 42 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180821T1100) [09:18:20] (03PS1) 10Jcrespo: mariadb: Promote es1015 to be the new es2 master instead of es1011 [puppet] - 10https://gerrit.wikimedia.org/r/454214 (https://phabricator.wikimedia.org/T202364) [09:27:48] !log installing mutt security updates [09:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:53] 10Operations, 10DBA: rack/setup/install dbproxy101[2-7].eqiad.wmnet - https://phabricator.wikimedia.org/T196690 (10Marostegui) 05Open>03Resolved [09:32:51] (03CR) 10Marostegui: [C: 031] mariadb: Promote es1015 to be the new es2 master instead of es1011 [puppet] - 10https://gerrit.wikimedia.org/r/454214 (https://phabricator.wikimedia.org/T202364) (owner: 10Jcrespo) [09:35:49] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[ca-certificates] [09:37:05] (03CR) 10Marostegui: [C: 031] mariadb: Promote es1015 to es2 master and repool es2 for writes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454211 (https://phabricator.wikimedia.org/T202364) (owner: 10Jcrespo) [09:43:37] (03PS1) 10Zhuyifei1999: mariadb::packages: Unversion server and client [puppet] - 10https://gerrit.wikimedia.org/r/454217 (https://phabricator.wikimedia.org/T181205) [09:46:13] !log installing libxml2 security updates on trusty [09:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:57] (03PS1) 10Jcrespo: mariadb: Point es2-master to es1015 after master switchover [dns] - 10https://gerrit.wikimedia.org/r/454219 (https://phabricator.wikimedia.org/T202364) [09:50:23] (03PS1) 10Muehlenhoff: Add library hint for libxcursor [puppet] - 10https://gerrit.wikimedia.org/r/454220 [09:50:40] (03CR) 10Marostegui: [C: 031] mariadb: Point es2-master to es1015 after master switchover [dns] - 10https://gerrit.wikimedia.org/r/454219 (https://phabricator.wikimedia.org/T202364) (owner: 10Jcrespo) [09:51:08] (03CR) 10Jcrespo: "This looks great, please give me some time to check we are not breaking other usages of this." [puppet] - 10https://gerrit.wikimedia.org/r/454217 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [09:51:10] PROBLEM - Memory correctable errors -EDAC- on scb1002 is CRITICAL: 5.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [09:51:54] (03PS2) 10Muehlenhoff: Add library hint for libxcursor [puppet] - 10https://gerrit.wikimedia.org/r/454220 [09:52:02] (03CR) 10Zhuyifei1999: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/454217 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [09:52:39] (03PS1) 10Hashar: nodepool: reduce number of instances [puppet] - 10https://gerrit.wikimedia.org/r/454222 (https://phabricator.wikimedia.org/T201972) [09:54:41] (03CR) 10Muehlenhoff: [C: 032] Add library hint for libxcursor [puppet] - 10https://gerrit.wikimedia.org/r/454220 (owner: 10Muehlenhoff) [09:57:31] (03CR) 10Jcrespo: "I saw no other usages of packages.pp, so this should be ok to be deployed. One comment down here that should be easy to solve." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/454217 (https://phabricator.wikimedia.org/T181205) (owner: 10Zhuyifei1999) [09:58:53] !log installing libxcursor security updates [09:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:54] (03PS2) 10Zhuyifei1999: mariadb::packages: Unversion server and client [puppet] - 10https://gerrit.wikimedia.org/r/454217 (https://phabricator.wikimedia.org/T181205) [10:01:00] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:02:38] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4026.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4026.ulsfo.wmnet'] ``` [10:06:55] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp4027.ulsfo.wmnet', 'cp3042.esams.wmnet', 'cp2007.codfw.wmnet'] ``` The log c... [10:07:38] (03CR) 10Zfilipin: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/453696 (https://phabricator.wikimedia.org/T202139) (owner: 10Urbanecm) [10:17:07] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] Update several Wikidata-related configs (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [10:28:58] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "Added to today’s EU SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [10:30:32] (03PS3) 10Lucas Werkmeister (WMDE): Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [10:30:52] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] Update several Wikidata-related configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [10:33:04] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:37:39] (03PS1) 10Marostegui: production.my.cnf: Set expire_log_days to 30 [puppet] - 10https://gerrit.wikimedia.org/r/454228 [10:39:27] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4027.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4027.ulsfo.wmnet'] ``` [10:39:30] (03CR) 10Jcrespo: [C: 031] production.my.cnf: Set expire_log_days to 30 [puppet] - 10https://gerrit.wikimedia.org/r/454228 (owner: 10Marostegui) [10:40:10] (03CR) 10Marostegui: [C: 032] production.my.cnf: Set expire_log_days to 30 [puppet] - 10https://gerrit.wikimedia.org/r/454228 (owner: 10Marostegui) [10:48:35] (03PS4) 10Volans: Add service locator class Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) [10:48:37] (03PS6) 10Volans: Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) [10:48:39] (03PS4) 10Volans: Add a retry decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) [10:48:41] (03PS1) 10Volans: Make all library instances 'immutable' [software/spicerack] - 10https://gerrit.wikimedia.org/r/454230 (https://phabricator.wikimedia.org/T199079) [10:49:31] (03CR) 10Matěj Suchánek: Update several Wikidata-related configs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [10:54:13] RIP wikibugs_ [10:54:24] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:54:33] sigh [10:54:58] Krenair / moritzm / robh please /msg Sigyn unkline wikibugs_ [10:56:00] or AlexZ [10:59:51] Hauskatze: No matches wikibugs_ in recent bans from #wikimedia-operations [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180821T1100). [11:00:04] jamesofur and Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] p858snake: perhaps the kill was triggered at other channel? [11:00:39] I can swat today [11:00:45] -dev i bet [11:00:45] I'd ask a freenode staffer but for some reason I can't talk at #freenode [11:00:49] Lucas_WMDE: around for SWAT? [11:01:02] Hauskatze: try sigyn's channel [11:01:12] freenode-sigyn iirc [11:01:21] zeljkof: I’m here [11:02:39] Lucas_WMDE: your next, I'll ping you in a few minutes when I deploy the first patch [11:02:47] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:02:48] ok thanks [11:03:17] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:04:08] in the meantime I’ll try to figure out if I can test the change ^^ [11:05:11] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:454186|Add IP ranges for ptWiki editathon (T202354)]] (duration: 00m 52s) [11:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:19] T202354: Temporary remove account creation limit for event on Portuguese Wikipedia on August 21, 2018 - https://phabricator.wikimedia.org/T202354 [11:05:39] Lucas_WMDE: is there a task related to the patch? [11:05:51] not as far as I know [11:06:10] the commit message does not say much :/ [11:06:16] okay, I found one case where I’ll be able to tes tit [11:06:18] *test it [11:06:20] I mean, what's the patch doing at all :D [11:06:27] nothing? just cleaup? [11:06:34] cleanup, that is [11:06:37] I can improve the commit message if you want :) [11:06:40] cleanup, yeah [11:06:42] please do [11:06:46] ok [11:07:05] I'm reluctant to merge and deploy a commit that I have not clue about :D [11:10:07] zeljkof: better now? [11:11:01] the second part of the commit message is the one I can test [11:11:20] Lucas_WMDE: looks good! I'll ping you when it's at mwdebug1002 [11:11:24] ok thanks! [11:15:01] Lucas_WMDE: it's at mwdebug1002 [11:15:16] zeljkof: seems to be working [11:15:25] Lucas_WMDE: ok to deploy? [11:15:28] yes [11:15:50] deploying... [11:16:32] !log zfilipin@deploy1001 Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:449017|Update several Wikidata-related configs]] (duration: 00m 50s) [11:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:39] Lucas_WMDE: deployed! [11:16:46] zeljkof: great, thank you! [11:16:54] !log EU SWAT finished [11:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3315 (duration: 00m 50s) [11:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:07] p858snake: No matches wikibugs_ in recent bans from #wikimedia-operations,#mediawiki,#wikimedia-dev [11:49:58] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp2008.codfw.wmnet', 'cp2023.codfw.wmnet', 'cp3047.esams.wmnet'] ``` The log c... [11:52:35] (03CR) 10Alexandros Kosiaris: [C: 032] osm: The master is osmdb.eqiad.wmnet, not labsdb1007 [puppet] - 10https://gerrit.wikimedia.org/r/454255 (https://phabricator.wikimedia.org/T197246) (owner: 10Alexandros Kosiaris) [11:53:32] p858snake, okay, spoke to freenode staff and it's fixed [11:53:41] just restarted the bot and then realised it had already reconnected. oops [11:53:50] (03CR) 10Jcrespo: [C: 032] mariadb: Depool es1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454254 (https://phabricator.wikimedia.org/T202364) (owner: 10Jcrespo) [11:55:03] (03Merged) 10jenkins-bot: mariadb: Depool es1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454254 (https://phabricator.wikimedia.org/T202364) (owner: 10Jcrespo) [11:55:46] 10Operations, 10Traffic: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) [11:55:55] 10Operations, 10Traffic: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) p:05Triage>03Normal [11:56:36] PROBLEM - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2023_v4, cp2023_v6 [11:56:36] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2008_v4, cp2008_v6, cp3047_v4, cp3047_v6 [11:56:37] PROBLEM - IPsec on cp1078 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2008_v4, cp2008_v6, cp3047_v4, cp3047_v6 [11:56:37] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:56:37] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:56:46] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:56:46] PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:56:46] PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:56:46] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:56:46] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:56:46] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:56:47] PROBLEM - IPsec on cp1086 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2008_v4, cp2008_v6, cp3047_v4, cp3047_v6 [11:56:47] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:56:48] PROBLEM - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2008_v4, cp2008_v6, cp3047_v4, cp3047_v6 [11:56:56] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:56:56] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:56:56] PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:56:56] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:56:56] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:56:57] PROBLEM - IPsec on cp1079 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2023_v4, cp2023_v6 [11:56:57] PROBLEM - IPsec on cp1085 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2023_v4, cp2023_v6 [11:56:57] PROBLEM - IPsec on cp1087 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2023_v4, cp2023_v6 [11:56:57] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2008_v4, cp2008_v6, cp3047_v4, cp3047_v6 [11:56:58] PROBLEM - IPsec on cp1088 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2008_v4, cp2008_v6, cp3047_v4, cp3047_v6 [11:56:58] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:57:06] PROBLEM - IPsec on cp1081 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2023_v4, cp2023_v6 [11:57:06] PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:06] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:57:06] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:57:07] PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:07] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:07] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:07] PROBLEM - IPsec on cp1083 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2023_v4, cp2023_v6 [11:57:08] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2008_v4, cp2008_v6, cp3047_v4, cp3047_v6 [11:57:08] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:57:10] 10Operations, 10Traffic: Traffic Server - Prometheus integration - https://phabricator.wikimedia.org/T202381 (10ema) [11:57:16] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:16] PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:16] PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:17] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 62 not-conn: cp3047_v4, cp3047_v6 [11:57:17] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:17] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:17] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:18] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:18] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:26] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:26] PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:26] PROBLEM - IPsec on cp1077 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2023_v4, cp2023_v6 [11:57:27] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:35] sorry about that ^ [11:57:36] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:36] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:36] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:36] PROBLEM - IPsec on cp1089 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp2023_v4, cp2023_v6 [11:57:36] PROBLEM - IPsec on cp1080 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2008_v4, cp2008_v6, cp3047_v4, cp3047_v6 [11:57:36] PROBLEM - IPsec on cp5006 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:37] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:37] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:37] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:57:38] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:46] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:46] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp2008_v4, cp2008_v6 [11:57:46] PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2023_v4, cp2023_v6 [11:58:31] (03CR) 10jenkins-bot: mariadb: Depool es1015 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454254 (https://phabricator.wikimedia.org/T202364) (owner: 10Jcrespo) [11:59:05] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1015 (duration: 00m 50s) [11:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:35] (03PS3) 10Ema: ATS: add caching rules support [puppet] - 10https://gerrit.wikimedia.org/r/453960 (https://phabricator.wikimedia.org/T199720) [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180821T1200) [12:01:29] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pg_basebackup-osmdb.eqiad.wmnet] [12:01:49] (03CR) 10Ema: [C: 032] ATS: add caching rules support [puppet] - 10https://gerrit.wikimedia.org/r/453960 (https://phabricator.wikimedia.org/T199720) (owner: 10Ema) [12:05:55] !log stop es1015 for upgrade [12:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:35] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454257 [12:09:31] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 36 ESP OK [12:09:31] RECOVERY - IPsec on cp5006 is OK: Strongswan OK - 36 ESP OK [12:09:40] RECOVERY - IPsec on cp1083 is OK: Strongswan OK - 52 ESP OK [12:09:40] RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 32 ESP OK [12:09:41] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 32 ESP OK [12:09:41] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 36 ESP OK [12:09:41] RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 32 ESP OK [12:09:50] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 36 ESP OK [12:09:51] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 36 ESP OK [12:09:51] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 36 ESP OK [12:10:00] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 32 ESP OK [12:10:00] RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 32 ESP OK [12:10:00] RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 32 ESP OK [12:10:00] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 36 ESP OK [12:10:00] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 52 ESP OK [12:10:00] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 36 ESP OK [12:10:00] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 36 ESP OK [12:10:01] RECOVERY - IPsec on cp1075 is OK: Strongswan OK - 52 ESP OK [12:10:10] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 36 ESP OK [12:10:11] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 36 ESP OK [12:10:11] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 32 ESP OK [12:10:12] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 36 ESP OK [12:10:20] RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 32 ESP OK [12:10:20] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 36 ESP OK [12:10:21] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 36 ESP OK [12:10:21] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 32 ESP OK [12:10:21] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454257 [12:10:21] RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 32 ESP OK [12:10:22] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 36 ESP OK [12:10:22] RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 32 ESP OK [12:10:30] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 36 ESP OK [12:10:30] RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 32 ESP OK [12:10:30] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 52 ESP OK [12:10:30] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 52 ESP OK [12:10:31] RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 32 ESP OK [12:10:31] RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 32 ESP OK [12:10:31] RECOVERY - IPsec on cp1087 is OK: Strongswan OK - 52 ESP OK [12:10:40] RECOVERY - IPsec on cp1081 is OK: Strongswan OK - 52 ESP OK [12:11:47] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454257 (owner: 10Marostegui) [12:12:21] PROBLEM - Disk space on ms-be2020 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sde1 is not accessible: Input/output error [12:13:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454257 (owner: 10Marostegui) [12:13:10] RECOVERY - IPsec on cp1078 is OK: Strongswan OK - 68 ESP OK [12:13:10] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 68 ESP OK [12:13:21] RECOVERY - IPsec on cp1086 is OK: Strongswan OK - 68 ESP OK [12:13:21] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 64 ESP OK [12:13:21] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:13:21] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 64 ESP OK [12:13:21] RECOVERY - IPsec on cp1076 is OK: Strongswan OK - 68 ESP OK [12:13:21] PROBLEM - SSH on ms-be2020 is CRITICAL: Server answer [12:13:30] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 64 ESP OK [12:13:31] RECOVERY - IPsec on cp1080 is OK: Strongswan OK - 68 ESP OK [12:13:31] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 64 ESP OK [12:13:31] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 68 ESP OK [12:13:40] RECOVERY - IPsec on cp1088 is OK: Strongswan OK - 68 ESP OK [12:13:40] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 64 ESP OK [12:13:41] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 64 ESP OK [12:13:41] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 64 ESP OK [12:13:41] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 68 ESP OK [12:13:50] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 64 ESP OK [12:13:51] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 64 ESP OK [12:13:56] !log rolling restart of services in scb/eqiad to pick up the nodejs security update [12:13:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3315 (duration: 00m 49s) [12:14:00] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 32 ESP OK [12:14:00] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 36 ESP OK [12:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:11] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 75021 bytes in 0.108 second response time [12:14:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454259 [12:14:20] RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 32 ESP OK [12:14:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454257 (owner: 10Marostegui) [12:15:39] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454259 (owner: 10Marostegui) [12:15:41] RECOVERY - IPsec on cp1089 is OK: Strongswan OK - 52 ESP OK [12:15:41] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 36 ESP OK [12:15:41] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 36 ESP OK [12:16:20] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp2023 is CRITICAL: connect to address 10.192.48.27 and port 3126: Connection refused [12:16:20] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp2023 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.frontend.vcl not defined [12:16:21] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 25 seconds ago with 3 failures. Failed resources (up to 3 shown): Package[varnishkafka],Service[varnishmtail],Package[mtail],Exec[retry-load-new-vcl-file] [12:16:31] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:16:31] PROBLEM - eventlogging Varnishkafka log producer on cp2023 is CRITICAL: NRPE: Command check_varnishkafka-eventlogging not defined [12:16:32] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp2023 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined [12:16:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454259 (owner: 10Marostegui) [12:17:31] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp2023 is OK: No errors detected [12:17:31] RECOVERY - eventlogging Varnishkafka log producer on cp2023 is OK: PROCS OK: 1 process with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [12:17:31] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454261 [12:18:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 00m 49s) [12:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:09] (03PS2) 10Jcrespo: Revert "mariadb: Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454261 [12:18:20] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp2023 is OK: No errors detected [12:19:20] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp2023 is OK: HTTP OK: HTTP/1.1 200 OK - 498 bytes in 0.072 second response time [12:20:40] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:20:58] (03PS3) 10Jcrespo: Revert "mariadb: Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454261 [12:21:08] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2008.codfw.wmnet', 'cp2023.codfw.wmnet', 'cp3047.esams.wmnet'] ``` and were **ALL** successful. [12:21:21] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:22:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454262 [12:23:10] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 36 ESP OK [12:24:12] jynus: are you going to merge soon? [12:24:25] Or should I go ahead with my change? [12:28:34] ok, I will merge now [12:28:39] cool [12:28:44] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454261 (owner: 10Jcrespo) [12:29:59] (03Merged) 10jenkins-bot: Revert "mariadb: Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454261 (owner: 10Jcrespo) [12:30:20] PROBLEM - swift-account-reaper on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:29] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:30] PROBLEM - swift-container-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:30] PROBLEM - swift-object-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:33] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454262 [12:30:39] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:39] PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:39] PROBLEM - swift-container-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:40] PROBLEM - swift-object-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:49] PROBLEM - dhclient process on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:50] PROBLEM - very high load average likely xfs on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:30:59] PROBLEM - swift-container-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:31:07] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454259 (owner: 10Marostegui) [12:31:09] (03CR) 10jenkins-bot: Revert "mariadb: Depool es1015 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454261 (owner: 10Jcrespo) [12:31:49] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool es1015 (duration: 00m 50s) [12:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:59] PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:31:59] PROBLEM - swift-account-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:32:01] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454262 (owner: 10Marostegui) [12:32:09] PROBLEM - configured eth on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:32:09] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:32:39] PROBLEM - DPKG on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:32:49] PROBLEM - HP RAID on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454262 (owner: 10Marostegui) [12:33:39] PROBLEM - swift-object-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:40] PROBLEM - swift-container-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:49] PROBLEM - swift-container-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:58] 10Operations, 10ops-eqiad, 10DC-Ops: Replace memory bank on scb1002 - https://phabricator.wikimedia.org/T196901 (10MoritzMuehlenhoff) @Cmjohnson Let me know the next time you're in the DC and I'll disable the host for diagnostics. [12:33:59] PROBLEM - swift-account-server on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:59] PROBLEM - swift-account-auditor on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:34:00] PROBLEM - very high load average likely xfs on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:34:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1100 (duration: 00m 50s) [12:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:44] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on scb1002 is CRITICAL: 5.001 ge 4 Muehlenhoff T196901 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [12:35:10] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:09] PROBLEM - swift-container-updater on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:10] PROBLEM - swift-object-replicator on ms-be2020 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:28] ms-be2020 seems to be in trouble [12:36:39] ** 463 printk messages dropped ** [14594094.219756] sd 0:1:0:0: rejecting I/O to offline device [12:36:56] that's what I get in console, repeatedly ^ [12:37:52] power-cycling [12:38:31] !log power-cycle ms-be2020: multiple alerts, no ssh access, I/O errors in console [12:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:50] PROBLEM - Host ms-be2020 is DOWN: PING CRITICAL - Packet loss = 100% [12:42:40] RECOVERY - swift-container-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:42:40] RECOVERY - swift-object-server on ms-be2020 is OK: PROCS OK: 73 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:42:40] RECOVERY - DPKG on ms-be2020 is OK: All packages OK [12:42:49] RECOVERY - swift-container-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:42:49] RECOVERY - swift-object-auditor on ms-be2020 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:42:49] RECOVERY - swift-container-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:42:49] RECOVERY - Host ms-be2020 is UP: PING OK - Packet loss = 0%, RTA = 36.57 ms [12:43:00] RECOVERY - dhclient process on ms-be2020 is OK: PROCS OK: 0 processes with command name dhclient [12:43:09] RECOVERY - very high load average likely xfs on ms-be2020 is OK: OK - load average: 9.82, 2.36, 0.79 [12:43:09] RECOVERY - swift-account-server on ms-be2020 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:43:09] RECOVERY - swift-account-auditor on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:43:09] RECOVERY - swift-account-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:43:09] RECOVERY - swift-container-updater on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:43:19] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2020 is OK: OK ferm input default policy is set [12:43:20] RECOVERY - configured eth on ms-be2020 is OK: OK - interfaces up [12:43:20] RECOVERY - swift-object-replicator on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [12:43:29] RECOVERY - swift-account-reaper on ms-be2020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:43:30] RECOVERY - HP RAID on ms-be2020 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK [12:43:40] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational [12:44:59] 10Operations, 10media-storage: ms-be2020 crashed - https://phabricator.wikimedia.org/T202397 (10ema) [12:45:09] 10Operations, 10media-storage: ms-be2020 crashed - https://phabricator.wikimedia.org/T202397 (10ema) p:05Triage>03Normal [12:45:19] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:47:18] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454262 (owner: 10Marostegui) [12:47:49] RECOVERY - Disk space on ms-be2020 is OK: DISK OK [12:49:30] RECOVERY - SSH on ms-be2020 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [12:55:28] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2069 storage crash - https://phabricator.wikimedia.org/T201603 (10Marostegui) a:03Marostegui I will get those differences fixed [12:56:40] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp3036.esams.wmnet', 'cp4028.ulsfo.wmnet', 'cp2016.codfw.wmnet'] ``` The log c... [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180821T1300) [13:13:56] (03PS2) 10MarcoAurelio: Configure gendered namespaces for pl.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) [13:15:50] (03PS2) 10Muehlenhoff: Remove stray cp4 hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/454241 (https://phabricator.wikimedia.org/T178815) [13:16:49] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:17:42] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:18:07] <3 [13:18:40] (03CR) 10Volans: [C: 032] Add service locator class Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:19:19] (03CR) 10Muehlenhoff: [C: 032] Remove stray cp4 hosts from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/454241 (https://phabricator.wikimedia.org/T178815) (owner: 10Muehlenhoff) [13:19:36] (03Merged) 10jenkins-bot: Add service locator class Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/453373 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:20:15] (03CR) 10Volans: [C: 032] Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:20:33] (03CR) 10Gehel: "LGTM (minor documentation fix, but feel free to merge anyway, or without another review)." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:21:11] (03Merged) 10jenkins-bot: Add dnsdisc module to manipulate DNS Discovery [software/spicerack] - 10https://gerrit.wikimedia.org/r/451814 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:21:35] (03CR) 10Gehel: [C: 031] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/454230 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:21:42] (03CR) 10Gehel: [C: 031] Add a retry decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [13:23:33] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 3122: Connection refused [13:23:33] PROBLEM - confd service on cp4028 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:23:34] PROBLEM - HTTPS Unified ECDSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [13:24:14] ACKNOWLEDGEMENT - Check Varnish expiry mailbox lag on cp4028 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. Ema reimaging [13:24:14] ACKNOWLEDGEMENT - Check systemd state on cp4028 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. Ema reimaging [13:24:14] ACKNOWLEDGEMENT - HTTPS Unified ECDSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Ema reimaging [13:24:14] ACKNOWLEDGEMENT - HTTPS Unified RSA on cp4028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Ema reimaging [13:24:14] ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3122 on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 3122: Connection refused Ema reimaging [13:24:14] ACKNOWLEDGEMENT - Varnish HTTP text-frontend - port 3123 on cp4028 is CRITICAL: connect to address 10.128.0.128 and port 3123: Connection refused Ema reimaging [13:24:14] ACKNOWLEDGEMENT - configured eth on cp4028 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. Ema reimaging [13:24:28] wmf-downtime-host does not seem to collaborate today [13:27:10] !log stop db2090 for upgrade [13:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:47] ema: as in it doesn't work? [13:27:58] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp4028.ulsfo.wmnet'] ``` Of which those **FAILED**: ``` ['cp4028.ulsfo.wmnet'] ``` [13:28:18] I've restarted icinga yesterday for the known awol issue, that usually causes also downtime to not apply, but after the restart all recovered [13:30:03] volans: it does work when it feels like [13:31:34] (03PS1) 10Jcrespo: mariadb: Depool db1096 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454276 [13:32:53] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1096 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454276 (owner: 10Jcrespo) [13:34:08] (03Merged) 10jenkins-bot: mariadb: Depool db1096 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454276 (owner: 10Jcrespo) [13:36:12] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1096 (duration: 00m 49s) [13:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:55] (03CR) 10jenkins-bot: mariadb: Depool db1096 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454276 (owner: 10Jcrespo) [13:39:22] (03PS2) 10Andrew Bogott: nodepool: reduce number of instances [puppet] - 10https://gerrit.wikimedia.org/r/454222 (https://phabricator.wikimedia.org/T201972) (owner: 10Hashar) [13:40:16] (03CR) 10Andrew Bogott: [C: 032] "Cool!" [puppet] - 10https://gerrit.wikimedia.org/r/454222 (https://phabricator.wikimedia.org/T201972) (owner: 10Hashar) [13:43:02] RECOVERY - HTTPS Unified ECDSA on cp4028 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345519 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2018-11-22 07:59:59 +0000 (expires in 92 days) [13:44:12] !log stop db1096 (both s5 an s6) for upgrade [13:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:11] RECOVERY - confd service on cp4028 is OK: OK - confd is active [13:47:42] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp4028 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.149 second response time [14:05:58] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1096 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454286 [14:08:03] (03PS1) 10RobH: wmf staff mflorence was in wmf grouip but not in admin module [puppet] - 10https://gerrit.wikimedia.org/r/454287 [14:08:49] (03CR) 10RobH: [C: 032] wmf staff mflorence was in wmf grouip but not in admin module [puppet] - 10https://gerrit.wikimedia.org/r/454287 (owner: 10RobH) [14:14:43] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Shorten logstash retention temporarily - https://phabricator.wikimedia.org/T201971 (10herron) Adding the below to the `/etc/curator/cleanup_logstash.yaml` curator config temporarily and running `/usr/bin/curator --config /etc/c... [14:20:51] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10RobH) [14:20:53] 10Operations, 10Patch-For-Review: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10RobH) 05Resolved>03Open No one bothered to update the racktables entry, physical label, or the network port description. No sub-task was made for any of this, which... [14:24:49] (03CR) 10DCausse: [C: 031] Switch entity reference type indexing from opt-in to opt-out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452956 (https://phabricator.wikimedia.org/T199884) (owner: 10Smalyshev) [14:25:42] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10RobH) [14:26:07] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10RobH) [14:26:11] 10Operations, 10Patch-For-Review: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10RobH) [14:26:20] 10Operations, 10HHVM, 10User-ArielGlenn: Run all jobs on PHP7 or HHVM - https://phabricator.wikimedia.org/T195393 (10RobH) [14:26:23] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10RobH) [14:26:25] 10Operations, 10Patch-For-Review: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10RobH) 05Open>03Resolved [14:26:30] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10RobH) a:03Papaul [14:30:39] 10Operations, 10Wikidata, 10Wikidata-Campsite, 10Wikimedia-General-or-Unknown, and 5 others: Multiple projects reporting Cannot access the database: No working replica DB server - https://phabricator.wikimedia.org/T195520 (10Lydia_Pintscher) [14:33:46] (03PS5) 10Volans: Add a retry decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) [14:33:48] (03PS2) 10Volans: Make all library instances 'immutable' [software/spicerack] - 10https://gerrit.wikimedia.org/r/454230 (https://phabricator.wikimedia.org/T199079) [14:33:50] (03PS1) 10Volans: Add MediaWiki module to manioulate MediaWiki config [software/spicerack] - 10https://gerrit.wikimedia.org/r/454290 (https://phabricator.wikimedia.org/T199079) [14:36:51] (03PS1) 10Jcrespo: mysql user: Remove exception for mysql user being removed [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) [14:40:41] (03CR) 10Jcrespo: "Disclaimer, I have checked all production databases I know of, I haven't checked other databases that may be on labs or analytics." [puppet] - 10https://gerrit.wikimedia.org/r/454291 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [14:41:35] (03CR) 10Volans: [C: 032] "Fixed" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:42:27] (03Merged) 10jenkins-bot: Add a retry decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/453994 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:42:58] (03CR) 10Volans: [C: 032] Make all library instances 'immutable' [software/spicerack] - 10https://gerrit.wikimedia.org/r/454230 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:43:57] (03Merged) 10jenkins-bot: Make all library instances 'immutable' [software/spicerack] - 10https://gerrit.wikimedia.org/r/454230 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:44:08] volans: you're merging all those Spicerack CR? How am I going to procrastinate? [14:44:26] you +1ed all of them :D but there is a new one for you already gehel ;) [14:44:27] 10Operations, 10Packaging, 10Toolforge, 10Patch-For-Review: Please add php-imagick and php-redis packages to apt.wikimedia.org thirdparty/php72 - https://phabricator.wikimedia.org/T200666 (10MoritzMuehlenhoff) 05stalled>03Resolved a:03MoritzMuehlenhoff I've downloaded the packages via Secure Apt from... [14:44:39] (03PS2) 10Volans: Add MediaWiki module to manipulate its config [software/spicerack] - 10https://gerrit.wikimedia.org/r/454290 (https://phabricator.wikimedia.org/T199079) [14:55:13] (03CR) 10Gehel: [C: 04-1] "A few comments inline." (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/454290 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [14:58:21] !log rolling restart of proton* to pick up nodejs security update [14:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:30] (03PS1) 10Rush: admin: add rush to analytics and research group [puppet] - 10https://gerrit.wikimedia.org/r/454299 [15:10:16] !log installing apache updates from stretch 9.5 point release [15:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:38] (03PS3) 10Vgutierrez: [WIP] Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 [15:23:39] (03PS1) 10Dzahn: admins: add Marcella Florence to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/454303 (https://phabricator.wikimedia.org/T202162) [15:24:08] (03PS2) 10Dzahn: admins: add Marcella Florence to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/454303 (https://phabricator.wikimedia.org/T202162) [15:24:25] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2: Release wikidiff2 v1.7.3 and update the production servers - https://phabricator.wikimedia.org/T202301 (10WMDE-Fisch) We plan to work on merging and cleaning the config variable handling with T194272 as next step so we might wait with the deploymen... [15:24:44] (03CR) 10Dzahn: [C: 032] admins: add Marcella Florence to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/454303 (https://phabricator.wikimedia.org/T202162) (owner: 10Dzahn) [15:25:15] 10Operations, 10TCB-Team, 10WMDE-QWERTY-Team, 10wikidiff2: Release wikidiff2 v1.7.3 - https://phabricator.wikimedia.org/T202301 (10WMDE-Fisch) [15:25:30] robh: lol, now we did it both and 2 patches in puppet-merge :) [15:25:40] sorry in meeting [15:25:43] cannot read irc [15:26:32] (03PS1) 10Dzahn: Revert "admins: add Marcella Florence to ldap_only admins" [puppet] - 10https://gerrit.wikimedia.org/r/454304 [15:27:13] (03CR) 10Dzahn: [C: 032] "has been done already by Robh, added twice" [puppet] - 10https://gerrit.wikimedia.org/r/454304 (owner: 10Dzahn) [15:27:25] i fixed that this AM [15:27:29] and merged it live awhile ago [15:27:31] it wasnt merged on master [15:27:35] oh, my bad [15:27:38] no puppet-merge [15:27:41] and then i did it too [15:27:52] so now i merged both and reverted mine [15:27:55] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:11] all good now. be back later [15:33:49] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1096 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454286 (owner: 10Jcrespo) [15:34:05] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog, 10Traffic: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Mholloway) As discussed today, let's gradually make these shorter (e.g., 24h, 12h, 6h, 3h, 1h) and closely monitor the effect... [15:34:24] (03CR) 10Filippo Giunchedi: [C: 031] Remove stray swift backend servers from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/454250 (https://phabricator.wikimedia.org/T162785) (owner: 10Muehlenhoff) [15:35:06] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1096 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454286 (owner: 10Jcrespo) [15:35:17] (03PS1) 10Milimetric: Add new cx translation reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/454308 [15:38:28] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1096 (duration: 00m 50s) [15:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:17] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10Cmjohnson) [15:42:45] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10Cmjohnson) a:05Cmjohnson>03RobH assigning to robh to finish installs [15:44:24] (03PS2) 10Rush: admin: add rush to analytics and research group [puppet] - 10https://gerrit.wikimedia.org/r/454299 [15:44:49] (03PS3) 10Rush: admin: add rush to analytics and research group [puppet] - 10https://gerrit.wikimedia.org/r/454299 [15:46:23] (03CR) 10Rush: [C: 032] admin: add rush to analytics and research group [puppet] - 10https://gerrit.wikimedia.org/r/454299 (owner: 10Rush) [15:46:33] (03CR) 10Rush: [C: 032] "Otto +1'd in IRC :)" [puppet] - 10https://gerrit.wikimedia.org/r/454299 (owner: 10Rush) [15:47:12] (03CR) 10Ottomata: [C: 031] admin: add rush to analytics and research group [puppet] - 10https://gerrit.wikimedia.org/r/454299 (owner: 10Rush) [15:48:03] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1096 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454286 (owner: 10Jcrespo) [15:54:11] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Cmjohnson) [15:54:50] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Cmjohnson) a:05Cmjohnson>03RobH assigning to @robh for installs [15:58:15] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: reports.frdev.wm.o -- still in use? - https://phabricator.wikimedia.org/T170640 (10Jgreen) a:03Jgreen [16:00:04] godog, moritzm, and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180821T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:10] moritzm: `I've downloaded the packages via Secure Apt from a systemd container` <-- is that documented? [16:10:10] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10Cmjohnson) a:05Cmjohnson>03RobH @robh these are ready but I see there is another last minute name change. When you do production D... [16:10:24] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10Cmjohnson) [16:13:06] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10Cmjohnson) [16:13:28] (03PS18) 10Gehel: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [16:13:34] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10Cmjohnson) a:05Cmjohnson>03RobH @robh added to public vlan assigning to you for installs [16:14:22] (03PS4) 10Vgutierrez: [WIP] Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 [16:17:09] !log replacing PEM2 on cr2-codfw [16:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:26] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Cmjohnson) [16:18:34] !log disable ae1 on cr1-eqiad - T202075 [16:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:41] T202075: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 [16:18:59] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install icinga1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10Cmjohnson) a:05Cmjohnson>03RobH @robh added to public vlan, all yours for install [16:21:34] !log disable ae1 on asw-a-eqiad - T202075 [16:21:38] 10Operations, 10ops-codfw: Check/replace PEM2 on cr2-codfw - https://phabricator.wikimedia.org/T202166 (10Papaul) ``` PEM replaced show chassis environment pem PEM 0 status: State Online Temperature OK DC Output Vo... [16:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:35] 10Operations, 10ops-codfw: Check/replace PEM2 on cr2-codfw - https://phabricator.wikimedia.org/T202166 (10Papaul) 05Open>03Resolved [16:22:41] (03CR) 10Gehel: [C: 04-1] Switch elasticsearch to use tlsproxy module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [16:24:35] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog, 10Traffic: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10BBlack) Catching up a little here: what I'm seeing right now on tile images, from the public POV, is basically: `Cache-contr... [16:27:25] !log upgrade hp raid firmware on ms-be2020 - T141756 [16:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:32] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [16:28:03] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-codfw.wikimedia.org recovered from Juniper environment status [16:28:54] 10Operations, 10media-storage: ms-be2020 crashed - https://phabricator.wikimedia.org/T202397 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The host is fine upon reboot, in the sense that disks are healthy, I suspect the raid controller bailed. I've updated the firmware as per T141756, resolving this fo... [16:29:25] PROBLEM - puppet last run on elastic1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:29:53] (03PS5) 10Vgutierrez: Certcentral integration tests [software/certcentral] - 10https://gerrit.wikimedia.org/r/454045 [16:30:04] 10Operations, 10ops-eqiad, 10netops: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 (10Cmjohnson) connections are complete. I was able to use the same cables with the exception of these two xe-3/0/0 xe-2/0/44 xe-1/1/0 4776 is now cable number 2172 xe-4/1/0 xe-7/0/45 xe-8... [16:30:05] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2688 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [16:31:05] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [16:32:09] !log enabling ae1 on asw-a-eqiad then ae1 on cr1-eqiad - T202075 [16:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:16] T202075: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 [16:32:54] 10Operations, 10ops-eqiad, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Cmjohnson) [16:33:27] 10Operations, 10ops-eqiad, 10Parsoid, 10Patch-For-Review: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10Cmjohnson) a:05Cmjohnson>03RobH @robh this is ready for install [16:34:16] looks like there was a brief syslog input rate spike re: logstash1007 [16:35:13] probably some very minor/brief network loss from the ae1 link up? [16:35:54] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196697 (10Cmjohnson) [16:36:05] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 227, down: 2, dormant: 0, excluded: 0, unused: 0 [16:36:39] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196697 (10Cmjohnson) Both servers are added and available, I am leaving task open until network issue on row A is resolved and will update switch port once fig... [16:37:29] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: rack/setup/install analyticsmaster100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T201939 (10Cmjohnson) [16:39:16] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.6072 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [16:39:44] 10Operations, 10Maps, 10Maps-Sprint, 10Reading-Infrastructure-Team-Backlog, 10Traffic: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Gehel) >>! In T186732#4519685, @BBlack wrote: > Reducing the Varnish-level TTLs seems counter-productive for efficiency at al... [16:40:16] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [16:40:25] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [16:40:45] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:42:37] looks like a brief flurry of memcached errors, but yeah could be related to link state change [16:42:55] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:44:26] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10Patch-For-Review: Rack/Setup frbast2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T196417 (10Papaul) a:05Papaul>03Jgreen @Jgreen Server is back up please confirm and close task. Thanks. [16:50:00] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host bast1001 [dns] - 10https://gerrit.wikimedia.org/r/454316 (https://phabricator.wikimedia.org/T191153) [16:50:36] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host bast1001 [dns] - 10https://gerrit.wikimedia.org/r/454316 (https://phabricator.wikimedia.org/T191153) (owner: 10Cmjohnson) [16:50:52] 10Operations, 10ops-eqiad, 10netops: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 (10ayounsi) 05Open>03Resolved a:03ayounsi [16:51:36] 10Operations, 10ops-eqiad, 10netops: Move asw2-a<->cr1 uplink back to asw-a - https://phabricator.wikimedia.org/T202075 (10ayounsi) Done, not a single ping was missed to a asw canary host (dns1001). [16:53:07] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom bast1001 - https://phabricator.wikimedia.org/T191153 (10Cmjohnson) [16:53:16] 10Operations, 10Patch-For-Review: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412 (10Cmjohnson) [16:53:20] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: decom bast1001 - https://phabricator.wikimedia.org/T191153 (10Cmjohnson) 05Open>03Resolved [16:55:04] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host chromium and hydrogen [dns] - 10https://gerrit.wikimedia.org/r/454317 (https://phabricator.wikimedia.org/T201522) [16:55:34] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host chromium and hydrogen [dns] - 10https://gerrit.wikimedia.org/r/454317 (https://phabricator.wikimedia.org/T201522) (owner: 10Cmjohnson) [16:56:31] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Cmjohnson) [16:56:39] 10Operations, 10ops-eqiad, 10Traffic, 10decommission, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Cmjohnson) 05Open>03Resolved [16:57:15] (03PS1) 10Mforns: Change spark log4j config to output logs to stdout [puppet/cdh] - 10https://gerrit.wikimedia.org/r/454318 (https://phabricator.wikimedia.org/T202429) [16:57:55] (03CR) 10Mforns: [C: 04-1] "Still needs to be tested. Thanks" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/454318 (https://phabricator.wikimedia.org/T202429) (owner: 10Mforns) [16:58:15] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host notebook1001 [dns] - 10https://gerrit.wikimedia.org/r/454319 (https://phabricator.wikimedia.org/T192103) [16:58:41] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host notebook1001 [dns] - 10https://gerrit.wikimedia.org/r/454319 (https://phabricator.wikimedia.org/T192103) (owner: 10Cmjohnson) [16:59:23] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission notebook1001 - https://phabricator.wikimedia.org/T192103 (10Cmjohnson) [16:59:28] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission notebook1001 - https://phabricator.wikimedia.org/T192103 (10Cmjohnson) 05Open>03Resolved [16:59:45] RECOVERY - puppet last run on elastic1045 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: (Dis)respected human, time to deploy Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180821T1700). Please do the needful. [17:01:46] No ORES deployment today. [17:02:32] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10Papaul) [17:02:45] 10Operations, 10ops-codfw, 10netops: Rename of wasat to mwmaint2001 (switch labels et al) - https://phabricator.wikimedia.org/T199530 (10Papaul) 05Open>03Resolved [17:02:48] 10Operations, 10Patch-For-Review: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10Papaul) [17:05:00] 10Operations, 10ops-codfw: Check/replace PEM2 on cr2-codfw - https://phabricator.wikimedia.org/T202166 (10Papaul) {F25203069} [17:05:23] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1052 - https://phabricator.wikimedia.org/T199861 (10Cmjohnson) [17:05:46] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db1052 - https://phabricator.wikimedia.org/T199861 (10Cmjohnson) 05Open>03Resolved [17:06:55] (03PS1) 10Cmjohnson: Removing mgmt dns for decom host hafnium [dns] - 10https://gerrit.wikimedia.org/r/454320 (https://phabricator.wikimedia.org/T193420) [17:07:13] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for decom host hafnium [dns] - 10https://gerrit.wikimedia.org/r/454320 (https://phabricator.wikimedia.org/T193420) (owner: 10Cmjohnson) [17:08:18] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420 (10Cmjohnson) [17:08:27] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review, 10Performance-Team (Radar): Decommission hafnium - https://phabricator.wikimedia.org/T193420 (10Cmjohnson) 05Open>03Resolved [17:13:56] !log starting branch cut for 1.32.0-wmf.18 [17:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:03] (03PS2) 10Ottomata: Add new cx translation reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/454308 (owner: 10Milimetric) [17:17:13] (03CR) 10Ottomata: [V: 032 C: 032] Add new cx translation reportupdater job [puppet] - 10https://gerrit.wikimedia.org/r/454308 (owner: 10Milimetric) [17:18:41] 10Operations, 10ops-eqdfw, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) [17:20:20] thcipriani: Are you just starting the train now? [17:21:09] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Close https://lists.wikimedia.org/mailman/listinfo/cep and keep the archive for now - https://phabricator.wikimedia.org/T155683 (10Dzahn) a:03Dzahn [17:21:27] Niharika: not doing deployment just yet, but getting the branch setup to deploy, yeah [17:22:05] Okay cool. I got confused because the calendar says the train will be on EU time today. [17:22:16] PROBLEM - IPMI Sensor Status on cp3036 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] [17:23:57] this should be right: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180821T1900 [17:34:46] which is to say: I will be deploying train in a few hours [17:44:28] cmjohnson on ticket spree today :) [17:44:39] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) p:05Triage>03High [17:44:44] (03PS19) 10Gehel: Switch elasticsearch to use tlsproxy module [puppet] - 10https://gerrit.wikimedia.org/r/444610 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [17:44:53] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [17:46:18] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [17:51:25] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [17:52:54] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10Reedy) [17:56:09] (03PS2) 10Dzahn: replace a couple http links with https where possible [puppet] - 10https://gerrit.wikimedia.org/r/453541 (https://phabricator.wikimedia.org/T202033) [17:56:56] (03PS3) 10Dzahn: replace a couple http links with https where possible [puppet] - 10https://gerrit.wikimedia.org/r/453541 (https://phabricator.wikimedia.org/T202033) [17:58:48] !log restarting ircecho on einsteinium - T202314 [17:58:50] (03CR) 10Dzahn: [C: 032] "harmless, these are all just inside comments" [puppet] - 10https://gerrit.wikimedia.org/r/453541 (https://phabricator.wikimedia.org/T202033) (owner: 10Dzahn) [17:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:56] T202314: icingawm is missing from #wikimedia-fundraising channel - https://phabricator.wikimedia.org/T202314 [17:59:17] cwd: rejoined here ^^^ [18:00:49] PROBLEM - DPKG on stat1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:02:01] 10Operations, 10Patch-For-Review: Feedback Appreciatted: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Dzahn) So yea.. i'm really not sure what we should do with this ticket. It might become a bit spammy to go through all of this and keep linking it. Might be better for an Etherpad or... [18:02:39] 10Operations, 10Patch-For-Review: Feedback Appreciatted: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Dzahn) p:05Triage>03Low [18:03:41] 10Operations, 10Patch-For-Review: Feedback Appreciated: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Reedy) [18:03:47] 10Operations: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10fgiunchedi) [18:04:06] 10Operations: Feedback Appreciated: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Reedy) [18:04:27] (03PS4) 10Dzahn: wikistats(vps): convert apache to httpd module [puppet] - 10https://gerrit.wikimedia.org/r/453544 [18:07:41] 10Operations, 10ops-ulsfo, 10Traffic, 10netops, 10Patch-For-Review: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552 (10RobH) [18:07:43] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [18:08:19] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050 (10RobH) [18:08:21] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) [18:08:57] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: ulsfo migration tracking - https://phabricator.wikimedia.org/T202433 (10RobH) so bast4002 wasn't fully deployed as a bastion yet, it can go online in new site in advance of other systems (since its not in production) if that is useful. [18:11:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:13:51] 10Operations: Feedback Appreciated: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Dzahn) [18:15:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:16:44] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10RobH) [18:17:07] 10Operations, 10SRE-Access-Requests, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10RobH) [18:17:53] !log restarting ircecho on einsteinium - T202314 [18:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:00] T202314: icingawm is missing from #wikimedia-fundraising channel - https://phabricator.wikimedia.org/T202314 [18:18:22] 10Operations: Feedback Appreciated: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Dzahn) @Akondrahman I moved your list into a pastebin so that it can be edited without causing a notifcation on this ticket each time. I edited your task description to include it from there. If you want... [18:18:31] (03PS1) 10RobH: adding user tonia to admin module [puppet] - 10https://gerrit.wikimedia.org/r/454327 (https://phabricator.wikimedia.org/T202069) [18:21:02] (03PS1) 10RobH: adding tonina to groups in admin module [puppet] - 10https://gerrit.wikimedia.org/r/454328 (https://phabricator.wikimedia.org/T202069) [18:21:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Addshore: Requesting access to view EventLogging data for Tonina WMDE - https://phabricator.wikimedia.org/T202069 (10RobH) [18:22:02] (03PS1) 10MarcoAurelio: Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) [18:23:13] (03CR) 10jerkins-bot: [V: 04-1] Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) (owner: 10MarcoAurelio) [18:24:12] (03PS2) 10MarcoAurelio: Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) [18:30:48] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create mailing list for Bureaucrat of zh.wikipedia - https://phabricator.wikimedia.org/T202435 (10MarcoAurelio) [18:31:04] (03PS1) 10Volans: dnsdisc: replace retry logic with decorator [software/spicerack] - 10https://gerrit.wikimedia.org/r/454334 (https://phabricator.wikimedia.org/T199079) [18:31:05] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10RobH) [18:31:45] (03PS1) 10Ottomata: Sum eventlogging throughput rate when computing quantiles for alerts [puppet] - 10https://gerrit.wikimedia.org/r/454335 [18:32:07] 10Operations, 10Wikimedia-Mailing-lists, 10Chinese-Sites: Create mailing list for Bureaucrat of zh.wikipedia - https://phabricator.wikimedia.org/T202435 (10MarcoAurelio) @Wong128hk Hello. Please include all the details required as per https://meta.wikimedia.org/wiki/Mailing_lists#Create_a_new_list * Request... [18:33:17] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10RobH) [18:33:22] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Patrick Earley - https://phabricator.wikimedia.org/T201667 (10RobH) a:05PEarleyWMF>03None [18:33:41] (03CR) 10Rxy: [C: 031] Switch 'wikidata-staff' add/remove 'interface-admin' for own account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454330 (https://phabricator.wikimedia.org/T202065) (owner: 10MarcoAurelio) [18:35:38] (03CR) 10Ottomata: [C: 032] Sum eventlogging throughput rate when computing quantiles for alerts [puppet] - 10https://gerrit.wikimedia.org/r/454335 (owner: 10Ottomata) [19:59:07] (03PS1) 10Cwhite: nagios_common: add cwhite to contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/454360 (https://phabricator.wikimedia.org/T202136) [19:59:35] (03CR) 10Filippo Giunchedi: [C: 031] nagios_common: add cwhite to contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/454360 (https://phabricator.wikimedia.org/T202136) (owner: 10Cwhite) [20:00:46] (03CR) 10Cwhite: [C: 032] nagios_common: add cwhite to contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/454360 (https://phabricator.wikimedia.org/T202136) (owner: 10Cwhite) [20:00:54] (03PS2) 10Herron: logstash: reduce replica count on old logstash indices [puppet] - 10https://gerrit.wikimedia.org/r/454354 (https://phabricator.wikimedia.org/T201971) [20:01:05] (03CR) 10Dzahn: "note there is also ./modules/icinga/files/cgi.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/454360 (https://phabricator.wikimedia.org/T202136) (owner: 10Cwhite) [20:01:16] (03PS2) 10Cwhite: nagios_common: add cwhite to contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/454360 (https://phabricator.wikimedia.org/T202136) [20:02:42] (03PS1) 10Thcipriani: Group0 to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454361 [20:03:12] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/12150/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/454354 (https://phabricator.wikimedia.org/T201971) (owner: 10Herron) [20:05:13] (03PS1) 10Cwhite: icinga: add cwhite to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/454362 (https://phabricator.wikimedia.org/T202136) [20:06:05] (03PS2) 10Cwhite: icinga: add cwhite to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/454362 (https://phabricator.wikimedia.org/T202136) [20:07:28] !log temporarily disabling puppet agents during puppetmaster apache updates [20:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:17] (03CR) 10Cwhite: [C: 031] icinga: add cwhite to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/454362 (https://phabricator.wikimedia.org/T202136) (owner: 10Cwhite) [20:08:22] (03CR) 10Filippo Giunchedi: [C: 031] icinga: add cwhite to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/454362 (https://phabricator.wikimedia.org/T202136) (owner: 10Cwhite) [20:08:25] (03CR) 10Thcipriani: [C: 032] Group0 to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454361 (owner: 10Thcipriani) [20:09:08] (03CR) 10Cwhite: [C: 032] icinga: add cwhite to icinga authorizations [puppet] - 10https://gerrit.wikimedia.org/r/454362 (https://phabricator.wikimedia.org/T202136) (owner: 10Cwhite) [20:09:44] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454361 (owner: 10Thcipriani) [20:10:29] (03CR) 10Dzahn: [C: 031] "caveat: Icinga web UI will let you login capitalized AND uncapitalized.. but for these permissions it has to match the right LDAP field. y" [puppet] - 10https://gerrit.wikimedia.org/r/454362 (https://phabricator.wikimedia.org/T202136) (owner: 10Cwhite) [20:11:12] !log thcipriani@deploy1001 rebuilt and synchronized wikiversions files: Group0 to 1.32.0-wmf.18 [20:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:33] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454361 (owner: 10Thcipriani) [20:17:02] !log re-enabling puppet agents after puppetmaster apache update [20:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:40] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport] [20:21:41] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport] [20:29:28] (03CR) 10Gehel: [C: 031] "LGTM, suggestion inline (but feel free to ignore)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/454354 (https://phabricator.wikimedia.org/T201971) (owner: 10Herron) [20:32:45] PROBLEM - MariaDB Slave Lag: s3 on db1095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 680.95 seconds [20:49:53] (03CR) 10Gehel: [C: 031] "Minor comment about documentation (and this comment does not even relate to this CR directly). So feel free to merge." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/454334 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [21:00:41] !log rebooting mw2232 for some tests [21:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:24] (03PS1) 10Dzahn: wikistats(VPS): jessie/stretch support for php module [puppet] - 10https://gerrit.wikimedia.org/r/454376 [21:18:03] 10Operations, 10Traffic, 10monitoring: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 (10ayounsi) The main goal of that alert is to be notified if a site suddenly sees its traffic drop, from a network or other issue, but isn't 100% unreachable (... [21:22:02] @seen prtksxna [21:22:02] mutante: Last time I saw prtksxna they were quitting the network with reason: Quit: So long, and thanks for all the fish! N/A at 8/21/2018 3:24:19 PM (5h57m43s ago) [21:41:35] (03PS1) 10BBlack: Revert "depool eqiad for front-edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/454420 [21:45:13] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10fgiunchedi) [21:54:00] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10fgiunchedi) [21:58:19] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10fgiunchedi) [22:02:09] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10fgiunchedi) [22:07:55] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) a:05RobH>03Cmjohnson @Cmjohnson Please relocate these two machines into racks with 1G (not 10G) connections. (It seems silly to leave them... [22:16:24] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10fgiunchedi) [22:17:11] (03PS2) 10Dzahn: wikistats(VPS): jessie/stretch support for php module [puppet] - 10https://gerrit.wikimedia.org/r/454376 [22:19:20] (03CR) 10Dzahn: [C: 032] wikistats(VPS): jessie/stretch support for php module [puppet] - 10https://gerrit.wikimedia.org/r/454376 (owner: 10Dzahn) [22:19:29] 10Operations, 10Patch-For-Review: Onboarding Cole White - https://phabricator.wikimedia.org/T202136 (10fgiunchedi) [22:21:37] (03PS10) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [22:22:17] 10Operations, 10SRE-Access-Requests: Requesting access to restricted production access and analytics-privatedata-users for Samuel Guebo - https://phabricator.wikimedia.org/T202362 (10RobH) p:05Triage>03Normal [22:22:22] (03CR) 10jerkins-bot: [V: 04-1] Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [22:28:42] !log bromine - git reset --hard origin ; git checkout master ; git pull origin ; puppet agent -tv (to fix broken puppet run due to unclean git repo.. unknown why it broke) [22:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:59] 10Operations: Feedback Appreciated: Use of HTTP Without TLS - https://phabricator.wikimedia.org/T202033 (10Akondrahman) Excellent. Thanks for your feedback, appreciate it. - Akond [22:29:08] (03PS1) 10RobH: adding scandium install params [puppet] - 10https://gerrit.wikimedia.org/r/454423 (https://phabricator.wikimedia.org/T201366) [22:29:14] !log bromine - last log line refers to /srv/org/wikimedia/TransparencyReport (transparency.wikimedia.org) [22:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:22] (03PS11) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 [22:29:49] icinga-wm: magic recovery button [22:32:28] !log vega - cd /srv/org/wikimedia/TransparencyReport ; git reset --hard origin ; puppet agent -tv (to fix transparency.wikimedia.org content repo and puppet run, just like on bromine) [22:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:34] (03PS1) 10RobH: scandium prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/454426 (https://phabricator.wikimedia.org/T201366) [22:32:35] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [22:32:51] (03CR) 10RobH: [C: 032] adding scandium install params [puppet] - 10https://gerrit.wikimedia.org/r/454423 (https://phabricator.wikimedia.org/T201366) (owner: 10RobH) [22:33:16] (03PS2) 10RobH: scandium prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/454426 (https://phabricator.wikimedia.org/T201366) [22:33:32] (03CR) 10RobH: [C: 032] scandium prod dns entries [dns] - 10https://gerrit.wikimedia.org/r/454426 (https://phabricator.wikimedia.org/T201366) (owner: 10RobH) [22:34:26] 10Operations, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10RobH) p:05High>03Normal [22:34:43] (03CR) 10Ayounsi: "Moved bird::anycast_healthchecker_check to a create_resources() + Hiera instead of a hardcoded function." [puppet] - 10https://gerrit.wikimedia.org/r/397723 (owner: 10Ayounsi) [22:35:55] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:37:59] (03CR) 10Dzahn: "a common::profile would also work, the point is just that it makes sure a httpd class is applied once and only once on any given node, so " [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [22:46:53] 10Operations, 10Growth-Team, 10Mail, 10Notifications: SRE query: Is it possible to measure how many e-mails are sent to "black hole" e-mail addresses? - https://phabricator.wikimedia.org/T202329 (10faidon) a:03herron [22:49:07] (03PS2) 10Dzahn: tendril: move httpd out of module to profile [puppet] - 10https://gerrit.wikimedia.org/r/449350 [22:49:59] (03CR) 10jerkins-bot: [V: 04-1] tendril: move httpd out of module to profile [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [22:53:29] (03PS3) 10Dzahn: tendril: move httpd out of module to profile [puppet] - 10https://gerrit.wikimedia.org/r/449350 [22:54:53] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) >>! In T200297#4517617, @daniel wrote: >> Delete the pages and drop the namespace. Note that stor... [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180821T2300). [23:00:05] Smalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:54] (03CR) 10Bartosz Dziewoński: [C: 04-1] Configure gendered namespaces for pl.wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454213 (https://phabricator.wikimedia.org/T202347) (owner: 10MarcoAurelio) [23:02:57] here, but have to be afk for 10 mins [23:06:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0 [23:06:45] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0 [23:08:39] (03PS4) 10Dzahn: tendril: move httpd out of module to profile [puppet] - 10https://gerrit.wikimedia.org/r/449350 [23:09:25] (03CR) 10jerkins-bot: [V: 04-1] tendril: move httpd out of module to profile [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [23:16:14] (03CR) 10Filippo Giunchedi: "I'm +1 on the idea, though I believe reducing replica count should be part of regular curator cleanup we do to drop logstash indices. Is i" [puppet] - 10https://gerrit.wikimedia.org/r/454354 (https://phabricator.wikimedia.org/T201971) (owner: 10Herron) [23:20:45] 10Operations, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10RobH) a:05RobH>03None [23:21:25] 10Operations, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10RobH) I'm not quite sure who on the #parsoid handling team would be involved in pushign this into service to replace ruthenium. If no one chimes in by next Monday, I'll be listing... [23:22:16] (03PS5) 10Dzahn: tendril: move httpd out of module to profile [puppet] - 10https://gerrit.wikimedia.org/r/449350 [23:24:20] (03CR) 10Dzahn: "moved all the http(s) related things to a profile::tendril::webserver, fixed the TODO" [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [23:26:20] 10Operations, 10DBA, 10JADE, 10TechCom-RFC, 10Scoring-platform-team (Current): Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) >>! In T200297#4517617, @daniel wrote: >> Delete the pages and drop the namespace. Note that stora... [23:26:39] SMalyshev: whenever you're around I can SWAT your change [23:28:20] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/12152/dbmonitor1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [23:29:04] thcipriani: great, I'm back now [23:29:19] okie doke [23:29:21] thanks for waiting [23:29:40] (03PS4) 10Thcipriani: Switch entity reference type indexing from opt-in to opt-out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452956 (https://phabricator.wikimedia.org/T199884) (owner: 10Smalyshev) [23:29:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452956 (https://phabricator.wikimedia.org/T199884) (owner: 10Smalyshev) [23:30:03] sure thing :) [23:31:02] (03PS6) 10Dzahn: tendril: move httpd out of module to profile [puppet] - 10https://gerrit.wikimedia.org/r/449350 [23:31:27] (03Merged) 10jenkins-bot: Switch entity reference type indexing from opt-in to opt-out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452956 (https://phabricator.wikimedia.org/T199884) (owner: 10Smalyshev) [23:32:04] SMalyshev: your change is live on mwdebug1002, if there's anything you can check there [23:32:22] thcipriani: checking [23:32:30] (03CR) 10jenkins-bot: Switch entity reference type indexing from opt-in to opt-out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/452956 (https://phabricator.wikimedia.org/T199884) (owner: 10Smalyshev) [23:33:53] 10Operations, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10ssastry) >>! In T201366#4521278, @RobH wrote: > I'm not quite sure who on the #parsoid handling team would be involved in pushign this into service to replace ruthenium. > > If no... [23:33:59] (03CR) 10Dzahn: [C: 032] "being bold since you said "not a strong opinion" and it compiles fine. first applying only on codfw to double check" [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [23:34:48] (03CR) 10Dzahn: [C: 032] "23:31:53 wmf-style: total violations delta -2 and fixes a TODO" [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [23:36:06] (03CR) 10Dzahn: [C: 032] "Resolved violations: class 'tendril' declares class httpd from another module" [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [23:37:25] thcipriani: hmmm it doesn't work exactly like I intended but this may be not an issue of the patch... so I still think it's good to go [23:37:26] (03CR) 10Dzahn: [C: 032] "noop on dbmonitor1001/2001" [puppet] - 10https://gerrit.wikimedia.org/r/449350 (owner: 10Dzahn) [23:38:02] SMalyshev: ok, is there any particular order these files need to be synced in? Or can they go out at the same time? [23:38:19] thcipriani: no particular order, any is fine [23:38:27] okie doke, going live [23:41:21] thanks! [23:42:07] !log thcipriani@deploy1001 Synchronized wmf-config: SWAT: [[gerrit:452956|Switch entity reference type indexing from opt-in to opt-out]] T199884 (duration: 00m 57s) [23:42:13] ^ SMalyshev live now [23:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:15] T199884: Support haswbstatement in other properties - https://phabricator.wikimedia.org/T199884 [23:42:22] thanks [23:42:27] yw :) [23:42:31] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/12153/contint1001.wikimedia.org/change.contint1001.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [23:44:08] (03CR) 10Dzahn: [C: 04-2] "there must be more usage of the apache module in other profiles included in the ci::master role.. back to WIP" [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [23:44:30] (03PS1) 10RobH: icinga1001 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/454433 (https://phabricator.wikimedia.org/T201344) [23:45:41] (03CR) 10RobH: [C: 032] icinga1001 production dns entries [dns] - 10https://gerrit.wikimedia.org/r/454433 (https://phabricator.wikimedia.org/T201344) (owner: 10RobH) [23:45:56] (03PS2) 10Dzahn: dnsrecursor: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/450315 [23:51:09] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/12154/dns2001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/450315 (owner: 10Dzahn) [23:51:20] thcipriani: is swat done? [23:51:36] legoktm: yes [23:51:44] just the one patch today [23:52:08] great, /me goes to mess with jenkins :) [23:54:32] (03PS2) 10Dzahn: puppetdb: add postgres backup to bacula [puppet] - 10https://gerrit.wikimedia.org/r/449523 [23:55:07] (03CR) 10Dzahn: [C: 032] "noop everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/450315 (owner: 10Dzahn) [23:57:42] (03PS1) 10RobH: icinga1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/454434 (https://phabricator.wikimedia.org/T201344) [23:58:21] (03CR) 10RobH: [C: 032] icinga1001 install params [puppet] - 10https://gerrit.wikimedia.org/r/454434 (https://phabricator.wikimedia.org/T201344) (owner: 10RobH)