[00:15:48] (03PS1) 10Smalyshev: Fix target file for diff between weekly and daily [puppet] - 10https://gerrit.wikimedia.org/r/450483 (https://phabricator.wikimedia.org/T201217) [00:18:11] (03PS2) 10Smalyshev: Fix target file for diff between weekly and daily [puppet] - 10https://gerrit.wikimedia.org/r/450483 (https://phabricator.wikimedia.org/T201217) [00:49:35] (03PS1) 10Alex Monk: deployment-prep: Change urldownloader host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450484 [00:51:31] (03PS1) 10Alex Monk: deployment-prep: Update urldownloader host [puppet] - 10https://gerrit.wikimedia.org/r/450485 [00:56:20] (03PS1) 10Alex Monk: url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 [00:57:03] (03CR) 10jerkins-bot: [V: 04-1] url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 (owner: 10Alex Monk) [01:08:29] (03CR) 10Krinkle: [C: 031] "When deploying this, the InitialiseSettings.php change must be deployed first to avoid a potential short thunder of E_NOTICE in production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 (owner: 10Prtksxna) [01:12:01] (03CR) 10Krinkle: [C: 031] deployment-prep: Update urldownloader host [puppet] - 10https://gerrit.wikimedia.org/r/450485 (owner: 10Alex Monk) [01:12:23] (03PS2) 10Alex Monk: url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 [01:12:33] (03CR) 10Krinkle: "Has this been beta-picked?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450484 (owner: 10Alex Monk) [01:12:43] (03CR) 10Krinkle: [C: 031] "(wrong tab)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450484 (owner: 10Alex Monk) [01:12:47] (03CR) 10Krinkle: [C: 031] "Has this been beta-picked?" [puppet] - 10https://gerrit.wikimedia.org/r/450485 (owner: 10Alex Monk) [01:13:03] (03CR) 10jerkins-bot: [V: 04-1] url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 (owner: 10Alex Monk) [01:15:24] (03PS3) 10Alex Monk: url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 [01:16:12] (03CR) 10jerkins-bot: [V: 04-1] url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 (owner: 10Alex Monk) [01:18:08] (03CR) 10Alex Monk: "no" [puppet] - 10https://gerrit.wikimedia.org/r/450485 (owner: 10Alex Monk) [01:18:40] (03PS4) 10Alex Monk: url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 [01:19:22] (03CR) 10jerkins-bot: [V: 04-1] url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 (owner: 10Alex Monk) [01:21:56] (03PS5) 10Alex Monk: url_downloader/squid3: Work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/450486 [01:29:37] (03CR) 10Alex Monk: "now it has" [puppet] - 10https://gerrit.wikimedia.org/r/450485 (owner: 10Alex Monk) [01:42:57] (03Abandoned) 10Alex Monk: beta: Set up deployment-deploy02 as deployment-mira replacement [puppet] - 10https://gerrit.wikimedia.org/r/449643 (https://phabricator.wikimedia.org/T192561) (owner: 10Alex Monk) [02:03:09] (03PS5) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [02:03:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [02:05:33] (03PS6) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [02:06:12] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [02:10:58] (03PS7) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [02:11:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [02:18:15] (03PS8) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [02:18:32] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [02:25:28] (03PS9) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [02:26:09] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [02:35:14] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.15) (duration: 13m 40s) [02:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:43] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Aug 6 02:45:42 UTC 2018 (duration 10m 28s) [02:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:05] (03PS1) 10Alex Monk: Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 [03:07:11] (03CR) 10jerkins-bot: [V: 04-1] Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 (owner: 10Alex Monk) [03:09:24] (03PS2) 10Alex Monk: Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 [03:10:29] (03CR) 10jerkins-bot: [V: 04-1] Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 (owner: 10Alex Monk) [03:11:20] (03PS3) 10Alex Monk: Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 [03:12:23] (03CR) 10jerkins-bot: [V: 04-1] Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 (owner: 10Alex Monk) [03:14:15] (03PS4) 10Alex Monk: Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 [03:15:18] (03CR) 10jerkins-bot: [V: 04-1] Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 (owner: 10Alex Monk) [03:16:07] (03PS5) 10Alex Monk: Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 [03:19:45] (03CR) 10Alex Monk: [C: 032] Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 (owner: 10Alex Monk) [03:20:51] (03Merged) 10jenkins-bot: Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 (owner: 10Alex Monk) [03:21:05] (03PS22) 10Alex Monk: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [03:21:27] (03CR) 10jenkins-bot: Fix basic functionality [software/certcentral] - 10https://gerrit.wikimedia.org/r/450492 (owner: 10Alex Monk) [03:25:17] (03PS10) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [03:25:58] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [03:26:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 939.19 seconds [03:45:06] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 250.85 seconds [03:46:37] (03CR) 10Alex Monk: "I haven't yet had time to integrate this with certcentral.py's certificate_management but it looks like it'll be something along these lin" [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [04:09:55] !log on mwmaint1001 running populateContentTables.php on mediawikiwiki T183488 [04:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:01] T183488: MCR schema migration stage 2: populate new fields - https://phabricator.wikimedia.org/T183488 [04:29:45] !log on mwmaint1001 running populateContentTables.php on metawiki T183488 [04:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:50] T183488: MCR schema migration stage 2: populate new fields - https://phabricator.wikimedia.org/T183488 [05:00:27] (03PS1) 10BryanDavis: Kubernetes: ignore terminating objects when searching [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) [06:30:06] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:30:36] PROBLEM - puppet last run on mw1319 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/vim/vimrc.local] [06:31:26] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/ferm.conf] [06:32:15] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:37:56] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2054 is CRITICAL: cluster=mysql device=cciss,11 instance=db2054:9100 job=node site=codfw Marostegui T201245 - The acknowledgement expires at: 2018-08-08 06:37:46. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2054&var-datasource=codfw%2520prometheus%252Fops [06:41:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450509 [06:43:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450509 (owner: 10Marostegui) [06:44:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450509 (owner: 10Marostegui) [06:46:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 (duration: 00m 53s) [06:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:22] !log Deploy schema change on db1077 with replication, this will generate lag on labsdb:s3 T144010 T51190 T199368 [06:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:31] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [06:49:31] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [06:49:31] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [06:50:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 600.52 seconds [06:56:05] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:36] RECOVERY - puppet last run on mw1319 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] (03CR) 10Vgutierrez: provide ACMEv2 support based on certbot/acme library (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [06:57:25] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:45] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450509 (owner: 10Marostegui) [06:58:15] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:49] (03CR) 10Vgutierrez: "> I haven't yet had time to integrate this with certcentral.py's" [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [07:11:16] (03PS1) 10Ema: Route grafana through cache_text [dns] - 10https://gerrit.wikimedia.org/r/450513 (https://phabricator.wikimedia.org/T164609) [07:13:06] (03CR) 10Ema: [C: 032] Route grafana through cache_text [dns] - 10https://gerrit.wikimedia.org/r/450513 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [07:41:32] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450515 [07:51:33] (03PS3) 10ArielGlenn: Fix target file for diff between weekly and daily [puppet] - 10https://gerrit.wikimedia.org/r/450483 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [07:51:45] <_joe_> !log restarting logstash on logstash1008, losing packets again [07:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:56] (03CR) 10ArielGlenn: [C: 032] Fix target file for diff between weekly and daily [puppet] - 10https://gerrit.wikimedia.org/r/450483 (https://phabricator.wikimedia.org/T201217) (owner: 10Smalyshev) [07:54:17] (03PS4) 10Volans: wmf-decommission-host: initial version [puppet] - 10https://gerrit.wikimedia.org/r/446887 (https://phabricator.wikimedia.org/T198649) [07:54:23] (03CR) 10Volans: "Thanks for the review, replies inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/446887 (https://phabricator.wikimedia.org/T198649) (owner: 10Volans) [07:55:35] <_joe_> and ofc it worked [07:56:07] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450515 (owner: 10Marostegui) [07:57:25] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450515 (owner: 10Marostegui) [08:02:09] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450515 (owner: 10Marostegui) [08:03:43] <_joe_> !log restarting logstash on logstash1009, losing packets again [08:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:55] (03PS3) 10Volans: Initial structure [software/spicerack] - 10https://gerrit.wikimedia.org/r/448046 (https://phabricator.wikimedia.org/T199079) [08:07:57] (03PS6) 10Volans: Add common base utility modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) [08:08:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 (duration: 00m 51s) [08:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:26] (03CR) 10jerkins-bot: [V: 04-1] Add common base utility modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:08:28] (03CR) 10jerkins-bot: [V: 04-1] Initial structure [software/spicerack] - 10https://gerrit.wikimedia.org/r/448046 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:09:39] _joe_: any plan to release the latest conftool on pypi? the current one is 1y old [08:10:32] it's kinda needed ;) [08:13:56] (03CR) 10Zhuyifei1999: Kubernetes: ignore terminating objects when searching (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) (owner: 10BryanDavis) [08:17:48] <_joe_> volans: sooner or later :P [08:17:59] <_joe_> am I blocking you in any way? [08:17:59] it's a blocker for spicerack tests ;) [08:18:04] <_joe_> ok [08:18:15] (03CR) 10Zhuyifei1999: [C: 04-1] Removing gridengine as default and encouraging the use of Kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [08:18:57] sorry to bother, but it's on your account on pypi [08:19:25] <_joe_> nah that's fair [08:23:18] <_joe_> !log restarting logstash on logstash1007, losing packets as well [08:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:51] !log Remove unused mysql users allowed to connect from 208.80.152.226 [08:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:26] (03CR) 10Volans: [C: 031] "LGTM, two optional nitpicks inline" (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [08:52:00] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/448046 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:54:24] (03CR) 10Ema: [C: 031] mediawiki: fix typo in rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/449723 (owner: 10Giuseppe Lavagetto) [08:54:48] (03CR) 10Volans: [C: 032] Initial structure [software/spicerack] - 10https://gerrit.wikimedia.org/r/448046 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:54:58] (03CR) 10Volans: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:55:56] (03Merged) 10jenkins-bot: Initial structure [software/spicerack] - 10https://gerrit.wikimedia.org/r/448046 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:56:26] (03CR) 10Volans: [C: 032] Add common base utility modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [08:56:29] (03PS1) 10Filippo Giunchedi: logstash: alert on udp loss ratio [puppet] - 10https://gerrit.wikimedia.org/r/450522 (https://phabricator.wikimedia.org/T200960) [08:57:04] (03CR) 10jerkins-bot: [V: 04-1] logstash: alert on udp loss ratio [puppet] - 10https://gerrit.wikimedia.org/r/450522 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [08:57:14] (03Merged) 10jenkins-bot: Add common base utility modules [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:00:13] 10Operations, 10Traffic: Discard of cold, labeled VCL crashes varnish parent and child - https://phabricator.wikimedia.org/T200207 (10ema) 05Open>03Resolved a:03ema No more cold VCLs, [[https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/445357/ | workaround ]] working fine. Closing. [09:00:29] (03PS2) 10Filippo Giunchedi: logstash: alert on udp loss ratio [puppet] - 10https://gerrit.wikimedia.org/r/450522 (https://phabricator.wikimedia.org/T200960) [09:02:26] (03PS6) 10Ema: trafficserver: initial module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/450204 (https://phabricator.wikimedia.org/T200178) [09:10:03] (03PS3) 10Filippo Giunchedi: logstash: alert on udp loss ratio [puppet] - 10https://gerrit.wikimedia.org/r/450522 (https://phabricator.wikimedia.org/T200960) [09:15:17] (03PS4) 10Filippo Giunchedi: logstash: alert on udp loss ratio [puppet] - 10https://gerrit.wikimedia.org/r/450522 (https://phabricator.wikimedia.org/T200960) [09:16:08] (03PS23) 10Vgutierrez: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) [09:16:35] (03CR) 10Vgutierrez: provide ACMEv2 support based on certbot/acme library (032 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [09:17:00] (03PS2) 10Giuseppe Lavagetto: mediawiki: fix typo in rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/449723 [09:18:15] (03CR) 10Vgutierrez: [C: 032] provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [09:18:41] (03CR) 10Volans: provide ACMEv2 support based on certbot/acme library (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [09:18:50] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: fix typo in rewrite rule [puppet] - 10https://gerrit.wikimedia.org/r/449723 (owner: 10Giuseppe Lavagetto) [09:19:24] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/11989/" [puppet] - 10https://gerrit.wikimedia.org/r/450522 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [09:19:26] (03CR) 10jenkins-bot: provide ACMEv2 support based on certbot/acme library [software/certcentral] - 10https://gerrit.wikimedia.org/r/446618 (https://phabricator.wikimedia.org/T199717) (owner: 10Vgutierrez) [09:21:56] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 263.28 seconds [09:22:15] 10Operations, 10Traffic, 10Goal: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 (10Vgutierrez) [09:22:19] 10Operations, 10Traffic: Pick up a suitable ACME library for certcentral - https://phabricator.wikimedia.org/T199717 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [09:25:28] (03CR) 10Filippo Giunchedi: [C: 032] logstash: alert on udp loss ratio [puppet] - 10https://gerrit.wikimedia.org/r/450522 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [09:25:35] (03PS5) 10Filippo Giunchedi: logstash: alert on udp loss ratio [puppet] - 10https://gerrit.wikimedia.org/r/450522 (https://phabricator.wikimedia.org/T200960) [09:27:26] Can this be merged? https://gerrit.wikimedia.org/r/c/operations/puppet/+/450232 [09:27:44] jobs.wikimedia.org is sorta broken because of the migration [09:28:53] <_joe_> Amir1: it will have to wait I have time to rebase it and that my current apache change is applied [09:29:33] Sure thing, I just don't want it to fall off the radar [09:29:36] Thanks [09:30:43] (03PS1) 10Filippo Giunchedi: logstash: daily bandaid restart [puppet] - 10https://gerrit.wikimedia.org/r/450527 (https://phabricator.wikimedia.org/T200960) [09:31:08] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@62716a5]: (no justification provided) [09:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:35] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@62716a5]: (no justification provided) (duration: 00m 27s) [09:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:21] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@62716a5]: CirrusSearch jobs: Increase checker concurrency to 10 - T198462 [09:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:25] T198462: Rethink pacing the cirrusSearchCheckerJob - https://phabricator.wikimedia.org/T198462 [09:33:05] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@62716a5]: CirrusSearch jobs: Increase checker concurrency to 10 - T198462 (duration: 00m 44s) [09:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:49] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['relforge1002.eqiad.wmnet'] ``` The log... [09:39:15] PROBLEM - ElasticSearch health check for shards on relforge1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.4.13:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.4.13, port=9200): Read timed out. (read timeout=4) [09:43:23] (03CR) 10MarcoAurelio: [C: 031] "+1 lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450441 (https://phabricator.wikimedia.org/T200698) (owner: 10Gergő Tisza) [09:45:26] (03PS2) 10Giuseppe Lavagetto: Use the correct destination for jobs.wikimedia.org and similar [puppet] - 10https://gerrit.wikimedia.org/r/450232 (owner: 10Ladsgroup) [09:46:44] <_joe_> Amir1: whoever controls the new site should create a redirect though [09:48:30] _joe_: honestly, I don't even like the new website :) It's an accessibility nightmare [09:53:57] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5003.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [10:01:35] (03CR) 10Volans: [C: 031] "LGTM syntax wise, I'll leave it to you for the choice of the partman recipe ;)" [puppet] - 10https://gerrit.wikimedia.org/r/450062 (https://phabricator.wikimedia.org/T193649) (owner: 10Gehel) [10:03:43] (03CR) 10Volans: [C: 031] "LGTM syntax wise, I'll leave it to you for the choice of the partman recipe ;)" [puppet] - 10https://gerrit.wikimedia.org/r/450064 (https://phabricator.wikimedia.org/T193649) (owner: 10Gehel) [10:06:29] !log mobrovac@deploy1001 Started deploy [restbase/deploy@478652a]: Add new wikis and fix lang fallback for lang variants [10:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:41] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@478652a]: Add new wikis and fix lang fallback for lang variants (duration: 11m 12s) [10:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:05] !log mobrovac@deploy1001 Started deploy [restbase/deploy@478652a]: Add new wikis and fix lang fallback for lang variants [10:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:56] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@478652a]: Add new wikis and fix lang fallback for lang variants (duration: 10m 51s) [10:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:00] !log mobrovac@deploy1001 Started deploy [restbase/deploy@478652a]: Add new wikis and fix lang fallback for lang variants [10:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:13] (03PS2) 10Filippo Giunchedi: logstash: daily bandaid restart [puppet] - 10https://gerrit.wikimedia.org/r/450527 (https://phabricator.wikimedia.org/T200960) [10:30:22] (03CR) 10Filippo Giunchedi: [C: 032] logstash: daily bandaid restart [puppet] - 10https://gerrit.wikimedia.org/r/450527 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [10:33:44] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@478652a]: Add new wikis and fix lang fallback for lang variants (duration: 04m 44s) [10:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:47] !log mobrovac@deploy1001 Started deploy [restbase/deploy@478652a]: Add new wikis and fix lang fallback for lang variants [10:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:08] (03PS2) 10Gergő Tisza: Give hewiki interface-admins the rights interface-editors have [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450441 (https://phabricator.wikimedia.org/T200698) [10:38:25] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@478652a]: Add new wikis and fix lang fallback for lang variants (duration: 04m 38s) [10:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:45] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5003.eqsin.wmnet'] ``` and were **ALL** successful. [10:44:27] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792 (10Krenair) [10:44:29] 10Puppet, 10Toolforge, 10Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577 (10Krenair) 05Open>03Resolved [10:45:36] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1489 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [10:46:04] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792 (10Krenair) 05Open>03Resolved [10:47:57] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.05 ge (W)0.01 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [10:48:05] 10Puppet, 10Cloud-Services, 10Toolforge, 10Patch-For-Review: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105 (10Krenair) Not strictly a T153577 blocker as you can use them on separate hosts. [10:48:14] 10Puppet, 10Toolforge, 10Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577 (10Krenair) [10:48:20] 10Puppet, 10Cloud-Services, 10Toolforge, 10Patch-For-Review: role::puppetmaster::puppetdb uses nginx as reverse proxy and cannot be used together with Apache applications - https://phabricator.wikimedia.org/T154105 (10Krenair) [10:54:49] !log repair sde on ms-be2040 [10:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:17] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.07395 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [10:58:46] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.09331 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [11:00:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180806T1100). [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180806T1100). Please do the needful. [11:00:05] rxy, Krenair, and tgr: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] (0_0)ノ [11:00:24] o/ [11:00:31] o/ [11:00:37] I can SWAT today [11:00:50] hey [11:00:56] tgr: want to deploy your patch(es)? (did not check the calendar yet) [11:01:15] zeljkof: sure, why not [11:01:31] tgr: go ahead then, while I review other patches [11:01:41] tgr: let me know when you are done [11:01:54] let's see if I manage to not screw up file permissions this time [11:02:27] tgr: :D [11:02:34] (03CR) 10Gergő Tisza: [C: 032] Give hewiki interface-admins the rights interface-editors have [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450441 (https://phabricator.wikimedia.org/T200698) (owner: 10Gergő Tisza) [11:02:42] (03PS1) 10Filippo Giunchedi: thumbor: add wikimedia-id-internal-local-public private container [puppet] - 10https://gerrit.wikimedia.org/r/450539 (https://phabricator.wikimedia.org/T201187) [11:02:49] Krenair, rxy: you are not deployers, right? [11:02:56] no [11:03:07] we're not [11:03:19] I'm not a deployer [11:03:31] 10Operations, 10Thumbor, 10Patch-For-Review: Thumbnails don't seem to be being created/saved for id_internalwikimedia - https://phabricator.wikimedia.org/T201187 (10fgiunchedi) I believe that's because thumbor has to know about private containers, I've proposed https://gerrit.wikimedia.org/r/c/operations/pup... [11:03:36] Krenair, rxy: please stand by, I'll ping you when your patches are at mwdebug1002 for testing, let me know if you need help testing there [11:03:48] PROBLEM - swift-object-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:49] PROBLEM - swift-container-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:49] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450540 (https://phabricator.wikimedia.org/T128546) [11:03:55] (03Merged) 10jenkins-bot: Give hewiki interface-admins the rights interface-editors have [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450441 (https://phabricator.wikimedia.org/T200698) (owner: 10Gergő Tisza) [11:03:58] PROBLEM - swift-object-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:58] PROBLEM - dhclient process on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:03:59] PROBLEM - swift-account-reaper on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:05] zeljkof, well my one is just changes LabsServices.php [11:04:18] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:18] PROBLEM - swift-object-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:27] it's a no-op in prod as that file isn't loaded there [11:04:28] PROBLEM - Check size of conntrack table on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:29] PROBLEM - swift-container-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:29] PROBLEM - swift-object-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:39] PROBLEM - swift-account-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:44] Krenair: ah, so I can deploy it without testing at mwdebug? [11:04:48] PROBLEM - swift-account-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:48] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:04:56] yeah [11:05:33] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['relforge1002.eqiad.wmnet'] ``` and were **ALL** successful. [11:05:37] herron: just checking, there are a few `Could not complete SSL handshake.` errors reported here, can we continue with SWAT? [11:05:42] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450540 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:06:04] I'll take a look at ms-be2019 [11:06:09] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.05 ge (W)0.01 ge 0.001978 https://grafana.wikimedia.org/dashboard/db/logstash [11:06:18] PROBLEM - swift-account-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:06:28] RECOVERY - swift-container-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:06:28] RECOVERY - Check size of conntrack table on ms-be2019 is OK: OK: nf_conntrack is 4 % full [11:06:28] PROBLEM - swift-container-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:06:28] RECOVERY - swift-object-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:06:38] RECOVERY - swift-account-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:06:39] RECOVERY - swift-account-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:06:48] RECOVERY - swift-object-auditor on ms-be2019 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:06:58] RECOVERY - swift-container-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:06:58] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450540 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:06:58] RECOVERY - swift-object-updater on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:07:08] RECOVERY - swift-account-reaper on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:07:08] RECOVERY - dhclient process on ms-be2019 is OK: PROCS OK: 0 processes with command name dhclient [11:07:09] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2019 is OK: OK ferm input default policy is set [11:07:09] RECOVERY - swift-account-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:07:09] RECOVERY - swift-object-server on ms-be2019 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:07:20] RECOVERY - swift-container-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:08:12] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:450441|Give hewiki interface-admins the rights interface-editors have]] (T200698) (duration: 00m 50s) [11:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:17] T200698: Merge two hewiki user groups - https://phabricator.wikimedia.org/T200698 [11:08:33] zeljkof: done [11:09:39] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 30 minutes ago with 0 failures [11:10:34] tgr: great! jan_drewniak is finishing portals update, I'll continue SWAT after he is done :) [11:11:38] jan_drewniak: please let me know when you are done [11:12:29] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.05468 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [11:13:06] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:450526|Bumping portals to master (T128546)]] (duration: 00m 49s) [11:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:10] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:13:55] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:450526|Bumping portals to master (T128546)]] (duration: 00m 48s) [11:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:18] jan_drewniak: done with portals deploy? [11:15:33] zeljkof: yup! all done now [11:16:17] jan_drewniak: thanks! I'll continue with swat [11:16:41] rxy: please stand by, your commit will be ready for testing in a few minutes [11:16:57] I'm ready [11:17:48] godog: nice! (Packet loss ratio for UDP on logstash1007) ^^^ [11:20:00] rxy: the commit is at mwdebug1002, please test and let me know if I can deploy it [11:21:19] zeljkof: ok. My patch is work correctly. Please deploy it [11:21:39] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.05487 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [11:21:39] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.06013 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [11:22:01] rxy: ok, deploying [11:23:09] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:450401|Add suppressredirect permission to rollbacker and patroller at zhwiki (T201160)]] (duration: 00m 49s) [11:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:14] T201160: Assign "suppressredirect" to patroller and rollbacker on zhwiki - https://phabricator.wikimedia.org/T201160 [11:23:28] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.06765 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [11:23:30] rxy: it's deployed, please test and thanks for deploying with #releng! :D [11:24:00] Krenair: please stand by, your patch will be at mwdebug1002 in a few minutes [11:25:00] Krenair: is there a task associated with the patch? not required, but highly recommended to add it to the commit message [11:25:27] zeljkof: work correctly in mw1264.eqiad.wmnet. Thanks! [11:25:49] zeljkof, um it's a LabsServices.php patch [11:25:55] mwdebug1002 is a prod host [11:26:09] I don't have a task for it no [11:26:16] Krenair: ah, sorry, forgot it's a labs patch :/ [11:26:32] Krenair: ok, merging, will ping you when it's deployed [11:26:37] thanks [11:26:38] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.05 ge (W)0.01 ge 0.006936 https://grafana.wikimedia.org/dashboard/db/logstash [11:27:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450484 (owner: 10Alex Monk) [11:27:08] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.05 ge (W)0.01 ge 0.005083 https://grafana.wikimedia.org/dashboard/db/logstash [11:28:19] (03Merged) 10jenkins-bot: deployment-prep: Change urldownloader host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450484 (owner: 10Alex Monk) [11:30:08] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.05 ge (W)0.01 ge 0.007021 https://grafana.wikimedia.org/dashboard/db/logstash [11:31:25] !log zfilipin@deploy1001 Synchronized wmf-config/LabsServices.php: SWAT: [[gerrit:450484|deployment-prep: Change urldownloader host (T201160)]] (duration: 00m 50s) [11:31:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:30] T201160: Assign "suppressredirect" to patroller and rollbacker on zhwiki - https://phabricator.wikimedia.org/T201160 [11:31:50] Krenair: it's deployed, please test and thanks for deploying with #releng! :) [11:32:03] thanks [11:32:31] !log EU SWAT finished [11:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:29] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [11:37:22] (03PS1) 10Marostegui: dbproxy: Change failover retries [puppet] - 10https://gerrit.wikimedia.org/r/450542 [11:47:49] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1254 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [11:48:51] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/11991/" [puppet] - 10https://gerrit.wikimedia.org/r/450542 (owner: 10Marostegui) [11:50:41] (03CR) 10jenkins-bot: Give hewiki interface-admins the rights interface-editors have [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450441 (https://phabricator.wikimedia.org/T200698) (owner: 10Gergő Tisza) [11:50:43] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450540 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:50:44] (03CR) 10jenkins-bot: Add suppressredirect permission to rollbacker and patroller at zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450401 (https://phabricator.wikimedia.org/T201160) (owner: 10Rxy) [11:50:47] (03CR) 10jenkins-bot: deployment-prep: Change urldownloader host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450484 (owner: 10Alex Monk) [11:54:49] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.05327 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [11:55:59] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.08683 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [12:03:09] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.05 ge (W)0.01 ge 0.00567 https://grafana.wikimedia.org/dashboard/db/logstash [12:03:29] (03PS3) 10Giuseppe Lavagetto: Use the correct destination for jobs.wikimedia.org and similar [puppet] - 10https://gerrit.wikimedia.org/r/450232 (owner: 10Ladsgroup) [12:06:48] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.07849 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [12:09:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Use the correct destination for jobs.wikimedia.org and similar [puppet] - 10https://gerrit.wikimedia.org/r/450232 (owner: 10Ladsgroup) [12:14:18] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.05231 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [12:14:58] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.07915 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [12:17:04] Krenair: Thanks for this: https://phabricator.wikimedia.org/T194267 ; btw, why deployment-tin is shutoff? could you please start the instance if no problem? [12:17:12] https://tools.wmflabs.org/openstack-browser/server/deployment-tin.deployment-prep.eqiad.wmflabs [12:17:19] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.07941 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [12:17:22] rxy, hi, no, that instance is being deleted soon [12:17:37] ah, ok. [12:17:43] it's shut down so people don't try to use it [12:18:04] you probably want to use deployment-deploy01 instead [12:19:24] (03PS1) 10Vgutierrez: Implement config file parsing outside CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/450548 [12:19:26] Thanks for info. I tried this: https://www.mediawiki.org/wiki/Beta_Cluster#Testing_changes_on_Beta_Cluster [12:22:08] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.05522 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [12:23:14] (03PS1) 10Arturo Borrero Gonzalez: toolforge: add texlive-full package [puppet] - 10https://gerrit.wikimedia.org/r/450549 (https://phabricator.wikimedia.org/T197176) [12:23:17] fixed [12:24:33] (03PS2) 10Arturo Borrero Gonzalez: toolforge: add texlive-full package [puppet] - 10https://gerrit.wikimedia.org/r/450549 (https://phabricator.wikimedia.org/T197176) [12:24:47] zeljkof, it worked btw, thanks [12:24:57] beta-scap-eqiad is slow :( [12:25:59] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.05 ge (W)0.01 ge 0.006769 https://grafana.wikimedia.org/dashboard/db/logstash [12:26:13] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: add texlive-full package [puppet] - 10https://gerrit.wikimedia.org/r/450549 (https://phabricator.wikimedia.org/T197176) (owner: 10Arturo Borrero Gonzalez) [12:26:48] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.05 ge (W)0.01 ge 0.009534 https://grafana.wikimedia.org/dashboard/db/logstash [12:26:48] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.08793 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [12:29:36] volans: yeah! sadly it brings bad news :( [12:30:06] (03PS9) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) [12:30:42] (03PS9) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 [12:31:42] (03PS7) 10Krinkle: webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764 (https://phabricator.wikimedia.org/T195312) [12:32:15] (03PS8) 10Krinkle: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) [12:32:48] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.05 ge (W)0.01 ge 0.00644 https://grafana.wikimedia.org/dashboard/db/logstash [12:34:58] (03PS6) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [12:37:00] 10Operations, 10User-fgiunchedi: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10fgiunchedi) Thanks @robh ! [12:37:29] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.06302 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [12:42:19] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.05 ge (W)0.01 ge 0.007415 https://grafana.wikimedia.org/dashboard/db/logstash [12:43:00] ok that alert is way too noisy now, I'm bumping it a little [12:45:27] (03PS1) 10Filippo Giunchedi: logstash: bump udp loss thresholds [puppet] - 10https://gerrit.wikimedia.org/r/450552 (https://phabricator.wikimedia.org/T200960) [12:45:35] (03CR) 10Alex Monk: [C: 032] Implement config file parsing outside CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/450548 (owner: 10Vgutierrez) [12:46:07] (03PS2) 10Filippo Giunchedi: logstash: bump udp loss thresholds [puppet] - 10https://gerrit.wikimedia.org/r/450552 (https://phabricator.wikimedia.org/T200960) [12:46:53] (03Merged) 10jenkins-bot: Implement config file parsing outside CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/450548 (owner: 10Vgutierrez) [12:47:10] Can this be merged? https://gerrit.wikimedia.org/r/c/operations/puppet/+/450395 It doesn't affect prod in any way possible (wikilabels is only used in wikilabels project in cloud VPS) [12:47:42] Amir1: looking [12:47:46] <_joe_> Amir1: later, I'm merging a delicate apache change [12:47:53] (03CR) 10jenkins-bot: Implement config file parsing outside CertCentral class [software/certcentral] - 10https://gerrit.wikimedia.org/r/450548 (owner: 10Vgutierrez) [12:47:55] (03PS2) 10Giuseppe Lavagetto: mediawiki: convert remnant.conf to use one file per vhost [puppet] - 10https://gerrit.wikimedia.org/r/449724 (https://phabricator.wikimedia.org/T196968) [12:48:09] <_joe_> arturo: can you please wait a few minutes? [12:48:19] sure [12:48:22] <_joe_> thanks [12:48:34] <_joe_> I prefer to have ease of revert in case things go sideways [12:48:44] :-) [12:48:56] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: convert remnant.conf to use one file per vhost [puppet] - 10https://gerrit.wikimedia.org/r/449724 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [12:50:18] RECOVERY - ElasticSearch health check for shards on relforge1001 is OK: OK - elasticsearch status relforge-eqiad: status: green, number_of_nodes: 2, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 28, task_max_waiting_in_queue_millis: 0, cluster_name: relforge-eqiad, relocating_shards: 2, active_shards_percent_as_number: 100.0, active_shards: 49, initializ [12:50:18] er_of_data_nodes: 2, delayed_unassigned_shards: 0 [12:51:38] Krenair: great! [12:52:35] (03CR) 10Filippo Giunchedi: [C: 032] logstash: bump udp loss thresholds [puppet] - 10https://gerrit.wikimedia.org/r/450552 (https://phabricator.wikimedia.org/T200960) (owner: 10Filippo Giunchedi) [12:53:33] (03PS1) 10Giuseppe Lavagetto: deploy-apache-change: raise concurrency of puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/450554 [12:59:20] (03CR) 10Arturo Borrero Gonzalez: "I plan to merge this in a couple of hours." [puppet] - 10https://gerrit.wikimedia.org/r/450395 (https://phabricator.wikimedia.org/T184437) (owner: 10Ladsgroup) [12:59:24] (03PS1) 10Gehel: logstash: enable GC logs [puppet] - 10https://gerrit.wikimedia.org/r/450555 [13:00:02] (03CR) 10jerkins-bot: [V: 04-1] logstash: enable GC logs [puppet] - 10https://gerrit.wikimedia.org/r/450555 (owner: 10Gehel) [13:00:09] (03CR) 10Gehel: "I'm not sure if the UDP loss we see is correlated to GC or not, but to make sure, we should collect some data." [puppet] - 10https://gerrit.wikimedia.org/r/450555 (owner: 10Gehel) [13:01:27] (03PS2) 10Gehel: logstash: enable GC logs [puppet] - 10https://gerrit.wikimedia.org/r/450555 [13:01:47] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10Gehel) relforge migrated to stretch, and looking good! [13:05:26] (03PS9) 10Krinkle: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) [13:05:28] (03PS7) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [13:09:22] are /^(actinium|alcyone|alsafi|aluminium)\.wikimedia\.org$/ running trusty? [13:10:33] (03CR) 10Krinkle: "Puppet compiler is clean this time, no more conflict. – https://puppet-compiler.wmflabs.org/compiler02/11992/" [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) (owner: 10Krinkle) [13:12:28] (03PS8) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [13:15:31] 10Operations, 10Operations-Software-Development: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ema) [13:15:55] 10Operations, 10Operations-Software-Development: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ema) p:05Triage>03Normal [13:28:33] (03PS2) 10Giuseppe Lavagetto: deploy-apache-change: raise concurrency of puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/450554 [13:29:26] (03CR) 10Giuseppe Lavagetto: [C: 032] deploy-apache-change: raise concurrency of puppet runs [puppet] - 10https://gerrit.wikimedia.org/r/450554 (owner: 10Giuseppe Lavagetto) [13:32:04] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) Not 100%, but I believe MirrorMaker will handle the additional throughput. If it doesn't, it wil... [13:38:58] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.06287 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [13:39:49] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` ['cp5004.eqsin.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/20... [13:42:38] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.05 ge (W)0.01 ge 0.005062 https://grafana.wikimedia.org/dashboard/db/logstash [13:44:19] (03PS3) 10Gehel: logstash: enable GC logs [puppet] - 10https://gerrit.wikimedia.org/r/450555 [13:46:22] (03CR) 10Gehel: "ppc looks happy: https://puppet-compiler.wmflabs.org/compiler02/11994/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/450555 (owner: 10Gehel) [13:46:41] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Papaul) p:05Triage>03Normal [13:48:24] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM! see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450555 (owner: 10Gehel) [13:49:28] (03CR) 10Filippo Giunchedi: [C: 031] "Ah, also please add Bug: T200362" [puppet] - 10https://gerrit.wikimedia.org/r/450555 (owner: 10Gehel) [13:49:55] (03PS4) 10Gehel: logstash: enable GC logs [puppet] - 10https://gerrit.wikimedia.org/r/450555 (https://phabricator.wikimedia.org/T200362) [13:50:32] (03CR) 10Gehel: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/450555 (https://phabricator.wikimedia.org/T200362) (owner: 10Gehel) [13:52:34] (03CR) 10Filippo Giunchedi: [C: 031] logstash: enable GC logs [puppet] - 10https://gerrit.wikimedia.org/r/450555 (https://phabricator.wikimedia.org/T200362) (owner: 10Gehel) [13:54:56] (03PS5) 10Gehel: logstash: enable GC logs [puppet] - 10https://gerrit.wikimedia.org/r/450555 (https://phabricator.wikimedia.org/T200362) [13:55:26] (03CR) 10Jcrespo: [C: 031] "I am not 100% sure about replicas- as in those cases we are in read only, but I think it is ok -1 minutes of unavailability before doing a" [puppet] - 10https://gerrit.wikimedia.org/r/450542 (owner: 10Marostegui) [13:55:47] (03CR) 10Gehel: [C: 032] logstash: enable GC logs [puppet] - 10https://gerrit.wikimedia.org/r/450555 (https://phabricator.wikimedia.org/T200362) (owner: 10Gehel) [13:55:49] (03PS2) 10Marostegui: dbproxy: Change failover retries [puppet] - 10https://gerrit.wikimedia.org/r/450542 [13:56:30] 10Operations, 10Thumbor, 10Patch-For-Review: Thumbnails don't seem to be being created/saved for id_internalwikimedia - https://phabricator.wikimedia.org/T201187 (10fgiunchedi) No script no, just a review like the above. I'll deploy that later today. [13:56:46] heads up: global rename stuck https://phabricator.wikimedia.org/T201314 [13:57:05] (03PS3) 10Marostegui: dbproxy: Change failover retries [puppet] - 10https://gerrit.wikimedia.org/r/450542 [13:57:54] (03CR) 10Marostegui: [C: 032] dbproxy: Change failover retries [puppet] - 10https://gerrit.wikimedia.org/r/450542 (owner: 10Marostegui) [14:04:54] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.08102 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [14:05:05] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.07169 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [14:05:58] (03PS1) 10Dbarratt: Disable Special:Block Feedback Request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450573 [14:06:55] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.05611 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [14:07:20] !log Reload haproxy to apply new configuration [14:07:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:31] (03PS2) 10Giuseppe Lavagetto: mediawiki: prepare the transition for the main sites [puppet] - 10https://gerrit.wikimedia.org/r/449725 (https://phabricator.wikimedia.org/T196968) [14:13:36] 10Operations, 10monitoring, 10User-fgiunchedi: rack/setup/install graphite1004 - https://phabricator.wikimedia.org/T196484 (10fgiunchedi) Thanks @RobH ! Yeah role spare makes sense in this case. [14:15:46] (03CR) 10Giuseppe Lavagetto: [C: 032] "I rechecked the change 10 times, it is a verbatim copy of what we already did on the mwdebug servers, and it looks like the puppet compile" [puppet] - 10https://gerrit.wikimedia.org/r/449725 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [14:18:24] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.05 ge (W)0.01 ge 0.0004335 https://grafana.wikimedia.org/dashboard/db/logstash [14:19:22] PROBLEM - Elasticsearch HTTPS on relforge1002 is CRITICAL: SSL CRITICAL - failed to verify relforge.svc.eqiad.wmnet against relforge1002.eqiad.wmnet [14:20:00] (03PS3) 10Filippo Giunchedi: logstash: bump udp loss thresholds [puppet] - 10https://gerrit.wikimedia.org/r/450552 (https://phabricator.wikimedia.org/T200960) [14:21:24] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: add wikimedia-id-internal-local-public private container [puppet] - 10https://gerrit.wikimedia.org/r/450539 (https://phabricator.wikimedia.org/T201187) (owner: 10Filippo Giunchedi) [14:21:25] (03PS2) 10Filippo Giunchedi: thumbor: add wikimedia-id-internal-local-public private container [puppet] - 10https://gerrit.wikimedia.org/r/450539 (https://phabricator.wikimedia.org/T201187) [14:23:22] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.05256 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [14:23:36] shush! [14:23:50] the threshold bump should be deployed soon [14:25:59] (03PS2) 10Dbarratt: Disable Special:Block Feedback Request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450573 [14:26:44] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1693 ge 0.05 https://grafana.wikimedia.org/dashboard/db/logstash [14:28:48] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5009.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [14:30:08] !log roll-restart thumbor after adding new private containers - T201187 [14:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:13] T201187: Thumbnails don't seem to be being created/saved for id_internalwikimedia - https://phabricator.wikimedia.org/T201187 [14:33:01] 10Operations, 10Thumbor, 10Patch-For-Review: Thumbnails don't seem to be being created/saved for id_internalwikimedia - https://phabricator.wikimedia.org/T201187 (10fgiunchedi) Can you try again? [14:34:09] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [14:36:12] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5004.eqsin.wmnet'] ``` and were **ALL** successful. [14:39:35] (03PS2) 10Bstorm: gridengine: remove puppetized hosts file [puppet] - 10https://gerrit.wikimedia.org/r/450247 (https://phabricator.wikimedia.org/T139190) [14:40:24] (03CR) 10Bstorm: [C: 032] gridengine: remove puppetized hosts file [puppet] - 10https://gerrit.wikimedia.org/r/450247 (https://phabricator.wikimedia.org/T139190) (owner: 10Bstorm) [14:42:40] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04316 https://grafana.wikimedia.org/dashboard/db/logstash [14:48:19] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 56 connecting: cp1080_v4, cp1080_v6 [14:50:42] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Papaul) a:05Papaul>03Marostegui @Marostegui Disk replaced [14:51:55] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Marostegui) Thanks! ``` physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, Rebuilding) ``` [14:53:33] (03PS1) 10BBlack: cp1080: remove from conftool/hieradata lists [puppet] - 10https://gerrit.wikimedia.org/r/450582 (https://phabricator.wikimedia.org/T201174) [14:56:16] (03PS1) 10BBlack: cp3031: remove from conftool/hieradata lists [puppet] - 10https://gerrit.wikimedia.org/r/450583 (https://phabricator.wikimedia.org/T200806) [14:56:33] !log Stop MySQL for onsite maintenance - T200641 [14:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:37] (03CR) 10BBlack: [C: 032] cp1080: remove from conftool/hieradata lists [puppet] - 10https://gerrit.wikimedia.org/r/450582 (https://phabricator.wikimedia.org/T201174) (owner: 10BBlack) [14:56:38] T200641: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 [14:56:45] (03CR) 10BBlack: [C: 032] cp3031: remove from conftool/hieradata lists [puppet] - 10https://gerrit.wikimedia.org/r/450583 (https://phabricator.wikimedia.org/T200806) (owner: 10BBlack) [15:01:47] (03PS4) 10Giuseppe Lavagetto: mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968) [15:01:49] (03PS1) 10Giuseppe Lavagetto: mediawiki: makes includes explicit in private-https.conf [puppet] - 10https://gerrit.wikimedia.org/r/450585 [15:01:51] (03PS1) 10Giuseppe Lavagetto: mediawiki: serve small private wikis with mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/450586 (https://phabricator.wikimedia.org/T196968) [15:02:49] !log rebooting cp1075-90 [15:02:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:03] RECOVERY - IPsec on cp1077 is OK: Strongswan OK - 52 ESP OK [15:03:12] RECOVERY - IPsec on cp1075 is OK: Strongswan OK - 52 ESP OK [15:03:51] yep [15:05:02] !log shutting down pc2006 for maintenance [15:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:02] we can't [15:08:33] RECOVERY - IPsec on cp1079 is OK: Strongswan OK - 52 ESP OK [15:11:06] (03PS5) 10Giuseppe Lavagetto: mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968) [15:11:08] (03PS2) 10Giuseppe Lavagetto: mediawiki: makes includes explicit in private-https.conf [puppet] - 10https://gerrit.wikimedia.org/r/450585 [15:11:10] (03PS2) 10Giuseppe Lavagetto: mediawiki: serve small private wikis with mediawiki::web::vhost [puppet] - 10https://gerrit.wikimedia.org/r/450586 (https://phabricator.wikimedia.org/T196968) [15:12:03] RECOVERY - IPsec on cp1085 is OK: Strongswan OK - 52 ESP OK [15:12:06] (03PS6) 10Giuseppe Lavagetto: mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968) [15:13:13] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 56 ESP OK [15:13:35] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [15:14:58] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Papaul) This server has pro support as mentioned in T139283 Hi Papaul, This server (7D3H282) has pro support so technically I cannot help you with it but if you will update the bios to 2.1.7: http://ww... [15:18:02] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) Thanks @Papaul - let's upgrade the BIOS then. [15:19:20] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5009.eqsin.wmnet'] ``` and were **ALL** successful. [15:21:54] 10Operations, 10Operations-Software-Development: wmf-auto-reimage: 'execution expired' on first puppet run - https://phabricator.wikimedia.org/T201317 (10ema) I've modified wmf_auto_reimage_lib to run puppet with --debug at the first run. Here is what happens around the failure (logs cleaned up from various ju... [15:25:14] 10Operations, 10ops-eqsin, 10Traffic: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) Dell finally replied back to me (3 days later) giving me a list of 4 engineers to go onsite. They keep doing that (listing more than are going.) So now I have to figure ou... [15:36:21] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Papaul) a:05Papaul>03Marostegui @Marostegui Bios update to 2.8.0 . It is all yours {F24596942} [15:37:26] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Marostegui) a:05Marostegui>03Papaul This disk failed: ``` physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, Failed) ``` If it was a new disk, can we try to pull it out and pull it back ag... [15:38:31] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Papaul) Done [15:39:53] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) Thanks - I have started MySQL and will leave it running during the night. There is not much we can do with this host as per T200641#4481566 - and we will at some point replace these hosts with... [15:40:34] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Marostegui) Thanks, let's see how it goes this time! ``` physicaldrive 1I:1:7 (port 1I:box 1:bay 7, SAS, 600 GB, Rebuilding) ``` [15:45:44] 10Operations, 10Patch-For-Review: Netbox: setup backups - https://phabricator.wikimedia.org/T190184 (10Dzahn) Files have been created for each day and are showing up in Bacula (helium) in bconsole now. (how to is at: https://wikitech.wikimedia.org/wiki/Bacula#Restore_(aka_Panic_mode)) bconsole: ``` $ cd po... [15:50:16] (03CR) 10Aezell: [C: 031] Disable Special:Block Feedback Request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450573 (owner: 10Dbarratt) [15:50:26] jouncebot: next [15:50:26] In 1 hour(s) and 9 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180806T1700) [15:50:37] !log pooling cp1090 (upload@eqiad, first of new hardware) [15:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:45] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1090.eqiad.wmnet [15:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:49] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.142 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:54:46] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1074.eqiad.wmnet [15:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:28] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.2696 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:57:17] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Papaul) Moved all disks in one of the decom server (db2013). Disk wipe in progress [15:57:29] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.132 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [15:58:05] (03PS1) 10Vgutierrez: Allow CSR generation without Subject Alternative Names [software/certcentral] - 10https://gerrit.wikimedia.org/r/450600 [15:58:36] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Marostegui) >>! In T201245#4481644, @Papaul wrote: > Moved all disks in one of the decom server (db2013). Disk wipe in progress I guess this is for: T195228 ? [15:58:59] papaul: ^ [15:59:52] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Papaul) yes thanks [16:00:04] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) Moved all disks in one of the decom server (db2013). Disk wipe in progress [16:01:02] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10Papaul) switch port information ge-6/0/12 [16:02:19] 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10Andrew) eth0 and eth1 are both up on labtestnet2002. On labtestnet2003 eth1 still shows as down: ``` root@labtestnet2003:~# ip addr 1: lo: 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10aborrero) The cable is unplugged or the switch port is down? ``` aborrero@labtestnet2003:~ $ sudo ethtool eth1 | grep Link Link detected: no aborrer... [16:08:21] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.04643 https://grafana.wikimedia.org/dashboard/db/logstash [16:08:32] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04562 https://grafana.wikimedia.org/dashboard/db/logstash [16:11:41] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.02821 https://grafana.wikimedia.org/dashboard/db/logstash [16:14:22] (03PS8) 10Arturo Borrero Gonzalez: wikilabels: Enforce SSL [puppet] - 10https://gerrit.wikimedia.org/r/450395 (https://phabricator.wikimedia.org/T184437) (owner: 10Ladsgroup) [16:15:04] (03PS1) 10Vgutierrez: Add self_signed property on Certificate class [software/certcentral] - 10https://gerrit.wikimedia.org/r/450607 [16:15:29] (03CR) 10Arturo Borrero Gonzalez: [C: 032] wikilabels: Enforce SSL [puppet] - 10https://gerrit.wikimedia.org/r/450395 (https://phabricator.wikimedia.org/T184437) (owner: 10Ladsgroup) [16:17:13] ACKNOWLEDGEMENT - Elasticsearch HTTPS on relforge1002 is CRITICAL: SSL CRITICAL - failed to verify relforge.svc.eqiad.wmnet against relforge1002.eqiad.wmnet Gehel ssl cert needs to be refreshed after reimage [16:18:41] (03PS1) 10Herron: logstash: double jvm heap size to 512m [puppet] - 10https://gerrit.wikimedia.org/r/450609 (https://phabricator.wikimedia.org/T200960) [16:21:31] (03PS1) 10BryanDavis: toolforge: Document inclusion of texlive-full package [puppet] - 10https://gerrit.wikimedia.org/r/450610 (https://phabricator.wikimedia.org/T197176) [16:21:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0 [16:21:57] XioNoX: ^^ [16:22:53] vgutierrez: thx [16:23:27] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) >>! In T200215#4481210, @Ottomata wrote: > Not 100%, but I believe MirrorMaker will handle th... [16:23:44] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) [16:26:21] FYI, it's the backup Telia link between codfw and eqiad [16:26:30] 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10Papaul) switch port is enable and link is up ``` show interfaces ge-1/0/17 descriptions Interface Admin Link Description ge-1/0/17 up... [16:26:49] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/11997/" [puppet] - 10https://gerrit.wikimedia.org/r/450609 (https://phabricator.wikimedia.org/T200960) (owner: 10Herron) [16:26:54] no maintenance, so I'm expecting them to send an outage notification soon [16:27:34] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) Oh yes, singular is better, that was a typo great. [16:28:17] Platform Operations, serious stuff | Status: Up | Log: https://wikitech.wikimedia.org/wiki/Server_Admin_Log | Channel logs: https://bit.ly/opsirclog | Ops Clinic Duty: godog [16:29:11] uhuh thanks marostegui I forgot [16:29:43] I was basically changing the status, that it will still said that we were having network issues :) [16:34:04] (03CR) 10Alex Monk: [C: 032] Add self_signed property on Certificate class [software/certcentral] - 10https://gerrit.wikimedia.org/r/450607 (owner: 10Vgutierrez) [16:34:49] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10Cmjohnson) I checked the log today and the error has not returned. [16:35:26] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10BBlack) Ok I'll take a stab at another imaging today and see how it goes, thanks! [16:35:34] 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10aborrero) >>! In T199821#4481716, @Papaul wrote: > switch port is enable and link is up something is now working as expected. ``` aborrero@labtestn... [16:38:01] (03CR) 10Alex Monk: "Why? SAN fields are mandatory for publicly-trusted certificates." [software/certcentral] - 10https://gerrit.wikimedia.org/r/450600 (owner: 10Vgutierrez) [16:40:13] (03CR) 10Vgutierrez: "> Why? SAN fields are mandatory for publicly-trusted certificates." [software/certcentral] - 10https://gerrit.wikimedia.org/r/450600 (owner: 10Vgutierrez) [16:42:48] 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashing, probable overheating - https://phabricator.wikimedia.org/T196252 (10RobH) a:03Andrew So, this thread is mildly confusing. From what I can see, labservices1001 (warranty expired 2017-04), had its thermal paste... [16:45:08] (03CR) 10Alex Monk: [C: 032] Allow CSR generation without Subject Alternative Names [software/certcentral] - 10https://gerrit.wikimedia.org/r/450600 (owner: 10Vgutierrez) [16:46:02] RECOVERY - Device not healthy -SMART- on db2054 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2054&var-datasource=codfw%2520prometheus%252Fops [16:46:10] (03Merged) 10jenkins-bot: Allow CSR generation without Subject Alternative Names [software/certcentral] - 10https://gerrit.wikimedia.org/r/450600 (owner: 10Vgutierrez) [16:46:12] (03Merged) 10jenkins-bot: Add self_signed property on Certificate class [software/certcentral] - 10https://gerrit.wikimedia.org/r/450607 (owner: 10Vgutierrez) [16:46:30] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) No flows logged since at least the last 3 days. So it looks fine to me. I removed the syslog statement to minimize noise while... [16:47:11] (03CR) 10jenkins-bot: Add self_signed property on Certificate class [software/certcentral] - 10https://gerrit.wikimedia.org/r/450607 (owner: 10Vgutierrez) [16:47:13] (03CR) 10jenkins-bot: Allow CSR generation without Subject Alternative Names [software/certcentral] - 10https://gerrit.wikimedia.org/r/450600 (owner: 10Vgutierrez) [16:48:13] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Ottomata) Done. [16:48:59] 10Operations, 10JADE, 10TechCom, 10Goal, and 3 others: [Blocked] Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10awight) [16:49:06] (03PS1) 10Andrew Bogott: Horizon: disable during labservices1001 maintenance [puppet] - 10https://gerrit.wikimedia.org/r/450615 [16:49:46] (03CR) 10Andrew Bogott: [C: 032] Horizon: disable during labservices1001 maintenance [puppet] - 10https://gerrit.wikimedia.org/r/450615 (owner: 10Andrew Bogott) [16:50:26] 10Operations, 10Analytics, 10EventBus, 10Discovery-Search (Current work), and 2 others: Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) Thanks! [16:51:00] 10Operations, 10DNS, 10Traffic: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) I'll handle this from here @RobH, thanks :) [16:51:16] 10Operations, 10DNS, 10Traffic: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) a:05BBlack>03Vgutierrez [16:52:51] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1089.eqiad.wmnet [16:52:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:58] 10Operations, 10Puppet: Stop introducing new code expanded from erb templates - https://phabricator.wikimedia.org/T200984 (10Volans) I fully agree with the principle, but I have to admit that I'm also guilty as charged as I've recently add a few lines wrapper bash script that is dependent of django variables,... [16:53:05] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1068.eqiad.wmnet [16:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:08] !log power down labservices1001 for thermal paste fix, T196252 [16:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:16] T196252: Labservices1001 crashing, probable overheating - https://phabricator.wikimedia.org/T196252 [16:54:05] (03PS1) 10Andrew Bogott: Revert "Horizon: disable during labservices1001 maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/450617 [16:56:22] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2742 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:00:04] gehel: #bothumor I � Unicode. All rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180806T1700). [17:00:15] jouncebot: o/ [17:00:50] (03PS8) 10EBernhardson: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [17:03:07] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10BBlack) First attempt to reboot for PXE install stops now with: ```UEFI0339: The Dual Inline Memory Module (DIMM) in the memory slot B5 is disabled because of in... [17:03:41] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.02207 https://grafana.wikimedia.org/dashboard/db/logstash [17:04:17] !log gehel@deploy1001 Started deploy [wdqs/wdqs@0d3c6a6]: new version of wdqs GUI and updater (wdqs1009 only) [17:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:49] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@0d3c6a6]: new version of wdqs GUI and updater (wdqs1009 only) (duration: 00m 33s) [17:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:12] (03PS9) 10EBernhardson: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [17:07:52] PROBLEM - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 504 Gateway Time-out - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 356 bytes in 60.007 second response time [17:08:11] PROBLEM - Check systemd state on wdqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:08:53] ^ issue with WDQS while testing new deployment, will silence and I'm on it (wdqs1009 is a test server) [17:09:22] RECOVERY - Check systemd state on wdqs1009 is OK: OK - running: The system is fully operational [17:10:41] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 [17:10:57] (03PS10) 10EBernhardson: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [17:16:24] (03CR) 10Andrew Bogott: [C: 032] Revert "Horizon: disable during labservices1001 maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/450617 (owner: 10Andrew Bogott) [17:17:41] PROBLEM - Host labtestnet2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:19:11] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) USB method isn't working neither, followed up with JTAC, if no quick resolution, we have spare EX4300 to swap it with. ``` loader> install --format file:///jinstall-ex-4300-14.1... [17:22:23] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) @Cmjohnson please swap the switch with another EX4300, and only connect console and the usb drive. Once it's ready to join the virtual chassis, I'll need you to connect the VC c... [17:22:31] RECOVERY - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.733 second response time [17:25:22] RECOVERY - Host labtestnet2003 is UP: PING OK - Packet loss = 16%, RTA = 77.05 ms [17:26:15] 10Operations, 10netops: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Here they are: 1/ Add logs statements to know if the packets are getting in/out of the fabric ``` Filter on the ingress interface: set firewall family ethernet-switching f... [17:27:22] RECOVERY - HP RAID on db2054 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [17:28:51] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1168 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:29:01] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1337 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:29:52] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1374 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:34:12] !log mobrovac@deploy1001 Started deploy [citoid/deploy@4f5ba15]: Use WorldCat for open search as well - T162357 [17:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:16] T162357: Add support for worldcat search api xml results - https://phabricator.wikimedia.org/T162357 [17:36:21] !log uploaded linux 4.9.110+deb9u1~wmf1 for jessie-wikimedia to apt.wikimedia.org [17:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:02] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.04435 https://grafana.wikimedia.org/dashboard/db/logstash [17:37:12] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.04544 https://grafana.wikimedia.org/dashboard/db/logstash [17:37:40] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@4f5ba15]: Use WorldCat for open search as well - T162357 (duration: 03m 29s) [17:37:43] 10Operations, 10ops-codfw, 10cloud-services-team, 10netops: connect eth1 on labtestnet2002 and labtestnet2003 - https://phabricator.wikimedia.org/T199821 (10aborrero) 05Open>03Resolved I didn't press the right buttons: ``` aborrero@labtestnet2003:~ $ sudo ip link set dev eth1 up aborrero@labtestnet20... [17:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:00] 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashing, probable overheating - https://phabricator.wikimedia.org/T196252 (10Cmjohnson) added thermal paste [17:42:52] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10Cmjohnson) I swapped out the ex4300 with a current spare wmf7314. @ayounsi can you give me the details of the RMA and the shipping label. Please email. [17:45:09] (03CR) 10Herron: [V: 032 C: 032] initial commit of 4.4.0-1 [debs/puppetdb] (4.4.0-1) - 10https://gerrit.wikimedia.org/r/415591 (owner: 10Herron) [17:46:52] (03PS1) 10Bstorm: gridengine: remove variables that were for deleted template [puppet] - 10https://gerrit.wikimedia.org/r/450624 (https://phabricator.wikimedia.org/T139190) [17:47:01] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.03635 https://grafana.wikimedia.org/dashboard/db/logstash [17:47:13] 10Operations, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10thcipriani) >>! In T200690#4472933, @akosiaris wrote: > This got me into thing that maybe we should instead have a check in scap to disallow running it with a wrong umask. It's possible to acquire a wron... [17:49:12] (03PS11) 10EBernhardson: Add mjolnir kafka daemon to primary elasticsearch clusters [puppet] - 10https://gerrit.wikimedia.org/r/445254 (https://phabricator.wikimedia.org/T198490) [17:54:13] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T201245 (10Marostegui) 05Open>03Resolved All good this time! ``` root@db2054:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337FE1C0) Port Name: 1I Port Na... [17:55:02] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.123 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:56:32] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1088.eqiad.wmnet [17:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:39] (03CR) 10Herron: [C: 032] logstash: double jvm heap size to 512m [puppet] - 10https://gerrit.wikimedia.org/r/450609 (https://phabricator.wikimedia.org/T200960) (owner: 10Herron) [17:56:44] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1099.eqiad.wmnet [17:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:46] (03PS2) 10Herron: logstash: double jvm heap size to 512m [puppet] - 10https://gerrit.wikimedia.org/r/450609 (https://phabricator.wikimedia.org/T200960) [17:57:05] !log rebooting bast4001 [17:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:31] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1087.eqiad.wmnet [17:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:45] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1067.eqiad.wmnet [17:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:42] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1337 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [17:58:52] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1591 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Morning SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180806T1800). [18:00:04] davidwbarratt: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:33] here~ [18:03:32] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04601 https://grafana.wikimedia.org/dashboard/db/logstash [18:04:40] !log double logstash jvm heap size from 256m to 512m and rolling restart of logstash instances T200960 [18:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:45] T200960: Logstash has ~90% packet loss since June 29 - https://phabricator.wikimedia.org/T200960 [18:05:02] PROBLEM - Disk space on elastic1026 is CRITICAL: DISK CRITICAL - free space: /srv 51128 MB (10% inode=99%) [18:05:43] I can SWAT [18:05:53] 10Operations, 10ops-eqiad, 10cloud-services-team, 10Patch-For-Review: Labservices1001 crashing, probable overheating - https://phabricator.wikimedia.org/T196252 (10Andrew) 05Open>03Resolved Hopefully resolved; we'll see if it overheats again. Thanks @Cmjohnson [18:06:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450573 (owner: 10Dbarratt) [18:06:39] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service: WDQS diskspace is low - https://phabricator.wikimedia.org/T196485 (10RobH) [18:07:02] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1042 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:07:59] (03Merged) 10jenkins-bot: Disable Special:Block Feedback Request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450573 (owner: 10Dbarratt) [18:08:22] PROBLEM - DPKG on bast3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [18:08:31] !log bast2002 - installing package upgrades (future bastion, to be setup) [18:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:12] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.04315 https://grafana.wikimedia.org/dashboard/db/logstash [18:09:22] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.03449 https://grafana.wikimedia.org/dashboard/db/logstash [18:09:32] RECOVERY - DPKG on bast3002 is OK: All packages OK [18:09:51] RECOVERY - Disk space on elastic1026 is OK: DISK OK [18:09:52] !log rebooting bast3002 for kernel upgrade [18:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:07] davidwbarratt: you change is live on mwdebug1002, check please [18:10:34] checking.. [18:10:51] looks good to me! [18:12:28] ok, going live [18:12:28] (03CR) 10jenkins-bot: Disable Special:Block Feedback Request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450573 (owner: 10Dbarratt) [18:15:18] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:450573|Disable Special:Block Feedback Request]] (duration: 00m 52s) [18:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:24] ^ davidwbarratt live now [18:15:31] (03PS1) 10Herron: Edit Project Config [debs/prometheus-logstash-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/450636 [18:15:41] thcipriani looks perfect! thanks! [18:15:41] (03PS2) 10Bstorm: gridengine: remove variables that were for deleted template [puppet] - 10https://gerrit.wikimedia.org/r/450624 (https://phabricator.wikimedia.org/T139190) [18:15:54] (03Abandoned) 10Herron: Edit Project Config [debs/prometheus-logstash-exporter] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/450636 (owner: 10Herron) [18:16:14] !log rebooting radium for kernel security update [18:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:40] (03CR) 10Bstorm: [C: 032] gridengine: remove variables that were for deleted template [puppet] - 10https://gerrit.wikimedia.org/r/450624 (https://phabricator.wikimedia.org/T139190) (owner: 10Bstorm) [18:16:57] (03PS1) 10Herron: initial import of prometheus-logstash-exporter-0.1.2 [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/450637 (https://phabricator.wikimedia.org/T200362) [18:17:43] 10Operations, 10netops: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Replacing fpc5 didn't solve the issue... Following up... [18:21:51] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [18:23:01] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [18:23:17] (03PS1) 10Andrew Bogott: site.pp: remove def for labvirt1021 and 1022 [puppet] - 10https://gerrit.wikimedia.org/r/450638 [18:23:21] 10Operations, 10ops-eqiad: rack/setup/install labservices100[34].wikimedia.org - https://phabricator.wikimedia.org/T201341 (10RobH) p:05Triage>03Normal [18:23:45] 10Operations, 10ops-eqiad: rack/setup/install labservices100[34].wikimedia.org - https://phabricator.wikimedia.org/T201341 (10RobH) [18:35:52] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1567 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [18:38:12] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.0456 https://grafana.wikimedia.org/dashboard/db/logstash [18:42:22] huh [18:46:44] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10RobH) p:05Triage>03Normal [18:48:57] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10RobH) [18:51:10] 10Operations, 10ops-eqiad: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10RobH) p:05Triage>03Normal [18:54:30] !log rolling reboot of mx servers for security updates [18:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:17] 10Operations, 10ops-eqiad, 10monitoring: rack/setup/install monitor1001.wikimedia.org - https://phabricator.wikimedia.org/T201344 (10RobH) p:05Triage>03High [18:56:43] !log rebooting fermium (lists) for security updates [18:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:50] 10Operations, 10ops-eqiad, 10Operations-Software-Development: rack/setup/install clustermgmt1001.eqiad.wmnet (new cumin master) - https://phabricator.wikimedia.org/T201346 (10RobH) p:05Triage>03Normal [19:03:46] !log bast1001 - installing package updates, incl systemd,kernel [19:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:37] !log bast1002 - the last log line is about bast1002, not bast1001 [19:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:19] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @JKatzWMF Definitely makes sense to test this before pushing it everywh... [19:07:42] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1246 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [19:12:43] 10Operations, 10SRE-Access-Requests: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10Imarlier) [19:12:49] (03PS1) 10RobH: decom bast1001 prod dns [dns] - 10https://gerrit.wikimedia.org/r/450645 (https://phabricator.wikimedia.org/T191153) [19:13:42] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.02812 https://grafana.wikimedia.org/dashboard/db/logstash [19:14:00] !log bast1002 - rebooting [19:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:05] (03PS1) 10RobH: decom bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/450646 (https://phabricator.wikimedia.org/T191153) [19:15:55] 10Operations, 10SRE-Access-Requests: Access to dumps servers - https://phabricator.wikimedia.org/T201350 (10ArielGlenn) It's the labstore boxes you want, either 1006 or 1007 depending, and maybe you just want to make the file available and ask someone to drop it into the right location? And that would likely b... [19:16:12] 10Operations, 10ops-eqiad, 10decommission: decom bast1001 - https://phabricator.wikimedia.org/T191153 (10RobH) [19:18:06] (03CR) 10RobH: [C: 032] decom bast1001 prod dns [dns] - 10https://gerrit.wikimedia.org/r/450645 (https://phabricator.wikimedia.org/T191153) (owner: 10RobH) [19:18:20] (03CR) 10RobH: [C: 032] decom bast1001 [puppet] - 10https://gerrit.wikimedia.org/r/450646 (https://phabricator.wikimedia.org/T191153) (owner: 10RobH) [19:20:20] 10Operations, 10ops-eqiad, 10decommission: decom bast1001 - https://phabricator.wikimedia.org/T191153 (10RobH) a:03Cmjohnson [19:20:21] PROBLEM - puppet last run on analytics1069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[spark2] [19:23:31] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[spark2] [19:33:30] (03PS4) 10Andrew Bogott: labs-ip-alias-dump: Update to work with pdns-recursor v4.x [puppet] - 10https://gerrit.wikimedia.org/r/449627 (https://phabricator.wikimedia.org/T200294) [19:34:54] (03CR) 10Andrew Bogott: [C: 032] labs-ip-alias-dump: Update to work with pdns-recursor v4.x [puppet] - 10https://gerrit.wikimedia.org/r/449627 (https://phabricator.wikimedia.org/T200294) (owner: 10Andrew Bogott) [19:37:06] 10Operations, 10ops-eqiad: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10Dzahn) [19:46:31] !log gehel@deploy1001 Started deploy [wdqs/wdqs@d552447]: new version of wdqs GUI and updater (wdqs1009 only) [19:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:54] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@d552447]: new version of wdqs GUI and updater (wdqs1009 only) (duration: 00m 22s) [19:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:30] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1213 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [19:48:50] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1079 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [19:48:51] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:50:01] !log gehel@deploy1001 Started deploy [wdqs/wdqs@d552447]: new version of wdqs GUI and updater [19:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:51] RECOVERY - puppet last run on analytics1069 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [19:52:00] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.04043 https://grafana.wikimedia.org/dashboard/db/logstash [19:52:20] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.02287 https://grafana.wikimedia.org/dashboard/db/logstash [19:52:52] 10Operations, 10TimedMediaHandler-Transcode: Increase job runners on video scalers to maximize load efficiency - https://phabricator.wikimedia.org/T201358 (10brion) [19:53:56] 10Operations, 10TimedMediaHandler-Transcode: Increase job runners on video scalers to maximize load efficiency - https://phabricator.wikimedia.org/T201358 (10brion) Hmm, that load factor moved. I have no idea what to change now. :D [19:54:40] (03PS1) 10Dzahn: smokeping: comment out broken bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/450649 (https://phabricator.wikimedia.org/T201355) [19:54:43] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10JKatzWMF) Great! Thank you for confirming, @Imarlier and, again, I am really exc... [19:55:05] Anybody know how the job queue runners (eg videoscalers) control how many threads they run now? [19:55:25] ACKNOWLEDGEMENT - High lag on wdqs1009 is CRITICAL: 8815 ge 3600 Gehel server is catching up on updates after failed deployment (now fixed) https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:55:51] (03PS2) 10Dzahn: smokeping: comment out broken bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/450649 (https://phabricator.wikimedia.org/T201355) [19:56:27] (03CR) 10Dzahn: [C: 032] smokeping: comment out broken bast1002 [puppet] - 10https://gerrit.wikimedia.org/r/450649 (https://phabricator.wikimedia.org/T201355) (owner: 10Dzahn) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: (Dis)respected human, time to deploy Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180806T2000). Please do the needful. [20:00:56] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@d552447]: new version of wdqs GUI and updater (duration: 10m 55s) [20:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:30] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1852 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:01:40] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1814 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:01:51] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1588 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:02:04] seriously logstash? [20:02:22] SMalyshev: deployment completed, tests are grean [20:02:30] s/grean/green/ [20:03:21] (03PS2) 10BryanDavis: Kubernetes: ignore terminating objects when searching [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) [20:04:21] hey all, anyone else have a problem with bast1002? [20:04:52] bast1002 is being poked at [20:04:53] it looks prety down [20:04:54] is this known? [20:04:55] ok [20:05:00] folks in analyitcs were asking [20:05:06] should they just use a different bastion? [20:05:10] yes for now :) [20:06:36] (03CR) 10BryanDavis: Kubernetes: ignore terminating objects when searching (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) (owner: 10BryanDavis) [20:07:36] (03CR) 10BryanDavis: Kubernetes: ignore terminating objects when searching (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/450495 (https://phabricator.wikimedia.org/T156626) (owner: 10BryanDavis) [20:09:19] 10Operations, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Tgr) > I wonder if there are git hooks we could setup on the deployment servers to address this without having to put any logic into scap? Sure (just check the umask in the hook and fail) but there are... [20:09:39] it's probably staying down for awhile [20:09:45] maybe memory issues [20:19:58] !log arlolra@deploy1001 Started deploy [parsoid/deploy@6b16b57]: Updating Parsoid to e02124b [20:20:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:03] 10Operations, 10ops-eqiad, 10Patch-For-Review: bast1002 - hardware (memory) issue - https://phabricator.wikimedia.org/T201355 (10Dzahn) I booted into BIOS to check if it sees the 2 disks .. and it does. [20:31:19] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@6b16b57]: Updating Parsoid to e02124b (duration: 11m 20s) [20:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:41] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1103 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [20:39:06] !log Updated Parsoid to e02124b (T199849, T198400, T199577, T199509, T200403, T201054) [20:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:19] T198400: Create Wikipedia Santali - https://phabricator.wikimedia.org/T198400 [20:39:20] T199849: VisualEditor manipulation based on TemplateData source code formatting does not handle newlines before and after correctly - https://phabricator.wikimedia.org/T199849 [20:39:20] T199509: Create Wikimania wiki - https://phabricator.wikimedia.org/T199509 [20:39:20] T199577: Create Wikiversity Chinese - https://phabricator.wikimedia.org/T199577 [20:39:21] T200403: Investigate onExtLink performance - https://phabricator.wikimedia.org/T200403 [20:39:21] T201054: Reference seems broken in Cat#Running section - https://phabricator.wikimedia.org/T201054 [20:39:51] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.02687 https://grafana.wikimedia.org/dashboard/db/logstash [20:40:00] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.01664 https://grafana.wikimedia.org/dashboard/db/logstash [20:40:11] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04558 https://grafana.wikimedia.org/dashboard/db/logstash [20:40:12] (03Restored) 10Krinkle: prometheus: make ferm DNS record type configurable [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [20:41:12] (03PS11) 10Bstorm: WIP toolforge: start writing module [puppet] - 10https://gerrit.wikimedia.org/r/448791 [21:00:04] bawolff and Reedy: Dear deployers, time to do the Weekly Security deployment window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180806T2100). [21:09:19] (03PS2) 10Andrew Bogott: site.pp: remove def for labvirt1021 and 1022 [puppet] - 10https://gerrit.wikimedia.org/r/450638 [21:09:21] (03PS1) 10Andrew Bogott: designate: allow API access from the keystone host [puppet] - 10https://gerrit.wikimedia.org/r/450863 (https://phabricator.wikimedia.org/T162977) [21:15:03] (03PS2) 10Dzahn: deployment-prep: Update urldownloader host [puppet] - 10https://gerrit.wikimedia.org/r/450485 (owner: 10Alex Monk) [21:15:54] (03CR) 10Dzahn: [C: 032] deployment-prep: Update urldownloader host [puppet] - 10https://gerrit.wikimedia.org/r/450485 (owner: 10Alex Monk) [21:16:52] (03CR) 10Andrew Bogott: [C: 032] designate: allow API access from the keystone host [puppet] - 10https://gerrit.wikimedia.org/r/450863 (https://phabricator.wikimedia.org/T162977) (owner: 10Andrew Bogott) [21:16:59] (03PS2) 10Andrew Bogott: designate: allow API access from the keystone host [puppet] - 10https://gerrit.wikimedia.org/r/450863 (https://phabricator.wikimedia.org/T162977) [21:23:05] (03PS1) 10Dzahn: yubiauth: add auth1002 in Hiera to fix rsync/cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450865 (https://phabricator.wikimedia.org/T196698) [21:23:30] (03PS1) 10Jon Harald Søby: Change category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450866 (https://phabricator.wikimedia.org/T182431) [21:25:11] 10Operations, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) @Papaul, can you pre-populate the following interfaces with SFP+-10G-LR? ``` xe-0/1/0 {#11399} xe-0/1/1 {#11401} xe-0/1/3 {#11389} xe-0/1/4 {#11403} xe-0/1/6 {#11397} ``` You can store a few spares... [21:28:55] 10Operations, 10netops: Intermitent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) The current guess is that those errors were side effects of the other switches failing. "DDOS_PROTOCOL_VIOLATION" syslog seem to be read hearing. Still monitoring, let me me know if... [21:32:36] (03CR) 10Jon Harald Søby: "Whoever deploys this should also run the following script:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450866 (https://phabricator.wikimedia.org/T182431) (owner: 10Jon Harald Søby) [21:37:34] (03CR) 10Dzahn: [C: 032] yubiauth: add auth1002 in Hiera to fix rsync/cron spam [puppet] - 10https://gerrit.wikimedia.org/r/450865 (https://phabricator.wikimedia.org/T196698) (owner: 10Dzahn) [21:37:56] 10Operations, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10ayounsi) Planning on doing the swap on August 14th, 11am CDT, 9am PDT, 4pm UTC. 1h. Pending no planned maintenance from redundant links. [21:38:48] (03CR) 10Dzahn: [C: 032] "going ahead.. sync is just _from_ the primary server..so adding the new one to the list .." [puppet] - 10https://gerrit.wikimedia.org/r/450865 (https://phabricator.wikimedia.org/T196698) (owner: 10Dzahn) [21:44:09] 10Operations, 10Patch-For-Review: rack/setup/install auth1002 - https://phabricator.wikimedia.org/T196698 (10Dzahn) Adding the role to auth1002 started adding the rsync/cron between auth1002 and auth1002, but because it was not aded in Hiera yet it did not get the needed ferm firewall rule on auth1001 to allow... [21:44:21] 10Operations, 10netops: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10Papaul) [21:45:23] (03CR) 10Dzahn: [C: 032] "that added the ferm rule." [puppet] - 10https://gerrit.wikimedia.org/r/450865 (https://phabricator.wikimedia.org/T196698) (owner: 10Dzahn) [21:48:09] 10Operations, 10netops: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147 (10ayounsi) [21:55:04] (03PS1) 10Dzahn: Revert "smokeping: comment out broken bast1002" [puppet] - 10https://gerrit.wikimedia.org/r/450870 [21:55:40] (03CR) 10Dzahn: "bast1002 might be ok again, it is up again. but the memory error still should be checked" [puppet] - 10https://gerrit.wikimedia.org/r/450870 (owner: 10Dzahn) [22:02:16] 10Operations, 10netops: Rack/Setup new codfw QFX5100 10G switch - https://phabricator.wikimedia.org/T197147 (10ayounsi) [22:08:17] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2501 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [22:08:45] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1086.eqiad.wmnet [22:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:00] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1073.eqiad.wmnet [22:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:30] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp1085.eqiad.wmnet [22:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:45] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp1066.eqiad.wmnet [22:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:46] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04023 https://grafana.wikimedia.org/dashboard/db/logstash [22:15:08] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:16:10] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1090.eqiad.wmnet,service=nginx [22:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:18] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1088.eqiad.wmnet,service=nginx [22:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:24] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1086.eqiad.wmnet,service=nginx [22:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:51] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1076.eqiad.wmnet,service=nginx [22:16:52] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1078.eqiad.wmnet,service=nginx [22:16:53] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1082.eqiad.wmnet,service=nginx [22:16:53] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1084.eqiad.wmnet,service=nginx [22:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:20] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1076.eqiad.wmnet,service=varnish-fe [22:17:21] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1078.eqiad.wmnet,service=varnish-fe [22:17:21] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1082.eqiad.wmnet,service=varnish-fe [22:17:22] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1084.eqiad.wmnet,service=varnish-fe [22:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:23] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1086.eqiad.wmnet,service=varnish-fe [22:17:24] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1088.eqiad.wmnet,service=varnish-fe [22:17:24] !log bblack@neodymium conftool action : set/weight=4; selector: name=cp1090.eqiad.wmnet,service=varnish-fe [22:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:25] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10RobH) p:05Triage>03Normal [22:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:20] 10Operations, 10ops-eqiad, 10Parsoid: rack/setup/install scandium.eqiad.wmnet (parsoid test box) - https://phabricator.wikimedia.org/T201366 (10RobH) p:05Triage>03High [22:23:58] 10Operations, 10ops-eqiad: rack/setup/add to spares tracking 1 dual cpu misc system - https://phabricator.wikimedia.org/T201367 (10RobH) p:05Triage>03Normal [22:34:35] (03PS1) 10Andrew Bogott: makedomain: add --delete and --all functions [puppet] - 10https://gerrit.wikimedia.org/r/450875 [22:35:48] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:44:17] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1258 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [22:44:48] mutante, hey [22:45:16] mutante, thanks for merging my thing. I was just wondering what OS version /^(actinium|alcyone|alsafi|aluminium)\.wikimedia\.org$/ all had [22:46:37] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.04749 https://grafana.wikimedia.org/dashboard/db/logstash [22:51:05] (03PS2) 10Jforrester: Load TimedMediaHandler via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448176 (https://phabricator.wikimedia.org/T140852) [22:51:37] (03PS2) 10Jforrester: Load TimedMediaHandler's i18n via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448177 (https://phabricator.wikimedia.org/T140852) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180806T2300). Please do the needful. [23:00:04] Jhs and James_F: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] * James_F waves. [23:00:14] * Jhs too [23:00:23] I can do it, unless someone desperately wants to? [23:01:29] I'll take that as "sure". :-) [23:01:39] :) [23:02:04] (03CR) 10Jforrester: [C: 032] Change category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450866 (https://phabricator.wikimedia.org/T182431) (owner: 10Jon Harald Søby) [23:03:35] (03Merged) 10jenkins-bot: Change category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450866 (https://phabricator.wikimedia.org/T182431) (owner: 10Jon Harald Søby) [23:03:58] Okie-dokie. [23:04:00] James_F, remember to run the script :) [23:04:06] I know, I know. :-) [23:04:17] great :) [23:04:19] Is this really testable, other than running the script? [23:04:37] not really, the results should be identical to what's there now [23:04:44] but we can test that it doesn't screw things up [23:04:50] OK, in that case, I'll just proceed. [23:06:40] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Update sewiki collation config T182431 (duration: 00m 51s) [23:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:45] T182431: Switch category collation for sewiki to uca-se-u-kn - https://phabricator.wikimedia.org/T182431 [23:07:26] !log jforrester@mwmaint1001 SWAT Updating sewiki collation via updateCollation.php T182431 [23:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:59] Jhs: OK, done. Everything looks OK? [23:09:12] James_F, yup, looks good. Thanks! [23:09:18] Jhs: Happy to help! [23:09:26] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1057 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [23:09:36] (03CR) 10jenkins-bot: Change category collation for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450866 (https://phabricator.wikimedia.org/T182431) (owner: 10Jon Harald Søby) [23:09:44] (03CR) 10Jforrester: [C: 032] Follow-up 4c97a86fe8: Add wikimania.wikimedia.org to CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450051 (owner: 10Jforrester) [23:09:46] (03CR) 10Jforrester: [C: 032] Load TimedMediaHandler via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448176 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:09:46] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.68, 33.72, 32.11 [23:09:48] (03CR) 10Jforrester: [C: 032] Load TimedMediaHandler's i18n via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448177 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:09:50] (03CR) 10Jforrester: [C: 032] Delete multiversion/submodules.json, putatively unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446847 (owner: 10Jforrester) [23:11:15] (03Merged) 10jenkins-bot: Follow-up 4c97a86fe8: Add wikimania.wikimedia.org to CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450051 (owner: 10Jforrester) [23:11:17] (03CR) 10jerkins-bot: [V: 04-1] Load TimedMediaHandler's i18n via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448177 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:12:27] (03Merged) 10jenkins-bot: Delete multiversion/submodules.json, putatively unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446847 (owner: 10Jforrester) [23:14:16] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.1009 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [23:15:09] (03PS3) 10Jforrester: Load TimedMediaHandler via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448176 (https://phabricator.wikimedia.org/T140852) [23:15:10] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT Add new wikimania wiki to CORS origins I493cadb871 (duration: 00m 49s) [23:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:47] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0.0157 https://grafana.wikimedia.org/dashboard/db/logstash [23:21:37] !log jforrester@deploy1001 Synchronized multiversion/: SWAT Delete multiversion/submodules.json (duration: 00m 48s) [23:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:42] (03CR) 10Jforrester: Load TimedMediaHandler via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448176 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:22:45] (03CR) 10Jforrester: [C: 032] Load TimedMediaHandler via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448176 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:23:58] (03Merged) 10jenkins-bot: Load TimedMediaHandler via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448176 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:24:15] (03PS2) 10Jforrester: PageImages: Make it possible to add extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446257 (https://phabricator.wikimedia.org/T198716) [23:24:41] (03CR) 10jenkins-bot: Follow-up 4c97a86fe8: Add wikimania.wikimedia.org to CORS origins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450051 (owner: 10Jforrester) [23:24:43] (03CR) 10jenkins-bot: Delete multiversion/submodules.json, putatively unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446847 (owner: 10Jforrester) [23:24:45] (03CR) 10jenkins-bot: Load TimedMediaHandler via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448176 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:26:28] (03PS3) 10Jforrester: Load TimedMediaHandler's i18n via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448177 (https://phabricator.wikimedia.org/T140852) [23:26:38] (03CR) 10Jforrester: [C: 032] Load TimedMediaHandler's i18n via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448177 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:27:02] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT Load TimedMediaHandler via static extension registration T140852 (duration: 00m 48s) [23:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:07] T140852: Load all Wikimedia-deployed extensions and skins via extension registration - https://phabricator.wikimedia.org/T140852 [23:27:52] (03CR) 10Jforrester: [C: 032] PageImages: Make it possible to add extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446257 (https://phabricator.wikimedia.org/T198716) (owner: 10Jforrester) [23:27:59] (03Merged) 10jenkins-bot: Load TimedMediaHandler's i18n via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448177 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:29:14] (03Merged) 10jenkins-bot: PageImages: Make it possible to add extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446257 (https://phabricator.wikimedia.org/T198716) (owner: 10Jforrester) [23:30:22] !log jforrester@deploy1001 Synchronized wmf-config/extension-list: SWAT Load TimedMediaHandler i18n via static extension registration (duration: 00m 48s) [23:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:05] (03PS2) 10Jforrester: PageImages: Make it possible to add extra namespaces, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446258 [23:32:10] (03CR) 10Jforrester: [C: 032] PageImages: Make it possible to add extra namespaces, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446258 (owner: 10Jforrester) [23:32:18] (03PS2) 10Jforrester: PageImages: Add NS_CATEGORY for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446259 [23:32:23] (03CR) 10Jforrester: [C: 032] PageImages: Add NS_CATEGORY for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446259 (owner: 10Jforrester) [23:32:39] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT PageImages: Make it possible to add extra namespaces, part I - I6c51c1fe1 (duration: 00m 49s) [23:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:01] !log bblack@neodymium conftool action : set/weight=1; selector: dc=eqiad,cluster=cache_upload,service=varnish-fe [23:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:23] (03Merged) 10jenkins-bot: PageImages: Make it possible to add extra namespaces, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446258 (owner: 10Jforrester) [23:34:13] (03Merged) 10jenkins-bot: PageImages: Add NS_CATEGORY for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446259 (owner: 10Jforrester) [23:35:06] !log bblack@neodymium conftool action : set/weight=1; selector: dc=eqiad,cluster=cache_upload,service=nginx [23:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:21] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT PageImages: Make it possible to add extra namespaces, part II - I3f225d051 (duration: 00m 48s) [23:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:11] !log jforrester@deploy1001 sync-file aborted: SWAT PageImages: Add NS_CATEGORY for Commons T198716 (duration: 00m 04s) [23:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:15] T198716: Enable PageImages on Commons categories - https://phabricator.wikimedia.org/T198716 [23:38:44] (03PS1) 10Jforrester: Revert "PageImages: Make it possible to add extra namespaces, part II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450878 [23:38:48] (03CR) 10Jforrester: [C: 032] Revert "PageImages: Make it possible to add extra namespaces, part II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450878 (owner: 10Jforrester) [23:39:06] (03PS1) 10Jforrester: Revert "PageImages: Add NS_CATEGORY for Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450879 [23:39:07] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.213 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [23:39:13] (03CR) 10Jforrester: [C: 032] Revert "PageImages: Add NS_CATEGORY for Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450879 (owner: 10Jforrester) [23:39:51] (03CR) 10jenkins-bot: Load TimedMediaHandler's i18n via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448177 (https://phabricator.wikimedia.org/T140852) (owner: 10Jforrester) [23:39:53] (03CR) 10jenkins-bot: PageImages: Make it possible to add extra namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446257 (https://phabricator.wikimedia.org/T198716) (owner: 10Jforrester) [23:39:55] (03CR) 10jenkins-bot: PageImages: Make it possible to add extra namespaces, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446258 (owner: 10Jforrester) [23:39:57] (03CR) 10jenkins-bot: PageImages: Add NS_CATEGORY for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446259 (owner: 10Jforrester) [23:40:37] (03Merged) 10jenkins-bot: Revert "PageImages: Make it possible to add extra namespaces, part II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450878 (owner: 10Jforrester) [23:40:42] (03CR) 10jenkins-bot: Revert "PageImages: Make it possible to add extra namespaces, part II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450878 (owner: 10Jforrester) [23:40:45] (03Merged) 10jenkins-bot: Revert "PageImages: Add NS_CATEGORY for Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450879 (owner: 10Jforrester) [23:41:44] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT Revert PageImages: Make it possible to add extra namespaces, part II, fatalmonitor unhappy (duration: 00m 49s) [23:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:30] !log bblack@neodymium conftool action : set/weight=1; selector: dc=esams,cluster=cache_text,service=varnish-fe [23:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:44] !log SWAT complete. [23:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:49] !log bblack@neodymium conftool action : set/weight=1; selector: dc=esams,cluster=cache_text,service=nginx [23:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:26] (03PS1) 10Jforrester: Re-apply "PageImages: Make it possible to add extra namespaces, part II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450884 [23:45:35] (03CR) 10Jforrester: [C: 04-2] Re-apply "PageImages: Make it possible to add extra namespaces, part II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450884 (owner: 10Jforrester) [23:45:58] (03PS1) 10Jforrester: Re-apply "PageImages: Add NS_CATEGORY for Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450885 (https://phabricator.wikimedia.org/T198716) [23:46:47] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.1727 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [23:47:12] (03PS2) 10Jforrester: Re-apply "PageImages: Add NS_CATEGORY for Commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450885 (https://phabricator.wikimedia.org/T198716) [23:47:27] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.2332 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [23:55:49] (03PS2) 10Jforrester: Re-apply "PageImages: Make it possible to add extra namespaces, part II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/450884