[00:05:31] (03PS7) 10Dzahn: Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 (owner: 10Paladox) [00:06:45] (03CR) 10Dzahn: [C: 032] Phabricator: Remove ubuntu / upstart support [puppet] - 10https://gerrit.wikimedia.org/r/379794 (owner: 10Paladox) [00:08:40] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active [00:12:41] (03PS2) 10Dzahn: profile::discovery_dashboards: Add daily forecasts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/380786 (https://phabricator.wikimedia.org/T112170) (owner: 10Bearloga) [00:12:49] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 352, down: 2, shutdown: 2 [00:14:21] (03CR) 10Dzahn: [C: 031] "wait a minute, we have "--enable-sshd" here and didn't merge this yet and were wondering why sshd doesn't start :p heh" [puppet] - 10https://gerrit.wikimedia.org/r/379420 (owner: 10Paladox) [00:14:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 60, down: 1, dormant: 0, excluded: 0, unused: 0 [00:15:04] (03CR) 10Dzahn: [C: 032] profile::discovery_dashboards: Add daily forecasts dashboard [puppet] - 10https://gerrit.wikimedia.org/r/380786 (https://phabricator.wikimedia.org/T112170) (owner: 10Bearloga) [00:18:54] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/8049/" [puppet] - 10https://gerrit.wikimedia.org/r/379420 (owner: 10Paladox) [00:19:40] (03PS5) 10Dzahn: Gerrit: Enable ui for slaves [puppet] - 10https://gerrit.wikimedia.org/r/379420 (https://phabricator.wikimedia.org/T176532) (owner: 10Paladox) [00:20:09] (03PS6) 10Dzahn: Gerrit: Enable ui for slaves [puppet] - 10https://gerrit.wikimedia.org/r/379420 (https://phabricator.wikimedia.org/T176532) (owner: 10Paladox) [00:21:12] (03CR) 10Dzahn: [C: 032] Gerrit: Enable ui for slaves [puppet] - 10https://gerrit.wikimedia.org/r/379420 (https://phabricator.wikimedia.org/T176532) (owner: 10Paladox) [00:24:26] (03CR) 10Dzahn: "no-op on cobalt, added options on gerrit2001" [puppet] - 10https://gerrit.wikimedia.org/r/379420 (https://phabricator.wikimedia.org/T176532) (owner: 10Paladox) [00:27:44] (03CR) 10Dzahn: [C: 031] Fix problem with throttle rule for John Michael Kohler Art Center. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379661 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [00:31:26] (03CR) 10Dzahn: [C: 04-1] "this is superseded by https://gerrit.wikimedia.org/r/#/c/379136/ and https://gerrit.wikimedia.org/r/#/c/378768/17 but still uses the gerri" [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [00:36:14] (03CR) 10Dzahn: "the comments on https://gerrit.wikimedia.org/r/#/c/356516/ also apply here i guess" [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [00:36:19] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 [00:36:45] (03Abandoned) 10Dzahn: icinga/base: turn screen monitoring into a WARN-only check [puppet] - 10https://gerrit.wikimedia.org/r/376636 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [00:38:26] (03Abandoned) 10Dzahn: elasticsearch: replace validate_bool with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [00:39:09] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active [00:43:10] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/8050/" [puppet] - 10https://gerrit.wikimedia.org/r/378708 (https://phabricator.wikimedia.org/T175876) (owner: 10Ayounsi) [00:48:22] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to start gerrit-ssh on gerrit2001 - https://phabricator.wikimedia.org/T176532#3638321 (10Dzahn) Well, that change above has "--enable-sshd" as option when it is a slave, and it wasn't merged.. and we were wonde... [00:50:17] (03CR) 10Dzahn: "heh, it happened again :) i should have learned from last time. fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/380565 (https://phabricator.wikimedia.org/T176529) (owner: 10Dzahn) [00:50:19] (03PS2) 10Dzahn: admins: partially re-enable shell access for cwdent [puppet] - 10https://gerrit.wikimedia.org/r/380565 (https://phabricator.wikimedia.org/T176529) [00:57:14] (03CR) 10Dzahn: [C: 031] "you said we are still waiting with these, but what was it that we wanted to check first" [puppet] - 10https://gerrit.wikimedia.org/r/366812 (owner: 10Muehlenhoff) [01:01:15] (03CR) 10Dzahn: [C: 031] "That linked patch is abandoned meanwhile. Are we still going to use docker now and never be on stretch with puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/361680 (https://phabricator.wikimedia.org/T166611) (owner: 10Paladox) [01:22:49] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 354, down: 2, shutdown: 0 [02:05:17] (03PS1) 10Dzahn: icinga/base: re-enable screen/tmux monitoring [puppet] - 10https://gerrit.wikimedia.org/r/380901 (https://phabricator.wikimedia.org/T165348) [02:06:39] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [02:32:51] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.19) (duration: 08m 42s) [02:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:09] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:46:09] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:46:19] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:46:19] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:46:19] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:46:19] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:46:19] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:46:20] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:46:29] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:46:59] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:46:59] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:47:00] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:47:01] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:47:01] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:47:01] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:47:01] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:47:09] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:47:09] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:47:09] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [02:47:09] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [02:52:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:52:00] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:52:09] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:52:10] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:52:10] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:52:10] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:52:19] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:52:20] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [02:52:20] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:04:09] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:04:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:04:09] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:04:10] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:10] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:04:10] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:04:10] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:11] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:04:19] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:04:19] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:19] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:04:19] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:19] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:20] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:04:20] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:04:21] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:04:21] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:22] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:22] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:23] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:04:23] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:04:24] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:05:09] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:05:10] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:11:01] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.1) (duration: 15m 55s) [03:11:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:29] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:16:29] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:16:29] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 88433.44 seconds [03:16:29] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [03:16:29] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:16:39] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:16:39] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:16:39] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:16:39] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 88813.84 seconds [03:17:22] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Sep 27 03:17:22 UTC 2017 (duration 6m 22s) [03:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:20] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:20:20] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:20:20] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:20:29] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:20:29] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:20:29] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:20:29] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:20:29] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:20:30] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:20:30] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:20:31] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:20:31] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:20:32] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:20:39] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:20:40] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:20:40] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:19] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:21:19] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:21:19] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:19] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:20] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:21:20] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:20] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:21] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:21:21] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:21:22] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:21:22] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:26:29] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:26:39] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:30:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [03:32:29] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:32:29] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:32:29] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:32:29] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:32:30] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:32:30] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:32:30] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:32:31] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:32:31] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:32:32] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:32:32] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:32:33] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:32:33] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:32:34] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:33:09] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [03:33:19] PROBLEM - Apache HTTP on mw2201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:33:29] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:33:29] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:33:29] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:33:29] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [03:33:30] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:33:30] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:33:31] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:33:31] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [03:33:32] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [03:34:09] RECOVERY - Apache HTTP on mw2201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.120 second response time [04:07:15] (03Abandoned) 10Zoranzoki21: Fix problem with throttle rule for John Michael Kohler Art Center. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379661 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [04:09:29] (03PS1) 10Zoranzoki21: Revert "Add John Michael Kohler Art Center throttle rule" because rule expired. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) [04:10:28] (03PS2) 10Zoranzoki21: Revert "Add John Michael Kohler Art Center throttle rule" because rule expired. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) [04:14:40] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:14:40] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:14:40] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:14:40] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:14:40] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:14:40] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:14:41] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:14:41] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:14:42] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:14:42] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:14:43] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:14:49] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:14:50] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:14:50] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:14:50] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:14:50] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:14:50] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:14:50] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:14:51] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:14:51] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:14:52] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:14:52] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:14:53] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:14:53] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:22:40] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [04:25:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:25:50] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:25:50] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:25:50] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:25:50] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:25:50] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:25:50] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:25:51] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:25:51] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:25:52] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:25:59] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:25:59] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:25:59] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:25:59] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:25:59] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:26:00] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:26:00] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:26:01] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:26:09] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:26:09] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:26:39] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0 [04:26:50] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:26:51] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:26:51] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:26:51] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:26:51] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:26:51] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:26:51] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:26:51] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:26:51] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:27:30] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [04:28:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [04:30:19] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [04:33:20] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [04:36:09] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:36:09] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:36:10] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:36:10] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:36:19] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [04:36:50] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:36:50] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:36:50] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:36:50] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:36:50] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:36:50] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:36:51] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:36:51] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:36:52] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:36:52] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:36:59] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:36:59] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:36:59] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:36:59] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:36:59] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:37:00] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:37:01] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:37:01] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:37:01] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:37:02] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:37:02] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:37:03] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:37:03] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:37:19] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [04:37:59] (03CR) 10Zoranzoki21: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [04:46:59] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [04:46:59] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:46:59] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:46:59] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [04:46:59] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:47:00] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:47:00] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:47:01] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 81306.27 seconds [04:47:01] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:47:02] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:47:02] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:47:03] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:47:03] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 38.37 seconds [04:47:04] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:47:19] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:47:19] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:48:17] (03PS1) 10Madhuvishy: paws_internal: Add the analytics client role back to notebook1001 [puppet] - 10https://gerrit.wikimedia.org/r/380914 [04:49:02] (03CR) 10Madhuvishy: [C: 032] paws_internal: Add the analytics client role back to notebook1001 [puppet] - 10https://gerrit.wikimedia.org/r/380914 (owner: 10Madhuvishy) [04:51:09] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:51:10] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:51:19] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:51:20] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:00] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:00] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:52:00] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:00] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:00] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:00] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:01] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:01] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:52:02] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:02] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:03] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:03] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:09] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:09] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:09] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [04:52:10] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:52:10] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:10] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:10] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:11] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [04:52:11] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:53:00] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [04:54:09] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [04:57:09] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:57:11] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:57:11] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:57:11] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:59:09] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:59:09] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:59:10] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86195.03 seconds [04:59:10] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [04:59:28] (03CR) 10Zoranzoki21: [C: 031] admins: partially re-enable shell access for cwdent [puppet] - 10https://gerrit.wikimedia.org/r/380565 (https://phabricator.wikimedia.org/T176529) (owner: 10Dzahn) [05:03:09] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:03:09] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:03:09] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:03:09] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:03:09] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:03:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:03:10] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:03:11] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:03:11] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:03:12] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:03:12] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:03:13] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:03:13] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:03:14] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:03:29] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:03:29] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:09:19] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:10:20] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:14:29] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:14:29] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:30] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:14:30] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:15:10] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:15:10] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:15:10] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:15:10] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:15:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:15:11] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:15:11] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:15:19] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:15:19] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:15:19] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:15:19] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:15:19] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:15:20] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:15:20] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:15:21] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:15:21] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:15:22] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:15:22] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:15:23] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:15:23] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:15:24] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:23:41] (03CR) 10Zoranzoki21: [C: 031] Revert "Add John Michael Kohler Art Center throttle rule" because rule expired. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [05:24:01] (03CR) 10Zoranzoki21: [C: 031] "Maybe vote will resolve problem" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [05:26:30] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:26:30] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:26:40] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:26:40] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:19] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:19] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:19] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:19] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:20] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:20] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:20] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:27:21] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:21] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:23] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:23] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:23] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:27:23] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:27:24] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:29:30] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:32:30] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:34:49] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:34:49] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:35:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:36:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:36:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [05:37:31] (I got the the nova test) [05:38:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [05:40:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:40:30] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:40:30] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:40:30] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:40:30] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:40:30] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:40:30] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:40:31] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:40:31] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:40:32] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:40:32] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:40:33] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:40:33] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:40:34] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:40:49] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:40:49] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:41:22] !log service nova-fullstack restart on labnet1001 [05:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:39] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:44:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:46:19] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:47:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:47:39] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:47:39] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:47:39] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:47:39] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:47:40] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:47:40] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:47:41] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:47:41] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:47:42] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:47:42] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:47:43] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:47:43] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:47:44] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:47:59] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [05:47:59] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [05:59:20] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:00:00] 5xx known issue? [06:00:42] <_joe_> kart_: more or less, let me look [06:01:00] Okay. Thanks! [06:01:43] <_joe_> kart_: still experiencing 503s? [06:01:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [06:07:03] checking --^ [06:08:06] seems all related to cp3032, already recovered [06:08:49] previous spike was cp3030, weird [06:08:51] _joe_: seems OK now. [06:09:36] Checking CX. [06:09:55] I am seeing another spike now, cp3043 [06:11:47] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-6h&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text [06:13:29] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:14:57] cp3043 looks worst than the others [06:15:56] !log restart varnish backend on cp3043 [06:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:21] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to start gerrit-ssh on gerrit2001 - https://phabricator.wikimedia.org/T176532#3638503 (10Paladox) Gerrit ssh won’t start because it carn’t connect to the mysql db. Since chad coulden’t get init to work. [06:29:31] !log Reboot db2044 after storage failure - T174764 [06:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:37] T174764: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764 [06:32:32] (03PS1) 10Marostegui: db-codfw.php: Add task to db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380916 [06:33:46] !log restart varnish-be on cp3041 [06:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:17] (03CR) 10Marostegui: [C: 032] db-codfw.php: Add task to db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380916 (owner: 10Marostegui) [06:35:49] (03Merged) 10jenkins-bot: db-codfw.php: Add task to db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380916 (owner: 10Marostegui) [06:36:01] !log installing git security updates (our config is not affected, but still fixing the underlying issue) [06:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:13] (03CR) 10jenkins-bot: db-codfw.php: Add task to db2044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380916 (owner: 10Marostegui) [06:37:29] !log restart varnish-be on cp3040 [06:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:53] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add task number to db2044 line (duration: 00m 48s) [06:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:19] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [06:39:19] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [06:40:44] 10Operations, 10ops-codfw, 10DBA: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3638520 (10Marostegui) a:05Marostegui>03Papaul I have rebooted the server via ILO as there were not much I could debug in this status: ``` [06:28:34] root@db2044:~# df -hT -bash: /bin/df: Input/output err... [06:40:53] (03PS3) 10Zoranzoki21: Revert "Add John Michael Kohler Art Center throttle rule" because rule expired. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) [06:41:49] (03CR) 10Zoranzoki21: "Now is ok.. I will now remove personal vote." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [06:49:09] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "This is quite obviously a bugfix. Who can merge it?" [puppet] - 10https://gerrit.wikimedia.org/r/380774 (https://phabricator.wikimedia.org/T163922) (owner: 10Lucas Werkmeister (WMDE)) [06:50:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:51:39] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:55:13] !log Deploy alter table on db1052 (enwiki master) - T174509 [06:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:20] PROBLEM - puppet last run on restbase1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[git] [06:55:20] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [06:57:16] !log Deploy alter table on s6 codfw - T174509 [06:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [07:02:19] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [07:02:36] (03PS3) 10Muehlenhoff: Adapt mediawiki::packages::math for stretch [puppet] - 10https://gerrit.wikimedia.org/r/380726 [07:02:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [07:03:15] (03CR) 10Muehlenhoff: [C: 032] Adapt mediawiki::packages::math for stretch [puppet] - 10https://gerrit.wikimedia.org/r/380726 (owner: 10Muehlenhoff) [07:03:46] (03CR) 10Thiemo Mättig (WMDE): [C: 031] Remove unnecessary `id` attributes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377406 (https://phabricator.wikimedia.org/T175670) (owner: 10VolkerE) [07:04:00] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2072 and db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380920 [07:04:04] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2072 and db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380920 [07:13:47] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2072 and db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380920 (owner: 10Marostegui) [07:16:07] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2072 and db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380920 (owner: 10Marostegui) [07:16:17] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2072 and db2071" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380920 (owner: 10Marostegui) [07:16:29] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [07:16:49] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:17:41] 10Operations, 10Gerrit, 10Release-Engineering-Team: 404 for /changes/ with Gerrit / git-review - https://phabricator.wikimedia.org/T176835#3638554 (10qoreqyas) [07:17:57] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2071 and db2072 - T174509 (duration: 00m 47s) [07:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:03] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:20:37] (03PS1) 10Marostegui: db-codfw: Depool db2070, db2069, db2042 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380921 (https://phabricator.wikimedia.org/T174509) [07:23:49] RECOVERY - puppet last run on restbase1012 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:24:27] (03CR) 10Marostegui: [C: 032] db-codfw: Depool db2070, db2069, db2042 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380921 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:25:56] (03Merged) 10jenkins-bot: db-codfw: Depool db2070, db2069, db2042 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380921 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:26:05] (03CR) 10jenkins-bot: db-codfw: Depool db2070, db2069, db2042 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380921 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:27:21] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2070, db2069 and db2042 to optimize templatelinks and pagelinks tables - T174509 (duration: 00m 48s) [07:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:26] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:28:50] 10Operations, 10Gerrit, 10Release-Engineering-Team: 404 for /changes/ with Gerrit / git-review - https://phabricator.wikimedia.org/T176835#3638603 (10Paladox) Hi, this is a known problem with git-review. Try cloning over ssh please. [07:37:45] !log Optimize pagelinks and template links tables on db2070 db2069 db2042 db2016 - T174509 [07:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:51] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:45:13] (03PS1) 10Marostegui: db-codfw.php: Depool db2076, db2067, db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380922 (https://phabricator.wikimedia.org/T174509) [07:48:23] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2076, db2067, db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:48:29] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [07:48:29] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [07:48:29] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:29] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [07:48:29] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:30] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:30] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [07:48:31] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [07:48:31] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:32] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:32] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:33] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:33] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [07:48:34] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [07:48:34] ^ backups [07:48:49] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [07:48:49] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [07:50:47] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2076, db2067, db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:50:58] (03CR) 10jenkins-bot: db-codfw.php: Depool db2076, db2067, db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380922 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [07:51:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [07:51:14] marostegui: should the backup put themselves in downtime first? :-P [07:51:23] volans: yeah, that'd be nice XD [07:52:01] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2076 db2067 db2060 to optimize templatelinks and pagelinks tables - T174509 (duration: 00m 48s) [07:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:05] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [07:55:18] !log Optimize pagelinks and template links tables on  db2076 db2067 db2060 - T174509 [07:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:23] (03PS1) 10Marostegui: db-codfw.php: Depool db2064 and db2063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380924 (https://phabricator.wikimedia.org/T174509) [08:12:29] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [08:12:48] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2064 and db2063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380924 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [08:13:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:15:14] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2064 and db2063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380924 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [08:16:11] (03CR) 10jenkins-bot: db-codfw.php: Depool db2064 and db2063 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380924 (https://phabricator.wikimedia.org/T174509) (owner: 10Marostegui) [08:16:22] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2064 db2063 to optimize templatelinks and pagelinks tables - T174509 (duration: 00m 48s) [08:16:27] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/380565 (https://phabricator.wikimedia.org/T176529) (owner: 10Dzahn) [08:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:27] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [08:16:58] !log Optimize tables pagelinks templatelinks on db2064 and db2063 - T174509 [08:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:25] (03PS10) 10Zoranzoki21: Access for Slaporte (Stephen LaPorte) to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) [08:20:58] 10Operations, 10Ops-Access-Requests, 10Research: Server access for Miriam Redi - https://phabricator.wikimedia.org/T176682#3638680 (10Miriam) Hi @herron , Thanks for this! I have created my ssh keypair from my wiki laptop, please find the public key on my page: https://office.wikimedia.org/wiki/User:Miriam_... [08:28:50] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:28:50] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 84615.03 seconds [08:28:50] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:28:50] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 75512.03 seconds [08:28:50] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:28:50] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:28:51] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:28:51] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:31:37] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3638689 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1312.eqiad.wmnet', 'mw1313.eqiad.wmnet... [08:33:47] !log mobrovac@tin Started deploy [restbase/deploy@1cc530b]: Introduce the /page/title/{title}/{revision} end point - T158100 [08:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:52] T158100: Deprecate and remove the public title/{title} endpoint - https://phabricator.wikimedia.org/T158100 [08:34:46] !log raise traffic weights to 30 for mw13[19-28] incrementally - T165519 [08:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:50] T165519: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519 [08:43:12] 10Operations: nginx xmldumps duplicate logrotate config - https://phabricator.wikimedia.org/T176840#3638696 (10elukey) [08:43:48] !log mobrovac@tin Finished deploy [restbase/deploy@1cc530b]: Introduce the /page/title/{title}/{revision} end point - T158100 (duration: 10m 01s) [08:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:53] T158100: Deprecate and remove the public title/{title} endpoint - https://phabricator.wikimedia.org/T158100 [08:46:57] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: Create wikibase/wikiba.se-deploy repo - https://phabricator.wikimedia.org/T176841#3638715 (10Ladsgroup) [08:48:37] Can this be merged? https://gerrit.wikimedia.org/r/#/c/380774/1 [08:48:58] (03CR) 10Gehel: Exceptions: convert remaining spuriours ones (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 (owner: 10Volans) [08:49:13] it's very straightforward and we don't have puppet swat today [08:53:48] !log mobrovac@tin Started deploy [changeprop/deploy@7a3dc66]: Use the /page/title/{title}/{revision} end point for revision-visibility-change - T158100 [08:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:54] T158100: Deprecate and remove the public title/{title} endpoint - https://phabricator.wikimedia.org/T158100 [08:55:13] (03CR) 10Hoo man: [C: 031] "Queries are ok, per T176760#3638733." [puppet] - 10https://gerrit.wikimedia.org/r/380628 (owner: 10Hoo man) [08:55:14] !log mobrovac@tin Finished deploy [changeprop/deploy@7a3dc66]: Use the /page/title/{title}/{revision} end point for revision-visibility-change - T158100 (duration: 01m 26s) [08:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:19] Amir1: yeah it looks trivial enough, but since it is apache config everybody is always cautious before merging :) [08:56:30] lemme check with others but it should be good to merge [08:56:38] Thanks :) [08:56:52] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/380773 (owner: 10Volans) [09:00:11] (03CR) 10Gehel: Exceptions: convert remaining spuriours ones (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 (owner: 10Volans) [09:02:00] (03CR) 10Volans: "reply inline" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 (owner: 10Volans) [09:02:25] (03PS2) 10Elukey: Fix /data/ redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/380774 (https://phabricator.wikimedia.org/T163922) (owner: 10Lucas Werkmeister (WMDE)) [09:10:04] (03CR) 10Giuseppe Lavagetto: [C: 031] Fix /data/ redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/380774 (https://phabricator.wikimedia.org/T163922) (owner: 10Lucas Werkmeister (WMDE)) [09:16:59] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3638804 (10Qgil) >>! In T176364#3636861, @Dzahn wrote: > Looking at https://wikimediafoundation.org/wiki/Staff_and_contractors it seems @Qgil is the one... [09:21:44] Amir1: getting ready to merge the change, but I'll have to test a bit with puppet disabled before proceeding (just to be sure) [09:22:05] elukey: thank you very much [09:23:50] Amir1: can you give me some examples to test before/after? (I only checked the one on the task) [09:24:10] okay [09:26:27] commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis81.map [09:26:37] elukey: ^ [09:26:49] elukey: should redirect to https://commons.wikimedia.org/wiki/Data:Bundestagswahl2017/wahlkreis81.map [09:26:54] (03PS1) 10Marostegui: mariadb: Add db1103 to s3 [puppet] - 10https://gerrit.wikimedia.org/r/380937 (https://phabricator.wikimedia.org/T172679) [09:27:33] (03CR) 10Volans: [C: 032] Configuration: do not raise on empty configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/380773 (owner: 10Volans) [09:28:03] okok the one on the task, good [09:28:13] (03CR) 10Elukey: [C: 032] Fix /data/ redirect for commons [puppet] - 10https://gerrit.wikimedia.org/r/380774 (https://phabricator.wikimedia.org/T163922) (owner: 10Lucas Werkmeister (WMDE)) [09:30:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1035 to clone it to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380938 (https://phabricator.wikimedia.org/T172679) [09:31:13] (03Merged) 10jenkins-bot: Configuration: do not raise on empty configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/380773 (owner: 10Volans) [09:31:59] (03CR) 10Marostegui: [C: 032] "Puppet looks good: https://puppet-compiler.wmflabs.org/compiler02/8052/" [puppet] - 10https://gerrit.wikimedia.org/r/380937 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:32:05] (03PS2) 10Marostegui: mariadb: Add db1103 to s3 [puppet] - 10https://gerrit.wikimedia.org/r/380937 (https://phabricator.wikimedia.org/T172679) [09:35:12] Thanks [09:36:36] Amir1: all good, re-enabling puppet on app/api-servers.. apache will reload during the next ~30/40 mins everywhere [09:36:45] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [09:36:48] (03PS2) 10Marostegui: db-eqiad.php: Depool db1035 to clone it to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380938 (https://phabricator.wikimedia.org/T172679) [09:36:53] awesome, thank you [09:36:55] PROBLEM - Check whether ferm is active by checking the default input chain on pybal-test2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [09:37:15] this is me ^ [09:37:48] (03PS1) 10Gilles: Thumbor: fix logstash message type [puppet] - 10https://gerrit.wikimedia.org/r/380942 (https://phabricator.wikimedia.org/T150734) [09:37:50] (03PS1) 10Gilles: Thumbor: don’t rewrite host value in logstash messages [puppet] - 10https://gerrit.wikimedia.org/r/380943 (https://phabricator.wikimedia.org/T150734) [09:38:26] (03CR) 10jerkins-bot: [V: 04-1] Thumbor: don’t rewrite host value in logstash messages [puppet] - 10https://gerrit.wikimedia.org/r/380943 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [09:38:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1035 to clone it to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380938 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:38:55] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2001 is OK: OK ferm input default policy is set [09:39:45] RECOVERY - Check whether ferm is active by checking the default input chain on pybal-test2003 is OK: OK ferm input default policy is set [09:40:00] (03PS2) 10Gilles: Thumbor: fix logstash message type [puppet] - 10https://gerrit.wikimedia.org/r/380942 (https://phabricator.wikimedia.org/T150734) [09:40:03] (03PS2) 10Gilles: Thumbor: don't rewrite host value in logstash messages [puppet] - 10https://gerrit.wikimedia.org/r/380943 (https://phabricator.wikimedia.org/T150734) [09:41:06] (03PS2) 10Volans: Exceptions: convert remaining spuriours ones [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 [09:41:08] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1035 to clone it to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380938 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:41:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1035 to clone it to db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380938 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:42:23] (03PS1) 10Marostegui: s3.hosts: Add db1103 [software] - 10https://gerrit.wikimedia.org/r/380944 (https://phabricator.wikimedia.org/T172679) [09:42:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1035 to transfer its data to db1103 - T172679 (duration: 00m 48s) [09:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:42] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [09:43:15] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:45:13] !log Stop mysql on db1035 to copy its data to db1103 - T172679 [09:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:54] (03CR) 10Marostegui: [C: 032] s3.hosts: Add db1103 [software] - 10https://gerrit.wikimedia.org/r/380944 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:48:19] (03Merged) 10jenkins-bot: s3.hosts: Add db1103 [software] - 10https://gerrit.wikimedia.org/r/380944 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [09:50:24] (03CR) 10Gehel: "good enough. I don't entirely agree with the way exceptions are re-written, but this seems to be a documented python best practice..." [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 (owner: 10Volans) [09:50:27] (03CR) 10Gehel: [C: 031] Exceptions: convert remaining spuriours ones [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 (owner: 10Volans) [09:50:34] (03CR) 10Muehlenhoff: [C: 04-1] "silver still uses and is based on trusty :-/" [puppet] - 10https://gerrit.wikimedia.org/r/380712 (owner: 10Muehlenhoff) [09:52:54] (03CR) 10Volans: [C: 032] Exceptions: convert remaining spuriours ones [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 (owner: 10Volans) [09:55:55] (03Merged) 10jenkins-bot: Exceptions: convert remaining spuriours ones [software/cumin] - 10https://gerrit.wikimedia.org/r/380798 (owner: 10Volans) [10:01:09] 10Operations, 10Datasets-General-or-Unknown, 10User-ArielGlenn: logrotate issue (cron spam) on dumps hosts - https://phabricator.wikimedia.org/T176810#3638898 (10ArielGlenn) [10:01:11] 10Operations: nginx xmldumps duplicate logrotate config - https://phabricator.wikimedia.org/T176840#3638900 (10ArielGlenn) [10:02:12] (03CR) 10Ladsgroup: "Any updates?" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [10:06:16] (03PS1) 10Volans: cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) [10:06:25] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [10:06:26] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [10:06:26] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [10:06:26] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [10:06:26] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [10:06:26] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [10:06:26] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [10:06:39] (03CR) 10jerkins-bot: [V: 04-1] cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [10:09:17] (03PS2) 10Volans: cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) [10:12:27] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:18:00] (03PS3) 10Volans: cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) [10:19:47] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag not a slave [10:19:48] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state not a slave [10:19:48] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [10:19:48] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state not a slave [10:19:48] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state not a slave [10:36:44] (03PS4) 10Volans: cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) [10:42:00] (03CR) 10Volans: "Compiler looks sane in prod: https://puppet-compiler.wmflabs.org/compiler02/8055/" [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [10:42:10] (03PS1) 10Matthias Mullie: Add 3D extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380948 [10:49:11] Amir1: let me know if you guys see more garbled redirect cached, I'll try to clean them up.. afaics the task is not unbreak-now, but I can be convinced otherwise :) [10:54:18] elukey: I think so too, let me talk to Daniel [10:55:03] but first, food [10:55:09] +1 [10:59:47] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3639002 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1314.eqiad.wmnet', 'mw1313.eqiad.wmnet', 'mw1312.eqiad.wmnet'] ``` and were **ALL** suc... [11:07:19] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [11:07:19] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [11:14:04] (03CR) 10Muehlenhoff: [C: 031] cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [11:15:53] (03PS1) 10Matthias Mullie: Enable 3D on test & test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 [11:16:31] (03Abandoned) 10Matthias Mullie: Enable Extension:3d in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350210 (owner: 10Matthias Mullie) [11:16:51] (03CR) 10Matthias Mullie: [C: 04-1] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 (owner: 10Matthias Mullie) [11:17:23] (03CR) 10jerkins-bot: [V: 04-1] Enable 3D on test & test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 (owner: 10Matthias Mullie) [11:19:38] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3639073 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1315.eqiad.wmnet', 'mw1316.eqiad.wmnet... [11:21:50] (03Abandoned) 10MarcoAurelio: [WIP DNM] Create computed list of wikis that can use SecurePoll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371926 (https://phabricator.wikimedia.org/T174398) (owner: 10MarcoAurelio) [11:26:26] (03PS2) 10Matthias Mullie: Enable 3D on test & test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 [11:26:40] (03CR) 10Matthias Mullie: [C: 04-1] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 (owner: 10Matthias Mullie) [11:46:40] (03CR) 10EddieGP: "Thanks for cleaning up behind yourself! However, this isn't really needed - the throttle exemption isn't active any more anyway (because i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [11:50:56] (03PS3) 10Hoo man: Enable ArticlePlaceholder on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380793 (https://phabricator.wikimedia.org/T176771) (owner: 10Jayprakash12345) [11:57:58] phuedx: Hi :) [11:58:10] elukey: o/ hey [11:58:19] what have i done?! ;) [11:58:33] phuedx: your current hive query is a little bit huge :D [11:58:46] do you mind to come to the wikimedia-analytics channel to chat about it? [11:59:08] sure! [11:59:21] thanks :) [11:59:30] elukey: wait... [11:59:44] you're not going to shout at me as soon as i join the other channel are you? [11:59:56] ;) [11:59:57] haahahah [12:00:04] hoo: (Dis)respected human, time to deploy ArticlePlaceholder (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T1200). Please do the needful. [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:11] nono we'd like to discuss with you other options [12:00:29] one month of webrequest data is a lot and the cluster allocates a ton of resources for it [12:00:35] derp [12:00:39] joining now! [12:02:06] (03CR) 10Hoo man: [C: 032] Enable ArticlePlaceholder on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380793 (https://phabricator.wikimedia.org/T176771) (owner: 10Jayprakash12345) [12:03:37] (03Merged) 10jenkins-bot: Enable ArticlePlaceholder on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380793 (https://phabricator.wikimedia.org/T176771) (owner: 10Jayprakash12345) [12:05:29] PROBLEM - Hadoop HistoryServer on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [12:05:34] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on bnwiki (T176771) (duration: 01m 00s) [12:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:40] T176771: Enable ArticlePlaceholder on bnwiki - https://phabricator.wikimedia.org/T176771 [12:05:55] forgot to rebase again :S [12:06:10] (03CR) 10jenkins-bot: Enable ArticlePlaceholder on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380793 (https://phabricator.wikimedia.org/T176771) (owner: 10Jayprakash12345) [12:06:48] !log hoo@tin Synchronized wmf-config/InitialiseSettings.php: Enable the ArticlePlaceholder on bnwiki (T176771) (duration: 00m 49s) [12:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:15] (03PS4) 10Muehlenhoff: Add support for stretch to hhvm::debug [puppet] - 10https://gerrit.wikimedia.org/r/380722 [12:13:37] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db1103 to the array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380954 (https://phabricator.wikimedia.org/T172679) [12:14:27] hoo: can you let me know once you are done deplyong? [12:14:28] !log installing apache updates on einsteinium/tegmen [12:14:29] deploying [12:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:51] marostegui: I'm done already… go ahead [12:15:04] oh! thanks :) [12:15:45] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db1103 to the array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380954 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [12:18:09] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1103 to the array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380954 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [12:18:19] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1103 to the array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380954 (https://phabricator.wikimedia.org/T172679) (owner: 10Marostegui) [12:19:28] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db1103 to the array of hosts (duration: 00m 49s) [12:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db1103 to the array of hosts (duration: 00m 48s) [12:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:10] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag not a slave [12:25:11] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state not a slave [12:25:11] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag not a slave [12:25:11] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state not a slave [12:33:54] (03PS1) 10Elukey: Raise HDFS's namenode maximum jvm heap size to 6G [puppet] - 10https://gerrit.wikimedia.org/r/380957 [12:38:09] mobrovac: Amir1: I am merging https://gerrit.wikimedia.org/r/#/c/376500/ [12:47:34] !log T175242 disable puppet across aqs kafka maps maps-test ores restbase restbase-dev sca scb wtp clusters for merging https://gerrit.wikimedia.org/r/#/c/376500/ [12:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:39] (03CR) 10Alexandros Kosiaris: [C: 032] service: Use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [12:47:40] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [12:47:45] (03PS2) 10Alexandros Kosiaris: service: Use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [12:47:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] service: Use LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [12:53:27] (03CR) 10Elukey: [C: 032] Raise HDFS's namenode maximum jvm heap size to 6G [puppet] - 10https://gerrit.wikimedia.org/r/380957 (owner: 10Elukey) [12:53:33] (03PS2) 10Elukey: Raise HDFS's namenode maximum jvm heap size to 6G [puppet] - 10https://gerrit.wikimedia.org/r/380957 [12:54:01] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8056/" [puppet] - 10https://gerrit.wikimedia.org/r/380957 (owner: 10Elukey) [12:54:53] (03PS1) 10Aklapper: Phab: Allow aklapper to delete panels on dashboards [puppet] - 10https://gerrit.wikimedia.org/r/380959 [12:54:54] !log T175242 enabled puppet in aqs kafka maps maps-test selected hosts and ran puppet manually. [12:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:58] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [12:55:40] (03CR) 10Aklapper: "Spam example: https://phabricator.wikimedia.org/dashboard/view/265/" [puppet] - 10https://gerrit.wikimedia.org/r/380959 (owner: 10Aklapper) [12:56:29] (03PS1) 10ArielGlenn: dumps servers: override nginx log rotation file provided by package [puppet] - 10https://gerrit.wikimedia.org/r/380960 (https://phabricator.wikimedia.org/T176810) [12:58:30] jouncebot: next [12:58:30] In 0 hour(s) and 1 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T1300) [12:59:12] (03CR) 10Zoranzoki21: "OK. How to request whitelist? I will abandon patch.. Thank you for help!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [12:59:34] (03Abandoned) 10Zoranzoki21: Revert "Add John Michael Kohler Art Center throttle rule" because rule expired. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [13:00:06] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T1300). [13:00:06] Jayprakash12345: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] I can SWAT today! [13:00:27] Oooh, I have a patch to add [13:00:31] Jayprakash12345: around for swat? [13:00:34] But I'm not at my laptop right now :p [13:00:39] addshore: go ahead [13:00:41] oh [13:01:10] !log T175242 tilerator and tileratorui need manually restart [13:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:15] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [13:01:20] (03PS5) 10Zfilipin: Add autopatrolled user group to dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) (owner: 10Jayprakash12345) [13:01:37] addshore: will you be at your laptop in the next hour? [13:01:45] I'll be around, or you can deploy yourself [13:03:35] (03PS2) 10ArielGlenn: dumps servers: override nginx log rotation file provided by package [puppet] - 10https://gerrit.wikimedia.org/r/380960 (https://phabricator.wikimedia.org/T176810) [13:04:05] zeljkof: at laptop now! [13:04:06] [= [13:04:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) (owner: 10Jayprakash12345) [13:04:20] (03CR) 10ArielGlenn: [C: 032] dumps servers: override nginx log rotation file provided by package [puppet] - 10https://gerrit.wikimedia.org/r/380960 (https://phabricator.wikimedia.org/T176810) (owner: 10ArielGlenn) [13:04:49] addshore: add the commit to the calendar, want to deploy yourself, or should I? [13:05:24] zeljkof: added to the calander, I can deploy it:) [13:05:56] addshore: I'll ping you when I am done with 380689 [13:06:02] ack! [13:06:12] should be in a minute or two [13:06:32] I have +2ed it already so I dont have to twiddle my thumbs for jenkins later, it is a CSS only change so nothing to worry about :) [13:06:58] (03Merged) 10jenkins-bot: Add autopatrolled user group to dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) (owner: 10Jayprakash12345) [13:07:18] (03CR) 10jenkins-bot: Add autopatrolled user group to dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380689 (https://phabricator.wikimedia.org/T176709) (owner: 10Jayprakash12345) [13:09:04] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:380689|Add autopatrolled user group to dty.wikipedia (T176709)]] (duration: 00m 47s) [13:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:09] T176709: Add autopatrolled flag right on dty.wikipedia - https://phabricator.wikimedia.org/T176709 [13:09:14] addshore: I'm done, go ahead [13:09:17] ack! [13:09:27] feel free to close the swat window when you are done [13:10:31] (03PS1) 10ArielGlenn: fix comment in common nginx logrot file for dumps [puppet] - 10https://gerrit.wikimedia.org/r/380964 [13:11:06] syncing [13:11:18] (03CR) 10ArielGlenn: [C: 032] fix comment in common nginx logrot file for dumps [puppet] - 10https://gerrit.wikimedia.org/r/380964 (owner: 10ArielGlenn) [13:11:49] !log addshore@tin Synchronized php-1.31.0-wmf.1/extensions/TwoColConflict/modules/ext.TwoColConflict.BaseVersionSelector.css: SWAT: [[gerrit:380930|Address changes in the label style in OOUI]] (duration: 00m 46s) [13:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:55] !log SWAT done [13:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:00] zeljkof: thanks! [13:12:15] addshore: thank you :) [13:12:59] 10Operations, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: logrotate issue (cron spam) on dumps hosts - https://phabricator.wikimedia.org/T176810#3639480 (10ArielGlenn) That ought to take care of it. [13:14:30] (03PS2) 10ArielGlenn: Set a reasonable --batch-size for Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/380628 (owner: 10Hoo man) [13:14:47] (03PS5) 10Volans: cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) [13:14:50] RECOVERY - Hadoop HistoryServer on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer [13:15:27] (03CR) 10ArielGlenn: [C: 032] Set a reasonable --batch-size for Wikidata entity dumps [puppet] - 10https://gerrit.wikimedia.org/r/380628 (owner: 10Hoo man) [13:15:52] !log T175242 restbase requires manual restart [13:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [13:18:29] (03PS1) 10Ema: pybaltest: add IPv6 to HTTP/BGP ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/380967 [13:18:53] (03PS6) 10Volans: cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) [13:20:15] thanks, apergos :) [13:20:48] yw [13:22:32] (03CR) 10Alexandros Kosiaris: [C: 031] "Make sure to add first AAAA records otherwise ferm will not be able to resolve() and fail" [puppet] - 10https://gerrit.wikimedia.org/r/380967 (owner: 10Ema) [13:23:01] (03CR) 10Volans: "Updated after testing in labs. All seems to work fine, I'll write some quick doc on Wikitech." [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [13:23:46] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/380967 (owner: 10Ema) [13:24:04] (03CR) 10MarkTraceur: [C: 031] Add 3D extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380948 (owner: 10Matthias Mullie) [13:25:22] !log T175242 parsoid requires manual restart [13:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:30] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [13:28:13] akosiaris: have run/enabled puppet everywhere re T175242 [13:28:15] ? [13:28:22] one host per cluster up to now [13:28:33] finishing the scb one right now [13:28:39] and then gonna enable and restart services [13:28:44] across the clusters [13:28:44] but you didn't restart anything? [13:28:49] I'm going to add one more backport to SWAT and deploy it. [13:28:59] I have restarted stuff on those nodes [13:29:15] restbase, tilerator, tileratorui up to now [13:29:30] oh and parsoid [13:29:35] which rb host akosiaris so I can check? [13:30:05] restbase1007 and restbase-dev1004 [13:30:15] tcpdump show traffic as expected [13:30:20] 13:29:14.497168 IP 10.64.0.89.54402 > 10.2.2.36.12201: UDP, length 255 [13:30:25] kk will check logstash [13:31:33] !log T175242 eventstreams requires manual restart [13:31:34] ok, logs are present there [13:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:39] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [13:31:46] ok looks like this wraps up the first part [13:31:51] I am proceding to the second part [13:31:55] enabling puppet and running it [13:32:10] kk [13:32:18] (03PS1) 10Gilles: Thumbor: enable STL engine on eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/380969 (https://phabricator.wikimedia.org/T171339) [13:33:10] (03CR) 10Gilles: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/380969 (https://phabricator.wikimedia.org/T171339) (owner: 10Gilles) [13:33:50] (03PS1) 10Herron: Add shell account cicalese with group researchers [puppet] - 10https://gerrit.wikimedia.org/r/380970 (https://phabricator.wikimedia.org/T176749) [13:34:05] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3639593 (10herron) a:03herron [13:35:17] (03CR) 10DCausse: Generate daily diffs for categories RDF (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [13:36:40] (03PS4) 10ArielGlenn: Template-ise rsync/public.pp hosts allow [puppet] - 10https://gerrit.wikimedia.org/r/379517 (owner: 10Reedy) [13:36:45] (03PS2) 10Ema: pybaltest: add IPv6 to HTTP/BGP ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/380967 [13:37:06] (03CR) 10jerkins-bot: [V: 04-1] Template-ise rsync/public.pp hosts allow [puppet] - 10https://gerrit.wikimedia.org/r/379517 (owner: 10Reedy) [13:37:44] (03PS1) 10Gehel: wdqs: reduce blazegraph heap size to 10GB [puppet] - 10https://gerrit.wikimedia.org/r/380972 (https://phabricator.wikimedia.org/T175919) [13:39:22] (03CR) 10DCausse: Generate daily diffs for categories RDF (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [13:40:35] (03CR) 10Muehlenhoff: [C: 031] Add shell account cicalese with group researchers [puppet] - 10https://gerrit.wikimedia.org/r/380970 (https://phabricator.wikimedia.org/T176749) (owner: 10Herron) [13:41:01] (03CR) 10Herron: [C: 032] Add shell account cicalese with group researchers [puppet] - 10https://gerrit.wikimedia.org/r/380970 (https://phabricator.wikimedia.org/T176749) (owner: 10Herron) [13:41:10] (03PS2) 10Herron: Add shell account cicalese with group researchers [puppet] - 10https://gerrit.wikimedia.org/r/380970 (https://phabricator.wikimedia.org/T176749) [13:41:20] (03PS5) 10ArielGlenn: Template-ise rsync/public.pp hosts allow [puppet] - 10https://gerrit.wikimedia.org/r/379517 (owner: 10Reedy) [13:42:07] (03CR) 10Giuseppe Lavagetto: nutcracker: create the service only after the package install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380487 (owner: 10Elukey) [13:42:12] !log raised Hadoop HDFS namenode master daemon max heap size to 6G (prev 4G) on analytics100[12] [13:42:14] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.2.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/380973 [13:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:28] (03CR) 10Muehlenhoff: [C: 031] cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [13:42:32] !log anomie@tin Synchronized php-1.31.0-wmf.1/includes/specials/SpecialWatchlist.php: SWAT: {{gerrit|380968}} Fix watchlist "in the last X hours" display (duration: 00m 48s) [13:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:30] PROBLEM - puppet last run on kafka1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:31] PROBLEM - puppet last run on kafka1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:40] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:40] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:50] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:55] !log T175242 re-enable puppet across aqs kafka maps maps-test ores restbase restbase-dev sca scb wtp clusters for merging https://gerrit.wikimedia.org/r/#/c/376500/. Run puppet as well in a batched execution [13:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:00] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [13:44:00] (03CR) 10MarkTraceur: [C: 031] Enable 3D on test & test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 (owner: 10Matthias Mullie) [13:44:05] those ^ are me and are harmless should be fixed soon [13:45:44] mobrovac: I 've left praseodymium, cerium, xenon with puppet disabled for now. Should I do those too ? [13:45:59] yup [13:46:04] feel free to do so [13:46:31] RECOVERY - puppet last run on kafka1013 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:46:40] RECOVERY - puppet last run on kafka1020 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:46:40] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:46:40] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [13:46:51] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:47:35] (03PS2) 10Elukey: nutcracker: create the service only after the package install [puppet] - 10https://gerrit.wikimedia.org/r/380487 [13:48:21] (03CR) 10Volans: [C: 032] CHANGELOG: add changelogs for release v1.2.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/380973 (owner: 10Volans) [13:48:49] (03PS6) 10ArielGlenn: Template-ise rsync/public.pp hosts allow [puppet] - 10https://gerrit.wikimedia.org/r/379517 (owner: 10Reedy) [13:49:27] (03PS1) 10Ema: Add AAAA records for pybal-test200[123] [dns] - 10https://gerrit.wikimedia.org/r/380976 [13:50:02] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3639664 (10herron) 05Open>03Resolved Great, thanks @Ottomata @tstarling @CCicalese_WMF your shell account `cicalese` is now present on `stat1006`: ``` st... [13:50:06] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3639666 (10herron) [13:50:36] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think we need to have the relationship where mediawiki::web::modules gets included." [puppet] - 10https://gerrit.wikimedia.org/r/380472 (owner: 10Elukey) [13:51:10] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3639681 (10CCicalese_WMF) Excellent! Thank you all! [13:51:47] (03CR) 10Alexandros Kosiaris: [C: 031] Add AAAA records for pybal-test200[123] [dns] - 10https://gerrit.wikimedia.org/r/380976 (owner: 10Ema) [13:51:57] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.2.1 [software/cumin] - 10https://gerrit.wikimedia.org/r/380973 (owner: 10Volans) [13:52:56] 10Operations, 10Discovery, 10WMDE-Analytics-Engineering, 10Wikidata, and 2 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875#3639688 (10Addshore) [13:52:59] (03CR) 10Alexandros Kosiaris: [C: 031] "Ah indeed, interface::add_ip6_mapped is also required" [puppet] - 10https://gerrit.wikimedia.org/r/380967 (owner: 10Ema) [13:53:52] (03CR) 10Ema: [C: 032] Add AAAA records for pybal-test200[123] [dns] - 10https://gerrit.wikimedia.org/r/380976 (owner: 10Ema) [13:54:16] (03CR) 10Muehlenhoff: [C: 031] nutcracker: create the service only after the package install [puppet] - 10https://gerrit.wikimedia.org/r/380487 (owner: 10Elukey) [13:55:11] (03CR) 10Muehlenhoff: "Sorry, wrong tab :-)" [puppet] - 10https://gerrit.wikimedia.org/r/380487 (owner: 10Elukey) [13:55:51] (03Abandoned) 10BBlack: libssl11-dev dep for jessie [software/nginx] (stretch-backports-wmf) - 10https://gerrit.wikimedia.org/r/377834 (owner: 10BBlack) [13:55:53] 10Operations, 10Discovery, 10WMDE-Analytics-Engineering, 10Wikidata, and 2 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875#3639710 (10Addshore) [13:55:56] (03Abandoned) 10BBlack: Forward-port WMF nginx patches from 1.11.10-1+wmf3 [software/nginx] (stretch-backports-wmf) - 10https://gerrit.wikimedia.org/r/377835 (owner: 10BBlack) [13:56:02] (03Abandoned) 10BBlack: Release 1.13.3-1+wmf1 [software/nginx] (stretch-backports-wmf) - 10https://gerrit.wikimedia.org/r/377836 (owner: 10BBlack) [13:56:38] (03PS7) 10ArielGlenn: Template-ise rsync/public.pp hosts allow [puppet] - 10https://gerrit.wikimedia.org/r/379517 (owner: 10Reedy) [13:57:25] zeljkof: How's SWAT doing? [13:57:39] marktraceur: all done [13:57:54] Perfect! [13:57:57] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3639712 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1316.eqiad.wmnet', 'mw1317.eqiad.wmnet', 'mw1315.eqiad.wmnet'] ``` and were **ALL** suc... [13:58:46] (03PS1) 10BBlack: libssl11-dev dep for jessie [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380981 [13:58:48] (03PS1) 10BBlack: Forward-port WMF nginx patches from 1.11.10-1+wmf3 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380982 [13:58:50] (03PS1) 10BBlack: Release 1.13.5-1+wmf1 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380983 [13:58:58] Then matthiasmullie is about to go forward with our deployment of Extension:3D code, in preparation for pushing to testwikis - we're not sure how long it will all take, but if we don't have any issues we may push a config change as well to enable. [14:00:02] (03CR) 10Giuseppe Lavagetto: Add ruby base image and a fluentd image (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379793 (owner: 10Giuseppe Lavagetto) [14:00:04] marktraceur: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Multimedia deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T1400). [14:00:04] No GERRIT patches in the queue for this window AFAICS. [14:00:20] (03CR) 10Matthias Mullie: [C: 032] Add 3D extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380948 (owner: 10Matthias Mullie) [14:00:57] gilles: o/ [14:01:28] Filippo doesn't feel super well and he appointed me your personal puppet helper/minion for today [14:01:38] elukey: thanks [14:01:41] Hooray! [14:01:54] (03Merged) 10jenkins-bot: Add 3D extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380948 (owner: 10Matthias Mullie) [14:02:13] (03CR) 10jenkins-bot: Add 3D extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380948 (owner: 10Matthias Mullie) [14:02:22] elukey: we need https://gerrit.wikimedia.org/r/#/c/380969/ merged and thumbor restarted in codfw and eqiad. usually out of caution we restart on codfw first to check that the processes restart correctly with the new config, etc. [14:02:42] (03PS2) 10Elukey: Thumbor: enable STL engine on eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/380969 (https://phabricator.wikimedia.org/T171339) (owner: 10Gilles) [14:03:14] gilles: sure make sense, reading the task to familiarize with the topic. [14:03:48] !log T175242 restart tilerator, tileratorui, restbase across the fleet to pick up the change in a rolling restart manner with a batch size of 2 [14:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:52] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [14:04:02] ok so 3d2png got deployed yesterday [14:04:05] so we are good [14:04:21] 10Operations, 10Ops-Access-Requests, 10Research: Server access for Miriam Redi - https://phabricator.wikimedia.org/T176682#3639744 (10herron) [14:04:30] yeah I've checked on thumbor1001 and thumbor2001 and it's there at the expected path [14:05:15] gilles: do we need puppet disable or thumbor does not restart on config change? (I am pretty sure the latter but checking) [14:05:40] it doesn't restart automatically on config change, no, you need to do a roll-restart [14:05:45] (03PS1) 10Volans: Upstream release 1.2.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380986 [14:05:50] nice [14:06:07] running pcc just to check the change [14:06:28] then I think we can merge, restart one in codfw and proceed with the rest [14:07:07] and https://wikitech.wikimedia.org/wiki/Service_restarts#Thumbor says that each host needs to be depooled first right? [14:07:09] sounds good [14:07:19] elukey: correct [14:07:50] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8066/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/380969 (https://phabricator.wikimedia.org/T171339) (owner: 10Gilles) [14:08:06] gilles: merging and running puppet [14:09:13] (03PS3) 10Ema: pybaltest: add IPv6 to HTTP/BGP ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/380967 [14:09:19] (03CR) 10Ema: [V: 032 C: 032] pybaltest: add IPv6 to HTTP/BGP ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/380967 (owner: 10Ema) [14:10:21] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:10:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:11:16] single spike, already recovered [14:11:34] elukey: yup, do we know what caused it? [14:12:43] ema: didn't check in depth, but I didn't see a single source of ints [14:12:58] I see that mw1212 isn't happy right now [14:13:42] elukey: ^ [14:14:03] checking it right now [14:14:09] but the alarm fired 1 min ago no? [14:14:13] does the timing match? [14:14:50] (03CR) 10Volans: [C: 032] Upstream release 1.2.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380986 (owner: 10Volans) [14:15:20] PROBLEM - puppet last run on pybal-test2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 58 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token] [14:15:46] ignore the pybal-test2002 error for now [14:16:09] ah mw1212 should be the usual hhvm restart [14:16:29] !log T175242 restart parsoid across the fleet to pick up the change in a rolling restart manner with a batch size of 5 [14:16:33] (03PS2) 10BBlack: Forward-port WMF nginx patches from 1.11.10-1+wmf3 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380982 [14:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:35] T175242: all log producers need to use the logstash LVS endpoint - https://phabricator.wikimedia.org/T175242 [14:16:35] (03PS2) 10BBlack: Release 1.13.5-1+wmf1 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380983 [14:16:37] (03PS1) 10Herron: Add mirrys shell account with groups researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/380987 (https://phabricator.wikimedia.org/T176682) [14:17:11] (03Merged) 10jenkins-bot: Upstream release 1.2.1 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/380986 (owner: 10Volans) [14:17:37] !log T175242 restart eventstreams across the fleet to pick up the change in a rolling restart manner with a batch size of 2 [14:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:17] (03CR) 10Muehlenhoff: [C: 031] Add mirrys shell account with groups researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/380987 (https://phabricator.wikimedia.org/T176682) (owner: 10Herron) [14:19:04] gilles: next step should be to depool thumbor2001 and then restart thumbor? (first one me, second one you?) [14:19:30] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:19:33] thumbor2001 doesn't need to be depooled in practice, only eqiad does [14:19:34] (03CR) 10EddieGP: "> How to request whitelist?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380913 (https://phabricator.wikimedia.org/T176287) (owner: 10Zoranzoki21) [14:19:38] codfw isn't serving traffic yet [14:19:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:19:53] yes makes sense :) [14:20:15] ok I'm going to restart on thumbor2001, then [14:20:17] (03CR) 10Herron: [C: 032] Add mirrys shell account with groups researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/380987 (https://phabricator.wikimedia.org/T176682) (owner: 10Herron) [14:20:25] (03PS2) 10Herron: Add mirrys shell account with groups researchers, analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/380987 (https://phabricator.wikimedia.org/T176682) [14:20:26] super [14:21:02] (03PS3) 10BBlack: Forward-port WMF nginx patches from 1.11.10-1+wmf3 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380982 [14:21:03] (03PS3) 10BBlack: Release 1.13.5-1+wmf1 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380983 [14:21:06] (03PS1) 10BBlack: Revert to debhelper 9 compat [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380990 [14:21:08] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add ruby base image and a fluentd image (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379793 (owner: 10Giuseppe Lavagetto) [14:22:38] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1001.eqiad.wmnet [14:22:38] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1002.eqiad.wmnet [14:22:38] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1003.eqiad.wmnet [14:22:39] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1004.eqiad.wmnet [14:22:39] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1005.eqiad.wmnet [14:22:40] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1006.eqiad.wmnet [14:22:41] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1007.eqiad.wmnet [14:22:42] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1008.eqiad.wmnet [14:22:42] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1009.eqiad.wmnet [14:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:43] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1010.eqiad.wmnet [14:22:44] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1011.eqiad.wmnet [14:22:44] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1012.eqiad.wmnet [14:22:45] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1013.eqiad.wmnet [14:22:46] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1014.eqiad.wmnet [14:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:47] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1015.eqiad.wmnet [14:22:48] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1016.eqiad.wmnet [14:22:49] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1017.eqiad.wmnet [14:22:49] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1018.eqiad.wmnet [14:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:50] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1019.eqiad.wmnet [14:22:51] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1020.eqiad.wmnet [14:22:52] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1021.eqiad.wmnet [14:22:52] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1022.eqiad.wmnet [14:22:53] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1023.eqiad.wmnet [14:22:53] !log mlitn@tin Started scap: Enable 3D extension [14:22:53] !log akosiaris@puppetmaster1001 conftool action : set/weight=5; selector: name=wtp1024.eqiad.wmnet [14:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:56] elukey: works fine, you can roll-restart the rest [14:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:08] wow, spam \o/ [14:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:20] gilles: just to be sure, you restarted thumbor-instances right ? [14:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:23] Bots, bots, bots bots bots bots, bots, bots bots bots bots [14:23:24] !log change wtp1001 to wtp1024 weights to 5 from 15 in preparation for deprecation [14:23:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:02] elukey: thumbor@*.service but -instances probably works since that's what filippo's instructions say [14:24:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:35] RECOVERY - puppet last run on pybal-test2002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:24:41] elukey: https://wikitech.wikimedia.org/wiki/Thumbor#Operations [14:24:56] (03CR) 10ArielGlenn: "Reedy: This is what I was thinking of, although I need to filter out out internal hosts from the list passed from the profile." [puppet] - 10https://gerrit.wikimedia.org/r/379517 (owner: 10Reedy) [14:25:31] ahhh didn't know it, I'll add the link in the Service_restarts page [14:25:48] ok doing codfw now [14:26:17] akosiaris: conftool supports regexes :) [14:27:11] bblack: yeah I know... it's just the for loop is faster. Seems people don't like it so I 'll try to avoid it next time [14:27:18] gilles: codfw done [14:28:08] !log uploaded cumin_1.2.1-1_amd64.deb to apt.wikimedia.org jessie-wikimedia [14:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:34] matthiasmullie: marktraceur: is there an STL available somewhere on a production wiki? [14:28:42] Not yet, no [14:29:34] elukey: you can go ahead with eqiad [14:30:22] gilles: just depooled thumbor1001, restarting it [14:30:29] I was trying to check if the host was drained [14:30:49] nginx will retry requests if they die [14:31:01] so you don't have to wait for it to be completely idle [14:31:02] toomanyports to check, just tailing nginx :P [14:31:14] (03PS1) 10Gehel: elasticsearch: use the lgostash LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/380991 (https://phabricator.wikimedia.org/T175242) [14:31:36] actually since nginx is per-server maybe you do [14:31:47] the cluster load balancing probably doesn't retry [14:32:00] (03CR) 10Andrew Bogott: [C: 04-1] "Looks good; one very minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [14:32:24] gilles: 1001 is ready to repool, do you want to check first? [14:32:30] sure [14:33:06] elukey: works fine [14:33:13] super [14:33:17] repooling [14:33:30] (03CR) 10Volans: cumin (WMCS): allow to setup cumin in a project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [14:33:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor type, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380991 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:34:09] (03PS7) 10Volans: cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) [14:34:45] (03CR) 10Gehel: elasticsearch: use the lgostash LVS endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/380991 (https://phabricator.wikimedia.org/T175242) (owner: 10Gehel) [14:34:55] (03PS2) 10Gehel: elasticsearch: use the logstash LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/380991 (https://phabricator.wikimedia.org/T175242) [14:35:05] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 65413.05 seconds [14:35:12] gilles: all good from nginx perspective, proceeding with 100[234] if you are ok [14:35:33] 10Operations, 10Ops-Access-Requests, 10Research, 10Patch-For-Review: Server access for Miriam Redi - https://phabricator.wikimedia.org/T176682#3639856 (10herron) 05Open>03Resolved It looks like `mirrys` was @Miriam's existing ldap uid. Since the ldap and shell usernames should match I've updated the t... [14:35:34] 10Operations, 10Ops-Access-Requests, 10Research: Server access for Miriam Redi - https://phabricator.wikimedia.org/T176682#3639858 (10herron) [14:35:38] elukey: yep, you can go ahead [14:35:52] (03PS1) 10Gehel: aqs: switch to LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/380992 (https://phabricator.wikimedia.org/T175242) [14:38:05] (03PS1) 10Gehel: striker: switch to LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/380993 (https://phabricator.wikimedia.org/T175242) [14:39:01] gilles: done! [14:39:35] elukey: thanks! [14:39:59] matthiasmullie: marktraceur: let me know when you have an STL file up. the thumbnails should "just work" now [14:40:20] gilles: We still don't know if we're going to enable yet, but...we'll let you know [14:40:28] (03PS1) 10Gehel: mediawiki: switch to LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/380994 (https://phabricator.wikimedia.org/T175242) [14:40:37] There doesn't appear to be a deploy window scheduled after us, so we may just go over. [14:41:01] Hi marostegui :), Do you have time for a global rename? (Will file a bugticket as well if needed) [14:41:18] enabling shouldn't even take long, just a few individual files to sync [14:41:29] (03CR) 10Herron: [C: 032] admins: partially re-enable shell access for cwdent [puppet] - 10https://gerrit.wikimedia.org/r/380565 (https://phabricator.wikimedia.org/T176529) (owner: 10Dzahn) [14:41:35] I'm hopeful [14:41:39] (03PS3) 10Herron: admins: partially re-enable shell access for cwdent [puppet] - 10https://gerrit.wikimedia.org/r/380565 (https://phabricator.wikimedia.org/T176529) (owner: 10Dzahn) [14:41:43] (03PS1) 10Gehel: ocg: switch to LVS endpoint for logstash [puppet] - 10https://gerrit.wikimedia.org/r/380995 (https://phabricator.wikimedia.org/T175242) [14:43:41] !log cp1008/pinkunicorn upgraded to test build of nginx-1.13.5-1+wmf1 [14:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:26] (03PS1) 10Giuseppe Lavagetto: Adapt product of dh-make-golang to WMF [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/380996 [14:45:01] !log rolling restart of all the Yarn nodemanager daemons on analytics1028-1068 [14:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:34] matthiasmullie: Is scap still going? I'm kind of amazed it would take this long [14:47:46] yeah, still going [14:48:04] * elukey coffee [14:48:04] 10Operations, 10ops-codfw, 10DBA: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3639885 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your reques... [14:48:14] it also rebuild l10n caches and stuff [14:48:37] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to production bastions for cwdent - https://phabricator.wikimedia.org/T176529#3639886 (10herron) 05Open>03Resolved Shell account `cwdent` has been re-enabled and will be fully deployed over the next half-hour via puppet. After... [14:48:41] 10Operations, 10Ops-Access-Requests: Requesting access to production bastions for cwdent - https://phabricator.wikimedia.org/T176529#3639888 (10herron) [14:48:49] but it's currently syncing code [14:48:59] probably shouldn't take too much longer [14:49:00] K. [14:50:32] (03PS1) 10Giuseppe Lavagetto: Adapt product of dh-make-golang to WMF [debs/golang-github-howeyc-fsnotify] - 10https://gerrit.wikimedia.org/r/380997 [14:52:25] (03PS8) 10Rush: cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [14:52:46] (03CR) 10Rush: [C: 031] "I haven't tested it but this is a great" [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [14:53:56] (03PS2) 10Giuseppe Lavagetto: Adapt product of dh-make-golang to WMF [debs/golang-github-howeyc-fsnotify] - 10https://gerrit.wikimedia.org/r/380997 [14:54:15] PROBLEM - puppet last run on elastic1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:34] !next [14:55:39] Err [14:55:51] jouncebot: next [14:55:51] In 3 hour(s) and 4 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T1800) [14:55:58] (03PS3) 10Giuseppe Lavagetto: Adapt product of dh-make-golang to WMF [debs/golang-github-howeyc-fsnotify] - 10https://gerrit.wikimedia.org/r/380997 [14:57:13] Multimedia deploy window is about to end - does anyone have a problem if we keep going for ~15 min to scap a few config changes? [14:58:01] (03PS2) 10Giuseppe Lavagetto: Adapt product of dh-make-golang to WMF [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/380996 [14:58:02] !log lvs1007: upgrade pybal to 1.14.0 [14:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:37] (03CR) 10Volans: [C: 032] cumin (WMCS): allow to setup cumin in a project [puppet] - 10https://gerrit.wikimedia.org/r/380947 (https://phabricator.wikimedia.org/T176314) (owner: 10Volans) [14:58:51] <_joe_> brb [15:00:02] !log mlitn@tin Finished scap: Enable 3D extension (duration: 37m 09s) [15:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:15] RECOVERY - puppet last run on elastic1023 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:00:56] PROBLEM - PyBal backends health check on lvs1007 is CRITICAL: PYBAL CRITICAL - OK - All pools are healthy [15:01:02] haha lol@pybal [15:02:26] :P [15:02:36] RIPyBal [15:03:57] (03PS3) 10Matthias Mullie: Enable 3D on test & test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 [15:04:32] i am looking forward to the day these alerts are based on prometheus metrics [15:04:40] (03CR) 10Andrew Bogott: [C: 032] Remove role::salt::masters::labs from labcontrol* hosts [puppet] - 10https://gerrit.wikimedia.org/r/379770 (owner: 10Muehlenhoff) [15:04:46] (03PS2) 10Andrew Bogott: Remove role::salt::masters::labs from labcontrol* hosts [puppet] - 10https://gerrit.wikimedia.org/r/379770 (owner: 10Muehlenhoff) [15:06:25] (03CR) 10Matthias Mullie: [C: 032] Enable 3D on test & test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 (owner: 10Matthias Mullie) [15:07:56] (03Merged) 10jenkins-bot: Enable 3D on test & test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 (owner: 10Matthias Mullie) [15:08:04] It's happening!!! [15:08:06] (03CR) 10jenkins-bot: Enable 3D on test & test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/380950 (owner: 10Matthias Mullie) [15:08:59] https://zippy.gfycat.com/MeekPeskyCockroach.gif [15:09:37] !log mlitn@tin Synchronized wmf-config/InitialiseSettings.php: Config 3D to be loaded on test, test2 (duration: 00m 48s) [15:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:45] (03PS1) 10Ema: pybal: update check_pybal for pybal 1.14.0 output format [puppet] - 10https://gerrit.wikimedia.org/r/381003 [15:09:46] gilles: The glasses are super appropriate! [15:10:20] there's an idea, an anaglyph mode for the viewer [15:10:32] !log mlitn@tin Synchronized wmf-config/CommonSettings.php: Load 3D extension (duration: 00m 49s) [15:10:32] #later [15:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:20] threejs can do it https://threejs.org/examples/webgl_effects_anaglyph.html [15:11:27] !log mlitn@tin Synchronized wmf-config/InitialiseSettings-labs.php: Remove 3D beta config (duration: 00m 48s) [15:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:34] https://test.wikipedia.org/wiki/File:Programmatically_created_crystal.stl [15:11:38] YESSSSSSS [15:12:10] damn you're fast - not even done scapping everything :D [15:12:17] (03PS2) 10Volans: Stop defining salt grain per labs project [puppet] - 10https://gerrit.wikimedia.org/r/380476 (owner: 10Muehlenhoff) [15:12:18] !log mlitn@tin Synchronized wmf-config/CommonSettings-labs.php: Remove 3D beta config (duration: 00m 47s) [15:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:28] Not sure the MMV integration is working [15:12:33] Oh, yes it is [15:12:34] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/380476 (owner: 10Muehlenhoff) [15:12:35] WFM! [15:12:45] Hooray! [15:12:46] all done [15:12:55] Off to write a bunch of excited update emails [15:13:23] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, and 2 others: Create wikibase/wikiba.se-deploy repo - https://phabricator.wikimedia.org/T176841#3639955 (10Dzahn) New Gerrit repos (projects) might have to be requested on wiki instead, afaict. https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests [15:13:26] for further rollout to more wikis, no extra steps for thumbor, it will just work [15:13:55] (03CR) 10Andrew Bogott: [C: 032] Stop defining salt grain per labs project [puppet] - 10https://gerrit.wikimedia.org/r/380476 (owner: 10Muehlenhoff) [15:14:18] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/380477 (owner: 10Muehlenhoff) [15:14:45] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to start gerrit-ssh on gerrit2001 - https://phabricator.wikimedia.org/T176532#3639956 (10Dzahn) No, i did not create a hole. I just think these are 2 unrelated issues. gerrit service doesnt start because of the... [15:14:56] awesome, thanks gilles! [15:15:33] !log revoking salt certs for WMCS [15:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:00] (03PS2) 10Muehlenhoff: Remove support for setting custom Salt grains [puppet] - 10https://gerrit.wikimedia.org/r/380477 [15:20:30] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3639978 (10Dzahn) Thanks @Qgil! It seems all other criteria have been met so we can now move on to the last step "Get sign off by a C-level staff of the... [15:21:12] (03CR) 10Muehlenhoff: [C: 032] Remove support for setting custom Salt grains [puppet] - 10https://gerrit.wikimedia.org/r/380477 (owner: 10Muehlenhoff) [15:22:26] (03CR) 10Thcipriani: [C: 04-1] "Like the idea, want to ensure this doesn't merge until scap supports this for MediaWiki deploys." [puppet] - 10https://gerrit.wikimedia.org/r/380503 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [15:22:47] (03PS3) 10Giuseppe Lavagetto: Adapt product of dh-make-golang to WMF [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/380996 [15:24:37] !log rebooting cp402[2356] (bnx2x oops) [15:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:48] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Adapt product of dh-make-golang to WMF [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/380996 (owner: 10Giuseppe Lavagetto) [15:27:00] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3639998 (10BBlack) @RobH any updates here on diags? [15:27:01] (03PS1) 10Muehlenhoff: Remove role::salt::masters::labs [puppet] - 10https://gerrit.wikimedia.org/r/381006 [15:27:07] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Adapt product of dh-make-golang to WMF [debs/golang-github-howeyc-fsnotify] - 10https://gerrit.wikimedia.org/r/380997 (owner: 10Giuseppe Lavagetto) [15:27:45] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3640003 (10RobH) I stupidly forgot this machine with the rest of the repairs I was doing! I'll go back onsite to work on this soon! [15:27:51] (03PS1) 10Dzahn: puppetmaster: drop salt support from wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/381008 [15:28:08] <_joe_> !log uploaded prometheus-statsd-exporter to stretch-wikimedia T175539 [15:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:14] T175539: Build containers for statsd, prometheus-statsd-exporter - https://phabricator.wikimedia.org/T175539 [15:29:05] (03CR) 10Ema: [C: 032] pybal: update check_pybal for pybal 1.14.0 output format [puppet] - 10https://gerrit.wikimedia.org/r/381003 (owner: 10Ema) [15:29:12] (03PS2) 10Ema: pybal: update check_pybal for pybal 1.14.0 output format [puppet] - 10https://gerrit.wikimedia.org/r/381003 [15:29:14] (03CR) 10Ema: [V: 032 C: 032] pybal: update check_pybal for pybal 1.14.0 output format [puppet] - 10https://gerrit.wikimedia.org/r/381003 (owner: 10Ema) [15:31:06] RECOVERY - PyBal backends health check on lvs1007 is OK: PYBAL OK - All pools are healthy [15:31:25] good boy [15:31:27] (03PS4) 10BBlack: Global: Turn off ethernet flow for all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/379799 [15:31:53] (03PS1) 10Dzahn: openstack: rm monitor_labs_salt_keys.py [puppet] - 10https://gerrit.wikimedia.org/r/381009 [15:33:22] (03PS2) 10Dzahn: puppetmaster: drop salt support from wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/381008 [15:34:26] 10Operations, 10ops-codfw, 10DBA: db2044 HW RAID failure - https://phabricator.wikimedia.org/T174764#3640031 (10Marostegui) Thank you! [15:36:37] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to start gerrit-ssh on gerrit2001 - https://phabricator.wikimedia.org/T176532#3640045 (10Paladox) @Dzahn Ssh should start regardless of weather we specify that option or not since it is there just for consitenc... [15:36:49] (03CR) 10Andrew Bogott: [C: 032] openstack: rm monitor_labs_salt_keys.py [puppet] - 10https://gerrit.wikimedia.org/r/381009 (owner: 10Dzahn) [15:37:11] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: drop salt support from wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/381008 (owner: 10Dzahn) [15:37:43] (03Abandoned) 10Andrew Bogott: nova monitoring instaces and salt keys: add new options [puppet] - 10https://gerrit.wikimedia.org/r/211432 (owner: 10ArielGlenn) [15:38:24] (03CR) 10Dzahn: [C: 031] "good to go now. But there will have to be a second change that adds him to the right groups." [puppet] - 10https://gerrit.wikimedia.org/r/379851 (https://phabricator.wikimedia.org/T176518) (owner: 10Zoranzoki21) [15:39:46] (03CR) 10BBlack: [C: 04-1] "Testing this on some live bnx2x nodes, the runtime change via ethtool seems to flap the ethernet link status. With 100ms ping intervals r" [puppet] - 10https://gerrit.wikimedia.org/r/379799 (owner: 10BBlack) [15:45:34] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3640060 (10CCicalese_WMF) I am attempting to login to stat1006, but it is prompting me for a password. I believe I have my .ssh/config set up correctly per advice at https://wikitech... [15:45:45] (03PS1) 10Marostegui: db-eqiad.php: Pool db1103 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381015 [15:46:24] (03PS5) 10Gehel: base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515 [15:48:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1103 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381015 (owner: 10Marostegui) [15:48:59] !log lvs4004: upgrade pybal to 1.14.0 [15:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:41] (03CR) 10jerkins-bot: [V: 04-1] base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel) [15:51:54] !log lvs4002: upgrade pybal to 1.14.0 [15:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:22] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1103 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381015 (owner: 10Marostegui) [15:54:04] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1103 with low weight - T172679 (duration: 00m 47s) [15:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:10] T172679: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679 [15:54:56] mmmh I'm seeing many -1 from CI on gerrit due to aborted after 3 minutes [15:56:24] backlog is small so maybe teh tests themselves are having issue https://graphite.wikimedia.org/render/?areaMode=stacked&height=400&width=800&target=alias(color(zuul.geard.queue.running,%27blue%27),%27Running%27)&target=alias(color(zuul.geard.queue.waiting,%27red%27),%27Waiting%27)&title=Gearman%20job%20queue%20&from=-8h [15:56:32] (03PS5) 10BBlack: Global: Turn off ethernet flow for all interfaces at boot time [puppet] - 10https://gerrit.wikimedia.org/r/379799 [15:56:34] (03PS5) 10BBlack: LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 [15:56:36] (03PS7) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [15:56:38] (03PS1) 10BBlack: Global: runtime disable ethernet flow on fresh install [puppet] - 10https://gerrit.wikimedia.org/r/381017 [15:56:49] yeah, I'm looking at latest commits [15:58:04] (03PS2) 10BBlack: Global: runtime disable ethernet flow on fresh install [puppet] - 10https://gerrit.wikimedia.org/r/381017 [15:58:06] (03PS6) 10BBlack: LVS: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379800 [15:58:08] (03PS8) 10BBlack: Caches: Disable LRO [puppet] - 10https://gerrit.wikimedia.org/r/379801 [15:58:25] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4022.ulsfo.wmnet [15:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:41] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4023.ulsfo.wmnet [15:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:29] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4025.ulsfo.wmnet [16:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:48] !log ema@neodymium conftool action : set/pooled=yes; selector: name=cp4026.ulsfo.wmnet [16:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:35] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1103 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381015 (owner: 10Marostegui) [16:03:12] PROBLEM - Apache HTTP on mw1194 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [16:04:12] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.141 second response time [16:06:27] (03CR) 10Volans: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel) [16:06:52] volans: yeah, I'm still trying to understand that one... [16:07:17] something odd, some ruby command takes 1m to run, and at 3m we cutoff apparently for timeout gehel [16:07:32] now it succeeded [16:07:36] so seems transient [16:07:48] interesting... [16:08:29] ok, puppet compiler on a few nodes seems happy, jenkins is happy (at last) I'll merge [16:08:32] (03CR) 10Gehel: [C: 032] base - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373515 (owner: 10Gehel) [16:10:03] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3640131 (10RobH) [16:11:12] PROBLEM - puppet last run on oresrdb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:11:23] PROBLEM - puppet last run on labnet1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:11:25] damn, that's me, reverting... [16:11:42] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:11:42] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:11:42] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:11:52] PROBLEM - puppet last run on labstore1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:12:12] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:12:12] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:12:13] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:12:22] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:12:22] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:12:42] (03PS1) 10Gehel: Revert "base - switch to logrotate::rule" [puppet] - 10https://gerrit.wikimedia.org/r/381021 [16:12:52] (03CR) 10Gehel: [V: 032 C: 032] Revert "base - switch to logrotate::rule" [puppet] - 10https://gerrit.wikimedia.org/r/381021 (owner: 10Gehel) [16:13:42] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:13:52] (03PS1) 10Andrew Bogott: bootstrapvz: update for newer sudo-ldap packages on jessie [puppet] - 10https://gerrit.wikimedia.org/r/381022 [16:14:02] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:14:32] (03CR) 10Andrew Bogott: [C: 032] bootstrapvz: update for newer sudo-ldap packages on jessie [puppet] - 10https://gerrit.wikimedia.org/r/381022 (owner: 10Andrew Bogott) [16:15:20] Oh, that might actually have been a transient error (looking at the logs...) [16:21:51] (03PS1) 10Smalyshev: Make using CirrusSearch engine default for wbsearchentities on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381025 (https://phabricator.wikimedia.org/T175741) [16:23:52] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/puppet] [16:27:25] !log setting local-as to selected transit BGP sessions - T167840 [16:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:30] T167840: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840 [16:28:02] (03PS1) 10RobH: furud.codfw.wmnet production dns [dns] - 10https://gerrit.wikimedia.org/r/381026 (https://phabricator.wikimedia.org/T176506) [16:28:31] (03CR) 10Paladox: [C: 031] "> the comments on https://gerrit.wikimedia.org/r/#/c/356516/ also" [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [16:30:03] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS6908/IPv4: Active [16:30:18] that's me ^ [16:30:23] (03CR) 10RobH: [C: 032] furud.codfw.wmnet production dns [dns] - 10https://gerrit.wikimedia.org/r/381026 (https://phabricator.wikimedia.org/T176506) (owner: 10RobH) [16:31:35] (03PS4) 10Ema: pybal: BGP MED configuration [puppet] - 10https://gerrit.wikimedia.org/r/380516 (https://phabricator.wikimedia.org/T165584) [16:32:03] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [16:34:03] 10Operations, 10Ops-Access-Requests: Requesting access to pingback data for cicalese - https://phabricator.wikimedia.org/T176749#3640232 (10CCicalese_WMF) Fixed! Thank you @herron! [16:34:32] damn, puppet:/// links are not in the catalog, so not atomic at all (not that catalog compilation is atomic either, but at least the timeframe for problems is shorter). Another good reason not to use puppet:/// [16:36:57] (03PS1) 10Gehel: base: switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/381030 [16:36:59] (03PS1) 10Gehel: base: switch to logrotate::rule (cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/381031 [16:38:55] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3640236 (10RobH) [16:38:56] (03CR) 10Krinkle: [C: 031] Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [16:39:03] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:39:42] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:39:42] RECOVERY - puppet last run on oresrdb1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:39:42] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:39:42] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:39:43] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:39:43] RECOVERY - puppet last run on labnet1003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:40:03] RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:40:03] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:40:12] RECOVERY - puppet last run on labstore1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:40:39] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/381030 (owner: 10Gehel) [16:40:42] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:40:48] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/381031 (owner: 10Gehel) [16:42:12] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:42:23] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:44:15] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3640271 (10RobH) Ok, since this is setting up the disk shelves, just noting in my audit of the system during setup, it does indeed see all 24 disks in the two md1400 disk shelves.... [16:47:09] (03PS1) 10RobH: furud install params [puppet] - 10https://gerrit.wikimedia.org/r/381032 (https://phabricator.wikimedia.org/T176506) [16:47:33] (03CR) 10RobH: [C: 032] furud install params [puppet] - 10https://gerrit.wikimedia.org/r/381032 (https://phabricator.wikimedia.org/T176506) (owner: 10RobH) [16:50:47] (03CR) 10Zhuyifei1999: Microtask for Outreachy(Round15) that describes the understanding of the webservice commands. webservice --backend kubernetes start webservi (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/380568 (owner: 10Sowjanyavemuri) [16:52:05] (03PS1) 10Chad: Group1 to 1.31.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381034 [16:52:51] (03CR) 10Chad: [C: 04-2] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381034 (owner: 10Chad) [16:52:59] (03PS19) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [16:53:17] (03Abandoned) 10BBlack: Revert to debhelper 9 compat [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380990 (owner: 10BBlack) [16:53:23] (03Abandoned) 10BBlack: libssl11-dev dep for jessie [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380981 (owner: 10BBlack) [16:53:28] (03Abandoned) 10BBlack: Forward-port WMF nginx patches from 1.11.10-1+wmf3 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380982 (owner: 10BBlack) [16:53:31] (03Abandoned) 10BBlack: Release 1.13.5-1+wmf1 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/380983 (owner: 10BBlack) [16:55:49] (03PS1) 10BBlack: Forward-port WMF nginx patches from 1.11.10-1+wmf3 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/381035 [16:55:50] (03PS1) 10BBlack: Release 1.13.5-1+wmf1 for stretch [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/381036 [17:05:08] (03CR) 10BBlack: [C: 032] Forward-port WMF nginx patches from 1.11.10-1+wmf3 [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/381035 (owner: 10BBlack) [17:05:11] (03CR) 10BBlack: [C: 032] Release 1.13.5-1+wmf1 for stretch [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/381036 (owner: 10BBlack) [17:09:32] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS6908/IPv6: Active [17:10:00] (03PS1) 10BBlack: Revert to debhelper 9 compat [software/nginx] (wmf-1.13-jessie) - 10https://gerrit.wikimedia.org/r/381040 [17:10:02] (03PS1) 10BBlack: libssl11-dev dep for jessie [software/nginx] (wmf-1.13-jessie) - 10https://gerrit.wikimedia.org/r/381041 [17:10:04] (03PS1) 10BBlack: Create nginx-{full,light,extras}-dbg by hand. [software/nginx] (wmf-1.13-jessie) - 10https://gerrit.wikimedia.org/r/381042 [17:10:07] (03PS1) 10BBlack: Release 1.13.5-1+wmf1~jessie1 [software/nginx] (wmf-1.13-jessie) - 10https://gerrit.wikimedia.org/r/381043 [17:12:33] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [17:17:06] (03PS1) 10Gilles: Thumbor: apply expensive type throttle to STL [puppet] - 10https://gerrit.wikimedia.org/r/381045 (https://phabricator.wikimedia.org/T166699) [17:17:21] (03PS2) 10Gilles: Thumbor: apply expensive type throttle to STL [puppet] - 10https://gerrit.wikimedia.org/r/381045 (https://phabricator.wikimedia.org/T166699) [17:20:14] 10Operations, 10ops-eqiad: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3640383 (10RobH) [17:20:38] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3627866 (10RobH) [17:23:58] 10Operations, 10ops-codfw: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3640392 (10RobH) [17:29:02] !log uploaded nginx-1.13.5-1+wmf1 to stretch-wikimedia, and matching nginx-1.13.5-1+wmf1~jessie1 to jessie-wikimedia [17:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:26] (03CR) 10BBlack: [C: 032] Revert to debhelper 9 compat [software/nginx] (wmf-1.13-jessie) - 10https://gerrit.wikimedia.org/r/381040 (owner: 10BBlack) [17:29:30] (03CR) 10BBlack: [C: 032] libssl11-dev dep for jessie [software/nginx] (wmf-1.13-jessie) - 10https://gerrit.wikimedia.org/r/381041 (owner: 10BBlack) [17:29:32] (03CR) 10BBlack: [C: 032] Create nginx-{full,light,extras}-dbg by hand. [software/nginx] (wmf-1.13-jessie) - 10https://gerrit.wikimedia.org/r/381042 (owner: 10BBlack) [17:29:34] (03CR) 10BBlack: [C: 032] Release 1.13.5-1+wmf1~jessie1 [software/nginx] (wmf-1.13-jessie) - 10https://gerrit.wikimedia.org/r/381043 (owner: 10BBlack) [17:30:31] 10Operations, 10Analytics: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3640394 (10RobH) a:05RobH>03faidon [17:31:13] 10Operations, 10Analytics: rack/setup/install furud.codfw.wmnet - https://phabricator.wikimedia.org/T176506#3627866 (10RobH) Ok, the two MD1200 disk shelves detect, and they are current not configured in any raid array. The OS is installed and calling into puppet, but is set to role spare for now. Assigned t... [17:33:37] (03CR) 10Ayounsi: [C: 032] Add OpenGear support to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/378708 (https://phabricator.wikimedia.org/T175876) (owner: 10Ayounsi) [17:33:47] (03PS2) 10Ayounsi: Add OpenGear support to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/378708 (https://phabricator.wikimedia.org/T175876) [17:43:09] (03PS1) 10EddieGP: varnish: remove references to mfLazyLoadReferences [puppet] - 10https://gerrit.wikimedia.org/r/381050 (https://phabricator.wikimedia.org/T175381) [17:45:17] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:08] (03CR) 10BryanDavis: "Created https://phabricator.wikimedia.org/T176891 to follow up on this." [puppet] - 10https://gerrit.wikimedia.org/r/380318 (owner: 10BryanDavis) [17:49:17] PROBLEM - Host lvs1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:52:23] (03CR) 10Jdlrobson: [C: 031] varnish: remove references to mfLazyLoadReferences [puppet] - 10https://gerrit.wikimedia.org/r/381050 (https://phabricator.wikimedia.org/T175381) (owner: 10EddieGP) [17:52:42] ACKNOWLEDGEMENT - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T167299 ? [17:52:42] ACKNOWLEDGEMENT - Host lvs1007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T167299 ? [17:54:27] RECOVERY - Host lvs1007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [17:55:18] (03PS1) 10RobH: redirect labs-admin to cloud-admin mailing list [puppet] - 10https://gerrit.wikimedia.org/r/381052 (https://phabricator.wikimedia.org/T167155) [17:56:07] Hi! Dunno who might SWAT this morning... Just added 3 CentralNotice updates to the calendar [17:56:53] (03CR) 10RobH: [C: 032] redirect labs-admin to cloud-admin mailing list [puppet] - 10https://gerrit.wikimedia.org/r/381052 (https://phabricator.wikimedia.org/T167155) (owner: 10RobH) [17:57:13] addshore hashar anomie RainbowSprinkles ^ [17:58:08] not me! [17:58:12] Ah hmmmmm [17:58:32] Pretty long list of possible deployers there... [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T1800). [18:00:04] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:29] * James_F waves. [18:01:38] Hmmm jouncebot didn't get my edit to the Deployments page [18:01:59] AndyRussG: Off by a few seconds probably. [18:02:01] jouncebot: reload [18:02:08] RoanKattouw: You around to SWAT? [18:02:31] jouncebot: refresh [18:02:34] I refreshed my knowledge about deployments. [18:02:41] Sure I can do it [18:02:45] paladox: thx! [18:02:46] Ta. [18:02:49] RoanKattouw: also thx! [18:02:51] your welcome :) [18:03:46] RoanKattouw: Also subbu might have https://gerrit.wikimedia.org/r/#/c/381055/ for the SWAT as well. [18:04:06] subbu: Did you want that ---^^ SWATted? [18:04:10] yes please. [18:04:20] AndyRussG: I'm confused about the CN patches, what am I supposed to do exactly? The patches into the wmf_deploy branch are already merged [18:05:43] bblack: did it ever happen to you that you lost an existing LE cert, like reinstalling without having a backup and then later you cant get a new one because LE says "too many certs already issued with the exact name". do i have to contact them in that case? [18:05:45] RoanKattouw: merged just now, not deployed [18:06:10] Maybe we have to do a core change? aaarg [18:06:32] I'll check, hold on [18:06:33] mutante i've had that error before :). [18:06:40] RoanKattouw: thanks! [18:06:43] i think we should backup gerrit's one. [18:06:49] paladox: what did you do about it? [18:07:00] i had to use a different domain [18:07:13] i am waiting for the prevous one to expire currently. [18:07:32] a random wikisaur tk ?:) [18:07:53] AndyRussG: OK it just magically went through [18:07:54] nope [18:08:00] RoanKattouw: in core, the pointer for the CN submodule is just at wmf_deploy head... [18:08:01] gerrit.git.wmflabs.org was the prevous domain [18:08:13] *zing* (the sound of magic) [18:08:13] ooh, wmflabs, i see [18:08:31] i had to register gerrit2.git.wmflabs.org as i lost the certificate for gerrit.git.wmflabs.org when i had to recreate the domain [18:08:32] i meant recreate the instance [18:08:53] mutante: I think that limit is time-bounded. You'd have to get some number of identical certs in a fairly short window of time to hit it? and I think the only answer is waiting it out. [18:09:23] mutante: https://letsencrypt.org/docs/rate-limits/ [18:09:32] "There's a rate limit of 5 certificates per domain (TLD + 1, so subdomains count too) per 7 days in place. [18:09:35] just found that [18:09:55] bblack: thanks! ok [18:10:11] found on community.letsencrypt.org they tell him "You will have to wait a week. [18:10:31] well, i'm glad it didnt happen with gerrit.wm.org :) [18:10:36] but just gerrit-slave [18:10:47] I think you're probably hitting: "We also have a Duplicate Certificate limit of 5 certificates per week" [18:10:56] and isnt really used yet, has another things to solve , like db access [18:11:04] if you get the same cert with exact same details 5x in one week, you're toast on the 6th [18:11:20] but it could also be the top one in my link: [18:11:24] "The main limit is Certificates per Registered Domain, (20 per week)" [18:11:41] meaning we can only get 20x new certificates that have any SANs in wikimedia.org, per week [18:12:01] if we're doing a lot of one-off wikimedia.org certs, and their renewal times happen to collide with some reinstalls, etc... we could trip that [18:12:22] AndyRussG: Your CN patches are now on mwdebug1002, please test [18:12:33] *nod*, i reinstalled that host , jessie->stretch and didnt save the existing acme files [18:12:46] how many times though? [18:12:53] https://crt.sh/?q=%25.wikimedia.org [18:13:05] once.. this week [18:13:22] and not very long before that i had fixed the cert generation [18:13:32] the log says gerrit-slave was done 6 times in the past few days? [18:13:36] with a puppet change that fixed the hostname, to be gerrit-slave if on slave hosts [18:13:44] so it had just started working a few days prior [18:13:45] They do have a request form for getting a higher rate limit [18:13:58] They might grant us that, so that we dont' have to worry? [18:14:00] uh? that surprises me [18:14:18] https://letsencrypt.org/docs/rate-limits/ -> https://docs.google.com/forms/d/e/1FAIpQLSfg56b_wLmUN7n-WWhbwReE11YoHXs_fpJyZcEjDaR69Q-kJQ/viewform?c=0&w=1 [18:14:57] oh, nice google form [18:15:01] "If you are a large hosting provider or organization working on a Let’s Encrypt integration, we have a rate limiting form that" [18:15:04] Sounds like us :) [18:15:16] should we try that, bblack? [18:15:24] Rate Limit Adjustment Request for us [18:16:14] RoanKattouw: K one sec [18:16:16] James_F: subbu: Your patches are on mwdebug1002 now too, please test [18:16:46] this specific problem i have can easily wait. i wonder why it was 6 times though. separately we may want to fill out that form for WMF in general [18:17:40] Yeah, we presumably don't want to combine certs for unrelated domains. Certainly would make puppetisation more complicated I guess. [18:17:45] mutante: if we can wait for now, let's wait, maybe open a phab task to discuss raising the limit [18:17:55] RoanKattouw: Hmm. Patch has not yet come through for VE… [18:18:07] D'oh [18:18:08] --recursive [18:18:13] Ha. [18:18:16] mutante: (mostly because my brain isn't fully engaged in this, and I'm not sure we couldn't come up with some security downsides to raising it, in case someone's trying to use LE to get fakes) [18:18:34] James_F: Try now [18:18:38] (03CR) 10Herron: [C: 032] MX: Change zen.spamhaus.org DNSBL action from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/378718 (https://phabricator.wikimedia.org/T175879) (owner: 10Herron) [18:18:44] (03PS2) 10Herron: MX: Change zen.spamhaus.org DNSBL action from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/378718 (https://phabricator.wikimedia.org/T175879) [18:19:51] bblack: fair enough! yep, will do that and wait, maybe ticket [18:20:09] RoanKattouw: Yup, now it works. :-) [18:20:28] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:20:42] RoanKattouw: But Subbu's needs an i18n scap. [18:20:48] Oh, right [18:20:55] That makes things easier [18:21:00] +1 [18:21:10] Did you test his too? [18:21:12] If you can push the VE one out quicker that'd be good. [18:21:13] Yes. [18:21:17] Thanks [18:21:20] So now I just need a green light from AndyRussG [18:21:48] RoanKattouw: all good! [18:21:58] thx much :) [18:23:25] !log catrope@tin Started scap: SWAT: T176762, T175358, T172023, T174719, and add html5-misnesting to Linter [18:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:34] T172023: CentralNotice: JS selection widgets no longer work in interface to add a campaign - https://phabricator.wikimedia.org/T172023 [18:23:34] T175358: CN Campaign Suppression prior to scheduled start time - https://phabricator.wikimedia.org/T175358 [18:23:34] T174719: Investigate email: BH data storage/transfer issue for iPad donations - https://phabricator.wikimedia.org/T174719 [18:23:34] T176762: [Regression pre-wmf.1] The first letter of every line is disappearing while typing and getting appended at the end of it/on the next line after inserting - https://phabricator.wikimedia.org/T176762 [18:24:40] subbu: Yours is in there too but I didn't discover the task number until it was too late, so you won't get the benefit of the bot auto-posting on your task [18:24:52] (03PS1) 10Herron: MX: Change zen.spamhaus.org DNSBL action from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/381057 (https://phabricator.wikimedia.org/T175879) [18:25:01] RoanKattouw, np. thanks. [18:25:28] (03CR) 10Herron: [C: 032] MX: Change zen.spamhaus.org DNSBL action from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/381057 (https://phabricator.wikimedia.org/T175879) (owner: 10Herron) [18:26:44] RoanKattouw: sounds like a bit of a finicky bot, eh ;) [18:27:25] (03CR) 10Lydia Pintscher: [C: 031] Make using CirrusSearch engine default for wbsearchentities on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381025 (https://phabricator.wikimedia.org/T175741) (owner: 10Smalyshev) [18:29:22] AndyRussG: Eh the bot is fine, it's Subbu's fault for not tagging the task on the commit :) [18:29:38] (03PS3) 10Sowjanyavemuri: Microtask for Outreachy(Round15) that describes the understanding of the webservice commands. webservice --backend kubernetes start webservice --backend kubernetes stop [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/380568 [18:29:46] I did have to manually list the task numbers, having the script automatically pick them up would be cool, but it wouldn't have helped me here [18:30:01] hmmmm [18:30:04] ya [18:30:12] ohwellz [18:38:05] (03PS1) 10BBlack: Revert "Repool text@ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/381060 [18:38:07] (03PS1) 10BBlack: Revert "depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/381061 [18:38:47] (03CR) 10BBlack: [C: 032] Revert "Repool text@ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/381060 (owner: 10BBlack) [18:38:50] (03CR) 10BBlack: [C: 032] Revert "depool ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/381061 (owner: 10BBlack) [18:39:10] !log re-pooling upload@ulsfo (text@ulsfo remains pooled) [18:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:19] (03PS1) 10Herron: MX: Change DNSBL error message [puppet] - 10https://gerrit.wikimedia.org/r/381063 (https://phabricator.wikimedia.org/T175879) [18:46:20] !log catrope@tin Finished scap: SWAT: T176762, T175358, T172023, T174719, and add html5-misnesting to Linter (duration: 22m 55s) [18:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:27] T172023: CentralNotice: JS selection widgets no longer work in interface to add a campaign - https://phabricator.wikimedia.org/T172023 [18:46:27] T175358: CN Campaign Suppression prior to scheduled start time - https://phabricator.wikimedia.org/T175358 [18:46:27] T174719: Investigate email: BH data storage/transfer issue for iPad donations - https://phabricator.wikimedia.org/T174719 [18:46:27] T176762: [Regression pre-wmf.1] The first letter of every line is disappearing while typing and getting appended at the end of it/on the next line after inserting - https://phabricator.wikimedia.org/T176762 [18:46:35] (03CR) 10Herron: [C: 032] MX: Change DNSBL error message [puppet] - 10https://gerrit.wikimedia.org/r/381063 (https://phabricator.wikimedia.org/T175879) (owner: 10Herron) [18:46:55] RoanKattouw: Yay. [18:47:51] (03CR) 10Daniel Kinzler: [C: 031] Make using CirrusSearch engine default for wbsearchentities on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381025 (https://phabricator.wikimedia.org/T175741) (owner: 10Smalyshev) [18:53:30] (03PS2) 10Madhuvishy: toolforge: Remove /usr/bin/sql [puppet] - 10https://gerrit.wikimedia.org/r/380685 (https://phabricator.wikimedia.org/T176688) (owner: 10BryanDavis) [19:00:04] no_justification: Dear deployers, time to do the MediaWiki train deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:00:44] * no_justification smacks jouncebot [19:05:21] there will be [19:06:15] (03CR) 10Chad: [C: 032] Group1 to 1.31.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381034 (owner: 10Chad) [19:09:43] (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381034 (owner: 10Chad) [19:09:53] (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381034 (owner: 10Chad) [19:10:25] !log demon@tin Synchronized php-1.31.0-wmf.1/maintenance/backup.inc: unbreak dumps (duration: 00m 49s) [19:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:33] w00t! [19:10:48] Looks like it's working [19:10:54] I'm gonna check right now [19:11:18] bam. works [19:11:23] thankye much [19:11:42] Yay for easy fixes :) [19:12:02] (03CR) 10Herron: "I'd like to test this change with the puppet compiler but it threw an error about facts for the tools-mail host. Is there a certain way t" [puppet] - 10https://gerrit.wikimedia.org/r/379239 (https://phabricator.wikimedia.org/T175964) (owner: 10Herron) [19:13:09] !log demon@tin Synchronized php: symlink bump (duration: 00m 47s) [19:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:34] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 -> wmf.1 [19:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:28] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:41:00] (03CR) 10Madhuvishy: [C: 032] toolforge: Remove /usr/bin/sql [puppet] - 10https://gerrit.wikimedia.org/r/380685 (https://phabricator.wikimedia.org/T176688) (owner: 10BryanDavis) [19:46:38] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:47:00] 10Operations, 10Traffic: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3640874 (10Dzahn) [19:48:08] 10Operations, 10Traffic: Evaluate requesting a rate limit change from Letsencrypt - https://phabricator.wikimedia.org/T176905#3640894 (10Dzahn) p:05Triage>03Low [19:48:36] (03PS1) 10Chad: Gerrit: replication should be a forced push [puppet] - 10https://gerrit.wikimedia.org/r/381072 [19:50:26] (03CR) 10Dzahn: [C: 032] Gerrit: replication should be a forced push [puppet] - 10https://gerrit.wikimedia.org/r/381072 (owner: 10Chad) [19:52:07] (03PS1) 10Hashar: prometheus: force ferm dns resolution to Ipv4 [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) [19:52:32] (03CR) 10Hashar: "Need something smarter, but that is a good enough for a cherry pick on the beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) (owner: 10Hashar) [19:53:45] no_justification, i've modifyed the systemd script here https://gerrit.wikimedia.org/r/#/c/378768/19/modules/gerrit/templates/initscripts/gerrit.systemd.erb [19:53:55] using java -jar instead of the script [19:54:03] i am going to test it to see if it works [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Parsoid / OCG / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:02:12] Nothing for ORES [20:02:27] (03PS2) 10Madhuvishy: device_backup: Update cron MAILTo to new mailing list address [puppet] - 10https://gerrit.wikimedia.org/r/378989 (https://phabricator.wikimedia.org/T168480) [20:05:36] (03CR) 10Hashar: "The class ends up on the deployment-memc* hosts and that choke on labs. Most probably we need a hiera setting in that profile to disable I" [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T176314) (owner: 10Hashar) [20:09:16] (03PS20) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [20:09:18] (03PS21) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [20:09:53] no_justification i can abandon the other change, as i've removed the reload command as you carn't reload a java -jar command ^^. [20:09:59] i've tested it and works [20:10:08] (03Abandoned) 10Paladox: Gerrit: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379136 (owner: 10Paladox) [20:10:26] There's no such thing as reload, correct. [20:10:33] (03CR) 10Madhuvishy: [C: 032] device_backup: Update cron MAILTo to new mailing list address [puppet] - 10https://gerrit.wikimedia.org/r/378989 (https://phabricator.wikimedia.org/T168480) (owner: 10Madhuvishy) [20:11:04] ah yep, restart then. [20:11:29] restart command works [20:11:37] without any additional things [20:15:04] :) [20:15:23] yes,gerrit.sh didnt have action reload [20:15:40] and cool that it appears to work now with KillSignal=SIGINT [20:15:44] thanks paladox [20:16:04] not sure about "status" [20:16:09] your welcome :) [20:16:19] i will remove status as it can get it from systemd [20:16:24] ok [20:16:27] though the only thing i am not sure about [20:16:37] is that it shows all the logs in the systemctl command now [20:16:52] systemctl status gerrit shows the logs from error_log [20:17:09] but i am not sure if it shows an error line will it cause systemd to stop working? [20:17:43] logstash ? :p [20:18:03] i mean https://gerrit.wikimedia.org/r/#/c/332531/ ? [20:18:14] nope [20:18:34] because we are running the command directly, it's showing what ever it pastes into error_log [20:18:37] in the screen [20:20:02] do we have to add logging options to it.. like.. eh.. Djava.util.logging.config.file= /path/to/logging/file [20:21:40] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#3640998 (10hashar) [20:21:51] nope [20:22:50] it logs to error_log still just it will show it in the systemctl screen now too. [20:22:52] mutante ^^ [20:24:46] (03PS1) 10Ayounsi: Rancid: Add scs-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/381094 (https://phabricator.wikimedia.org/T175876) [20:26:23] (03PS22) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [20:26:49] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [20:28:07] (03PS23) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [20:28:50] (03CR) 10Ayounsi: [C: 032] Rancid: Add scs-ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/381094 (https://phabricator.wikimedia.org/T175876) (owner: 10Ayounsi) [20:29:35] paladox: oh, ok, yea, i think that's ok? [20:30:04] mutante yeh [20:30:23] lgtm, tested it using systemctl stop gerrit, systemctl restart gerrit and systemctl start gerrit [20:30:53] nice. try a reboot and see if it comes up by itself [20:31:59] ok [20:33:17] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3641010 (10Dereckson) (Thanks @Framawiki to have developed the scope of your request. Yes, it shows a real need for the permission requested.) [20:34:16] though the icinga alarm needs fixing [20:36:01] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Reimage gerrit2001 as stretch - https://phabricator.wikimedia.org/T168562#3641012 (10Dzahn) gerrit2001 currently has a puppet error because Letsencrypt cert request gets denied by LE due to hitting rate limits. This affects t... [20:36:10] mutante it starts :) [20:36:16] after rebooting [20:37:04] great! [20:37:24] paladox: what about port 29418 [20:37:28] checking [20:37:29] is it listening [20:38:31] yep [20:38:32] adds comments in icinga to the gerrit2001 alert [20:38:32] [2017-09-27 20:38:02,092] [main] INFO com.google.gerrit.sshd.SshDaemon : Started Gerrit SSHD-CORE-1.2.0 on *:29418 [20:38:47] :) ok [20:38:51] mutante i wonder how do we fix the icinga check for gerrit? [20:39:01] which one? [20:39:09] as it will show [20:39:10] Service: gerrit-test3 check gerrit [20:39:14] Service: gerrit-test3 check gerrit [20:39:19] PROCS CRITICAL: 0 processes with regex args '^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war' [20:40:00] it alerts now because your commandline changed? [20:40:04] yeh [20:40:33] it's in modules/gerrit/manifests/jetty.pp [20:40:37] nrpe_command => [20:41:28] thanks, though what do i change it too? [20:41:35] please [20:41:54] we could change the command based on "if something in Hiera" or other.. ehm.. [20:42:15] or the hostname not being the prod ones [20:42:18] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:42:33] well, you change it to match the new commandline [20:42:52] ok [20:42:56] you just change the end of it, the regex [20:45:58] thanks [20:46:00] found a fix [20:46:52] i wonder how do i add variables in [20:46:53] nrpe_command => "/usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array '^<%= @java_home %>/bin/java -Xmx<%= @heap_limit %> -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site'", [20:46:55] ? [20:47:05] since it is using "" and then '' [20:48:26] paladox: like the existing ones, java_home and heap_limit [20:48:45] yep i mean how do i insert them. can they be in " [20:48:50] "''" [20:48:57] in a single string? [20:49:29] looks like it, yea, as long as " is the outer one [20:50:12] the ' is around the argument to --ereg-argument-array [20:50:30] ah ok [20:50:32] thanks [20:50:39] you dont need new ' or " to add an additional variable, just <%= [20:51:24] ok [20:51:40] i thought it's ${} in puppet code? and <%= in erb. [20:51:58] (03PS24) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 (https://phabricator.wikimedia.org/T157414) [20:52:00] fixed it now in ^^ :) [20:52:01] works [20:54:52] (03PS1) 10Andrew Bogott: tools clush: exclude puppetmaster from hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/381122 [20:58:26] (03CR) 10Andrew Bogott: [C: 032] tools clush: exclude puppetmaster from hostgroup [puppet] - 10https://gerrit.wikimedia.org/r/381122 (owner: 10Andrew Bogott) [20:59:18] paladox: it's like inline_template syntax without using inline_template, because when you are inside the commandline itself it is not puppet code [20:59:37] yep [20:59:38] so it looks like in erb templates [20:59:50] i did [20:59:51] nrpe_command => "/usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 --ereg-argument-array '^${java_home}/bin/java -Xmx${heap_limit} -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site'", [21:00:02] i'm not even sure why it doesnt need to have inline_template() around that [21:00:13] inline_template? [21:00:13] but thats the working example [21:00:16] yeh [21:00:18] it works :) [21:00:25] https://www.safaribooksonline.com/library/view/puppet-3-cookbook/9781782169765/ch02s06.html [21:00:45] oh i see [21:00:58] i did it using ${} :) [21:01:28] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:01:44] eh.. looking if that is true [21:02:03] expects nothing :p [21:02:49] yep, puppet ran fine [21:03:01] RFC meeting starting now in #wikimedia-office: HHVM support and PHP 7 migration [21:03:28] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [21:04:22] (03PS1) 10Ayounsi: Rancid add 4 conservers as up; 1 as down until replaced [puppet] - 10https://gerrit.wikimedia.org/r/381125 (https://phabricator.wikimedia.org/T175876) [21:06:57] (03CR) 10Ayounsi: [C: 032] Rancid add 4 conservers as up; 1 as down until replaced [puppet] - 10https://gerrit.wikimedia.org/r/381125 (https://phabricator.wikimedia.org/T175876) (owner: 10Ayounsi) [21:07:05] (03PS2) 10Ayounsi: Rancid add 4 conservers as up; 1 as down until replaced [puppet] - 10https://gerrit.wikimedia.org/r/381125 (https://phabricator.wikimedia.org/T175876) [21:10:18] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [21:11:25] 10Operations, 10DC-Ops, 10Patch-For-Review: document all scs connections - https://phabricator.wikimedia.org/T175876#3641109 (10ayounsi) I added all the conservers to rancid except scs-c1-eqiad.mgmt.eqiad.wmnet. Ping me when it's back online and I can take care of it. [21:15:12] (03PS1) 10Dzahn: base: screen-monitor, raise CRIT limit to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/381130 (https://phabricator.wikimedia.org/T165348) [21:16:37] (03CR) 10Dzahn: [C: 032] base: screen-monitor, raise CRIT limit to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/381130 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [21:17:41] (03PS2) 10Dzahn: icinga/base: re-enable screen/tmux monitoring [puppet] - 10https://gerrit.wikimedia.org/r/380901 (https://phabricator.wikimedia.org/T165348) [21:22:45] (03PS3) 10Dzahn: icinga/base: re-enable screen/tmux monitoring [puppet] - 10https://gerrit.wikimedia.org/r/380901 (https://phabricator.wikimedia.org/T165348) [21:24:05] (03PS1) 10Ayounsi: check_ifstatus: ignore swfab interfaces [puppet] - 10https://gerrit.wikimedia.org/r/381132 [21:24:25] (03PS4) 10Dzahn: icinga/base: re-enable screen/tmux monitoring [puppet] - 10https://gerrit.wikimedia.org/r/380901 (https://phabricator.wikimedia.org/T165348) [21:25:20] (03CR) 10Dzahn: "re-reverting in https://gerrit.wikimedia.org/r/#/c/380901/" [puppet] - 10https://gerrit.wikimedia.org/r/376577 (owner: 10Dzahn) [21:28:20] (03PS2) 10Ayounsi: check_ifstatus: ignore swfab interfaces [puppet] - 10https://gerrit.wikimedia.org/r/381132 [21:28:37] (03CR) 10Dzahn: [C: 032] icinga/base: re-enable screen/tmux monitoring [puppet] - 10https://gerrit.wikimedia.org/r/380901 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [21:29:14] (03CR) 10Ayounsi: [C: 032] check_ifstatus: ignore swfab interfaces [puppet] - 10https://gerrit.wikimedia.org/r/381132 (owner: 10Ayounsi) [21:29:22] (03PS3) 10Ayounsi: check_ifstatus: ignore swfab interfaces [puppet] - 10https://gerrit.wikimedia.org/r/381132 [21:34:28] _joe_ hi, when you have a chance could you review https://gerrit.wikimedia.org/r/#/c/378768/ please? [21:34:40] It is migrating the gerrit systemd service from the deb to puppet [21:34:57] so that we can archive the gerrit deb repo, since we now use scap to deploy gerrit. [21:35:16] RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 69, down: 0, dormant: 0, excluded: 3, unused: 0 [21:36:14] no_justification would you also +1 or -1 ^^ please. Needs you +1 at least so that it can be merged :) [21:36:37] I'm kinda busy [21:36:47] we just need that merged so we can finally archive the deb repo. and yep, when you have a chance :). [21:42:51] There's no rush on any of this by the way [21:51:28] (03PS1) 10Dzahn: screen-monitoring, whitelist es200[234] by regex [puppet] - 10https://gerrit.wikimedia.org/r/381136 (https://phabricator.wikimedia.org/T165348) [21:53:46] (03PS2) 10Dzahn: screen-monitoring, whitelist es200[234] by regex [puppet] - 10https://gerrit.wikimedia.org/r/381136 (https://phabricator.wikimedia.org/T165348) [21:54:35] (03PS3) 10Dzahn: screen-monitoring, whitelist es200[234] by regex [puppet] - 10https://gerrit.wikimedia.org/r/381136 (https://phabricator.wikimedia.org/T165348) [21:56:36] (03CR) 10Dzahn: [C: 032] screen-monitoring, whitelist es200[234] by regex [puppet] - 10https://gerrit.wikimedia.org/r/381136 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [22:03:16] PROBLEM - Long running screen/tmux on db1020 is CRITICAL: CRIT: Long running SCREEN process. (PID: 6254, 52060346s 31536000s). [22:06:11] 10Operations, 10MediaWiki-Platform-Team, 10NewPHP: Find PHP7 alternative for HHVM's Xenon - https://phabricator.wikimedia.org/T176916#3641313 (10Krinkle) [22:06:46] (03PS1) 10Dzahn: screen-monitoring: whitelist all db/es by regex [puppet] - 10https://gerrit.wikimedia.org/r/381137 (https://phabricator.wikimedia.org/T165348) [22:14:13] (03PS2) 10Dzahn: screen-monitoring: whitelist all db/es by regex [puppet] - 10https://gerrit.wikimedia.org/r/381137 (https://phabricator.wikimedia.org/T165348) [22:17:16] (03PS3) 10Dzahn: screen-monitoring: whitelist all db/es by regex [puppet] - 10https://gerrit.wikimedia.org/r/381137 (https://phabricator.wikimedia.org/T165348) [22:20:11] (03CR) 10Dzahn: [C: 032] screen-monitoring: whitelist all db/es by regex [puppet] - 10https://gerrit.wikimedia.org/r/381137 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [22:34:41] !log maxsem@tin Synchronized php-1.31.0-wmf.1/includes/EditPage.php: https://gerrit.wikimedia.org/r/#/c/381139/1 (duration: 00m 49s) [22:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:55] (03CR) 10Catrope: "@Zoranzoki21: Because a trusted user (Subbu) in this case +1ed the patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) (owner: 10Zoranzoki21) [22:55:27] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3641426 (10Dzahn) 05stalled>03Open [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 8 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170927T2300). [23:00:04] Smalyshev: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:19] here! [23:00:30] I'll SWAT [23:00:48] (03PS2) 10Catrope: Make using CirrusSearch engine default for wbsearchentities on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381025 (https://phabricator.wikimedia.org/T175741) (owner: 10Smalyshev) [23:00:52] (03CR) 10Catrope: [C: 032] Make using CirrusSearch engine default for wbsearchentities on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381025 (https://phabricator.wikimedia.org/T175741) (owner: 10Smalyshev) [23:02:14] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3641436 (10Dzahn) Done! So far. It has been added to Icinga again after the merges above. You can now see some WARNs here: https://icinga.wikimedia.org/cgi-bin/icinga/status.cg... [23:02:52] haha I see someone patched jouncebot [23:03:12] RoanKattouw: thank Niharika :) ) [23:03:17] haha [23:03:31] Thank? no_justification seemed mad. :P [23:03:38] Thank you Niharika for correctly reflecting our perverse incentives [23:03:58] Haha. [23:07:10] * RoanKattouw throws some knives in the general direction of Nodepool/Zuul/Jenkins [23:09:25] (03Merged) 10jenkins-bot: Make using CirrusSearch engine default for wbsearchentities on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381025 (https://phabricator.wikimedia.org/T175741) (owner: 10Smalyshev) [23:09:30] (03CR) 10jenkins-bot: Make using CirrusSearch engine default for wbsearchentities on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381025 (https://phabricator.wikimedia.org/T175741) (owner: 10Smalyshev) [23:09:34] looks like knives helped... [23:10:07] * greg-g adds that to our roadmap [23:10:28] greg-g: T169279 in lieu of knives [23:10:29] T169279: Add mediawiki-config to the gate-and-submit-swat pipeline - https://phabricator.wikimedia.org/T169279 [23:10:45] Paladox submitted a patch for it and hashar seemed to argue that it isn't necessary, but my experience would beg to differ [23:10:53] SMalyshev: Your change is on mwdebug1002, please teste [23:10:54] -e [23:11:13] sure, testing [23:12:16] RoanKattouw: seems to be active and working fine, thanks! [23:16:26] (03PS2) 10Smalyshev: Make using CirrusSearch engine default for wbsearchentities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379426 (https://phabricator.wikimedia.org/T175741) [23:17:05] !log catrope@tin Synchronized wmf-config/Wikibase.php: Make CirrusSearch default for wbsearchentities on testwikidatawiki T175741 (duration: 00m 48s) [23:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:12] T175741: Set ElasticSearch implementation as default for wbsearchentites on Wikidata - https://phabricator.wikimedia.org/T175741 [23:31:12] (03CR) 10Dzahn: "ok, thanks!. i don't think we can call it a problem, since we are adding a new thing. so it will wait for proper setup i would say." [puppet] - 10https://gerrit.wikimedia.org/r/380827 (https://phabricator.wikimedia.org/T168562) (owner: 10Dzahn) [23:39:48] !log cp4024 is running hardware testing, leave it alone T174891 [23:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:54] T174891: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891 [23:41:03] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3641514 (10Dzahn) What do you think? The only thing that keeps me from closing this as resolved is now "should i lower the CRIT threshold to something less than a year" or shoul... [23:50:57] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3641515 (10RobH) Sometimes entering the serial console registers a keystroke, so I have this running via a screen session on iron. That way reconnecting wont input a keystroke and cancel testing. I'll che...