[00:15:14] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/6509/" [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn) [00:20:38] 06Operations, 10ops-codfw: Degraded RAID on ms-be2029 - https://phabricator.wikimedia.org/T166021#3287967 (10Papaul) p:05Triage>03Normal [00:47:26] RECOVERY - HP RAID on ms-be1037 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [00:53:27] 06Operations, 10ops-codfw: Degraded RAID on ms-be2029 - https://phabricator.wikimedia.org/T166021#3287988 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below. Your request is... [02:24:58] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 30s) [02:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:56] RECOVERY - HP RAID on ms-be1035 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [02:57:18] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 13m 38s) [02:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:03] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed May 24 03:04:03 UTC 2017 (duration 6m 45s) [03:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:56] RECOVERY - HP RAID on ms-be1038 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [04:08:17] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:17] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:17] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:26] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:26] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:26] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:26] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:26] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:27] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:27] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:28] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:28] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:29] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:36] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:46] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:46] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:47] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:47] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:47] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:47] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:15:07] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:16] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:26] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:26] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:26] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:26] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:26] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:36] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:15:36] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [04:17:37] o_O? [04:18:16] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:18:26] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:18:27] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [04:18:27] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 84355.27 seconds [04:18:27] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 88216.27 seconds [04:18:27] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:18:27] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87161.28 seconds [04:18:27] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 5.29 seconds [04:18:36] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:18:36] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:18:37] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:18:37] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:18:37] RECOVERY - MariaDB Slave Lag: s5 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87215.32 seconds [04:18:37] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [04:18:37] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [04:18:38] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87164.35 seconds [04:19:06] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:19:07] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89112.50 seconds [04:19:07] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:19:16] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:19:16] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:19:16] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:19:16] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:19:16] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:19:16] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [04:19:16] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87337.43 seconds [04:19:17] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 88271.44 seconds [04:19:17] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:19:18] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [04:19:18] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:16:16] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:16] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:16] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:16] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:26] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:26] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:26] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:26] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:26] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:27] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:27] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:28] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:28] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:36] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:36] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:46] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:47] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:47] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:47] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:47] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:47] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:16] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:25:16] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:25:26] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:25:26] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:25:36] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:25:36] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:25:36] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:25:36] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:25:36] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:25:37] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:26:06] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:26:06] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:26:06] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:26:16] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:26:16] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:26:16] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:26:16] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:26:16] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:26:17] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:26:17] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87254.47 seconds [05:26:18] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:26:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89345.48 seconds [05:26:19] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:26:19] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:26:26] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:26:26] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [05:35:16] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:48:26] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [05:48:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [05:49:26] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:49:36] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:50:46] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1124.90 Read Requests/Sec=318.00 Write Requests/Sec=1.00 KBytes Read/Sec=40000.40 KBytes_Written/Sec=28.40 [05:52:24] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3288070 (10Joe) >>! In T147718#3286942, @Ottomata wrote: > Also, if configuration of profiles can only be done via hiera, doesn't that mean any module p... [05:57:24] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3288071 (10Joe) >>! In T147718#3275482, @Ottomata wrote: > I have a question about the new profile guidelines: > >> Profile classes should only have pa... [06:00:46] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=4.40 Read Requests/Sec=1.70 Write Requests/Sec=74.60 KBytes Read/Sec=15.60 KBytes_Written/Sec=1064.80 [06:02:42] !log Deploy alter table on s2 db1047 - https://phabricator.wikimedia.org/T162611 [06:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:55] !log Run pt-table-checksum on s7.frwiktionary - https://phabricator.wikimedia.org/T163190 [06:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:16] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89996.16 seconds [06:34:54] !log Deploy alter table on s2.fawiki directly on codfw master (db2029) after running the clean up duplicates script - https://phabricator.wikimedia.org/T164530 [06:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:26] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [06:41:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [06:44:36] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:49:36] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 280 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:51:04] (03PS1) 10Marostegui: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355378 (https://phabricator.wikimedia.org/T164530) [06:53:00] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355378 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [06:55:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355378 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [06:55:29] (03PS5) 10Muehlenhoff: Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 [06:56:05] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355378 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [06:56:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1079 - T164530 (duration: 00m 54s) [06:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:46] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [06:58:36] (03PS1) 10Marostegui: db-eqiad.php: Repool db1079, depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355379 (https://phabricator.wikimedia.org/T164530) [06:59:26] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:59:36] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [07:01:31] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1079, depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355379 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:02:37] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1079, depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355379 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:02:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1079, depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355379 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:03:55] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1079, depool db1086 - T164530 (duration: 00m 42s) [07:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:05] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [07:06:54] (03PS1) 10Marostegui: db-eqiad.php: Repool db1086, depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355380 (https://phabricator.wikimedia.org/T164530) [07:11:36] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [07:12:06] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [07:12:06] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [07:12:06] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [07:12:07] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [07:12:16] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [07:13:36] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [07:14:26] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [07:14:26] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [07:14:56] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [07:14:56] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [07:14:56] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [07:14:57] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [07:15:06] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [07:15:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1086, depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355380 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:16:40] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1086, depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355380 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:16:47] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1086, depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355380 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:17:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1086, depool db1094 - T164530 (duration: 00m 41s) [07:17:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:47] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [07:18:27] (03PS6) 10Jcrespo: raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [07:19:56] (03PS7) 10Jcrespo: raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [07:23:40] (03PS1) 10Marostegui: db-eqiad.php: Repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355382 (https://phabricator.wikimedia.org/T164530) [07:26:23] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355382 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:27:59] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355382 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:28:07] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355382 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui) [07:28:59] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 - T164530 (duration: 00m 41s) [07:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:09] T164530: Deploy uniqueness constraints on ores_classification table - https://phabricator.wikimedia.org/T164530 [07:47:41] (03CR) 10Giuseppe Lavagetto: [C: 031] Add explicit dependency on ghostscript [puppet] - 10https://gerrit.wikimedia.org/r/313963 (owner: 10Muehlenhoff) [07:56:15] (03CR) 10Volans: [C: 04-1] "I've some comments on the logic of the generalization, see them inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [07:56:26] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:57:27] Hi all, I need some help with Python virtualenv on Kubernetes on Wikitech ToolLabs. What's the best place to ask for such questions? [07:58:26] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:26] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:26] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:27] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:27] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:27] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:27] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:28] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:28] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:29] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:30] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:36] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:36] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:36] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:36] PROBLEM - puppet last run on conf1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 18 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk] [07:58:46] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:46] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:46] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:46] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:46] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:47] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:59:49] ^ those are probably the backups [08:00:13] Xelgen: #wikimedia-labs is probably the best place for it [08:02:36] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:02:36] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:02:36] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:02:37] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [08:02:37] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [08:02:37] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:03:16] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:03:16] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:03:16] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:03:16] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:03:16] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:03:17] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:03:17] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:03:18] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:03:18] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:03:19] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:03:19] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:03:26] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:03:26] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [08:03:26] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [08:03:48] volans: Thanks! [08:04:22] ye [08:04:25] *yw [08:06:27] (03CR) 10Jcrespo: raid: Implement the option to check write cache policies (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [08:10:36] RECOVERY - puppet last run on conf1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:13:11] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: rack/setup/wire/deploy msw2-c1-eqiad - https://phabricator.wikimedia.org/T166171#3288263 (10ayounsi) As ETA is very short for the new routers and switches, let's wait for them and plan/rack everything at the same time. [08:13:37] (03CR) 10Volans: [C: 04-1] "reply inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [08:14:01] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: rack/setup/wire/deploy msw2-c1-eqiad - https://phabricator.wikimedia.org/T166171#3288264 (10ayounsi) a:03Cmjohnson [08:16:36] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk] [08:17:15] (03PS8) 10Jcrespo: raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [08:23:36] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:27:28] 06Operations: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3288272 (10Volans) [08:28:47] (03PS9) 10Jcrespo: raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [08:28:50] (03CR) 10Filippo Giunchedi: "I see a similar patch was merged and then reverted? What happened?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355174 (https://phabricator.wikimedia.org/T160616) (owner: 10Aaron Schulz) [08:29:38] (03CR) 10Jcrespo: "What about now?" [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [08:32:44] (03CR) 10Filippo Giunchedi: [C: 031] Use gdb from jessie-backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/355110 (owner: 10Muehlenhoff) [08:34:01] jynus: the patch is on top of many others not merged [08:34:19] (03PS9) 10Filippo Giunchedi: Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [08:34:25] volans: nope [08:34:43] just same topic? [08:34:47] can people learn to read? "same topic" [08:34:57] I use mysql as my default topic [08:35:08] can gerrit show differently chained changes AND same topic? :D [08:35:15] yes [08:35:19] it create 3 sections [08:35:31] same topic "depends" and something else [08:35:42] I was confused when 2 people told me the same [08:35:55] apparently I know gerrit better than most people :-) [08:36:16] I found gerrit UI super confusing, the new one more than the previou ;) [08:36:45] yeah, Related Changes (6) [08:36:45] Submitted Together (2) [08:36:48] Same Topic (1) [08:36:51] I could change the topic [08:36:54] but the on focus one changes [08:36:59] but I left it as mysql [08:37:03] depending if the other are there or not [08:37:13] to show this was intended for mysql hosts first [08:37:49] so the change looks better to me for the check part, not sure what was the outcome of the puppet-related part chat [08:38:00] there is things missing [08:38:01] there [08:38:05] variables, not variables, default values, etc... :) [08:38:07] but I do not want to do them here [08:38:07] (03CR) 10Filippo Giunchedi: [C: 032] Enable memcache-based Thumbor broken thumbnail throttling [puppet] - 10https://gerrit.wikimedia.org/r/342811 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [08:38:15] we have to refactor [08:38:24] raid to be included on base directly [08:38:31] according to the manual style [08:38:36] style manual [08:38:44] modules should not include other modules [08:38:53] so raid should be included by base directly [08:39:04] I will be doing that, but not as part of this change [08:39:14] sure [08:39:22] and then looking at the hp stuff [08:39:32] but that is more complicated [08:39:41] in fact, we do not have the cache enabled on most servers [08:39:49] the write cache [08:39:58] for ssd IIRC is better without [08:40:42] I am not disagreeing, but I would like to 1) test and b) maybe tune it for reads [08:41:08] I think now it is fully disabled [08:41:20] I think we may be able to enable it for reads only [08:41:42] not sure, just need to look at it [08:42:37] this is why I want to have something now, and promete it for a couple of core services (mysql, switft, analytics) [08:42:46] and then keep improving [08:43:05] for example, the ipmi check is there, but it is horrible to enable it [08:43:35] then there is the BBU monitoring, which will be mostly heuristic [08:43:41] but it should be there, too [08:48:26] !log joal@tin Started deploy [analytics/refinery@9377d9c]: Deploying to fix yesterday's deploy bugs [08:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: cp3036.esams.wmnet [08:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:35] !log depool cp3036 for T133387 testing [08:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:42] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [08:50:34] jynus: ack but then the value in hiera might not apply as the logic between megaraid and hp are probably different [08:51:03] we might have to define a "dbpolicy" that does the right thing based on the hardware [08:51:11] !log joal@tin Finished deploy [analytics/refinery@9377d9c]: Deploying to fix yesterday's deploy bugs (duration: 02m 44s) [08:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:40] volans: I genunily do not gety what you mean [08:51:47] (03PS1) 10Ema: prometheus: enable qdisc collector on cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/355391 (https://phabricator.wikimedia.org/T147569) [08:51:56] "the logic between megaraid and hp are probably different"? [08:52:35] !log Deploy alter table on codfw master (db2019 and let it replicate) on s4 - T166206 [08:52:41] that if for megaraid we use the cache with a specific policy and for hp maybe we don't use it all or in a different way [08:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:42] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [08:52:44] profile::base::check_raid_policy: 'WriteBack' [08:52:52] you mean that I will not want writeback on hp? [08:53:15] will not apply, and we might need to define it as "dbpolicy" and then in the script to the "right" thing based on which hardware is present [08:53:27] you might want no cache on HP and WriteBack on megaraid [08:53:33] it depends also on SSD vs HDD [08:53:35] but as of now, I want it [08:53:44] if I didn't want it [08:53:59] then I would restrict that regex to the hosts that I want [08:54:45] would you be happy with doing that now? [08:55:10] basically, selecting all old hosts, whicha are the ones problematic right now [08:55:45] old_db_hosts_policy [08:56:03] and then decide later if to apply it or what to do on the newer ssd hosts? [08:56:40] if all the megaraid hosts needs that policy and have that policy you can keep it for now, but be aware that you might need to refactor it to support different use cases later [08:57:11] yes, that is why I put it on hiera [08:57:21] so that it is much easier to change [09:00:05] (03PS1) 10Giuseppe Lavagetto: calico: add new version 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/355392 (https://phabricator.wikimedia.org/T165024) [09:00:07] (03PS1) 10Giuseppe Lavagetto: role::kubernetes::worker: upgrade calico on one host [puppet] - 10https://gerrit.wikimedia.org/r/355393 (https://phabricator.wikimedia.org/T165024) [09:00:09] (03PS1) 10Giuseppe Lavagetto: role::kubernetes::worker: upgrade calico everywhere [puppet] - 10https://gerrit.wikimedia.org/r/355394 (https://phabricator.wikimedia.org/T165024) [09:03:07] (03PS1) 10Alexandros Kosiaris: Drain esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/355395 (https://phabricator.wikimedia.org/T133387) [09:06:00] (03PS2) 10Muehlenhoff: Use gdb from jessie-backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/355110 (https://phabricator.wikimedia.org/T165043) [09:10:03] (03CR) 10Alexandros Kosiaris: [C: 032] Drain esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/355395 (https://phabricator.wikimedia.org/T133387) (owner: 10Alexandros Kosiaris) [09:10:34] (03PS3) 10Muehlenhoff: Use gdb from jessie-backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/355110 [09:10:46] !log drain esams for network tests for T133387 [09:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:55] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [09:12:03] (03PS1) 10Phuedx: mobileFrontend: Move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355397 (https://phabricator.wikimedia.org/T150325) [09:15:26] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: cp3036.esams.wmnet [09:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:34] (03CR) 10Jcrespo: [C: 031] raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [09:17:07] (03PS10) 10Jcrespo: raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) [09:21:02] 06Operations: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3288486 (10Volans) The upgrade will be performed with those steps: - disable puppet reliably (waiting for any in-flight run) - compile the catalog and output the facts to a directory - upgrade facter - compile the catalog a... [09:26:42] !log upload prometheus-node-exporter 0.14.0~git20170523-0 to jessie-wikimedia - T160156 [09:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:52] T160156: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156 [09:34:41] 06Operations, 10Monitoring, 10Traffic, 15User-fgiunchedi: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3288519 (10fgiunchedi) a:03ema [09:35:18] (03CR) 10Jcrespo: [C: 032] raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo) [09:39:10] 06Operations, 10ops-eqiad, 15User-Elukey, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3288538 (10Joe) Here is my proposal regarding these systems: Based on current clusters and substitution mw1307-1311: jobrunners, in row C. They substitute mw1161-67 that are in ro... [09:40:49] PROBLEM - swift-container-server on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:49] PROBLEM - swift-account-auditor on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:50] PROBLEM - swift-account-replicator on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:50] PROBLEM - swift-object-server on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:50] PROBLEM - dhclient process on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:50] PROBLEM - swift-account-reaper on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:50] PROBLEM - swift-object-updater on ms-be1019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:41:39] RECOVERY - swift-container-server on ms-be1019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:41:39] RECOVERY - swift-account-reaper on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:41:39] RECOVERY - swift-account-replicator on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:41:39] RECOVERY - swift-account-auditor on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:41:39] RECOVERY - swift-object-server on ms-be1019 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:41:40] RECOVERY - dhclient process on ms-be1019 is OK: PROCS OK: 0 processes with command name dhclient [09:41:40] RECOVERY - swift-object-updater on ms-be1019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:44:55] 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3288562 (10hashar) I guess now we can switch CI from HHVM 3.12. to the latest 3.18 you have build? :-} [09:45:27] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3288563 (10elukey) I tried to manually hack mw1161 in prod but nothing changed from th... [09:45:48] 06Operations, 10ops-eqiad, 15User-Elukey, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3288565 (10Joe) Another option is not to care much how the current distribution goes but to just evenly distribute servers across rows, and then go on and rebalance the whole cluster... [09:49:39] !log upgrade prometheus-node-exporter on cache hosts to 0.14.0~git20170523-0 T147569 [09:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:47] T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569 [09:53:26] !log rebooting asw-esams for upgrade (T133387) [09:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:35] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [09:53:39] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [09:54:10] PROBLEM - Host ns2-v4 is DOWN: CRITICAL - Network Unreachable (91.198.174.239) [09:54:29] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [09:54:39] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [09:54:39] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [09:54:54] PROBLEM - Host text-lb.esams.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ed1a::1) [09:54:58] PROBLEM - Host upload-lb.esams.wikimedia.org_ipv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:862:ed1a::2:b) [09:55:03] PROBLEM - Host upload-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.208) [09:55:08] PROBLEM - Host text-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.192) [09:55:18] PROBLEM - Host cr1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.245) [09:55:32] PROBLEM - Host misc-web-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.217) [09:55:32] PROBLEM - Host mr1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.247) [09:55:36] I suppose the pages are from the reboot [09:55:37] PROBLEM - Host maps-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.209) [09:55:40] XioNoX: is that you? [09:55:42] PROBLEM - Host misc-web-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:55:44] is esams depooled? [09:55:45] yeah, esams is depooled [09:55:45] υεσ [09:55:46] yes [09:55:47] PROBLEM - Host maps-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [09:55:47] PROBLEM - Host ns2-v6 is DOWN: PING CRITICAL - Packet loss = 100% [09:55:48] IIRC [09:55:48] yes yes [09:55:58] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [09:55:58] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [09:55:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 41, down: 3, dormant: 0, excluded: 0, unused: 0BRet-0/2/0: down - Core: asw-esams:et-3/0/48 {#10669}BRet-0/2/1: down - Core: asw-esams:et-0/0/48 {#10644}BRae1: down - Core: asw-esams:ae1BR [09:55:58] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 37, down: 4, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - Core: asw-esams:xe-0/0/32 (Relined, SMF4303) [10Gbps DF CWDM C59 cwdm1-knams]BRxe-0/0/0: down - Core: asw-esams:xe-3/0/43 (GBLX leg 1) {#14006} [10Gbps DF CWDM C61]BRae1: down - Core: asw-esams:ae3BRxe-1/0/0: down - Core: asw-esams:xe-3/0/42 (GBLX leg 2) {#14007} [10Gbp [09:56:20] controlled doomsday scenario [09:56:30] well it pages lots of people :P [09:56:38] PROBLEM - Host wikidata is DOWN: CRITICAL - Network Unreachable (91.198.174.192) [09:56:44] some of whom aren't here on IRC and aware of impending planned doom :P [09:57:02] hahahah yes [09:57:03] wikidata down? [09:57:07] why? [09:57:10] heh [09:57:22] <_joe_> not for me [09:57:25] what is host "wikidata" [09:57:40] jynus: 91.198.174.192 reverse to text-lb.esams.wikimedia.org. [09:57:41] no, I mean that alert, not sure what it monitors [09:57:43] looks like a non-qualified hostname [09:58:01] yeah it's a crap definition in the config [09:58:06] should fix that [09:58:06] or a weird check [09:58:18] ah, I geet it [09:58:18] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 121.13 ms [09:58:23] RECOVERY - Host misc-web-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 119.80 ms [09:58:23] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 122.30 ms [09:58:27] RECOVERY - Host maps-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 119.81 ms [09:58:32] RECOVERY - Host upload-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 120.84 ms [09:58:32] RECOVERY - Host cr1-esams is UP: PING OK - Packet loss = 0%, RTA = 121.35 ms [09:58:33] RECOVERY - Host ns2-v6 is UP: PING OK - Packet loss = 0%, RTA = 119.57 ms [09:59:07] RECOVERY - Host upload-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 120.92 ms [09:59:11] RECOVERY - Host maps-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 119.62 ms [09:59:16] RECOVERY - Host misc-web-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 121.42 ms [09:59:20] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 [09:59:20] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [09:59:20] well at least there is some good news. We did not avoid the pages for the LVS services (my mistake) but we did avoid the per host spam storm [09:59:25] RECOVERY - Host text-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 119.63 ms [09:59:29] RECOVERY - Host text-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 119.58 ms [09:59:51] 06Operations, 10Security-Reviews, 07Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#3288591 (10Elitre) Can someone clarify who should do the "due diligence that as an organization we're comfortable with how they would protect our users privacy and the controls they have in place t... [09:59:52] !log asw-esams back up (T133387) [10:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:01] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [10:00:09] <_joe_> akosiaris: you fixed all the alerts, but the paging ones [10:00:49] :-) [10:01:20] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:01:20] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:01:21] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:01:21] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 2 minutes ago with 19 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py],File[/usr/local/bin/puppet-enabled],File[/usr/lib/nagios/plugins/check_sysctl],File[/etc/sysctl.d] [10:01:50] RECOVERY - Host wikidata is UP: PING OK - Packet loss = 0%, RTA = 120.29 ms [10:05:38] !log forcing puppet run on failed hosts only in esams T133387 [10:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:46] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [10:07:14] (03PS1) 10Hashar: beta: set profile::etcd::tlsproxy::read_only=false [puppet] - 10https://gerrit.wikimedia.org/r/355402 [10:07:20] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [10:07:26] <_joe_> hashar: hah, thanks [10:07:56] (03CR) 10Giuseppe Lavagetto: [C: 032] beta: set profile::etcd::tlsproxy::read_only=false [puppet] - 10https://gerrit.wikimedia.org/r/355402 (owner: 10Hashar) [10:08:00] _joe_: testing it on beta :) [10:08:08] <_joe_> hashar: that's gonna work for sure [10:08:32] and someone magically dropped all cherry picks from beta bah :( [10:09:20] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [10:12:27] <_joe_> hashar: _all_? [10:12:35] <_joe_> jeez some were local-only [10:12:36] yeah some bad rebase :-) [10:12:41] <_joe_> ouch [10:12:46] 25 patches currently. I am going to rebase it locally [10:12:53] <_joe_> it's gonna be hard to reconstruct it? [10:12:55] and see what can be polished / merged easily via a puppet swat [10:13:13] the script that runs in the crontab uses git tag [10:13:21] so it is all about git reset --hard to the last good tag :} [10:13:52] rebases are usually quite easy, the faults tends to happen when a patch get merged in puppet.git but the local repo still has some outdated version of it [10:16:52] the reflog should also have the previous HEAD, in case the cron wasn't there [10:17:20] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:18:12] I am going to test a new alert on db1015 [10:18:25] it is a depooled host, do don't worry [10:19:21] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [10:19:55] _joe_: etcd is all happy on beta. Thank you! [10:21:28] (03PS2) 10Ema: prometheus: enable qdisc collector on cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/355391 (https://phabricator.wikimedia.org/T147569) [10:24:42] it is starting to lag, but no error yet [10:25:07] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: enable qdisc collector on cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/355391 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema) [10:25:55] (03PS3) 10Ema: prometheus: enable qdisc collector on cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/355391 (https://phabricator.wikimedia.org/T147569) [10:26:01] (03CR) 10Ema: [V: 032 C: 032] prometheus: enable qdisc collector on cache hosts [puppet] - 10https://gerrit.wikimedia.org/r/355391 (https://phabricator.wikimedia.org/T147569) (owner: 10Ema) [10:29:27] (03PS4) 10Hashar: [WIP] logstash: send errors to sentry [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [10:30:38] (03CR) 10jerkins-bot: [V: 04-1] [WIP] logstash: send errors to sentry [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [10:31:33] (03CR) 10Hashar: "Rebased since role::logstash::* classes have been split to their own files." [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [10:33:50] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2154777 [10:34:20] PROBLEM - configured eth on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:34:23] PROBLEM - MD RAID on ms-be1021 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:35:11] RECOVERY - configured eth on ms-be1021 is OK: OK - interfaces up [10:35:11] RECOVERY - MD RAID on ms-be1021 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [10:36:10] PROBLEM - MegaRAID on db1015 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [10:37:27] (03PS1) 10Alexandros Kosiaris: Decouple wikidata monitoring from the IP address [puppet] - 10https://gerrit.wikimedia.org/r/355411 [10:38:51] (03CR) 10Hashar: [WIP] logstash: send errors to sentry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [10:38:58] (03PS5) 10Hashar: [WIP] logstash: send errors to sentry [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [10:40:27] (03CR) 10Muehlenhoff: [C: 032] Use gdb from jessie-backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/355110 (owner: 10Muehlenhoff) [10:40:32] (03PS4) 10Muehlenhoff: Use gdb from jessie-backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/355110 [10:41:43] (03CR) 10Hashar: "I have applied PS5 on the beta cluster. Puppet on deployment-logstash02 seems to pass, I have no idea whether the sentry service still wo" [puppet] - 10https://gerrit.wikimedia.org/r/263024 (https://phabricator.wikimedia.org/T85239) (owner: 10Gergő Tisza) [10:43:03] !log upgrade prometheus-node-exporter on cache hosts to 0.14.0~git20170523-0 T160156 [10:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:11] T160156: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156 [10:43:28] heh [10:43:36] !log upgrade prometheus-node-exporter on lvs hosts to 0.14.0~git20170523-0 T160156 [10:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:04] (03PS1) 10Muehlenhoff: Revert "Use gdb from jessie-backports on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/355412 [10:45:10] PROBLEM - puppet last run on ms-fe2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:20] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:20] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:20] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:20] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:20] (03CR) 10Muehlenhoff: [V: 032 C: 032] Revert "Use gdb from jessie-backports on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/355412 (owner: 10Muehlenhoff) [10:45:20] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:20] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:21] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:30] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:30] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:40] PROBLEM - puppet last run on db1087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:40] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:40] PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:41] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:41] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:41] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:41] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:42] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:42] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:43] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:43] PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:44] PROBLEM - puppet last run on db2075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:46] <_joe_> uh? [10:45:50] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:50] PROBLEM - puppet last run on es2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:50] <_joe_> what's this? [10:45:50] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:50] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:45:54] <_joe_> oh I see [10:45:59] just reverted, stupid puppet [10:46:10] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:46:10] PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:46:10] RECOVERY - MegaRAID on db1015 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [10:46:10] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:46:10] PROBLEM - puppet last run on mw2238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:46:11] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:46:26] !log stopped temporarily ircecho to avoid alert spam [10:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:46] 06Operations, 10netops, 13Patch-For-Review: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387#3288680 (10akosiaris) A `tcpdump -vvvv -ttt -i eth0 icmp6 and 'ip6[40] = 134'` on cp3036 shows RAs still being received by the box with i... [10:48:55] (03PS1) 10Alexandros Kosiaris: Revert "Drain esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/355413 (https://phabricator.wikimedia.org/T133387) [10:49:07] (03PS1) 10Volans: Puppet: more reliable run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/355414 [10:49:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Drain esams for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/355413 (https://phabricator.wikimedia.org/T133387) (owner: 10Alexandros Kosiaris) [10:49:40] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:49:40] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:49:42] (03PS1) 10Mark Bergsma: Add some protocol BGP class test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355415 [10:49:44] (03PS1) 10Mark Bergsma: Move BGP tests into a sub package and add to pybal test suite [debs/pybal] - 10https://gerrit.wikimedia.org/r/355416 [10:50:03] !log repool esams T133387 [10:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:11] T133387: Enabling IGMP snooping on QFX switches breaks IPv6 (HTCP purges flood across codfw) - https://phabricator.wikimedia.org/T133387 [10:50:17] and ofc puppet just run on tegmen to re-enable ircecho [10:50:20] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:50:20] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:50:30] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:50:30] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:50:40] RECOVERY - puppet last run on wtp1022 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:50:40] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:50:40] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:50:41] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:50:50] RECOVERY - puppet last run on tureis is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:50:50] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:50:59] (03PS1) 10Giuseppe Lavagetto: Add netlink-based Ipvsmanager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/355417 [10:51:00] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [10:51:00] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:51:05] (03CR) 10jerkins-bot: [V: 04-1] Add netlink-based Ipvsmanager implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/355417 (owner: 10Giuseppe Lavagetto) [10:51:10] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:51:23] (03CR) 10jerkins-bot: [V: 04-1] Move BGP tests into a sub package and add to pybal test suite [debs/pybal] - 10https://gerrit.wikimedia.org/r/355416 (owner: 10Mark Bergsma) [10:51:50] RECOVERY - puppet last run on restbase2010 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:51:50] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:51:50] RECOVERY - puppet last run on mw2215 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:51:50] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:52:10] RECOVERY - puppet last run on mw2214 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:52:10] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:52:40] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:53:17] (03CR) 10Alexandros Kosiaris: [C: 032] Decouple wikidata monitoring from the IP address [puppet] - 10https://gerrit.wikimedia.org/r/355411 (owner: 10Alexandros Kosiaris) [10:53:21] (03PS2) 10Alexandros Kosiaris: Decouple wikidata monitoring from the IP address [puppet] - 10https://gerrit.wikimedia.org/r/355411 [10:54:40] RECOVERY - puppet last run on bohrium is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:54:40] RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:54:40] RECOVERY - puppet last run on analytics1063 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:54:40] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [10:54:41] RECOVERY - puppet last run on cp1008 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:54:50] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:55:20] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:55:30] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [10:55:30] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [10:55:30] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [10:55:40] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:55:41] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:56:20] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:56:20] RECOVERY - puppet last run on aqs1004 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:56:30] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:56:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [10:56:30] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [10:56:40] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:56:43] (03PS2) 10Mark Bergsma: Move BGP tests into a sub package and add to pybal test suite [debs/pybal] - 10https://gerrit.wikimedia.org/r/355416 [10:56:50] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:57:20] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:58:20] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:58:20] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:58:20] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:58:21] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:58:30] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:58:30] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:58:30] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [10:58:30] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [10:58:30] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [10:58:30] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [10:58:30] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [10:58:50] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:59:00] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:59:30] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:59:30] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:00:20] RECOVERY - puppet last run on elastic1028 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:00:30] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [11:00:30] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:01:00] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:01:20] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:01:21] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [11:02:37] (03CR) 10Alexandros Kosiaris: [C: 031] calico: add new version 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/355392 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [11:03:00] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:03:40] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:03:40] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:03:41] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:03:41] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [11:04:09] (03CR) 10Ema: [C: 031] Move BGP tests into a sub package and add to pybal test suite [debs/pybal] - 10https://gerrit.wikimedia.org/r/355416 (owner: 10Mark Bergsma) [11:04:20] RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [11:04:20] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:05:20] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [11:05:25] (03CR) 10Alexandros Kosiaris: "actually let's do eqiad first ?" [puppet] - 10https://gerrit.wikimedia.org/r/355393 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [11:05:40] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:06:30] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:06:30] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:06:30] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:06:31] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [11:07:30] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [11:08:30] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [11:08:30] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:08:30] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [11:11:01] PROBLEM - DPKG on meitnerium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:11:40] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:13:10] RECOVERY - puppet last run on ms-fe2007 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:13:20] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:13:40] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:13:40] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:13:40] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:13:40] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:13:41] RECOVERY - puppet last run on db2075 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:13:50] RECOVERY - puppet last run on elastic2008 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:14:10] RECOVERY - DPKG on meitnerium is OK: All packages OK [11:14:10] RECOVERY - puppet last run on elastic2026 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:14:10] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:14:10] RECOVERY - puppet last run on heze is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [11:14:20] RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:14:20] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [11:14:40] RECOVERY - puppet last run on lvs1007 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [11:14:40] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [11:14:40] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:14:40] RECOVERY - puppet last run on mw1276 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [11:14:40] RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:14:40] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [11:14:40] RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [11:14:41] RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:14:50] RECOVERY - puppet last run on es2011 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:14:50] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [11:14:50] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:15:10] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:15:11] RECOVERY - puppet last run on db2078 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [11:15:11] RECOVERY - puppet last run on db2081 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [11:15:11] RECOVERY - puppet last run on mw2238 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:15:11] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:15:12] 06Operations, 10Monitoring, 10Traffic, 15User-fgiunchedi: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3288748 (10ema) 05Open>03Resolved Fixed! We now have per-IPv4/IPv6 backend metrics available: ``` node_ipvs_backend_connections_active{local_address="2620:0:863:ed1... [11:15:20] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [11:15:20] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [11:15:20] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [11:15:21] RECOVERY - puppet last run on ms-fe1008 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [11:15:21] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:15:21] RECOVERY - puppet last run on etcd1006 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:15:30] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [11:15:40] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [11:15:40] RECOVERY - puppet last run on elastic1050 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [11:15:40] RECOVERY - puppet last run on elastic1039 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:15:40] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [11:15:40] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:15:40] RECOVERY - puppet last run on fermium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:15:40] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:15:41] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:15:50] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:15:50] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:15:50] RECOVERY - puppet last run on mc2023 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [11:15:50] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [11:15:50] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:16:10] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:16:10] RECOVERY - puppet last run on elastic2006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:16:10] RECOVERY - puppet last run on mw2242 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:16:10] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:16:10] RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [11:16:20] RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [11:16:21] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [11:16:21] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [11:16:21] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [11:16:21] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:16:40] RECOVERY - puppet last run on db1103 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [11:16:40] RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [11:16:50] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [11:16:50] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:17:00] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [11:17:10] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:18:20] PROBLEM - Check correctness of the icinga configuration on tegmen is CRITICAL: Icinga configuration contains errors [11:19:23] (03CR) 10Mark Bergsma: [C: 032] Move BGP tests into a sub package and add to pybal test suite [debs/pybal] - 10https://gerrit.wikimedia.org/r/355416 (owner: 10Mark Bergsma) [11:19:42] akosiaris: Error: Could not find any host matching 'wikidata' (config file '/etc/nagios/nagios_service.cfg', starting on line 4744) [11:22:07] grrr [11:22:10] * akosiaris looking [11:23:42] (03PS1) 10Alexandros Kosiaris: Specify the correct host for wikidata icinga config [puppet] - 10https://gerrit.wikimedia.org/r/355421 [11:23:43] akosiaris: seems we reference host "wikidata" in modules/icinga/manifests/monitor/wikidata.pp [11:23:56] yep, that one :) [11:24:02] yeah, PEBKAC [11:24:03] sorry [11:24:09] (03PS3) 10Mark Bergsma: Move BGP tests into a sub package and add to pybal test suite [debs/pybal] - 10https://gerrit.wikimedia.org/r/355416 [11:24:11] (03PS2) 10Mark Bergsma: Add some protocol BGP class test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355415 [11:24:14] (03PS2) 10Alexandros Kosiaris: Specify the correct host for wikidata icinga config [puppet] - 10https://gerrit.wikimedia.org/r/355421 [11:24:20] np [11:24:20] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Specify the correct host for wikidata icinga config [puppet] - 10https://gerrit.wikimedia.org/r/355421 (owner: 10Alexandros Kosiaris) [11:27:20] (03CR) 10Mark Bergsma: [C: 032] Move BGP tests into a sub package and add to pybal test suite [debs/pybal] - 10https://gerrit.wikimedia.org/r/355416 (owner: 10Mark Bergsma) [11:28:21] RECOVERY - Check correctness of the icinga configuration on tegmen is OK: Icinga configuration is correct [11:28:35] (03Merged) 10jenkins-bot: Move BGP tests into a sub package and add to pybal test suite [debs/pybal] - 10https://gerrit.wikimedia.org/r/355416 (owner: 10Mark Bergsma) [11:36:28] !log uploaded puppet_3.8.5-2~bpo8+2 to apt.wikimedia.org [11:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:51] (03PS1) 10Alexandros Kosiaris: logrotate: Fix uwsgi postrotate script [puppet] - 10https://gerrit.wikimedia.org/r/355423 [11:39:07] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3288813 (10Ottomata) > any further discussion should go to a new ticket or (better) on the puppet coding rules talk page on wikitech. Ah ok, I went back... [11:39:16] (03PS2) 10Alexandros Kosiaris: logrotate: Fix uwsgi postrotate script [puppet] - 10https://gerrit.wikimedia.org/r/355423 [11:39:19] (03CR) 10Alexandros Kosiaris: [C: 032] logrotate: Fix uwsgi postrotate script [puppet] - 10https://gerrit.wikimedia.org/r/355423 (owner: 10Alexandros Kosiaris) [11:52:29] !log pregressively adding "remove-private" to ix4/6 and transit4/6 bgp groups on cr2-esams T83037 [11:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:36] T83037: Filter outgoing BGP announcements on AS regex - https://phabricator.wikimedia.org/T83037 [11:56:13] (03CR) 10Filippo Giunchedi: [C: 031] Puppet: more reliable run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/355414 (owner: 10Volans) [12:07:37] 06Operations, 06Operations-Software-Development, 10Pybal, 10Traffic, 13Patch-For-Review: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893#2281050 (10fgiunchedi) Something similar just happened on lvs1003, where thumbor1001 isn't bein... [12:08:08] !log bounce pybal on lvs1003 - T134893 [12:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:16] T134893: Unhandled pybal error causing services to be depooled in etcd but not in lvs - https://phabricator.wikimedia.org/T134893 [12:08:59] <_joe_> godog: hah that has to do with alex deploying the change on the etcd systemd unit [12:09:46] !log updating puppet on puppetmaster2002 [12:09:52] because of reconnections not handled or ? _joe_ [12:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:27] <_joe_> godog: uhm [12:13:31] (03PS1) 10Mark Bergsma: Add bgp.ip unit test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355425 [12:14:01] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2022571 [12:16:12] 06Operations, 10netops: Filter outgoing BGP announcements on AS regex - https://phabricator.wikimedia.org/T83037#3288863 (10ayounsi) Pushed to all cr* in AMS. BGP sessions and advertised routes haven't change. Will roll it to more sites shortly. [12:20:55] !log upgrade application servers using HHVM 3.18 to the latest 3.18.2+wmf4 build [12:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:39] (03CR) 10VolkerE: [C: 04-1] dynamicproxy: Make use of errorpage template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [12:42:23] 06Operations, 10ops-eqiad, 15User-Elukey, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3288899 (10Cmjohnson) @Joe in order for me to do add 30 servers to row C I will need to remove mw1161-1200 from row C. Even you want to go with option A which is a messy way of rac... [12:51:04] errors on SpecialRecentChanges::doMainQuery seems unusualy high, cc deployers [12:51:38] let me see if I see a pattern [12:52:05] it is a single server/wiki [12:52:37] it is not new [12:54:02] PROBLEM - DPKG on mw2192 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:55:00] (03PS12) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [12:55:01] RECOVERY - DPKG on mw2192 is OK: All packages OK [12:55:59] (03CR) 10Krinkle: dynamicproxy: Make use of errorpage template (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [12:56:08] (03PS4) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) [12:59:26] o/ [12:59:30] jouncebot: refresh [12:59:32] I refreshed my knowledge about deployments. [12:59:33] jouncebot: next [12:59:34] In 0 hour(s) and 0 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170524T1300) [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170524T1300). [13:00:22] nothing for this slot apparnetly [13:14:46] !log upload prometheus-hhvm-exporter 0.3-1 to jessie-wikimedia - T158286 [13:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:56] T158286: Raise default logging level of prometheus-hhvm-exporter - https://phabricator.wikimedia.org/T158286 [13:18:36] 06Operations, 10Ops-Access-Requests: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3289001 (10akosiaris) @elukey @ottomata, I am not sure what this entails, care to help? I am looking at https://wikitech.wikimedia.org/wiki/Analytics/Data_access and I am not sure which of th... [13:23:37] 06Operations, 10Ops-Access-Requests: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3289003 (10Ottomata) I'm also not totally sure what data Jan is looking for, but if I had to guess, it would be webrequest logs, which would mean that `analytics-privatedata-users` group is t... [13:26:44] 06Operations, 10ops-eqiad, 15User-Elukey, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3268837 (10elukey) >>! In T165519#3288899, @Cmjohnson wrote: > I +1 adding 30 servers to row C, 6 each to A and B and removing the > existing servers. This will require more work on... [13:28:47] !log slowly upgrading facter across the fleet checking is a noop T166203 [13:28:53] 06Operations, 10Ops-Access-Requests: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3289017 (10Jan_Dittrich) > I'm also not totally sure what data Jan is looking for We would like to find out with parameters (like AND, OR, intitle: …) and namespaces (like help:…) users use... [13:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:55] T166203: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203 [13:33:21] 06Operations, 10ops-eqiad, 15User-Elukey, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3289025 (10Cmjohnson) @elukey I think moving them in batch would be ideal and have the least amount of impact but the servers have to be removed/replaced in order. Replacing them in... [13:37:20] (03CR) 10VolkerE: [C: 031] dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [13:43:39] (03PS3) 10Ottomata: Changes needed for upgrading to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/351691 (https://phabricator.wikimedia.org/T164008) [13:43:46] (03CR) 10Ottomata: [V: 032 C: 032] Changes needed for upgrading to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/351691 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [13:47:17] (03PS3) 10Filippo Giunchedi: Test for unreferenced files introduced by changes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/354939 [13:48:22] 06Operations, 07HHVM, 07Upstream: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586#3289051 (10MoritzMuehlenhoff) Yeah, that was even possible with +wmf3 (ran the crashing test manually in vagrant), but even more so with +wmf4. [13:50:21] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 17 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[hhvm],Package[hhvm-dbg] [13:53:06] (03CR) 10Muehlenhoff: "I had to revert this earlier the day; the catalogue compiled fine in PCC, but when applied to hosts, it bailed like this:" [puppet] - 10https://gerrit.wikimedia.org/r/355110 (owner: 10Muehlenhoff) [13:53:21] RECOVERY - puppet last run on mw2187 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:54:52] !log upgrade Druid daemons on druid100[123] to 0.10 - T164008 [13:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:01] T164008: Update druid to latest release - https://phabricator.wikimedia.org/T164008 [13:57:15] (03CR) 10Hashar: "See also the pending patch https://gerrit.wikimedia.org/r/#/c/349413/ which is factoring out the code to load $wgConf . I was looking fo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [13:58:34] (03PS2) 10Andrew Bogott: Tidy up tools node motd [puppet] - 10https://gerrit.wikimedia.org/r/354668 [13:59:23] !log cr2-esams: enabling netflows experimentally [13:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:41] !log Start running pt-table-checksum on s1 (will not run over night for now) - T162807 [13:59:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:48] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [14:01:46] (03CR) 10Andrew Bogott: [C: 032] Tidy up tools node motd [puppet] - 10https://gerrit.wikimedia.org/r/354668 (owner: 10Andrew Bogott) [14:02:31] (03PS2) 10Volans: Puppet: more reliable run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/355414 [14:04:36] !log installing jasper security updates on trusty (jessie already fixed) [14:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:49] 06Operations, 10Mail, 10Wikimedia-Mailing-lists, 05Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#3289082 (10Aklapper) >>! In T160529#3260174, @jayvdb wrote: > Please can we get a proper solution soon. Would that be T160529#3164928 or what would be "a proper solution"? (Tryin... [14:08:28] (03CR) 10Volans: [C: 032] Puppet: more reliable run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/355414 (owner: 10Volans) [14:08:42] (03PS4) 10Filippo Giunchedi: Test for unreferenced files introduced by changes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/354939 [14:08:50] (03CR) 10Filippo Giunchedi: Test for unreferenced files introduced by changes (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/354939 (owner: 10Filippo Giunchedi) [14:18:08] 06Operations, 10ops-eqiad, 15User-Elukey, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3289089 (10Joe) @Cmjohnson I suggest we do the following: # Start with racking the 12 servers in row A and B # Decom mw1170-mw1179, and replace them with new systems. # Wait for me... [14:19:46] (03CR) 10Rush: "yes please, un-mickey-mousing this is a positive thing." [puppet] - 10https://gerrit.wikimedia.org/r/354668 (owner: 10Andrew Bogott) [14:20:39] (03PS2) 10Rush: tools: have maintain-kubeusers chown $HOME/.kube [puppet] - 10https://gerrit.wikimedia.org/r/354839 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis) [14:21:31] (03PS1) 10Gilles: Upgrade to 0.1.39 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/355429 (https://phabricator.wikimedia.org/T151065) [14:22:33] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.39 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/355429 (https://phabricator.wikimedia.org/T151065) (owner: 10Gilles) [14:22:44] (03CR) 10Hashar: phpunit: factor out logic to handle globals vars (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [14:27:25] (03CR) 10Rush: [C: 032] tools: have maintain-kubeusers chown $HOME/.kube [puppet] - 10https://gerrit.wikimedia.org/r/354839 (https://phabricator.wikimedia.org/T165875) (owner: 10BryanDavis) [14:28:08] (03CR) 10Volans: "LGTM in general, few minor comments inline" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354007 (owner: 10Filippo Giunchedi) [14:38:48] (03PS5) 10Hashar: phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 [14:40:18] (03PS1) 10Ottomata: Release 0.10.0-2 [debs/druid] - 10https://gerrit.wikimedia.org/r/355430 (https://phabricator.wikimedia.org/T164008) [14:40:47] (03CR) 10Hashar: "I went back to PS1 and rebased it. We have to ensure globals have been restored when data providers mess up with them, else they would " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [14:42:16] (03PS1) 10Muehlenhoff: Extend expiry dates for two accounts [puppet] - 10https://gerrit.wikimedia.org/r/355431 [14:44:17] (03CR) 10Muehlenhoff: [C: 032] Extend expiry dates for two accounts [puppet] - 10https://gerrit.wikimedia.org/r/355431 (owner: 10Muehlenhoff) [14:44:59] chasemp: can I puppet-merge your kube change along? [14:45:24] moritzm: yes shit, sorry I merged it on puppetmaster1002 [14:45:34] k, done [14:45:35] I need to break myself of that autocomplete [14:46:02] (03CR) 10VolkerE: [C: 031] mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 (owner: 10Krinkle) [14:47:00] (03CR) 10VolkerE: [C: 031] varnish: Convert errorpage into re-usable template [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [14:54:09] !log uploaded gerrit 2.13.8+wmf2 to apt.wikimedia.org [14:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:40] \o/ [14:58:32] paravoid, hi, i wanted to ask about https://gerrit.wikimedia.org/r/#/c/354041/ . If i was to make it a erb template for redis.conf and then make the os_version a variable in the pp file and use that variable i just defined in the erb template to add the stuff aimed at just jessie+ would that work? Or am i thinking wrong? [15:00:58] !log deploy thumbor 0.1.39 for memcache-based throttling - T151065 [15:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:06] T151065: Implement DC-local cache failure limiter in Thumbor - https://phabricator.wikimedia.org/T151065 [15:04:01] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 2 [15:19:39] 06Operations, 10Ops-Access-Requests: Access to search logs for Jan Dittrich - https://phabricator.wikimedia.org/T165943#3289230 (10Ottomata) Ah, ok. I betcha this is in the cirrussearchrequestset Hive table maintained by the Discovery folks. This data is currently accessible by the `analytics-users` group.... [15:20:53] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3289234 (10Papaul) @akosiaris can you please provide me with a partman recipe to use? Thanks [15:21:04] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "The way this is intended to work is that we create a settings file that is distro-specific, given the versions might differ." [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [15:21:19] <_joe_> paladox: I took a look, that patch is to be thrown away [15:25:31] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [15:26:21] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [15:30:47] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3289272 (10akosiaris) IIRC these boxes don't have a RAID controller, so let's go for a RAID1 with LVM. Seems like `raid1-lvm.cfg` (https://github.com/wikimedia/puppet/blob/production/mo... [15:31:20] _joe_ thanks, but it fails on stretch [15:31:25] how would we fix it on stretch? [15:31:40] (03Abandoned) 10Paladox: redis: add support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/354041 (owner: 10Paladox) [15:31:53] <_joe_> paladox: writing the config for stretch [15:32:23] Thanks :) [15:32:24] <_joe_> which means someone who studied the changelog of redis between the jessie version and the stretch version must decide a baseline of configs to distribute everywhere [15:32:36] ok [15:32:38] <_joe_> that makes it a tad more complicated than moving things around in puppet [15:35:34] !log krinkle@tin Synchronized php-1.30.0-wmf.2/resources/Resources.php: Restore mediawiki.page.watch.ajax dependency - Iebfda85c7 (duration: 00m 42s) [15:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:00] _joe_ Are you doing that now? Or was that a suggestion? Should i file a task for somebody to do it? [15:43:16] <_joe_> paladox: no, I'm not [15:43:24] ok [15:45:09] 06Operations, 07Puppet, 06Labs: Update redis puppet class to support stretch - https://phabricator.wikimedia.org/T166233#3289296 (10Paladox) [15:45:21] (03PS2) 10Hashar: Test wgLogoHD keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [15:45:23] (03PS1) 10Hashar: test: factor out wgConf loading [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355440 [15:45:40] !log test-upgrade grafana 4.3.1 on labmon1001 [15:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:39] (03CR) 10Paladox: Test wgLogoHD keys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [15:47:08] (03CR) 10Hashar: [V: 031] "I believe it is fine as is." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 (owner: 10Hashar) [15:47:59] (03CR) 10Hashar: Test wgLogoHD keys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [15:49:00] (03CR) 10Hashar: [V: 031] "Rebased on top of https://gerrit.wikimedia.org/r/355440 test: factor out wgConf loading" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [15:49:01] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [15:49:01] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [15:49:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [15:50:51] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [15:50:51] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [15:50:51] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [15:53:51] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:19] (03CR) 10Dereckson: "Nice work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355440 (owner: 10Hashar) [16:02:19] Dereckson: so yeah I went with the refactoring work :) [16:02:39] Dereckson: so in theory it should be now rather easy to test initialisesettings.php [16:05:01] (03PS2) 10Dzahn: DNS/Decom Remove mgmt DNS entries for ms-fe200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/344651 (owner: 10Papaul) [16:06:30] (03CR) 10Dzahn: [C: 032] DNS/Decom Remove mgmt DNS entries for ms-fe200[1-4] [dns] - 10https://gerrit.wikimedia.org/r/344651 (owner: 10Papaul) [16:07:41] (03PS3) 10Andrew Bogott: openstackclients: add an optional project arg to allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/354515 [16:11:16] (03PS1) 10Mark Bergsma: Add basic unit tests for protocol BGP send methods [debs/pybal] - 10https://gerrit.wikimedia.org/r/355445 [16:11:42] (03CR) 10Andrew Bogott: [C: 032] openstackclients: add an optional project arg to allinstances() [puppet] - 10https://gerrit.wikimedia.org/r/354515 (owner: 10Andrew Bogott) [16:12:25] (03PS3) 10Andrew Bogott: novastats: Update some reports to use more up-to-date code. [puppet] - 10https://gerrit.wikimedia.org/r/354516 [16:13:07] (03CR) 10Filippo Giunchedi: [C: 031] "Clarified in multi dc meeting the related patch was eventually merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355174 (https://phabricator.wikimedia.org/T160616) (owner: 10Aaron Schulz) [16:14:05] (03PS2) 10Ottomata: Release 0.10.0-2 [debs/druid] - 10https://gerrit.wikimedia.org/r/355430 (https://phabricator.wikimedia.org/T164008) [16:15:19] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3289364 (10RobH) @akosiaris: I'd actually recommend we go with something that uses a /srv/ mount and ext4, with no swap like most servers? Unless these are different, raid1-lvm puts th... [16:18:43] (03CR) 10Andrew Bogott: [C: 032] novastats: Update some reports to use more up-to-date code. [puppet] - 10https://gerrit.wikimedia.org/r/354516 (owner: 10Andrew Bogott) [16:19:19] (03Abandoned) 10Andrew Bogott: dynamicproxy: When rotating logs, HUP the nginx process. [puppet] - 10https://gerrit.wikimedia.org/r/348954 (owner: 10Andrew Bogott) [16:22:51] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:26:51] !log restarting varnish backend on cp1099 (mailbox lag) [16:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:51] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [16:34:17] 06Operations, 10ops-codfw: Degraded RAID on ms-be2029 - https://phabricator.wikimedia.org/T166021#3289373 (10Papaul) a:05Papaul>03fgiunchedi Disk replacement complete [16:37:58] !log Stop pt-table-checksum on s1 - T162807 [16:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:07] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [16:39:39] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3289377 (10Papaul) Disk wipe complete system is backup with 1 disk [16:51:57] 06Operations, 10ops-codfw, 06cloud-services-team: rack/setup/install labtestvirt2002 - https://phabricator.wikimedia.org/T166237#3289416 (10RobH) [16:54:36] !log pause slowly upgrading facter across the fleet, resuming tomorrow T166203 [16:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:45] T166203: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203 [16:56:02] !log restarting and upgrading db2047 [16:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:27] 06Operations, 10OCG-General, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3289449 (10bmansurov) [16:58:29] !log installing ghostscript regression update on trusty (jessie security update was not affected) [16:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:12] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:06:51] RECOVERY - puppet last run on ms-be2029 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:06:52] (03PS1) 10BBlack: r::c::perf - FQ outbound flow rate cap @ 1Gbps [puppet] - 10https://gerrit.wikimedia.org/r/355451 (https://phabricator.wikimedia.org/T147569) [17:07:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:07:55] (03PS2) 10BBlack: r::c::perf - FQ outbound flow rate cap @ 1Gbps [puppet] - 10https://gerrit.wikimedia.org/r/355451 (https://phabricator.wikimedia.org/T147569) [17:07:57] brief spike --^ [17:08:23] (03PS4) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [17:08:25] (03PS1) 10Andrew Bogott: Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) [17:09:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:09:25] re: ulsfo 5xx spike above - it was confined to the text cluster @ ulsfo, and it was 500s not 503s [17:09:46] so it's likely that specific clients (in western US or asia) were triggering applayer 500s there... [17:10:12] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:10:34] both of those new alerts are from the same thing, basically. the 5xx alerter is kind of horrible at being laggy/inaccurate about what's going on [17:11:26] the actual spike was confined to: text@ulsfo, 17:02 -> 17:05, and a peak rate of ~20/sec of 500s [17:11:34] indeed, btw I think it is cspreport [17:11:38] https://phabricator.wikimedia.org/T166229 [17:12:01] (the ~20/sec being out of ~10k/sec total reqs to text@ulsfo) [17:13:27] 06Operations, 10MediaWiki-General-or-Unknown, 06Security-Team, 10Traffic: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3289487 (10fgiunchedi) [17:13:41] yeah godog seems right looking from https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X [17:13:53] 06Operations, 10Phabricator, 10Traffic: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3289489 (10fgiunchedi) [17:15:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:15:12] (03CR) 10BBlack: [C: 032] r::c::perf - FQ outbound flow rate cap @ 1Gbps [puppet] - 10https://gerrit.wikimedia.org/r/355451 (https://phabricator.wikimedia.org/T147569) (owner: 10BBlack) [17:16:11] 06Operations, 10Phabricator, 10Traffic: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3285916 (10jcrespo) T104735 [17:16:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:17:30] 06Operations, 10Phabricator, 10Traffic: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3285916 (10Dzahn) fwiw, class phabricator::redirector has: ``` 16 $alt_host = 'fab.wmfusercontent.org' ``` but "Host fab.wmfusercontent.org not found: 3(NXDOMAIN)".... [17:18:34] 06Operations, 10Fundraising-Backlog, 07Technical-Debt: Determine if benefactorevents.wikimedia.org should be hosted on the production cluster or still on Microsoft Azure - https://phabricator.wikimedia.org/T166240#3289505 (10Dereckson) [17:22:39] 06Operations, 10ops-codfw, 10hardware-requests: Decomission ms-fe2001-4 - https://phabricator.wikimedia.org/T159413#3289521 (10Dzahn) and now it's actually resolved [17:25:19] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3289525 (10jcrespo) @Papaul - do you know when you will be able to work on this (upgrading firmware)? It is not that urgent, but data on it it will get outdated if it is... [17:28:30] 06Operations, 10MediaWiki-General-or-Unknown, 06Security-Team, 10Traffic: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3288964 (10Legoktm) There's a code comment that says: ```lang=php // 500 so it shows up in browser's developer console. $this-... [17:29:10] 06Operations, 10Gerrit, 07LDAP, 06Release-Engineering-Team (Backlog): Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3289546 (10demon) Just did this with @MoritzMuehlenhoff. Shouldn't have any issues, but please reopen if anyone sees something broken. [17:29:18] 06Operations, 10Gerrit, 07LDAP, 06Release-Engineering-Team (Kanban): Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3289547 (10demon) 05Open>03Resolved a:03demon [17:30:32] 06Operations, 10Gerrit, 07LDAP, 06Release-Engineering-Team (Kanban): Remove user gerrit2 from ldap - https://phabricator.wikimedia.org/T160122#3289554 (10Paladox) Ah thanks :) [17:30:53] (03PS1) 10Dzahn: phabricator: set alt_host in redirector to "phab" instead of "fab" [puppet] - 10https://gerrit.wikimedia.org/r/355455 [17:31:21] PROBLEM - Check systemd state on db1099 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:31:31] PROBLEM - Check systemd state on db1103 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:32:02] (03CR) 10Paladox: [C: 031] phabricator: set alt_host in redirector to "phab" instead of "fab" [puppet] - 10https://gerrit.wikimedia.org/r/355455 (owner: 10Dzahn) [17:32:31] PROBLEM - Check systemd state on db1102 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:32:41] PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:32:43] PROBLEM - Check systemd state on db1096 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:35:11] PROBLEM - Check systemd state on db2082 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:35:41] PROBLEM - Check systemd state on db2083 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:35:41] PROBLEM - Check systemd state on db2072 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:35:41] PROBLEM - Check systemd state on db2089 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:35:51] PROBLEM - Check systemd state on db2073 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:01] PROBLEM - Check systemd state on db2077 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:01] PROBLEM - Check systemd state on db2085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:01] PROBLEM - Check systemd state on db2080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:01] PROBLEM - Check systemd state on db2079 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:01] PROBLEM - Check systemd state on db2081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:02] PROBLEM - Check systemd state on db2087 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:05] Hmm [17:36:11] PROBLEM - Check systemd state on db2084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:22] Is that known? [17:36:41] PROBLEM - Check systemd state on db2074 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:41] PROBLEM - Check systemd state on db2090 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:51] PROBLEM - Check systemd state on db2076 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:36:51] PROBLEM - Check systemd state on db2086 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:37:01] PROBLEM - Check systemd state on db2075 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:37:03] PROBLEM - Check systemd state on db2088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:37:11] PROBLEM - Check systemd state on db2091 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:37:21] PROBLEM - Check systemd state on db2078 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:37:38] May 24 17:29:41 db1096 systemd[1]: Unit prometheus-mysqld-exporter.service entered failed state. [17:37:52] seems to be prometheus-mysqld-exporter.service [17:37:53] ^ it's that, so probably not user-facing-critical [17:37:54] yeah sorry [17:38:01] PROBLEM - Check systemd state on db2092 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:38:10] it's not mariadb/mysql [17:38:12] ah [17:38:17] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3289561 (10Papaul) @jcrespo didn't know i supposed to work on this system, now i know will try to work on it tomorrow. [17:40:41] that is me [17:40:43] sorry [17:41:49] (03PS1) 10Ottomata: Revert "Changes needed for upgrading to Druid 0.10" [puppet] - 10https://gerrit.wikimedia.org/r/355457 [17:42:32] (03PS2) 10Ottomata: Revert "Changes needed for upgrading to Druid 0.10" [puppet] - 10https://gerrit.wikimedia.org/r/355457 [17:42:43] only "new" servers failed? [17:42:52] ah [17:43:00] maybe servers that do not have mysql installed [17:43:09] so I am going to just uninstall it [17:43:53] not only not-user-facing, those servers are not (yet) in use [17:44:34] (03CR) 10Ottomata: [C: 032] Revert "Changes needed for upgrading to Druid 0.10" [puppet] - 10https://gerrit.wikimedia.org/r/355457 (owner: 10Ottomata) [17:45:55] !log rolling druid back to 0.9.0 [17:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:42] 06Operations, 06MediaWiki-Platform-Team, 06Performance-Team, 07Availability (Multiple-active-datacenters), and 6 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3289594 (10aaron) Based on the last multi-dc meeting, this should consist... [17:48:18] (03PS2) 10Jdlrobson: Enable related pages for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351534 (https://phabricator.wikimedia.org/T155079) [17:48:41] RECOVERY - Check systemd state on db1096 is OK: OK - running: The system is fully operational [17:50:41] RECOVERY - Check systemd state on db1101 is OK: OK - running: The system is fully operational [17:51:21] RECOVERY - Check systemd state on db1099 is OK: OK - running: The system is fully operational [17:51:31] RECOVERY - Check systemd state on db1103 is OK: OK - running: The system is fully operational [17:51:31] RECOVERY - Check systemd state on db1102 is OK: OK - running: The system is fully operational [17:51:41] RECOVERY - Check systemd state on db2072 is OK: OK - running: The system is fully operational [17:51:41] RECOVERY - Check systemd state on db2083 is OK: OK - running: The system is fully operational [17:51:41] RECOVERY - Check systemd state on db2089 is OK: OK - running: The system is fully operational [17:51:41] RECOVERY - Check systemd state on db2090 is OK: OK - running: The system is fully operational [17:51:41] RECOVERY - Check systemd state on db2074 is OK: OK - running: The system is fully operational [17:51:51] RECOVERY - Check systemd state on db2076 is OK: OK - running: The system is fully operational [17:51:51] RECOVERY - Check systemd state on db2073 is OK: OK - running: The system is fully operational [17:51:51] RECOVERY - Check systemd state on db2086 is OK: OK - running: The system is fully operational [17:51:52] sorry about that, that check was a really nice thing from alex [17:52:01] RECOVERY - Check systemd state on db2077 is OK: OK - running: The system is fully operational [17:52:01] RECOVERY - Check systemd state on db2085 is OK: OK - running: The system is fully operational [17:52:01] RECOVERY - Check systemd state on db2080 is OK: OK - running: The system is fully operational [17:52:01] RECOVERY - Check systemd state on db2075 is OK: OK - running: The system is fully operational [17:52:01] RECOVERY - Check systemd state on db2079 is OK: OK - running: The system is fully operational [17:52:02] RECOVERY - Check systemd state on db2088 is OK: OK - running: The system is fully operational [17:52:02] RECOVERY - Check systemd state on db2081 is OK: OK - running: The system is fully operational [17:52:02] RECOVERY - Check systemd state on db2087 is OK: OK - running: The system is fully operational [17:52:11] RECOVERY - Check systemd state on db2084 is OK: OK - running: The system is fully operational [17:52:11] RECOVERY - Check systemd state on db2082 is OK: OK - running: The system is fully operational [17:52:11] RECOVERY - Check systemd state on db2091 is OK: OK - running: The system is fully operational [17:52:15] (03PS5) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [17:52:17] (03PS2) 10Andrew Bogott: Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) [17:52:19] (03PS1) 10Andrew Bogott: Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 [17:52:21] RECOVERY - Check systemd state on db2078 is OK: OK - running: The system is fully operational [17:52:39] those new server seem to be with the default puppet role only, and cought me by suprise [17:53:26] 06Operations, 10Phabricator, 10Traffic: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3289633 (10mmodell) @dzahn: good catch but that's overridden: In modules/role/manifests/phabricator/main.pp $altdom = hiera('phabricator_altdomain', 'phab.wmfusercontent.or... [17:54:57] (03PS2) 10Andrew Bogott: Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 [17:57:01] RECOVERY - Check systemd state on db2092 is OK: OK - running: The system is fully operational [17:57:29] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3289671 (10jcrespo) I have upgraded all jessie mysql servers to the latest version. We have now to look if we can enable the pt-heartbeat monitoring. Probably the be... [17:58:46] (03CR) 10Andrew Bogott: "You can see this code live in action on labtesthorizon. For arcane labtest/ldap reasons it only works properly in the 'labtestproject' te" [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170524T1800). Please do the needful. [18:00:04] framawiki, phuedx, and gilles: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:14] o/ [18:00:27] \o [18:00:41] !log T164865: Upgrading Cassandra from 3.7.3-instaclustr to 3.10 [18:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:50] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [18:01:13] o/ [18:02:23] I can SWAT today, looks like a full window [18:02:31] (03PS3) 10Thcipriani: Set $wgUploadNavigationUrl on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354737 (https://phabricator.wikimedia.org/T165901) (owner: 10Framawiki) [18:02:55] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354737 (https://phabricator.wikimedia.org/T165901) (owner: 10Framawiki) [18:03:11] (03PS1) 10BryanDavis: Labs: Add wmcs-roots admin group to NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/355463 [18:03:26] all of mine can be deployed together, they depend on each other anyway, except one [18:03:35] madhuvishy: ^^ I think that's the right magic [18:03:55] gilles: okie doke, just got to get them all through jenkins :) [18:04:23] (03Merged) 10jenkins-bot: Set $wgUploadNavigationUrl on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354737 (https://phabricator.wikimedia.org/T165901) (owner: 10Framawiki) [18:04:24] thcipriani: they need to be merged in order for the tests to pass, fyi [18:05:03] yup, saw your comments there [18:05:19] bd808: looks right [18:05:22] framawiki: $wgUploadNavigationUrl on srwiki is live on mwdebug1002, check please [18:05:42] (03CR) 10Madhuvishy: [C: 032] Labs: Add wmcs-roots admin group to NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/355463 (owner: 10BryanDavis) [18:05:51] ok, i look [18:05:51] (03CR) 10jenkins-bot: Set $wgUploadNavigationUrl on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354737 (https://phabricator.wikimedia.org/T165901) (owner: 10Framawiki) [18:07:51] RECOVERY - HP RAID on ms-be1032 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [18:08:21] (03PS2) 10Thcipriani: Create a new namespace "Vikiproje" for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355244 (https://phabricator.wikimedia.org/T166102) (owner: 10Framawiki) [18:08:57] thcipriani, ok for me for $wgUploadNavigationUrl on srwiki [18:09:18] framawiki: ok, thanks for checking, going live everywhere [18:10:51] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:354737|Set $wgUploadNavigationUrl on srwiki]] T165901 (duration: 00m 42s) [18:10:57] ^ framawiki live everywhere now [18:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:00] T165901: Change target for "Upload file" in sidebar on sr.wikipedia.org - https://phabricator.wikimedia.org/T165901 [18:11:05] thanks thcipriani !* [18:11:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355244 (https://phabricator.wikimedia.org/T166102) (owner: 10Framawiki) [18:12:18] (03Merged) 10jenkins-bot: Create a new namespace "Vikiproje" for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355244 (https://phabricator.wikimedia.org/T166102) (owner: 10Framawiki) [18:12:30] (03CR) 10jenkins-bot: Create a new namespace "Vikiproje" for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355244 (https://phabricator.wikimedia.org/T166102) (owner: 10Framawiki) [18:13:30] framawiki: new namespace for trwiki is live on mwdebug1002, check please [18:15:36] thcipriani, new namespace for trwiki is ok for me [18:15:59] framawiki: ok, will sync live, then I will run the namespaceDupes script for trwiki [18:17:49] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:355244|Create a new namespace "Vikiproje" for trwiki]] T166102 (duration: 00m 41s) [18:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:57] T166102: Create a new namespace for trwiki "Vikiproje" - https://phabricator.wikimedia.org/T166102 [18:18:08] !log running mwscript namespaceDupes.php trwiki --fix [18:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:36] > 7 links to fix, 7 were resolvable. [18:18:48] framawiki: should be live now, thank you for the patches :) [18:18:55] thanks thcipriani ! [18:20:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355397 (https://phabricator.wikimedia.org/T150325) (owner: 10Phuedx) [18:20:17] (03PS2) 10Thcipriani: mobileFrontend: Move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355397 (https://phabricator.wikimedia.org/T150325) (owner: 10Phuedx) [18:20:25] (03CR) 10Thcipriani: mobileFrontend: Move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355397 (https://phabricator.wikimedia.org/T150325) (owner: 10Phuedx) [18:20:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355397 (https://phabricator.wikimedia.org/T150325) (owner: 10Phuedx) [18:21:38] (03Merged) 10jenkins-bot: mobileFrontend: Move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355397 (https://phabricator.wikimedia.org/T150325) (owner: 10Phuedx) [18:21:46] (03CR) 10jenkins-bot: mobileFrontend: Move first paragraph before infobox [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355397 (https://phabricator.wikimedia.org/T150325) (owner: 10Phuedx) [18:22:46] phuedx: mobilefrontend config change live on mwdebug1002, check please [18:24:00] thcipriani: testing with some reading web folk [18:24:11] ok :) [18:25:09] just going through some test pages w/ and w/out infoboxes [18:28:18] thcipriani: tested on a bunch of pages with known edge cases and popular pages [18:28:33] jdlrobson is just checking the api [18:28:35] sorry for the delay [18:29:08] (03PS1) 10Ottomata: Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) [18:29:09] thcipriani: ok. go! [18:29:13] phuedx: np, still waiting on jenkins for other patches, no rush :) [18:29:15] ok, going live [18:30:01] (03PS2) 10Ottomata: Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) [18:31:02] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:355397|mobileFrontend: Move first paragraph before infobox]] T150325 (duration: 00m 41s) [18:31:08] ^ phuedx live everywhere [18:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:12] T150325: Move first paragraph before infobox on stable - https://phabricator.wikimedia.org/T150325 [18:32:14] (03CR) 10jerkins-bot: [V: 04-1] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [18:36:43] (03PS3) 10Ottomata: Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) [18:36:45] (03PS1) 10Ottomata: Changes needed for upgrading to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) [18:39:04] gilles: all your patches are staged on mwdebug1002, check please [18:39:41] thcipriani: checking [18:39:54] (03CR) 10jerkins-bot: [V: 04-1] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [18:41:21] PROBLEM - configured eth on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:41:31] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:41:36] (03PS4) 10Ottomata: Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) [18:41:41] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:41:41] PROBLEM - MD RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:41:49] (03PS13) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) [18:42:11] RECOVERY - configured eth on ms-be1020 is OK: OK - interfaces up [18:42:21] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [18:42:31] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 18 minutes ago with 0 failures [18:42:31] RECOVERY - MD RAID on ms-be1020 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [18:43:06] (03CR) 10jerkins-bot: [V: 04-1] Trigger a rerun of druid-hdfs-storage-cdh-link during a Druid upgrade [puppet] - 10https://gerrit.wikimedia.org/r/355469 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [18:43:12] (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [18:44:45] thcipriani: my changes possibly don't do what they're supposed to, but with no breakage that I can see, you can go ahead and deploy it [18:46:00] (03CR) 10Ejegg: [C: 04-1] "Oops, we can actually delete most of this. That db server is gone, and nothing outside the payments cluster can talk to its replacement." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [18:46:27] gilles: trying to think of the best way to deploy it without causing log errors considering there is no guarantee of the order in which files will by synced, how does this sound: pdfhandler, pagedtiffhandler, timedmediahandler, full sync? [18:46:52] or am I overthinking this? [18:47:03] thcipriani: yes I believe extensions first will only cause warnings [18:48:09] gilles: "yes" to? sorry, having trouble parsing. [18:48:16] yes to your plan [18:48:22] ah, ok :) [18:48:39] alright, going [18:50:43] ah, I think my patches were doing what I expected them to after all, I underestimated how long varnish caches 404s for [18:51:48] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3289924 (10Ottomata) @Cmjohnson estimate on these? We'd like to get them up and running by the end of this quarter, so I'm going to nee... [18:52:34] (03PS1) 10Chad: WIP: Also clean up ExtensionMessages files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 [18:54:20] (03PS7) 10Chad: Drop most contribution tracking config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) [18:54:48] !log thcipriani@tin Synchronized php-1.30.0-wmf.2/extensions/PdfHandler/PdfHandler_body.php: SWAT: [[gerrit:355388|Update getContentHeaders signature]] T150741 (duration: 00m 40s) [18:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:57] T150741: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741 [18:55:48] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3289946 (10Cmjohnson) @fgiunchedi Hey missed that Tuesday....let's do this Thrusday morning but has to be 1021...the ticket with Dell is 1021 and I have to be consi... [18:55:54] !log thcipriani@tin Synchronized php-1.30.0-wmf.2/extensions/PagedTiffHandler/PagedTiffHandler_body.php: SWAT: [[gerrit:355405|Update getContentHeaders signature]] T150741 (duration: 00m 42s) [18:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:52] !log thcipriani@tin Synchronized php-1.30.0-wmf.2/extensions/TimedMediaHandler/handlers: SWAT: [[gerrit:355406|Make getContentHeaders rely on fallback width/height]] T150741 (duration: 00m 41s) [18:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:17] !log thcipriani@tin Started scap: SWAT: [[gerrit:355389|Use file width/height instead of metadata for getContentHeaders]] [[gerrit:355390|Batch/pipeline backend operations in refreshFileHeaders]] T150741 [18:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:30] 06Operations, 10ops-eqiad, 10netops: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3289951 (10faidon) RIPE responded with a new USB image; I sent that to Chris over a separate medium. [18:57:33] (03CR) 10Ejegg: [C: 032] "Thanks for the cleanup!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [18:58:39] (03Merged) 10jenkins-bot: Drop most contribution tracking config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [18:58:47] (03CR) 10jenkins-bot: Drop most contribution tracking config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170524T1900). Please do the needful. [19:00:04] some RB warnings starting to pop up [19:00:06] en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is WARNING: Test Get citation for Darth Vader responds with unexpected body: [0]/encyclopediaTitle => None [19:00:29] apparently intermittent so it won't alert here, but still [19:00:29] !log thcipriani@tin Finished scap: SWAT: [[gerrit:355389|Use file width/height instead of metadata for getContentHeaders]] [[gerrit:355390|Batch/pipeline backend operations in refreshFileHeaders]] T150741 (duration: 03m 12s) [19:00:37] ^ gilles all sync'd [19:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:38] T150741: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741 [19:00:43] thcipriani: My cleanup patch just landed, but I didn't pull to tin yet [19:00:46] thcipriani: checking [19:01:07] RainbowSprinkles: I'm done swatting now, so you should be clear to run your cleanup, ping me after and I'll run train [19:01:15] Ok shouldn't take more than a minute or two [19:02:11] !log demon@tin Synchronized .gitignore: Completeness (duration: 00m 41s) [19:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:59] also, wowza, scap sync took 3m 12s. The delta was pretty small and there were no l10n changes to actually sync, but I was still expecting slower. [19:03:14] ejegg: Just sync'd your config removal live. Lemme know if you see any issues [19:03:20] checking [19:03:23] !log demon@tin Synchronized wmf-config/CommonSettings.php: Dropping old ContribTracking config (duration: 00m 41s) [19:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:07] all looking good RainbowSprinkles. Thanks! [19:04:40] No, thank you! You don't know how happy I am--this is the last piece of untracked configuration we had on the MW deploy server [19:04:46] thcipriani: all looks good, thank you for the SWAT [19:04:49] https://dbtree.wikimedia.org/ is broken again [19:04:57] gilles: glad to hear it :) [19:04:59] nice and clean now, huh? [19:05:11] Dbtree broken for me aswell [19:06:17] 06Operations, 06Performance-Team, 10Thumbor, 05MW-1.30-release-notes (WMF-deploy-2017-05-23_(1.30.0-wmf.2)), 13Patch-For-Review: Thumbor should reject thumbnail requests that are the same size as the original or bigger - https://phabricator.wikimedia.org/T150741#3289985 (10Gilles) refreshFileHeaders is s... [19:07:39] !log demon@tin Synchronized wmf-config/: Dropping old contribution-tracking-setup.php -- finally (duration: 00m 42s) [19:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:06] !log otto@tin Started deploy [eventlogging/analytics@c90a609]: (no justification provided) [19:11:08] !log otto@tin Finished deploy [eventlogging/analytics@c90a609]: (no justification provided) (duration: 00m 02s) [19:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:37] !log otto@tin Started deploy [eventlogging/analytics@c90a609]: (no justification provided) [19:11:39] !log otto@tin Finished deploy [eventlogging/analytics@c90a609]: (no justification provided) (duration: 00m 02s) [19:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:51] RainbowSprinkles did we switch to /srv/gerrit? [19:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:00] !log demon@tin Synchronized wmf-config/: Dropping old ExtensionMessages (duration: 00m 42s) [19:12:01] !log otto@tin Started deploy [eventlogging/analytics@c90a609]: (no justification provided) [19:12:03] !log otto@tin Finished deploy [eventlogging/analytics@c90a609]: (no justification provided) (duration: 00m 02s) [19:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:09] paladox: The git repos are in /srv/gerrit/git [19:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:23] Ah i see [19:12:25] Im getting puppet errors now [19:12:28] The software itself still operates out of /var/lib/gerrit2/review_site [19:12:33] !log otto@tin Started deploy [eventlogging/analytics@c90a609]: (no justification provided) [19:12:34] !log otto@tin Finished deploy [eventlogging/analytics@c90a609]: (no justification provided) (duration: 00m 01s) [19:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:45] ottomata: Having trouble? [19:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:52] errit2 failed: Could not find user gerrit2 [19:12:59] RainbowSprinkles ^^ [19:13:05] We deleted gerrit2 from ldap [19:13:09] (03PS1) 10Jdlrobson: Hygiene: Remove no longer supported config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355477 [19:13:11] (03PS1) 10Jdlrobson: Enable print styles in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355478 (https://phabricator.wikimedia.org/T163287) [19:13:20] yep, i guess i need to create the user now with the package [19:13:21] Probably need to do a fresh build from scratch so you get the system user created [19:13:31] Er, or just create the user [19:13:39] Copy+paste from the package preinstall step [19:13:51] /usr/sbin/deluser: The user `gerrit2' does not exist. [19:13:58] Don't deluser, just make the user [19:13:59] :) [19:14:06] !log otto@tin Started deploy [eventlogging/analytics@c90a609]: (no justification provided) [19:14:08] !log otto@tin Finished deploy [eventlogging/analytics@c90a609]: (no justification provided) (duration: 00m 02s) [19:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:22] Ah [19:14:30] * paladox addgroup --system gerrit2 [19:15:05] Basically: since we already had gerrit2 in ldap, the install script skipped making it. Since we nuked from LDAP, you need the user locally. A fresh install would do this (or just adduser :)) [19:15:07] that's adding a group, not a user thouhg [19:15:20] Ah, thanks [19:15:29] mutante yep, just realised :) [19:16:17] 06Operations, 10ops-eqiad: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3290023 (10Cmjohnson) @elukey the controller card has been swapped this is all yours..you can power on via mgmt. Please resolve this once you confirm. [19:16:29] Fixed now :) [19:16:35] thanks RainbowSprinkles :) [19:16:44] (03PS1) 10Jdlrobson: Wikivoyage should allow page images outside the lead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355479 (https://phabricator.wikimedia.org/T166251) [19:17:33] yw [19:19:09] 06Operations, 10ops-eqiad: Degraded RAID on db1024 - https://phabricator.wikimedia.org/T165934#3290048 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Disk was replaced and raid restored. [19:20:06] !log otto@tin Started deploy [eventlogging/analytics@c90a609]: (no justification provided) [19:20:08] !log otto@tin Finished deploy [eventlogging/analytics@c90a609]: (no justification provided) (duration: 00m 01s) [19:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:33] (03PS1) 10Thcipriani: group1 wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355481 [19:21:35] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355481 (owner: 10Thcipriani) [19:21:38] RainbowSprinkles some how sudo su gerrit2 is not working [19:21:43] keeps keeping me as root [19:22:19] That's weird, doesn't do that in prod [19:22:27] Legacy sudoers rules? [19:22:30] (03PS1) 10Ottomata: Use is_not_bot filter function for eventlogging mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/355482 (https://phabricator.wikimedia.org/T67508) [19:22:33] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355481 (owner: 10Thcipriani) [19:22:45] Look in /etc/sudoers.d/? [19:22:47] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355481 (owner: 10Thcipriani) [19:23:53] (03CR) 10Ottomata: [C: 032] Use is_not_bot filter function for eventlogging mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/355482 (https://phabricator.wikimedia.org/T67508) (owner: 10Ottomata) [19:24:10] that was weird. just had a spike of "Cannot access the database: Unknown error" [19:24:28] Ok [19:24:31] * paladox looks [19:25:09] Nope nothing in there [19:25:21] RainbowSprinkles ah [19:25:31] /bin/falsh [19:25:39] /bin/false [19:25:43] Ah, yeah gerrit2 needs /bin/bash [19:25:44] adduser --system --ingroup gerrit2 --home /var/lib/gerrit2 --shell /bin/false --no-create-home gerrit2 [19:25:51] * paladox submits patch [19:25:54] thanks [19:25:58] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.2 [19:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:10] (03Draft1) 10Paladox: Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 [19:27:12] (03PS2) 10Paladox: Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 [19:27:17] RainbowSprinkles ^^ [19:27:18] :) [19:27:43] 06Operations, 10Security-Reviews, 07Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#3290068 (10egalvezwmf) Hi @Elitre ! That would be the security team, I just don't expect this is high on their priority or ours at this time, but it is definitely on my radar for the long-term. The... [19:28:31] (03CR) 10Chad: [C: 031] "Should get a debian/changelog entry, but otherwise fine" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 (owner: 10Paladox) [19:28:31] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:41] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:41] PROBLEM - MD RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:29:21] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [19:29:31] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [19:29:32] RECOVERY - MD RAID on ms-be1020 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:29:41] (03PS3) 10Paladox: Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 [19:29:43] (03CR) 10Dereckson: Test wgLogoHD keys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson) [19:29:59] RainbowSprinkles should i use your's or my username in the changelog entry [19:30:10] Use yours :) [19:30:22] Ok [19:30:23] thanks [19:30:37] (03PS4) 10Paladox: Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 [19:30:41] Done :) [19:30:49] Set at +1 utc instead of +0 [19:31:04] (03CR) 10Zppix: [C: 031] Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 (owner: 10Paladox) [19:31:34] Luckly i like this systemd script it runs as gerrit2 so i can run gerrit without sudoing into the user. [19:31:45] (03CR) 10Ottomata: [C: 032] Release 0.10.0-2 [debs/druid] - 10https://gerrit.wikimedia.org/r/355430 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [19:33:01] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2066647 [19:33:09] * paladox rebuilds gerrit 2.14 ontop of that change [19:33:38] I usually just do +0 and normalize to UTC, but w/e works [19:34:00] What's w/e? [19:34:21] (03PS5) 10Paladox: Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 [19:36:02] paladox: whatever [19:36:22] Thanks [19:36:57] (03CR) 10Hashar: "I have cherry picked this change on the CI puppet master since we still have some Trusty permanent slaves with hhvm." [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [19:37:17] (03CR) 10Paladox: "> I have cherry picked this change on the CI puppet master since we" [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox) [19:37:40] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3290113 (10akosiaris) raid1-lvm-ext4-srv-noswap works too. Fine by me. [19:38:46] 06Operations, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3290115 (10mmodell) I haven't seen any signs of this lately but it's possible that we just haven't been hit by the perfect storm of simultaneous crawlers. From memo... [19:38:51] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:39:30] 06Operations, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3290116 (10mmodell) p:05High>03Normal [19:46:55] (03PS6) 10Paladox: Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 [19:46:57] (03PS7) 10Paladox: Upgrade gerrit to 2.14.1 (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [19:55:03] (03CR) 10Dzahn: "but why?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 (owner: 10Paladox) [19:55:24] (03PS7) 10Paladox: Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 [19:56:57] (03CR) 10Chad: "Gerrit2 needs to be able to ssh to other gerrit boxes for replication. It's like this in production, but the package didn't enforce it." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 (owner: 10Paladox) [19:58:35] mutante ^^ [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170524T2000). [20:00:22] (03PS8) 10Paladox: Use /bin/bash instead of /bin/false for gerrit2 user [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 [20:03:04] (03CR) 10Dzahn: [C: 032] "thanks, that makes it more clear that it is for replication, not for humans running commands as it. gotcha and confirmed it's like this in" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355483 (owner: 10Paladox) [20:03:15] Thanks ^^^ :) [20:07:51] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:08:21] (03PS4) 10Paladox: Gerrit: Remove "" around T\\d+ in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/352710 [20:09:02] I carn't tell what the difference is for this: https://phabricator.wikimedia.org/P5483 (gerrit) [20:09:34] 06Operations, 10ops-eqiad: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3290213 (10Cmjohnson) [20:09:54] paladox: Trailing whitespace. gerrit init trims it, but the erb template results in a trailing space [20:10:02] I never managed to clean that up [20:10:02] Oh [20:10:04] thanks [20:15:45] https://www.facebook.com/pages/Wikimedia-Foundation/101762606683454 [20:15:47] woops [20:15:55] that was meant for mutante in a pm [20:16:02] sorry for pasting it in the wrong channel [20:16:28] re: https://gerrit.wikimedia.org/r/#/c/352710/4/modules/gerrit/templates/gerrit.config.erb but it doesn't remove the quotes around the other "match = " lines in there? [20:17:15] (03PS3) 10Cmjohnson: Adding mgmt dns for new parsoid wtp125-1048 T165520 [dns] - 10https://gerrit.wikimedia.org/r/355106 [20:17:39] The problem dosen't happen for the other ones [20:17:42] Mutante ^^ [20:18:06] paladox: re: that Facebook page, is it broken? There are HTML tags on the left side and German text? [20:18:26] The about text [20:18:42] mutante, i was trying to say it was an unoffical page. Carn't find the offical one. [20:20:10] no parsoid deploy today [20:20:15] paladox: ok, i have no idea, by i think Jeff Elder is the guy. https://meta.wikimedia.org/wiki/Social_media [20:21:05] thanks [20:21:07] paladox: see that "our Facebook group" link there, but i can't see from there without login [20:21:24] Yep [20:22:19] (03CR) 10RobH: [C: 031] Adding mgmt dns for new parsoid wtp125-1048 T165520 [dns] - 10https://gerrit.wikimedia.org/r/355106 (owner: 10Cmjohnson) [20:22:21] PROBLEM - HP RAID on ms-be2029 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [20:27:39] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for new parsoid wtp125-1048 T165520 [dns] - 10https://gerrit.wikimedia.org/r/355106 (owner: 10Cmjohnson) [20:29:42] !log Run fixProofreadIndexPagesContentModel on vec.wikisource (requested by Tpt), aborted after 50k (as that's greater than the expected number of rows) [20:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:51] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:29:52] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:29:52] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:29:52] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:30:01] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:30:21] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:30:31] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [20:30:41] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [20:30:41] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [20:30:41] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [20:32:41] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [20:32:51] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [20:33:11] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [20:33:21] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [20:34:17] (03CR) 10Chad: [C: 032] scap clean: Some docs, minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355365 (owner: 10Chad) [20:35:20] (03Merged) 10jenkins-bot: scap clean: Some docs, minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355365 (owner: 10Chad) [20:35:39] (03PS8) 10Paladox: Upgrade gerrit to 2.14.1 (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [20:35:54] (03CR) 10jenkins-bot: scap clean: Some docs, minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355365 (owner: 10Chad) [20:38:49] !log demon@tin Synchronized scap/plugins/clean.py: cleanups (duration: 00m 41s) [20:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:08] (03PS1) 10Papaul: DHCP/partman: Add DHCP and partman entries for ores200[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/355501 [20:40:22] (03PS9) 10Paladox: Upgrade gerrit to 2.14.1 (DO NOT MERGE) [debs/gerrit] - 10https://gerrit.wikimedia.org/r/350440 [20:41:22] marostegui: hi, would you have a quick idea for https://phabricator.wikimedia.org/T166261 ? [20:42:10] (03PS2) 10Dzahn: DHCP/partman: Add DHCP and partman entries for ores200[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/355501 (owner: 10Papaul) [20:43:11] Dereckson: rollback the train for wikitonary til fixed if its user visable [20:43:19] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3290264 (10Papaul) [20:43:19] wikisource [20:43:26] Yes wikisource [20:43:28] My bad [20:43:28] (03CR) 10Dzahn: [C: 032] "saw comments on ticket, yes, uses raid1-lvm-ext4-srv-noswap recipe" [puppet] - 10https://gerrit.wikimedia.org/r/355501 (owner: 10Papaul) [20:44:04] Yes, Ive offered the same train rollback to Tpt, but there are workarounds like run a bot to force content model update [20:44:22] or run the maintenance scripts, and watch when they're done [20:45:15] Dereckson: workarounds arent great they tend to go perm and then it never really gets fixed [20:45:22] Ugh. That query was getting stuck in beta yesterday. [20:45:58] LIMIT on UPDATE warns as unsafe. [20:46:11] (when doing statement-based replication) [20:46:21] Script should be adjusted to do them in batches by page_id [20:52:09] 06Operations, 10ops-eqiad: rack/setup/install ganeti1005-ganeti1008 - https://phabricator.wikimedia.org/T166076#3290280 (10Cmjohnson) Racked and labeled 2 in row A (a4 and a6) 1 in row B4 and 1 in row B3. Racktables updated [20:55:40] (03PS5) 10Paladox: Gerrit: Remove "" around T\\d+ in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/352710 [20:55:45] Dereckson: i replied to task [20:56:37] I also just rewrote that batching loop [20:56:45] So it's not crappy and works in replicated environments [20:57:21] RainbowSprinkles: is it possible to do extensive testing before deploying to wikisource... to prevent more issues? [20:58:39] 06Operations, 10ops-eqiad, 10netops: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3191129 (10Cmjohnson) @faidon the usb with the image is attached. [20:58:54] I mean I didn't change what it does. Just how it does it [20:59:03] But if you want to go on a testing spree by all means go ahead [20:59:29] (03PS1) 10Chad: Rolling wikisources back to wmf.1 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355529 [20:59:34] Dereckson: ^ [20:59:44] Dereckson: it should of been wmf 2 no? [20:59:52] Sorry RainbowSprinkles [21:00:10] (03PS6) 10Dzahn: Gerrit: Remove "" around T\\d+ in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/352710 (owner: 10Paladox) [21:00:17] Why would it be wmf.2? [21:00:25] If wmf.2 is what broke it [21:00:55] Arent we on .3? [21:01:01] Nope [21:01:12] https://tools.wmflabs.org/versions/ [21:01:20] Carry on... im going crazy [21:02:03] RainbowSprinkles: according tpt, rollback is a bad idea, as we started to migrate data [21:02:13] Ah, ok [21:02:22] Sooo, in that case let's merge my patch and move forward? [21:02:33] Tpt is reviewing it [21:02:39] awesome [21:02:53] (03Abandoned) 10Chad: Rolling wikisources back to wmf.1 for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355529 (owner: 10Chad) [21:03:29] Dereckson: is it only affecting wikisource? [21:03:42] 06Operations, 10Security-Reviews, 07Surveys: Re-evaluate Limesurvey - https://phabricator.wikimedia.org/T109606#3290324 (10Aklapper) //If// someone wants a security review from the #Security-Team (=not a single person), please follow https://www.mediawiki.org/wiki/Wikimedia_Security_Team/Security_reviews#Req... [21:03:42] Zppix: frrwiki (already fixed) and test2 (fixed too) [21:04:19] RainbowSprinkles: https://gerrit.wikimedia.org/r/#/c/355534/ to cherry-pick [21:04:26] Dereckson: so all the projects with that ext [21:04:48] Zppix: wikisource.dblist + frrwiki + test2 + sourceswiki [21:05:05] Merging to wmf.2 [21:05:07] Ack thanks [21:05:24] (03CR) 10Dzahn: [C: 032] Gerrit: Remove "" around T\\d+ in gerrit.config [puppet] - 10https://gerrit.wikimedia.org/r/352710 (owner: 10Paladox) [21:05:32] thanks ^^ [21:08:40] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3290332 (10RobH) [21:09:09] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3290349 (10RobH) Since this initially was requested by Alex, I've assigned it to him for feedback on the hostname and racking proposals. Please provide feedback, and assign to... [21:10:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:10:01] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:10:01] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [21:10:08] !log demon@tin Synchronized php-1.30.0-wmf.2/extensions/ProofreadPage/maintenance/fixProofreadIndexPagesContentModel.php: Now with proper batch support (duration: 00m 41s) [21:10:09] (03PS7) 10Hashar: contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [21:10:16] Dereckson: Script is live everywhere. Maintenance can continue now :) [21:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:19] !log Fixed wikisource Index: content model for ta.wikisource, en.wikisource and not wikisource databases (frrwiki + test2 + sourceswiki) [21:10:22] RainbowSprinkles: ok [21:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:51] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [21:10:51] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [21:10:52] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [21:12:18] (03CR) 10jerkins-bot: [V: 04-1] contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [21:12:30] (03PS8) 10Hashar: contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [21:13:47] (03CR) 10jerkins-bot: [V: 04-1] contint: New role for Docker based CI slave [puppet] - 10https://gerrit.wikimedia.org/r/320942 (https://phabricator.wikimedia.org/T150502) (owner: 10Dduvall) [21:18:39] (03CR) 10Andrew Bogott: [C: 032] dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:26:51] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:28:58] (03PS1) 10Cmjohnson: Updating asset tag labels to correctly match the associated servers [dns] - 10https://gerrit.wikimedia.org/r/355550 [21:31:25] (03CR) 10Cmjohnson: [C: 032] Updating asset tag labels to correctly match the associated servers [dns] - 10https://gerrit.wikimedia.org/r/355550 (owner: 10Cmjohnson) [21:34:29] !log Run fixProofreadIndexPagesContentModel.php new version (with [[Gerrit:355534]] fix) to every wikisource [21:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:51] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:36:47] 06Operations, 10DBA: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290453 (10Dereckson) p:05Triage>03High [21:38:07] (03PS4) 10Krinkle: varnish: Make errorpage.html balanced and use placeholder [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [21:38:27] (03CR) 10Krinkle: "Testing in Beta failed:" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:38:36] (03CR) 10Krinkle: "Fixed syntax error." [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [21:43:53] 06Operations, 10DBA: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290480 (10Dereckson) [21:46:45] 06Operations, 10DBA: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290483 (10Dzahn) a:03Dzahn [21:46:56] 06Operations, 10DBA: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290441 (10Dzahn) my fault it seems.. on it [21:52:09] 06Operations, 10DBA: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290491 (10Dzahn) it was an extra "," that was causing a syntax error. fixed with hotfix. gerrit follow-up coming.. [21:53:35] !log T164865: Disabling range delete-based render culling, dev env [21:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:46] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [21:57:14] (03PS1) 10Dzahn: fix syntax error in index.php, extra comma [software/dbtree] - 10https://gerrit.wikimedia.org/r/355558 [21:57:38] (03PS2) 10Dzahn: fix syntax error in index.php, extra comma [software/dbtree] - 10https://gerrit.wikimedia.org/r/355558 (https://phabricator.wikimedia.org/T166267) [21:58:46] (03CR) 10Dereckson: [C: 031] fix syntax error in index.php, extra comma [software/dbtree] - 10https://gerrit.wikimedia.org/r/355558 (https://phabricator.wikimedia.org/T166267) (owner: 10Dzahn) [21:59:16] (03PS3) 10Dzahn: fix syntax error in index.php, extra dot [software/dbtree] - 10https://gerrit.wikimedia.org/r/355558 (https://phabricator.wikimedia.org/T166267) [21:59:22] (03CR) 10Dzahn: [V: 032 C: 032] fix syntax error in index.php, extra dot [software/dbtree] - 10https://gerrit.wikimedia.org/r/355558 (https://phabricator.wikimedia.org/T166267) (owner: 10Dzahn) [22:02:34] !log terbium: dbtree: git stash and git pull origin to fix unclean repo state, deploy fix to syntax error [22:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:03] 06Operations, 10DBA, 13Patch-For-Review: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290547 (10Dzahn) fixed. deployed on terbium and wasat. also: 15:06 < mutante> !log terbium: dbtree: git stash and git pull origin to fix unclean repo state, depl... [22:05:13] 06Operations, 10DBA, 13Patch-For-Review: dbtree: dbtree.wikimedia.org currently returns a 500 error - https://phabricator.wikimedia.org/T166267#3290548 (10Dzahn) 05Open>03Resolved [22:10:24] mutante: I'm testing out https://gerrit.wikimedia.org/r/#/c/350493/4/modules/varnish/manifests/common/vcl.pp but it seems no matter what I do, even manual vcl-reload, the change to errorpage.html is not picked up. [22:10:43] I did understand that std.fileread is cached somewhere, and that our puppet doesn't reload vcl when the html file changes [22:10:48] but I assumed a reload would clear that cache [22:10:54] Do you know how to clear it? [22:11:25] Krinkle: not if it's not vcl-reload, no [22:11:52] Krinkle: sorry, maybe ask folks in -traffic though [22:13:00] "Please note that std.fileread is only read once and is cached until varnish is reloaded." [22:20:52] (03CR) 10Krinkle: "It now compiles fine, but it seems the std.fileread call is cached quite strongly. I'm unable to make it refresh. Tried vcl-reload, and al" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [22:28:02] (03CR) 10Bmansurov: [C: 031] Enable related pages for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351534 (https://phabricator.wikimedia.org/T155079) (owner: 10Jdlrobson) [22:29:57] (03CR) 10Bmansurov: [C: 031] Hygiene: Remove no longer supported config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355477 (owner: 10Jdlrobson) [22:32:25] (03CR) 10Bmansurov: [C: 031] Enable print styles in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355478 (https://phabricator.wikimedia.org/T163287) (owner: 10Jdlrobson) [22:33:02] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 3892 [22:33:16] !log krinkle@tin Synchronized php-1.30.0-wmf.2/extensions/wikihiero: Fix styles queue warning - T92459 (duration: 00m 42s) [22:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:27] T92459: ResourceLoader should restrict addModuleStyles() to modules that only provide styles - https://phabricator.wikimedia.org/T92459 [22:35:51] (03CR) 10Bmansurov: [C: 031] Wikivoyage should allow page images outside the lead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355479 (https://phabricator.wikimedia.org/T166251) (owner: 10Jdlrobson) [22:39:10] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3290652 (10ggellerman) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170524T2300). [23:00:04] SMalyshev, AaronSchulz, and Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] \o hello [23:00:10] (03PS1) 10Catrope: Enable $wgEchoPerUserBlacklist in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355567 (https://phabricator.wikimedia.org/T150419) [23:00:12] I can SWAT [23:00:17] and I have an addition of my own [23:00:17] here [23:06:53] (03PS4) 10Catrope: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [23:06:57] (03CR) 10Catrope: [C: 032] Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [23:07:24] OK, Stas's first, I'll put it on terbium for testing since it's maintenance/CLI-related [23:07:53] RoanKattouw: thanks, checking [23:07:56] (03Merged) 10jenkins-bot: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [23:08:06] (03CR) 10jenkins-bot: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev) [23:08:45] ah, not there yet I assume [23:08:59] There now [23:09:09] yay, works! [23:09:10] AaronSchulz: You around for SWAT? [23:09:13] RoanKattouw: thanks! [23:09:57] (03PS2) 10Catrope: Hygiene: Remove no longer supported config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355477 (owner: 10Jdlrobson) [23:10:24] !log catrope@tin Synchronized multiversion/MWMultiVersion.php: Allow absolute script path for getMediaWikiCli() (duration: 00m 44s) [23:10:31] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89964.74 seconds [23:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:34] (03CR) 10Catrope: [C: 032] Hygiene: Remove no longer supported config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355477 (owner: 10Jdlrobson) [23:11:13] jdlrobson: Yours is next., would you like them one by one or in groups? [23:11:22] groups is fine [23:11:27] We can probably combine that ---^ one with the print styles one for example [23:11:27] OK [23:11:32] (03Merged) 10jenkins-bot: Hygiene: Remove no longer supported config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355477 (owner: 10Jdlrobson) [23:11:34] (03PS2) 10Catrope: Enable print styles in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355478 (https://phabricator.wikimedia.org/T163287) (owner: 10Jdlrobson) [23:11:38] (03CR) 10Catrope: [C: 032] Enable print styles in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355478 (https://phabricator.wikimedia.org/T163287) (owner: 10Jdlrobson) [23:11:41] (03CR) 10jenkins-bot: Hygiene: Remove no longer supported config flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355477 (owner: 10Jdlrobson) [23:12:33] (03Merged) 10jenkins-bot: Enable print styles in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355478 (https://phabricator.wikimedia.org/T163287) (owner: 10Jdlrobson) [23:13:12] jdlrobson: OK, please test the above on mwdebug1002 [23:13:46] (03CR) 10jenkins-bot: Enable print styles in Minerva [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355478 (https://phabricator.wikimedia.org/T163287) (owner: 10Jdlrobson) [23:14:52] (on it) [23:15:13] (03PS3) 10Catrope: Enable related pages for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351534 (https://phabricator.wikimedia.org/T155079) (owner: 10Jdlrobson) [23:17:02] RoanKattouw: looks good sync away! [23:17:44] (03CR) 10Catrope: [C: 032] Enable related pages for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351534 (https://phabricator.wikimedia.org/T155079) (owner: 10Jdlrobson) [23:18:10] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable print styles in Minerva (T163287) (duration: 00m 42s) [23:18:17] jdlrobson: That one --^^ seems tricky to test, since it was previously already enabled for 90% on enwiki, should I just sync that one right away? [23:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:19] T163287: Deploy new print styles to all mobile devices - https://phabricator.wikimedia.org/T163287 [23:18:49] RoanKattouw: yeh you can just sync that one [23:18:56] (03Merged) 10jenkins-bot: Enable related pages for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351534 (https://phabricator.wikimedia.org/T155079) (owner: 10Jdlrobson) [23:19:04] (03CR) 10jenkins-bot: Enable related pages for everyone [mediawiki-config] - 10https://gerrit.wikimedia.org/r/351534 (https://phabricator.wikimedia.org/T155079) (owner: 10Jdlrobson) [23:19:08] i was just gonna hit it > 10 times to check :) [23:19:46] (03PS2) 10Catrope: Wikivoyage should allow page images outside the lead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355479 (https://phabricator.wikimedia.org/T166251) (owner: 10Jdlrobson) [23:19:51] (03CR) 10Catrope: [C: 032] Wikivoyage should allow page images outside the lead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355479 (https://phabricator.wikimedia.org/T166251) (owner: 10Jdlrobson) [23:19:55] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable related pages for everyone (T155079) (duration: 00m 42s) [23:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:05] T155079: Deploy related pages to mobile web stable to 100% of users for English - https://phabricator.wikimedia.org/T155079 [23:20:27] jdlrobson: It's been too long since I took stat classes so I can't tell you how many times you'd need to hit it to get >95% confidence :P [23:20:33] :) [23:20:54] (03Merged) 10jenkins-bot: Wikivoyage should allow page images outside the lead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355479 (https://phabricator.wikimedia.org/T166251) (owner: 10Jdlrobson) [23:21:13] (03CR) 10jenkins-bot: Wikivoyage should allow page images outside the lead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355479 (https://phabricator.wikimedia.org/T166251) (owner: 10Jdlrobson) [23:21:23] jdlrobson: OK, Wikivoyage non-lead images is on mwdebug1002, please test [23:21:47] (03CR) 10Catrope: [C: 032] Enable $wgEchoPerUserBlacklist in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355567 (https://phabricator.wikimedia.org/T150419) (owner: 10Catrope) [23:22:46] (03Merged) 10jenkins-bot: Enable $wgEchoPerUserBlacklist in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355567 (https://phabricator.wikimedia.org/T150419) (owner: 10Catrope) [23:23:13] (03CR) 10jenkins-bot: Enable $wgEchoPerUserBlacklist in beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355567 (https://phabricator.wikimedia.org/T150419) (owner: 10Catrope) [23:25:23] i dont think i can test this easily RoanKattouw but it should be fine [23:25:34] unless you want to wait for the next jobqueue run [23:25:46] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287147 (10Dzahn) Hi @kaldari if you take a look at the table on https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups would you say you are asking for "statistics-priva... [23:27:00] Oh wasn't aware of that [23:27:02] Then I'll just sync [23:27:27] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3290772 (10kaldari) @Dzahn: analytics-privatedata-users, please. [23:28:05] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Allow page images outside the lead on Wikivoyage wikis (T166251) (duration: 00m 41s) [23:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:15] T166251: No page images on various pages in Wikivoyage - https://phabricator.wikimedia.org/T166251 [23:29:03] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3290788 (10Dzahn) I think it's "analytics-privatedata-users" because https://wikitech.wikimedia.org/wiki/Analytics#By_access_system says "Data Lake [Hadoop cluster]" and the page linked ab... [23:29:55] 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3290791 (10kaldari) Yep :) [23:32:04] (03PS1) 10Dzahn: admins: add kaldari to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355568 (https://phabricator.wikimedia.org/T166165) [23:32:14] thanks RoanKattouw [23:32:27] OK that's all of Jon's done [23:32:37] AaronSchulz: You around? You have a patch listed for SWAT [23:35:58] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3290832 (10kaldari) [23:44:46] (03PS2) 10Dzahn: confd: use logrotate::conf for logrotate [puppet] - 10https://gerrit.wikimedia.org/r/352652 [23:47:41] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:48:31] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1020 is OK: OK ferm input default policy is set [23:57:41] (03CR) 10Dzahn: [C: 032] "resource change but no real change http://puppet-compiler.wmflabs.org/6517/" [puppet] - 10https://gerrit.wikimedia.org/r/352652 (owner: 10Dzahn)