[00:14:53] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3564595 (10RobH) a:05Cmjohnson>03RobH I've contacted Dasher about this system failing to take updates, will update task when I have more. [00:30:59] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [00:30:59] RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.007 second response time [00:39:18] RECOVERY - configured eth on mw1228 is OK: OK - interfaces up [00:39:38] RECOVERY - Disk space on mw1228 is OK: DISK OK [00:39:38] RECOVERY - Check systemd state on mw1228 is OK: OK - running: The system is fully operational [00:39:48] RECOVERY - Check whether ferm is active by checking the default input chain on mw1228 is OK: OK ferm input default policy is set [00:39:48] RECOVERY - HHVM processes on mw1228 is OK: PROCS OK: 6 processes with command name hhvm [00:39:48] RECOVERY - dhclient process on mw1228 is OK: PROCS OK: 0 processes with command name dhclient [00:40:08] RECOVERY - Check size of conntrack table on mw1228 is OK: OK: nf_conntrack is 0 % full [00:42:19] RECOVERY - DPKG on mw1228 is OK: All packages OK [00:44:26] (03PS1) 10Chad: Drop flaggedrevs from testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374659 [00:44:53] (03PS1) 10Cmjohnson: Adding mgmt dns entries for mw1307=1328 T165519 [dns] - 10https://gerrit.wikimedia.org/r/374660 [00:46:19] (03CR) 10Reedy: [C: 031] "DIEEEEEEE" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374659 (owner: 10Chad) [00:46:39] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Joe: Decom mw1170-mw1179, and replace them with new systems. - https://phabricator.wikimedia.org/T167130#3564672 (10Cmjohnson) @joe, these have been decommissioned and removed from rack. I have racked and have mostly setup the first 22 new servers for y... [00:47:49] (03CR) 10Chad: [C: 032] "I'm gonna go with no, it hasn't exactly been touched in a long time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374659 (owner: 10Chad) [00:48:02] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3564675 (10Cmjohnson) @volans @Joe the disks have been replaced and the server reinstalled. This is all yours. Please resolve the task if you are satisfied ...if not please ping me in irc. [00:49:03] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T172809#3510136 (10Cmjohnson) The new disk arrived today and will be swapped on 08/30. [00:49:16] (03Merged) 10jenkins-bot: Drop flaggedrevs from testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374659 (owner: 10Chad) [00:49:26] (03CR) 10jenkins-bot: Drop flaggedrevs from testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374659 (owner: 10Chad) [00:50:54] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3564686 (10Cmjohnson) [00:52:06] !log demon@tin Synchronized dblists/flaggedrevs.dblist: killing FR on testwiki (duration: 00m 47s) [00:52:18] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3564689 (10Cmjohnson) [00:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:20] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Joe: Decom mw1170-mw1179, and replace them with new systems. - https://phabricator.wikimedia.org/T167130#3564687 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [00:53:14] !log demon@tin Synchronized wmf-config/flaggedrevs.php: killing FR on testwiki (duration: 00m 46s) [00:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:09] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3564690 (10Cmjohnson) Oddly enough this server h/w log is showing cpu errors in both slots but the log entries are dated. The server will need to come down, reseat the CPU's and clr the log. Then monitor to se... [00:57:39] RECOVERY - nutcracker port on mw1228 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [00:57:39] RECOVERY - nutcracker process on mw1228 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [00:57:43] 10Operations, 10ops-eqiad, 10Traffic: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3261314 (10Cmjohnson) @bblack The server is out of warranty but we could try and re-do the thermal paste. [01:00:39] RECOVERY - IPMI Temperature on mw1228 is OK: Sensor Type(s) Temperature Status: OK [01:00:56] (03PS1) 10Chad: Disable EducationProgram from legalteamwiki, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374661 [01:01:39] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [01:03:48] PROBLEM - Check systemd state on mw1228 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:04:08] PROBLEM - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused [01:04:09] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 327 bytes in 0.006 second response time [01:04:19] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:04:38] PROBLEM - MD RAID on rdb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:08] PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:09] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1228 is OK: OK: synced at Wed 2017-08-30 01:05:06 UTC. [01:05:18] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9783549 keys, up 5 minutes 10 seconds - replication_delay is 0 [01:05:28] RECOVERY - MD RAID on rdb2005 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:05:58] RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 5080385 keys, up 5 minutes 52 seconds - replication_delay is 0 [01:23:22] (03CR) 10Jalexander: [C: 031] "Can confirm, I said he could :D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374661 (owner: 10Chad) [01:23:53] (03CR) 10Chad: [C: 032] Disable EducationProgram from legalteamwiki, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374661 (owner: 10Chad) [01:25:16] (03Merged) 10jenkins-bot: Disable EducationProgram from legalteamwiki, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374661 (owner: 10Chad) [01:26:27] (03CR) 10jenkins-bot: Disable EducationProgram from legalteamwiki, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374661 (owner: 10Chad) [01:27:18] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 74518 bytes in 0.224 second response time [01:27:19] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.074 second response time [01:27:19] RECOVERY - Check systemd state on mw1228 is OK: OK - running: The system is fully operational [01:27:19] RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.086 second response time [01:27:34] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: remove education program from legalteamwiki (duration: 00m 47s) [01:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:33] (03PS1) 10Chad: Don't bother polluting function namespace, just use an anonymous one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374663 [01:35:49] (03CR) 10jerkins-bot: [V: 04-1] Don't bother polluting function namespace, just use an anonymous one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374663 (owner: 10Chad) [01:36:13] (03PS1) 10Chad: Revoke my old SSH key, unused now [puppet] - 10https://gerrit.wikimedia.org/r/374664 [01:38:30] (03PS2) 10Chad: Don't bother polluting function namespace, just use an anonymous one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374663 [02:02:03] (03PS1) 10Alex Monk: Remove allowances for IPs that are no longer in use [puppet] - 10https://gerrit.wikimedia.org/r/374665 [02:05:08] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 29 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [02:10:08] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [02:17:09] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 33 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [02:17:24] !log demon@tin Started deploy [gerrit/gerrit@f33c63f]: Testing, no-op [02:17:31] !log demon@tin Finished deploy [gerrit/gerrit@f33c63f]: Testing, no-op (duration: 00m 07s) [02:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:15] (03PS1) 10Chad: Gerrit: Start using plugins from scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/374667 (https://phabricator.wikimedia.org/T157414) [02:25:22] (03PS1) 10Chad: gerrit (2.13.8+git1-wmf.7) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/374668 [02:30:08] (03PS2) 10Chad: gerrit (2.13.8+git1-wmf.7) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/374668 [02:30:40] (03PS3) 10Chad: gerrit (2.13.8+git1-wmf.7) jessie-wikimedia; urgency=medium [debs/gerrit] - 10https://gerrit.wikimedia.org/r/374668 [02:32:14] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.15) (duration: 08m 28s) [02:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:59] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:08] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:19] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:19] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:19] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:19] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:19] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:19] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:19] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:20] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:20] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:21] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:28] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:28] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:28] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:38] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:38] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:38] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:38] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:38] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:53:48] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:08] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:09] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:19] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:19] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:38] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:39] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:39] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:55:39] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:02:18] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 80906.34 seconds [03:02:18] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 88674.34 seconds [03:02:18] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:02:18] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:02:18] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:05:28] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:28] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:28] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:08:50] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.16) (duration: 15m 42s) [03:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:29] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:12:29] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:16:09] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Aug 30 03:16:09 UTC 2017 (duration 7m 19s) [03:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:48] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:16:48] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 85396.31 seconds [03:16:48] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [03:16:48] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [03:16:48] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 0.34 seconds [03:16:48] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:16:48] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:16:49] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89776.36 seconds [03:19:58] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:58] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:58] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:58] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:21:39] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:21:39] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:21:39] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:21:48] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [03:21:48] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:21:48] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:21:48] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:21:49] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:22:18] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:22:19] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86408.90 seconds [03:22:19] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 88608.91 seconds [03:22:19] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:22:29] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [03:22:29] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:22:29] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:22:29] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:22:29] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:22:29] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:22:29] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [03:22:30] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89863.71 seconds [03:22:30] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:22:31] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 82036.72 seconds [03:22:31] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:22:32] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [03:32:59] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:37:09] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 19 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [03:37:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 847.96 seconds [03:44:18] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 24 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [03:49:18] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:02:01] PROBLEM - MariaDB disk space on dbstore1002 is CRITICAL: DISK CRITICAL - free space: /srv 397540 MB (5% inode=99%) [04:25:38] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:25:48] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:26:19] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:28] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:29] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:29] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:29] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:38] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:38] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:38] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:38] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:27:39] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:08] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:09] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:18] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:19] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:19] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:19] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:19] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:19] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:19] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:20] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:28:20] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:39] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:48] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:34:48] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:18] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:18] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:35:28] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:22:08] 10Operations, 10Mail, 10OTRS, 10Patch-For-Review: Automatically merge bounces/DSNs in ticket - https://phabricator.wikimedia.org/T173733#3564850 (10pajz) Still works as expected, e.g. https://ticket.wikimedia.org/otrs/index.pl?Action=AgentTicketZoom;TicketID=10219163#12120794 https://ticket.wikimedia.org... [05:22:36] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3564851 (10Joe) This task was about pdfrender failing to start, and that problem has... [05:27:38] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 187.57 seconds [05:27:39] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 88296.02 seconds [05:27:39] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:28:49] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:28:49] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:30:58] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:34:58] PROBLEM - Check size of conntrack table on mw1301 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [05:35:38] PROBLEM - Check size of conntrack table on mw1305 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [05:35:38] PROBLEM - Check size of conntrack table on mw1303 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [05:35:48] PROBLEM - Check size of conntrack table on mw1300 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [05:35:58] PROBLEM - Check size of conntrack table on mw1299 is CRITICAL: CRITICAL: nf_conntrack is 94 % full [05:36:11] mw1300-1305 is me ... that was unexpected [05:36:28] PROBLEM - Check size of conntrack table on mw1302 is CRITICAL: CRITICAL: nf_conntrack is 94 % full [05:36:49] should recover momentarily [05:37:49] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [05:38:28] RECOVERY - Check size of conntrack table on mw1302 is OK: OK: nf_conntrack is 59 % full [05:38:38] RECOVERY - Check size of conntrack table on mw1305 is OK: OK: nf_conntrack is 72 % full [05:38:38] RECOVERY - Check size of conntrack table on mw1303 is OK: OK: nf_conntrack is 74 % full [05:38:39] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:38:48] RECOVERY - Check size of conntrack table on mw1300 is OK: OK: nf_conntrack is 68 % full [05:38:49] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:38:49] RECOVERY - MariaDB Slave Lag: s7 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87717.22 seconds [05:38:58] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:38:58] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:38:58] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:38:58] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:38:59] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:38:59] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:38:59] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:39:00] RECOVERY - MariaDB Slave Lag: m3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 82558.02 seconds [05:39:00] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:39:01] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:39:01] RECOVERY - Check size of conntrack table on mw1301 is OK: OK: nf_conntrack is 70 % full [05:39:02] RECOVERY - Check size of conntrack table on mw1299 is OK: OK: nf_conntrack is 70 % full [05:39:08] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:39:08] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:39:09] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:39:09] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:39:18] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:39:18] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [05:39:18] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:39:18] RECOVERY - MariaDB Slave Lag: x1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 3284.96 seconds [05:39:19] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:39:19] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 85541.01 seconds [05:39:19] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:39:19] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:43:38] 10Operations, 10Data-Services, 10Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#3564855 (10madhuvishy) [05:43:41] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3564853 (10madhuvishy) 05Open>03Resolved Thank you so much, that all looks right. Closing this as resolved! [06:07:08] RECOVERY - MariaDB Slave Lag: s4 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89987.84 seconds [06:16:00] (03Abandoned) 10Reception123: Added wordmark for Wikipedia Atikamekw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368198 (owner: 10Reception123) [06:19:51] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7652/index-future.html" [puppet] - 10https://gerrit.wikimedia.org/r/374553 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [06:19:56] (03PS2) 10Giuseppe Lavagetto: puppetmaster::passenger: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374553 (https://phabricator.wikimedia.org/T171704) [06:31:03] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7653/index-future.html and https://puppet-compiler.wmflabs.org/compiler02/7653/" [puppet] - 10https://gerrit.wikimedia.org/r/374554 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [06:31:08] (03PS2) 10Giuseppe Lavagetto: labspuppetmaster: fix array interpolation in strings [puppet] - 10https://gerrit.wikimedia.org/r/374554 (https://phabricator.wikimedia.org/T171704) [06:36:08] (03PS2) 10Muehlenhoff: Revoke my old SSH key, unused now [puppet] - 10https://gerrit.wikimedia.org/r/374664 (owner: 10Chad) [06:37:35] (03CR) 10Muehlenhoff: [C: 032] Revoke my old SSH key, unused now [puppet] - 10https://gerrit.wikimedia.org/r/374664 (owner: 10Chad) [06:37:40] (03PS3) 10Muehlenhoff: Revoke my old SSH key, unused now [puppet] - 10https://gerrit.wikimedia.org/r/374664 (owner: 10Chad) [06:38:07] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3564922 (10elukey) First thing that came up in my mind is a previous fight with Partman that ended up in this note: https://wikitec... [06:49:30] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:49:38] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:06:17] 10Operations, 10ops-eqiad, 10DBA: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265#3564954 (10Marostegui) As this is looking good after the whole night, I am going to start slowly repooling it back [07:07:36] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374701 (https://phabricator.wikimedia.org/T174265) [07:09:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374701 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [07:10:59] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374701 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [07:11:09] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374701 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [07:12:24] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1055 - T174265 (duration: 00m 52s) [07:12:24] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: mw1294.eqiad.wmnet [07:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:40] T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265 [07:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:11] !log powering down mw1294 for hardware diagnostics (T1674069 [07:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:40] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3564964 (10MoritzMuehlenhoff) @Cmjohnson I've depooled the server and powered it down, CPUs can be shuffled. [07:20:07] (03PS1) 10Giuseppe Lavagetto: prometheus::class_config: avoid validate_re for an integer [puppet] - 10https://gerrit.wikimedia.org/r/374704 (https://phabricator.wikimedia.org/T171704) [07:22:06] 10Operations, 10Analytics, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3564969 (10elukey) [07:22:58] 10Operations, 10Analytics, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3284834 (10elukey) Due to the new Kafka Jumbo cluster (and other things like the Eventlogging cleaner script) I didn't get much time to schedule/plan this work, that may en... [07:39:02] (03PS1) 10Marostegui: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374705 (https://phabricator.wikimedia.org/T168661) [07:40:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374705 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:42:07] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374705 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:42:20] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374705 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:44:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1059 for a MariaDB upgrade - T168661 (duration: 00m 53s) [07:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:18] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:47:10] !log Upgrade MariaDB on db1059 to 10.0.32 - T168661 [07:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:53] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1228.eqiad.wmnet [07:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:08] RECOVERY - salt-minion processes on mw1228 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:48:34] 10Operations, 10ops-eqiad: Broken disk on mw1228 - https://phabricator.wikimedia.org/T168613#3564981 (10MoritzMuehlenhoff) 05Open>03Resolved a:05RobH>03MoritzMuehlenhoff Thanks Chris, looks fine. I fixed the salt key, ran "scap pull" and repooled the server. [07:48:59] (03PS1) 10KartikMistry: WIP: Add Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706 [07:49:26] (03PS1) 10Marostegui: mariadb: Update socket location for db1059 [puppet] - 10https://gerrit.wikimedia.org/r/374707 (https://phabricator.wikimedia.org/T148507) [07:51:24] RECOVERY - MariaDB disk space on dbstore1002 is OK: DISK OK [07:52:35] (03PS2) 10KartikMistry: WIP: Add Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706 [07:52:55] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706 (owner: 10KartikMistry) [07:54:21] (03CR) 10Marostegui: [C: 032] mariadb: Update socket location for db1059 [puppet] - 10https://gerrit.wikimedia.org/r/374707 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [07:54:39] PROBLEM - MD RAID on ms-be2024 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [07:54:48] RECOVERY - mediawiki-installation DSH group on mw1228 is OK: OK [07:54:48] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:54:59] PROBLEM - swift-container-updater on ms-be2024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:55:09] PROBLEM - Disk space on ms-be2024 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda4 is not accessible: Input/output error [07:55:12] (03PS3) 10KartikMistry: WIP: Add Matxin MT service for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/374706 [07:59:08] godog: FYI I'm forcing the change status on ms-be2024 to trigger again the raid handler, it was triggered with SOFT state somehow... [08:00:18] RECOVERY - MD RAID on ms-be2024 is OK: forcing the re-trigger of the raid handler [08:01:28] PROBLEM - MD RAID on ms-be2024 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 [08:01:29] ACKNOWLEDGEMENT - MD RAID on ms-be2024 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T174534 [08:01:35] there we go! [08:01:41] 10Operations, 10ops-codfw: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T174534#3564998 (10ops-monitoring-bot) [08:01:54] (03PS1) 10Marostegui: db-eqiad.php: Increase weight on db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374709 (https://phabricator.wikimedia.org/T174265) [08:02:24] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T174534#3565004 (10Volans) [08:03:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight on db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374709 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [08:05:13] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight on db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374709 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [08:06:18] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight on db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374709 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [08:06:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s) [08:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:32] T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265 [08:11:04] volans: odd! I've never seen that before [08:11:47] what part? [08:12:56] volans: "it was triggered with SOFT state somehow" [08:14:09] godog: see https://etherpad.wikimedia.org/p/volans-tmp3 [08:14:23] the HARD call was in the middle and with a timeout message, so was skipped [08:14:33] but the IRC notification was with the correct message [08:14:50] look at service_state_type [08:15:15] * volans would like a no-wrap option for etherpad :D [08:15:30] ah ok, so in case of soft the raid handler had nothing to do and that's it [08:15:54] yeah, we wait for the HARD state ofc [08:16:56] but in this case we got CRIT-SOFT -> OK-SOFT -> CRIT-HARD (timeout) -> CRIT-SOFT -> CRIT-SOFT [08:17:01] that doesn't make sense to me [08:17:21] but I'm not gonna digg deeper this time ;) [08:17:37] maybe akosiaris want to have some more fun :-P [08:18:25] hehehe [08:20:52] andrewbogott: essentially yes, though it doesn't have to be, we're using "file_sd" for targets for prometheus to ask metrics to. It just so happens that puppet generates the files for prometheus in the same role, but it doesn't have to be [08:29:11] (03PS1) 10Ema: VCL: drop VSV00001 (DSA 3924-1) DoS workaround [puppet] - 10https://gerrit.wikimedia.org/r/374716 [08:29:38] (03CR) 10Paladox: "We need to add the its-phabricator plugin to the scap repo." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/374668 (owner: 10Chad) [08:32:36] (03CR) 10Paladox: Gerrit: Start using plugins from scap-deployed version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/374667 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [08:43:16] volans: nice try [08:46:13] (03PS2) 10Giuseppe Lavagetto: prometheus: avoid validate_re for an integer [puppet] - 10https://gerrit.wikimedia.org/r/374704 (https://phabricator.wikimedia.org/T171704) [08:46:15] (03PS1) 10Giuseppe Lavagetto: requesttracker: fix further template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374736 (https://phabricator.wikimedia.org/T171704) [08:46:17] (03PS1) 10Giuseppe Lavagetto: service::node: use validate_numeric for validating parameters [puppet] - 10https://gerrit.wikimedia.org/r/374737 [08:46:19] (03PS1) 10Giuseppe Lavagetto: sysfs::conffile: use validate_numeric for number validation [puppet] - 10https://gerrit.wikimedia.org/r/374738 (https://phabricator.wikimedia.org/T171704) [08:46:21] (03PS1) 10Giuseppe Lavagetto: varnish::common::vcl: fix template scoping [puppet] - 10https://gerrit.wikimedia.org/r/374739 (https://phabricator.wikimedia.org/T171704) [08:46:54] (03PS1) 10Marostegui: db-eqiad.php: Give more weight to db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374741 (https://phabricator.wikimedia.org/T174265) [08:47:03] (03CR) 10Gehel: [C: 032] apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510 (owner: 10Gehel) [08:47:12] (03PS5) 10Gehel: apertium - switch to logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/373510 [08:48:40] 10Operations, 10Mail, 10OTRS, 10Patch-For-Review: Automatically merge bounces/DSNs in ticket - https://phabricator.wikimedia.org/T173733#3565069 (10akosiaris) 05stalled>03Resolved Yes, agreed. Resolving. [08:49:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give more weight to db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374741 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [08:50:33] (03Merged) 10jenkins-bot: db-eqiad.php: Give more weight to db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374741 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [08:50:44] (03CR) 10jenkins-bot: db-eqiad.php: Give more weight to db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374741 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [08:51:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 48s) [08:51:48] PROBLEM - Apache HTTP on mw1223 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [08:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:49] PROBLEM - HHVM rendering on mw1223 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [08:51:49] T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265 [08:52:48] RECOVERY - Apache HTTP on mw1223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.264 second response time [08:52:58] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 74622 bytes in 2.367 second response time [08:56:47] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T174534#3565097 (10fgiunchedi) a:03Papaul I don't seem to be able to login on ms-be2024 at all, from console it looks like both SSDs are considered offline: ``` [2506558.978048] sd 0:1:0:0: rejecting... [08:56:59] akosiaris: had to try ;) [09:07:22] !log restart all jvm daemons on druid100[123] for security updates [09:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:12] (03PS3) 10Alexandros Kosiaris: kubelet: Remove configure-cbr0 parameter [puppet] - 10https://gerrit.wikimedia.org/r/374556 (https://phabricator.wikimedia.org/T170119) [09:08:18] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubelet: Remove configure-cbr0 parameter [puppet] - 10https://gerrit.wikimedia.org/r/374556 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [09:09:37] (03PS1) 10Marostegui: db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374764 (https://phabricator.wikimedia.org/T168661) [09:10:30] (03PS2) 10Marostegui: db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374764 (https://phabricator.wikimedia.org/T168661) [09:11:00] (03PS1) 10Hashar: Rebuild for Jessie + PHP 5.5 [debs/pkg-php/php-defaults] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/374766 [09:13:30] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374764 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:15:07] (03CR) 10Hashar: "recheck" [debs/pkg-php/php-defaults] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/374766 (owner: 10Hashar) [09:16:21] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374764 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:16:38] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374764 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [09:17:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1059 - T168661 (duration: 00m 47s) [09:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:27] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [09:18:34] (03CR) 10Filippo Giunchedi: "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) (owner: 10Rush) [09:20:30] 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3565123 (10ema) p:05Triage>03Normal [09:22:16] (03PS1) 10Muehlenhoff: Cleanup JSON messaging between server and clients [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 [09:26:35] (03PS1) 10Marostegui: db-eqiad.php: Give more weight to db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374777 (https://phabricator.wikimedia.org/T174265) [09:28:55] (03PS1) 10Giuseppe Lavagetto: varnish: convert to string integers [puppet] - 10https://gerrit.wikimedia.org/r/374778 (https://phabricator.wikimedia.org/T171704) [09:31:01] (03CR) 10Marostegui: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374777 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [09:37:27] (03CR) 10Volans: "Much nicer indeed! Thanks for fixing this. A couple of comments line." (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 (owner: 10Muehlenhoff) [09:39:17] hashar: is there anything wrong with jenkins? [09:39:49] A patch I have submited looks stalled even though it looks like it has finished [09:39:56] i am checking in zuul [09:40:46] 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3565159 (10fgiunchedi) Yes the are LVS-specific in the sense that the metrics backing the graphs come from `/proc/net/ip_vs*` and thus only for ipvs-managed services, and indeed for lv... [09:44:56] it went thru now…weird [09:45:06] 20 minutes after it said success in zuul [09:45:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give more weight to db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374777 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [09:49:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s) [09:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:34] T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265 [09:49:47] 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3561810 (10ema) >>! In T174432#3562830, @BBlack wrote: > Are the non-icmp graphs somehow LVS-specific? Yes, the metrics are: node_ipvs_backend_connections_active, node_ipvs_incoming_p... [09:50:19] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:50:19] PROBLEM - puppet last run on chlorine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 12 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[kube-apiserver] [09:51:14] (03CR) 10Filippo Giunchedi: [C: 032] "Sorry about the delay in review :(" [puppet] - 10https://gerrit.wikimedia.org/r/368522 (owner: 10MaxSem) [09:51:51] (03PS3) 10Filippo Giunchedi: logging: Remove exceptionmonitor [puppet] - 10https://gerrit.wikimedia.org/r/368522 (owner: 10MaxSem) [09:59:02] (03CR) 10Filippo Giunchedi: [C: 032] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/368522 (owner: 10MaxSem) [10:00:04] godog: looks like you are having the same issue I was having [10:00:15] But at least my change appeared on zuul, yours doesn't (yet) [10:01:50] marostegui: ah, yeah sounds like the same issue [10:01:54] I see the changes in zuul now [10:02:17] (03CR) 10jenkins-bot: db-eqiad.php: Give more weight to db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374777 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [10:03:08] let's see what it does, mine was stuck in "success" for 20 minutes [10:04:46] is db1055 back into prod? [10:04:55] with restricted weight [10:05:17] yeah, no problem, I just wanted to remove the downtimes/disabled notifications [10:05:21] ah sure [10:05:24] let me do that [10:05:40] so we notice in case something happens- or we would forget [10:05:47] yeah, very good point [10:05:49] doing it now [10:06:34] (03CR) 10Muehlenhoff: "Thanks, all done." (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 (owner: 10Muehlenhoff) [10:06:53] (03PS2) 10Muehlenhoff: Cleanup JSON messaging between server and clients [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 [10:17:19] (03CR) 10Ema: [C: 032] VCL: drop VSV00001 (DSA 3924-1) DoS workaround [puppet] - 10https://gerrit.wikimedia.org/r/374716 (owner: 10Ema) [10:17:24] (03PS2) 10Ema: VCL: drop VSV00001 (DSA 3924-1) DoS workaround [puppet] - 10https://gerrit.wikimedia.org/r/374716 [10:17:28] (03CR) 10Ema: [V: 032 C: 032] VCL: drop VSV00001 (DSA 3924-1) DoS workaround [puppet] - 10https://gerrit.wikimedia.org/r/374716 (owner: 10Ema) [10:18:04] (03PS1) 10Hashar: [WMF] jessie: tweak build dependencies [debs/pkg-php/php] (debian/jessie-wikimedia-5.5) - 10https://gerrit.wikimedia.org/r/374782 (https://phabricator.wikimedia.org/T161882) [10:18:13] (03CR) 10Volans: "Minor comments to be a bit more DRY" (034 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 (owner: 10Muehlenhoff) [10:20:35] (03PS1) 10Marostegui: db-eqiad.php: Give db1055 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374783 (https://phabricator.wikimedia.org/T174265) [10:22:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give db1055 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374783 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [10:22:39] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator, 10Patch-For-Review: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3565221 (10jcrespo) 05Resolved>03Open I think this happened again yesterday- I will just modify this task into a decommissioning one. [10:23:43] (03Merged) 10jenkins-bot: db-eqiad.php: Give db1055 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374783 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [10:24:00] 10Operations, 10DBA, 10Phabricator: Decom db1048 (BBU Faulty - slave lagging) - https://phabricator.wikimedia.org/T160731#3565225 (10jcrespo) [10:24:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give db1055 more traffic - T174265 (duration: 00m 47s) [10:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:56] T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265 [10:26:00] (03CR) 10jenkins-bot: db-eqiad.php: Give db1055 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374783 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [10:31:53] (03CR) 10Muehlenhoff: Cleanup JSON messaging between server and clients (034 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 (owner: 10Muehlenhoff) [10:32:03] (03PS3) 10Muehlenhoff: Cleanup JSON messaging between server and clients [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 [10:36:20] (03CR) 10Volans: [C: 031] "LGTM, nipicking inline (feel free to ignore it for now and sorry for not having spot it before)" (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 (owner: 10Muehlenhoff) [10:38:42] (03CR) 10Muehlenhoff: Cleanup JSON messaging between server and clients (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 (owner: 10Muehlenhoff) [10:43:56] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#1293753 (10Lydia_Pintscher) After discussion with Faidon at Wikimania we agreed: * hosting can move now * domain is registe... [11:00:45] (03PS1) 10Marostegui: db-eqiad.php: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374787 (https://phabricator.wikimedia.org/T174265) [11:03:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374787 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [11:03:53] (03PS4) 10Muehlenhoff: Cleanup JSON messaging between server and clients [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 [11:04:34] Hey, I'm trying to get https://phabricator.wikimedia.org/T169969 done but I'm not sure about directories of ores logs in logstash nodes, Can I have access to them for one hour? [11:04:58] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374787 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [11:05:04] It's okay if it's not possible [11:06:01] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1055 with full weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374787 (https://phabricator.wikimedia.org/T174265) (owner: 10Marostegui) [11:06:10] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1055 original weight - T174265 (duration: 00m 46s) [11:06:11] sorry, it's not logstash nodes, graphite [11:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:24] T174265: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265 [11:06:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: BBU issues on db1055, RAID cache on WriteThrough - https://phabricator.wikimedia.org/T174265#3565301 (10Marostegui) 05Open>03Resolved a:03Cmjohnson The original weight values have been set now. I will close this for now Thanks @Cmjohnson for help... [11:11:44] Made https://phabricator.wikimedia.org/T174542 [11:11:49] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10Graphite, 10User-fgiunchedi: Temporarily access request to graphite nodes - https://phabricator.wikimedia.org/T174542#3565331 (10Ladsgroup) [11:17:59] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 when accessed from non-English languages specified in the template - https://phabricator.wikimedia.org/T171392#3463322 (10Johnuni... [11:19:08] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10faidon) JFTR, since I didn't see it mentioned neither here nor in T142807, how impending is that decomm? Days/weeks/months? [11:21:16] (03CR) 10Muehlenhoff: [C: 032] Cleanup JSON messaging between server and clients [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374776 (owner: 10Muehlenhoff) [11:26:10] 10Operations, 10Mail: mail.wikimedia.org SSL cert expiring Mon 23 Oct 2017 - https://phabricator.wikimedia.org/T174081#3550469 (10faidon) For the history side of it :), mx1002/mx2002 never existed, it was just me hoping to get around in building additional MXes (and possibly splitting roles, e.g. inbound and o... [11:51:03] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10Graphite, 10User-fgiunchedi: Temporarily access request to graphite nodes - https://phabricator.wikimedia.org/T174542#3565331 (10MarcoAurelio) Amir is trusted, support from me. [11:54:48] (03PS1) 10Muehlenhoff: Handle incorrect package names in reverse dependency query [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374791 [12:00:12] (03PS2) 10Muehlenhoff: Make the server group / Cumin alias configurable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373870 [12:00:14] (03PS1) 10Marostegui: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374792 (https://phabricator.wikimedia.org/T168661) [12:02:54] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374792 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:03:03] (03PS1) 10Marostegui: mariadb: Update db1053 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374793 (https://phabricator.wikimedia.org/T148507) [12:04:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374792 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:05:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1053 - T168661 (duration: 00m 47s) [12:05:35] !log Upgrade MariaDB to 10.0.32 on db1053 - T168661 [12:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:43] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374792 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:06:56] (03PS2) 10Hashar: Rebuild for Jessie + PHP 5.5 [debs/pkg-php/php-defaults] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/374766 [12:09:03] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374791 (owner: 10Muehlenhoff) [12:12:15] (03CR) 10Marostegui: [C: 032] mariadb: Update db1053 socket location [puppet] - 10https://gerrit.wikimedia.org/r/374793 (https://phabricator.wikimedia.org/T148507) (owner: 10Marostegui) [12:17:22] (03CR) 10Muehlenhoff: [C: 032] Handle incorrect package names in reverse dependency query [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/374791 (owner: 10Muehlenhoff) [12:19:57] (03PS1) 10Alexandros Kosiaris: kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) [12:20:05] (03PS1) 10Marostegui: db-eqiad.php: Repool db1053 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374796 (https://phabricator.wikimedia.org/T168661) [12:20:54] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 when accessed from non-English languages specified in the template - https://phabricator.wikimedia.org/T171392#3565472 (10Strainu... [12:23:24] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1053 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374796 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:25:18] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1053 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374796 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:26:01] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1053 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374796 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:26:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1053 with low weight - T168661 (duration: 00m 46s) [12:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:57] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:35:36] (03PS1) 10Marostegui: db-eqiad.php: Increase weight on db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374802 (https://phabricator.wikimedia.org/T168661) [12:37:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight on db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374802 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:39:11] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight on db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374802 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:39:20] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight on db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374802 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:40:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1053 weight - T168661 (duration: 00m 47s) [12:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:32] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:40:41] Amir1: I was looking at your request for graphite access, there's a list of ores metrics files in your home on tin, please LMK if that would be enough to understand what to purge? if not we can do the request [12:41:10] sure thanks [12:41:53] godog: can I download them? there is no PII in it [12:42:34] Amir1: yeah I think so [12:44:08] thanks [12:50:08] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [12:50:39] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [12:51:33] (03PS1) 10Marostegui: db-eqiad.php: Increase db1053 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374804 (https://phabricator.wikimedia.org/T168661) [12:51:35] !log restart wdqs-updater on wdqs2001 for config change [12:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase db1053 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374804 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:53:35] (03PS1) 10Hashar: contint: aptly server in labs [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) [12:54:48] (03Merged) 10jenkins-bot: db-eqiad.php: Increase db1053 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374804 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:55:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1053 weight - T168661 (duration: 00m 45s) [12:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:53] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:56:01] (03CR) 10jenkins-bot: db-eqiad.php: Increase db1053 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374804 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:57:49] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [12:58:18] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170830T1300). [13:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:11] Here [13:01:40] (03PS1) 10Gehel: wdqs - tuning of logback configuration to send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/374806 (https://phabricator.wikimedia.org/T172710) [13:02:11] Who'll be the SWATter? [13:02:14] zeljkof? [13:02:56] o/ [13:03:03] Hi hashar :) [13:03:12] (03CR) 10Hashar: [C: 032] Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) (owner: 10Urbanecm) [13:03:50] will push it on mwdebug1001 [13:04:04] ack [13:04:08] (03CR) 10Muehlenhoff: Make the server group / Cumin alias configurable (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373870 (owner: 10Muehlenhoff) [13:04:19] (03PS3) 10Muehlenhoff: Make the server group / Cumin alias configurable [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373870 [13:04:25] (03PS2) 10Gehel: wdqs - tuning of logback configuration to send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/374806 (https://phabricator.wikimedia.org/T172710) [13:04:44] (03PS5) 10Hashar: Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) (owner: 10Urbanecm) [13:04:49] (03CR) 10Hashar: [C: 032] Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) (owner: 10Urbanecm) [13:04:51] bah [13:06:12] (03Merged) 10jenkins-bot: Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) (owner: 10Urbanecm) [13:06:21] (03CR) 10jenkins-bot: Enable SandboxLink on cywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372531 (https://phabricator.wikimedia.org/T173054) (owner: 10Urbanecm) [13:06:31] 10Operations, 10ops-esams: bast3002 sdb broken - https://phabricator.wikimedia.org/T169035#3565570 (10mark) These systems are 3,5" drives, not hot swap. We have a lot of SFF spares (mostly SSDs), but no LFF, and these are well out of warranty. I could steal a drive from one of the (many) other decom'ed servers... [13:06:53] Urbanecm: it is on mwdebug1001 [13:07:19] working, please deploy to the whole universer [13:07:38] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [13:09:51] (03PS2) 10Hashar: contint: aptly server in labs [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) [13:10:00] hashar, Urbanecm: sorry, can't do SWAT today [13:10:11] hashar's SWATting [13:10:28] syncing :D [13:10:39] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:10:41] Syncing is part of swatting, isn't it? [13:11:08] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Enable SandboxLink on cywiki - T173054 (duration: 00m 48s) [13:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:21] T173054: Enable SandboxLink on cywiki - https://phabricator.wikimedia.org/T173054 [13:13:36] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374809 (https://phabricator.wikimedia.org/T168661) [13:15:30] (03CR) 10Gehel: [C: 032] wdqs - tuning of logback configuration to send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/374806 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [13:16:47] !log restarting wdqs-blazegraph and wdqs-updater on all wdqs nodes for logback config change - T172710 [13:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:01] T172710: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710 [13:19:10] !log European SWAT completed [13:19:16] Urbanecm: thank you [13:19:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374809 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:22] Urbanecm: yean syncing is part of swatting [13:19:30] You're welcome [13:19:51] (03PS2) 10Ottomata: Let scoring platform team run "lsof" for diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/374593 (https://phabricator.wikimedia.org/T174402) (owner: 10Awight) [13:20:56] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374809 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:20:59] (03CR) 10Ottomata: [C: 032] Let scoring platform team run "lsof" for diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/374593 (https://phabricator.wikimedia.org/T174402) (owner: 10Awight) [13:21:07] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374809 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:21:42] (03PS3) 10Hashar: contint: aptly server in labs [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) [13:21:50] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1053 weight - T168661 (duration: 00m 46s) [13:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:04] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [13:23:06] !log installing ffmpeg security upadtes [13:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:31] godog: it looks enough. (It's a lots of files :((() [13:25:37] Thanks [13:25:37] 10Operations, 10monitoring, 10User-fgiunchedi: Monitor and alarm on SMART attributes - https://phabricator.wikimedia.org/T86552#3565629 (10fgiunchedi) [13:25:53] Amir1: np, yeah it is lots alright :) [13:25:58] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: send wdqs logs to logstash - https://phabricator.wikimedia.org/T172710#3565630 (10Gehel) Logs are now correctly sent to logstash. @Smalyshev: can you review and let me know if we are missing anything? Else just close thi... [13:26:23] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3565634 (10elukey) The following recipe is almost working fine, except that I didn't manage to create the srv logical volume yet for... [13:27:38] hahahha elukey you and robh are crazy people [13:27:54] partman can make a person insane, be careful! [13:28:21] partman could be a trick played on your mind by an evil demon [13:28:25] ottomata: I am definitely crazy I know, but it was a matter of pride, partman cannot win all the times [13:28:29] hahah [13:28:48] jokes aside, lemme know if you like the partitions [13:29:22] (03PS4) 10Hashar: contint: aptly server in labs [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) [13:29:24] (03PS1) 10Hashar: aptly: support components for clients [puppet] - 10https://gerrit.wikimedia.org/r/374813 [13:32:56] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10Graphite, 10User-fgiunchedi: Temporarily access request to graphite nodes - https://phabricator.wikimedia.org/T174542#3565694 (10Addshore) [13:34:09] ottomata: the idea is to have two lvm phisical volumes, one for root and one for the "data" stuff, and then correspondent logical volumes (/ and /srv) [13:34:49] elukey: sounds good [13:34:55] I am testing now why the /srv partition doesn't mount but it should be a stupid partman typo [13:34:58] !log installing libxml2 security updates [13:35:00] is the window still open hashar? [13:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:14] phuedx: yeah sure [13:37:13] (03PS2) 10Alexandros Kosiaris: kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) [13:42:14] (03PS1) 10Phuedx: pagePreviews: Scale A/B test bucket sizes by 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374815 (https://phabricator.wikimedia.org/T172291) [13:42:23] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10Graphite, 10User-fgiunchedi: Temporarily access request to graphite nodes - https://phabricator.wikimedia.org/T174542#3565714 (10Ladsgroup) I've got list of them so I don't need this anymore. Thanks :) [13:42:28] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Graphite, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3565716 (10Ladsgroup) [13:42:30] 10Operations, 10Ops-Access-Requests, 10Scoring-platform-team, 10Graphite, 10User-fgiunchedi: Temporarily access request to graphite nodes - https://phabricator.wikimedia.org/T174542#3565715 (10Ladsgroup) 05Open>03Resolved [13:42:31] (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/7658/argon.eqiad.wmnet/ is quite OK, fixed a smaller issue with the trailing ", production " [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) (owner: 10Alexandros Kosiaris) [13:42:56] (03PS3) 10Alexandros Kosiaris: kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) [13:43:27] hashar: https://gerrit.wikimedia.org/r/#/c/374815/ [13:45:09] elukey : we're going to increase the popups event rate and will be keeping a very close eye on disk space usage on dbstore1002 (per https://phabricator.wikimedia.org/T172291#3565685) [13:45:28] also o/ hashar , how was your vacation? [13:45:42] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3565720 (10elukey) Forgot a "\" after a ".", there where is left to the reader as exercise :D This is what we end up with: ``` roo... [13:45:57] (03PS3) 10Hashar: Rebuild for Jessie + PHP 5.5 [debs/pkg-php/php-defaults] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/374766 [13:46:07] phuedx: exhausting :] [13:46:33] (03CR) 10Hashar: [C: 032] pagePreviews: Scale A/B test bucket sizes by 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374815 (https://phabricator.wikimedia.org/T172291) (owner: 10Phuedx) [13:47:22] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3565738 (10elukey) The /boot partition might need more space, I used the `prometheus.cfg` configuration as baseline and forgot to up... [13:47:32] phuedx: sure (Cc marostegui ) [13:47:48] weird [13:47:55] i didn't see marostegui in my autocomplete list :/ [13:48:00] sorry! [13:48:07] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3565739 (10Ottomata) Weird! /dev/dm-0 is a LVM partition? [13:49:18] sure phuedx! [13:49:20] thanks for the heads up [13:50:38] !log restarting squid on url downloaders to pick up libxml security update [13:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:58] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2068413 [13:51:14] (03Merged) 10jenkins-bot: pagePreviews: Scale A/B test bucket sizes by 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374815 (https://phabricator.wikimedia.org/T172291) (owner: 10Phuedx) [13:51:18] (03PS1) 10Marostegui: db-eqiad.php: Restore db1053 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374819 (https://phabricator.wikimedia.org/T168661) [13:51:25] (03CR) 10jenkins-bot: pagePreviews: Scale A/B test bucket sizes by 10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374815 (https://phabricator.wikimedia.org/T172291) (owner: 10Phuedx) [13:53:04] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: kafka-jumbo.cfg partman recipe creation/troubleshooting - https://phabricator.wikimedia.org/T174457#3565761 (10elukey) I can definitely see the following that looks correct: ``` root@kafka-jumbo1001:/# mount | grep root /dev/mapper... [13:53:08] phuedx: deploying :) [13:53:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1053 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374819 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:53:24] (03PS2) 10Marostegui: db-eqiad.php: Restore db1053 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374819 (https://phabricator.wikimedia.org/T168661) [13:54:41] hashar: thanks! there's not much to test as it's a constant change [13:55:05] and if I got it right https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=Popups&refresh=5m&orgId=1&from=now-12h&to=now [13:55:11] should show a 10x increase in rate ? [13:55:24] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: pagePreviews: Scale A/B test bucket sizes by 10 - T172291 (duration: 00m 46s) [13:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:36] T172291: Launch page previews A/B test on enwiki and dewiki - https://phabricator.wikimedia.org/T172291 [13:56:47] (03CR) 10jenkins-bot: db-eqiad.php: Restore db1053 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374819 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [13:56:58] hashar: yes [13:57:07] it'll take a little while to happen [13:57:12] but yes :) [13:57:20] * hashar blames cache [13:57:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1053 original weight - T168661 (duration: 00m 46s) [13:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:54] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [13:59:19] phuedx: it bumped already :) [13:59:41] it's always good to know that things are working! [14:00:07] \o/ [14:00:18] I am taking a quick break / away [14:00:20] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM. We should take some care to revert when we upgrade to version 4.2 of puppet and handle it natively in the provider. Merging" [puppet] - 10https://gerrit.wikimedia.org/r/374438 (https://phabricator.wikimedia.org/T167104) (owner: 10Thcipriani) [14:00:24] be back in 15 minutes or so [14:00:39] (03PS2) 10Alexandros Kosiaris: Mask jobchron and jobrunner in non-active DC [puppet] - 10https://gerrit.wikimedia.org/r/374438 (https://phabricator.wikimedia.org/T167104) (owner: 10Thcipriani) [14:00:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Mask jobchron and jobrunner in non-active DC [puppet] - 10https://gerrit.wikimedia.org/r/374438 (https://phabricator.wikimedia.org/T167104) (owner: 10Thcipriani) [14:02:33] !log restarting apache on krypton to pick up libxml security update [14:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:07] !log restarting squid on installurl downloaders to pick up libxml security update [14:04:15] !log restarting squid on install* servers to pick up libxml security update [14:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:13] phuedx: confirmed. The graph bumped :] [14:07:27] !log restart java daemons on druid100[456] for jvm security updates [14:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:00] marostegui: woah! that's a big temporary table! [14:08:20] yeah :( [14:14:18] PROBLEM - DPKG on conf1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:15:18] RECOVERY - DPKG on conf1004 is OK: All packages OK [14:31:46] !log restarting nginx on sodium to pick up libxml security update [14:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:38] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3566017 (10faidon) [14:40:45] (03PS1) 10Alexandros Kosiaris: mask file should be in /etc directory [puppet] - 10https://gerrit.wikimedia.org/r/374822 (https://phabricator.wikimedia.org/T167104) [14:41:41] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3566023 (10Cmjohnson) 05Open>03Resolved The battery was replaced. [14:45:29] (03PS1) 10Mforns: [WIP] Optimize EventLogging purging script using timestamps [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) [14:45:50] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Optimize EventLogging purging script using timestamps [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [14:46:40] (03CR) 10Alexandros Kosiaris: [C: 032] mask file should be in /etc directory [puppet] - 10https://gerrit.wikimedia.org/r/374822 (https://phabricator.wikimedia.org/T167104) (owner: 10Alexandros Kosiaris) [14:49:49] PROBLEM - Host mw1294.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:50:27] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3566066 (10Cmjohnson) Shuffled the CPU's....time to wait and see [14:51:49] (03CR) 10Ema: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/374739 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [14:52:58] (03CR) 10Ema: [C: 031] varnish: convert to string integers [puppet] - 10https://gerrit.wikimedia.org/r/374778 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [14:52:59] (03PS1) 10Muehlenhoff: Fix broken regexp in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/374825 [14:54:58] RECOVERY - Host mw1294.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [14:56:49] PROBLEM - DPKG on labweb1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:57:49] RECOVERY - DPKG on labweb1002 is OK: All packages OK [15:02:15] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1294.eqiad.wmnet [15:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:50] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3566094 (10MoritzMuehlenhoff) I ran "scap pull" and re-enabled the server for live traffic. Let's see whether this reoccurs. [15:04:38] (03Abandoned) 10Muehlenhoff: Handle incorrect package names in reverse dependency query [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373866 (owner: 10Muehlenhoff) [15:14:53] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T172809#3566159 (10Cmjohnson) @elukey I swapped the disk in slot 1...megacli still shows failed but also does not show the updated s/n for the new disk it could be preserved cache. Please try an... [15:18:48] 10Operations, 10ORES, 10Scoring-platform-team, 10Graphite, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3566203 (10Ladsgroup) [15:21:00] ottomata: Thanks for merging my lsof change! It doesn’t seem to work though, so I’m wondering if it might not be deployed, or perhaps I have the wrong sudo statement... [15:21:54] I would cat /etc/sudoers.d/ores-admin but don’t have perms :-) [15:24:33] 10Operations, 10Goal, 10Kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3566239 (10akosiaris) [15:24:40] 10Operations, 10Goal, 10Kubernetes: Define a process to keep images up-to-date on similar standards as the rest of production - https://phabricator.wikimedia.org/T162043#3566234 (10akosiaris) 05Open>03Resolved a:03akosiaris The process has been well defined in T167269. Resolving this [15:25:50] awight: the change you merged allows the deploy-service user to sudo to ls [15:25:54] 10Operations, 10ORES, 10Scoring-platform-team: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3566244 (10akosiaris) 05Open>03stalled Stalling a bit while T169246 is ongoing [15:25:58] can you sudo as the deploy-service user? [15:26:02] so uhhh [15:26:14] 10Operations, 10Goal, 10Kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3150583 (10akosiaris) [15:26:15] i thougth this is what you wanted :) [15:26:19] awight: what's a box you use? [15:26:22] for ores? [15:26:24] want to try something [15:26:42] 10Operations, 10Goal, 10Kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3566254 (10akosiaris) 05Open>03Resolved a:03akosiaris The immediate subtasks have all been completed long ago. Resolving [15:27:14] ottomata: I must be missing something here! I’m ores1001.eqiad.wmnet, and can run "sudo service celery-ores-worker restart” [15:27:57] So perhaps blindly grepping didn’t lead me to the right puppet config ;-) [15:28:01] yeah, but your change affects the service::uwsgi deploy user [15:28:05] not your user [15:28:08] hehe [15:28:10] or groups [15:28:11] lemme try again [15:29:14] awight: you probably want to change the priviledges in data.yaml in the admin module for the ores-admin grou [15:29:38] oh [15:29:40] awight [15:29:40] also [15:29:44] puppet isn't running here? [15:29:44] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'alex/halfak stress tests'); [15:29:44] Use 'puppet agent --enable' to re-enable. [15:29:51] O_O lol [15:30:25] is there an ores machine with puppet enabled right now? [15:30:31] scb1001 perhaps [15:30:54] I tried there as well, so I think you must be right about the sudo config only applying to the deployment user [15:32:44] (patch forthcoming) [15:32:47] (03PS1) 10Awight: Give ores-admin users lsof, rather than the deployment user [puppet] - 10https://gerrit.wikimedia.org/r/374832 [15:39:28] (03PS5) 10Hashar: contint: aptly server in labs [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) [15:39:38] RECOVERY - Disk space on ms-be2024 is OK: DISK OK [15:40:08] RECOVERY - swift-container-updater on ms-be2024 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:40:22] (03PS2) 10Mforns: [WIP] Optimize EventLogging purging script using timestamps [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) [15:40:29] RECOVERY - MD RAID on ms-be2024 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:40:39] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational [15:40:44] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Optimize EventLogging purging script using timestamps [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [15:41:32] 10Operations, 10hardware-requests: codfw/eqiad: 2x systems for prometheus - https://phabricator.wikimedia.org/T148513#3566348 (10RobH) [15:43:21] !log T169939: Decommission Cassandra: restbase1009-a.eqiad.wmnet [15:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:34] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [15:43:55] 10Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3566372 (10RobH) [15:43:57] 10Operations, 10hardware-requests: codfw/eqiad:(9+9) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#3566373 (10RobH) [15:44:05] i hate that phab makes me claim a task if its unowned when i resolve it. [15:44:16] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3566374 (10herron) For sure! Maybe a pair of instances in different locations for durability? Regarding handling root@ mail for individual instances. If it would be useful for some projects to h... [15:45:16] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T174534#3566375 (10Papaul) @fgiunchedi no sign of disk error at my end. I reboot the system and it looks like the system is back up. But i really don't trust HP, I will leave the task open and monito... [15:46:00] RECOVERY - Hadoop DataNode on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:46:38] RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:48:34] 10Operations, 10Deployment-Systems, 10JobRunner-Service, 10Patch-For-Review, and 2 others: Figure out how to disable starting of jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3566417 (10thcipriani) >>! In T167104#3566055, @gerritbot wrote: > Change 374822 merged by Ale... [15:49:44] !log cp1049 - restart varnish backend, mailbox lag [15:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:12] (03CR) 10Filippo Giunchedi: "> I am a bit concerned about doubling the write bandwidth consumed by" [puppet] - 10https://gerrit.wikimedia.org/r/373863 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [15:50:58] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [15:53:38] PROBLEM - Check Varnish expiry mailbox lag on cp1062 is CRITICAL: CRITICAL: expiry mailbox lag is 2069335 [15:54:06] (03PS1) 10Hashar: aptly: support https and switch contint to it [puppet] - 10https://gerrit.wikimedia.org/r/374837 [15:55:22] (03PS6) 10Hashar: contint: aptly server in labs [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) [15:55:24] (03PS2) 10Hashar: aptly: support https and switch contint to it [puppet] - 10https://gerrit.wikimedia.org/r/374837 [15:56:34] !log re-added analytics1055 among the hdfs/yarn worker after maintenance [15:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:19] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T174534#3566442 (10fgiunchedi) 05Open>03Resolved Thanks @Papaul ! I also can't find anything obviously wrong after a reboot, tentatively resolving :( [15:57:55] 10Operations, 10hardware-requests: codfw/eqiad:(9+9) hardware access request for ORES - https://phabricator.wikimedia.org/T142578#3566454 (10RobH) [15:59:12] (03PS3) 10Hashar: aptly: support https and switch contint to it [puppet] - 10https://gerrit.wikimedia.org/r/374837 [15:59:23] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T174534#3566482 (10fgiunchedi) 05Resolved>03Open Reopening as per request [16:00:33] (03PS1) 10Ladsgroup: Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) [16:01:31] (03PS2) 10Ladsgroup: Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) [16:01:58] (03CR) 10Hashar: "I need to split debian packages by components. That makes it easier to track what ends up being installed on an instance." [puppet] - 10https://gerrit.wikimedia.org/r/374813 (owner: 10Hashar) [16:02:23] (03CR) 10Ladsgroup: "I also compressed it so it's waaay smaller now." [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [16:03:08] (03CR) 10Hashar: "jessie-integration/php55 will be to host Zend PHP 5.5 packages for Jessie ( T161882 ). I want a standalone component to make it easier to" [puppet] - 10https://gerrit.wikimedia.org/r/374805 (https://phabricator.wikimedia.org/T161882) (owner: 10Hashar) [16:04:45] (03CR) 10Hashar: "I have created a web proxy in labs which comes with HTTPS support out of the box. That let us make the repository public so developers ca" [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [16:07:22] (03PS2) 10Jforrester: RCFilters: Enable on watchlist for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374382 [16:07:28] (03PS2) 10Jforrester: Cleanup: Removed wgEnableRcFiltersBetaFeature setting for Beta Cluster, true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374383 [16:07:29] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:07:39] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:07:39] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:07:45] (03CR) 10Krinkle: [C: 04-1] "Per Phab task. I think this should remain in all caps for consistency, and the second line should not be in the same font style afaik." [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [16:07:49] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:07:49] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:07:58] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:07:59] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:07:59] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:08] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:09] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:09] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:09] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:09] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:11] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3566527 (10faidon) Indeed! Note that ToolForge already has something like that for tool authors that does LDAP calls etc. if I recall correctly, so perhaps these two efforts could complement each o... [16:08:17] 10Operations: Request for python package csvsort on stat1005.equiad.wmnet - https://phabricator.wikimedia.org/T174577#3566528 (10Adrian_Bielefeldt) [16:08:18] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:18] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:18] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:18] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:18] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:18] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:19] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:28] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:28] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:28] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:28] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:08:29] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:29] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:35] looking ^ [16:08:39] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:39] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:48] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:48] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:49] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:58] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:58] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:58] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:08:59] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:09:08] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:09:08] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:10:37] ema: it's bblack [16:11:00] right, I was expecting just one, but I guess it's all 8 [16:11:18] due to their halfway-correctly-installed nature :P [16:13:24] !log lowering vrrp priority for the cr2<->asw-d-codfw link - T174366 [16:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:37] T174366: Power alarm flap on asw-d-codfw:et-7/0/52 channel 3 - https://phabricator.wikimedia.org/T174366 [16:14:19] PROBLEM - Host labvirt1015 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:56] bblack: I've ACKed the ipsec alert spam in the meanwhile [16:17:37] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#434985 (10chasemp) That Toolforge mail server is a real mess. It maybe the remaining holdover from the early days of un-puppetized things and we have been kicking that can down the road for a good... [16:30:06] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3566633 (10Volans) @elukey, @Joe, @Cmjohnson: for testing purposes of the migration of the reimage script from salt to cumin, could I grab `mc100[1-2]`... [16:34:10] !log start work on the cr2<->asw-d-codfw link - T174366 [16:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:22] T174366: Power alarm flap on asw-d-codfw:et-7/0/52 channel 3 - https://phabricator.wikimedia.org/T174366 [16:37:49] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 114, down: 2, dormant: 0, excluded: 0, unused: 0 [16:38:49] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [16:41:55] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T172809#3566690 (10elukey) 05Open>03Resolved Done! Host back to working, thanks Chris! [16:43:03] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3369055 (10chasemp) No definite date has been set as we are working on T173511 as the precursor to moving over quarry (and probably PAWS). I think we are tal... [16:45:48] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 82 ESP OK [16:45:48] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 82 ESP OK [16:45:49] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 82 ESP OK [16:45:49] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK [16:45:59] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK [16:46:08] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 82 ESP OK [16:46:08] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK [16:46:09] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 48 ESP OK [16:46:09] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK [16:46:09] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 48 ESP OK [16:46:18] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 60 ESP OK [16:46:18] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 48 ESP OK [16:46:18] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 68 ESP OK [16:46:18] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 82 ESP OK [16:46:19] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 48 ESP OK [16:46:19] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [16:46:19] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK [16:46:28] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 60 ESP OK [16:46:29] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK [16:46:29] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK [16:46:29] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK [16:46:38] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK [16:46:40] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 82 ESP OK [16:46:40] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 82 ESP OK [16:46:40] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 82 ESP OK [16:46:40] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 60 ESP OK [16:46:40] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 60 ESP OK [16:46:40] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 82 ESP OK [16:46:48] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 48 ESP OK [16:46:48] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 48 ESP OK [16:46:48] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 60 ESP OK [16:46:48] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 82 ESP OK [16:46:58] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 48 ESP OK [16:46:58] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 60 ESP OK [16:47:08] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 60 ESP OK [16:47:08] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 60 ESP OK [16:47:08] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 48 ESP OK [16:48:45] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3566722 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4021.ulsfo.wmnet', 'cp4022.ulsfo.wmnet', 'cp4023.ulsfo.... [16:49:37] !log rolling back vrrp priority change for the cr2<->asw-d-codfw link - T174366 [16:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:48] T174366: Power alarm flap on asw-d-codfw:et-7/0/52 channel 3 - https://phabricator.wikimedia.org/T174366 [16:50:53] (03Abandoned) 10Hashar: Revert "Revert "Group1 wikis to wmf.14"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372415 (owner: 10Thcipriani) [16:51:07] 10Operations, 10ops-codfw, 10netops: Power alarm flap on asw-d-codfw:et-7/0/52 channel 3 - https://phabricator.wikimedia.org/T174366#3566753 (10ayounsi) 05Open>03Resolved a:03ayounsi Papaul replaced the optic on the switch side, levels back to normal: ``` > show interfaces diagnostics optics et-7/0/52... [16:51:58] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 8 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:55:08] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:08] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:08] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:09] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:09] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:09] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:09] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:09] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:10] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:10] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:18] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:18] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:19] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:28] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:28] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:28] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:28] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:29] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:32] 10Operations, 10Deployment-Systems, 10JobRunner-Service, 10Patch-For-Review, and 2 others: Figure out how to disable starting of jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3566767 (10akosiaris) \o/. Isn't there anything else left to do or can we declare victory on t... [16:55:38] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:38] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:38] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:38] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:39] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:39] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:39] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:40] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:40] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:41] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:49] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:49] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:49] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [16:55:49] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:49] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:49] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:58] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 70 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:58] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp4021_v4, cp4021_v6, cp4022_v4, cp4022_v6, cp4023_v4, cp4023_v6, cp4024_v4, cp4024_v6, cp4025_v4, cp4025_v6, cp4026_v4, cp4026_v6 [16:55:58] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 44 not-conn: cp4027_v4, cp4027_v6, cp4028_v4, cp4028_v6 [17:02:28] 10Operations, 10Deployment-Systems, 10JobRunner-Service, 10Patch-For-Review, and 2 others: Figure out how to disable starting of jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3566808 (10Krinkle) 05Open>03Resolved a:03Krinkle [17:02:34] 10Operations, 10Deployment-Systems, 10JobRunner-Service, 10Patch-For-Review, and 2 others: Figure out how to disable starting of jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3566813 (10Krinkle) a:05Krinkle>03thcipriani [17:08:08] (03PS3) 10Ladsgroup: Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) [17:13:06] (03PS4) 10Ladsgroup: Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) [17:22:06] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:24:36] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 48 ESP OK [17:24:55] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 60 ESP OK [17:24:55] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 60 ESP OK [17:24:56] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 48 ESP OK [17:25:05] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 48 ESP OK [17:25:06] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 60 ESP OK [17:25:06] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 48 ESP OK [17:25:25] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 48 ESP OK [17:25:25] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 60 ESP OK [17:25:25] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 48 ESP OK [17:25:25] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 48 ESP OK [17:25:25] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 60 ESP OK [17:25:26] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 60 ESP OK [17:25:26] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 60 ESP OK [17:25:35] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 48 ESP OK [17:25:36] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 60 ESP OK [17:27:46] !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds [17:27:52] sjoerddebruin: ^ ;) [17:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:59] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [17:29:10] !log T169939: Decommission Cassandra: restbase1009-b.eqiad.wmnet [17:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:22] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [17:30:47] (03PS1) 10Chad: Group1 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374844 [17:35:30] (03PS1) 10Madhuvishy: nfsmount: Add temporary exception to the block-for-export check [puppet] - 10https://gerrit.wikimedia.org/r/374845 (https://phabricator.wikimedia.org/T171508) [17:39:47] (03CR) 10Herron: [C: 032] icinga: add -u option to check_nrpe commands [puppet] - 10https://gerrit.wikimedia.org/r/374368 (https://phabricator.wikimedia.org/T172131) (owner: 10Herron) [17:39:52] (03PS2) 10Herron: icinga: add -u option to check_nrpe commands [puppet] - 10https://gerrit.wikimedia.org/r/374368 (https://phabricator.wikimedia.org/T172131) [17:44:03] PROBLEM - Check systemd state on cp4025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:44:03] PROBLEM - traffic-pool service on cp4021 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [17:45:54] PROBLEM - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:46:43] PROBLEM - traffic-pool service on cp4026 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [17:46:43] PROBLEM - traffic-pool service on cp4028 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [17:46:43] PROBLEM - traffic-pool service on cp4027 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [17:47:33] PROBLEM - Check systemd state on cp4023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:48:32] PROBLEM - traffic-pool service on cp4025 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [17:49:24] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:49:26] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3567032 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4022.ulsfo.wmnet'] ``` The log can be found in `/var/lo... [17:49:29] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#434985 (10Legoktm) I was actually just asking @bd808 about the status of mail access for Cloud VPS projects yesterday. At least for my use case, it would be great if mail to @ ignore cp402x, sorry for the spam! (new installs) [17:50:13] PROBLEM - Check systemd state on cp4027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:50:13] PROBLEM - Check systemd state on cp4028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:50:13] PROBLEM - traffic-pool service on cp4024 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [17:52:02] PROBLEM - Check systemd state on cp4026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:52:03] PROBLEM - traffic-pool service on cp4023 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [17:55:22] PROBLEM - Host cp4023 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:23] PROBLEM - Host cp4025 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:33] PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:43] PROBLEM - Host cp4026 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:52] PROBLEM - Host cp4028 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:52] PROBLEM - Host cp4027 is DOWN: PING CRITICAL - Packet loss = 100% [17:56:02] PROBLEM - Host cp4021 is DOWN: PING CRITICAL - Packet loss = 100% [17:56:03] RECOVERY - Check systemd state on cp4024 is OK: OK - running: The system is fully operational [17:56:03] RECOVERY - Host cp4026 is UP: PING WARNING - Packet loss = 66%, RTA = 78.52 ms [17:56:03] RECOVERY - Host cp4024 is UP: PING WARNING - Packet loss = 66%, RTA = 78.51 ms [17:56:03] RECOVERY - Host cp4023 is UP: PING WARNING - Packet loss = 80%, RTA = 78.48 ms [17:56:03] RECOVERY - Host cp4025 is UP: PING WARNING - Packet loss = 80%, RTA = 78.48 ms [17:56:12] RECOVERY - Check systemd state on cp4026 is OK: OK - running: The system is fully operational [17:56:12] RECOVERY - traffic-pool service on cp4023 is OK: OK - traffic-pool is active [17:56:12] RECOVERY - Host cp4028 is UP: PING OK - Packet loss = 0%, RTA = 78.83 ms [17:56:13] RECOVERY - Host cp4021 is UP: PING OK - Packet loss = 0%, RTA = 78.52 ms [17:56:22] RECOVERY - Host cp4027 is UP: PING OK - Packet loss = 0%, RTA = 78.48 ms [17:56:23] RECOVERY - Check systemd state on cp4025 is OK: OK - running: The system is fully operational [17:56:23] RECOVERY - traffic-pool service on cp4021 is OK: OK - traffic-pool is active [17:56:23] RECOVERY - Check systemd state on cp4027 is OK: OK - running: The system is fully operational [17:56:23] RECOVERY - Check systemd state on cp4028 is OK: OK - running: The system is fully operational [17:56:23] RECOVERY - traffic-pool service on cp4024 is OK: OK - traffic-pool is active [17:56:32] RECOVERY - Check systemd state on cp4021 is OK: OK - running: The system is fully operational [17:56:42] RECOVERY - traffic-pool service on cp4025 is OK: OK - traffic-pool is active [17:56:52] RECOVERY - Check systemd state on cp4023 is OK: OK - running: The system is fully operational [17:57:02] RECOVERY - traffic-pool service on cp4028 is OK: OK - traffic-pool is active [17:57:04] RECOVERY - traffic-pool service on cp4026 is OK: OK - traffic-pool is active [17:57:04] RECOVERY - traffic-pool service on cp4027 is OK: OK - traffic-pool is active [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170830T1800). Please do the needful. [18:00:04] Pchelolo: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:24] I can do it [18:00:26] I'm here [18:01:04] thank you MaxSem [18:01:38] (03PS4) 10MaxSem: Enable JobQueueEventBus on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374399 (owner: 10Ppchelko) [18:01:45] (03CR) 10MaxSem: [C: 032] Enable JobQueueEventBus on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374399 (owner: 10Ppchelko) [18:03:53] (03Merged) 10jenkins-bot: Enable JobQueueEventBus on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374399 (owner: 10Ppchelko) [18:05:42] Pchelolo, if there's a way to test that change, it's on mwdebug1002 [18:05:53] testing MaxSem [18:06:32] (03PS2) 10Madhuvishy: nfsmount: Add temporary exception to the block-for-export check [puppet] - 10https://gerrit.wikimedia.org/r/374845 (https://phabricator.wikimedia.org/T171508) [18:07:00] give me 5 minutes to ensure all is ok [18:07:52] (03PS3) 10Madhuvishy: nfsmount: Add temporary exception to the block-for-export check [puppet] - 10https://gerrit.wikimedia.org/r/374845 (https://phabricator.wikimedia.org/T171508) [18:10:06] bien MaxSem all seem to work [18:11:27] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/374399/4 (duration: 00m 47s) [18:11:33] Pchelolo, ^ [18:11:37] (03CR) 10Rush: [C: 031] "let's give it a whirl :)" [puppet] - 10https://gerrit.wikimedia.org/r/374845 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [18:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:06] thank you, I [18:12:15] I'll monitor it for some time [18:17:23] 10Operations, 10Cloud-VPS, 10netops: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567207 (10Krenair) [18:18:13] (03PS1) 10Krinkle: Enable jQuery 3 on nlwiki sister projects (b, n, q, s, wikt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374850 (https://phabricator.wikimedia.org/T124742) [18:18:45] 10Operations, 10Cloud-VPS, 10netops: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567234 (10Krenair) [18:19:56] 10Operations, 10Cloud-VPS, 10netops: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567207 (10Krenair) See also T167357 where this task will probably become obsolete, I just wanted to document the effect of this really. [18:21:52] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567268 (10bd808) >>! In T41785#3567033, @Legoktm wrote: > it would be great if mail to @.wmflabs.org (where something is fixed string(s) or any string) just worked for forw... [18:22:11] (03CR) 10jenkins-bot: Enable JobQueueEventBus on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374399 (owner: 10Ppchelko) [18:23:16] (03CR) 10Madhuvishy: [C: 032] nfsmount: Add temporary exception to the block-for-export check [puppet] - 10https://gerrit.wikimedia.org/r/374845 (https://phabricator.wikimedia.org/T171508) (owner: 10Madhuvishy) [18:23:33] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3567275 (10GWicke) I personally am not sure whether the startup issues are caused by... [18:25:07] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567277 (10Krenair) >>! In T41785#3566374, @herron wrote: > For sure! Maybe a pair of instances in different locations for durability? You mean on separate physical hosts, right? I think we're st... [18:27:05] (03PS4) 10Alexandros Kosiaris: kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) [18:27:14] hoo: <3 [18:29:14] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567298 (10Krenair) >>! In T41785#3567268, @bd808 wrote: > Next tricky step is that `.wmflabs.org` does not exist in DNS by default. We could just say these projects can't get mail un... [18:30:12] (03PS2) 10Rush: prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) [18:31:18] (03PS5) 10Ladsgroup: Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) [18:31:25] (03PS3) 10Rush: prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) [18:32:30] (03CR) 10Rush: prometheus: allow setting a specific listening address and port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) (owner: 10Rush) [18:33:13] (03PS5) 10Alexandros Kosiaris: kubernetes: Refactor/add admission controllers [puppet] - 10https://gerrit.wikimedia.org/r/374795 (https://phabricator.wikimedia.org/T170119) [18:41:33] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 82 ESP OK [18:41:33] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 82 ESP OK [18:41:43] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 68 ESP OK [18:41:43] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 82 ESP OK [18:41:43] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 68 ESP OK [18:41:52] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 68 ESP OK [18:41:53] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 82 ESP OK [18:41:53] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 68 ESP OK [18:41:53] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 82 ESP OK [18:41:53] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 82 ESP OK [18:42:02] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 68 ESP OK [18:42:02] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 82 ESP OK [18:42:02] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 68 ESP OK [18:42:12] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 68 ESP OK [18:42:12] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 82 ESP OK [18:42:13] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 82 ESP OK [18:42:22] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 68 ESP OK [18:42:22] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 82 ESP OK [18:42:22] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 68 ESP OK [18:42:23] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 68 ESP OK [18:42:32] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 68 ESP OK [18:44:12] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3567351 (10Krinkle) p:05Triage>03High [18:48:38] robh: y u no like luca's partman? or maybe you do? [18:48:40] what's the skinny? [18:49:26] ? [18:49:46] I dont understand the use of lvm on the firs tdisk, and the odd labeling is strange [18:49:50] he and i discussed it [18:49:54] not sure why you think i dont like it? [18:50:00] (I never said anything bad about it?) [18:50:32] I think the main / should be more like 80GB not 30GB. The LVM labeling as dm-0 rather than dev/mapper/lvm is odd [18:50:38] but it may be due to mixed lvm and non lvm on sda [18:51:50] I advised that we have faidon review it to ensure its ok to use. its different than anythign else out there but not much we can do about that [18:51:58] coo :) [18:52:01] since the use of hwraid for 2 os disks and then raid10 for the rest is newer for us [18:52:04] i actually kinda like that / is lvm [18:52:13] that way if we want to say, add a special var/log parittion later, we can [18:52:17] other uses of the flexbay + 12 disks seem to never ptut he 12 disks into raid [18:52:21] just leave as jbod [18:52:27] yeah thats fine with me [18:52:28] aye [18:52:59] review for what? [18:53:12] new partman for kakfa-jumbo [18:53:26] i just assume since its new and non standard we should ensure archtects are aware of it [18:53:32] damn i cannot type today. [18:53:50] paravoid: review as in a 'this is sensible and im ok with it existing' kind of thing. =] [18:54:16] i think elukey planned to tweak it slightly more before submitting to you [18:54:35] (03PS1) 10Alex Monk: Add me back to deployment-prep shinken contacts [puppet] - 10https://gerrit.wikimedia.org/r/374866 [18:58:36] (03CR) 10Chad: [C: 032] Group1 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374844 (owner: 10Chad) [18:58:39] aye cool, was just wondering, i'm fine with whattever! just anxious to get working on these thangs! [18:59:13] 10Operations, 10JobRunner-Service, 10MediaWiki-Platform-Team, 10monitoring, and 3 others: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479#3567427 (10Krinkle) [18:59:17] having some phabricator searching fun: is someone aware of the wmflabs.org cert renewal? it expires in ~6 weeks [18:59:37] 6 WEEKS [18:59:45] HOW CAN WE EVER MANAGE TO RENEW A CERT BEFORE THEN!?! [18:59:50] Reedy [19:00:04] RainbowSprinkles: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170830T1900). [19:00:24] Krenair: hello. There is probably an icinga check for that. You can poke #wikimedia-cloud about it [19:00:29] ssh [19:01:19] yeah there is [19:01:19] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=tools.wmflabs.org&service=HTTPS-wmflabs [19:01:22] Krenair: T174053 [19:01:27] (03Merged) 10jenkins-bot: Group1 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374844 (owner: 10Chad) [19:01:27] but it's in S4, so you may not have access to it [19:01:47] don't think I ever had access to that, IIRC S4 is procurement [19:01:53] but ok [19:01:55] the icinga alarm is already acked and points to T174053 [19:01:57] robh: should probably do that it was approved last week [19:02:01] and yeah that link won't work for me either hasharAway, but ty [19:02:01] Krenair: yeah, S4 is procurement [19:02:06] In january 2018, LE will start to support wildcart certs [19:02:10] yeah we know [19:02:19] but that cert expires in october, so ;) [19:02:37] paravoid: yeah its on my list today [19:02:50] paravoid: but in general, are you planning to use that later, or still buy a cert like that? [19:02:51] but yeah, its in october it wasnt on my 'drop everyting do this second' just on my 'get done this week' column [19:02:56] robh: what is sensible and I should be OK with that existing [19:03:25] Sagan: probably, but haven't thought about it much yet [19:03:46] paravoid: ok, but at least there is no general opposal yet, I guess? :) [19:03:52] nope :) [19:03:54] robh, I figured :) [19:03:54] yeah it's possible we could go LE, but the devil is always in the details [19:04:03] yeah that [19:04:08] ok :) [19:04:27] we've gone LE for our non wildcard stuff whenever possible, we like them =] [19:04:27] one thing I've noted so far is that I think they only authorize wildcards via the DNS challenge proto, not the HTTP one like we use today [19:04:27] !log demon@tin Synchronized php: Symlink swap for group1/wmf.16 (duration: 00m 46s) [19:04:35] it was going to be dns-based verification only right? [19:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:56] and who knows if they'll still have limitations on how many wildcard SANs in a cert, or subdomains of subdomains in a cert, or still dissallow top-N domains, etc [19:05:17] which in labs might actually be OK if we write some code to integrate it with designate [19:05:29] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567441 (10herron) >! In T41785#3567277, @Krenair wrote: > > You mean on separate physical hosts, right? I think we're still limited to eqiad if it is to be a Cloud VPS instance :) Nah, this does... [19:05:39] dunno how you'd handle that in prod [19:06:01] and then we've got our end to deal with - we distribute the big wildcard cert to ~100 servers in 4 datacenters. we'd need some central fetcher to do the automated updates, and then distribute it around and avoid SPOF, etc [19:06:30] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567458 (10Krenair) >>! In T41785#3567441, @herron wrote: >>! In T41785#3567277, @Krenair wrote: >> >> You mean on separate physical hosts, right? I think we're still limited to eqiad if it is to... [19:06:33] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.16 [19:06:33] Krenair: in theory we can do it with a gdnsd plugin, or some kind of templated zonefile updates, either way [19:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:14] (03CR) 10jenkins-bot: Group1 to wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374844 (owner: 10Chad) [19:07:44] !log ppchelko@tin Started deploy [changeprop/deploy@ff93ea8]: Don't deduplicate retry messages [19:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:12] lastly but not leastly - we currently purchase our big wildcard cert from two independent CAs (GlobalSign and Digicert) for CA-level redundancy in case of things like the CA having an OCSP outage that can affect our users (and then we deploy one vesion of the cert to some DCs, and the other to others, so both are in active use, and have a plan to move all sites to just one of the CAs if the ot [19:08:18] her fails) [19:08:41] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567462 (10faidon) Also see T47827, T47828, T47829 and T61142. This task is supposed to be for the smarthost which sounds like a good first step. I'd recommend keeping separate instances for inboun... [19:08:44] so it's not like we'd replace all of that with LE-only. We'd need a second LE-like CA that's independent (or more likely, keep one of our traditional vendors alongside LE) [19:08:50] so that makes the integration tricky too [19:09:01] !log ppchelko@tin Finished deploy [changeprop/deploy@ff93ea8]: Don't deduplicate retry messages (duration: 01m 17s) [19:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:22] the current certs are OV too, while LE is DV [19:09:50] also, until https://tools.ietf.org/html/draft-ietf-acme-caa-02 or something like that gets implemented [19:09:50] (but so far, I think that makes little difference to UAs right? unlike EV) [19:10:08] it would be probably best to not add LE to our CAA [19:10:30] as if someone manages to hijack e.g. our IP space, they can issue a cert from LE as well and start getting our traffic [19:10:47] (it is in our CAA though, for wikimedia.org, because we have LE certs there now) [19:10:52] (just not the other big domains) [19:11:00] yeah I know... [19:11:13] at least it's not for wikipedia.org, that's something :) [19:11:30] but really, if they hijack our IP space we have bigger problems and CAA won't necessarily fix it [19:11:36] they can fake the DNS responses too after all [19:11:47] yeah, but not the cert :) [19:11:59] oh you mean they can fake the CAA [19:12:01] yeah but they can fake the CAA record to the CA to get the cert [19:12:04] yeah I suppose that's true too [19:12:04] if they can fake DNS responses they can get a valid cert signed [19:12:22] scary stuff [19:12:25] CAA is unlikely to be cached anywhere useful before it's requested from the authservers for a check [19:12:47] in general all of DNS is scary stuff, though [19:13:08] DNSSEC! [19:13:09] * paravoid ducks [19:13:20] hence DNSSEC, the awful extension to an awful protocol that makes some people sleep a little better (but at what cost?) [19:14:24] does anyone know what's with that wikibase/tac cronspam? [19:14:31] I see volans has touched it last [19:14:53] paravoid: let me check [19:15:33] paravoid: that cron has a timeout in front... [19:15:37] I guess is that :( [19:16:33] is a temporary one that will be on for 2-3 weeks AFAIK, I just helped Amir1 to clean it a bit after was already deployed [19:17:11] volans: it has a timeout so it should be killed [19:17:48] paravoid: Can you tell me what's wrong? [19:17:57] Amir1: we get email from cron: /usr/bin/xargs: /usr/bin/tac: terminated by signal 13 [19:18:19] my first guess is that this is triggered when the timeout kills it [19:18:26] hmm, maybe timeout sends that signal [19:18:33] okay. I think it's easy to fix [19:20:07] adding another 2> /dev/null? :D [19:20:33] 13 looks strange http://www.comptechdoc.org/os/linux/programming/linux_pgsignals.html [19:20:58] * volans wonders why is the xargs reporting the kill given that the process running is mwscript and the xargs is only part of the subshell that finds the ID at the start [19:21:21] the emails are all at minute 28, that confirms that is when they are killed by the timeout [19:23:11] volans: If we tell timeout to kill it with SIGTERM, won't it send the email? [19:24:25] I doubt that [19:25:19] I think I know what it is, and is not the timeout [19:27:22] ok, my bad, is my exit inside awk that kills the reader of the pipe, we have 2 solutions [19:27:39] 1) add 2> /dev/null at the end of the xargs block [19:27:47] 2) remove the exit and add a head -n1 [19:28:23] Amir1: ^^^ [19:29:02] the second one seems easier but it should be tail I guess [19:29:17] nope because with cat we start from the end ;) [19:29:39] sending a patch [19:29:49] volans: no, let me do it [19:29:56] I need to reduce the time a little [19:30:04] sure [19:32:00] thanks paravoid, I should have looked at the cronspam after the merge... fix in the way ;) [19:36:27] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567608 (10bd808) [19:37:21] (03PS1) 10Ladsgroup: Fix cronspam [puppet] - 10https://gerrit.wikimedia.org/r/374872 [19:37:28] (03PS2) 10Chad: Gerrit: Start using plugins from scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/374667 (https://phabricator.wikimedia.org/T157414) [19:37:51] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Start using plugins from scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/374667 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [19:38:29] (03PS2) 10Ladsgroup: mediawiki: Fix cronspam [puppet] - 10https://gerrit.wikimedia.org/r/374872 [19:38:41] volans: [19:38:42] https://gerrit.wikimedia.org/r/374872 [19:39:30] (03CR) 10Volans: [C: 032] mediawiki: Fix cronspam [puppet] - 10https://gerrit.wikimedia.org/r/374872 (owner: 10Ladsgroup) [19:40:14] (03PS3) 10Chad: Gerrit: Start using plugins from scap-deployed version [puppet] - 10https://gerrit.wikimedia.org/r/374667 (https://phabricator.wikimedia.org/T157414) [19:40:39] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567639 (10bd808) Should we start a #cloud-vps project in the spirit of [[https://tools.wmflabs.org/openstack-browser/project/bastion|bastion ]] to host VMs for inbound/outbound SMTP services? [19:41:14] Amir1: thanks! Merged and puppet run on terbium, so we should get the last cronspam in 48m and since then should be ok [19:44:34] (03PS1) 10RobH: new *.wmflabs.org certificate for cert expiry on 2017-10-16 [puppet] - 10https://gerrit.wikimedia.org/r/374873 (https://phabricator.wikimedia.org/T174053) [19:46:27] 10Puppet, 10Cloud-Services: Migrate references from $instance.eqiad.wmflabs to $instance.$project.eqiad.wmflabs - https://phabricator.wikimedia.org/T153608#3567672 (10Krenair) [19:47:41] 10Operations, 10Cloud-Services: Missing Labs hiera entry in labs-private repo - https://phabricator.wikimedia.org/T152767#3567673 (10Krenair) [19:47:46] 10Operations, 10cloud-services-team: update *.wmflabs.org by 2017-10-16 - https://phabricator.wikimedia.org/T174611#3567674 (10RobH) [19:51:08] (03PS1) 10Reedy: Add centralnotice tables to maintain-views.yaml [puppet] - 10https://gerrit.wikimedia.org/r/374875 (https://phabricator.wikimedia.org/T135405) [19:51:54] (03PS3) 10Mforns: [WIP] Optimize EventLogging purging script using timestamps [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) [19:52:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Optimize EventLogging purging script using timestamps [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [19:52:28] 10Operations, 10cloud-services-team (Kanban): update *.wmflabs.org by 2017-10-16 - https://phabricator.wikimedia.org/T174611#3567702 (10bd808) [19:55:54] (03PS1) 10Reedy: Remove/collapse a few conditionals in CentralNotice config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374877 [19:56:46] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3567720 (10Krenair) I think so [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170830T2000). [20:00:21] Nothing for ORES today [20:06:14] (03PS1) 10Mforns: [WIP] Add reportupdater job to trigger page-creation metrics [puppet] - 10https://gerrit.wikimedia.org/r/374878 (https://phabricator.wikimedia.org/T170850) [20:06:52] (03CR) 10Mforns: [C: 04-1] "Depends on https://gerrit.wikimedia.org/r/#/c/373373 being merged first." [puppet] - 10https://gerrit.wikimedia.org/r/374878 (https://phabricator.wikimedia.org/T170850) (owner: 10Mforns) [20:07:34] !log T169939: Decommission Cassandra: restbase1009-c.eqiad.wmnet [20:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:52] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [20:10:04] (03CR) 10Krinkle: [C: 031] Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [20:10:32] 10Operations, 10Deployment-Systems, 10JobRunner-Service, 10Patch-For-Review, and 2 others: Figure out how to disable starting of jobrunner/jobchron in the non-active DC - https://phabricator.wikimedia.org/T167104#3567789 (10hashar) Awesome and well played! [20:13:27] Can Ops review (and merge) this puppet patch? https://gerrit.wikimedia.org/r/#/c/374838/ [20:13:42] Every time someone sees the RGB logo in gerrit, a goat dies [20:14:22] or tell me what's the next steps. Should I get sign off from someone else? [20:14:23] lol, ops reviewing css? [20:14:51] the whole patch, since it's puppet [20:18:19] 10Operations, 10JobRunner-Service, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): jobrunner / jobchron systemd services are in error state after a stop - https://phabricator.wikimedia.org/T168044#3567827 (10hashar) [20:19:19] 10Operations, 10JobRunner-Service, 10Beta-Cluster-reproducible, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): jobrunner / jobchron systemd services are in error state after a stop - https://phabricator.wikimedia.org/T168044#3354078 (10hashar) 05Open>03stalled That one depends on T129148 co... [20:19:50] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/369923 (owner: 10Hashar) [20:21:30] (03Abandoned) 10Hashar: (WIP) trigger all modules [puppet] - 10https://gerrit.wikimedia.org/r/369923 (owner: 10Hashar) [20:21:30] !log ppchelko@tin Started deploy [changeprop/deploy@8d5fa29]: Start rejecting deduplicated messages [20:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:35] !log ppchelko@tin Finished deploy [changeprop/deploy@8d5fa29]: Start rejecting deduplicated messages (duration: 01m 05s) [20:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:33] (03CR) 10Hashar: [C: 031] "Alex has been nicely taking care of those alarms." [puppet] - 10https://gerrit.wikimedia.org/r/374866 (owner: 10Alex Monk) [20:26:41] (03CR) 10Hashar: "Filippo, Faidon, could you check the production Swift machine and check whether nscd has some unusual CPU usage?" [puppet] - 10https://gerrit.wikimedia.org/r/358799 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [20:27:25] (03Abandoned) 10Hashar: mediawiki-firejail: explicitly signal end of options [puppet] - 10https://gerrit.wikimedia.org/r/338979 (https://phabricator.wikimedia.org/T158649) (owner: 10Hashar) [20:29:27] (03CR) 10Hashar: "Brandon, I got the gdnsd configuration files generated thanks to rspec-puppet. But since CI lacks MaxMind GeoDNS files it is hitting a wal" [puppet] - 10https://gerrit.wikimedia.org/r/343747 (owner: 10Hashar) [20:30:21] (03CR) 10Greg Grossmeier: [C: 031] Add me back to deployment-prep shinken contacts [puppet] - 10https://gerrit.wikimedia.org/r/374866 (owner: 10Alex Monk) [20:32:03] (03PS2) 10Hashar: contint: upgrade git on zuul mergers [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) [20:32:08] (03PS3) 10Hashar: contint: upgrade git on zuul mergers [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) [20:32:45] (03CR) 10Hashar: "I guess I am going to manually upgrade git on the zuul-merger servers (contint1001 and contint2001). If all goes fine, this patch can be m" [puppet] - 10https://gerrit.wikimedia.org/r/360420 (https://phabricator.wikimedia.org/T161086) (owner: 10Hashar) [20:33:21] (03Abandoned) 10Hashar: Rake: memoize git_changed_in_head() [puppet] - 10https://gerrit.wikimedia.org/r/359951 (https://phabricator.wikimedia.org/T166888) (owner: 10Hashar) [20:33:34] (03Abandoned) 10Hashar: Rake: optimize typos task for CI [puppet] - 10https://gerrit.wikimedia.org/r/357804 (https://phabricator.wikimedia.org/T166888) (owner: 10Hashar) [20:35:49] (03PS3) 10Hashar: tests: disable ruby output buffering [puppet] - 10https://gerrit.wikimedia.org/r/359457 [20:36:23] (03CR) 10Hashar: [C: 031] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/299151 (owner: 10Hashar) [20:36:39] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/341602 (owner: 10Hashar) [20:37:00] (03PS3) 10Hashar: contint: boilerplate for spec tests [puppet] - 10https://gerrit.wikimedia.org/r/342206 [20:37:47] (03PS9) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 [20:38:21] (03CR) 10jerkins-bot: [V: 04-1] interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [20:38:42] RainbowSprinkles, looks like today's train caused lot of stack overflows [20:41:12] MaxSem: I think those are all https://phabricator.wikimedia.org/T173520 [20:43:20] thcipriani, https://logstash.wikimedia.org/app/kibana#/dashboard/Fatal-Monitor?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-4h,mode:quick,to:now))&_a=(filters:!(('$$hashKey':'object:1366','$state':(store:appState),bool:(must:!((terms:(level:!(NOTICE,INFO,WARNING))),(term:(type:mediawiki)))),meta:(alias:!n,disabled:!f,index:'logstash-*',key:bool,negate:!t,value:'%7B%22must%22:%5B%7B%22terms%22:%7B%22level%22:%5B%22NOTICE%22 [20:43:20] ,%22INFO%22,%22WARNING%22%5D%7D%7D,%7B%22term%22:%7B%22type%22:%22mediawiki%22%7D%7D%5D%7D')),('$$hashKey':'object:1367','$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:message,negate:!t,value:SlowTimer),query:(match:(message:(query:SlowTimer,type:phrase)))),('$$hashKey':'object:1368','$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:message,negate:!t,value:'Invalid%20host%20name'),query:(m [20:43:21] atch:(message:(query:'Invalid%20host%20name',type:phrase)))),('$$hashKey':'object:1369','$state':(store:appState),meta:(alias:!n,disabled:!f,index:'logstash-*',key:level,negate:!t,value:INFO),query:(match:(level:(query:INFO,type:phrase))))),options:(darkTheme:!f),panels:!((col:1,id:Top-20-Hosts,panelIndex:2,row:3,size_x:8,size_y:2,type:visualization),(col:1,columns:!(type,level,wiki,host,message,http_status),id:Default-Events-List,panelIndex:3 [20:43:26] ,row:11,size_x:12,size_y:23,sort:!('@timestamp',desc),type:search),(col:1,id:Fatal-Events-Over-Time,panelIndex:4,row:1,size_x:12,size_y:2,type:visualization),(col:1,id:Trending-Messages,panelIndex:5,row:5,size_x:12,size_y:6,type:visualization),(col:9,id:MediaWiki-Versions,panelIndex:6,row:3,size_x:4,size_y:2,type:visualization)),query:(query_string:(analyze_wildcard:!t,query:'(type:mediawiki%20AND%20(channel:exception%20OR%20channel:wfLogDBErr [20:43:31] or))%20OR%20type:hhvm')),title:'Fatal%20Monitor',uiState:(P-2:(spy:(mode:(fill:!f,name:!n)),vis:(legendOpen:!f,params:(sort:(columnIndex:!n,direction:!n)))),P-4:(spy:(mode:(fill:!f,name:!n)),vis:(colors:(exception:%23C15C17,hhvm:%23BF1B00))),P-5:(spy:(mode:(fill:!f,name:!n)),vis:(params:(sort:(columnIndex:!n,direction:!n)))),P-6:(spy:(mode:(fill:!f,name:!n)),vis:(legendOpen:!t)))) [20:43:36] ffffuuuuuuu [20:43:40] fuck you kibana [20:43:42] sorry :P [20:44:12] it's a lot. I'm going to rollback while I try to figure out what to do about these. [20:45:31] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 back to 1.30.0-wmf.15 [20:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:48] (03CR) 10Paladox: [C: 031] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/374667 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [20:46:51] ideally could find someone to review https://gerrit.wikimedia.org/r/#/c/372565 [20:50:14] (03PS1) 10Thcipriani: Revert "Group1 to wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374890 [20:50:40] (03CR) 10Thcipriani: [C: 032] Revert "Group1 to wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374890 (owner: 10Thcipriani) [20:52:17] (03Merged) 10jenkins-bot: Revert "Group1 to wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374890 (owner: 10Thcipriani) [20:57:02] 10Operations, 10netops: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3567954 (10ayounsi) [20:57:38] 10Operations, 10netops: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3567973 (10ayounsi) [20:58:04] thcipriani: done [20:58:25] (03CR) 10jenkins-bot: Revert "Group1 to wmf.16" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374890 (owner: 10Thcipriani) [20:58:31] AaronSchulz: thank you! [20:59:02] I expect that overflow to be avoided...hopefully there isn't another one elsewhere ;) [21:00:08] ok. Merging for wmf.16 now, I'll deploy that and re-roll forward [21:00:55] (03PS1) 10Smalyshev: Remove Q25267 (degree Celsius) from conversion config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374892 (https://phabricator.wikimedia.org/T174353) [21:02:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] Remove Q25267 (degree Celsius) from conversion config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374892 (https://phabricator.wikimedia.org/T174353) (owner: 10Smalyshev) [21:03:19] !log thcipriani@tin Synchronized php-1.30.0-wmf.16/extensions/ProofreadPage/ProofreadPage.body.php: [[gerrit:374882|ProofreadPage: Avoids a stack overflow]] T173520 (duration: 00m 47s) [21:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:31] T173520: Fatal error: Stack overflow in [files] for wmf.14 - https://phabricator.wikimedia.org/T173520 [21:05:08] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 back to 1.30.0-wmf.16 [21:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:34] well, I don't see any stack overflows... [21:16:23] greg-g, can we have a window to enable watchlist filters (like RC Filters, but for watchlist) Tuesday 7 AM Pacific: https://phabricator.wikimedia.org/T164234 ? [21:18:08] matt_flaschen: sure thing. godspeed :) [21:18:23] Thanks [21:23:46] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3568035 (10Eevans) To summarize the discussion in https://gerrit.wikimedia.org/r/373863 (and elsewhere): Linux MD is capable of some interesting,... [21:27:43] 10Operations, 10Cloud-Services, 10Mail: Create a labs SMTP smarthost - https://phabricator.wikimedia.org/T41785#3568057 (10bd808) I created {T174618} for the project. [21:49:23] (03PS1) 10Alex Monk: [WIP] shinkengen for all projects [puppet] - 10https://gerrit.wikimedia.org/r/374897 (https://phabricator.wikimedia.org/T166845) [22:07:35] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 7 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3568300 (10GWicke) [22:12:37] (03PS4) 10Ottomata: [WIP] Initial commit of certpy [software/certpy] - 10https://gerrit.wikimedia.org/r/359960 (https://phabricator.wikimedia.org/T166167) [22:20:02] PROBLEM - IPMI Temperature on es1019 is CRITICAL: CHECK_NRPE: Socket timeout after 60 seconds. [22:57:03] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 7 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3537304 (10GWicke) A possible contribution to the backlog building could be the infinite retry / immortal job problem described in T73853. Looking for ol... [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170830T2300). [23:00:04] Smalyshev: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:22] I'll take swat [23:00:26] (cuz I'm stealing half of it) [23:00:36] I'm here [23:01:26] (03CR) 10Chad: [C: 032] Remove Q25267 (degree Celsius) from conversion config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374892 (https://phabricator.wikimedia.org/T174353) (owner: 10Smalyshev) [23:02:08] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3568521 (10Krinkle) [23:02:55] (03Merged) 10jenkins-bot: Remove Q25267 (degree Celsius) from conversion config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374892 (https://phabricator.wikimedia.org/T174353) (owner: 10Smalyshev) [23:03:11] (03CR) 10jenkins-bot: Remove Q25267 (degree Celsius) from conversion config. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374892 (https://phabricator.wikimedia.org/T174353) (owner: 10Smalyshev) [23:04:40] thcipriani: 76eb9ac4cdaa72fa59a7a5ae73a8a3daa9f18b1e is locally committed on tin but not pushed? [23:05:43] !log demon@tin Synchronized wmf-config/unitConversionConfig.json: swat (duration: 00m 47s) [23:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:48] RainbowSprinkles: ah, crap, yeah lemme push that [23:07:02] SMalyshev: You're live everywhere [23:07:11] thanks! [23:07:11] (03PS1) 10Thcipriani: Revert "Revert "Group1 to wmf.16"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374909 [23:07:37] 10Operations, 10MediaWiki-JobQueue: The refreshLinks jobs enqueue rate is 10 times the normal rate - https://phabricator.wikimedia.org/T129517#3568537 (10Krinkle) [23:07:38] (03CR) 10Thcipriani: [C: 032] Revert "Revert "Group1 to wmf.16"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374909 (owner: 10Thcipriani) [23:07:53] it's deployed, just wasn't pushed/merged. [23:09:04] !log demon@tin Synchronized php-1.30.0-wmf.15/extensions/Newsletter/includes/specials/SpecialNewsletter.php: T174604 (duration: 00m 48s) [23:09:08] (03Merged) 10jenkins-bot: Revert "Revert "Group1 to wmf.16"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374909 (owner: 10Thcipriani) [23:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:17] (03CR) 10jenkins-bot: Revert "Revert "Group1 to wmf.16"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374909 (owner: 10Thcipriani) [23:09:17] T174604: Call to undefined method Newsletter::getSubscriberCount() - https://phabricator.wikimedia.org/T174604 [23:09:24] ^ RainbowSprinkles clean! [23:09:29] ty :) [23:10:11] was there any change to mediawiki deployed recently (within past hour)? [23:10:13] !log demon@tin Synchronized php-1.30.0-wmf.16/extensions/Newsletter/includes/specials/SpecialNewsletter.php: T174604 (duration: 00m 47s) [23:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:33] i see some things differently now [23:10:53] You'll have to be far more specific than that [23:11:12] (because to your first question: yes, stuff has been deployed) [23:11:59] lol [23:12:21] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3568551 (10GWicke) HTMLCacheUpdate root job timestamp distribution, jobs executed within the last 15 hours: ``` 1233 20170407 8237 20170408 1... [23:12:45] Danny_B: As long as you don't see dejavu [23:12:49] Then someone has changed the matrix [23:13:27] ;-) [23:13:49] i guess something relevant to parser, since

is no longer in output where it used to be [23:16:17] (03CR) 10Greg Grossmeier: [C: 031] Use new logo of WMF for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [23:16:29] which is kinda unpleasant since it changes behavior and look of things [23:17:45] (03CR) 10Chad: [C: 04-1] "Symbolic -1 because I hate the logo. But +1 I guess :(" [puppet] - 10https://gerrit.wikimedia.org/r/374838 (https://phabricator.wikimedia.org/T174576) (owner: 10Ladsgroup) [23:19:10] symbolic +1 to chad! [23:19:47] Monochrome logos are so fucking unimaginative and boring. [23:20:14] I should come up with a custom gerrit logo so we don't have to use the WMF one! [23:20:15] :D [23:20:34] its never too late ;) [23:20:47] I hear that folks like goats! [23:20:53] so back to the deployed change... ;-) any changes to parser? [23:21:11] No idea! [23:21:14] I just deploy stuff [23:21:21] I have no idea what I'm deploying 99% of the time [23:21:22] :D [23:21:25] It's part of the fun! [23:21:31] black box deployments! [23:22:11] https://www.mediawiki.org/wiki/MediaWiki_1.30/wmf.16/Changelog [23:22:30] bd808, yer spoiling the fun! [23:22:30] I guess I wrote that! [23:22:31] :p [23:24:10] RainbowSprinkles: you at least ran the script that writes it :D [23:24:36] And the script doesn't handle logins anymore [23:24:38] Since the rewrite [23:24:43] So it's always posted by an IP :p [23:26:26] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3568578 (10Krinkle) [23:27:21] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3544008 (10Krinkle) Added a mitigation section to the task description. Also a summary of the impact of the mitigations so far (based on input from @aaro... [23:27:26] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 5 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3568580 (10Krinkle) [23:37:47] 10Operations, 10netops: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3568606 (10ayounsi) [23:49:45] hmm, i can't find in that changelog anything what would seem relevant to that change i spotted [23:51:05] basically suddenly some

s disappeared [23:55:47] Hey folks... [23:56:11] I've uploaded some new files to bromine, to appear in https://releases.wikimedia.org/mobile/android/wikipedia/stable/ [23:56:32] but the web directory listing doesn't show the new files. Is there some cache that needs to be cleared? [23:58:14] dbrant: It's varnish cached, yes. Try adding some ?foo to the URL and you should see a new result :) [23:58:17] (it'll expire pretty soon) [23:58:36] Ohhhhh [23:58:38] WAIT [23:58:41] It's not on bromine now! [23:58:47] releases1001.eqiad.wmnet [23:58:49] (we moved it) [23:58:50] duhhhh [23:58:53] Sorry! [23:59:04] We should add a warning or something on bromine [23:59:15] RainbowSprinkles: ahh! I see. I'll try that... [23:59:18] thanks [23:59:41] releases1001 should have the same directory structure setup