[00:00:05] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:05:54] PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% [00:11:34] PROBLEM - Host cp2010 is DOWN: PING CRITICAL - Packet loss = 100% [00:15:44] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:15:44] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:44] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:44] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:15:44] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:15:45] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:15:45] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:45] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:15:45] PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:46] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:15:54] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:54] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:54] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 53 no-child-sa: cp1068_v4 not-conn: cp2010_v4, cp2010_v6 [00:15:54] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:55] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:55] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:15:55] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:04] PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:04] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:16:04] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:05] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:16:14] PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:14] PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:14] PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:14] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:16:14] PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:14] PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:14] PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:15] PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:15] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:16:16] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:16:16] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:16:17] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:24] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:24] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:34] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:34] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:35] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:44] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:16:44] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:18:04] RECOVERY - Host cp2010 is UP: PING WARNING - Packet loss = 93%, RTA = 36.00 ms [00:18:14] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [00:18:14] RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 56 ESP OK [00:18:14] RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 56 ESP OK [00:18:14] RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 56 ESP OK [00:18:15] RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 56 ESP OK [00:18:15] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [00:18:15] RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 56 ESP OK [00:18:15] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [00:18:15] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [00:18:16] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 56 ESP OK [00:18:25] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 56 ESP OK [00:18:25] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 56 ESP OK [00:18:34] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 56 ESP OK [00:18:34] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 56 ESP OK [00:18:35] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 56 ESP OK [00:18:44] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [00:18:44] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 56 ESP OK [00:18:44] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 56 ESP OK [00:18:44] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [00:18:45] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [00:18:45] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [00:18:45] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 56 ESP OK [00:18:45] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 56 ESP OK [00:18:45] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [00:18:46] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 56 ESP OK [00:18:54] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [00:18:54] RECOVERY - IPsec on cp4032 is OK: Strongswan OK - 56 ESP OK [00:18:54] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 56 ESP OK [00:18:54] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 56 ESP OK [00:18:54] RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 56 ESP OK [00:18:55] RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 56 ESP OK [00:18:55] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 56 ESP OK [00:19:04] RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 56 ESP OK [00:19:04] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 56 ESP OK [00:19:04] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [00:19:04] RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 56 ESP OK [00:19:04] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 56 ESP OK [00:19:05] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [00:19:14] RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 56 ESP OK [00:19:14] RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 56 ESP OK [00:20:05] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [00:21:20] this is ongoing work on cp codfw servers [00:21:46] those recoveries are actually from a single server at a time [00:22:31] jouncebot: next [00:22:31] In 12 hour(s) and 37 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T1300) [00:22:53] twentyafterfour: are you working on phab today? [00:22:59] jouncebot: now [00:22:59] For the next 0 hour(s) and 37 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T0000) [00:25:25] arming keyholder on deploy1001 [00:25:59] twentyafterfour: if we could just get deploy1001 synced before the next swat that would be great [00:26:05] RECOVERY - Keyholder SSH agent on deploy1001 is OK: OK: Keyholder is armed with all configured keys. [00:26:44] (03PS1) 10Dzahn: Revert "Revert "switch deployment server from tin to deploy1001"" [puppet] - 10https://gerrit.wikimedia.org/r/422632 [00:27:02] (03CR) 10Dzahn: "deploy1001 is back as jessie" [puppet] - 10https://gerrit.wikimedia.org/r/422632 (owner: 10Dzahn) [00:53:14] PROBLEM - Host cp2010 is DOWN: PING CRITICAL - Packet loss = 100% [00:57:20] (03PS1) 10Samwilson: Make a note about the loading order of GlobalPreferences and Echo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422642 (https://phabricator.wikimedia.org/T190353) [00:58:04] RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 36.99 ms [00:58:34] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:34] PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:34] PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:35] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:44] PROBLEM - IPsec on cp5009 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:44] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:45] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:45] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:45] PROBLEM - IPsec on cp5012 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:54] PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:54] PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:54] PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:54] PROBLEM - IPsec on cp5008 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:58:54] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:58:55] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:58:55] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:58:55] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:58:55] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:59:04] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:59:04] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:04] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:05] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:05] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:05] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:05] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:05] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:14] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:59:14] PROBLEM - IPsec on cp4032 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:15] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:15] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:15] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:15] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:24] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:59:24] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:59:24] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:24] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:25] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:59:25] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2010_v4, cp2010_v6 [00:59:25] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [00:59:25] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2010_v4, cp2010_v6 [01:14:18] 10Operations, 10Traffic, 10User-zeljkofilipin: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4089968 (10Krinkle) 05Open>03declined Agreed. If anything, this may also very well fix issues for people writing new things with... [01:14:55] 10Operations, 10Traffic, 10User-zeljkofilipin: Figure out how Varnish errorpage was enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4089970 (10Krinkle) [01:15:12] 10Operations, 10Traffic, 10User-zeljkofilipin: Figure out how Varnish errorpage was enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4074032 (10Krinkle) 05declined>03Resolved a:03BBlack [01:18:47] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4089973 (10Krinkle) [01:28:24] PROBLEM - Host cp2015 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:44] RECOVERY - Host cp2015 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [01:43:14] PROBLEM - Host cp2015 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:14] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 56 ESP OK [01:45:14] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [01:45:15] RECOVERY - IPsec on cp3040 is OK: Strongswan OK - 56 ESP OK [01:45:15] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 56 ESP OK [01:45:24] RECOVERY - Host cp2010 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [01:45:25] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [01:45:25] RECOVERY - IPsec on cp5009 is OK: Strongswan OK - 56 ESP OK [01:45:25] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 56 ESP OK [01:45:34] RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 56 ESP OK [01:45:34] RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 56 ESP OK [01:45:34] RECOVERY - IPsec on cp5012 is OK: Strongswan OK - 56 ESP OK [01:45:34] RECOVERY - IPsec on cp5008 is OK: Strongswan OK - 56 ESP OK [01:45:34] RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 56 ESP OK [01:45:34] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 56 ESP OK [01:45:34] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 56 ESP OK [01:45:35] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [01:45:35] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 56 ESP OK [01:45:36] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 56 ESP OK [01:45:36] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 56 ESP OK [01:45:37] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [01:45:54] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 56 ESP OK [01:45:54] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - 56 ESP OK [01:45:54] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 56 ESP OK [01:45:55] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 56 ESP OK [01:45:55] RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 56 ESP OK [01:45:55] RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 56 ESP OK [01:46:05] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [01:46:05] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [01:46:14] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [01:46:14] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [01:46:14] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [02:00:14] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:04] RECOVERY - Host cp2015 is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [03:16:42] (03CR) 10Krinkle: [C: 031] Swap mediawiki.org to use standard docroot naming scheme [puppet] - 10https://gerrit.wikimedia.org/r/421949 (owner: 10Chad) [03:23:35] (03PS2) 10Krinkle: Remove indirection from search-redirect.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411284 (owner: 10Chad) [03:24:15] (03CR) 10Krinkle: [C: 031] "The Gerrit diff is uselessly confusing, but confirmed locally that this removes symlink and moves file in place. Not sure what the "W" mea" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411284 (owner: 10Chad) [03:25:34] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 801.82 seconds [04:05:44] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 135.46 seconds [04:35:22] !log ran scap pull on deploy1001 [04:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:30] Hi, There are many failed uploads today. Any particular reason? [04:52:42] i.e. broken files [04:53:13] e.g. https://commons.wikimedia.org/wiki/File:Valentin_Guichaux_posant_%C3%A0_cot%C3%A9_de_son_Mao.jpg [04:53:54] I already deleted more than a dozen like this one [04:56:42] https://phabricator.wikimedia.org/T190988 [05:32:21] (03PS5) 10Madhuvishy: nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) [06:31:44] PROBLEM - puppet last run on labcontrol1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/sudoers] [06:31:45] PROBLEM - puppet last run on ms-be1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/ferm.conf] [06:35:43] (03PS2) 10ArielGlenn: clean up internal rsync client list for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/422454 [06:36:28] (03CR) 10ArielGlenn: [C: 032] clean up internal rsync client list for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/422454 (owner: 10ArielGlenn) [06:42:18] (03PS1) 10Madhuvishy: hieradata: Add settings for dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/422835 [06:49:49] (03CR) 10Madhuvishy: [C: 032] hieradata: Add settings for dumps distribution servers [puppet] - 10https://gerrit.wikimedia.org/r/422835 (owner: 10Madhuvishy) [06:50:33] 10Operations, 10User-Elukey: Sporadic logrotate issue for stretch mediawiki appservers - https://phabricator.wikimedia.org/T185195#4090147 (10elukey) Another rather confusing thing that I've noticed while checking logs on videoscalers (on stretch) is that `/var/log/apache2/jobqueue-access.log.1` keeps getting... [06:51:49] (03PS6) 10Madhuvishy: nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) [06:52:17] (03CR) 10jerkins-bot: [V: 04-1] nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [06:56:20] (03PS1) 10Madhuvishy: dumps: Absent /public/dumps mount served from labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/422848 (https://phabricator.wikimedia.org/T188643) [06:56:44] RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:45] RECOVERY - puppet last run on ms-be1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:32] 10Operations, 10media-storage, 10Patch-For-Review: swift upgrade plans: jessie and swift 2.x - https://phabricator.wikimedia.org/T117972#4090173 (10ema) [07:03:12] 10Operations, 10Traffic, 10media-storage: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162#4090176 (10ema) p:05Triage>03Normal [07:16:13] 10Operations, 10Traffic, 10media-storage: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162#2683206 (10ema) Reopening this bug as swift still returns `501` when it should return `416`. I have noticed [[https://grafana.wikimedia.o... [07:18:38] !log reboot cache@eqiad for retpoline kernel updates: T188092 [07:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:51] !log installing openssl security updates [07:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:18] (03CR) 10Muehlenhoff: [C: 031] "Thanks, looks good to me." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250) (owner: 10Jrdnch) [07:27:21] (03PS1) 10Madhuvishy: dumps: Set up symlinks on instances under /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/422867 (https://phabricator.wikimedia.org/T188643) [07:27:55] (03CR) 10jerkins-bot: [V: 04-1] dumps: Set up symlinks on instances under /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/422867 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [07:35:40] (03CR) 10Elukey: [V: 032 C: 032] "Really great work, thanks a lot!" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250) (owner: 10Jrdnch) [07:39:39] 10Operations, 10DNS, 10Mail, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4090230 (10ema) [07:45:02] (03PS1) 10ArielGlenn: clean up dumps 'latest' links that are too old [puppet] - 10https://gerrit.wikimedia.org/r/422879 (https://phabricator.wikimedia.org/T189527) [07:58:35] (03PS4) 10Madhuvishy: statistics: Mount dumps share from labstore1006|7 on stat1005|6 [puppet] - 10https://gerrit.wikimedia.org/r/420083 (https://phabricator.wikimedia.org/T188644) [07:58:43] (03CR) 10Alexandros Kosiaris: [C: 032] apertium: New upstream release [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [08:02:50] (03CR) 10Muehlenhoff: [C: 031] "Looks good now, one comment for potential improvement." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [08:07:25] !log roll restart of cassandra on aqs* for openjdk-8 upgrades [08:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:35] (03PS1) 10Madhuvishy: statistics: Absent existing dumps mount at /mnt/data [puppet] - 10https://gerrit.wikimedia.org/r/422892 (https://phabricator.wikimedia.org/T188644) [08:12:46] (03PS1) 10Madhuvishy: statistics: Symlink /mnt/data to nfs share from active server [puppet] - 10https://gerrit.wikimedia.org/r/422896 (https://phabricator.wikimedia.org/T188644) [08:14:45] PROBLEM - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.126 and port 9042: Connection refused [08:14:49] whattt [08:14:52] I silenced it [08:14:53] ufff [08:15:34] PROBLEM - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is CRITICAL: connect to address 10.64.0.127 and port 9042: Connection refused [08:16:35] RECOVERY - cassandra-b CQL 10.64.0.127:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.127 port 9042 [08:16:44] RECOVERY - cassandra-a CQL 10.64.0.126:9042 on aqs1004 is OK: TCP OK - 0.000 second response time on 10.64.0.126 port 9042 [08:18:39] !log installing OpenJDK security updates on elastic* hosts (along with current version of the search plugins package) [08:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:52] !log T189075 upload apertium_3.5.1-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [08:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:58] T189075: Package apertium-separable and dependencies - https://phabricator.wikimedia.org/T189075 [08:19:11] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [08:19:16] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [08:19:21] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [08:19:28] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [08:21:22] (I silenced druid hosts since those were the ones that I need to do aftewards, my brain is not working fine with the latest updates for speculative execution) [08:23:08] 10Operations, 10Traffic, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992#4090268 (10ema) [08:23:19] 10Operations, 10Traffic, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992#4090279 (10ema) p:05Triage>03Normal [08:27:30] 10Operations, 10Analytics, 10Traffic: Investigate and fix odd uri_host values - https://phabricator.wikimedia.org/T188804#4090285 (10ema) p:05Triage>03Normal [08:27:52] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4090287 (10ema) p:05Triage>03Normal [08:28:32] 10Operations, 10Analytics, 10Traffic: Spammy events coming our way for sites such us https://ru.wikipedia.kim - https://phabricator.wikimedia.org/T190843#4090289 (10ema) p:05Triage>03Normal [08:29:39] (03PS2) 10ArielGlenn: clean up dumps 'latest' links that are too old [puppet] - 10https://gerrit.wikimedia.org/r/422879 (https://phabricator.wikimedia.org/T189527) [08:29:48] akosiaris: nice. all green checks :) [08:30:20] 10Operations, 10Pybal, 10Traffic: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993#4090290 (10Vgutierrez) p:05Triage>03Normal [08:33:33] 10Operations, 10Pybal, 10Traffic: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993#4090302 (10Vgutierrez) [08:37:34] ACKNOWLEDGEMENT - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Elukey Still wip [08:39:46] (03CR) 10Filippo Giunchedi: [C: 031] mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [08:40:12] \o/ [08:41:24] heheh thanks for your patience vgutierrez \o/ [08:54:05] (03PS1) 10Vgutierrez: install_server: Reimage pybal-test2003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/422897 (https://phabricator.wikimedia.org/T190993) [08:55:40] (03CR) 10Muehlenhoff: [C: 031] install_server: Reimage pybal-test2003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/422897 (https://phabricator.wikimedia.org/T190993) (owner: 10Vgutierrez) [09:02:14] PROBLEM - Check systemd state on cp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:14] PROBLEM - Check systemd state on cp1064 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:14] PROBLEM - Check systemd state on cp3010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:15] PROBLEM - Check systemd state on cp1065 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:24] PROBLEM - Check systemd state on cp2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:24] PROBLEM - Check systemd state on cp2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:24] PROBLEM - Check systemd state on cp3049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:24] PROBLEM - Check systemd state on cp3047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:24] PROBLEM - Check systemd state on cp3046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:24] PROBLEM - Check systemd state on cp3008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:25] PROBLEM - Check systemd state on cp2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:25] PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:25] PROBLEM - Check systemd state on cp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:26] PROBLEM - Check systemd state on cp1058 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:26] PROBLEM - Check systemd state on cp2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:27] PROBLEM - Check systemd state on cp2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:30] ema --^ [09:02:38] PROBLEM - Check systemd state on cp2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:38] PROBLEM - Check systemd state on cp2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:39] PROBLEM - Check systemd state on cp1074 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:44] PROBLEM - Check systemd state on cp1062 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:44] PROBLEM - Check systemd state on cp1067 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:44] PROBLEM - Check systemd state on cp1072 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:44] PROBLEM - Check systemd state on cp3030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:44] PROBLEM - Check systemd state on cp3035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:44] PROBLEM - Check systemd state on cp3045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:44] PROBLEM - Check systemd state on cp3037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:45] PROBLEM - Check systemd state on cp1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:45] PROBLEM - Check systemd state on cp5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:46] PROBLEM - Check systemd state on cp5012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:46] PROBLEM - Check systemd state on cp1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:47] PROBLEM - Check systemd state on cp1071 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:58] PROBLEM - Check systemd state on cp4030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:58] PROBLEM - Check systemd state on cp5003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:59] PROBLEM - Check systemd state on cp1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:02:59] PROBLEM - Check systemd state on cp4031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:00] PROBLEM - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:00] PROBLEM - Check systemd state on cp4032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:01] PROBLEM - Check systemd state on cp4022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:01] PROBLEM - Check systemd state on cp5004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:02] PROBLEM - Check systemd state on cp5011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:02] PROBLEM - Check systemd state on cp5002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:03] that's mtail [09:03:03] PROBLEM - Check systemd state on cp1053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:03] PROBLEM - Check systemd state on cp1063 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:04] PROBLEM - Check systemd state on cp4028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:04] PROBLEM - Check systemd state on cp4021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:05] PROBLEM - Check systemd state on cp1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:05] yup [09:03:05] PROBLEM - Check systemd state on cp3048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:06] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:06] PROBLEM - Check systemd state on cp1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:14] PROBLEM - Check systemd state on cp3039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:14] PROBLEM - Check systemd state on cp3034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:14] PROBLEM - Check systemd state on cp3038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:14] PROBLEM - Check systemd state on cp3040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:14] PROBLEM - Check systemd state on cp3031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:14] PROBLEM - Check systemd state on cp3033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:14] PROBLEM - Check systemd state on cp3044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:15] PROBLEM - Check systemd state on cp3042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:03:53] gooood, for a moment I was a bit scared, seems fine :) [09:04:11] sorry :( [09:04:34] nah it is fine! [09:05:36] vgutierrez: don't forget to !log it just in case / for context [09:05:38] vgutierrez: next time just !log it so it will be more staightforward that you were working on it [09:05:41] yeah [09:05:56] *more straightforward to know [09:06:01] :) [09:11:55] RECOVERY - Check systemd state on cp1054 is OK: OK - running: The system is fully operational [09:13:03] cp1054 just came back online after reboot ^ [09:15:47] vgutierrez: so, I think what happened is that the mtail package ships its own systemd unit [09:16:32] !log roll restart aqs on aqs100* for icu/openssl upgrades [09:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:46] vgutierrez: forcing a puppet run on cp3034 [09:16:58] I did that on cp1008, no changes [09:22:33] systemctl status mtail on 1054 says `Active: inactive (dead)` [09:22:52] on 3034: `Active: failed` [...] [09:22:58] fatal error, Mar 29 09:16:24 cp1008 mtail[5093]: F0329 09:16:24.598467 5093 mtail.go:219] listen tcp :3903: bind: address already in use [09:23:58] (03PS3) 10ArielGlenn: clean up dumps 'latest' links that are too old [puppet] - 10https://gerrit.wikimedia.org/r/422879 (https://phabricator.wikimedia.org/T189527) [09:24:24] right, so, we have varnishmtail.service running mtail (binding on 3903) [09:24:50] varnishmtail-backend.service instead explicitly starts mtail passing port 3904 as an argument [09:25:15] godog: do we need to run mtail.service on cache hosts? Probably not [09:25:49] ema: likely not [09:26:16] maybe simply set service=>stopped via puppet? [09:26:27] yeah [09:26:39] looks like that, dunno why this was working with mtail 3.0-rc though [09:26:46] *rc4 [09:27:52] (03CR) 10ArielGlenn: [C: 032] clean up dumps 'latest' links that are too old [puppet] - 10https://gerrit.wikimedia.org/r/422879 (https://phabricator.wikimedia.org/T189527) (owner: 10ArielGlenn) [09:29:00] also, why is 1054 happy? I'm gonna reboot 1061 anyways now for kernel upgrade, let's see if there's some magic coincidence at reboot fixing things [09:29:26] (not that we should rely on that of course, just out of curiosity) [09:31:38] wait, now that I look at the puppet code [09:31:49] we do pass $ensure to class mtail [09:32:12] and then we have [09:32:15] hieradata/role/common/cache/text.yaml:mtail::ensure: 'stopped' [09:32:59] so perhaps upgrading the package tries to start the service, that fails, and systemd thinks the unit is failed from that moment onwards [09:33:45] looks like that [09:34:33] do they just need a reset-failed? [09:34:55] RECOVERY - Check systemd state on cp1061 is OK: OK - running: The system is fully operational [09:35:15] RECOVERY - Check systemd state on cp1008 is OK: OK - running: The system is fully operational [09:35:24] RECOVERY - Check systemd state on cp3034 is OK: OK - running: The system is fully operational [09:35:50] yeah reset-failed does work of course, it's not great if we have to remember doing that after every upgrade though [09:36:14] just tried on 3034, and I guess someone else tried on 1008 [09:36:56] more in general, do we want debian packages to auto-start services? :) [09:37:13] on this note, I've got to go afk for a few minutes [09:39:06] ema: that could be avoided setting up /usr/sbin/policy-rc.d [09:39:34] as ofc you already know :) [09:39:59] so I'm triggering the reset-failed for the affected nodes [09:42:25] RECOVERY - Check systemd state on cp3040 is OK: OK - running: The system is fully operational [09:42:25] RECOVERY - Check systemd state on cp3039 is OK: OK - running: The system is fully operational [09:42:25] RECOVERY - Check systemd state on cp3044 is OK: OK - running: The system is fully operational [09:42:25] RECOVERY - Check systemd state on cp3038 is OK: OK - running: The system is fully operational [09:42:25] RECOVERY - Check systemd state on cp3031 is OK: OK - running: The system is fully operational [09:42:25] RECOVERY - Check systemd state on cp3033 is OK: OK - running: The system is fully operational [09:42:25] RECOVERY - Check systemd state on cp3042 is OK: OK - running: The system is fully operational [09:42:26] RECOVERY - Check systemd state on cp3032 is OK: OK - running: The system is fully operational [09:42:34] RECOVERY - Check systemd state on cp4023 is OK: OK - running: The system is fully operational [09:42:34] RECOVERY - Check systemd state on cp3036 is OK: OK - running: The system is fully operational [09:42:34] RECOVERY - Check systemd state on cp3007 is OK: OK - running: The system is fully operational [09:42:34] RECOVERY - Check systemd state on cp1058 is OK: OK - running: The system is fully operational [09:42:34] RECOVERY - Check systemd state on cp4026 is OK: OK - running: The system is fully operational [09:42:34] RECOVERY - Check systemd state on cp3010 is OK: OK - running: The system is fully operational [09:42:35] RECOVERY - Check systemd state on cp2026 is OK: OK - running: The system is fully operational [09:42:35] RECOVERY - Check systemd state on cp2014 is OK: OK - running: The system is fully operational [09:42:35] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational [09:42:36] RECOVERY - Check systemd state on cp2013 is OK: OK - running: The system is fully operational [09:42:44] RECOVERY - Check systemd state on cp2001 is OK: OK - running: The system is fully operational [09:42:44] RECOVERY - Check systemd state on cp1073 is OK: OK - running: The system is fully operational [09:42:44] RECOVERY - Check systemd state on cp1068 is OK: OK - running: The system is fully operational [09:42:44] RECOVERY - Check systemd state on cp2022 is OK: OK - running: The system is fully operational [09:42:44] RECOVERY - Check systemd state on cp2018 is OK: OK - running: The system is fully operational [09:42:44] RECOVERY - Check systemd state on cp2025 is OK: OK - running: The system is fully operational [09:42:45] RECOVERY - Check systemd state on cp1066 is OK: OK - running: The system is fully operational [09:42:45] RECOVERY - Check systemd state on cp2004 is OK: OK - running: The system is fully operational [09:42:45] RECOVERY - Check systemd state on cp2011 is OK: OK - running: The system is fully operational [09:42:46] RECOVERY - Check systemd state on cp2008 is OK: OK - running: The system is fully operational [09:42:46] RECOVERY - Check systemd state on cp4029 is OK: OK - running: The system is fully operational [09:42:47] RECOVERY - Check systemd state on cp3049 is OK: OK - running: The system is fully operational [09:42:53] * vgutierrez the flooder :( [09:43:04] RECOVERY - Check systemd state on cp1049 is OK: OK - running: The system is fully operational [09:43:04] RECOVERY - Check systemd state on cp1045 is OK: OK - running: The system is fully operational [09:43:04] RECOVERY - Check systemd state on cp3035 is OK: OK - running: The system is fully operational [09:43:04] RECOVERY - Check systemd state on cp3030 is OK: OK - running: The system is fully operational [09:43:04] RECOVERY - Check systemd state on cp3045 is OK: OK - running: The system is fully operational [09:43:04] RECOVERY - Check systemd state on cp3037 is OK: OK - running: The system is fully operational [09:43:04] RECOVERY - Check systemd state on cp2010 is OK: OK - running: The system is fully operational [09:43:05] RECOVERY - Check systemd state on cp2002 is OK: OK - running: The system is fully operational [09:43:05] RECOVERY - Check systemd state on cp2020 is OK: OK - running: The system is fully operational [09:43:06] RECOVERY - Check systemd state on cp1063 is OK: OK - running: The system is fully operational [09:43:06] RECOVERY - Check systemd state on cp1053 is OK: OK - running: The system is fully operational [09:43:07] RECOVERY - Check systemd state on cp1050 is OK: OK - running: The system is fully operational [09:43:14] RECOVERY - Check systemd state on cp4025 is OK: OK - running: The system is fully operational [09:43:14] RECOVERY - Check systemd state on cp4024 is OK: OK - running: The system is fully operational [09:43:14] RECOVERY - Check systemd state on cp4030 is OK: OK - running: The system is fully operational [09:43:14] RECOVERY - Check systemd state on cp4027 is OK: OK - running: The system is fully operational [09:43:14] RECOVERY - Check systemd state on cp4022 is OK: OK - running: The system is fully operational [09:43:14] RECOVERY - Check systemd state on cp4032 is OK: OK - running: The system is fully operational [09:43:14] RECOVERY - Check systemd state on cp5012 is OK: OK - running: The system is fully operational [09:43:15] RECOVERY - Check systemd state on cp4031 is OK: OK - running: The system is fully operational [09:43:15] Recovery messages are nice! [09:43:34] RECOVERY - Check systemd state on cp5008 is OK: OK - running: The system is fully operational [09:43:34] RECOVERY - Check systemd state on cp5009 is OK: OK - running: The system is fully operational [09:43:34] RECOVERY - Check systemd state on cp5010 is OK: OK - running: The system is fully operational [09:43:34] RECOVERY - Check systemd state on cp5003 is OK: OK - running: The system is fully operational [09:43:34] RECOVERY - Check systemd state on cp5004 is OK: OK - running: The system is fully operational [09:43:34] RECOVERY - Check systemd state on cp5011 is OK: OK - running: The system is fully operational [09:43:34] RECOVERY - Check systemd state on cp5005 is OK: OK - running: The system is fully operational [09:43:35] RECOVERY - Check systemd state on cp5002 is OK: OK - running: The system is fully operational [09:46:01] elukey: better to know that nothing was broken on the first time [09:46:07] but still, not cool [09:46:50] vgutierrez: if you could see the amount of spam that I've caused in here it would probably make you feel better :) [09:47:23] ther was an issue, you were on it and you fixed it [09:48:22] hmmm regarding what ema was mentioning, do we actually want dpkg triggering unit starts? [09:49:43] or maybe we should ship a policy-rc.d handling those? https://people.debian.org/~hmh/invokerc.d-policyrc.d-specification.txt [09:52:55] let's not bother with policy-rc.d, what are we trying to achieve, we only want to use the mtail command, but not run the service? [09:55:40] indeed [09:55:54] we don't want the default mtail.unit running [09:56:07] (in our cache nodes at least) [09:57:05] since all the cache hosts are jessie, let's simply use "systemctl mask mtail", then [09:58:30] see e.g. how that's used in modules/graphite/manifests/init.pp [09:59:04] moritzm: well it's really a question of who should decide whether a service should be started or not. In my mind, that decision should be made by puppet, not by a package upgrade [09:59:43] yeah, but if you have made the decision to mask via puppet, the package update will follow that [10:01:00] there's also the case of packages that we want to be running, but that we don't necessarily want to be started upon upgrade (pybal comes to mind) [10:02:21] yup.. pybal is a tricky one, with BGP enabled, just starting pybal it's going to affect the traffic [10:02:39] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#4090724 (10fgiunchedi) FWIW I'm happy to assist with the Prometheus part, e.g. next week [10:03:15] PROBLEM - BGP status on cr1-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Connect [10:06:19] you mean "restarted upon upgrade"? [10:06:58] indeed [10:07:12] no I mean started [10:07:16] (upon upgrade) [10:07:24] RECOVERY - BGP status on cr1-eqord is OK: BGP OK - up: 62, down: 2, shutdown: 2 [10:07:51] systemctl stop pybal ; wait for traffic to failover, check things, ... ; upgrade pybal [10:08:01] the upgrade should not result in pybal being started [10:08:41] (logs for cr1-eqord - https://librenms.wikimedia.org/device/device=140/tab=logs/ - seemed one AS flapping) [10:11:52] (03PS7) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) [10:12:20] (03PS4) 10Ema: varnishxcps: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/421338 (https://phabricator.wikimedia.org/T184942) [10:13:05] the auto-generated maintainer script (like the ones created ny dh_systemd) use the Debian-specific wrapper which abides policy-rcd (deb-systemd-invoke) [10:13:57] but external packages often use maintainer scripts which use systemctl, so policy-rcd isn't fully reliable [10:14:24] ideally systemd would gain such a scheme, but I don't know if that was proposed/discussed upstream [10:14:40] 10Operations, 10Traffic, 10media-storage, 10User-fgiunchedi: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162#4090776 (10fgiunchedi) a:05fgiunchedi>03None Quite possible! Initially I thought it might have to do with thumbna... [10:16:35] (03CR) 10Ema: "> It's actually not so correct and now that I relook at my change," [puppet] - 10https://gerrit.wikimedia.org/r/422106 (owner: 10Alexandros Kosiaris) [10:20:03] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage pybal-test2003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/422897 (https://phabricator.wikimedia.org/T190993) (owner: 10Vgutierrez) [10:20:23] (03PS2) 10Vgutierrez: install_server: Reimage pybal-test2003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/422897 (https://phabricator.wikimedia.org/T190993) [10:24:45] so those who can potentially start a service are: (1) admin (2) systemd on startup (3) puppet (4) debian package. In the mtail.service case we want none of the above to start it, hence we can use systemctl mask and be done with it. In the pybal case (and possibly others) we want only (1) and (2) [10:29:03] 10Operations, 10Traffic, 10media-storage, 10User-fgiunchedi: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#4090814 (10fgiunchedi) [10:29:06] 10Operations, 10Traffic, 10media-storage, 10User-fgiunchedi: upload.wikimedia.org returns HTTP 501 instead of 416 for non-satisfiable byte ranges - https://phabricator.wikimedia.org/T147162#4090816 (10fgiunchedi) [10:30:13] 10Operations, 10Traffic, 10media-storage, 10User-fgiunchedi: Swift invalid range requests causing 501s - https://phabricator.wikimedia.org/T183902#3866584 (10fgiunchedi) As per related {T147162} this is likely a bug in our usage of `webob` in `rewrite.py`. We should be using `swob` which is what swift itse... [10:35:37] in the case of pybal the maintainer scripts are under our control, so we could fix them to do what we want [10:36:01] IIRC quagga also doesn't restart by default after updates (also to avoid BGP issues) [10:36:47] the maint script could detect the current run time status and e.g. only start pybal if it was previosly running [10:37:35] (03PS2) 10ArielGlenn: clean up all 'latest' links from most runs older than current run [dumps] - 10https://gerrit.wikimedia.org/r/421851 (https://phabricator.wikimedia.org/T189527) [10:50:18] !log restarting elastic@codfw for JVM and plugin upgrade (T189239) [10:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:25] T189239: Deploy initial version of the extra-analysis plugin - https://phabricator.wikimedia.org/T189239 [10:54:14] dcausse: do you need us to downtime some alarms? [10:54:36] (03PS1) 10Volans: Puppetmaster: store reports also in puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/422907 (https://phabricator.wikimedia.org/T190918) [10:54:48] volans: if you can downtime the Shard check on search.svc.codfw that'd be great :) [10:55:00] sure, how long? [10:55:08] 1 day I think [10:55:13] (03PS1) 10Muehlenhoff: Extend account expiry date for groovier [puppet] - 10https://gerrit.wikimedia.org/r/422908 [10:56:44] dcausse: ElasticSearch health check for shards on search.svc.codfw.wmnet downtimed for 24h with a link to your task [10:57:04] (03CR) 10Muehlenhoff: [C: 032] Extend account expiry date for groovier [puppet] - 10https://gerrit.wikimedia.org/r/422908 (owner: 10Muehlenhoff) [10:57:06] volans: thanks! [10:57:37] yw :) [11:01:10] (03CR) 10ArielGlenn: [C: 032] clean up all 'latest' links from most runs older than current run [dumps] - 10https://gerrit.wikimedia.org/r/421851 (https://phabricator.wikimedia.org/T189527) (owner: 10ArielGlenn) [11:02:03] !log ariel@tin Started deploy [dumps/dumps@96ba844]: cleanup 'latest' links, rss files from old runs [11:02:07] !log ariel@tin Finished deploy [dumps/dumps@96ba844]: cleanup 'latest' links, rss files from old runs (duration: 00m 04s) [11:02:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:01] (03PS1) 10Vgutierrez: prometheus: varnish_x_cache rate for the last 2m [puppet] - 10https://gerrit.wikimedia.org/r/422910 (https://phabricator.wikimedia.org/T184942) [11:06:40] (03PS3) 10BBlack: eqsin: turn-up India [dns] - 10https://gerrit.wikimedia.org/r/422395 (https://phabricator.wikimedia.org/T189252) [11:10:50] (03CR) 10Volans: "Compiler results:" [puppet] - 10https://gerrit.wikimedia.org/r/422907 (https://phabricator.wikimedia.org/T190918) (owner: 10Volans) [11:13:20] (03PS1) 10Elukey: role::configcluster_stretch: add IPv6 static addresses [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) [11:13:26] -1 following [11:13:45] (03CR) 10jerkins-bot: [V: 04-1] role::configcluster_stretch: add IPv6 static addresses [puppet] - 10https://gerrit.wikimedia.org/r/422911 (https://phabricator.wikimedia.org/T166081) (owner: 10Elukey) [11:15:20] 10Operations, 10Puppet, 10Patch-For-Review: Puppet: enable reports to puppetdb - https://phabricator.wikimedia.org/T190918#4090897 (10Volans) From the quick test I've made yesterday enabling reporting also to puppetdb for some minutes, I got ~200 hosts reported and showing data in Puppetboard, I didn't notic... [11:19:52] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4090905 (10elukey) Before proceeding I'd wait for @Joe's confirmation. I'd like to: 1) add static IPv6 addresses to conf100[456] with https://... [11:20:41] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#1824561 (10Aklapper) [11:22:39] (03CR) 10Filippo Giunchedi: "LGTM, better to merge next week IMHO" [puppet] - 10https://gerrit.wikimedia.org/r/422907 (https://phabricator.wikimedia.org/T190918) (owner: 10Volans) [11:23:33] (03CR) 10Volans: [C: 04-2] "Agreed on Tue." [puppet] - 10https://gerrit.wikimedia.org/r/422907 (https://phabricator.wikimedia.org/T190918) (owner: 10Volans) [11:24:17] 10Operations, 10Puppet, 10Patch-For-Review: Puppet: enable reports to puppetdb - https://phabricator.wikimedia.org/T190918#4090942 (10Volans) The plan as of now is to enable it on next Tuesday, to avoid issues in the long weekend. [11:26:08] (03CR) 10BBlack: [C: 032] eqsin: turn-up India [dns] - 10https://gerrit.wikimedia.org/r/422395 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [11:29:24] PROBLEM - BGP status on cr1-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Connect [11:33:34] RECOVERY - BGP status on cr1-eqord is OK: BGP OK - up: 62, down: 2, shutdown: 2 [11:38:34] PROBLEM - BGP status on cr1-eqord is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect, AS6939/IPv4: Connect [11:42:35] RECOVERY - BGP status on cr1-eqord is OK: BGP OK - up: 62, down: 2, shutdown: 2 [11:52:25] (03Abandoned) 10Vgutierrez: prometheus: varnish_x_cache rate for the last 2m [puppet] - 10https://gerrit.wikimedia.org/r/422910 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [12:14:40] (03PS3) 10ArielGlenn: Add ability to skip recombine of meta-current page content, per project [dumps] - 10https://gerrit.wikimedia.org/r/421858 (https://phabricator.wikimedia.org/T179059) [12:17:05] (03CR) 10ArielGlenn: [C: 032] Add ability to skip recombine of meta-current page content, per project [dumps] - 10https://gerrit.wikimedia.org/r/421858 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [12:18:37] !log ariel@tin Started deploy [dumps/dumps@982cebd]: ability to configure production of recombined metacurrent page content file [12:18:40] !log ariel@tin Finished deploy [dumps/dumps@982cebd]: ability to configure production of recombined metacurrent page content file (duration: 00m 02s) [12:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:38] (03PS1) 10ArielGlenn: turn off production of single meta-current page content dump for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/422914 (https://phabricator.wikimedia.org/T179059) [12:29:03] !log recreating replicas for skwiki_content in elastic@codfw due to stalled shard recovery [12:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:24] 10Operations, 10Traffic: Unwanted service startups and their triggers - https://phabricator.wikimedia.org/T191017#4091056 (10ema) [12:33:44] 10Operations, 10Traffic: Unwanted service startups and their triggers - https://phabricator.wikimedia.org/T191017#4091066 (10ema) p:05Triage>03Normal [12:34:15] 10Operations, 10User-fgiunchedi: Provide a pxe-bootable rescue image - https://phabricator.wikimedia.org/T78135#4091069 (10fgiunchedi) a:05fgiunchedi>03None [12:36:15] (03CR) 10ArielGlenn: [C: 032] turn off production of single meta-current page content dump for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/422914 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [12:36:49] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4091075 (10Cwek) Can I ask something? How to measure where traffic should route through? latency? I suggest a website... [12:38:29] 10Operations, 10Tracking: Provide an option menu when booting via PXE - https://phabricator.wikimedia.org/T191018#4091076 (10fgiunchedi) p:05Triage>03Normal [12:54:32] (03CR) 10D3r1ck01: "Some minor trailings to be removed :)" (034 comments) [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/421011 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [12:54:37] !log installing ICU security updates on trusty [12:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:34] (03CR) 10D3r1ck01: command to recombine page content xml files into one (031 comment) [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/421011 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [12:55:45] 10Operations, 10Patch-For-Review: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955#4091108 (10fgiunchedi) Another consideration that has emerged from the ops mini offsite in SF in Jan 2018 is that partman recipes will likely need to take into account vendor differences too. [12:55:53] (03CR) 10Alexandros Kosiaris: [C: 04-2] "> In light of this, how does one properly get rid of an nrpe::monitor_service? I'm tempted of setting nrpe_command to /bin/true and ensure" [puppet] - 10https://gerrit.wikimedia.org/r/422106 (owner: 10Alexandros Kosiaris) [12:56:46] (03PS1) 10Niedzielski: Update: SSH public key for niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/422918 [13:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T1300). [13:00:05] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:43] o/ [13:01:40] o/ [13:02:00] dcausse: you are not a deployer, right? [13:02:04] (I can SWAT today) [13:02:22] zeljkof: I can swat my patch if you want [13:02:34] dcausse: oh, in that case, go ahead :) [13:02:42] ok deploying [13:03:09] great, swat is all yours, feel free to close it with !log EU SWAT finished [13:03:23] (03PS1) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [13:04:17] zeljkof: just to clarify is tin still the deploy host (deployment.eqiad.wmnet)? [13:04:23] Question about puppet coding -- can I use $::site in a template, if I want to have the template be different in each data center? [13:04:54] dcausse: as far as I know [13:04:58] ok [13:05:03] (03PS1) 10Andrew Bogott: labtestweb: correct wikitech_nova_ldap_proxyagent_pass [puppet] - 10https://gerrit.wikimedia.org/r/422921 (https://phabricator.wikimedia.org/T190727) [13:05:11] Try :) [13:05:45] I've seen mails so I was not 100% sure [13:06:25] and group1 is still on wmf.26 [13:06:29] (03PS2) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [13:07:20] 10Operations, 10Patch-For-Review: Standardizing our partman recipes - https://phabricator.wikimedia.org/T156955#4091128 (10fgiunchedi) Auditing of partman recipe usage below. The easiest to tackle for standardization are stateless hosts and hosts where state is relatively simple to handle (e.g. in `/srv`, like... [13:08:21] (03PS1) 10Andrew Bogott: labtestweb: add an additional dummy var [labs/private] - 10https://gerrit.wikimedia.org/r/422922 [13:08:33] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4091129 (10kchapman) @Tgr we are just putting it in the Declined TechCom-RFC workboard, not in Phabricator as a whole. For reference, this is how we app... [13:10:52] Pchelolo: if you're around how did you check errors in T190958 (in logstash?) [13:10:53] T190958: CirrusSearchCheckerJob should have a title - https://phabricator.wikimedia.org/T190958 [13:10:55] (03PS3) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [13:11:38] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4091163 (10Nemo_bis) >>! In T189252#4091075, @Cwek wrote: > Can I ask something? > How to measure where traffic shoul... [13:12:13] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197#4091166 (10fgiunchedi) [13:12:37] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197#4091167 (10elukey) [13:12:39] (03PS2) 10ArielGlenn: command to recombine page content xml files into one [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/421011 (https://phabricator.wikimedia.org/T179059) [13:13:06] dcausse: there's a JobQueue-EventBus dashboard, I just filter out known stuff after a deployment and see what's left [13:13:33] thanks [13:13:49] (03PS2) 10Muehlenhoff: Update: SSH public key for niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/422918 (owner: 10Niedzielski) [13:14:34] 10Operations, 10Goal, 10User-Elukey, 10User-fgiunchedi: Export Prometheus-compatible JVM metrics from JVMs in production - https://phabricator.wikimedia.org/T177197#4091175 (10fgiunchedi) [13:14:49] (03PS4) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [13:15:00] ok I see "Failed creating job from description" in this dashboard [13:15:03] (03CR) 10Muehlenhoff: [C: 032] Update: SSH public key for niedzielski [puppet] - 10https://gerrit.wikimedia.org/r/422918 (owner: 10Niedzielski) [13:15:33] (03PS1) 10Imarlier: coal: Need to use a unique consumer group in each data center [puppet] - 10https://gerrit.wikimedia.org/r/422926 (https://phabricator.wikimedia.org/T110903) [13:16:14] (03CR) 10ArielGlenn: command to recombine page content xml files into one (035 comments) [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/421011 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [13:19:36] (03CR) 10Elukey: "pcc https://puppet-compiler.wmflabs.org/compiler03/10730/" [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey) [13:21:20] (03CR) 10Imarlier: "https://puppet-compiler.wmflabs.org/compiler03/10729/ - consumer group variable ends up being set correctly in both eqiad and codfw" [puppet] - 10https://gerrit.wikimedia.org/r/422926 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [13:21:54] (03CR) 10Elukey: [C: 031] ""each site gets all the data present on kafka in its dc", if we are ok about this sentence the change looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/422926 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [13:22:40] (03CR) 10Imarlier: "> "each site gets all the data present on kafka in its dc", if we are" [puppet] - 10https://gerrit.wikimedia.org/r/422926 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [13:28:43] scap is slow... (waiting on sync-masters, ok: 1, left: 1) [13:28:50] (03PS1) 10Alexandros Kosiaris: Don't error out on interface::add_ip6_mapped on node level [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/422927 [13:29:05] (03CR) 10jerkins-bot: [V: 04-1] Don't error out on interface::add_ip6_mapped on node level [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/422927 (owner: 10Alexandros Kosiaris) [13:29:14] akosiaris: thanksssssss [13:29:25] (03PS3) 10Elukey: eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) [13:29:27] (03CR) 10Alexandros Kosiaris: "Indeed. Let's see if we can fix that. https://gerrit.wikimedia.org/r/#/c/422927/" [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [13:29:39] (03CR) 10Ottomata: [C: 031] profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey) [13:30:04] (03CR) 10jerkins-bot: [V: 04-1] eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [13:31:05] I don't run the test suite once and I get a -1, lovely :D [13:31:08] zeljkof: any idea why scap is so slow: more than 5 minutes in sync-masters: 50% (ok: 1; fail: 0; left: 1) [13:31:14] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [13:31:29] ?? ^ [13:31:32] dcausse: hm, I don't know :( [13:31:46] (03PS2) 10Alexandros Kosiaris: Don't error out on interface::add_ip6_mapped on node level [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/422927 [13:31:54] (03PS2) 10Elukey: coal: Need to use a unique consumer group in each data center [puppet] - 10https://gerrit.wikimedia.org/r/422926 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [13:31:54] robh: can you take a look? [13:32:04] > icinga-wm> IRC echo bot PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [13:32:34] dcausse: looks like they made some changes after all, something seems to be wrong [13:32:41] hm.. [13:32:49] Should I Ctrl-C scap? [13:33:06] (03CR) 10Elukey: [C: 032] coal: Need to use a unique consumer group in each data center [puppet] - 10https://gerrit.wikimedia.org/r/422926 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [13:33:12] dcausse: it is way too much time... [13:33:28] ctrl-c might be a good idea [13:33:35] ok killing [13:33:35] but not sure how to continue [13:34:49] !log aborted scap sync-dir php-1.31.0-wmf.27/extensions/CirrusSearch/ (was taking too much time at: waiting on sync-masters, ok: 1, left: 1) [13:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:59] the deployment server is tin [13:36:04] zeljkof: what should I do? revert my patch [13:36:21] (03PS4) 10Elukey: eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) [13:36:21] volans: I'm on tin but scap was taking too much time syncing masters [13:36:26] dcausse: I don't know, I don't think I had a similar situation [13:36:42] hashar: are you around? some help with scap problems would be great :) [13:36:47] and then I saw this alter about deploy1001 [13:36:58] twentyafterfour: in case you are awake :) ^ [13:37:00] dcausse: ack, i saw mentioned that deploy1001 was having issue [13:37:03] (03CR) 10Elukey: "Nuria: changes applied, let's chat if merge or not :)" [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [13:37:43] so deploy1001 was reimaged back to jessie yesterday, according to Daniel's email [13:37:44] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:30] dcausse: I'm wondering if scap is trying to do something on deploy1001 and failing because of something missing [13:38:42] ema, vgutierrez ^^^ FYI (cp2017) [13:38:50] yes that seems related [13:39:03] papaul: is it you working on cp2017? [13:39:15] I think I'll revert and rebase tin so that mediawiki-staging is in line with what's deployed [13:41:15] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is OK: Files ownership is ok. [13:41:54] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:41:55] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:41:55] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:04] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:05] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:14] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:14] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:14] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:14] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:15] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:15] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:15] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:15] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:24] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:24] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:24] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:24] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:24] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:25] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:25] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:25] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:34] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:34] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:35] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:35] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:44] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:44] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:44] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:44] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:44] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:45] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:45] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:45] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:45] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:54] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:54] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [13:42:54] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:54] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:54] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:54] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:55] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:55] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:55] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:56] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:56] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:42:57] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [13:44:07] (03PS2) 10Andrew Bogott: labtestweb: correct wikitech_nova_ldap_proxyagent_pass [puppet] - 10https://gerrit.wikimedia.org/r/422921 (https://phabricator.wikimedia.org/T190727) [13:44:15] (03PS2) 10Andrew Bogott: labtestweb: add an additional dummy var [labs/private] - 10https://gerrit.wikimedia.org/r/422922 [13:44:25] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [13:44:25] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [13:44:25] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [13:44:27] (03CR) 10Andrew Bogott: [V: 032 C: 032] labtestweb: add an additional dummy var [labs/private] - 10https://gerrit.wikimedia.org/r/422922 (owner: 10Andrew Bogott) [13:44:34] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [13:44:34] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [13:44:34] RECOVERY - Host cp2017 is UP: PING OK - Packet loss = 0%, RTA = 36.04 ms [13:44:34] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [13:44:35] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [13:44:44] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [13:44:44] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK [13:44:44] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [13:44:44] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [13:44:44] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [13:44:44] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [13:44:44] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK [13:44:45] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [13:44:45] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK [13:44:52] (03CR) 10Andrew Bogott: [C: 032] labtestweb: correct wikitech_nova_ldap_proxyagent_pass [puppet] - 10https://gerrit.wikimedia.org/r/422921 (https://phabricator.wikimedia.org/T190727) (owner: 10Andrew Bogott) [13:44:54] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [13:44:54] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 66 ESP OK [13:44:54] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [13:44:54] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK [13:44:54] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK [13:44:54] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK [13:44:54] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [13:44:55] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK [13:44:55] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [13:44:56] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK [13:44:56] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [13:44:57] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [13:45:14] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [13:45:14] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [13:45:14] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [13:45:15] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [13:45:15] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [13:45:15] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 66 ESP OK [13:45:16] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [13:45:16] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK [13:45:16] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK [13:45:24] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [13:45:24] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [13:45:24] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [13:52:06] !log reverted and rebased tin for undeployed patch due to scap issues (https://gerrit.wikimedia.org/r/#/c/422906/ https://gerrit.wikimedia.org/r/#/c/422929/) [13:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:44] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 114293 keys, up 11 hours 34 minutes - replication_delay is 648 [13:54:06] (03PS3) 10Elukey: profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) [13:55:14] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 626 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 114072 keys, up 11 hours 33 minutes - replication_delay is 626 [13:55:34] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 651 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 114156 keys, up 11 hours 39 minutes - replication_delay is 651 [13:57:49] don't recall if this is the time that codfw restarts [13:58:23] # Puppet Name: jobqueue-redis-conditional-restart [13:58:24] 0 2 * * * /usr/local/bin/restart-redis-if-slave 6378 6379 6380 6381 6478 6479 6480 6481 [14:00:17] the slave is trying to sync with the master but the connection drops after a bit, should be the "usual" issue of the buffer, last time it cleared out by itself [14:00:39] !log restarting parsoid and related service on ruthenium to pick up openssl update [14:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] (03PS1) 10Nikerabbit: Update SSH key for nikerabbit [puppet] - 10https://gerrit.wikimedia.org/r/422932 [14:03:15] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1109 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 114072 keys, up 11 hours 41 minutes - replication_delay is 1109 [14:05:14] (03CR) 10Herron: [C: 031] "sounds like a plan! let's keep an eye on the command queue depth of both puppetdb servers during/after this is merged on Tuesday. It can " [puppet] - 10https://gerrit.wikimedia.org/r/422907 (https://phabricator.wikimedia.org/T190918) (owner: 10Volans) [14:10:24] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1536 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 114072 keys, up 11 hours 48 minutes - replication_delay is 1536 [14:11:45] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 114136 keys, up 11 hours 53 minutes - replication_delay is 0 [14:13:04] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [14:13:45] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 113905 keys, up 11 hours 58 minutes - replication_delay is 0 [14:14:01] (03PS6) 10Ema: WIP: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712) [14:14:09] goood redis, thanks [14:14:34] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 113843 keys, up 11 hours 52 minutes - replication_delay is 0 [14:14:54] RECOVERY - Host cp2017 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [14:17:04] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:44] (03PS2) 10Imarlier: wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) [14:19:24] (03CR) 10Imarlier: wmf-config: Enable oversampling for remaining countries in Asia (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [14:21:54] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:21:54] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:21:55] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:22:04] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:04] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:04] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:05] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:05] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:13] 10Operations, 10Release-Engineering-Team, 10Scap: Scap stalled at sync-masters, ok: 1, left: 1 - https://phabricator.wikimedia.org/T191029#4091403 (10dcausse) [14:22:14] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:14] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:22:14] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:22:14] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:14] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:14] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:14] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:15] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:15] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:16] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:16] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:17] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:34] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:34] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 connecting: cp2017_v6 not-conn: cp2017_v4 [14:22:34] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:22:35] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:22:35] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 connecting: cp2017_v6 not-conn: cp2017_v4 [14:22:35] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 connecting: cp2017_v6 not-conn: cp2017_v4 [14:22:35] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 64 connecting: cp2017_v6 not-conn: cp2017_v4 [14:22:44] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:44] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:22:44] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:44] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:44] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:44] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:45] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:22:45] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2017_v4, cp2017_v6 [14:22:45] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:54] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:54] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:22:54] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2017_v4, cp2017_v6 [14:27:29] (03PS1) 10Muehlenhoff: Update Cumin alias for notebook/swap [puppet] - 10https://gerrit.wikimedia.org/r/422937 [14:28:29] (03CR) 10Muehlenhoff: [C: 032] Update Cumin alias for notebook/swap [puppet] - 10https://gerrit.wikimedia.org/r/422937 (owner: 10Muehlenhoff) [14:30:26] (03PS1) 10Ottomata: Use promethues based alert rather than burrow lag check alert [puppet] - 10https://gerrit.wikimedia.org/r/422939 (https://phabricator.wikimedia.org/T189611) [14:30:31] elukey: ^ [14:30:52] (03PS1) 10Dzahn: Revert "Revert "remove deploy1001 from dsh hosts and scap masters"" [puppet] - 10https://gerrit.wikimedia.org/r/422940 (https://phabricator.wikimedia.org/T191029) [14:31:15] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "remove deploy1001 from dsh hosts and scap masters"" [puppet] - 10https://gerrit.wikimedia.org/r/422940 (https://phabricator.wikimedia.org/T191029) (owner: 10Dzahn) [14:31:54] (03CR) 10Elukey: [C: 031] Use promethues based alert rather than burrow lag check alert [puppet] - 10https://gerrit.wikimedia.org/r/422939 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [14:32:21] (03PS2) 10Ottomata: Use promethues based alert rather than burrow lag check alert [puppet] - 10https://gerrit.wikimedia.org/r/422939 (https://phabricator.wikimedia.org/T189611) [14:33:17] (03PS7) 10Ema: WIP: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712) [14:37:45] (03CR) 10Ottomata: [C: 032] Use promethues based alert rather than burrow lag check alert [puppet] - 10https://gerrit.wikimedia.org/r/422939 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [14:40:02] (03PS2) 10Ppchelko: Stop reading refreshLinks jobs from the Redis queue. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416942 (https://phabricator.wikimedia.org/T185052) [14:40:34] (03CR) 10Ppchelko: "Rebased on master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416942 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [14:48:52] (03PS1) 10Dzahn: remove deploy1001 from scap masters, keep in scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/422941 (https://phabricator.wikimedia.org/T191029) [14:50:57] (03CR) 10Mobrovac: [C: 032] Stop reading refreshLinks jobs from the Redis queue. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416942 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [14:52:00] (03PS2) 10Dzahn: remove deploy1001 from scap masters, keep in scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/422941 (https://phabricator.wikimedia.org/T191029) [14:52:25] (03Merged) 10jenkins-bot: Stop reading refreshLinks jobs from the Redis queue. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416942 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [14:52:39] PROBLEM - Host cp2021 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:39] (03CR) 10jenkins-bot: Stop reading refreshLinks jobs from the Redis queue. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416942 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [14:53:18] (03CR) 10Dzahn: [C: 032] remove deploy1001 from scap masters, keep in scap hosts [puppet] - 10https://gerrit.wikimedia.org/r/422941 (https://phabricator.wikimedia.org/T191029) (owner: 10Dzahn) [14:53:41] (03PS1) 10Elukey: profile::kafka::mirror:alerts: fix prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/422944 [14:54:55] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Cleanup: Use only EventBus for refreshLinks - T185052 (duration: 01m 18s) [14:54:59] haha [14:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:01] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [14:55:03] (03PS1) 10Ottomata: Fix consuer max lag check query [puppet] - 10https://gerrit.wikimedia.org/r/422945 (https://phabricator.wikimedia.org/T189611) [14:55:07] elu you do yours [14:55:27] (03Abandoned) 10Ottomata: Fix consuer max lag check query [puppet] - 10https://gerrit.wikimedia.org/r/422945 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [14:55:32] elukey: merging yours [14:55:33] (03CR) 10Elukey: [C: 032] profile::kafka::mirror:alerts: fix prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/422944 (owner: 10Elukey) [14:55:37] (03CR) 10Ottomata: [C: 032] profile::kafka::mirror:alerts: fix prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/422944 (owner: 10Elukey) [14:55:39] haha [14:55:39] (03PS2) 10Elukey: profile::kafka::mirror:alerts: fix prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/422944 [14:55:41] (03PS3) 10Ottomata: profile::kafka::mirror:alerts: fix prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/422944 (owner: 10Elukey) [14:55:42] ahhaha [14:55:43] hahahah [14:55:48] (03CR) 10Ottomata: [V: 032 C: 032] profile::kafka::mirror:alerts: fix prometheus query [puppet] - 10https://gerrit.wikimedia.org/r/422944 (owner: 10Elukey) [14:55:48] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 connecting: cp2017_v4, cp2017_v6 [14:55:52] BAM I DID IT [14:55:55] ahhahah [14:58:38] RECOVERY - Host cp2021 is UP: PING OK - Packet loss = 0%, RTA = 36.29 ms [15:01:24] (03PS1) 10Muehlenhoff: Update SSH key for dbrant [puppet] - 10https://gerrit.wikimedia.org/r/422946 [15:03:12] (03PS5) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [15:03:41] (03CR) 10Ema: [C: 031] Fix dummy metrics implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/421847 (https://phabricator.wikimedia.org/T190527) (owner: 10Vgutierrez) [15:09:02] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [15:09:02] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [15:09:02] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [15:09:02] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [15:09:03] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [15:09:03] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [15:09:12] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [15:09:13] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [15:09:13] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [15:09:22] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK [15:09:22] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [15:09:22] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [15:09:22] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [15:09:22] RECOVERY - Host cp2017 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [15:09:22] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [15:09:23] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [15:09:23] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 66 ESP OK [15:09:23] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK [15:09:24] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK [15:09:24] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [15:09:25] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [15:09:33] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [15:09:33] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [15:09:33] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK [15:09:33] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK [15:09:33] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [15:09:33] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 66 ESP OK [15:09:33] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK [15:09:42] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [15:09:42] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 66 ESP OK [15:09:42] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK [15:09:42] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK [15:09:43] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK [15:09:43] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK [15:09:43] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [15:09:43] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [15:09:43] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [15:09:52] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [15:09:53] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [15:09:53] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [15:09:53] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK [15:09:53] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [15:13:32] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK [15:14:02] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [15:14:12] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4091592 (10ayounsi) >>! In T189252#4091075, @Cwek wrote: > How to measure where traffic should route through? latency... [15:15:22] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [15:17:23] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK [15:18:42] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [15:22:29] (03PS2) 10Vgutierrez: Fix dummy metrics implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/421847 (https://phabricator.wikimedia.org/T190527) [15:24:50] (03CR) 10Vgutierrez: [C: 032] Fix dummy metrics implementation [debs/pybal] - 10https://gerrit.wikimedia.org/r/421847 (https://phabricator.wikimedia.org/T190527) (owner: 10Vgutierrez) [15:25:43] 10Operations, 10User-Elukey: Sporadic logrotate issue for stretch mediawiki appservers - https://phabricator.wikimedia.org/T185195#4091621 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I ran some tests and can confirm that PrivateTmp=true is the culprit. I haven't yet figured why it breaks, but i'll have a clo... [15:26:33] PROBLEM - Host cp2021 is DOWN: PING CRITICAL - Packet loss = 100% [15:27:03] 10Operations, 10User-Elukey: Apache reload fails on stretch-based app servers - https://phabricator.wikimedia.org/T185195#4091623 (10MoritzMuehlenhoff) [15:32:02] 10Operations, 10Phabricator, 10RelEng-Archive-FY201718-Q1: reinstall iridium (phabricator) as phab1001 with jessie - https://phabricator.wikimedia.org/T152129#4091629 (10RobH) [15:32:04] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#4091627 (10RobH) 05Open>03Resolved >>! In T172487#4079943, @MoritzMuehlenhoff wrote: > The server is still visible in Cumin: > > ``` > jmm@sarin:~$ sudo cumin irid* > 1 hosts will be targeted: > iri... [15:40:34] (03CR) 10Vgutierrez: Introduce server.is_pooled and make server.pooled usage more consistent (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053 (owner: 10Mark Bergsma) [15:41:24] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4091656 (10Papaul) @BBlack @RobH I did the test on already 6 of the systems that are depooled and upgrade also the IDRAC and BIOS. You can see the... [16:03:00] 10Operations, 10Commons, 10Traffic: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4091702 (10zhuyifei1999) [16:07:33] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#4091708 (10herron) [16:07:35] 10Operations, 10Puppet, 10Patch-For-Review: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4091706 (10herron) 05Open>03Resolved Resolving since the original task (run catalog diffs in production) has been completed [16:08:12] Hello, do we have puppet swat now? [16:09:30] stephanebisson: mmm I don't think so [16:09:44] jouncebot: current [16:09:54] jouncebot: next [16:09:54] In 0 hour(s) and 50 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T1700) [16:10:20] It's on there: https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_March_29 [16:13:04] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4091718 (10Papaul) [16:13:25] jouncebot: now [16:13:25] For the next 0 hour(s) and 46 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T1600) [16:14:22] PROBLEM - Apache HTTP on mw2285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:14:54] so, can anyone do the puppet swat? [16:14:55] 10Operations, 10Puppet, 10Patch-For-Review: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#4091724 (10herron) [16:14:57] 10Operations, 10Puppet, 10Patch-For-Review: Port puppetlabs PuppetDB 4.4 package to stretch - https://phabricator.wikimedia.org/T185502#4091722 (10herron) 05Open>03Resolved a:03herron [16:15:12] RECOVERY - Apache HTTP on mw2285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.106 second response time [16:16:00] 10Operations, 10Puppet: Port puppetlabs PuppetDB 4.4 package to stretch - https://phabricator.wikimedia.org/T185502#3917661 (10herron) [16:16:29] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4091737 (10RobH) @Papaul: The remaining systems will need to be depooled and repooled one at a time for work, please coordinate with either myself o... [16:17:45] stephanebisson: is the patch meant for puppet swat? it seems like the other one, namely related to a service the might be affected after deploy (but I only too a quick look) [16:20:29] for example, it is not clear from the description what this change will do [16:21:19] elukey: I think it's meant for puppet swat. It populates /etc//vars.yaml so it can be used by subsequent deployments. [16:21:36] For 3 services: kartotherian, tilerator, tileratorui [16:21:39] (03CR) 10Muehlenhoff: [C: 032] Update SSH key for dbrant [puppet] - 10https://gerrit.wikimedia.org/r/422946 (owner: 10Muehlenhoff) [16:21:59] Rather it adds 1 or 2 keys to the vars file that already exist. [16:22:58] I've cherry-picked this patch to deployment-puppetmaster02.eqiad.wmflabs and tested it on deployment-maps01.eqiad.wmflabs by triggering a pupper run and checking the var files. [16:24:25] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4091743 (10BBlack) Well we should maybe pause at this point and ask if this test is doing any good? It seems odd that 3/6 tested had the SEL entrie... [16:25:43] RECOVERY - Host cp2021 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [16:28:22] (03CR) 10Elukey: "adding pcc: https://puppet-compiler.wmflabs.org/compiler03/10732/" [puppet] - 10https://gerrit.wikimedia.org/r/422239 (https://phabricator.wikimedia.org/T112948) (owner: 10Sbisson) [16:30:26] stephanebisson: can you please add somebody that owns/knows those services to +1 it? (as requested in the puppet swat page) [16:31:54] elukey: That would be Guillaume. We'll wait for next week then. [16:32:18] stephanebisson: it would be great yes :) [16:36:24] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4091758 (10Papaul) a:05Papaul>03RobH @Robh can you please work on the switch ports and assign back to me for mgmt DNS removal. Thanks. [16:40:11] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#4091771 (10herron) [16:41:12] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#3652246 (10herron) [16:41:15] 10Operations, 10Puppet: puppetdb4: use postgres db backend in puppet-compiler - https://phabricator.wikimedia.org/T187258#4091776 (10herron) 05Open>03Resolved a:03herron compiler02 has been upgraded and re-enabled which completes this task [16:41:21] 10Operations, 10Puppet: puppetdb4: use postgres db backend in puppet-compiler - https://phabricator.wikimedia.org/T187258#4091780 (10herron) [16:41:22] (03CR) 10Nuria: eventlogging: move alarms from graphite to prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [16:43:31] (03CR) 10Elukey: eventlogging: move alarms from graphite to prometheus (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [16:43:34] 10Operations, 10Puppet, 10Goal, 10Patch-For-Review: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4091788 (10herron) [16:43:37] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#4091785 (10herron) 05Open>03Resolved a:03herron PuppetDB 4.4 upgrade is complete [16:43:57] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#4091793 (10herron) [16:44:00] 10Operations, 10Puppet, 10Patch-For-Review: Add PuppetDB version selector (puppet/hiera) - https://phabricator.wikimedia.org/T185501#4091792 (10herron) [16:44:02] 10Operations, 10Puppet, 10Patch-For-Review: Extend puppetmaster::puppetdb to support puppetlabs packaged puppetdb 4.4 - https://phabricator.wikimedia.org/T185500#4091790 (10herron) 05Open>03Resolved a:03herron [16:44:04] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4091794 (10RobH) a:05RobH>03Papaul @papaul: The switch port info you provided does NOT match the switch: ``` robh@asw-b-codfw> show interfaces descriptions | grep... [16:44:06] 10Operations, 10Puppet, 10Goal, 10Patch-For-Review: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#3888176 (10herron) [16:44:57] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#3652246 (10herron) [16:44:59] 10Operations, 10Puppet, 10Patch-For-Review: puppetdb4: upgrade puppetdbquery module - https://phabricator.wikimedia.org/T187259#4091798 (10herron) 05Open>03Resolved a:03herron [16:45:53] 10Operations, 10Puppet: Upgrade PuppetDB to version 4.4 - https://phabricator.wikimedia.org/T177253#4091806 (10herron) [16:45:55] 10Operations, 10Puppet: naggen2: support puppetdb 4 settings and api - https://phabricator.wikimedia.org/T188032#4091804 (10herron) 05Open>03Resolved a:03herron [16:46:06] 10Operations, 10Puppet: naggen2: support puppetdb 4 settings and api - https://phabricator.wikimedia.org/T188032#3993941 (10herron) [16:49:25] (03CR) 10Dzahn: [C: 031] Don't error out on interface::add_ip6_mapped on node level [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/422927 (owner: 10Alexandros Kosiaris) [16:50:14] (03PS1) 10Dduvall: Update SSH key for dduvall [puppet] - 10https://gerrit.wikimedia.org/r/422962 [17:05:14] !log mobrovac@tin Started deploy [restbase/deploy@af592d6]: Add bawikibooks - T191033 [17:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:21] T191033: ba.wikibooks.org REST API seems broken - https://phabricator.wikimedia.org/T191033 [17:05:59] (03PS6) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [17:09:16] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10734/" [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey) [17:13:48] 10Operations, 10Analytics, 10User-Elukey: Tune Varnishkafka delivery errors to be more sensitive - https://phabricator.wikimedia.org/T173492#4091941 (10fdans) p:05Normal>03Low [17:16:26] (03PS7) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [17:25:49] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#4092012 (10fdans) [17:26:16] (03CR) 10Chad: [C: 032] "I couldn't tell you what it actually stands for, but best I can wager it's for when a symlink is moved? Idk, that's the only place I've se" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411284 (owner: 10Chad) [17:26:35] 10Operations, 10Commons, 10Traffic: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4092016 (10Aklapper) Thanks for reporting this. https://commons.wikimedia.org/wiki/File:Ariano_Irpino_ZI.jpeg shows an image with a cloud and in the "File History" section the... [17:27:34] (03Merged) 10jenkins-bot: Remove indirection from search-redirect.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411284 (owner: 10Chad) [17:28:08] 10Operations, 10Discovery-Search: Additional network ports for elasticsearch servers? - https://phabricator.wikimedia.org/T189854#4092023 (10EBjune) 05Open>03declined [17:28:35] (03CR) 10jenkins-bot: Remove indirection from search-redirect.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/411284 (owner: 10Chad) [17:28:50] 10Operations, 10Discovery-Search: Additional network ports for elasticsearch servers? - https://phabricator.wikimedia.org/T189854#4055439 (10EBjune) We'll make sure this gets addressed during upcoming server replacements. [17:29:44] 10Operations, 10Commons, 10Traffic: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4092032 (10Aklapper) 05Open>03stalled //If// this is about reverting to the non-cloud version and the preview on top of the `File:` page still shows the cloud version, you... [17:29:44] Who should I talk to to get the PostgreSQL command in https://phabricator.wikimedia.org/T190238 run on maps-test2004? [17:30:05] !log demon@tin Synchronized docroot/wwwportal/w/search-redirect.php: removing symlink indirection (duration: 01m 16s) [17:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:26] 10Operations, 10Commons, 10Traffic: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4092058 (10Aklapper) [17:33:27] wikibugs: ping [17:33:43] (03CR) 10Dzahn: [C: 032] Don't error out on interface::add_ip6_mapped on node level [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/422927 (owner: 10Alexandros Kosiaris) [17:34:50] (03PS1) 10Elukey: Refactor hadoop/hive monitoring profiles to a simpler structure [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) [17:34:52] oh.. that isn't operations/puppet repo of course, heh [17:35:49] !log mobrovac@tin Finished deploy [restbase/deploy@af592d6]: Add bawikibooks - T191033 (duration: 30m 35s) [17:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:57] T191033: ba.wikibooks.org REST API seems broken - https://phabricator.wikimedia.org/T191033 [17:39:50] (03PS1) 10Chad: Adding zuul plugin for cross-repo dependency mgmt [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/423002 [17:41:29] (03CR) 10Paladox: [C: 031] Adding zuul plugin for cross-repo dependency mgmt [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/423002 (owner: 10Chad) [17:41:47] (03CR) 10Chad: "Uploaded to Archiva: https://archiva.wikimedia.org/#artifact/com.googlesource.gerrit.plugins/zuul/2.14.7-9" [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/423002 (owner: 10Chad) [17:44:32] (03PS2) 10Elukey: Refactor hadoop/hive monitoring profiles to a simpler structure [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) [17:46:50] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar), 10Performance-Team-notice: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4092111 (10Krinkle) [17:48:56] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10736/" [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [17:50:51] (03CR) 10Dzahn: [C: 032] "thanks !:)" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/422927 (owner: 10Alexandros Kosiaris) [17:51:27] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4092127 (10RobH) Please note I've asked @papaul to memtest86+ cp2022 WITHOUT flashing the bios/drac. Once we have that result, we'll also then star... [17:53:04] (03PS8) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [17:53:06] (03PS6) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 [17:53:38] (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [17:55:10] !log pausing restarts of elastic@codfw (6 nodes left) [17:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:00:17] lies [18:00:23] I'll deploy my stuff [18:00:52] jouncebot: refresh [18:00:53] I refreshed my knowledge about deployments. [18:00:58] jouncebot: now [18:00:58] For the next 0 hour(s) and 59 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T1800) [18:05:32] (03CR) 10Ottomata: [C: 031] Refactor hadoop/hive monitoring profiles to a simpler structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [18:14:23] Niharika: your patch is live on mwdebug1002 [18:14:24] (03PS5) 10Elukey: eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) [18:15:13] (03CR) 10jerkins-bot: [V: 04-1] eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [18:15:49] MaxSem: Which wmf version? [18:15:55] both [18:16:11] MaxSem: Doesn't seem to work. [18:16:28] Limit on summary box is still 1000. [18:16:47] well, you changed it in the backend only, right? [18:17:05] also, debug=true? [18:17:14] Bah, why are the two not connected. [18:17:27] (03PS6) 10Elukey: eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) [18:18:17] MaxSem: It was cache. Seems fine. [18:18:19] Deploy. [18:18:59] * MaxSem flips the siren on [18:22:06] !log maxsem@tin Synchronized php-1.31.0-wmf.27/includes/: Shorten summary length to 500 (duration: 02m 14s) [18:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:24] Niharika: ^ [18:22:49] waiting for OK to deploy to 26 [18:25:25] MaxSem: Works. [18:26:37] you scared me MaxSem :P [18:27:15] 👻👻👻 [18:27:28] little ghosts? [18:27:42] !log maxsem@tin Synchronized php-1.31.0-wmf.26/includes/: Shorten summary length to 500 (duration: 02m 06s) [18:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:57] Niharika: ^ [18:28:00] (03PS3) 10Elukey: Refactor hadoop/hive monitoring profiles to a simpler structure [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) [18:30:27] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/10738/" [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [18:38:41] hey MaxSem can i add a late entry to swat? [18:38:45] found an unbreak now [18:38:54] you can try :} [18:39:05] (03PS5) 10Bstorm: wiki replicas: refactor and record grants and set user [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) [18:40:14] https://gerrit.wikimedia.org/r/#/c/423012/ < MaxSem adding it to deploy page [18:47:33] MaxSem: debug1001 ? [18:47:37] jdlrobson: pulled on mwdebug1002 [18:47:57] MaxSem: verified fix [18:47:59] deploy away! [18:49:28] !log maxsem@tin Synchronized php-1.31.0-wmf.27/skins/MinervaNeue: https://gerrit.wikimedia.org/r/#/c/423012/ (duration: 01m 17s) [18:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:06] jdlrobson: ^ [18:50:23] thanks so much MaxSem [18:50:27] my designer was about to kill me ;-) [18:50:43] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [18:51:48] (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [19:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T1900). [19:00:05] No GERRIT patches in the queue for this window AFAICS. [19:00:26] (03PS1) 10Bstorm: wiki replicas: working on getting the hiera data right [labs/private] - 10https://gerrit.wikimedia.org/r/423014 [19:01:32] (03CR) 10Bstorm: [V: 032 C: 032] wiki replicas: working on getting the hiera data right [labs/private] - 10https://gerrit.wikimedia.org/r/423014 (owner: 10Bstorm) [19:01:34] (03PS1) 10Ottomata: Initial debian release of 3.0.0 [debs/python-uritemplate] - 10https://gerrit.wikimedia.org/r/423015 [19:01:49] (03Abandoned) 10Ottomata: Initial debian release of 3.0.0 [debs/python-uritemplate] - 10https://gerrit.wikimedia.org/r/423015 (owner: 10Ottomata) [19:02:41] (03PS1) 10Ottomata: Initial debian release of 3.0.0 [debs/python-uritemplate] (debian) - 10https://gerrit.wikimedia.org/r/423016 [19:02:46] Train is still blocked [19:03:31] (03PS1) 10Ottomata: Initial debian release of 3.0.0 [debs/python-uritemplate] - 10https://gerrit.wikimedia.org/r/423017 (https://phabricator.wikimedia.org/T190767) [19:04:18] (03Abandoned) 10Ottomata: Initial debian release of 3.0.0 [debs/python-uritemplate] - 10https://gerrit.wikimedia.org/r/423017 (https://phabricator.wikimedia.org/T190767) (owner: 10Ottomata) [19:05:42] (03Abandoned) 10Ottomata: Initial debian release of 3.0.0 [debs/python-uritemplate] (debian) - 10https://gerrit.wikimedia.org/r/423016 (owner: 10Ottomata) [19:06:31] (03PS1) 10Ottomata: Initial debian release of 3.0.0 [debs/python-uritemplate] (debian) - 10https://gerrit.wikimedia.org/r/423018 (https://phabricator.wikimedia.org/T190767) [19:06:50] (03CR) 10Ottomata: [V: 032 C: 032] Initial debian release of 3.0.0 [debs/python-uritemplate] (debian) - 10https://gerrit.wikimedia.org/r/423018 (https://phabricator.wikimedia.org/T190767) (owner: 10Ottomata) [19:08:12] (03PS1) 10Ladsgroup: labs: Mark fawiki collation method explicitly uppercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423020 (https://phabricator.wikimedia.org/T190965) [19:09:35] (03CR) 10Ladsgroup: [C: 032] labs: Mark fawiki collation method explicitly uppercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423020 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [19:10:21] ^ rebased on tin [19:10:41] (03Merged) 10jenkins-bot: labs: Mark fawiki collation method explicitly uppercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423020 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [19:10:43] (03CR) 10jenkins-bot: labs: Mark fawiki collation method explicitly uppercase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423020 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [19:10:58] Amir1: Wait, why uppercase instead of 'uca-fa' ? [19:11:12] I haven't been following this super closely [19:12:01] 10Operations, 10Commons, 10Traffic: Caching problem with file description page on Commons - https://phabricator.wikimedia.org/T191028#4092635 (10zhuyifei1999) 05stalled>03Open Look like it's thumbnails did not get purged. 'Original file' https://upload.wikimedia.org/wikipedia/commons/c/ca/Ariano_Irpino_Z... [19:12:02] bawolff: I went with the default option there [19:12:08] the default is uppercase [19:13:28] If we want to check if the bug goes away in the new uca, we probably want to test uca-fa as that's the one with the bug [19:13:47] hmm, I can do that [19:14:01] 10Operations, 10Commons, 10Traffic: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4092640 (10zhuyifei1999) [19:19:48] (03PS1) 10Ottomata: Initial debian release 1.6.6 [debs/python-google-api] (debian) - 10https://gerrit.wikimedia.org/r/423021 [19:19:50] (03PS2) 10Ottomata: Initial debian release 1.6.6 [debs/python-google-api] (debian) - 10https://gerrit.wikimedia.org/r/423021 (https://phabricator.wikimedia.org/T190767) [19:20:22] (03CR) 10Ottomata: [V: 032 C: 032] Initial debian release 1.6.6 [debs/python-google-api] (debian) - 10https://gerrit.wikimedia.org/r/423021 (https://phabricator.wikimedia.org/T190767) (owner: 10Ottomata) [19:21:31] (03CR) 10Chad: [V: 032 C: 032] Adding zuul plugin for cross-repo dependency mgmt [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/423002 (owner: 10Chad) [19:24:50] (03PS1) 10Ottomata: Install python google api client on stretch stat and notebook boxes [puppet] - 10https://gerrit.wikimedia.org/r/423023 (https://phabricator.wikimedia.org/T190767) [19:25:36] (03CR) 10Ottomata: [C: 032] Install python google api client on stretch stat and notebook boxes [puppet] - 10https://gerrit.wikimedia.org/r/423023 (https://phabricator.wikimedia.org/T190767) (owner: 10Ottomata) [19:27:01] (03PS1) 10Ladsgroup: labs: turn fawiki's collation to uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423024 (https://phabricator.wikimedia.org/T190965) [19:31:01] Hopefully will be the last one [19:31:03] sorry [19:32:01] (03CR) 10Ladsgroup: [C: 032] labs: turn fawiki's collation to uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423024 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [19:32:43] rebased on tin [19:32:52] (03PS6) 10Ppchelko: Remove special jobrunners for refreshLinks and htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) [19:32:54] (03CR) 10jerkins-bot: [V: 04-1] Remove special jobrunners for refreshLinks and htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [19:33:42] (03PS1) 10Ottomata: Include ores::base on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/423027 (https://phabricator.wikimedia.org/T181646) [19:34:00] (03Merged) 10jenkins-bot: labs: turn fawiki's collation to uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423024 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [19:34:31] (03CR) 10jenkins-bot: labs: turn fawiki's collation to uca-fa [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423024 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [19:35:05] (03CR) 10jerkins-bot: [V: 04-1] Include ores::base on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/423027 (https://phabricator.wikimedia.org/T181646) (owner: 10Ottomata) [19:35:09] (03CR) 10Ottomata: [V: 032 C: 032] Include ores::base on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/423027 (https://phabricator.wikimedia.org/T181646) (owner: 10Ottomata) [19:35:56] (03PS7) 10Ppchelko: Remove special jobrunners for refreshLinks and htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) [19:35:59] (03CR) 10jerkins-bot: [V: 04-1] Remove special jobrunners for refreshLinks and htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [19:38:07] (03CR) 10Bstorm: "Changed the private repo to include duplicated passwords at the right location https://puppet-compiler.wmflabs.org/compiler02/10740/. Test" [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:38:41] (03PS8) 10Ppchelko: Remove special jobrunners for refreshLinks and htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) [19:39:46] (03CR) 10Ppchelko: "Rebased" [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [19:40:53] (03PS6) 10Bstorm: wiki replicas: refactor and record grants and set user [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) [19:41:39] 10Operations, 10HHVM, 10User-Elukey, 10User-notice: ICU 57 migration for wikis using non-default collation - https://phabricator.wikimedia.org/T189295#4092701 (10Ladsgroup) [19:41:43] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10User-Ladsgroup: Remove uca-fa from beta cluster - https://phabricator.wikimedia.org/T190965#4092699 (10Ladsgroup) 05Open>03Resolved All pages are in the right place: https://fa.wikipedia.beta.wmflabs.org/wiki/%D8%B1%D8%AF%D9%87:%D8%B5%D... [19:45:38] (03CR) 10Mobrovac: [C: 031] Remove special jobrunners for refreshLinks and htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/416481 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [20:07:51] !log shuttdown cp2022 for hw testing [20:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:52] (03PS1) 10Rush: wip openstack: neutron router l3-agent HA [puppet] - 10https://gerrit.wikimedia.org/r/423032 (https://phabricator.wikimedia.org/T188266) [20:09:39] (03PS1) 10Ottomata: Blacklist job topics from main -> jumbo mirrormaker [puppet] - 10https://gerrit.wikimedia.org/r/423033 (https://phabricator.wikimedia.org/T189464) [20:09:42] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [20:11:05] (03CR) 10jerkins-bot: [V: 04-1] wip openstack: neutron router l3-agent HA [puppet] - 10https://gerrit.wikimedia.org/r/423032 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:11:57] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4092735 (10Papaul) cp2022 SEL "Normal","Sat May 30 2015 03:52:02","Log cleared." "Warning","Wed Jun 01 2016 17:39:30","Correctable memory error rat... [20:12:59] (03CR) 10Ottomata: [C: 032] Blacklist job topics from main -> jumbo mirrormaker [puppet] - 10https://gerrit.wikimedia.org/r/423033 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [20:14:12] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:12] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:13] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:13] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:22] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:22] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:22] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:22] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:22] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:22] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:22] PROBLEM - IPsec on cp5005 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:23] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:23] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:24] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:24] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:32] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:32] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:32] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:32] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:32] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:33] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:42] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:42] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:43] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:43] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:43] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:43] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:43] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:14:52] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:52] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:52] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:52] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:53] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:14:53] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:15:02] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:02] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:02] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:02] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:02] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:02] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:15:03] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:03] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:03] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:04] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2022_v4, cp2022_v6 [20:15:04] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:15:05] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2022_v4, cp2022_v6 [20:34:12] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4092761 (10ayounsi) [20:38:29] (03PS2) 10Rush: wip openstack: neutron router l3-agent HA [puppet] - 10https://gerrit.wikimedia.org/r/423032 (https://phabricator.wikimedia.org/T188266) [20:39:19] (03CR) 10jerkins-bot: [V: 04-1] wip openstack: neutron router l3-agent HA [puppet] - 10https://gerrit.wikimedia.org/r/423032 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:50:58] (03PS1) 10Madhuvishy: dumps: Add nfs, load, network monitoring for dist servers [puppet] - 10https://gerrit.wikimedia.org/r/423040 (https://phabricator.wikimedia.org/T168486) [20:51:33] (03CR) 10Andrew Bogott: [C: 04-1] "This looks good, but for reviewing and history-diving purposes I'd like it if the role->profile refactor is in a separate patch from the a" [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [20:52:04] (03CR) 10jerkins-bot: [V: 04-1] dumps: Add nfs, load, network monitoring for dist servers [puppet] - 10https://gerrit.wikimedia.org/r/423040 (https://phabricator.wikimedia.org/T168486) (owner: 10Madhuvishy) [20:54:34] (03CR) 10Bstorm: "Fair, but trying to add my credentials fails unless I refactor in one way or another. I could trick out the linter with a class declarati" [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [20:54:36] (03PS2) 10Madhuvishy: dumps: Add nfs, load, network monitoring for dist servers [puppet] - 10https://gerrit.wikimedia.org/r/423040 (https://phabricator.wikimedia.org/T168486) [20:55:07] (03Abandoned) 10Dzahn: Revert "Revert "remove deploy1001 from dsh hosts and scap masters"" [puppet] - 10https://gerrit.wikimedia.org/r/422940 (https://phabricator.wikimedia.org/T191029) (owner: 10Dzahn) [20:56:29] (03PS3) 10Madhuvishy: dumps: Add nfs, load, network monitoring for dist servers [puppet] - 10https://gerrit.wikimedia.org/r/423040 (https://phabricator.wikimedia.org/T168486) [20:57:36] (03CR) 10Madhuvishy: [C: 032] dumps: Add nfs, load, network monitoring for dist servers [puppet] - 10https://gerrit.wikimedia.org/r/423040 (https://phabricator.wikimedia.org/T168486) (owner: 10Madhuvishy) [21:23:02] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 66 ESP OK [21:23:02] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK [21:23:02] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 66 ESP OK [21:23:11] RECOVERY - Host cp2022 is UP: PING WARNING - Packet loss = 54%, RTA = 36.03 ms [21:23:11] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [21:23:11] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [21:23:11] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [21:23:12] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK [21:23:12] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK [21:23:12] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 66 ESP OK [21:23:12] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK [21:23:12] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [21:23:21] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [21:23:21] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [21:23:21] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 66 ESP OK [21:23:21] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 66 ESP OK [21:23:22] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [21:23:31] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [21:23:31] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK [21:23:31] RECOVERY - IPsec on cp4023 is OK: Strongswan OK - 66 ESP OK [21:23:31] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [21:23:32] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 66 ESP OK [21:23:32] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [21:23:41] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [21:23:41] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK [21:23:41] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [21:23:41] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [21:23:41] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [21:23:41] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [21:23:41] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [21:23:42] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK [21:23:42] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [21:23:43] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [21:23:43] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [21:23:44] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [21:24:01] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [21:24:02] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [21:24:31] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [21:25:50] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4093023 (10Papaul) cp2022 SEL after test "Normal","Thu Mar 29 2018 20:11:58","Log cleared." "Warning","Thu Mar 29 2018 20:14:22","Fan 5A RPM is les... [21:28:11] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:31] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 0%, RTA = 36.02 ms [21:33:03] (03PS1) 10Jdlrobson: Rollout VirtualPageViews (final stage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423047 (https://phabricator.wikimedia.org/T189906) [21:56:49] wikibugs is lagged or something [21:57:46] (03CR) 10Dzahn: [C: 032] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [21:58:00] (03CR) 10Dzahn: [V: 032 C: 032] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [21:58:02] (03PS7) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 [21:58:11] yea, it works, just with that delay.. used to be almost realtime [21:58:36] (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [21:58:50] that -1 is a bug ;) [21:58:59] that Alex tried to fix already [22:00:46] (03PS3) 10Paladox: Gerrit: Switch gc back on [puppet] - 10https://gerrit.wikimedia.org/r/421593 (https://phabricator.wikimedia.org/T190045) [22:00:48] (03CR) 10Dzahn: [V: 032 C: 032] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [22:03:15] (03CR) 10Dzahn: [V: 032 C: 032] "thanks for the change to wmf-style check, it doesn't seem to be applied yet but i'm not sure if that is because there is a deployment step" [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [22:06:57] !log andrew@tin Started deploy [horizon/deploy@14d3e7d]: Updating Horizon with possible fix for T189706 [22:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:04] T189706: Floating Ip panel missing from new horizon update - https://phabricator.wikimedia.org/T189706 [22:10:12] !log andrew@tin Finished deploy [horizon/deploy@14d3e7d]: Updating Horizon with possible fix for T189706 (duration: 03m 16s) [22:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:47] (03PS1) 10Dzahn: add IPv6 records for vega.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/423059 (https://phabricator.wikimedia.org/T188163) [22:19:34] (03CR) 10Dzahn: [C: 032] add IPv6 records for vega.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/423059 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [22:25:34] 10Operations, 10Commons, 10Traffic: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4091369 (10BBlack) >>! In T191028#4092635, @zhuyifei1999 wrote: > AFAIK, client-side purging on upload.wm.o has been intentionally disabled... [22:27:04] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4093321 (10Volker_E) [22:28:09] (03PS1) 1020after4: New ssh key for twentyafterfour [puppet] - 10https://gerrit.wikimedia.org/r/423062 [22:28:54] 10Operations, 10Commons, 10Traffic: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093330 (10zhuyifei1999) >>! In T191028#4093308, @BBlack wrote: > So to purge an image from the upload.wikimedia.org caches, you purge via M... [22:29:29] (03CR) 10Dzahn: [C: 032] "[radon:~] $ host vega.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/423059 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [22:31:59] 10Operations, 10Commons, 10Traffic: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093337 (10BBlack) It definitely does purge Varnish for the original file on upload.wikimedia.org. An in general, when a replacement file i... [22:33:38] (03PS1) 10EBernhardson: Configure 5 buckets for next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423063 (https://phabricator.wikimedia.org/T187148) [22:33:40] (03CR) 10jerkins-bot: [V: 04-1] Configure 5 buckets for next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423063 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [22:37:29] (03PS2) 10EBernhardson: Configure 5 buckets for next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423063 (https://phabricator.wikimedia.org/T187148) [22:40:05] 10Operations, 10Commons, 10Traffic: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093343 (10BBlack) To be sure of what I'm saying, and perhaps provide some trace data that may help pinpoint whatever the actual problem is,... [22:41:38] (03PS7) 10Bstorm: wiki replicas: refactor and record grants and set user [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) [22:47:06] 10Operations, 10Commons, 10Traffic: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093349 (10zhuyifei1999) Interesting. I'm unaware of that. Thanks. I issued two more purges, but https://upload.wikimedia.org/wikipedia/com... [22:47:17] 10Operations, 10Commons, 10Traffic: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093350 (10BBlack) Hmmm, now I'm noting what is probably the critical discrepancy here.... When I visit https://commons.wikimedia.org/wiki/... [22:49:32] 10Operations, 10Commons, 10Thumbor, 10Traffic, 10media-storage: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093352 (10BBlack) [22:57:22] PROBLEM - HHVM rendering on mw2209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:58:21] RECOVERY - HHVM rendering on mw2209 is OK: HTTP OK: HTTP/1.1 200 OK - 74263 bytes in 0.394 second response time [22:59:53] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4093386 (10Nuria) 05Open>03Resolved [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180329T2300). [23:00:05] marlier and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:31] \o [23:00:34] i can eploy [23:01:21] i'm going to guess deployment.eqiad.wmnet (tin) is still the right place to deploy from :) [23:02:08] (03CR) 10EBernhardson: [C: 032] Configure 5 buckets for next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423063 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:02:40] ebernhardson: it should be always pointed at the right place [23:07:08] (03Merged) 10jenkins-bot: Configure 5 buckets for next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423063 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:08:15] (03CR) 10jenkins-bot: Configure 5 buckets for next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423063 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:10:09] marlier: around for swat? [23:12:11] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148: Configure 5 buckets for cirrus AB test (duration: 01m 17s) [23:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:18] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [23:12:40] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4093427 (10BBlack) So, intersecting our info at the top ("Y" for eqsin as best site, not zero-blocked), the peering u... [23:22:49] marlier: ping? [23:28:14] (03PS8) 10Bstorm: wiki replicas: record grants and set user for index maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) [23:35:43] (03CR) 10Krinkle: [C: 031] wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [23:35:52] (03CR) 10EBernhardson: [C: 032] wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [23:35:56] (03PS3) 10EBernhardson: wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [23:36:05] ebernhardson: I can watch it in Ian's absence, no problem. [23:37:54] !log ebernhardson@tin Synchronized php-1.31.0-wmf.26/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Start cirrus AB test (duration: 01m 16s) [23:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:00] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [23:38:12] PROBLEM - Outgoing network saturation on labstore1006 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [106250000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [23:39:12] PROBLEM - Incoming network saturation on labstore1007 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [106250000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [23:40:22] !log ebernhardson@tin Synchronized php-1.31.0-wmf.27/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT: T187148: Start cirrus AB test (duration: 01m 16s) [23:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:00] (03CR) 10EBernhardson: [C: 032] wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [23:42:13] (03Merged) 10jenkins-bot: wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [23:42:28] (03CR) 10jenkins-bot: wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [23:42:36] Krinkle: up on mwdebug1001 if there is anything to test [23:44:43] ebernhardson: confirmed [23:47:08] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T189252: Enable perf oversampling for remaining countries in Asia (duration: 01m 16s) [23:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:14] T189252: Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252 [23:48:44] Krinkle: all synced out [23:50:15] ebernhardson: confirmed in prod when bypassing cache with XMD to a codfw host, but don't yet see it without cache bypass, may take <5min for startup module to expire, so will check again in a few minutes