[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate Evening SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T0000). [00:00:04] tgr: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:26] o/ [00:01:02] am I the only one with patches? I guess I should do it myself then [00:04:07] still no way to connect from the office network to production? :( [00:08:07] hm, I can't reach bast4001 from the guest wifi either so it's probably a different issue [00:10:22] oh, it's 4002 now [00:10:30] how do people learn about these things? [00:14:44] tgr: ops-l, which is probably not all deployers, or is it? [00:15:42] Also at https://wikitech.wikimedia.org/wiki/Production_shell_access#SSH_configuration, but that's not something you'd look at regularly :) [00:17:21] 10Operations, 10Analytics, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Nuria) Both 2 and 3 above can start being worked as soon as @bmansurov has any bandwidth. [00:17:45] I am on ops-l, I did a mail search and the last mail including the word bast4001 was from March and did not mention any change [00:18:10] yeah, wikitech is how I figured it out eventually [00:20:32] (03PS2) 10Gergő Tisza: Bring up password change logging to the same standards as login logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 [00:22:12] (03CR) 10Gergő Tisza: [C: 032] Bring up password change logging to the same standards as login logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [00:23:17] (03Merged) 10jenkins-bot: Bring up password change logging to the same standards as login logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [00:23:35] (03PS2) 10Gergő Tisza: Add some missing groups to the privileged list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478467 [00:23:56] 10Operations, 10Analytics, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10leila) [00:26:11] 10Operations, 10Wikimedia-General-or-Unknown: Request for information about hosting services for WM-ES - https://phabricator.wikimedia.org/T211414 (10Aklapper) [00:27:22] (03CR) 10Gergő Tisza: "Oh, duh. Why am I calling wfGetPrivilegedGroups twice?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [00:30:52] (03CR) 10Gergő Tisza: "> Oh, duh. Why am I calling wfGetPrivilegedGroups twice?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [00:31:43] (03CR) 10jenkins-bot: Bring up password change logging to the same standards as login logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467110 (owner: 10Gergő Tisza) [00:42:03] (03CR) 10Gergő Tisza: [C: 032] Add some missing groups to the privileged list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478467 (owner: 10Gergő Tisza) [00:43:05] (03Merged) 10jenkins-bot: Add some missing groups to the privileged list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478467 (owner: 10Gergő Tisza) [00:43:55] (03CR) 10jenkins-bot: Add some missing groups to the privileged list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478467 (owner: 10Gergő Tisza) [00:46:21] !log tgr@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:467110|Bring up password change logging to the same standards as login logging]] [[gerrit:478467|Add some missing groups to the privileged list]] (duration: 00m 53s) [00:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:44] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:478467|Add some missing groups to the privileged list]] (duration: 00m 51s) [00:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:30] 10Operations, 10MediaWiki-Page-deletion, 10MW-1.32-release, 10Performance-Team (Radar): Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10Krinkle) [01:17:22] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [01:25:28] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [01:26:32] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1panelId=2fullscreen [01:26:37] that second dashboard isnt found because grafana changed, but on the first one it is recovered ^ [01:30:06] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [02:33:15] tgr: re: finding the right bastion and how to learn about changes. yea, it should have been announced on the ops list. i made this experimental thing to improve on that workflow https://people.wikimedia.org/~dzahn/bastion.sh.txt [02:50:14] Hi, I ran into merge conflict here. Any suggestions? https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CheckUser/+/464472/ [03:02:52] (03CR) 10Mathew.onipe: [C: 031] "Some methods are missing complete docstrings." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [03:08:49] (03CR) 10Mathew.onipe: [C: 031] puppet: add PuppetMaster class (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [03:22:23] PROBLEM - HHVM rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:24:30] RECOVERY - HHVM rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 75978 bytes in 0.139 second response time [03:46:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1464.38 seconds [03:48:41] (03PS5) 10Mathew.onipe: setup: change curator version to '>=5.0.0,<5.4.0' to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 [03:51:18] RECOVERY - Long running screen/tmux on people1001 is OK: OK: No SCREEN or tmux processes detected. [03:54:48] (03Abandoned) 10Robingan7: Add several logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478571 (owner: 10Robingan7) [04:14:29] (03PS1) 10Andrew Bogott: Horizon: move mwoffliner to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/479153 (https://phabricator.wikimedia.org/T204745) [04:15:42] (03PS2) 10Andrew Bogott: Horizon: move mwoffliner to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/479153 (https://phabricator.wikimedia.org/T204745) [04:16:31] (03CR) 10Andrew Bogott: [C: 032] Horizon: move mwoffliner to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/479153 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [04:19:40] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 120.56 seconds [04:37:10] (03Abandoned) 10Robingan7: Revise images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478738 (owner: 10Robingan7) [04:45:48] (03CR) 10Tim Starling: Class wrapper for ProductionServices.php etc. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 (owner: 10Tim Starling) [04:49:57] (03PS1) 10CRusnov: Add management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 [04:50:28] (03PS2) 10CRusnov: Add management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) [04:56:07] (03PS4) 10Tim Starling: Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 [04:56:09] (03PS3) 10Tim Starling: Class wrapper for ProductionServices.php etc. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477956 [04:56:11] (03PS3) 10Tim Starling: Put profiler hostnames in ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477957 [04:56:13] (03PS4) 10Tim Starling: Excimer and Tideways support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 [04:57:11] (03CR) 10jerkins-bot: [V: 04-1] Refactor profiler.php and X-Wikimedia-Debug parsing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [04:58:17] (03CR) 10jerkins-bot: [V: 04-1] Excimer and Tideways support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478137 (owner: 10Tim Starling) [04:59:58] (03CR) 10Krinkle: [C: 032] [cirrus] Add all three elasticsearch cluster to labs services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478892 (https://phabricator.wikimedia.org/T211526) (owner: 10DCausse) [05:00:06] (03CR) 10Krinkle: [C: 032] "Beta-only change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478892 (https://phabricator.wikimedia.org/T211526) (owner: 10DCausse) [05:00:19] (03CR) 10Krinkle: [C: 032] tests: Assert LabsServices contains all prod keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478569 (https://phabricator.wikimedia.org/T211526) (owner: 10Krinkle) [05:01:03] (03CR) 10jerkins-bot: [V: 04-1] tests: Assert LabsServices contains all prod keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478569 (https://phabricator.wikimedia.org/T211526) (owner: 10Krinkle) [05:05:09] (03PS2) 10Krinkle: [cirrus] Add all three elasticsearch cluster to labs services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478892 (https://phabricator.wikimedia.org/T211526) (owner: 10DCausse) [05:05:16] (03CR) 10Krinkle: [C: 032] [cirrus] Add all three elasticsearch cluster to labs services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478892 (https://phabricator.wikimedia.org/T211526) (owner: 10DCausse) [05:05:20] (03PS4) 10Krinkle: tests: Assert LabsServices contains all prod keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478569 (https://phabricator.wikimedia.org/T211526) [05:05:26] (03CR) 10Krinkle: [C: 032] tests: Assert LabsServices contains all prod keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478569 (https://phabricator.wikimedia.org/T211526) (owner: 10Krinkle) [05:05:43] (03PS1) 10Krinkle: Fix minor tech debt around AuthManager audit logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479156 [05:06:23] (03Merged) 10jenkins-bot: [cirrus] Add all three elasticsearch cluster to labs services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478892 (https://phabricator.wikimedia.org/T211526) (owner: 10DCausse) [05:06:27] (03Merged) 10jenkins-bot: tests: Assert LabsServices contains all prod keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478569 (https://phabricator.wikimedia.org/T211526) (owner: 10Krinkle) [05:07:22] (03CR) 10jenkins-bot: [cirrus] Add all three elasticsearch cluster to labs services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478892 (https://phabricator.wikimedia.org/T211526) (owner: 10DCausse) [05:07:24] (03CR) 10jenkins-bot: tests: Assert LabsServices contains all prod keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478569 (https://phabricator.wikimedia.org/T211526) (owner: 10Krinkle) [05:28:19] (03PS8) 10Robingan7: Upload HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478498 [05:45:56] (03CR) 10Krinkle: Refactor profiler.php and X-Wikimedia-Debug parsing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [05:48:39] 10Operations, 10Patch-For-Review: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429 (10faidon) 05Open>03Resolved a:03faidon Has been implemented for all hosts starting with stretch and going forward for a long time now! [05:49:38] 10Operations, 10Traffic, 10netops, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10faidon) Makes sense, +1, go for it! A lot has happened since this task was filled in 2015 (e.g. not having precise anymore, T163196 etc.) and including `int... [05:49:50] (03CR) 10Krinkle: Refactor profiler.php and X-Wikimedia-Debug parsing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [06:00:13] !log Deploy schema change on s8 primary master (db1071) T86338 T202167 [06:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:19] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:00:19] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:06:03] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Marostegui) After those last merges, is this good to be closed? @Papaul @robh? Thanks! [06:10:29] 10Operations, 10DBA, 10Performance-Team: Increase parsercache keys TTL from 22 days back to 30 days - https://phabricator.wikimedia.org/T210992 (10Marostegui) Thanks @aaron - I will get a patch out after the code freeze [06:13:41] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Papaul) @Marostegui no need to close the task. It can be assign to @RobH so he can keep track [06:14:16] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Marostegui) a:05Papaul>03RobH [06:22:24] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) same error again at 22:47 Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred., Normal,Tue 11 Dec 2018 22:47:04,An OEM diagnostic event occurred., Normal,Tue 11 Dec 2018 22:47:04,An OE... [06:30:26] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.009 second response time [06:33:18] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:37:24] !log Deploy schema change on s4 primary master (db1068) T86338 [06:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:28] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:37:44] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:39:24] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.556 second response time [06:40:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479159 [06:41:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479159 (owner: 10Marostegui) [06:43:02] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479159 (owner: 10Marostegui) [06:44:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1088 for mysql upgrade (duration: 01m 07s) [06:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:31] 10Operations, 10Performance-Team: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10Peter) [06:44:38] !log Stop MySQL on db1088 for mysql and kernel upgrade [06:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479159 (owner: 10Marostegui) [06:52:29] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479160 [06:53:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479160 (owner: 10Marostegui) [06:54:42] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479160 (owner: 10Marostegui) [06:55:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1088 (duration: 00m 52s) [06:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:20] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479160 (owner: 10Marostegui) [07:03:56] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479162 [07:06:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479162 (owner: 10Marostegui) [07:07:45] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479162 (owner: 10Marostegui) [07:09:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1088 (duration: 00m 51s) [07:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:02] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479162 (owner: 10Marostegui) [07:32:13] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479163 [07:33:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479163 (owner: 10Marostegui) [07:34:32] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479163 (owner: 10Marostegui) [07:35:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1088 (duration: 00m 51s) [07:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:16] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479163 (owner: 10Marostegui) [07:38:14] !log Deploy schema change on db2040 (s7 codfw master), this will generate lag on codfw T86338 T202167 [07:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:19] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:38:20] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [07:40:38] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479165 [07:47:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479165 (owner: 10Marostegui) [07:48:42] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479165 (owner: 10Marostegui) [07:49:37] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479165 (owner: 10Marostegui) [07:49:53] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase traffic for db1088 (duration: 00m 52s) [07:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:47] (03CR) 10Urbanecm: [C: 04-1] Upload HD logos for several projects (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478498 (owner: 10Robingan7) [07:55:39] (03PS9) 10Rafidaslam: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) [08:00:12] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479168 [08:01:26] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479168 (owner: 10Marostegui) [08:02:26] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479168 (owner: 10Marostegui) [08:02:42] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479168 (owner: 10Marostegui) [08:03:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1088 (duration: 00m 51s) [08:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:06] (03PS1) 10Muehlenhoff: Remove now obsolete Diamond collector and related conffiles [puppet] - 10https://gerrit.wikimedia.org/r/479169 (https://phabricator.wikimedia.org/T183454) [08:06:27] (03CR) 10Hashar: "recheck" [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/478985 (https://phabricator.wikimedia.org/T209136) (owner: 10Filippo Giunchedi) [08:07:48] (03CR) 10Hashar: [C: 031] "Ema proposed the CI change to make the debian-glue job voting :)" [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/478985 (https://phabricator.wikimedia.org/T209136) (owner: 10Filippo Giunchedi) [08:17:29] (03CR) 10Urbanecm: [C: 04-1] Upload some new logos (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [08:18:27] 10Operations: Connecting to mwmaint1002 though bast4002 fails - https://phabricator.wikimedia.org/T211748 (10jrbs) [08:18:47] !log decommissioning cassandra-b, restbase2006 -- T210843 [08:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:55] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [08:19:22] 10Operations: Connecting to mwmaint1002 though bast4002 fails - https://phabricator.wikimedia.org/T211748 (10jrbs) [08:21:33] (03PS7) 10Filippo Giunchedi: logstash: copy 'severity' into 'level' where needed [puppet] - 10https://gerrit.wikimedia.org/r/476473 (https://phabricator.wikimedia.org/T205851) [08:21:58] (03CR) 10Urbanecm: [C: 04-1] "Last issue, otherwise, it's good. Thanks!" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [08:23:49] (03CR) 10Filippo Giunchedi: [C: 032] logstash: copy 'severity' into 'level' where needed [puppet] - 10https://gerrit.wikimedia.org/r/476473 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [08:28:54] (03CR) 10Filippo Giunchedi: "Would it be ok to reuse the group-id for all logstash kafka consumers? I'd guess so but asking in the context of adding another consumer i" [puppet] - 10https://gerrit.wikimedia.org/r/479136 (https://phabricator.wikimedia.org/T205850) (owner: 10Herron) [08:31:31] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM modulo commit message" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/478774 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [08:37:00] !log Remove old backup directory from db1116 - T206743 [08:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:03] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [08:38:29] !log depooling db1082 for schema change - T85757 [08:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:32] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:38:48] (03CR) 10Banyek: [C: 032] mariadb: depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [08:41:39] (03PS3) 10Banyek: mariadb: depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) [08:41:45] (03CR) 10Banyek: [V: 032 C: 032] mariadb: depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [08:42:53] (03Merged) 10jenkins-bot: mariadb: depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [08:48:39] (03CR) 10Vgutierrez: [V: 032 C: 032] secrets: get rid of private keys of no longer used TLS certificates [labs/private] - 10https://gerrit.wikimedia.org/r/478982 (https://phabricator.wikimedia.org/T211697) (owner: 10Vgutierrez) [08:48:57] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: depool db1082 (duration: 00m 51s) [08:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:01] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [08:50:11] (03PS10) 10Rafidaslam: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) [08:52:09] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10hashar) [08:52:30] (03CR) 10jenkins-bot: mariadb: depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [09:05:11] (03CR) 10Urbanecm: [C: 031] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [09:15:07] 10Operations, 10ops-eqiad, 10media-storage: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10fgiunchedi) a:05fgiunchedi>03RobH @RobH looks like of these hosts only ms-be1050 is accessible from cumin atm? ditto for logging in as my user via ssh ` root@cumin1001... [09:15:28] !log installing pixman security updates on trusty (Debian already fixed) [09:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:43] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10fgiunchedi) Given that the other hosts in this batch are fine and we've replaced the parts Dell wanted to replace what's the next step? [09:16:47] (03PS11) 10Rafidaslam: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) [09:20:58] !log stopping replication on db1082 for schema change - T85757 [09:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:02] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:22:38] !log executing schema change with replication on db1082 - T85757 [09:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:08] !log fixing triggers on db1124:3315- T85757 [09:25:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:40] !log restarting replication on db1082 after schema change - T85757 [09:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:30] (03PS1) 10Muehlenhoff: Add email address for joewalsh [puppet] - 10https://gerrit.wikimedia.org/r/479172 [09:26:59] (03CR) 10GTirloni: Remove now obsolete Diamond collector and related conffiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479169 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:28:49] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T211537 (10Banyek) 05Open>03Resolved The sync finished, thank you @Cmjohnson ` Virtual Drive: 0 (Target Id: 0) RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0 State: Optimal Number Of Drives per... [09:29:06] !log repooling db1082 - T85757 [09:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:10] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:29:17] (03PS1) 10Banyek: Revert "mariadb: depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479173 [09:31:35] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479173 (owner: 10Banyek) [09:32:42] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479173 (owner: 10Banyek) [09:33:58] (03CR) 10Muehlenhoff: Remove now obsolete Diamond collector and related conffiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479169 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [09:34:10] (03CR) 10Hashar: "Make sure to deploy InitialiseSettings.php first in order to have the new feature flag populated. Else logging.php would trigger a notice " (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478621 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [09:34:44] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T85757: repool db1082 (duration: 00m 52s) [09:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:48] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [09:34:59] (03PS2) 10Muehlenhoff: Remove now obsolete Diamond collector and related conffile [puppet] - 10https://gerrit.wikimedia.org/r/479169 (https://phabricator.wikimedia.org/T183454) [09:39:34] (03PS2) 10Muehlenhoff: Add email address for joewalsh [puppet] - 10https://gerrit.wikimedia.org/r/479172 [09:40:08] !log repooling labsdb1010 - T210693 [09:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:12] T210693: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 [09:40:59] (03PS1) 10Banyek: Revert "mariadb: depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479174 [09:41:44] (03CR) 10jenkins-bot: Revert "mariadb: depool db1082" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479173 (owner: 10Banyek) [09:42:36] (03PS2) 10Banyek: Revert "mariadb: depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479174 [09:43:12] (03CR) 10Muehlenhoff: [C: 032] Add email address for joewalsh [puppet] - 10https://gerrit.wikimedia.org/r/479172 (owner: 10Muehlenhoff) [09:43:56] (03CR) 10Banyek: [C: 032] Revert "mariadb: depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479174 (owner: 10Banyek) [09:44:58] (03Merged) 10jenkins-bot: Revert "mariadb: depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479174 (owner: 10Banyek) [09:46:17] (03PS1) 10Banyek: Revert "Revert "mariadb: depool db1110"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479175 [09:46:38] I reverted the wrong patch ^ [09:46:46] Reverting the revert [09:47:55] (03CR) 10Banyek: [C: 032] Revert "Revert "mariadb: depool db1110"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479175 (owner: 10Banyek) [09:49:05] (03Merged) 10jenkins-bot: Revert "Revert "mariadb: depool db1110"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479175 (owner: 10Banyek) [09:50:46] (03PS1) 10Banyek: Revert "labsdb: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/479177 [09:52:06] (03PS2) 10Banyek: Revert "labsdb: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/479177 [09:53:02] (03CR) 10Banyek: [C: 032] Revert "labsdb: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/479177 (owner: 10Banyek) [09:54:07] (03CR) 10jenkins-bot: Revert "mariadb: depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479174 (owner: 10Banyek) [09:54:11] (03CR) 10jenkins-bot: Revert "Revert "mariadb: depool db1110"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479175 (owner: 10Banyek) [09:54:30] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10fgiunchedi) [10:05:08] 10Operations, 10Performance-Team, 10Traffic, 10media-storage: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10fgiunchedi) Thanks @Gilles for kickstarting this! For context these are the notes I took when we did the first round of cleanup a couple of years ba... [10:08:45] !log executing schema change on db1070 (s5 master) - T85757 [10:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:49] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [10:16:35] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10mobrovac) [10:17:37] (03PS1) 10Giuseppe Lavagetto: Remove references to the old, decommissioned etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/479178 [10:23:33] (03CR) 10MarcoAurelio: [C: 04-1] "- Wikisources get Wikidata support. Why is this project left out of wikidataclient.dblist? (probably that means leaving this wiki out of n" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [10:26:50] !log Icinga is having issue restarting properly, investigation ongoing [10:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:31] (03PS4) 10MarcoAurelio: Add NS_PROJECT localised name for tt.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312) [10:35:25] (03PS11) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [10:35:27] (03PS11) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [10:35:29] (03PS13) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [10:35:31] (03PS1) 10DCausse: [cirrus] fix temp clusters for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479180 (https://phabricator.wikimedia.org/T210381) [10:38:57] (03PS1) 10GTirloni: Limit manifest starts (max 10) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/479181 (https://phabricator.wikimedia.org/T107878) [10:40:17] !log mobrovac@deploy1001 Started deploy [restbase/deploy@44e0955]: Bring restbase201[3-8] up to date, try #2 - T211416 [10:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:20] T211416: Put restbase201[3-8] into conftool and LVS - https://phabricator.wikimedia.org/T211416 [10:40:32] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@44e0955]: Bring restbase201[3-8] up to date, try #2 - T211416 (duration: 00m 15s) [10:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:22] !log mobrovac@deploy1001 Started deploy [restbase/deploy@44e0955]: Bring restbase201[3-8] up to date, try #2b - T211416 [10:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:56] (03PS2) 10GTirloni: Limit manifest starts (max 10) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/479181 (https://phabricator.wikimedia.org/T107878) [10:48:39] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10ema) p:05Triage>03Normal [10:51:33] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@44e0955]: Bring restbase201[3-8] up to date, try #2b - T211416 (duration: 10m 11s) [10:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:37] T211416: Put restbase201[3-8] into conftool and LVS - https://phabricator.wikimedia.org/T211416 [10:52:56] (03CR) 10Ema: [C: 031] hiera: add trafficserver cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/478774 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [10:58:15] (03PS1) 10DCausse: elasticsearch: configure LVS endpoint for new eqiad clusters [puppet] - 10https://gerrit.wikimedia.org/r/479184 (https://phabricator.wikimedia.org/T207195) [11:03:38] !log restarting Icinga with debug log on icinga1001 [11:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:46] 10Operations, 10WMF-Legal, 10Wikimedia-General-or-Unknown, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10GTirloni) I was wondering about this myself today and found this task. Without a license, it's not possible for someone to [[ https:/... [11:03:55] Can everyone stop deploying changes? We are having issues with icinga at the moment and we are blind [11:03:58] mobrovac: ^ [11:04:48] marostegui: i'm not deploying, no worries (the logs you see were just testing etcd on new rb hosts) [11:05:00] mobrovac: A right :) [11:05:16] (03CR) 10GTirloni: [C: 031] "Probably needs an updated CONTRIBUTORS/.mailmap but LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [11:11:19] (03PS2) 10Hoo man: Wikidata: Display Kartographer mapframes for geocoordinate statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477976 (https://phabricator.wikimedia.org/T184933) [11:14:16] !log restarting icinga with dropped downtimes from last night (start_date > 1544489652) [11:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:22] it was >= ;) and that's Tuesday, December 11, 2018 12:54:12 AM [11:16:27] (03PS1) 10GTirloni: toolforge: Increase shinken 'High iowait' to 70/90 warning/critical. [puppet] - 10https://gerrit.wikimedia.org/r/479187 (https://phabricator.wikimedia.org/T161898) [11:21:12] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.48.46:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.46 and port 9042: Connection refused Muehlenhoff T210843 [11:21:12] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.46:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff T210843 [11:21:13] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.47:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.47 and port 9042: Connection refused Muehlenhoff T210843 [11:21:13] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.48.47:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff T210843 [11:21:13] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.48:9042 on restbase2005 is CRITICAL: connect to address 10.192.48.48 and port 9042: Connection refused Muehlenhoff T210843 [11:21:13] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.48.48:7001 on restbase2005 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff T210843 [11:21:13] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.48.49:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.49 and port 9042: Connection refused Muehlenhoff T210843 [11:21:13] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.49:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused Muehlenhoff T210843 [11:30:42] !log re-enabled puppet on icinga[12]001, re-activated crontab to sync files on 2001 and manually run it + run puppet [11:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:23] <_joe_> mobrovac: what are you using etcd for on the rb hosts? [11:32:07] <_joe_> oh sorry, unrelated it seems [11:32:22] <_joe_> anyways, the problem is over, deployments can resume [11:34:13] ack :) [11:54:00] (03PS1) 10Muehlenhoff: Remove Diamond from ORES hosts [puppet] - 10https://gerrit.wikimedia.org/r/479189 (https://phabricator.wikimedia.org/T183454) [11:54:36] (03CR) 10jerkins-bot: [V: 04-1] Remove Diamond from ORES hosts [puppet] - 10https://gerrit.wikimedia.org/r/479189 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [11:55:40] (03PS2) 10Muehlenhoff: Remove Diamond from ORES hosts [puppet] - 10https://gerrit.wikimedia.org/r/479189 (https://phabricator.wikimedia.org/T183454) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T1200). [12:00:04] hoo, Urbanecm, and Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] present [12:00:15] o/ [12:00:34] I can swat today [12:01:00] sorry, jet-lagged, just woke up, hoo please go ahead while I get ready [12:01:16] Will do :) [12:01:31] (03CR) 10Hoo man: [C: 032] Wikidata: Display Kartographer mapframes for geocoordinate statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477976 (https://phabricator.wikimedia.org/T184933) (owner: 10Hoo man) [12:02:30] (03Merged) 10jenkins-bot: Wikidata: Display Kartographer mapframes for geocoordinate statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477976 (https://phabricator.wikimedia.org/T184933) (owner: 10Hoo man) [12:03:55] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Display Kartographer mapframes for geocoordinate statements (T184933) (duration: 00m 52s) [12:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:59] T184933: Display map for geocoordinate statements - https://phabricator.wikimedia.org/T184933 [12:06:56] * Hauskatze bribes hoo to continue with the SWAT :P [12:09:09] I can continue until zeljkof is back [12:09:41] I'm bakc [12:09:44] back [12:09:47] :D [12:09:56] hoo: feel free to continue :D [12:10:01] but I can swat if you prefer [12:10:24] Would be nicer… I'll probably head for lunch sometime soon :) [12:10:41] ok, I'll swat [12:11:04] (03CR) 10jenkins-bot: Wikidata: Display Kartographer mapframes for geocoordinate statements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477976 (https://phabricator.wikimedia.org/T184933) (owner: 10Hoo man) [12:12:29] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478890 (https://phabricator.wikimedia.org/T198507) (owner: 10Urbanecm) [12:12:39] Urbanecm: does 478890 need to be tested? [12:13:14] No, you can deploy to prod directly [12:13:34] (03Merged) 10jenkins-bot: Upload new logos for cawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478890 (https://phabricator.wikimedia.org/T198507) (owner: 10Urbanecm) [12:17:29] !log zfilipin@deploy1001 Synchronized static/images/project-logos/: SWAT: [[gerrit:478890|Upload new logos for cawikimedia (T198507)]] (duration: 00m 52s) [12:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:32] T198507: Change logo on the Wiki of Wikimedia Canada - https://phabricator.wikimedia.org/T198507 [12:17:53] Urbanecm: 478890 deployed, purging [12:18:33] ack [12:19:28] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) p:05Low>03Normal My team agreed on following up with eqiad1. The only requirement is we have a clear rollback pla... [12:19:53] Urbanecm: purged, please test new logos [12:19:59] will do zeljkof [12:22:08] (03CR) 10Hoo man: [C: 031] use lbzip2 for recompression of wikidata weeky json dumps [puppet] - 10https://gerrit.wikimedia.org/r/474159 (https://phabricator.wikimedia.org/T206535) (owner: 10ArielGlenn) [12:22:16] Urbanecm: is 478891 related to a phab task? T198507? [12:22:28] let me check [12:22:56] (03PS2) 10Urbanecm: Use HD logos for cawikimedia in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478891 (https://phabricator.wikimedia.org/T198507) [12:23:01] indeed, added to commit msg [12:23:24] (03CR) 10jenkins-bot: Upload new logos for cawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478890 (https://phabricator.wikimedia.org/T198507) (owner: 10Urbanecm) [12:23:29] thanks [12:23:35] yw [12:23:43] (03PS3) 10Zfilipin: Use HD logos for cawikimedia in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478891 (https://phabricator.wikimedia.org/T198507) (owner: 10Urbanecm) [12:24:03] Urbanecm: does it need to be tested at mwdebug? [12:24:18] I don't think so, should be fully covered by tests [12:24:23] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478891 (https://phabricator.wikimedia.org/T198507) (owner: 10Urbanecm) [12:25:26] (03Merged) 10jenkins-bot: Use HD logos for cawikimedia in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478891 (https://phabricator.wikimedia.org/T198507) (owner: 10Urbanecm) [12:27:29] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:478891|Use HD logos for cawikimedia in IS.php (T198507)]] (duration: 00m 52s) [12:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:33] T198507: Change logo on the Wiki of Wikimedia Canada - https://phabricator.wikimedia.org/T198507 [12:27:43] Urbanecm: 478891 deployed [12:27:46] ack [12:28:01] (03PS2) 10Zfilipin: Enable extension SandboxLink for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478896 (https://phabricator.wikimedia.org/T210325) (owner: 10Urbanecm) [12:29:08] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478896 (https://phabricator.wikimedia.org/T210325) (owner: 10Urbanecm) [12:29:23] Urbanecm: does 478896 need to be tested? [12:29:32] (at mwdebug) [12:29:43] I prefer testing it [12:30:04] ok [12:30:12] (03Merged) 10jenkins-bot: Enable extension SandboxLink for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478896 (https://phabricator.wikimedia.org/T210325) (owner: 10Urbanecm) [12:30:38] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [12:31:00] Urbanecm: it's at mwdebug1002 [12:31:07] looking [12:31:07] 478896 [12:31:29] zeljkof, lgtm, please deploy [12:31:45] ok, deploying [12:32:45] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:478896|Enable extension SandboxLink for nowiki (T210325)]] (duration: 00m 52s) [12:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:48] T210325: Add the extension "SandboxLink" for nowiki - https://phabricator.wikimedia.org/T210325 [12:33:08] Urbanecm: deployed [12:33:11] ack, thanks [12:33:49] (03PS2) 10Zfilipin: Add new namespace aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471345 (https://phabricator.wikimedia.org/T207544) (owner: 10Arcayn) [12:34:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471345 (https://phabricator.wikimedia.org/T207544) (owner: 10Arcayn) [12:35:27] !log T205969 icinga downtime load-avg check for labstore1007 until January (1 month) [12:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:31] T205969: labstore1007: high load avg issue - https://phabricator.wikimedia.org/T205969 [12:35:32] (03Merged) 10jenkins-bot: Add new namespace aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471345 (https://phabricator.wikimedia.org/T207544) (owner: 10Arcayn) [12:35:59] (03CR) 10jenkins-bot: Use HD logos for cawikimedia in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478891 (https://phabricator.wikimedia.org/T198507) (owner: 10Urbanecm) [12:36:01] (03CR) 10jenkins-bot: Enable extension SandboxLink for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478896 (https://phabricator.wikimedia.org/T210325) (owner: 10Urbanecm) [12:36:03] (03CR) 10jenkins-bot: Add new namespace aliases for zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471345 (https://phabricator.wikimedia.org/T207544) (owner: 10Arcayn) [12:36:16] Urbanecm: 471345 at mwdebug1002 [12:36:23] looking [12:39:49] zeljkof, please deploy and run namespaceDupes.php afterwards [12:40:10] ok [12:41:10] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:471345|Add new namespace aliases for zhwikiversity (T207544)]] (duration: 00m 52s) [12:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:14] T207544: Create namespace aliases in zhwikiversity - https://phabricator.wikimedia.org/T207544 [12:43:12] (03CR) 10Zfilipin: "`mwscript namespaceDupes.php zhwikiversity --fix` results at T207544#4816698" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471345 (https://phabricator.wikimedia.org/T207544) (owner: 10Arcayn) [12:43:31] Urbanecm: deployed, `mwscript namespaceDupes.php zhwikiversity --fix` results at T207544#4816698 [12:44:20] Hauskatze: please stand by, you're next [12:44:30] zeljkof: Tu sam :) [12:44:32] (03PS5) 10Zfilipin: Add NS_PROJECT localised name for tt.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312) (owner: 10MarcoAurelio) [12:45:40] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312) (owner: 10MarcoAurelio) [12:46:43] (03Merged) 10jenkins-bot: Add NS_PROJECT localised name for tt.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312) (owner: 10MarcoAurelio) [12:47:35] Hauskatze: it's at mwdebug1002, please test [12:48:09] zeljkof: looks okay, don't forget namespace dupes after deploying [12:48:20] zeljkof, thank you for deploying my patches [12:48:21] ok, deploying and running the script [12:48:29] Urbanecm: I'm glad I could help :) [12:48:34] (03CR) 10jenkins-bot: Add NS_PROJECT localised name for tt.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312) (owner: 10MarcoAurelio) [12:49:29] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:477979|Add NS_PROJECT localised name for tt.wiktionary (T211312)]] (duration: 00m 52s) [12:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:33] T211312: Add NS_PROJECT translation for tt.wiktionary - https://phabricator.wikimedia.org/T211312 [12:51:07] (03CR) 10Zfilipin: "`mwscript namespaceDupes.php ttwiktionary --fix` results at T211312#4816718" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477979 (https://phabricator.wikimedia.org/T211312) (owner: 10MarcoAurelio) [12:51:13] Hauskatze: `mwscript namespaceDupes.php ttwiktionary --fix` results at T211312#4816718 [12:51:18] and it's deployed [12:51:22] thanks :) [12:51:51] lunch time! - see you [12:51:54] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: export NFS dumps for soweego project [puppet] - 10https://gerrit.wikimedia.org/r/479195 (https://phabricator.wikimedia.org/T209818) [12:52:04] !log EU SWAT finished [12:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:06] RECOVERY - High load average on labstore1007 is OK: OK: Less than 85.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [12:57:14] (03CR) 10GTirloni: [C: 031] cloudvps: export NFS dumps for soweego project [puppet] - 10https://gerrit.wikimedia.org/r/479195 (https://phabricator.wikimedia.org/T209818) (owner: 10Arturo Borrero Gonzalez) [12:58:57] 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) @BBlack that refactoring is awesome! As for why the task got stuck, I did a first a... [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T1300) [13:03:56] (03PS12) 10Rafidaslam: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) [13:04:44] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [13:10:12] 10Operations, 10Analytics, 10Security-Team, 10WMF-Legal, 10Software-Licensing: Can exfat be used in WMF production? - https://phabricator.wikimedia.org/T210667 (10JBennett) >>! In T210667#4812704, @Legoktm wrote: >>>! In T210667#4795435, @JBennett wrote: >> Thanks everyone of for their thoughtful conside... [13:19:15] (03CR) 10Filippo Giunchedi: "Good start! A few comments inline" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [13:22:19] jouncebot: next [13:22:19] In 0 hour(s) and 37 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T1400) [13:23:05] (03CR) 10Volans: [C: 04-1] "I think there is one missing bit. The rest looks good, I'm not sure about the list of hosts that should be in one or the other cluster, I'" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/479184 (https://phabricator.wikimedia.org/T207195) (owner: 10DCausse) [13:23:29] 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) >>! In T205439#4816740, @hashar wrote: > Out of curiosity: how do you ship the GeoDN... [13:24:45] (03PS1) 10Marostegui: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479198 (https://phabricator.wikimedia.org/T202167) [13:25:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479198 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [13:26:51] !log installing PHP security updates [13:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:00] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479198 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [13:28:00] (03CR) 10DCausse: elasticsearch: configure LVS endpoint for new eqiad clusters (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/479184 (https://phabricator.wikimedia.org/T207195) (owner: 10DCausse) [13:28:11] (03PS2) 10DCausse: elasticsearch: configure LVS endpoint for new eqiad clusters [puppet] - 10https://gerrit.wikimedia.org/r/479184 (https://phabricator.wikimedia.org/T207195) [13:28:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1098:3316 db1098:3317 for kernel and mysql upgrade (duration: 00m 52s) [13:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:21] (03CR) 10Volans: [C: 031] "LGTM (modulo which host should have which service)" [puppet] - 10https://gerrit.wikimedia.org/r/479184 (https://phabricator.wikimedia.org/T207195) (owner: 10DCausse) [13:32:53] (03PS13) 10Rafidaslam: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) [13:33:18] (03PS1) 10BBlack: Add README explaining the small binary db here [dns] - 10https://gerrit.wikimedia.org/r/479204 (https://phabricator.wikimedia.org/T205439) [13:34:01] 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) ^ Fixing it to be self-explanatory! :) [13:34:35] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [13:34:55] (03CR) 10BBlack: [C: 032] Add README explaining the small binary db here [dns] - 10https://gerrit.wikimedia.org/r/479204 (https://phabricator.wikimedia.org/T205439) (owner: 10BBlack) [13:36:07] (03CR) 10Rafidaslam: "> Patch Set 11: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [13:37:43] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479198 (https://phabricator.wikimedia.org/T202167) (owner: 10Marostegui) [13:39:01] (03PS1) 10Banyek: mariadb: depooling db1122 for renaming tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479205 (https://phabricator.wikimedia.org/T211544) [13:41:31] banyek: you don't really need to depool a host to rename tables [13:42:21] If they are not in use you should encounter no metadata locking issues [13:42:22] I wanted to make the process as close to the drop one as possible even though renaming is atomic [13:42:30] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: export NFS dumps for soweego project [puppet] - 10https://gerrit.wikimedia.org/r/479195 (https://phabricator.wikimedia.org/T209818) (owner: 10Arturo Borrero Gonzalez) [13:42:40] But I abandon the change then [13:42:48] and do the rename as-is [13:42:50] banyek: You don't need to depool it to drop them either ;) [13:44:57] marostegui: in one hour, query patterns for lots of API queries + watchlist will change for commons and wikidata (tag_summary -> change_tag thingy) I already checked stuff for mediawiki.org and looked fine [13:45:11] but if things went mayhem, it's probably this [13:45:16] Amir1: Thanks for the heads up [13:45:19] banyek: ^ [13:45:37] kk [13:48:46] I always forgot, that we don't have to scan through buffer pools before dropping tables anymore [13:48:56] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Joe) I don't think we need to overthink this, but knowing what kind of latency increase we can expect migh... [13:52:56] (03Abandoned) 10Banyek: mariadb: depooling db1122 for renaming tables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479205 (https://phabricator.wikimedia.org/T211544) (owner: 10Banyek) [13:57:41] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Eevans) >>! In T211721#4816810, @Joe wrote: > I don't think we need to overthink this, but knowing what ki... [13:58:03] (03CR) 10Mathew.onipe: [C: 031] "Double checked nodes to clusters config. All seems good!" [puppet] - 10https://gerrit.wikimedia.org/r/479184 (https://phabricator.wikimedia.org/T207195) (owner: 10DCausse) [14:00:04] zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T1400). [14:00:39] o/ [14:00:43] lol [14:01:24] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:04:48] !log Stopy MySQL on db1098:3316 and db1098:3317 for kernel and mysql upgrade [14:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:58] (03PS1) 10Ladsgroup: Revert "Revert "ores: Remove added celery configs"" [puppet] - 10https://gerrit.wikimedia.org/r/479206 [14:05:26] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "ores: Remove added celery configs"" [puppet] - 10https://gerrit.wikimedia.org/r/479206 (owner: 10Ladsgroup) [14:05:38] Amir1: Only commons and wikidata will be "affected"? [14:06:17] marostegui: group1, most notably wikidata and commons but hewiki, cawiki and some other might show weird things [14:06:29] (03PS7) 10Arturo Borrero Gonzalez: Neutron: allow VMs to access the neutron API [puppet] - 10https://gerrit.wikimedia.org/r/478786 (https://phabricator.wikimedia.org/T211391) (owner: 10Andrew Bogott) [14:07:00] Amir1: ok - keep an eye on fatals and https://logstash.wikimedia.org/goto/9ac0b716462a5627846cdc060b521c38 please :) [14:07:20] marostegui: sure thing! [14:07:26] Cheers [14:07:31] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Neutron: allow VMs to access the neutron API [puppet] - 10https://gerrit.wikimedia.org/r/478786 (https://phabricator.wikimedia.org/T211391) (owner: 10Andrew Bogott) [14:07:52] !log mobrovac@deploy1001 Started restart [restbase/deploy@5946231]: Restart RB to pick up the new seeds in codfw - T211416 [14:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:57] T211416: Put restbase201[3-8] into conftool and LVS - https://phabricator.wikimedia.org/T211416 [14:09:25] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Compiler: https://puppet-compiler.wmflabs.org/compiler1002/13904/" [puppet] - 10https://gerrit.wikimedia.org/r/478786 (https://phabricator.wikimedia.org/T211391) (owner: 10Andrew Bogott) [14:11:02] (03PS1) 10Zfilipin: group1 wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479207 [14:11:04] (03CR) 10Zfilipin: [C: 032] group1 wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479207 (owner: 10Zfilipin) [14:12:05] zeljkof: Please ping me when the train is done (and determined stable) this morning. I want to make a config change affecting group 2. Thanks! [14:12:13] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479207 (owner: 10Zfilipin) [14:12:21] anomie: sure [14:13:51] 10Operations, 10Analytics, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) @Nuria, what you mention makes sense. I created this task in order to get the current recommendations into MySQL. I think we... [14:14:00] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.8 [14:14:46] (03CR) 10Herron: "> Would it be ok to reuse the group-id for all logstash kafka" [puppet] - 10https://gerrit.wikimedia.org/r/479136 (https://phabricator.wikimedia.org/T205850) (owner: 10Herron) [14:14:52] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.8 (duration: 00m 51s) [14:15:44] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479207 (owner: 10Zfilipin) [14:16:01] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [14:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:34] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast (Auth)DNS - https://phabricator.wikimedia.org/T98006 (10BBlack) Some interesting stuff here (see also the Mailing Lists link there in the datatracker for discussion): https://datatracker.ietf.org/doc/draft-moura-dnsop-authoritative-recommendati... [14:18:10] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [14:18:52] Amir1: Is the change deployed? [14:18:54] !log restart uwsgi-netbox on netmon2001 [14:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:14] marostegui: on its way [14:19:27] Amir1: I am seeing things going a bit crazy on a commons api slave [14:19:38] hm, looks like this started to happen when I deployed wmf.8 to group1 [14:19:40] wikidata is on wmf.8 meaning it's deployed there [14:19:43] `Wikimedia\Rdbms\Database::selectSQLText called from ApiBase::filterIDs with incorrect parameters: $conds must be a string or an array` [14:19:47] banyek: ^ [14:19:49] https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors?_g=h@bfc149c&_a=h@6ad664e [14:20:01] Amir1: https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1084&var-port=9104&kiosk&refresh=10s&from=now-3h&to=now (this is a commons API slave) [14:20:12] it might be me. Let me double check everything [14:20:26] * banyek trying to understand [14:20:54] wait, is there another deployment going on during train window? cc Amir1 [14:21:02] marostegui: maybe because it's not cached, let's see if it warms up [14:21:09] zeljkof: no, it's part of wmf.8 [14:21:24] Amir1: yeah, let's give it some minutes [14:22:35] (03PS1) 10GTirloni: prometheus: fix openstack-exporter cache dir permission [puppet] - 10https://gerrit.wikimedia.org/r/479209 (https://phabricator.wikimedia.org/T211766) [14:22:47] zeljkof: hey, I can't find Wikimedia\Rdbms\Database::selectSQLText called from ApiBase::filterIDs with incorrect parameters: $conds must be a string or an array on logstash [14:23:24] found it [14:23:26] on dbquery [14:23:34] marostegui: it's probably related [14:23:36] yeah, huge spike on dbquery [14:23:45] can we revert? [14:23:50] and it just doubled [14:24:05] * anomie sees an ApiBase error being discussed and starts investigating [14:24:11] Amir1, marostegui: should I revert deployment to group1? [14:24:18] Hey anomie welcome to the party! :) [14:24:23] <_joe_> revert whatever caused the spike [14:24:31] I'll have a patch up in a minute [14:24:49] <_joe_> marostegui: are we in an outage? [14:24:52] Not yet [14:25:06] ok, so revert, merge patch, try again? cc Amir1 [14:25:10] <_joe_> +1 [14:25:21] Is there a bug number yet? [14:25:21] +1 to that yes [14:25:21] <_joe_> zeljkof: actually +2 [14:25:22] <_joe_> do it [14:25:28] <_joe_> :P [14:25:35] mediawiki-errrors is nto happy, reverting https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors [14:25:37] https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1084&var-port=9104&kiosk&refresh=10s&from=now-3h&to=now [14:26:08] marostegui: the handler stat is going down, that's good [14:26:25] zeljkof, marostegui: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/479210 [14:26:39] Amir1: going down? it went from 8k to 300k :) [14:27:01] let's merge anomie's patch then? [14:27:20] sure, It would fix the API error issue [14:27:36] Amir1: what about the full scans? [14:27:37] it seems going down [14:27:46] marostegui: now it's 174k :D [14:28:08] maybe it's from something else [14:28:12] <_joe_> can we first revert, then decide with calm which patches are needed? [14:28:16] <_joe_> please [14:29:00] <_joe_> zeljkof: are you reverting then? [14:29:07] _joe_: reverting [14:29:11] it spikes again [14:29:11] running scap [14:29:26] (03CR) 10Filippo Giunchedi: [C: 031] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/479136 (https://phabricator.wikimedia.org/T205850) (owner: 10Herron) [14:29:31] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "Worth checking if this creates any conflict with the systemd service. Other than that LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/479209 (https://phabricator.wikimedia.org/T211766) (owner: 10GTirloni) [14:29:33] Amir1: on wikidata things seems to be fine (not sure if it is deployed there) [14:29:46] it's deployed there [14:29:52] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.33.0-wmf.8" [14:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:00] reverted [14:30:08] Amir1: Interesting, I don't see the same behaviour there [14:30:14] The API error is super simple: Passing null or false for $conds to $db->select() is deprecated and logs a warning, and a recent patch added code that did that. The fix is trivial, just change it to empty-string or empty-array. [14:30:22] <_joe_> marostegui, banyek do you see the query going down? [14:30:35] <_joe_> the query rate, even [14:30:42] marostegui: I think it's because of the ApiBase issue (not related to change_tag) [14:30:51] otherwise wikidata would also go crazy [14:30:58] PROBLEM - Apache HTTP on mw1274 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [14:31:16] _joe_: no, full scans are still there on commons only [14:31:22] I am looking on the graph mentioned before it went down a bit, but spiked again. Now it seems settling down but that doesn't mean anything [14:31:59] mediawiki-errors back to normal after the revert [14:32:10] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.180 second response time [14:32:14] please create a phab ticket and reference it in the commit that fixes the problem [14:32:20] we need it for tracking incidents [14:32:35] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) >>! In T209460#4815447, @Multichill wrote: > Can someone point me to the current network layout? Vlans, ip space in use, what's used to route/filter traffic, etc.? K... [14:32:40] <_joe_> give it time [14:32:45] <_joe_> it's going down now [14:32:46] ok, it is back to normal values now [14:32:48] https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1084&var-port=9104&kiosk&from=1544624163731&to=1544625153818 [14:32:53] <_joe_> ok [14:33:00] <_joe_> until we're sure what caused that [14:33:08] <_joe_> we should not move forward with the release [14:33:23] I'll create a phab task [14:33:24] <_joe_> (I feel like captain obvious right now, but still) [14:33:25] marostegui:change_tag table on wikidata has an order of magnitude higher number of rows than commons [14:34:08] Amir1: Yeah, I was checking sizes [14:35:08] (03PS2) 10GTirloni: prometheus: fix openstack-exporter cache dir permission [puppet] - 10https://gerrit.wikimedia.org/r/479209 (https://phabricator.wikimedia.org/T211766) [14:35:41] (03CR) 10jerkins-bot: [V: 04-1] prometheus: fix openstack-exporter cache dir permission [puppet] - 10https://gerrit.wikimedia.org/r/479209 (https://phabricator.wikimedia.org/T211766) (owner: 10GTirloni) [14:36:01] Amir1, anomie: T211769 [14:36:02] T211769: Wikimedia\Rdbms\Database::selectSQLText called from ApiBase::filterIDs with incorrect parameters: $conds must be a string or an array - https://phabricator.wikimedia.org/T211769 [14:36:06] (03PS7) 10Filippo Giunchedi: logstash: add new logging kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/476472 (https://phabricator.wikimedia.org/T205851) [14:36:56] (03PS3) 10GTirloni: prometheus: fix openstack-exporter cache dir permission [puppet] - 10https://gerrit.wikimedia.org/r/479209 (https://phabricator.wikimedia.org/T211766) [14:37:11] (03CR) 10jerkins-bot: [V: 04-1] logstash: add new logging kafka consumer [puppet] - 10https://gerrit.wikimedia.org/r/476472 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [14:38:10] (03CR) 10GTirloni: [C: 032] prometheus: fix openstack-exporter cache dir permission [puppet] - 10https://gerrit.wikimedia.org/r/479209 (https://phabricator.wikimedia.org/T211766) (owner: 10GTirloni) [14:42:29] (03PS1) 10Zfilipin: Revert "group1 wikis to 1.33.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479213 [14:43:11] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479215 [14:43:54] !log renaming tables on db1122 ptwiki: flagged* -> T211544_flagged* - T211544 [14:43:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:57] T211544: Drop FlaggedRevs tables in database for ptwikipedia - https://phabricator.wikimedia.org/T211544 [14:44:23] (03CR) 10Zfilipin: [C: 032] Revert "group1 wikis to 1.33.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479213 (owner: 10Zfilipin) [14:45:10] anomie: looks like you're working on T211769, ok to assign it to you? [14:45:11] T211769: Wikimedia\Rdbms\Database::selectSQLText called from ApiBase::filterIDs with incorrect parameters: $conds must be a string or an array - https://phabricator.wikimedia.org/T211769 [14:45:28] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479213 (owner: 10Zfilipin) [14:46:00] zeljkof: I'm already done with T211769, the patch is waiting for review and backport. [14:46:43] (03CR) 10Andrew Bogott: [C: 031] toolforge: Increase shinken 'High iowait' to 70/90 warning/critical. [puppet] - 10https://gerrit.wikimedia.org/r/479187 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [14:46:57] anomie: I'm waiting for jenkins before +2ing it [14:47:13] as for the backport, zeljkof what do you think? [14:48:03] Unless you're going to skip wmf.8 entirely, it'll need to be backported before the train can go forward. FWIW it's probably happening on group 0 already, just at a low enough rate to not be noticed. [14:49:19] Amir1, please backport, feel free to deploy, or I can do it, whatever you prefer, I'll roll the train forward then [14:49:27] For the record, I have no idea about the full scans. The warning from ApiBase shouldn't have anything to do with that. [14:49:29] Sure. [14:49:45] anomie: I'll assign you to the task then, since you are the only one with the patch [14:50:18] oh, I see you've already claimed it :) [14:50:43] I'll create an incident report for the train [14:52:12] zeljkof: can I deploy wmf-config? [14:52:46] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.33.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479213 (owner: 10Zfilipin) [14:53:20] marostegui: what do you need to deploy? [14:53:30] it was a DB repool, but I can wait [14:53:32] train is paused, if there is something you need to do, go ahead [14:53:58] zeljkof: that doesn't mess up with the train then? [14:54:41] I'm not deploying anything until 479210 is merged and backported [14:54:48] ok! [14:54:50] marostegui: I really don't know [14:55:03] but train is on hold for now [14:55:16] zeljkof: I am just seeing something not pushed on deploy1001 on mediawiki-staging [14:55:21] so not sure if that is you or not [14:55:27] Your branch is ahead of 'origin/master' by 1 commit [14:56:27] marostegui: tendril doesn't give out any slow queries on commons :/ [14:57:05] Amir1: Yeah, I also checked that [14:59:12] (03PS2) 10Robingan7: "Use uploaded logos in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 [14:59:27] (03CR) 10jerkins-bot: [V: 04-1] "Use uploaded logos in IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [14:59:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Not in production of course, but yes I 'll help" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/479026 (https://phabricator.wikimedia.org/T211708) (owner: 10Jeena Huneidi) [15:03:10] (03CR) 10Ebe123: [C: 04-1] "Some comments" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [15:04:22] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Anomie) >>! In T211721#4816819, @Eevans wrote: >>>! In T211721#4816810, @Joe wrote: >> I don't think we ne... [15:05:31] (03PS1) 10Muehlenhoff: Update stat alias for cumin for new role [puppet] - 10https://gerrit.wikimedia.org/r/479220 [15:06:19] marostegui: A query like https://commons.wikimedia.org/w/api.php?titles=File%3ALocation_Of_Chelno-Vershinsky_District_%28Samara_Oblast%29.svg&iiprop=url%7Cthumbnail%7Ctimestamp&iiurlwidth=300&iiurlheight=-1&iiurlparam=300px&prop=imageinfo&format=json&action=query&redirects=true&uselang=ru timed out, it doesn't use change_tag. It's something else [15:06:33] (I got the query from logstash) [15:07:37] you've got the SQL? [15:07:50] Can dig it up, it's not in the logs [15:07:57] (03PS2) 10Cmjohnson: Adding mgmt dns for sessionstore100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/479027 (https://phabricator.wikimedia.org/T209393) [15:08:03] https://logstash.wikimedia.org/goto/b03356bbf7a313593ff88de2166b9041 [15:08:24] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for sessionstore100[1-3] [dns] - 10https://gerrit.wikimedia.org/r/479027 (https://phabricator.wikimedia.org/T209393) (owner: 10Cmjohnson) [15:08:51] basically everything times out, It might be because it got so overloaded that started to timeout for everything [15:09:00] yeah, could be that [15:09:11] I wanted to try the query manually [15:13:44] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've pasted the results of the ab tests in https://phabricator.wikimedia.org/P7909 btw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/479026 (https://phabricator.wikimedia.org/T211708) (owner: 10Jeena Huneidi) [15:15:00] marostegui: sorry for the late reply, this is what I did https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Rollback [15:15:35] I guess adding change-id to the commit message changed the commit? hm, it shouldn't right? [15:16:11] maybe `git fetch` at deploy1001 would fix it, since the commit was made there, then pushed to gerrit [15:16:22] zeljkof: Don't know - what I am saying is that deploy1001 is now looking like: https://phabricator.wikimedia.org/P7910 [15:17:20] marostegui: fixed with `git fetch` :) [15:17:33] \o/ [15:17:36] `Your branch is up-to-date with 'origin/master'.` [15:17:46] I will deploy my change then! Thanks :) [15:17:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479215 (owner: 10Marostegui) [15:18:01] in case of revert, the commit it created first there, then pushed to gerrit [15:18:14] so we don't have to wait for gerrit/jenkins [15:18:22] yeah, makes sense [15:18:22] marostegui: Regarding the full scans, my best guess from what I can see of the queries in T211769#4817076 is that they came from ImageListPager. I really wish I could see the whole queries. [15:18:23] T211769: Wikimedia\Rdbms\Database::selectSQLText called from ApiBase::filterIDs with incorrect parameters: $conds must be a string or an array - https://phabricator.wikimedia.org/T211769 [15:18:27] (03PS1) 10Banyek: mariadb: dbstore_multi - dbstore1003-dbstre1005 [puppet] - 10https://gerrit.wikimedia.org/r/479224 (https://phabricator.wikimedia.org/T210478) [15:18:42] anomie: yeah, it is silly that sys DB redacts the query :( [15:19:01] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479215 (owner: 10Marostegui) [15:19:03] (03CR) 10jerkins-bot: [V: 04-1] mariadb: dbstore_multi - dbstore1003-dbstre1005 [puppet] - 10https://gerrit.wikimedia.org/r/479224 (https://phabricator.wikimedia.org/T210478) (owner: 10Banyek) [15:19:11] anomie: But are there any changes to that file on this train version? [15:19:37] marostegui: Just one. That I don't see how it would have done that, which is why I want to see more details... [15:20:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1098:3316 db1098:3317 after kernel and mysql upgrade (duration: 00m 53s) [15:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:24] !log decommissioning cassandra-c, restbase2006 -- T210843 [15:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:28] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [15:22:04] PROBLEM - cassandra-b SSL 10.192.48.50:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:22:14] anomie: I think I got the whole query [15:22:23] which one you think is the one? [15:22:28] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [15:22:39] !log ladsgroup@deploy1001 Synchronized php-1.33.0-wmf.8/includes/api/ApiBase.php: T211769 (duration: 00m 52s) [15:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:45] (03PS2) 10Banyek: mariadb: dbstore_multi - dbstore1003-dbstore1005 [puppet] - 10https://gerrit.wikimedia.org/r/479224 (https://phabricator.wikimedia.org/T210478) [15:22:46] anomie: zeljkof ^ [15:22:50] PROBLEM - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.50 and port 9042: Connection refused [15:22:51] marostegui: Any query that has oi_timestamp AS img_timestamp as the first field in the SELECT. [15:23:09] Let me update the task [15:23:14] I got the full query [15:23:33] Amir1: this resolves T211769? ok to move the train forward again? [15:23:34] T211769: Wikimedia\Rdbms\Database::selectSQLText called from ApiBase::filterIDs with incorrect parameters: $conds must be a string or an array - https://phabricator.wikimedia.org/T211769 [15:23:58] (03PS1) 10Vgutierrez: certcentral: Provide TLS certificates for mx[12]001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/479226 (https://phabricator.wikimedia.org/T207050) [15:24:08] zeljkof: We're still investigating the full scan issue which is *not* related to T211769 even though it's being discussed there. [15:24:22] zeljkof: yes but the full scan db issue seems unrelated to this bug and marostegui and anomie are debugging it, maybe another ticket is needed for that [15:24:33] !log poweroff ms-be2044 for hardware inspection - T209921 [15:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:37] T209921: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 [15:24:53] (03PS2) 10Dzahn: admin: add wmde-fisch to deployment [puppet] - 10https://gerrit.wikimedia.org/r/477548 (https://phabricator.wikimedia.org/T211014) (owner: 10Mathew.onipe) [15:24:53] Amir1, anomie: so, can I try rolling the train forward, or not? [15:25:07] if there's another problem, is there a task? if not, can you please create one? [15:25:18] it's hard for me to track things if there is not a task [15:25:25] (03CR) 10Dzahn: "this has already been done in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/478790/ so it rebased into nothing" [puppet] - 10https://gerrit.wikimedia.org/r/477548 (https://phabricator.wikimedia.org/T211014) (owner: 10Mathew.onipe) [15:26:12] anomie: I think the time of first seen and last seen matches the first query of https://phabricator.wikimedia.org/T211769#4817163 [15:26:28] that matches the spikes at: https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1084&var-port=9104&kiosk&refresh=10s&from=now-3h&to=now [15:26:32] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.50 and port 9042: Connection refused eevans Decommissioned (T210843) [15:26:32] ACKNOWLEDGEMENT - cassandra-b SSL 10.192.48.50:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned (T210843) [15:26:57] (03CR) 10Dzahn: "thanks Matt, you can abandon it, it's a duplicate of Effie's previous change and already resolved" [puppet] - 10https://gerrit.wikimedia.org/r/477548 (https://phabricator.wikimedia.org/T211014) (owner: 10Mathew.onipe) [15:27:19] marostegui: ... Ugh. I see the problem, oldimage doesn't have an index on (oi_user,oi_timestamp) like I expected it to. Easy enough to work around in the code now that I see that. [15:27:32] !log installing PHP security updates on matomo1001 (piwik host) [15:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:45] anomie: Great! :) [15:27:55] If someone will put together a task for the full scan issue, I'll write a patch. [15:28:00] zeljkof: sure, but anomie has more details on that, I haven't touched this part of the codebase at all [15:28:02] anomie: I will do that [15:28:40] marostegui: please add the task to train blockers when you create it T206662 [15:28:40] T206662: 1.33.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T206662 [15:29:03] zeljkof: will do [15:29:09] thanks! [15:29:53] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1098 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479215 (owner: 10Marostegui) [15:30:48] (03PS2) 10Muehlenhoff: Update stat alias for cumin for new role [puppet] - 10https://gerrit.wikimedia.org/r/479220 [15:31:52] zeljkof anomie https://phabricator.wikimedia.org/T211774 [15:32:21] (03CR) 10Muehlenhoff: [C: 032] Update stat alias for cumin for new role [puppet] - 10https://gerrit.wikimedia.org/r/479220 (owner: 10Muehlenhoff) [15:32:38] (03CR) 10Vgutierrez: [C: 032] certcentral: Provide TLS certificates for mx[12]001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/479226 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [15:32:47] (03PS2) 10Vgutierrez: certcentral: Provide TLS certificates for mx[12]001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/479226 (https://phabricator.wikimedia.org/T207050) [15:32:50] (03Abandoned) 10Mathew.onipe: setup: change curator version to '>=5.0.0,<5.4.0' to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe) [15:33:04] (03Restored) 10Mathew.onipe: setup: change curator version to '>=5.0.0,<5.4.0' to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe) [15:33:56] (03Abandoned) 10Mathew.onipe: admin: add wmde-fisch to deployment [puppet] - 10https://gerrit.wikimedia.org/r/477548 (https://phabricator.wikimedia.org/T211014) (owner: 10Mathew.onipe) [15:35:02] marostegui, Amir1, zeljkof: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/479227 [15:35:19] !log upload matomo 3.7.0 to stretch-wikimedia, removed 3.5.1 from jessie-wikimedia [15:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:43] zeljkof: Once that's merged and backported, it should be safe to try the train again. [15:35:59] anomie, marostegui: thanks! :) [15:36:15] 10Operations, 10ops-eqiad, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 2 others: rack/setup/install sessionstore100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T209393 (10Cmjohnson) [15:36:23] (03PS6) 10Mathew.onipe: setup: update curator version to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 [15:36:39] 10Operations, 10ops-eqiad, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), and 2 others: rack/setup/install sessionstore100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T209393 (10Cmjohnson) a:05Cmjohnson>03RobH Hi @Robh these are ready for install... [15:41:14] @seen hashar [15:41:14] mutante: Last time I saw hashar they were quitting the network with reason: Quit: I am a virus. Please copy paste me in your /quit message to help me propagate N/A at 12/12/2018 2:30:22 PM (1h10m52s ago) [15:41:30] (03PS3) 10Robingan7: "Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 [15:41:43] (03CR) 10jerkins-bot: [V: 04-1] "Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [15:43:00] (03CR) 10Ebe123: [C: 04-1] "Use uploaded logos in InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [15:43:39] I am going to logoff, banyek will help you guys to see if the issue arises again when the train is deployed anomie Amir1 zeljkof [15:43:52] sure [15:43:54] marostegui: thanks! [15:43:58] (03PS2) 10GTirloni: toolforge: Increase shinken 'High iowait' to 70/90 warning/critical. [puppet] - 10https://gerrit.wikimedia.org/r/479187 (https://phabricator.wikimedia.org/T161898) [15:44:03] yep [15:44:07] zeljkof: I'm trying to look at the code before merging it [15:44:16] Amir1: please do :D [15:44:30] let me know when you're done, I'll roll forward the train [15:44:41] (03CR) 10GTirloni: [C: 032] toolforge: Increase shinken 'High iowait' to 70/90 warning/critical. [puppet] - 10https://gerrit.wikimedia.org/r/479187 (https://phabricator.wikimedia.org/T161898) (owner: 10GTirloni) [15:45:56] Sure. Looked the class and other stuff, it seems sane, let's see if jenkins is happy too [15:47:48] (03CR) 10Ottomata: [C: 032] Initial debian packaging version 0.208 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/456277 (https://phabricator.wikimedia.org/T203115) (owner: 10Ottomata) [15:47:50] (03CR) 10Ottomata: [V: 032 C: 032] Initial debian packaging version 0.208 [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/456277 (https://phabricator.wikimedia.org/T203115) (owner: 10Ottomata) [15:48:08] (03CR) 10Cwhite: [C: 031] Remove Diamond from ORES hosts [puppet] - 10https://gerrit.wikimedia.org/r/479189 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:49:11] (03CR) 10Cwhite: [C: 031] Remove now obsolete Diamond collector and related conffile [puppet] - 10https://gerrit.wikimedia.org/r/479169 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:53:07] (03PS1) 10Ayounsi: Disable codfw for row B recabling [dns] - 10https://gerrit.wikimedia.org/r/479230 (https://phabricator.wikimedia.org/T210456) [15:53:20] (03PS2) 10Ladsgroup: Revert "Revert "ores: Remove added celery configs"" [puppet] - 10https://gerrit.wikimedia.org/r/479206 [15:53:48] (03CR) 10Ayounsi: [C: 032] Disable codfw for row B recabling [dns] - 10https://gerrit.wikimedia.org/r/479230 (https://phabricator.wikimedia.org/T210456) (owner: 10Ayounsi) [15:54:31] !log Depool codfw for row B recabling - T210456 [15:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:35] T210456: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 [15:55:02] (03PS1) 10Banyek: mariadb: pool db1098 for recentchanges and recenlchangeslinked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479231 [15:55:22] (03PS4) 10Cwhite: hiera: add cache_ats cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/478774 (https://phabricator.wikimedia.org/T210486) [15:55:31] (03PS5) 10Cwhite: hiera: add cache_ats cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/478774 (https://phabricator.wikimedia.org/T210486) [15:55:33] (03CR) 10Dzahn: "it seems i can't compile changes on mwmaint hosts in general currently. https://puppet-compiler.wmflabs.org/compiler1002/13909/" [puppet] - 10https://gerrit.wikimedia.org/r/479131 (owner: 10Dzahn) [15:56:38] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/13906/dbstore1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/479224 (https://phabricator.wikimedia.org/T210478) (owner: 10Banyek) [15:56:59] (03CR) 10Effie Mouzeli: [C: 032] mcrouter: replace codfw proxy before maintenance [puppet] - 10https://gerrit.wikimedia.org/r/477472 (https://phabricator.wikimedia.org/T210467) (owner: 10Elukey) [15:57:28] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/13908/" [puppet] - 10https://gerrit.wikimedia.org/r/477472 (https://phabricator.wikimedia.org/T210467) (owner: 10Elukey) [15:58:28] PROBLEM - Host ms-be2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:58:53] (03PS4) 10Robingan7: "Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 [15:59:06] (03CR) 10jerkins-bot: [V: 04-1] "Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [15:59:08] (03PS1) 10Ayounsi: Redirect eqsin/ulsfo caches to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/479233 (https://phabricator.wikimedia.org/T210456) [15:59:52] (03CR) 10Ayounsi: [C: 032] Redirect eqsin/ulsfo caches to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/479233 (https://phabricator.wikimedia.org/T210456) (owner: 10Ayounsi) [16:01:23] (03PS6) 10Cwhite: hiera: add cache_ats cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/478774 (https://phabricator.wikimedia.org/T210486) [16:01:47] !log Redirect eqsin/ulsfo caches to eqiad - T210456 [16:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:51] T210456: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 [16:03:02] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.05 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [16:03:28] ^ expected [16:03:30] (03PS1) 10Dzahn: add missing phabricator:::offboarding_script_token to fix compiler runs on mwmaint [labs/private] - 10https://gerrit.wikimedia.org/r/479234 [16:03:50] (03PS1) 10Herron: hieradata: disable icinga notifications on logstash200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/479235 [16:04:32] (03CR) 10Cwhite: [C: 032] hiera: add cache_ats cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/478774 (https://phabricator.wikimedia.org/T210486) (owner: 10Cwhite) [16:05:02] ACKNOWLEDGEMENT - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 47.42 le 60 Ayounsi https://phabricator.wikimedia.org/T210456 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [16:05:13] (03PS5) 10Robingan7: "Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 [16:05:32] (03CR) 10jerkins-bot: [V: 04-1] "Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [16:08:20] (03CR) 10Dzahn: [V: 032 C: 032] add missing phabricator:::offboarding_script_token to fix compiler runs on mwmaint [labs/private] - 10https://gerrit.wikimedia.org/r/479234 (owner: 10Dzahn) [16:08:46] (03PS2) 10Herron: hieradata: disable icinga notifications on logstash200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/479235 [16:10:10] (03CR) 10Herron: [C: 032] hieradata: disable icinga notifications on logstash200[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/479235 (owner: 10Herron) [16:10:59] (03CR) 10GTirloni: [C: 031] "We don't seem to be using MySQL in any of these servers, only MariaDB 10.0/10.1." [puppet] - 10https://gerrit.wikimedia.org/r/470726 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [16:11:05] (03PS2) 10Effie Mouzeli: mcrouter: replace codfw proxy before maintenance [puppet] - 10https://gerrit.wikimedia.org/r/477472 (https://phabricator.wikimedia.org/T210467) (owner: 10Elukey) [16:12:26] (03CR) 10Dzahn: "thank you for the review!:))" [puppet] - 10https://gerrit.wikimedia.org/r/470726 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [16:12:31] (03CR) 10Dzahn: [C: 032] hieradata/labs: remove mysql::server::use_apparmor: false [puppet] - 10https://gerrit.wikimedia.org/r/470726 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [16:12:49] (03CR) 10Andrew Bogott: "Do we know that this has no effect on VM installs?" [puppet] - 10https://gerrit.wikimedia.org/r/470726 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [16:14:12] 10Operations, 10Puppet, 10User-Joe, 10cloud-services-team (FY2017-18): Upgrade to puppet 4 (4.8 or newer) - https://phabricator.wikimedia.org/T177254 (10Andrew) [16:14:23] (03PS6) 10Robingan7: Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 [16:14:36] (03CR) 10jerkins-bot: [V: 04-1] Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [16:14:44] (03CR) 10Effie Mouzeli: [C: 032] mcrouter: replace codfw proxy before maintenance [puppet] - 10https://gerrit.wikimedia.org/r/477472 (https://phabricator.wikimedia.org/T210467) (owner: 10Elukey) [16:15:37] Amir1: train window is over, checking on status of both blocker tasks, seems to me you are in charge of both of them, do you have estimate when the train will be able to move forward? [16:16:05] zeljkof: the cherry-picked patch is about to be merged [16:16:11] (jenkins is just being slow) [16:16:21] Amir1: great, let me know when the train is unblocked [16:16:37] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [16:16:57] (03CR) 10Dzahn: [C: 031] "compiler run fixed by https://gerrit.wikimedia.org/r/#/c/labs/private/+/479234/" [puppet] - 10https://gerrit.wikimedia.org/r/479131 (owner: 10Dzahn) [16:17:00] PROBLEM - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 4 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:17:24] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) [16:17:47] !log ladsgroup@deploy1001 Synchronized php-1.33.0-wmf.8/includes/specials/pagers/ImageListPager.php: T211774 (duration: 00m 52s) [16:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:51] T211774: Full table scans on oldimage table - https://phabricator.wikimedia.org/T211774 [16:18:12] XioNoX: is that you on asw-b-codfw ? [16:18:15] That was wrong, I forgot to rebase [16:18:34] 🤞 [16:18:59] elukey: yep [16:19:05] forgot that new alert :) [16:19:21] !log ladsgroup@deploy1001 Synchronized php-1.33.0-wmf.8/includes/specials/pagers/ImageListPager.php: T211774 (duration: 00m 52s) [16:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:26] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [16:20:02] RECOVERY - Host ms-be2047.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.28 ms [16:20:14] ACKNOWLEDGEMENT - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 8 Unknown: 0 Ayounsi https://phabricator.wikimedia.org/T210456 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [16:20:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:20:43] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [16:21:53] zeljkof: query should be fixed now but it's not possible to test until it hit commonswiki [16:22:09] I tested it on testwiki but it's not much use [16:22:24] https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1084&var-port=9104&kiosk&refresh=10s&from=now-3h&to=now [16:22:31] that was the graph from the last time [16:22:40] 👀 <- that's me [16:23:11] banyek: yeah but since it's rollbacked we can't check it there :D [16:23:26] XioNoX: do you think that PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL was related? [16:24:02] Amir1: so, should I roll forward and be ready to revert? cc banyek [16:24:08] elukey: we haven't done anything intrusive yet [16:24:12] yup [16:24:19] amir1: the host still has the commonswiki, so if the query performs bad, we'll see it [16:24:20] elukey: so mabe related to the depool? [16:24:26] zeljkof: go ahead [16:24:55] XioNoX: ack [16:24:59] I don't think that's from the depool, I think that (the RB alert in codfw) is actually from the real network blip [16:25:23] it came in nearly-simultanous with the "Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 4", and it's a 503 [16:25:37] but there was no network blip [16:25:46] (Afaik) [16:25:48] the 503 says otherwise! [16:27:04] the VC ports alert is due to me designating unused ports as VC ports, and at least *shouldn't* impact anything [16:27:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:27:41] as nothing is plugged to them, there should not be any impact to traffic, but with Juniper's VC who knows [16:27:48] yeah, exactly [16:28:11] Amir1, banyek: forgot about scrum of scrums in a few minutes, I'll move forward after SoS [16:28:13] but still, if the network suffered no hiccups, depooling normal traffic to not come into/through codfw shouldn't cause checks of service edges in codfw to fail with 503s [16:28:21] the stats check on "traffic drop" sure, but not that [16:29:00] the nginx HTTP availability check is similar in nature and witnessing the same hiccup too. [16:30:45] looks like the availability check in codfw was cache text talking to restbase in codfw btw [16:30:52] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10crusnov) Just to throw in, since this was discussed perhaps 2 weeks ago on the meeting, I installed Blacken with the width settings we normally use for Cumin and have been usin... [16:31:03] well also dropping of requests in codfw, so a few 500s stand out [16:31:35] zeljkof: ack [16:31:45] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479231 (owner: 10Banyek) [16:32:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [16:32:14] RECOVERY - Device not healthy -SMART- on stat1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004var-datasource=eqiad%2520prometheus%252Fops [16:32:31] (03PS1) 10Ottomata: 0.214-1 release [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/479238 [16:33:38] (03PS3) 10Banyek: mariadb: dbstore_multi - dbstore1003-dbstore1005 [puppet] - 10https://gerrit.wikimedia.org/r/479224 (https://phabricator.wikimedia.org/T210478) [16:34:34] !log Merged 477472 "mcrouter: replace codfw proxy before maintenance", eqiad mcrouters are picking up the change - T210467 [16:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:38] T210467: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 [16:35:28] jijiki: note that we're doing row B today, row D is for next week [16:35:45] (in case there was some confusion) [16:36:52] 10Operations, 10ops-codfw, 10monitoring: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10RobH) [16:36:54] XioNoX: we are rolling this out now [16:37:01] we remember :) [16:37:22] 10Operations, 10ops-codfw, 10monitoring: Decom graphite2002 - https://phabricator.wikimedia.org/T200210 (10RobH) [16:37:24] 10Operations, 10ops-codfw, 10monitoring: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10RobH) p:05Triage>03Normal [16:37:26] 10Operations, 10ops-codfw, 10monitoring: Decom graphite2002 - https://phabricator.wikimedia.org/T200210 (10RobH) p:05Triage>03Normal [16:37:31] cool :) [16:38:10] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10Volans) I agree that a Python formatter simplify the life of reviewers avoiding unnecessary comments over the style, but has also some draw backs as highlighted in the descript... [16:38:48] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [16:39:14] 10Operations, 10ops-codfw, 10decommission, 10monitoring: Decom graphite2002 - https://phabricator.wikimedia.org/T200210 (10RobH) [16:39:16] 10Operations, 10ops-codfw, 10decommission, 10monitoring: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10RobH) [16:40:32] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10ayounsi) 05stalled>03Resolved Actually, this can be closed. [16:40:44] PROBLEM - Varnish HTCP daemon on cp2008 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.134: Connection reset by peer [16:40:52] !log installing cups updates on trusty (only client libs used) [16:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:50] RECOVERY - Varnish HTCP daemon on cp2008 is OK: PROCS OK: 1 process with UID = 116 (vhtcpd), args vhtcpd [16:42:27] !log installing lxml security updates [16:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:30] (03PS9) 10Robingan7: Upload HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478498 [16:46:42] (03CR) 10Volans: [C: 032] "Tested locally, all works fine. Thanks for the fix." [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe) [16:47:03] Amir1, banyek: SoS done, I'll move the train in a few minutes [16:47:17] Sure. I'm stand by [16:47:34] 👀 [16:47:49] PROBLEM - IPsec on cp1082 is CRITICAL: Strongswan CRITICAL - ok: 71 not-conn: cp2008_v4 [16:48:04] (03Merged) 10jenkins-bot: setup: update curator version to match our current elasticsearch version [software/spicerack] - 10https://gerrit.wikimedia.org/r/477958 (owner: 10Mathew.onipe) [16:49:36] (03CR) 10Urbanecm: [C: 04-1] "This patch should touch wmf-config/InitialiseSettings.php _only_." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [16:49:49] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10JoeWalsh) @Dzahn i'm unable to ssh into stat1006. The error I get is `channel 0: open failed: administratively prohibited: open failed... [16:50:07] (03CR) 10Urbanecm: [C: 04-1] "(also mind your commit message, there should be no blank line after the task's number and Bug: in front of the task number)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [16:50:09] PROBLEM - MD RAID on ms-be1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.138: Connection reset by peer [16:50:10] PROBLEM - swift-container-replicator on ms-be1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.138: Connection reset by peer [16:50:11] PROBLEM - puppet last run on ms-be1045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.139: Connection reset by peer [16:50:11] PROBLEM - DPKG on ms-be1045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.139: Connection reset by peer [16:50:11] PROBLEM - swift-object-updater on ms-be1045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.139: Connection reset by peer [16:50:11] PROBLEM - MD RAID on ms-be1046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.140: Connection reset by peer [16:50:12] PROBLEM - swift-container-replicator on ms-be1046 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.140: Connection reset by peer [16:50:14] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1044 - https://phabricator.wikimedia.org/T211791 (10ops-monitoring-bot) [16:50:16] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1046 - https://phabricator.wikimedia.org/T211792 (10ops-monitoring-bot) [16:50:32] (03CR) 10Dzahn: "i don't know/understand how it would be related to VM installs" [puppet] - 10https://gerrit.wikimedia.org/r/470726 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [16:50:57] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title}{/revision} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) [16:51:13] RECOVERY - swift-container-replicator on ms-be1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:51:13] RECOVERY - MD RAID on ms-be1046 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:52:00] ms-be failures are known and benign, new hosts [16:52:01] PROBLEM - swift-account-auditor on ms-be1045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.139: Connection reset by peer [16:52:01] PROBLEM - swift-container-server on ms-be1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.138: Connection reset by peer [16:52:01] PROBLEM - very high load average likely xfs on ms-be1045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.139: Connection reset by peer [16:52:12] effect of not using the script :( [16:52:24] (03PS1) 10Ema: tlsproxy::localssl: add snakeoil cert support [puppet] - 10https://gerrit.wikimedia.org/r/479242 [16:52:25] XioNoX: there is again an alert for Restbase in codfw [16:52:53] (03PS10) 10Robingan7: Upload HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478498 [16:52:54] marostegui: are you gone for the day? [16:52:56] (03CR) 10Dzahn: "@hashar @thcipriani is it ok if i just go ahead with this by myself as long as i ensure it's noop on contint*?" [puppet] - 10https://gerrit.wikimedia.org/r/453554 (owner: 10Dzahn) [16:53:01] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [16:53:21] elukey: not sure what to do with it :( [16:53:38] is restbase more sensitive than anything else? [16:53:47] PROBLEM - Check size of conntrack table on ms-be1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.138: Connection reset by peer [16:53:49] PROBLEM - swift-container-updater on ms-be1044 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.138: Connection reset by peer [16:53:49] PROBLEM - Disk space on ms-be1045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.139: Connection reset by peer [16:53:49] PROBLEM - swift-account-reaper on ms-be1045 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.139: Connection reset by peer [16:53:57] other than that alert, I don't think the maintenance had any impact [16:54:24] (03PS1) 10Volans: raid_handler: skip another NRPE failure message [puppet] - 10https://gerrit.wikimedia.org/r/479243 [16:54:25] godog: ^^^ for the unwanted tasks opened [16:54:29] (03PS1) 10Vgutierrez: certcentral: Allow puppet_svc to be undef [puppet] - 10https://gerrit.wikimedia.org/r/479244 (https://phabricator.wikimedia.org/T207050) [16:54:33] it's actually a deep and tricky thing to quanity impacts like these any more, but I doubt there was anything more than transient impact [16:54:35] (03PS1) 10Vgutierrez: mx: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/479245 (https://phabricator.wikimedia.org/T207050) [16:54:58] Amir1, banyek: moving train forward, please stand by [16:55:01] but servicefoo could consume servicebar within codfw and care, even when codfw traffic-level things are depooled [16:55:07] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1044 - https://phabricator.wikimedia.org/T211791 (10Volans) 05Open>03Invalid [16:55:09] 🚂 [16:55:10] sure [16:55:11] ack [16:55:22] volans: neat, thanks! [16:55:26] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1046 - https://phabricator.wikimedia.org/T211792 (10Volans) 05Open>03Invalid [16:55:51] banyek: fun thing I discovered, this is when wmf.8 got deployed on api host of wikidata: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1092&var-port=9104 [16:56:13] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) [16:56:14] lots of metrics dropped drastically, that's due to tag_summary -> change_tag migration [16:56:25] RECOVERY - swift-container-replicator on ms-be1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:56:25] RECOVERY - MD RAID on ms-be1044 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:56:28] I thought it would help but not this much :D [16:56:33] (03PS7) 10Robingan7: Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 [16:56:43] (03PS1) 10Zfilipin: group1 wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479248 [16:56:45] (03CR) 10Zfilipin: [C: 032] group1 wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479248 (owner: 10Zfilipin) [16:56:55] (03CR) 10jerkins-bot: [V: 04-1] Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [16:56:57] RECOVERY - Check size of conntrack table on ms-be1044 is OK: OK: nf_conntrack is 0 % full [16:56:59] RECOVERY - swift-container-updater on ms-be1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:57:07] bblack: by any impact I mean I haven't seen anything else alert about any codfw issue [16:57:12] amir1: yeah [16:57:17] RECOVERY - swift-container-server on ms-be1044 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:57:27] vgutierrez, so exim actually does some inotify thing and detects that puppet has changed the files? [16:57:34] yes [16:57:36] I'm silencing the ms-be hosts as we go, but recoveries will spam [16:57:40] interesting [16:57:43] I've tested it forcing a cert renewal for lists.wm.o [16:57:59] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479248 (owner: 10Zfilipin) [16:58:03] (03CR) 10Alex Monk: [C: 031] certcentral: Allow puppet_svc to be undef [puppet] - 10https://gerrit.wikimedia.org/r/479244 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [16:58:25] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [16:58:30] Krenair: that doesn't work as expected though, I think the default value for puppet_svc needs to be undef now [16:58:42] (03CR) 10Ebe123: [C: 04-1] "Please remove the png files from this commit." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [16:59:10] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.8 [16:59:32] (03PS2) 10Vgutierrez: certcentral: Allow puppet_svc to be undef [puppet] - 10https://gerrit.wikimedia.org/r/479244 (https://phabricator.wikimedia.org/T207050) [16:59:34] (03PS2) 10Vgutierrez: mx: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/479245 (https://phabricator.wikimedia.org/T207050) [17:00:02] !log zfilipin@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.8 (duration: 00m 50s) [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T1700). [17:00:04] dcausse: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:22] dcausse: please wait 5-10 minutes, I've just moved train forward [17:00:33] Oh it's SWAT time, I have some stuff to SWAT after it's done [17:01:20] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1045 - https://phabricator.wikimedia.org/T211796 (10ops-monitoring-bot) [17:01:31] Krenair: PS1: https://puppet-compiler.wmflabs.org/compiler1002/13912/mx1001.wikimedia.org/change.mx1001.wikimedia.org.pson / PS2: https://puppet-compiler.wmflabs.org/compiler1002/13913/mx1001.wikimedia.org/change.mx1001.wikimedia.org.pson [17:01:46] Krenair: in PS1 change catalog puppet_svc is still set to nginx in the certcentral::cert resource [17:01:58] in PS2 is not there as expected [17:02:03] Amir1, banyek: I don't see anything new in logs, please check logs yourselves too [17:02:26] zfilipin@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [17:03:04] in first sight everything seems neat [17:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:45] I see a spike in connection problems [17:04:51] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1544623466705&to=1544634266705&var-dc=eqiad%20prometheus%2Fops&var-server=db1084&var-port=9104&panelId=10&fullscreen [17:05:03] but that will settle I think [17:05:32] RECOVERY - IPsec on cp1082 is OK: Strongswan OK - 72 ESP OK [17:05:48] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 503 (expecting: 200): /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) [17:05:54] yeah [17:07:12] (03PS8) 10Robingan7: Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 [17:07:26] (03CR) 10jerkins-bot: [V: 04-1] Use uploaded logos in InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478570 (owner: 10Robingan7) [17:07:48] zejkof I see there will be a swat, can I do a quick repool before that once the train finished? [17:07:52] RECOVERY - Juniper virtual chassis ports on asw-b-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [17:08:26] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [17:08:32] zeljkof, sorry^ [17:08:38] RECOVERY - swift-account-reaper on ms-be1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:08:54] RECOVERY - very high load average likely xfs on ms-be1045 is OK: OK - load average: 0.17, 0.10, 0.10 [17:08:54] RECOVERY - swift-account-auditor on ms-be1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:08:56] RECOVERY - Disk space on ms-be1045 is OK: DISK OK [17:08:57] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479248 (owner: 10Zfilipin) [17:09:11] banyek: go ahead as far as I'm concerned, please sync with dcausse and Amir1 [17:09:14] RECOVERY - swift-object-updater on ms-be1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:09:14] RECOVERY - DPKG on ms-be1045 is OK: All packages OK [17:09:31] fine for me [17:09:36] tx [17:09:39] o/ [17:09:43] <3 [17:09:50] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 76.29 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6fullscreenorgId=1 [17:09:56] Amir1: can T211774 be resolved? or removed from train blockers? [17:09:57] T211774: Full table scans on oldimage table - https://phabricator.wikimedia.org/T211774 [17:10:12] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [17:10:24] RECOVERY - puppet last run on ms-be1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:10:45] dcausse: hey, I will be deploying your patch, is it testable on mwdebug? [17:10:51] zeljkof: we should just close it [17:11:06] Amir1: yes but it's not worth testing [17:11:16] Amir1: please go ahead and resolve the bug [17:11:24] zeljkof: already odne [17:11:25] *done [17:11:34] thanks! [17:11:40] dcausse: yeah, it looks super straightforward [17:12:11] hey, can I add one more patch ? [17:12:12] so just to make sure: [17:12:24] PROBLEM - Device not healthy -SMART- on stat1004 is CRITICAL: cluster=analytics device=sde instance=stat1004:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004var-datasource=eqiad%2520prometheus%252Fops [17:12:30] the train is finished, we are about to swat, and I can do a quick deploy now, right? [17:12:52] amir1, zeljkof, dacausse^^ [17:13:04] amir1, zeljkof, dcausse^^ [17:13:18] train done, go ahead as far as I am concerned [17:13:46] ah, looks like there is a Flow problem :/ [17:14:01] `[{exception_id}] {exception_url} Flow\Exception\DataModelException from line 173 of /srv/mediawiki/php-1.33.0-wmf.8/extensions/Flow/includes/Data/Index/TopKIndex.php: Unable to find specified offset in query results ` [17:14:23] Flow... [17:14:27] banyek: I'm about to take codfw rack B4 offline (https://phabricator.wikimedia.org/T210456), should I wait for your work to finish first? [17:14:38] banyek: it's fine for me [17:14:42] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200) [17:15:02] raynor: why not, it's Christmas [17:15:14] bblack: the restbase alert keeps flapping even if we're not doing any network changes [17:15:36] please wait with deployments/swat until I see about this Flow problem [17:15:50] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [17:15:55] yay [17:17:16] (03PS1) 10Pmiazga: Define the default mobile content provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479250 (https://phabricator.wikimedia.org/T210390) [17:17:18] one sec, patch on the way, thanks Amir1 [17:18:15] raynor: patch for what? flow? [17:18:16] (03PS2) 10Pmiazga: Define the default mobile content provider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479250 (https://phabricator.wikimedia.org/T210390) [17:19:01] no, MobileFronted config, I made a bubu [17:19:09] XioNoX: but is the maint complete and we know we're in a good state? [17:19:10] I need to fix one config [17:19:21] XioNoX: otherwise it's just a very odd coincidence that it's only flapping in codfw [17:19:57] zeljkof, raynor, dcausse everything is stopped until we find the reason for flow issue [17:20:04] sure [17:20:32] did anybody report the flow problem so far? [17:20:39] if not, I'll do it in a minute [17:20:40] sure, I'm waiting [17:21:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:22:05] bblack: the recabling is done and we're in a good state, waiting for the activity here (SWAT?) to finish (or get a green light) to do the FPC4 replacement [17:22:31] XioNoX: looks like train caused a flow problem, please wait [17:22:50] no idea what that mean but I'll wait :) [17:23:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:24:03] two Flow related problems are at the top of fatal monitor [17:24:08] not sure if they're related [17:24:10] I looked at it, it might be related to MCR stuff, it touches WikiPage->insertOn and PageUpdater->doCreate [17:24:33] anomie: addshore ^ [17:25:16] *looks* [17:25:37] is there something in a ticket already? [17:26:59] addshore: it's in logstash [17:29:16] addshore: I've created T211798 [17:29:17] T211798: [{exception_id}] {exception_url} Flow\Exception\FlowException from line 397 of /srv/mediawiki/php-1.33.0-wmf.8/extensions/Flow/includes/Block/TopicListBlock.php: The `newest` sort order does not allow the `offset` parameter. Please use `offset-id` - https://phabricator.wikimedia.org/T211798 [17:29:20] adding details now [17:29:34] (03CR) 10Alex Monk: [C: 031] mx: Deploy certcentral managed TLS certificate [puppet] - 10https://gerrit.wikimedia.org/r/479245 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [17:30:27] (03CR) 10Alex Monk: [C: 031] certcentral: Allow puppet_svc to be undef [puppet] - 10https://gerrit.wikimedia.org/r/479244 (https://phabricator.wikimedia.org/T207050) (owner: 10Vgutierrez) [17:30:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:31:02] nice, I'll merge it later after I come back from the doctor [17:31:14] Amir1: I'm not seeing that in Kibana. Top things I see are DB lock timeouts and T211798 that doesn't seem to have anything to do with MCR. [17:32:14] anomie: I think T211798 might be related to MCR [17:32:21] sorry if I'm mistaken [17:34:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:34:52] Amir1, zeljkof - sorry, I'm bit confused, so whats happening? The train is causing errors, did we rollback? [17:35:06] and because of that we are not doing swat, right? [17:35:23] raynor: the SWAT is stopped because the train has issues [17:35:33] ok, makes sense. thanks for info [17:35:33] we hasn't rollbacked yet [17:35:53] Investigating [17:36:02] Amir1: I don't see anything relevant-looking in the trace on https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.12.12/mediawiki?id=AWejayVzi-uKhB2cluqZ&_g=h@66534ad. Straight into Flow's view action, no MCR-related calls in the stack. [17:36:14] 10Operations, 10ops-eqiad, 10media-storage: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10RobH) a:05RobH>03fgiunchedi Turns out when you enable puppet on a new install with the cert signed, you must still manually run the first run. Fixed. [17:36:56] Amir1, zeljkof: BTW, I note we didn't roll back but that Flow error seems to have stopped anyway 20 minutes ago. [17:37:30] I've noticed it too [17:37:32] I thought MediaWiki\Storage\PageUpdater and WikiPage were related to MCR [17:37:38] sorry for the confusion [17:37:50] Amir1: I don't see either of those in the trace I looked at though? [17:38:08] Amir1, zeljkof: Also that Flow error was entirely on mediawikiwiki, which is group 0. [17:38:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:38:52] (03PS3) 10BBlack: Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 [17:39:11] (03CR) 10jerkins-bot: [V: 04-1] Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack) [17:40:32] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:40:51] 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) So I see @Joe has merged up some Dockerfile stuff. What's our next step to flip ope... [17:41:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:41:53] (03CR) 10Ottomata: [C: 032] 0.214-1 release [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/479238 (owner: 10Ottomata) [17:41:56] (03PS1) 10RobH: sulfur.wikimedia.org dns entries [dns] - 10https://gerrit.wikimedia.org/r/479252 (https://phabricator.wikimedia.org/T201364) [17:41:58] It seems that was for something else [17:42:12] 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10BBlack) BTW: https://gerrit.wikimedia.org/r/c/operations/dns/+/462693 is a good test job whe... [17:42:45] (03CR) 10RobH: [C: 032] sulfur.wikimedia.org dns entries [dns] - 10https://gerrit.wikimedia.org/r/479252 (https://phabricator.wikimedia.org/T201364) (owner: 10RobH) [17:43:27] is the issue still ongoing or is it safe to do my maintenance? [17:44:05] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10RobH) [17:46:40] Amir1, zeljkof ^ [17:47:46] XioNoX: looking, I'll let you know in a minute [17:47:53] thx [17:48:24] out of curiosity, what is rhenium being used for, wikitech doesn't give out any info https://wikitech.wikimedia.org/w/index.php?search=rhenium&title=Special%3ASearch&go=Go [17:50:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:50:30] Amir1: supposed to be a netflow server but that project is on hold [17:50:47] thanks! [17:51:04] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 503 (expecting: 200): /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 503 (expecting: 200) [17:52:01] (03PS1) 10RobH: adding sulfur.wikimedia.org items [puppet] - 10https://gerrit.wikimedia.org/r/479253 (https://phabricator.wikimedia.org/T201364) [17:52:18] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [17:52:29] XioNoX: looks like flow problem went away with no action :/ [17:52:38] so go ahead, I think train is fine now [17:52:53] cool, thx [17:52:56] (03CR) 10RobH: [C: 032] adding sulfur.wikimedia.org items [puppet] - 10https://gerrit.wikimedia.org/r/479253 (https://phabricator.wikimedia.org/T201364) (owner: 10RobH) [17:52:59] (03PS3) 10BBlack: gdnsd config: update for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/464862 [17:54:56] !log shutting down asw-b4-codfw - T210456 [17:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:00] T210456: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 [17:55:10] banyek: are you on it? [17:55:14] !log pooling db1098 [17:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:29] I think we should cancel this SWAT, it's already 55 minutes in [17:55:45] Amir1 on what? [17:56:13] banyek: you said will be (de)pooling a node [17:56:30] I'd start it, but didn't merged yet [17:56:37] shan't I? [17:56:46] I just moved my patch to the evening window [17:57:01] 10Operations, 10Release-Engineering-Team (Kanban): Point keyholder github mirror to gerrit - https://phabricator.wikimedia.org/T210674 (10thcipriani) a:05thcipriani>03mmodell [17:57:08] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['sulfur.wikimedia.org'] ` The log can be found in `/var/log/wmf-auto-re... [17:57:11] 10Operations, 10Release-Engineering-Team (Backlog): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10thcipriani) [17:57:13] 10Operations, 10Release-Engineering-Team (Kanban): Point keyholder github mirror to gerrit - https://phabricator.wikimedia.org/T210674 (10thcipriani) 05Open>03Resolved [17:57:14] raynor: sorry, train was late, flow problem was unrelated but happened at the same time as train... [17:57:20] Amir1: can I go or not? [17:57:21] no worries [17:57:28] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [17:57:32] what [17:57:41] banyek: you should :) [17:57:56] houston we have a problem: https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1084&var-port=9104&kiosk&refresh=10s&from=now-3h&to=now [17:58:13] 10Operations: Connecting to mwmaint1002 though bast4002 fails - https://phabricator.wikimedia.org/T211748 (10Jalexander) p:05Triage>03Unbreak! [17:58:16] it's WAY worse it was before [17:58:21] 1.2M [17:58:27] how so [17:58:31] could we revert ASAP [17:58:32] It's almost 1000 times worse [17:59:50] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s4&var-shard=s8&var-role=All&from=now-1h&to=now [17:59:57] This one says it was a HUGE spike [18:00:02] and then it's gone [18:00:16] I'm lost on some context, but we're talking about this being fallout from train/flow? [18:00:34] the first link did fall off similarly [18:00:48] no no, something caused commons to read 10M row/sec [18:00:52] Seems to have happened 2 minutes after "17:54 XioNoX: shutting down asw-b4-codfw - T210456". 54 minutes after the train. [18:00:53] T210456: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 [18:00:56] PROBLEM - Juniper virtual chassis ports on asw-b-codfw is CRITICAL: CRIT: Down: 2 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [18:01:01] it's only in commons AFAIK [18:01:22] that's an odd fallout from the non-primary DC :/ [18:01:34] if it's linked to the codfw switch work, I mean [18:01:39] 1 minute after "17:55 banyek: pooling db1098" [18:01:48] banyek: it's recovering [18:01:53] I didn't pooled at the end [18:02:14] here's the patch: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/479231/ [18:02:23] so it's not connected [18:02:40] Do we know what SQL queries caused the spike this time? [18:02:58] nothing here: https://tendril.wikimedia.org/report/slow_queries?host=%5Edb&user=wikiuser&schema=commonswiki&qmode=eq&query=&hours=1 [18:03:04] <_joe_> from what I see you don't need to convert now [18:03:10] 10Operations: Connecting to mwmaint1002 though bast4002 fails - https://phabricator.wikimedia.org/T211748 (10jrbs) Also hearing that @Thargrovewmf is having issues (she is, like us both, based in the Bay Area so using 4002) [18:03:19] I'm monitoring this channel, but ping me if needed [18:03:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:04:09] 10Operations, 10monitoring, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10colewhite) There are few options that occur to me right away * Cron generates Prometheus metrics and exposed via the node text exporter *... [18:04:16] <_joe_> banyek: why do you think we should revert? [18:04:56] I thought it's connected to the train which just finished prior the spike [18:05:10] <_joe_> that transient peak looks like an index being brought to memory, unless I misread the graph? [18:05:33] to be honest, I have no idea [18:06:06] it's very likely, lots of changes to query patterns of API went live with wmf.8 [18:06:29] db1084 is the api host for s3 [18:06:33] <_joe_> and this was moving to group1? [18:06:33] *s4 [18:06:43] yup [18:06:54] <_joe_> ok we need to understand this very very well before the train is moved to group2 [18:06:58] <_joe_> zeljkof: ^^ [18:07:32] <_joe_> also banyek I guess, you should probably look deeper into tendril to find out what really happened [18:07:39] 10Operations: Connecting to mwmaint1002 though bast4002 fails - https://phabricator.wikimedia.org/T211748 (10jrbs) 05Open>03Resolved a:03jrbs Ahh, this was simple enough. I forgot to change my config's `ProxyCommand` to match the move from 4001 to 4002. Sorry. [18:08:13] _joe_: what happened? is there a phab task? [18:08:26] sorry, got lost, many things are going on here in the channel [18:08:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:08:41] (03CR) 10Volans: "A couple of comments inline" (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [18:08:45] <_joe_> zeljkof: see the backlog, there was a spike in read traffic [18:08:57] what I am seeing now it happened only on s4 [18:08:58] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1544636316596&to=1544638116597&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All [18:08:58] <_joe_> on the commons database [18:09:31] <_joe_> banyek: can you open a task about investigating this surge please? [18:09:38] PROBLEM - Host ps1-b4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [18:09:57] <_joe_> and that should be a blocker to move the train to group2 [18:10:02] (03PS2) 10Volans: raid_handler: skip another NRPE failure message [puppet] - 10https://gerrit.wikimedia.org/r/479243 (https://phabricator.wikimedia.org/T211791) [18:10:24] PROBLEM - Host mw2136.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:24] PROBLEM - Host mw2135.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:24] PROBLEM - Host mw2138.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:24] PROBLEM - Host mw2137.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:24] PROBLEM - Host mw2140.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:24] PROBLEM - Host mw2142.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:24] PROBLEM - Host mw2141.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:25] PROBLEM - Host mw2144.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:25] PROBLEM - Host mw2143.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:26] PROBLEM - Host mw2145.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:26] PROBLEM - Host mw2146.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:27] PROBLEM - Host mw2147.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:10:28] PROBLEM - Host mw2139.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:11:01] (03CR) 10CDanis: [C: 032] raid_handler: skip another NRPE failure message [puppet] - 10https://gerrit.wikimedia.org/r/479243 (https://phabricator.wikimedia.org/T211791) (owner: 10Volans) [18:11:05] _joe_ who to assign? [18:11:08] XioNoX: is this part of the maintenance? ^^^ [18:11:20] I think it's fair to assume, yes [18:11:25] <_joe_> banyek: I guess either you or manuel? [18:11:27] (03PS1) 10Jdlrobson: Beta cluster shows production content on mobile only for non-existent pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479256 (https://phabricator.wikimedia.org/T207508) [18:11:30] PROBLEM - Host sessionstore2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:11:32] their mgmt interfaces should not go down [18:12:00] so that's unexpected, but their primary interface is supposed to go down [18:12:03] paladox: ^ [18:12:03] (fair to assume that anything that looks like network issues in codfw is related to switch stuff, while the switch stuff is in progress) [18:12:08] er, papaul ^ [18:12:16] paladox: sorry, didn't mean to ping you [18:12:19] :) [18:12:20] PROBLEM - Host wtp2001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:20] PROBLEM - Host wtp2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:20] PROBLEM - Host wtp2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:20] PROBLEM - Host wtp2004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:20] PROBLEM - Host wtp2006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:20] PROBLEM - Host wtp2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:20] PROBLEM - Host wtp2005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:24] PROBLEM - Host wtp2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:24] PROBLEM - Host wtp2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:24] PROBLEM - Host wtp2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:12:29] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10RobH) a:05RobH>03Cmjohnson This system is behaving poorly, when I send a racadm serveraction powercycle, it takes a very long time to process, and then I never see any out... [18:12:47] (03PS3) 10Volans: raid_handler: skip another NRPE failure message [puppet] - 10https://gerrit.wikimedia.org/r/479243 (https://phabricator.wikimedia.org/T211791) [18:12:48] <_joe_> uhm [18:12:48] !log catrope@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/GrowthExperiments/extension.json: Temporarily disable help panel / VisualEditor integration (duration: 03m 00s) [18:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:03] Hey, I am on my phone but this query matches the spikes on tendril and grafana: SELECT /* ApiPageSet::initFromPageIds */ page_namespace, page_title, page_id, page_content_model, page_lang, page_len, page_is_redirect, page_latest FROM `page` WHERE page_id IN ('-1', '-2') [18:13:04] <_joe_> the wtp systems were not supposed to go down? [18:13:22] PROBLEM - Host db2096.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:13:34] It is a slow query (33 seconds) someone should run an explain on that and confirm if it is doing a full scan [18:13:53] _joe_: yes, they were supposed, to go down, but not their mgmt interface [18:14:04] same for db2096 [18:14:30] marostegui: I'm on it [18:14:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:14:45] marostegui: An explain does look like a full scan. WTF though? [18:15:24] I am not next to my laptop, I will leave it for banyek - but at least we got the query [18:17:13] this is a new query? [18:17:52] RECOVERY - Host wtp2008.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 40.05 ms [18:17:52] RECOVERY - Host wtp2010.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 38.32 ms [18:17:52] RECOVERY - Host wtp2009.mgmt is UP: PING WARNING - Packet loss = 80%, RTA = 38.22 ms [18:17:56] RECOVERY - Host db2096.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms [18:17:56] RECOVERY - Host ps1-b4-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.52 ms [18:18:22] banyek: please add the task to train blockers T206662 [18:18:23] T206662: 1.33.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T206662 [18:18:32] RECOVERY - Host mw2136.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.77 ms [18:18:32] RECOVERY - Host mw2138.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.44 ms [18:18:32] RECOVERY - Host mw2140.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.18 ms [18:18:36] RECOVERY - Host mw2142.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.05 ms [18:18:36] RECOVERY - Host mw2144.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.26 ms [18:18:36] RECOVERY - Host mw2146.mgmt is UP: PING OK - Packet loss = 0%, RTA = 38.04 ms [18:18:52] _joe_: thanks, read the backlog, not that I understand what is going on, but I see something is wrong :) [18:18:53] banyek: Nope, not new at all. It's an instance of T140302 except with negative numbers instead of really positive ones (why, MariaDB, why are you being dumb?). Ic1975220 should have avoided it an instance of that bug, which is new, except it looks like the filter added to catch that didn't run for that code path. [18:18:54] T140302: plwiki API request is excessively slow when including a badrevid - https://phabricator.wikimedia.org/T140302 [18:19:25] s/avoided it an/avoided it as an/ [18:21:12] RECOVERY - Host mw2135.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.77 ms [18:21:12] RECOVERY - Host mw2137.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [18:21:12] RECOVERY - Host mw2141.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [18:21:12] RECOVERY - Host mw2143.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [18:21:12] RECOVERY - Host mw2145.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.91 ms [18:21:12] RECOVERY - Host mw2147.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.71 ms [18:21:16] RECOVERY - Host mw2139.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.73 ms [18:22:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:22:18] RECOVERY - Host sessionstore2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 37.01 ms [18:22:32] https://phabricator.wikimedia.org/T211804 [18:22:48] please help me improve this task, because I have no idea what am I doing. [18:23:06] RECOVERY - Host wtp2001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [18:23:06] RECOVERY - Host wtp2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.86 ms [18:23:06] RECOVERY - Host wtp2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 39.17 ms [18:23:06] RECOVERY - Host wtp2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [18:23:06] RECOVERY - Host wtp2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.83 ms [18:23:06] RECOVERY - Host wtp2006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.69 ms [18:23:06] RECOVERY - Host wtp2005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.86 ms [18:24:56] banyek: I've added it to train blockers and raised priority to UBN [18:25:34] thanks [18:25:43] will you revert the train? [18:25:50] looks like brad already has a patch ready for review and then hopeful backport [18:26:12] zeljkof: revert, backport, roll-forward [18:26:44] ack? [18:26:56] greg-g, zeljkof: Seriously? It was a one-off because someone issued a weird API query, of the kind that have been possible since forever ago. Totally not caused by the train. [18:27:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:27:17] anomie: ok, good to know, I was not aware of that and only working under what I read in the task [18:27:34] zeljkof: no revert, just backport when the patch is reviewed and merged [18:27:44] thanks, again, anomie [18:27:58] anomie, greg-g: ok, will backport when ready [18:28:01] You're welcome, it's what I'm here for (: [18:28:16] anomie: thanks for letting us know, I'm a bit confused and jet-lagged :/ [18:28:30] banyek: looks like revert is not needed, right? [18:28:35] see ^ [18:29:19] ok, I trust you guys [18:29:50] since that nothing bad happened anyways [18:29:52] I trust anomie and greg-g :D [18:30:05] and besides the spike I didn't seen anything wrong [18:30:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:31:07] cool, glad we're in agreement, mutual trust :) [18:31:13] so we close this? https://phabricator.wikimedia.org/T211804 [18:31:14] ? [18:31:37] banyek: once that patch it merged and deployed it will prevent that code path, so leave open until then [18:31:49] what about the priority? [18:32:01] (I am still really new here, that's why I have so many questions) [18:32:24] I'd downgrade it to High given it's not a train blocker but we're going to back-port it to prod. [18:32:50] * greg-g did the prio downgrade :) [18:33:15] perfect, thanks [18:33:23] so I can do my repool now? [18:33:23] Success, my advice is so good it quantum-tunnelled back five minutes into greg-g's head. [18:33:56] exactly [18:34:07] that's how all of my decisions really work [18:36:48] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10Dzahn) @JoeWalsh i see your successful login on bast1002 in the logs, along a bunch of failed attempts, but i don't see any attempts o... [18:37:32] 10Operations, 10Research-Programs, 10SRE-Access-Requests: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10jijiki) [18:40:28] RECOVERY - Juniper virtual chassis ports on asw-b-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [18:40:58] I do the repool now [18:41:08] PROBLEM - Host cp2007 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:16] RECOVERY - Host cp2007 is UP: PING OK - Packet loss = 16%, RTA = 36.12 ms [18:41:19] !log pool db1098 for recentchanges and recenlchangeslinked [18:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:28] (03CR) 10Banyek: [C: 032] mariadb: pool db1098 for recentchanges and recenlchangeslinked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479231 (owner: 10Banyek) [18:42:25] XioNoX: are you still working on codfw stuff? [18:42:34] (03Merged) 10jenkins-bot: mariadb: pool db1098 for recentchanges and recenlchangeslinked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479231 (owner: 10Banyek) [18:43:03] bblack: yes, last step, connect servers back to the new B4 switch [18:43:04] XioNoX: nevermind, I see still traffic in dcops [18:43:06] ok :) [18:43:36] bblack: is that cp2007 alert related to the work here? (It shouldn't) [18:46:24] (03PS3) 10Muehlenhoff: Remove now obsolete Diamond collector and related conffile [puppet] - 10https://gerrit.wikimedia.org/r/479169 (https://phabricator.wikimedia.org/T183454) [18:46:52] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: pool db1098 for recentchanges and recentchangeslinked (duration: 03m 00s) [18:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:14] (03CR) 10Muehlenhoff: [C: 032] Remove now obsolete Diamond collector and related conffile [puppet] - 10https://gerrit.wikimedia.org/r/479169 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [18:47:38] I had this at the end of scap: [18:47:44] ``` [18:47:44] sync-apaches: 100% (ok: 250; fail: 13; left: 0) [18:47:45] 18:46:52 13 apaches had sync errors [18:47:45] 18:46:52 Finished sync-apaches (duration: 02m 13s) [18:47:45] 18:46:52 Synchronized wmf-config/db-eqiad.php: pool db1098 for recentchanges and recentchangeslinked (duration: 03m 00s) [18:47:45] ``` [18:47:59] (sorry for not using paste) [18:48:03] 10Operations, 10monitoring, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10Volans) I would suggest to pick one of those solutions: >>! In T210723#4817909, @colewhite wrote: > * Script that runs on cron and cache... [18:48:14] (03CR) 10jenkins-bot: mariadb: pool db1098 for recentchanges and recenlchangeslinked [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479231 (owner: 10Banyek) [18:48:25] https://www.irccloud.com/pastebin/KzKYGDZC/ [18:49:40] do I have to redo the scap? [18:49:42] @here? [18:50:06] `'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet'` [18:51:40] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:53:16] XioNoX: it's not anything else that I know of (cp2007) [18:53:26] (03PS2) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [18:53:38] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: pool db1098 for recentchanges and recentchangeslinked (duration: 02m 58s) [18:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:35] XioNoX: the system itself doesn't even show a link loss event, so I'm guessing that was switch reachability from the monitors at eqiad? [18:54:56] ok, anyway, no harm [18:55:07] bblack: we should be all set [18:55:41] I'm going to let everything sit 15min then repool [18:55:52] XinoX, bblack: [18:56:17] I tried to deploy a config change, but it was not able to deploy on a bunch of hosts: [18:56:23] 'mw1268.eqiad.wmnet', 'mw1314.eqiad.wmnet', 'mw2255.codfw.wmnet', 'mw2290.codfw.wmnet', 'mw2216.codfw.wmnet', 'mw1251.eqiad.wmnet', 'mw2188.codfw.wmnet', 'mw1320.eqiad.wmnet', 'mw1285.eqiad.wmnet' [18:56:38] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [18:56:41] deploy a config change how, and what failed? [18:57:06] the sync-apaches part of scap [18:57:35] (03PS3) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [18:57:57] was it temporary or is it still borked? (also I have not much idea about scap or sync-apaches) [18:57:58] 10Operations: Connecting to mwmaint1002 though bast4002 fails - https://phabricator.wikimedia.org/T211748 (10Aklapper) 05Resolved>03Invalid (changing task status as no code was changed in a repository / on a server) [18:58:12] https://www.irccloud.com/pastebin/6vEwmenx/ [18:58:35] after that I re-did scap, and now only one sync error I've got: [18:58:39] https://www.irccloud.com/pastebin/ZJgp8PcW/ [18:59:09] ok [18:59:22] I never seen anything like this, that's why I am asking [18:59:25] it looks like the list it didn't deploy to is all the codfw at the ends of the lines, not that list that includes mw above on IRC [18:59:54] the nw2136.codfw.wmnet host is not reachable from my computer too [19:00:34] I don't know if this means anything I just wanted to FYI [19:00:37] right: scap [... blah ...] on mw2136.codfw.wmnet returned [255]: ssh: connect to host mw2136.codfw.wmnet port 22: Connection timed out [19:00:43] for 13x different mw2xxx.codfw.wmnet [19:00:49] (03PS4) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [19:01:15] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10JoeWalsh) @Dzahn thank you for your help. My config is: ` Host bast1002.wikimedia.org # Direct connection for the bastion host Pr... [19:01:48] I can reach mw2136 through normal bastion ssh now, and it is up [19:02:00] (03PS1) 10Ayounsi: Revert "Disable codfw for row B recabling" [dns] - 10https://gerrit.wikimedia.org/r/479262 [19:02:06] (03CR) 10Cwhite: "Thanks for following up. I've incorporated your changes and also fixed a couple bugs in the process. Let me know what you think :)" [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [19:02:15] (03PS1) 10Ayounsi: Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/479263 [19:02:21] the paste log is from 18:46, which was back before the network work in codfw was complete [19:02:30] so, anything goes basically [19:03:07] banyek: re-run it? I assume it's idempotent and re-runnable [19:03:07] (03PS2) 10Ayounsi: Revert "Disable codfw for row B recabling" [dns] - 10https://gerrit.wikimedia.org/r/479262 (https://phabricator.wikimedia.org/T210456) [19:03:24] (03PS2) 10Ayounsi: Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/479263 (https://phabricator.wikimedia.org/T210456) [19:03:41] Ok, I can do it again [19:04:36] !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: pool db1098 for recentchanges and recentchangeslinked (duration: 00m 50s) [19:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:59] bblack: `sync-apaches: 100% (ok: 263; fail: 0; left: 0)` [19:05:05] this time it was successful [19:05:06] yay [19:05:08] \o/ [19:05:31] now I leave for today [19:05:36] Reedy, legoktm, bpirkle: looks like you could help with a train blocker T211806 [19:05:37] T211806: Passing in the "body" request option as an array to send a POST request has been deprecated - https://phabricator.wikimedia.org/T211806 [19:05:41] if something, just ping me on irc [19:05:43] if so, please do, and thank you :) [19:06:17] !log uploading gdnsd 2.99.9944-beta-1+wmf1 to stretch-wikimedia [19:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:34] Bye [19:08:54] !log disable BGP to telia on cr1-codfw - T211715 [19:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:57] T211715: Interface errors on cr1-codfw:xe-5/3/1 - https://phabricator.wikimedia.org/T211715 [19:16:44] !log re-enable BGP to telia on cr1-codfw - T211715 [19:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:48] T211715: Interface errors on cr1-codfw:xe-5/3/1 - https://phabricator.wikimedia.org/T211715 [19:18:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [19:20:18] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [19:26:34] (03CR) 10Ayounsi: [C: 032] Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/479263 (https://phabricator.wikimedia.org/T210456) (owner: 10Ayounsi) [19:26:41] (03PS3) 10Ayounsi: Revert "Redirect eqsin/ulsfo caches to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/479263 (https://phabricator.wikimedia.org/T210456) [19:26:43] (03CR) 10Ayounsi: [C: 032] Revert "Disable codfw for row B recabling" [dns] - 10https://gerrit.wikimedia.org/r/479262 (https://phabricator.wikimedia.org/T210456) (owner: 10Ayounsi) [19:26:47] (03PS3) 10Ayounsi: Revert "Disable codfw for row B recabling" [dns] - 10https://gerrit.wikimedia.org/r/479262 (https://phabricator.wikimedia.org/T210456) [19:27:21] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T211807 (10ops-monitoring-bot) [19:28:17] !log revert redirecting eqsin/ulsfo caches to eqiad - T210456 [19:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:22] T210456: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 [19:28:55] !log repool codfw - T210456 [19:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:05] jouncebot: next [19:30:05] In 0 hour(s) and 29 minute(s): MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T2000) [19:30:18] done [19:30:31] (03PS3) 10CRusnov: Add management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) [19:31:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [19:31:26] (03CR) 10CRusnov: "> Patch Set 2:" (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [19:33:34] (03CR) 10Volans: [C: 031] "LGTM but I'll leave to the other reviewers to add more strict check logics for specific device types (if some must have 2 console ports, e" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [19:39:11] (03PS2) 10Herron: logstash: set kafka consumer group id in codfw [puppet] - 10https://gerrit.wikimedia.org/r/479136 (https://phabricator.wikimedia.org/T205850) [19:39:40] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-jijiki: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) I have setup `deployment-imagescaler03` and applied `role::thumbor::mediawiki`. Please find me on IRC and tell me how I can help :) [19:40:13] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-jijiki: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [19:40:53] (03CR) 10Herron: [C: 032] logstash: set kafka consumer group id in codfw [puppet] - 10https://gerrit.wikimedia.org/r/479136 (https://phabricator.wikimedia.org/T205850) (owner: 10Herron) [19:40:58] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10Dzahn) >>! In T211115#4818109, @JoeWalsh wrote: > The command I'm using is `ssh joewalsh@stat1006.equiad.wmnet` Ah, i think it's a t... [19:42:31] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10faidon) OK, I had a look at this. A few observations first of all: * While not 100% sure, I don't think this is related to the controller having been... [19:42:57] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10Dzahn) P.S. Since you have "IdentifiesOnly" that should mean it's not trying to use the keys provided by the ssh agent, only the Ident... [19:43:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10JoeWalsh) @Dzahn works now, thank you! [19:44:14] (03PS1) 10Ottomata: 0.214-2 - remove --data-dir from systemd service unit [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/479268 [19:44:16] (03PS2) 10Herron: assign codfw logging VMs logstash role [puppet] - 10https://gerrit.wikimedia.org/r/479140 (https://phabricator.wikimedia.org/T205850) [19:44:32] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10Dzahn) Cool, thanks for confirming :) [19:45:08] !log contint1001: sudo chown -R zuul:zuul /etc/zuul/wikimedia/.git [19:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:38] 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) A build against master: https://integration.wikimedia.org/ci/job/operations-dns-lint... [19:46:24] hashar: would you like to review contint changes like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/453554/ or should i just do it and go ahead as long as i make sure it's all noop [19:47:30] (03CR) 10BBlack: "check experimental" [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack) [19:47:44] (03CR) 10jerkins-bot: [V: 04-1] Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack) [19:48:07] 10Operations, 10Operations-Software-Development: Introduce Python code formatters usage - https://phabricator.wikimedia.org/T211750 (10colewhite) I like Black, but any formatter, as long as the barrier to entry is low, is a good idea. The value to me is in a faster review cycle and less friction between autho... [19:48:15] not meant as a ping to review right now, more about the general process how to handle it for contint stuff [19:50:04] do you have +2 there? I don't :) [19:50:49] (03CR) 10Ayounsi: "I added a few inline comments." (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [19:55:35] (03PS4) 10CRusnov: Add management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) [19:56:14] (03CR) 10CRusnov: "Okay these changes included now." (033 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [20:00:05] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T2000) [20:06:17] (03CR) 10Herron: [C: 032] assign codfw logging VMs logstash role [puppet] - 10https://gerrit.wikimedia.org/r/479140 (https://phabricator.wikimedia.org/T205850) (owner: 10Herron) [20:08:57] volans: ? it's puppet [20:17:07] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack) [20:17:19] (03CR) 10jerkins-bot: [V: 04-1] Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack) [20:18:42] mutante: lol, sorry I thought it was integration/config :D [20:18:48] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack) [20:19:54] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) a:05CRoslof>03tramm Hi @tramm Did you see the comment above? Can we move forward with this? I got reminded this is still open because today we got... [20:20:03] (03PS3) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [20:20:58] (03CR) 10jerkins-bot: [V: 04-1] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [20:21:06] volans: oh yea, makes sense, no i just meant contint servers i nthis case [20:21:35] 10Operations, 10Traffic, 10Continuous-Integration-Infrastructure (Slipway), 10Patch-For-Review, 10User-ArielGlenn: CI jobs for authdns linting need to run on Stretch - https://phabricator.wikimedia.org/T205439 (10hashar) 05Open>03Resolved @BBlack refactored the operations/dns test to mock anything th... [20:22:33] (03CR) 10BBlack: [C: 032] Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 (owner: 10BBlack) [20:22:57] (03PS4) 10BBlack: Switch CAA records to proper RR format [dns] - 10https://gerrit.wikimedia.org/r/462693 [20:23:07] (03PS4) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [20:23:42] (03CR) 10jerkins-bot: [V: 04-1] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [20:24:12] 10Operations, 10Pywikibot, 10Traffic, 10HTTPS: SSL CERTIFICATE_VERIFY_FAILED on generating family file - https://phabricator.wikimedia.org/T211813 (10SgtLion) [20:24:30] (03PS5) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [20:25:23] (03CR) 10jerkins-bot: [V: 04-1] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [20:26:21] (03Abandoned) 10BBlack: Fix authdns-lint for 3.x [puppet] - 10https://gerrit.wikimedia.org/r/468578 (https://phabricator.wikimedia.org/T205439) (owner: 10BBlack) [20:26:51] (03PS6) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [20:27:46] (03CR) 10jerkins-bot: [V: 04-1] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [20:28:06] (03PS1) 10BBlack: Remove deprecated -s flag in check-gdnsd.sh [dns] - 10https://gerrit.wikimedia.org/r/479271 [20:28:43] (03CR) 10BBlack: [C: 032] Remove deprecated -s flag in check-gdnsd.sh [dns] - 10https://gerrit.wikimedia.org/r/479271 (owner: 10BBlack) [20:30:11] (03PS7) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [20:31:04] (03CR) 10jerkins-bot: [V: 04-1] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [20:33:23] (03PS8) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [20:34:14] (03CR) 10jerkins-bot: [V: 04-1] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [20:35:48] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) [20:42:29] <_joe_> ottomata: use data types in your parameters :) [20:42:50] 10Operations, 10ops-codfw, 10decommission, 10Discovery-Search (Current work): Decommission elastic2001-2024 - https://phabricator.wikimedia.org/T211023 (10Papaul) [20:42:54] oh my [20:49:30] (03PS5) 10CRusnov: Add management console report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) [20:50:25] (03CR) 10CRusnov: "Changeset 5 includes excluding devices by site which are not expected to have console servers / connections." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181212T2100). [21:01:57] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 504 (expecting: 200) [21:02:05] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [21:03:03] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [21:04:30] <_joe_> *sigh* [21:05:01] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) is CRITICAL: Test Get citation for Darth Vader returned the unexpected status 504 (expecting: 200) [21:05:16] <_joe_> I'm in a meeting [21:05:27] <_joe_> I guess it's zotero ooming in codfw [21:05:39] <_joe_> and well no, in eqiad [21:05:39] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [21:06:09] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [21:06:09] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [21:06:41] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [21:06:41] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [21:06:41] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [21:07:04] (03PS1) 10Andrew Bogott: Horizon: enable projects in eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/479277 (https://phabricator.wikimedia.org/T204745) [21:07:51] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [21:07:51] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy [21:07:55] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [21:08:47] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [21:09:15] 10Operations, 10Wikimedia-General-or-Unknown: Request for information about hosting services for WM-ES - https://phabricator.wikimedia.org/T211414 (10Platonides) To clarify a little that last comment, WMES uses a commercial dedicated server, on which the different services then run on separate containers. The... [21:09:43] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [21:09:53] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [21:09:53] (03CR) 10Andrew Bogott: [C: 032] Horizon: enable projects in eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/479277 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [21:11:28] (03PS9) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [21:12:35] (03CR) 10jerkins-bot: [V: 04-1] Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [21:12:37] (03CR) 10Ayounsi: "Typo, other than that, lgtm!" (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479155 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [21:13:42] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10ayounsi) [21:14:25] (03PS10) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [21:15:40] (03PS11) 10Ottomata: Presto Puppetization [puppet] - 10https://gerrit.wikimedia.org/r/457993 [21:15:56] (03CR) 10Ottomata: [V: 032 C: 032] "Let's try it!" [puppet] - 10https://gerrit.wikimedia.org/r/457993 (owner: 10Ottomata) [21:16:03] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [21:17:09] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [21:20:15] (03PS1) 10Ottomata: Use fqdn instead of hostname in nodes id [puppet] - 10https://gerrit.wikimedia.org/r/479324 (https://phabricator.wikimedia.org/T204951) [21:21:16] (03CR) 10Ottomata: [C: 032] Use fqdn instead of hostname in nodes id [puppet] - 10https://gerrit.wikimedia.org/r/479324 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [21:22:04] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@ced6fab]: Update mobileapps to 55981a8. Summary: Get modified date with regexes to avoid unneeded Document parse [21:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:32] (03PS1) 10Ottomata: Fix typo in presto::server [puppet] - 10https://gerrit.wikimedia.org/r/479325 [21:24:04] (03CR) 10Ottomata: [V: 032 C: 032] Fix typo in presto::server [puppet] - 10https://gerrit.wikimedia.org/r/479325 (owner: 10Ottomata) [21:26:07] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@ced6fab]: Update mobileapps to 55981a8. Summary: Get modified date with regexes to avoid unneeded Document parse (duration: 04m 03s) [21:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:23] (03PS1) 10Dzahn: phabricator: use Stdlib::Fqdn data type for hostname parameters [puppet] - 10https://gerrit.wikimedia.org/r/479327 [21:26:49] (03PS1) 10Ottomata: Fix service ensure in presto::server [puppet] - 10https://gerrit.wikimedia.org/r/479328 [21:27:07] (03PS4) 10Andrew Bogott: Nova: lower cpu_allocation_ratio by a lot [puppet] - 10https://gerrit.wikimedia.org/r/478955 [21:27:09] (03PS1) 10Andrew Bogott: Labtest: activate a few more projects in codfw1dev-r [puppet] - 10https://gerrit.wikimedia.org/r/479329 [21:27:40] (03CR) 10Ottomata: [C: 032] Fix service ensure in presto::server [puppet] - 10https://gerrit.wikimedia.org/r/479328 (owner: 10Ottomata) [21:28:11] (03PS2) 10Andrew Bogott: Labtest: activate a few more projects in codfw1dev-r [puppet] - 10https://gerrit.wikimedia.org/r/479329 [21:29:10] (03CR) 10Andrew Bogott: [C: 032] Labtest: activate a few more projects in codfw1dev-r [puppet] - 10https://gerrit.wikimedia.org/r/479329 (owner: 10Andrew Bogott) [21:30:19] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Joe) I was asking because looking at what's currently stored in the "service", I see both `mwsession` obje... [21:39:07] (03PS1) 10Ottomata: Set systemd unit SyslogIdentifier=presto-server [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/479333 [21:42:03] mutante: around? [21:42:18] volans: yes [21:43:25] give me a sec, sorry, saw somwthing weird on icinga [21:44:29] (03PS1) 10Dzahn: dumps:nfs: use (new) data types and move some things to parameters [puppet] - 10https://gerrit.wikimedia.org/r/479335 [21:45:31] (03CR) 10jerkins-bot: [V: 04-1] dumps:nfs: use (new) data types and move some things to parameters [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [21:46:12] eh, do you mean that einsteinium is still in it and has the systemd check.. [21:46:17] (03PS2) 10Ottomata: Set systemd unit SyslogIdentifier=presto-server [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/479333 [21:46:49] expired downtime, it is in the decom queue [21:47:04] but i could also really fix it if worth it [21:47:55] no the an etcd check that was unknown for a moment [21:47:57] transient [21:48:24] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.8/extensions/EventBus: T211805 (duration: 00m 53s) [21:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:30] T211805: Call to a member function format() on a non-object (boolean) - https://phabricator.wikimedia.org/T211805 [21:50:04] ah, ok [21:50:53] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Eevans) >>! In T211721#4818580, @Joe wrote: > I was asking because looking at what's currently stored in t... [21:52:21] PROBLEM - graphite.wikimedia.org on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.007 second response time [21:56:48] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) [21:59:23] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) [22:00:10] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) a:05RobH>03elukey So this should either go to @elukey or @Ottomata, as this is ready to go into serivice and replace oxygen, then we can decommission oxygen on T211826. Th... [22:02:09] RECOVERY - graphite.wikimedia.org on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1569 bytes in 0.773 second response time [22:08:27] PROBLEM - graphite.wikimedia.org on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:11:47] 10Operations, 10decommission, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10RobH) [22:11:59] RECOVERY - graphite.wikimedia.org on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1569 bytes in 0.006 second response time [22:12:29] 10Operations, 10decommission, 10Patch-For-Review, 10User-fgiunchedi: Return graphite100[13] to spares pool (or decom) - https://phabricator.wikimedia.org/T209357 (10RobH) [22:12:31] 10Operations, 10decommission, 10hardware-requests, 10User-fgiunchedi: Return graphite200[12] to spares pool - https://phabricator.wikimedia.org/T199321 (10RobH) [22:19:31] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10Ottomata) @elukey can do this if he wants to, but I don't think Analytics considers oxygen to be part of its domain :) It's used only by SRE. [22:20:59] PROBLEM - graphite.wikimedia.org on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:24:05] RECOVERY - graphite.wikimedia.org on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1569 bytes in 8.636 second response time [22:25:25] (03PS1) 10Ayounsi: Assign public /29 for cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/479337 (https://phabricator.wikimedia.org/T207663) [22:25:38] (03CR) 10jerkins-bot: [V: 04-1] Assign public /29 for cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/479337 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [22:29:55] (03CR) 10Ottomata: [C: 032] Set systemd unit SyslogIdentifier=presto-server [debs/presto] (debian) - 10https://gerrit.wikimedia.org/r/479333 (owner: 10Ottomata) [22:32:26] (03PS1) 10Ottomata: Set presto hive connector name to hive-hadoop2 [puppet] - 10https://gerrit.wikimedia.org/r/479339 (https://phabricator.wikimedia.org/T204951) [22:32:49] (03CR) 10Ottomata: [V: 032 C: 032] Set presto hive connector name to hive-hadoop2 [puppet] - 10https://gerrit.wikimedia.org/r/479339 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [22:33:18] 10Operations, 10Analytics: setup/install weblog1001/WMF4750 as oxygen replacement - https://phabricator.wikimedia.org/T207760 (10RobH) Ahh, due to past discussions via linked tasks, I assumed he was part of the refresh-replace, so I made assumptions! if this needs to go to someone else @elukey let me know! [22:36:33] PROBLEM - cassandra-c SSL 10.192.48.51:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [22:36:47] PROBLEM - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.51 and port 9042: Connection refused [22:38:51] (03PS1) 10RobH: sessionstore100[1-3].eqiad.wmnet dns entries [dns] - 10https://gerrit.wikimedia.org/r/479342 (https://phabricator.wikimedia.org/T209393) [22:38:53] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) To be pushed: `lang=diff,name=cr1-eqiad [edit interfaces ae2 unit 1120 family inet] address 10.64.22.2/24 { ..... [22:39:45] (03CR) 10Ayounsi: "Not sure what's wrong with that CR?" [dns] - 10https://gerrit.wikimedia.org/r/479337 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [22:41:04] (03CR) 10RobH: [C: 032] sessionstore100[1-3].eqiad.wmnet dns entries [dns] - 10https://gerrit.wikimedia.org/r/479342 (https://phabricator.wikimedia.org/T209393) (owner: 10RobH) [22:42:12] (03CR) 10Jeena Huneidi: "> Patch Set 2: Code-Review-1" (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/479026 (https://phabricator.wikimedia.org/T211708) (owner: 10Jeena Huneidi) [22:42:15] (03PS1) 10Ottomata: Use thrift:// uri for hive http:// uri for discovery uri [puppet] - 10https://gerrit.wikimedia.org/r/479343 (https://phabricator.wikimedia.org/T204951) [22:42:30] (03CR) 10Ottomata: [V: 032 C: 032] Use thrift:// uri for hive http:// uri for discovery uri [puppet] - 10https://gerrit.wikimedia.org/r/479343 (https://phabricator.wikimedia.org/T204951) (owner: 10Ottomata) [22:47:59] !log change email for User:Denrique [22:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:23] (03PS1) 10RobH: new sessionstore100[1-3].eqiad.wmnet puppet updates [puppet] - 10https://gerrit.wikimedia.org/r/479346 (https://phabricator.wikimedia.org/T209393) [22:52:09] (03PS2) 10RobH: new sessionstore100[1-3].eqiad.wmnet puppet updates [puppet] - 10https://gerrit.wikimedia.org/r/479346 (https://phabricator.wikimedia.org/T209393) [22:54:43] (03PS3) 10RobH: new sessionstore100[1-3].eqiad.wmnet puppet updates [puppet] - 10https://gerrit.wikimedia.org/r/479346 (https://phabricator.wikimedia.org/T209393) [22:54:45] (03CR) 10MarcoAurelio: [C: 031] "LGTM now (no comment about the logos as I haven't checked them this time). Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) (owner: 10Rafidaslam) [22:55:52] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10RobH) a:05RobH>03Eevans So I think there is some confusio... [22:56:01] (03CR) 10RobH: [C: 032] new sessionstore100[1-3].eqiad.wmnet puppet updates [puppet] - 10https://gerrit.wikimedia.org/r/479346 (https://phabricator.wikimedia.org/T209393) (owner: 10RobH) [22:56:30] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.51 and port 9042: Connection refused eevans Decommissioned (T210843) [22:56:30] ACKNOWLEDGEMENT - cassandra-c SSL 10.192.48.51:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned (T210843) [22:57:20] (03PS2) 10Ayounsi: Assign public /29 for cloud-instance-transport1-b-eqiad [dns] - 10https://gerrit.wikimedia.org/r/479337 (https://phabricator.wikimedia.org/T207663) [22:59:22] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [23:00:02] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) 05Open>03Resolved This is done! [23:00:04] 10Operations, 10ops-codfw, 10Patch-For-Review, 10Services (watching), 10User-fgiunchedi: rack/setup/install restbase201[3-8].codfw.wmnet - https://phabricator.wikimedia.org/T209615 (10Eevans) [23:01:55] (03CR) 10Tim Starling: Refactor profiler.php and X-Wikimedia-Debug parsing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477939 (owner: 10Tim Starling) [23:06:14] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:07:05] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10ayounsi) @aborrero Everything is ready to be merged/commited. I used the name `vip-gw-cloudnet.wikimedia.org.` let me know if... [23:12:19] (03PS1) 10BryanDavis: osm::planet_sync: configure logrotate to use non-root user [puppet] - 10https://gerrit.wikimedia.org/r/479348 (https://phabricator.wikimedia.org/T211013) [23:25:33] 10Operations, 10hardware-requests, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: Hardware for session storage service - https://phabricator.wikimedia.org/T206017 (10RobH) 05Open>03Resolved [23:26:21] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: rack/setup/install sessionstore100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T209393 (10RobH) [23:27:52] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Kanban (Doing), 10User-Eevans: rack/setup/install sessionstore100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T209393 (10RobH) 05Open>03stalled a:05RobH>03Eevans Ok, I've chatted with @eevans about... [23:28:24] 10Operations, 10Core Platform Team (Session Management Service (CDP2)), 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install sessionstore200[123].codfw.wmnet - https://phabricator.wikimedia.org/T209389 (10RobH) 05Open>03stalled Ok, I've chatted with @eevans abou... [23:28:28] (03PS1) 10Cwhite: profile: enable statsd_exporter and add matching rules to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) [23:30:08] (03CR) 10Cwhite: "I wasn't sure how specific these collectors should be. Starting general and we can make more specific if that's the direction we want to " [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [23:33:23] 10Operations, 10media-storage: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10RobH) [23:37:30] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [23:39:00] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check analytics1037 power supply status - https://phabricator.wikimedia.org/T179192 (10RobH) 05stalled>03Resolved This hasn't reoccured in a very long time, none since this task creation, resolving. [23:42:43] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Core Platform Team Backlog (Watching / External), 10Services (watching): TEC3:O3:O3.1:Q2 Goal - Move Blubberoid, ZoteroV2, and Graphoid through the production CD Pipeline - https://phabricator.wikimedia.org/T205919 (10dduvall) [23:49:05] (03PS2) 10Dzahn: dumps:nfs: use (new) data types and move some things to parameters [puppet] - 10https://gerrit.wikimedia.org/r/479335 [23:50:08] (03CR) 10jerkins-bot: [V: 04-1] dumps:nfs: use (new) data types and move some things to parameters [puppet] - 10https://gerrit.wikimedia.org/r/479335 (owner: 10Dzahn) [23:51:33] (03PS2) 10Dzahn: phabricator: use Stdlib::Fqdn data type for hostname parameters [puppet] - 10https://gerrit.wikimedia.org/r/479327 [23:54:16] (03CR) 10Dduvall: [C: 031] "Works for me locally, including `helm test` after deploying the latest build that includes a working openapi spec. :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/479026 (https://phabricator.wikimedia.org/T211708) (owner: 10Jeena Huneidi) [23:59:49] (03PS5) 10Dzahn: phabricator: Enable php-fpm on phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/478032 (https://phabricator.wikimedia.org/T211353) (owner: 10Paladox)