[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190109T0000). [00:00:04] tgr: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:02:47] 10Operations, 10ops-codfw: asw-c-codfw - FPC 1 PEM 1 is not powered - https://phabricator.wikimedia.org/T213233 (10ayounsi) [00:02:57] greg-g: Can I add a deploy window for SDC on Thursday at 08:00–09:00? (It's currently just the EU train placeholder.) [00:03:54] (03CR) 10Dzahn: [C: 04-1] "i tried to amend to this to do that and then noticed i had to rebase.. and when i did that i saw "AuthName "Developer account (use wiki lo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [00:06:30] (03PS14) 10Gergő Tisza: Make password policy and logging code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 [00:06:57] (03PS2) 10Dzahn: Specify allowed ldap groups by site logins [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [00:10:14] (03CR) 10Dzahn: "- since 54d6f1f9101 the term "developer account" is used to describe LDAP backend" [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [00:11:27] (03CR) 10Dzahn: "let's also add the allowed group names as suggested on I5ffc7aabfea78 .. rebased that on top of this here" [puppet] - 10https://gerrit.wikimedia.org/r/467723 (https://phabricator.wikimedia.org/T179461) (owner: 10BryanDavis) [00:12:35] (03CR) 10Gergő Tisza: [C: 03+2] Make password policy and logging code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [00:13:43] (03Merged) 10jenkins-bot: Make password policy and logging code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [00:23:22] (03CR) 10jenkins-bot: Make password policy and logging code saner [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481115 (owner: 10Gergő Tisza) [00:27:17] (03PS1) 10Gergő Tisza: Fix wfGetPrivilegedGroups merge conflict [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483040 [00:28:04] (03CR) 10Gergő Tisza: [C: 03+2] "SWAT hotfix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483040 (owner: 10Gergő Tisza) [00:29:22] (03Merged) 10jenkins-bot: Fix wfGetPrivilegedGroups merge conflict [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483040 (owner: 10Gergő Tisza) [00:33:20] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:481115|Make password policy and logging code saner]] (duration: 00m 55s) [00:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:27] PROBLEM - puppet last run on ms-be1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:34:14] tgr: SWAT done? [00:34:28] !log tgr@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:481115|Make password policy and logging code saner]] (duration: 00m 52s) [00:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:02] (03PS2) 10Jforrester: [Wikimania] Create year namespaces for each Wikimania, 2005–2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455049 (https://phabricator.wikimedia.org/T202683) [00:36:20] (03CR) 10jenkins-bot: Fix wfGetPrivilegedGroups merge conflict [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483040 (owner: 10Gergő Tisza) [00:38:00] James_F: yeah, done [00:38:21] Cool, taking deployment conch. [00:38:24] (03CR) 10Jforrester: [C: 03+2] [Wikimania] Create year namespaces for each Wikimania, 2005–2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455049 (https://phabricator.wikimedia.org/T202683) (owner: 10Jforrester) [00:39:30] (03Merged) 10jenkins-bot: [Wikimania] Create year namespaces for each Wikimania, 2005–2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455049 (https://phabricator.wikimedia.org/T202683) (owner: 10Jforrester) [00:39:49] James_F: sure [00:40:06] greg-g: Awesome, will edit. [00:43:50] (03PS1) 10Jforrester: [Wikimania] Fix typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483043 [00:43:58] (03CR) 10Jforrester: [C: 03+2] [Wikimania] Fix typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483043 (owner: 10Jforrester) [00:44:47] (03PS3) 10Ayounsi: Monitoring: add VRRP check [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [00:45:30] (03Merged) 10jenkins-bot: [Wikimania] Fix typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483043 (owner: 10Jforrester) [00:45:32] (03CR) 10jerkins-bot: [V: 04-1] Monitoring: add VRRP check [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [00:47:49] (03PS1) 10AndyRussG: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) [00:48:05] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T202683 [Wikimania] Create year namespaces for each Wikimania, 2005–2019 (duration: 00m 53s) [00:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:07] T202683: On Wikimania wiki, create a namespace for each year from 2005 to 2019 - https://phabricator.wikimedia.org/T202683 [00:48:14] (03Abandoned) 10Addshore: WIP Add grafana_json_datasource [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [00:48:54] (03CR) 10jenkins-bot: [Wikimania] Create year namespaces for each Wikimania, 2005–2019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455049 (https://phabricator.wikimedia.org/T202683) (owner: 10Jforrester) [00:48:56] (03CR) 10jenkins-bot: [Wikimania] Fix typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483043 (owner: 10Jforrester) [00:51:12] (03PS4) 10Ayounsi: Monitoring: add VRRP check [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [00:52:13] (03CR) 10AndyRussG: [C: 04-1] "Requires confirmation that this is acceptable policy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) (owner: 10AndyRussG) [00:52:27] (03PS2) 10AndyRussG: Give protect right to centralnoticeadmin on Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483044 (https://phabricator.wikimedia.org/T209873) [00:53:10] (03PS1) 10Jforrester: [Wikimania] Add 2019 content to default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483045 [00:53:50] (03CR) 10Jforrester: [C: 03+2] [Wikimania] Add 2019 content to default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483045 (owner: 10Jforrester) [00:54:59] (03Merged) 10jenkins-bot: [Wikimania] Add 2019 content to default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483045 (owner: 10Jforrester) [00:55:28] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/14223/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/481154 (https://phabricator.wikimedia.org/T150264) (owner: 10Faidon Liambotis) [00:56:20] (03PS1) 10Smalyshev: Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) [00:57:16] (03CR) 10jerkins-bot: [V: 04-1] Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [00:58:22] Eurgh, when will this Wikibase patch ever land?! [00:58:29] (03PS1) 10Krinkle: xhgui: Disable deletion features [puppet] - 10https://gerrit.wikimedia.org/r/483048 (https://phabricator.wikimedia.org/T213218) [00:58:52] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [Wikimania] Add 2019 content to default search (duration: 00m 53s) [00:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:40] (03PS2) 10Jforrester: Remove mobilelanding.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482493 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [01:00:53] (03CR) 10Jforrester: [C: 03+2] "zero.… now 301s (not 302s), so this can be safely removed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482493 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [01:01:16] Oh, right, you can't `sync-file` when deleting. Lame. [01:02:00] (03Merged) 10jenkins-bot: Remove mobilelanding.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482493 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [01:02:08] (03CR) 10jenkins-bot: [Wikimania] Add 2019 content to default search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483045 (owner: 10Jforrester) [01:02:13] (03CR) 10jenkins-bot: Remove mobilelanding.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482493 (https://phabricator.wikimedia.org/T187716) (owner: 10MaxSem) [01:04:31] !log jforrester@deploy1001 Synchronized docroot/: T187716 Remove mobilelanding.php, no longer pointed to by Apache (duration: 00m 52s) [01:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:34] T187716: Sunset Wikipedia Zero - https://phabricator.wikimedia.org/T187716 [01:04:41] RECOVERY - puppet last run on ms-be1034 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:05:10] (03PS2) 10Jforrester: Disable ZeroBanner and ZeroPortal on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482097 (https://phabricator.wikimedia.org/T212864) [01:05:24] (03PS2) 10Jforrester: Drop the Wikipedia Zero debug log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482099 (https://phabricator.wikimedia.org/T212865) [01:08:30] herron: CC-ed you on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483048/ - no rush, but let me know if it works to roll out sometime this week (affects 3 nodes: tungsten, webperf[12]002; of which only tungsten is currently live) [01:14:19] (03PS2) 10Smalyshev: Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) [01:15:16] (03CR) 10jerkins-bot: [V: 04-1] Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [01:15:24] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/Wikibase/repo/RepoHooks.php: T213227 Don't have onApiCheckCanExecute die for inactive entity types (duration: 00m 53s) [01:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:15:27] T213227: Wikibase RepoHooks::onApiCheckCanExecute dies (via EntityHandler::getEntityNamespaces's assert) for all edits on wikis where Item isn't enabled? - https://phabricator.wikimedia.org/T213227 [01:15:50] OK, conch released. [01:36:25] Krinkle: ok, will take a closer look tomorrow [01:36:46] (03PS3) 10Smalyshev: Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) [01:37:39] (03CR) 10jerkins-bot: [V: 04-1] Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [02:00:32] (03PS4) 10Smalyshev: Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) [02:01:01] (03CR) 10jerkins-bot: [V: 04-1] Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [02:02:48] (03PS5) 10Smalyshev: Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) [02:03:43] (03CR) 10jerkins-bot: [V: 04-1] Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [02:04:30] (03PS2) 10Jforrester: robots.php: Drop the special treatment for Wikipedia Zero [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482100 (https://phabricator.wikimedia.org/T212865) [02:04:38] (03PS2) 10Jforrester: zerowiki: Stop whitelisting ZeroPortal to logged out users, no longer available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482101 (https://phabricator.wikimedia.org/T212865) [02:04:46] (03PS2) 10Jforrester: Drop ZeroBanner and ZeroPortal from production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482102 (https://phabricator.wikimedia.org/T212865) [02:04:54] (03PS2) 10Jforrester: Stop configuring ZeroBanner and ZeroPortal, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482103 (https://phabricator.wikimedia.org/T212865) [02:05:00] (03PS2) 10Jforrester: Stop loading i18n for ZeroBanner and ZeroPortal, unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482104 (https://phabricator.wikimedia.org/T212865) [02:07:10] (03PS6) 10Smalyshev: Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) [02:07:53] (03CR) 10jerkins-bot: [V: 04-1] Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [02:08:57] (03PS6) 10Jforrester: Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981) [02:08:59] (03PS5) 10Jforrester: Enable WikibaseMediaInfo on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708) [02:09:01] (03PS3) 10Jforrester: [Beta Cluster] Cleanup SDC config, all same as prod now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/470459 [02:12:23] (03PS7) 10Smalyshev: Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) [02:13:02] (03CR) 10jerkins-bot: [V: 04-1] Make config suitable for multiple instances of Blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [02:16:29] (03CR) 10Smalyshev: "@Gehel - not sure why it doesn't work, do you have any ideas? Do we need separate prometheus exporters for two Blazegraph instances or we " [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [02:17:08] (03CR) 10BryanDavis: Specify allowed ldap groups by site logins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [02:19:17] (03PS2) 10Jforrester: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) [02:19:19] (03PS2) 10Jforrester: Require an 8-byte new password for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479571 (https://phabricator.wikimedia.org/T211622) [02:19:21] (03PS2) 10Jforrester: Require that passwords are not in any common list for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479573 [02:19:23] (03PS2) 10Jforrester: Require that passwords are not in the most common 100k list for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) [02:19:32] (03Abandoned) 10Jforrester: Require passwords do not match account names for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479572 (https://phabricator.wikimedia.org/T208441) (owner: 10Jforrester) [02:20:06] (03CR) 10Jforrester: [C: 04-2] "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [02:21:36] (03CR) 10Jforrester: "> Patch Set 1: Code-Review-1" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479571 (https://phabricator.wikimedia.org/T211622) (owner: 10Jforrester) [02:22:08] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479573 (owner: 10Jforrester) [02:22:47] (03CR) 10Jforrester: Require that passwords are not in the most common 100k list for all users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [02:23:08] (03CR) 10BryanDavis: Improve list of privileged groups (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 (owner: 10Gergő Tisza) [02:24:59] (03PS2) 10Gergő Tisza: Improve list of privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 [02:25:17] (03CR) 10jerkins-bot: [V: 04-1] Improve list of privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 (owner: 10Gergő Tisza) [02:30:26] (03CR) 10Gergő Tisza: Improve list of privileged groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 (owner: 10Gergő Tisza) [02:30:42] (03PS3) 10Gergő Tisza: Improve list of privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 [02:33:28] (03CR) 10Gergő Tisza: [C: 03+1] Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [02:36:39] (03CR) 10Gergő Tisza: [C: 03+1] "Not sure how this is more complete but if you feel strongly about it :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479573 (owner: 10Jforrester) [02:40:35] (03CR) 10Gergő Tisza: [C: 03+1] "Does what it says. I8d8f738176 will probably supersede it though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [02:47:34] (03CR) 10Ottomata: "I think we probably won't end up doing this after all... we will see." [puppet] - 10https://gerrit.wikimedia.org/r/482867 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [03:32:09] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 890.82 seconds [04:36:27] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 274.31 seconds [04:42:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [04:44:01] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [04:49:46] 10Operations, 10WMF-Communications, 10Wikimedia-Mailing-lists, 10Design: Update Wikimedia logo on Mailman web pages from colored version to black and white version - https://phabricator.wikimedia.org/T212674 (10Ladsgroup) 05Open→03Declined Thanks for the notes. [05:08:35] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [05:09:39] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:12:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:15:51] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [05:16:59] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:18:09] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational [05:19:23] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:21:41] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [05:22:47] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [05:59:00] !log kartik@deploy1001 Started deploy [cxserver/deploy@1098942]: Update cxserver to 656c468 [05:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:09] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational [06:03:08] !log kartik@deploy1001 Finished deploy [cxserver/deploy@1098942]: Update cxserver to 656c468 (duration: 04m 08s) [06:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:18:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:19:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:24:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:41:35] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:41:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:42:37] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:45:01] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [06:45:11] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [06:49:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:50:37] (03PS3) 10Elukey: Specify allowed ldap groups by site logins [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [06:51:26] (03CR) 10Elukey: [C: 03+2] Specify allowed ldap groups by site logins [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [06:58:50] (03CR) 10Elukey: [C: 03+2] "As FYI I don't see the string message in Chrome at the moment, found https://codereview.chromium.org/1466473003" [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [07:04:06] (03PS3) 10Elukey: systemd: introduce timer::job define [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [07:11:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:14:39] (03CR) 10Elukey: systemd: introduce timer::job define (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [07:15:29] (03PS4) 10Elukey: systemd: introduce timer::job define [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [07:16:16] (03CR) 10jerkins-bot: [V: 04-1] systemd: introduce timer::job define [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [07:16:40] buuuu [07:17:42] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14225/an-coord1001.eqiad.wmnet/ looks good now, but tests are broken since the change [$i" [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [07:19:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:26:13] <_joe_> elukey: what are you doing with my patch :P [07:26:20] ahahahah [07:26:35] I am stopping, just wanted to progress it :) [07:26:45] <_joe_> what did you change? [07:27:14] the thing that I commented in timer::job, [$interval] => $interval [07:27:20] <_joe_> that's wrong [07:27:25] <_joe_> go see systemd::timer [07:27:36] <_joe_> it expects an array of values [07:27:45] <_joe_> with minimum length 1 [07:28:21] sure but 1) me without coffee 2) timer_intervals in profile::analytics::systemd_timer was already an array :P [07:28:49] <_joe_> yeah I changed the definition to be more generic in the base define [07:28:58] <_joe_> not everyone will want to use 'OnCalendar' [07:29:14] <_joe_> also go get coffee :P [07:30:09] so $interval in profile::analytics::systemd_timer should only be a {..} ? [07:30:19] and [$interval] stays? [07:31:45] <_joe_> yep [07:31:55] <_joe_> gimme 20 minutes and I'll get back to that patch [07:32:31] I can try if you want :P [07:36:19] (03PS5) 10Elukey: systemd: introduce timer::job define [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [07:36:26] * elukey hides [07:37:43] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14226/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [07:38:48] after this I'll move all the profile::analytics::systemd_timer occurrences to systemd::timer::job [07:39:10] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Grant sudo access for CI admins to doc.wikimedia.org publishing user - https://phabricator.wikimedia.org/T213169 (10hashar) [07:39:11] <_joe_> I don't think it's needed [07:39:21] <_joe_> you might want to specialize your timer defs further [07:39:26] <_joe_> and you have your own defaults [07:39:49] <_joe_> for instance, you might want to add a use_kerberos parameter [07:39:53] <_joe_> and add a wrapper [07:40:00] <_joe_> I did the same for mediawiki [07:43:52] !log contint1001: restarted Zuul to take in account SMTP configuration | https://gerrit.wikimedia.org/r/376739 | T93414 [07:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:55] T93414: Regularly run mwext-{name}-testextension-* jobs to make sure they are still passing after core or dependency changes - https://phabricator.wikimedia.org/T93414 [07:44:50] _joe_ yeah but the use_kerberos could simply be to pass a different command [07:45:40] anyway, will check what's best :) [07:45:45] thanks for the suggestions [07:52:50] <_joe_> elukey: let's merge that change then? [07:55:56] !log installing libseccomp updates from stretch point release [07:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:52] _joe_ lemme re-check it again just to be sure, then I'll merge ok? [08:03:14] <_joe_> elukey I'm fixing a last detail, nothing that changes significantly the results [08:03:22] ack [08:05:17] (03PS6) 10Giuseppe Lavagetto: systemd: introduce timer::job define [puppet] - 10https://gerrit.wikimedia.org/r/482790 [08:05:23] <_joe_> done :P [08:10:19] (03PS1) 10Elukey: Prevent using /var/lib/hadoop/data/c partition on an1054 [puppet] - 10https://gerrit.wikimedia.org/r/483062 (https://phabricator.wikimedia.org/T213038) [08:12:32] (03PS2) 10Elukey: Prevent using /var/lib/hadoop/data/c partition on an1054 [puppet] - 10https://gerrit.wikimedia.org/r/483062 (https://phabricator.wikimedia.org/T213038) [08:13:46] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14228/analytics1054.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/483062 (https://phabricator.wikimedia.org/T213038) (owner: 10Elukey) [08:14:03] (03CR) 10Elukey: [C: 03+2] Prevent using /var/lib/hadoop/data/c partition on an1054 [puppet] - 10https://gerrit.wikimedia.org/r/483062 (https://phabricator.wikimedia.org/T213038) (owner: 10Elukey) [08:17:49] (03PS7) 10Elukey: systemd: introduce timer::job define [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [08:19:53] (03CR) 10Elukey: [C: 03+2] systemd: introduce timer::job define [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [08:21:01] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:23:54] _joe_ no op on the analytics hosts for the timer change :) [08:23:57] thanks! [08:25:50] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10elukey) 1. is done :) [08:26:29] <_joe_> nice, trhanks [08:28:57] !log installing openssl security updates for on stretch-based DB servers [08:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:52] (03CR) 10Giuseppe Lavagetto: site.pp: fold videoscalers into jobrunners (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482268 (owner: 10Giuseppe Lavagetto) [08:36:20] (03PS2) 10Giuseppe Lavagetto: site.pp: fold videoscalers into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/482268 [08:42:13] (03PS3) 10Giuseppe Lavagetto: site.pp: fold videoscalers into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/482268 [08:52:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] site.pp: fold videoscalers into jobrunners [puppet] - 10https://gerrit.wikimedia.org/r/482268 (owner: 10Giuseppe Lavagetto) [08:58:32] (03CR) 10Mathew.onipe: Elasticsearch failed shard allocation check (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [08:59:20] (03PS2) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) [08:59:59] (03CR) 10jerkins-bot: [V: 04-1] Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [09:03:52] (03CR) 10Jon Harald Søby: [C: 03+1] Remove NS 104 from wgContentNamespaces for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481266 (https://phabricator.wikimedia.org/T191396) (owner: 10Framawiki) [09:03:55] (03PS1) 10Banyek: mariadb: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/483066 (https://phabricator.wikimedia.org/T210693) [09:05:20] (03PS3) 10Mathew.onipe: Elasticsearch failed shard allocation check [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) [09:07:03] (03PS2) 10Hashar: cache/trafficserver: switch doc.wikimedia.org to doc1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/480536 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [09:10:54] (03CR) 10Jcrespo: [C: 03+1] mariadb: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/483066 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [09:11:15] (03CR) 10Banyek: [C: 03+2] mariadb: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/483066 (https://phabricator.wikimedia.org/T210693) (owner: 10Banyek) [09:11:28] (03PS2) 10Banyek: mariadb: depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/483066 (https://phabricator.wikimedia.org/T210693) [09:13:51] (03PS2) 10Giuseppe Lavagetto: videoscaler: remove last references to videoscalers as a separate cluster. [puppet] - 10https://gerrit.wikimedia.org/r/482269 [09:20:20] (03PS1) 10Elukey: profile::refinery::job::camus: conver netflow to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/483069 (https://phabricator.wikimedia.org/T172532) [09:21:48] RECOVERY - High lag on wdqs1007 is OK: (C)3600 ge (W)1200 ge 1158 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:22:13] (03CR) 10Hashar: "Funnily, when I have introduced the spec boiler plate to the profile module, I had hinted at this change:" [puppet] - 10https://gerrit.wikimedia.org/r/480957 (owner: 10Hashar) [09:24:01] (03PS2) 10Elukey: profile::refinery::job::camus: conver netflow to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/483069 (https://phabricator.wikimedia.org/T172532) [09:25:18] arturo: I don't see why not. Sure feel free to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/481341/ [09:25:58] akosiaris: will do later [09:26:41] !log depooled labsdb1010 [09:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:55] !log dropping materialized views on labdb1010 - T210693 [09:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:57] T210693: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 [09:30:02] (03PS1) 10Elukey: cdh::exec: improper use of unless_command variable [puppet/cdh] - 10https://gerrit.wikimedia.org/r/483072 [09:30:26] (03CR) 10Elukey: [V: 03+2 C: 03+2] cdh::exec: improper use of unless_command variable [puppet/cdh] - 10https://gerrit.wikimedia.org/r/483072 (owner: 10Elukey) [09:31:49] (03PS1) 10Elukey: Update cdh submodule to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/483073 [09:32:03] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update cdh submodule to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/483073 (owner: 10Elukey) [09:36:25] (03PS2) 10Volans: Upstream release v0.0.11 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/482804 [09:36:42] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/14232/ seems to do the right thing." [puppet] - 10https://gerrit.wikimedia.org/r/482269 (owner: 10Giuseppe Lavagetto) [09:37:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] videoscaler: remove last references to videoscalers as a separate cluster. [puppet] - 10https://gerrit.wikimedia.org/r/482269 (owner: 10Giuseppe Lavagetto) [09:37:15] (03PS3) 10Giuseppe Lavagetto: videoscaler: remove last references to videoscalers as a separate cluster. [puppet] - 10https://gerrit.wikimedia.org/r/482269 [09:38:44] !log repooling labdsb1010 - T210693 [09:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:47] T210693: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 [09:39:04] (03PS1) 10Banyek: Revert "mariadb: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/483076 [09:39:22] (03PS1) 10ArielGlenn: version 0.0.9 [debs/mwbzutils] - 10https://gerrit.wikimedia.org/r/483077 (https://phabricator.wikimedia.org/T213200) [09:40:15] (03CR) 10Muehlenhoff: [C: 03+1] Upstream release v0.0.11 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/482804 (owner: 10Volans) [09:40:44] (03CR) 10Banyek: [C: 03+2] Revert "mariadb: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/483076 (owner: 10Banyek) [09:40:50] (03PS2) 10Banyek: Revert "mariadb: depool labsdb1010" [puppet] - 10https://gerrit.wikimedia.org/r/483076 [09:43:40] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.11 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/482804 (owner: 10Volans) [09:44:11] !log Some CI npm jobs get broken due to a faulty node module. https://phabricator.wikimedia.org/T213249 [09:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:15] PROBLEM - DPKG on db2062 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:49:16] (03Merged) 10jenkins-bot: Upstream release v0.0.11 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/482804 (owner: 10Volans) [09:50:36] 10Operations, 10Traffic, 10Patch-For-Review: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) 05Open→03Resolved We've added TLS support for maps and fixed the SAN list on swift to ensure proper TLS connections with upload origin servers. This is thus done. [09:52:43] (03PS7) 10Giuseppe Lavagetto: role::beta: introduce docker_services [puppet] - 10https://gerrit.wikimedia.org/r/478637 [10:00:00] !log uploaded spicerack_0.0.11 to apt.wikimedia.org stretch-wikimedia T205884 [10:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:02] T205884: Spicerack: split wmf-auto-reimage-lib into Spicerack modules - https://phabricator.wikimedia.org/T205884 [10:01:09] !log upgraded spicerack to 0.0.11 on cumin2001 T205884 [10:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:58] akosiaris: hey, when you have a minute, can I make a new key to get back my prod access? [10:23:05] (03PS1) 10Elukey: systemd::timer: allow more normal forms for datetime type [puppet] - 10https://gerrit.wikimedia.org/r/483085 (https://phabricator.wikimedia.org/T172532) [10:26:36] (03PS3) 10Ema: cache/trafficserver: switch doc.wikimedia.org to doc1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/480536 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [10:27:07] (03CR) 10Ema: [C: 03+2] cache/trafficserver: switch doc.wikimedia.org to doc1001 backend [puppet] - 10https://gerrit.wikimedia.org/r/480536 (https://phabricator.wikimedia.org/T137890) (owner: 10Dzahn) [10:27:13] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/14236/" [puppet] - 10https://gerrit.wikimedia.org/r/483085 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [10:27:43] Amir1: sure. So what happened? new laptop? [10:28:51] I akosiaris no, Flixbus left me at middle of nowhere when they stopped for bathroom. Laptop is one issue, passport is the other one [10:28:59] akosiaris: I'm making the patch [10:30:29] RECOVERY - DPKG on db2062 is OK: All packages OK [10:30:43] !log fixed package installation status on db2062 [10:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:08] (03PS1) 10Arturo Borrero Gonzalez: openstack: virt: dont install libssl1.0.0 in mitaka/stretch [puppet] - 10https://gerrit.wikimedia.org/r/483087 [10:31:40] (03PS2) 10Elukey: systemd::timer: allow more normal forms for datetime type [puppet] - 10https://gerrit.wikimedia.org/r/483085 (https://phabricator.wikimedia.org/T172532) [10:33:01] (03PS3) 10Elukey: systemd::timer: allow more normal forms for datetime type [puppet] - 10https://gerrit.wikimedia.org/r/483085 (https://phabricator.wikimedia.org/T172532) [10:33:08] (03CR) 10Arturo Borrero Gonzalez: "Compiler result is OK: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/14237/console" [puppet] - 10https://gerrit.wikimedia.org/r/483087 (owner: 10Arturo Borrero Gonzalez) [10:34:41] (03PS2) 10Zoranzoki21: Disable unused Flow extension on de.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478463 (https://phabricator.wikimedia.org/T207626) [10:35:38] (03PS1) 10Ladsgroup: Bring back Amir [puppet] - 10https://gerrit.wikimedia.org/r/483088 [10:37:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/483087 (owner: 10Arturo Borrero Gonzalez) [10:37:19] (03PS4) 10Arturo Borrero Gonzalez: wmcs: Add postgres maps users for eqiad1-r region [puppet] - 10https://gerrit.wikimedia.org/r/481341 (https://phabricator.wikimedia.org/T212596) (owner: 10BryanDavis) [10:37:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: Add postgres maps users for eqiad1-r region [puppet] - 10https://gerrit.wikimedia.org/r/481341 (https://phabricator.wikimedia.org/T212596) (owner: 10BryanDavis) [10:38:06] (03CR) 10Gehel: [C: 04-1] Elasticsearch failed shard allocation check (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:39:38] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I don't see a reason to not accept this, but I'm not allowed to +2 in this repository." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482501 (owner: 10Zoranzoki21) [10:40:09] (03PS2) 10Arturo Borrero Gonzalez: openstack: virt: dont install libssl1.0.0 in mitaka/stretch [puppet] - 10https://gerrit.wikimedia.org/r/483087 [10:41:11] (03PS11) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [10:45:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: virt: dont install libssl1.0.0 in mitaka/stretch [puppet] - 10https://gerrit.wikimedia.org/r/483087 (owner: 10Arturo Borrero Gonzalez) [10:48:23] (03CR) 10DCausse: [C: 04-1] Elasticsearch failed shard allocation check (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [10:49:55] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: reduce heap size to 31G to keep compressed oops [puppet] - 10https://gerrit.wikimedia.org/r/482863 (owner: 10Gehel) [11:02:16] 10Operations, 10Traffic, 10Patch-For-Review: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10ema) 05Open→03Resolved a:03ema The patch by @Vgutierrez fixed this bug. Closing. [11:03:05] RECOVERY - High lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 1190 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:03:26] (03CR) 10Elukey: "This patch needs https://gerrit.wikimedia.org/r/#/c/483085/" [puppet] - 10https://gerrit.wikimedia.org/r/483069 (https://phabricator.wikimedia.org/T172532) (owner: 10Elukey) [11:10:18] 10Operations, 10Traffic: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ema) [11:10:26] 10Operations, 10Traffic: Partial cache_upload traffic switchover to ATS and switchback to Varnish - https://phabricator.wikimedia.org/T213263 (10ema) p:05Triage→03Normal [11:21:12] (03PS3) 10Ema: cache: hiera flag to use ATS as local backend [puppet] - 10https://gerrit.wikimedia.org/r/482024 (https://phabricator.wikimedia.org/T213263) [11:21:14] (03PS1) 10Ema: Add new conftool service "ats-be" [puppet] - 10https://gerrit.wikimedia.org/r/483094 (https://phabricator.wikimedia.org/T213263) [11:21:16] (03PS1) 10Ema: cache: define ATS nodes in hiera [puppet] - 10https://gerrit.wikimedia.org/r/483095 (https://phabricator.wikimedia.org/T213263) [11:23:48] !log stopping db1082 and db2052 s5 replication in sync to migrate db1124:s5 master [11:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:12] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Banyek) [11:30:08] 10Operations, 10Traffic, 10Patch-For-Review: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10ema) [11:33:48] Hi there! Could someone please say bye bye to https://phabricator.wikimedia.org/p/Mikemcgee/? [11:36:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] citoid: Move back to using zotero.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/482808 (owner: 10Alexandros Kosiaris) [11:36:34] (03PS3) 10Alexandros Kosiaris: citoid: Move back to using zotero.svc.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/482808 [11:53:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall this patch seems good; however see some comments inline." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) (owner: 10Fsero) [11:54:20] !log enabling gtid on db1082 [11:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:50] !log enabling gtid on db1124:s5 [11:55:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy European Mid-day SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190109T1200). [12:00:04] bmansurov, Zoranzoki21, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] o/ [12:00:13] \o [12:00:18] <3 [12:00:24] \o/ [12:00:25] \o/ [12:00:29] \o/ [12:00:39] :) [12:00:43] * addshore is not deploying, just watching [12:00:49] Nice start of SWAT :) Who will take this torch? [12:00:59] Amir1: you have backports? [12:01:12] zeljkof: yes and this is sorta important [12:01:26] (It can cause data corruption) [12:02:08] (03CR) 10Gehel: [C: 04-1] Elasticsearch failed shard allocation check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/482297 (https://phabricator.wikimedia.org/T212850) (owner: 10Mathew.onipe) [12:02:14] zeljkof: You can deploy Amir1's patches first [12:02:18] I can wait [12:02:25] (03PS1) 10Jcrespo: mariadb: Repool db1082 with minimal traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483102 (https://phabricator.wikimedia.org/T213108) [12:02:25] Thanks [12:02:26] Amir1: your patches will take a while to merge, rigth? [12:02:42] !log repool wdqs100[78] - data import complete - T213210 [12:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:44] T213210: WDQS is hitting allocator limit on Blazegraph - https://phabricator.wikimedia.org/T213210 [12:02:53] zeljkof: I don't know TBH [12:03:10] if cherry-picks are heavy, it can wait [12:03:20] Amir1: I guess we can deploy config changes while you wait for your changes to merge, let me know when it's merged and I'll let you deploy [12:03:43] Amir1: they run a lot of tests, I remember it taking 10-20 to merge [12:03:48] (03PS3) 10Alexandros Kosiaris: sca: Remove the cluster from conftool [puppet] - 10https://gerrit.wikimedia.org/r/482809 [12:03:50] (03PS3) 10Alexandros Kosiaris: lvs: Remove all mentions of zoterov2 [puppet] - 10https://gerrit.wikimedia.org/r/482810 [12:03:52] (03PS1) 10Alexandros Kosiaris: sca: Remove the cluster [puppet] - 10https://gerrit.wikimedia.org/r/483103 [12:03:56] (03CR) 10Volans: "recheck" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [12:04:05] oh okay, zeljkof do you want to +2? [12:04:15] Amir1: go ahead and merge your changes, I'll deploy config changes, let me know when your commits are merged [12:04:20] o/ Sorry I'm late [12:04:28] bmansurov: no problemo [12:04:34] sure [12:04:42] (03CR) 10Giuseppe Lavagetto: [C: 03+1] sca: Remove the cluster from conftool [puppet] - 10https://gerrit.wikimedia.org/r/482809 (owner: 10Alexandros Kosiaris) [12:04:55] Amir1: one of your commits has -1 from jenkins :/ [12:04:56] zeljkof: I can be back in 10 mins, is that OK? [12:05:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: Remove all mentions of zoterov2 [puppet] - 10https://gerrit.wikimedia.org/r/482810 (owner: 10Alexandros Kosiaris) [12:05:06] bmansurov: sure, ping me when you're back [12:05:12] zeljkof: thanks [12:05:24] for the record: I can swat today [12:05:33] it seems it's random failure [12:05:45] I'll deploy Zoranzoki21's changes until bmansurov is back and Amir1's commits are merged, sounds good? [12:05:48] (03CR) 10jerkins-bot: [V: 04-1] Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [12:05:57] zeljkof: Ok from my point [12:06:16] 10Operations, 10Citoid: Production access need to zotero / citoid machines - https://phabricator.wikimedia.org/T213269 (10Mvolz) [12:08:29] 10Operations, 10Patch-For-Review: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [12:09:02] (03CR) 10Jcrespo: [C: 03+2] mariadb: Repool db1082 with minimal traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483102 (https://phabricator.wikimedia.org/T213108) (owner: 10Jcrespo) [12:10:10] (03Merged) 10jenkins-bot: mariadb: Repool db1082 with minimal traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483102 (https://phabricator.wikimedia.org/T213108) (owner: 10Jcrespo) [12:10:45] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478463 (https://phabricator.wikimedia.org/T207626) (owner: 10Zoranzoki21) [12:11:00] Zoranzoki21: does a script need to be run for 478463? [12:11:35] zeljkof: Script for? [12:11:48] (03Merged) 10jenkins-bot: Disable unused Flow extension on de.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478463 (https://phabricator.wikimedia.org/T207626) (owner: 10Zoranzoki21) [12:11:50] zeljkof: Rollbacking pages at ns 0? [12:11:51] the task says "Please disable the extension and move back the page to namespace 0." [12:12:01] yeah, no sure how that's done [12:12:01] zeljkof: Yes [12:12:34] zeljkof: Is my patch on mwdebug1002? [12:12:36] which script :) [12:12:48] zeljkof: Checking... [12:12:49] just got merged, in a minute [12:13:14] (03PS1) 10Jbond42: Add John Bond (jbond) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/483106 (https://phabricator.wikimedia.org/T213079) [12:15:05] (03CR) 10jenkins-bot: mariadb: Repool db1082 with minimal traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483102 (https://phabricator.wikimedia.org/T213108) (owner: 10Jcrespo) [12:15:07] (03CR) 10jenkins-bot: Disable unused Flow extension on de.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478463 (https://phabricator.wikimedia.org/T207626) (owner: 10Zoranzoki21) [12:15:29] mmh, I didn't notice swat was ongoing, will deploy later [12:15:43] (03PS1) 10Giuseppe Lavagetto: aptrepo: allow importing xdebug in thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/483107 (https://phabricator.wikimedia.org/T212757) [12:15:50] jynus: I was just about to ask about the patch :) [12:16:12] rebase as normal , I will deploy later, it is independent [12:16:20] jynus: ok, thanks [12:16:34] ping me when finished [12:16:40] Zoranzoki21: 483102 is at mwdebug [12:16:44] jynus: will do [12:16:50] thanks and sorry [12:17:11] no problem, I thought maybe something is going on [12:17:16] eventually pooling state will get removed from that [12:17:23] zeljkof: I'm ready whenever you are. [12:17:42] zeljkof: I think this script should do job https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Flow/+/master/maintenance/convertToText.php [12:17:45] zeljkof: it crashed yesterday, but the repool can wait [12:18:27] bmansurov: ok, I'm deploying one patch now, you can be next [12:18:33] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: dashboard for WDQS lag has moved [puppet] - 10https://gerrit.wikimedia.org/r/482847 (owner: 10Gehel) [12:18:36] zeljkof: cool [12:19:25] zeljkof: Lets go to next [12:20:09] Zoranzoki21: I can't proceed with un-deployed code, can I deploy the patch, or should I revert it? [12:20:16] (03PS3) 10Bmansurov: Disable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476370 (https://phabricator.wikimedia.org/T209882) [12:20:48] zeljkof: Revert it, then if anyone confirm to script is correct, I will request it for next SWAT.. Skip next patch for Flow too [12:22:19] (03PS1) 10Jcrespo: mariadb: Fully repool db1082 after recovery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483108 (https://phabricator.wikimedia.org/T213108) [12:22:40] zeljkof: npm is going crazy :/ https://integration.wikimedia.org/ci/job/mediawiki-quibble-composer-mysql-php70-docker/11289/console [12:23:29] Amir1: yeah, shasum check error [12:23:39] as far as I remember, there is a caching problem [12:23:45] I have no idea how to fix this [12:23:50] hasharAway: can you help? [12:24:09] There is one task related to it [12:24:12] Amir1: I think just re-running the job fixes it, new vm/container will be used and it should work [12:24:29] oh okay, already did the recheck [12:25:46] Zoranzoki21: revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/483109 [12:25:58] zeljkof: ok [12:26:39] (03PS1) 10Zfilipin: Revert "Disable unused Flow extension on de.wikiversity" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483109 (https://phabricator.wikimedia.org/T207626) [12:26:39] (03CR) 10Zoranzoki21: [C: 03+1] Revert "Disable unused Flow extension on de.wikiversity" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483109 (https://phabricator.wikimedia.org/T207626) (owner: 10Zfilipin) [12:26:39] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483109 (https://phabricator.wikimedia.org/T207626) (owner: 10Zfilipin) [12:26:53] Zoranzoki21: bmansurov has one commit, so I'll deploy it now, then continue with your commits, ok? [12:27:03] zeljkof: Ok [12:27:25] (03Merged) 10jenkins-bot: Revert "Disable unused Flow extension on de.wikiversity" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483109 (https://phabricator.wikimedia.org/T207626) (owner: 10Zfilipin) [12:27:58] zeljkof: mine will go in in one minute [12:27:58] (03CR) 10jenkins-bot: Revert "Disable unused Flow extension on de.wikiversity" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483109 (https://phabricator.wikimedia.org/T207626) (owner: 10Zfilipin) [12:28:17] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10Peter) 05Open→03Resolved a:03Peter I cannot reproduce now, seems to be fixed, thank you @CDanis and @fgiunchedi ! [12:28:29] (03PS1) 10Zoranzoki21: Reverted "Revert "Disable unused Flow extension on de.wikiversity"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483111 (https://phabricator.wikimedia.org/T207626) [12:30:41] Amir1: should I wait for you then? [12:31:16] it would be great, it's almost done [12:31:45] stupid question, I've reverted a commit that was not deployed, I don't need to deploy the revert, right? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/483109 [12:32:15] since the original commit was reverted before deployment, ah I do need to fetch and rebase at least [12:32:55] I'd do the same [12:33:16] so, no deploment of the revert, right? :) [12:33:45] Amir1: swat is yours, let me know when you've deployed your commits and I'll continue cc bmansurov Zoranzoki21 [12:34:08] zeljkof: I don't have the deployment rights right now [12:34:12] :( [12:34:21] bmansurov, Zoranzoki21: apologies for the delay, slight turbulence with the flight today [12:34:29] Amir1: ah, I need to deploy :D [12:34:33] zeljkof: no problem, as long as we land safely [12:34:43] (03CR) 10Muehlenhoff: [C: 03+1] "This change would require SRE meeting review, but Faidon green-lighted this change in advance as he's on vacation this week." [puppet] - 10https://gerrit.wikimedia.org/r/483106 (https://phabricator.wikimedia.org/T213079) (owner: 10Jbond42) [12:34:49] Amir1: no problemo, I've forgot about that :D [12:34:52] zeljkof: No worry [12:35:04] I would recommand bmansurov go ahead [12:35:05] bmansurov: we'll discuss landing later :D [12:35:11] this is taking ages and zuul is lying [12:35:15] ;) [12:35:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:35:25] ok, I'll deploy bmansurov's patch then so he can go [12:35:33] \o/ [12:36:02] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476370 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [12:36:29] I'll be afk for a minute or two [12:37:06] (03Merged) 10jenkins-bot: Disable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476370 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [12:38:52] bmansurov: the patch is at mwdebug1002, please test and let me know if I can deploy it [12:39:01] zeljkof: ok, testing [12:39:32] zeljkof: looks good, let's deploy it [12:39:37] ok [12:40:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:40:49] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:476370|Disable reader trust survey (T209882)]] (duration: 01m 07s) [12:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:59] T209882: Quicksurvey for reader trust - https://phabricator.wikimedia.org/T209882 [12:41:01] (03CR) 10jenkins-bot: Disable reader trust survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476370 (https://phabricator.wikimedia.org/T209882) (owner: 10Bmansurov) [12:41:30] bmansurov: it's deployed, please test and thanks for deploying with #releng :) [12:41:43] zeljkof: thanks! [12:43:17] Amir1: your patches are merged, I can deploy them as soon as you're back [12:43:31] Zoranzoki21: I'll continue with your patches while waiting for Amir1 [12:43:38] zeljkof: Ok [12:44:46] (03PS2) 10Zfilipin: Enable signature button in toolbar for the "Arbitration" namespace in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482591 (https://phabricator.wikimedia.org/T213049) (owner: 10Zoranzoki21) [12:44:54] !log installing OpenSSL 1.0.2 security updates for stretch [12:44:55] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482591 (https://phabricator.wikimedia.org/T213049) (owner: 10Zoranzoki21) [12:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:51] zeljkof: I'm back now. Thanks! [12:45:58] (03Merged) 10jenkins-bot: Enable signature button in toolbar for the "Arbitration" namespace in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482591 (https://phabricator.wikimedia.org/T213049) (owner: 10Zoranzoki21) [12:46:56] Amir1: you're next then, just to finish 482591 [12:47:02] cool [12:47:27] Zoranzoki21: 482591 is at mwdebug1002 [12:47:58] zeljkof: Testing [12:49:56] zeljkof: Looks good [12:50:43] ok, deploying [12:51:42] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:482591|Enable signature button in toolbar for the "Arbitration" namespace in ruwiki (T213049)]] (duration: 00m 52s) [12:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:45] T213049: Enable signature button in toolbar for the "Arbitration" namespace in ruwiki - https://phabricator.wikimedia.org/T213049 [12:51:53] Zoranzoki: what're you getting deployed? [12:52:17] Zoranzoki21: it's deployed, since there's only 10 minutes left and two patches from Amir1, please move the rest of your patches to another swat [12:52:41] zeljkof: https://gerrit.wikimedia.org/r/c/482516/ can be merged directly [12:52:58] if there is free time, you can work on 482516 [12:53:14] Zoranzoki21: I will not have the time, is it urgent? [12:53:19] zeljkof: No [12:53:25] Amir1: should I deploy your patches to mwdebug first? [12:53:35] Zoranzoki21: please move them to another swat then [12:53:45] zeljkof: Ok [12:54:01] (03CR) 10jenkins-bot: Enable signature button in toolbar for the "Arbitration" namespace in ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482591 (https://phabricator.wikimedia.org/T213049) (owner: 10Zoranzoki21) [12:54:11] zeljkof: yes please [12:55:54] 10Operations, 10Cloud-Services, 10Wikibase-Quality, 10Wikibase-Quality-Constraints, and 5 others: Flood of WDQS requests from wbqc - https://phabricator.wikimedia.org/T204267 (10Addshore) So the spike reported again above was before https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseQualityCon... [12:56:37] Amir1: 483097 is at mwdebug1002 [12:57:19] testing [12:58:05] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 6 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) [12:58:31] it's super slow [12:58:34] that's weird [12:58:52] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-striker [puppet] - 10https://gerrit.wikimedia.org/r/483114 (https://phabricator.wikimedia.org/T135991) [12:59:19] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for uwsgi-striker [puppet] - 10https://gerrit.wikimedia.org/r/483114 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:59:20] it was just a random transaction issue [12:59:29] zeljkof: it's working, please proceed [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190109T1300) [13:00:40] Amir1: ok, deploying [13:00:53] !log extending eu swat for 5-10 minutes [13:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:20] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-striker [puppet] - 10https://gerrit.wikimedia.org/r/483114 (https://phabricator.wikimedia.org/T135991) [13:03:56] Amir1: this might take a while... I'm not sure why, but it's stuck at `Checking for new runtime errors locally` [13:05:05] :/ [13:05:07] am I doing something wrong? this is the command I'm running: `scap sync-file php-1.33.0-wmf.12/ 'SWAT: [[gerrit:483097|Fix order of arguments in ChangeTags::getPrevTags ([T212703])]]'` [13:05:07] T212703: Impossible to remove change tags from revisions via UI - https://phabricator.wikimedia.org/T212703 [13:05:27] so I guess it's syncing everything in the branch :/ [13:05:31] I'm sorry. It looks okay to me. [13:05:42] (03PS1) 10Zoranzoki21: Remove main page special casing from eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483117 (https://phabricator.wikimedia.org/T212849) [13:05:44] (03PS2) 10Alexandros Kosiaris: Revert "Revoke ladsgroup access due to lost laptop" [puppet] - 10https://gerrit.wikimedia.org/r/481608 [13:05:49] yeah, we could just deploy the changetags.php (and omit tests) [13:06:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] Bring back Amir [puppet] - 10https://gerrit.wikimedia.org/r/483088 (owner: 10Ladsgroup) [13:06:06] ah, I didn't notice that one is the tests [13:06:18] ok, it's moving forward, maybe it will just work [13:06:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Verified identify via a hangout, merging" [puppet] - 10https://gerrit.wikimedia.org/r/483088 (owner: 10Ladsgroup) [13:06:34] (03PS2) 10Alexandros Kosiaris: Bring back Amir [puppet] - 10https://gerrit.wikimedia.org/r/483088 (owner: 10Ladsgroup) [13:06:57] (03PS1) 10Zoranzoki21: Remove main page special casing from ruwikibooks and ruwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483118 (https://phabricator.wikimedia.org/T212849) [13:06:59] (03PS3) 10Alexandros Kosiaris: Revert "Revoke ladsgroup access due to lost laptop" [puppet] - 10https://gerrit.wikimedia.org/r/481608 [13:07:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Verified identify via a hangout, merging" [puppet] - 10https://gerrit.wikimedia.org/r/481608 (owner: 10Alexandros Kosiaris) [13:07:36] (03CR) 10Muehlenhoff: "Note that Daniel pushed a follow up commit to your original patch which added Amir to the absent group, that also needs to be folded into " [puppet] - 10https://gerrit.wikimedia.org/r/481608 (owner: 10Alexandros Kosiaris) [13:08:14] !log zfilipin@deploy1001 Synchronized php-1.33.0-wmf.12/: SWAT: [[gerrit:483097|Fix order of arguments in ChangeTags::getPrevTags ([T212703])]] (duration: 06m 54s) [13:08:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:21] Amir1: it's deployed! [13:08:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Has been already AFAICT in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483088/2" [puppet] - 10https://gerrit.wikimedia.org/r/481608 (owner: 10Alexandros Kosiaris) [13:08:39] not sure why it took so long, probably because it was checking the entire branch :) [13:08:44] thanks. what about wmf.9? [13:08:51] I'll deploy the next one to mwdebug [13:09:18] thank you so much, sorry for the trouble. it might cause issues that are really hard to fix :(((( [13:09:32] (03CR) 10Volans: [C: 03+1] "Indeed, I can confirm that it was already pre-approved." [puppet] - 10https://gerrit.wikimedia.org/r/483106 (https://phabricator.wikimedia.org/T213079) (owner: 10Jbond42) [13:11:02] Amir1: 483099 is at mwdebug1002 and no problem at all :) [13:11:53] (03PS4) 10Alexandros Kosiaris: lvs: Remove all mentions of zoterov2 [puppet] - 10https://gerrit.wikimedia.org/r/482810 [13:11:55] (03PS4) 10Alexandros Kosiaris: sca: Remove the cluster from conftool [puppet] - 10https://gerrit.wikimedia.org/r/482809 (https://phabricator.wikimedia.org/T212772) [13:11:57] (03PS2) 10Alexandros Kosiaris: sca: Remove the cluster [puppet] - 10https://gerrit.wikimedia.org/r/483103 (https://phabricator.wikimedia.org/T212772) [13:13:14] Amir1: just checking if you got my ping, since we're out of swat window, are you testing 483099? [13:13:17] mwdebug1002 is super super slow [13:13:22] sorry, I'm testing [13:13:49] I basically can't open any page [13:14:10] hm, not sure why [13:14:18] :/ [13:15:00] zeljkof: it's working [13:15:06] please move forward [13:15:41] Amir1: ok, deploying [13:21:17] 10Operations, 10Citoid: Production access need to zotero / citoid machines - https://phabricator.wikimedia.org/T213269 (10akosiaris) Yes of course. File a bug under #sre-access-requests. Per the #sre-access-requests main page, make sure to follow the instructions on https://wikitech.wikimedia.org/wiki/Requesti... [13:22:05] !log zfilipin@deploy1001 Synchronized php-1.33.0-wmf.9/: SWAT: [[gerrit:483099|Fix order of arguments in ChangeTags::getPrevTags ([T212703])]] (duration: 05m 50s) [13:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:00] T212703: Impossible to remove change tags from revisions via UI - https://phabricator.wikimedia.org/T212703 [13:23:04] Thanks! [13:24:55] Okay, I'm leaving for rest of today. zeljkof contact me on irc if anything goes wrong [13:26:08] Amir1: sorry, had to step away for a minute, just saw it's deployed :) [13:26:20] no worries [13:26:32] ^^ [13:26:43] !log EU SWAT finished [13:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:10] jynus: apologies for the delay, but I'm finally done with swat [13:27:30] np [13:29:29] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1082 with low weight (duration: 00m 52s) [13:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:35] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for passive Icinga node [puppet] - 10https://gerrit.wikimedia.org/r/483125 (https://phabricator.wikimedia.org/T135991) [13:31:00] (03PS1) 10Hashar: Remove ci::publisher, no more used [puppet] - 10https://gerrit.wikimedia.org/r/483126 (https://phabricator.wikimedia.org/T137890) [13:31:24] (03PS1) 10DCausse: Add gitreview [debs/prometheus-elasticsearch-exporter] - 10https://gerrit.wikimedia.org/r/483127 [13:31:40] (03CR) 10Hashar: "Ema merged the Varnish change to switch doc.wikimedia.org to the new host doc1001.eqiad.wmnet. It is all working as expected \o/" [puppet] - 10https://gerrit.wikimedia.org/r/483126 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [13:32:04] !log forcing removal of restbase1016-c (host down way too long to salvage) -- T212418 [13:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:07] T212418: Memory error on restbase1016 - https://phabricator.wikimedia.org/T212418 [13:32:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove ci::publisher, no more used [puppet] - 10https://gerrit.wikimedia.org/r/483126 (https://phabricator.wikimedia.org/T137890) (owner: 10Hashar) [13:33:28] (03PS2) 10Gehel: wdqs: dashboard for WDQS lag has moved [puppet] - 10https://gerrit.wikimedia.org/r/482847 [13:34:17] (03CR) 10Gehel: [C: 03+2] wdqs: dashboard for WDQS lag has moved [puppet] - 10https://gerrit.wikimedia.org/r/482847 (owner: 10Gehel) [13:34:44] (03PS2) 10Gehel: wdqs: reduce heap size to 31G to keep compressed oops [puppet] - 10https://gerrit.wikimedia.org/r/482863 [13:35:52] (03CR) 10Gehel: [C: 03+2] "LGTM" [debs/prometheus-elasticsearch-exporter] - 10https://gerrit.wikimedia.org/r/483127 (owner: 10DCausse) [13:35:54] (03CR) 10Gehel: [V: 03+2 C: 03+2] Add gitreview [debs/prometheus-elasticsearch-exporter] - 10https://gerrit.wikimedia.org/r/483127 (owner: 10DCausse) [13:36:13] (03CR) 10Gehel: [C: 03+2] wdqs: reduce heap size to 31G to keep compressed oops [puppet] - 10https://gerrit.wikimedia.org/r/482863 (owner: 10Gehel) [13:39:12] 10Operations, 10serviceops, 10User-jijiki: Add `supervised` option to redis configuration - https://phabricator.wikimedia.org/T212102 (10jijiki) [13:44:44] (03PS1) 10Gehel: wdqs: reduce heap size to 31G to keep compressed oops [puppet] - 10https://gerrit.wikimedia.org/r/483129 [13:46:50] (03CR) 10Mathew.onipe: [C: 03+1] wdqs: reduce heap size to 31G to keep compressed oops [puppet] - 10https://gerrit.wikimedia.org/r/483129 (owner: 10Gehel) [13:47:12] (03CR) 10Gehel: [C: 03+2] wdqs: reduce heap size to 31G to keep compressed oops [puppet] - 10https://gerrit.wikimedia.org/r/483129 (owner: 10Gehel) [13:56:41] (03Abandoned) 10Banyek: admin: Change banyek's .bash_profile [puppet] - 10https://gerrit.wikimedia.org/r/466901 (owner: 10Banyek) [14:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190109T1400) [14:07:55] (03PS2) 10Muehlenhoff: Add John Bond (jbond) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/483106 (https://phabricator.wikimedia.org/T213079) (owner: 10Jbond42) [14:09:32] (03CR) 10Muehlenhoff: [C: 03+2] Add John Bond (jbond) to the ops group [puppet] - 10https://gerrit.wikimedia.org/r/483106 (https://phabricator.wikimedia.org/T213079) (owner: 10Jbond42) [14:09:58] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 3.047e+05 gt 1e+05 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:10:55] seems already recovered --^ [14:11:15] the cgroup reading refreshlinks seems to have lagged a bit [14:12:42] 10Operations, 10Traffic, 10Patch-For-Review: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Ricordisamoa) I still get dropped connections (not 414) with much longer URLs (total header size from 9442 bytes onward). Is that expected? [14:15:54] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 2182 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [14:18:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [14:18:28] (03PS3) 10Volans: Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) [14:18:30] (03PS3) 10Volans: DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) [14:18:32] (03PS3) 10Volans: Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) [14:18:34] (03PS1) 10Volans: tests: mark test strings with escape as raw [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 [14:18:36] (03PS2) 10Elukey: Remove decommed nodes from Analytics Hadoop's net topology [puppet] - 10https://gerrit.wikimedia.org/r/482767 (https://phabricator.wikimedia.org/T209929) [14:20:42] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:21:02] (03CR) 10jerkins-bot: [V: 04-1] Kernels refactor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443366 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [14:21:06] (03CR) 10jerkins-bot: [V: 04-1] DataTables: refactor column grouping [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443367 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [14:21:09] (03CR) 10jerkins-bot: [V: 04-1] Add search capability [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [14:21:11] (03CR) 10jerkins-bot: [V: 04-1] tests: mark test strings with escape as raw [software/debmonitor] - 10https://gerrit.wikimedia.org/r/483131 (owner: 10Volans) [14:21:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483132 (https://phabricator.wikimedia.org/T212254) [14:22:08] 10Operations, 10Citoid, 10SRE-Access-Requests: Production access need to zotero / citoid machines - https://phabricator.wikimedia.org/T213269 (10Mvolz) [14:22:36] (03PS1) 10Volans: Rebuilt wheels for Django security update [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/483133 [14:25:27] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483132 (https://phabricator.wikimedia.org/T212254) (owner: 10Marostegui) [14:26:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483132 (https://phabricator.wikimedia.org/T212254) (owner: 10Marostegui) [14:26:43] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) [14:27:36] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) [14:27:37] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1077 T212254 (duration: 00m 52s) [14:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:40] T212254: Drop valid_tag table - https://phabricator.wikimedia.org/T212254 [14:27:58] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) [14:28:54] !log valid_tag table on db1077 with replication (lag will be generated on labs s3) - T212254 [14:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:53] (03CR) 10Muehlenhoff: [C: 03+1] Rebuilt wheels for Django security update [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/483133 (owner: 10Volans) [14:31:47] (03PS5) 10Alexandros Kosiaris: lvs: Remove all mentions of zoterov2 [puppet] - 10https://gerrit.wikimedia.org/r/482810 [14:31:48] (03PS5) 10Alexandros Kosiaris: sca: Remove the cluster from conftool [puppet] - 10https://gerrit.wikimedia.org/r/482809 (https://phabricator.wikimedia.org/T212772) [14:31:51] (03PS3) 10Alexandros Kosiaris: sca: Remove the cluster [puppet] - 10https://gerrit.wikimedia.org/r/483103 (https://phabricator.wikimedia.org/T212772) [14:31:52] (03PS1) 10Alexandros Kosiaris: mtail: Remove sca2004 from tests [puppet] - 10https://gerrit.wikimedia.org/r/483134 [14:32:11] (03CR) 10Volans: [V: 03+2 C: 03+2] Rebuilt wheels for Django security update [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/483133 (owner: 10Volans) [14:32:21] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483135 [14:34:00] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483135 (owner: 10Marostegui) [14:34:10] !log volans@deploy1001 Started deploy [debmonitor/deploy@0f096de]: Deploy Django security upgrade [14:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:26] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483135 (owner: 10Marostegui) [14:35:56] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10marcella) This request has my support as Marielle's engineering manager. Access will allow Marielle to better complete her work focused on improving t... [14:36:00] !log volans@deploy1001 Finished deploy [debmonitor/deploy@0f096de]: Deploy Django security upgrade (duration: 01m 50s) [14:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1077 T212254 (duration: 00m 53s) [14:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:36] T212254: Drop valid_tag table - https://phabricator.wikimedia.org/T212254 [14:36:39] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483132 (https://phabricator.wikimedia.org/T212254) (owner: 10Marostegui) [14:36:40] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1077" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483135 (owner: 10Marostegui) [14:36:51] (03PS1) 10Elukey: Assign role::spare::system to analytics1028->41 [puppet] - 10https://gerrit.wikimedia.org/r/483136 (https://phabricator.wikimedia.org/T209929) [14:38:33] (03CR) 10Elukey: [C: 03+2] Remove decommed nodes from Analytics Hadoop's net topology [puppet] - 10https://gerrit.wikimedia.org/r/482767 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [14:39:20] !log restart Hadoop HDFS namenodes on an-master100[1,2] to complete decom of analytics1028->41 [14:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:26] (03PS3) 10Hashar: Attempt to pull images before building [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/475843 (https://phabricator.wikimedia.org/T200720) [14:44:23] 10Operations, 10monitoring: Report problems found by mcelog - https://phabricator.wikimedia.org/T197086 (10CDanis) I think this work has mostly already happened? We have some mtail rules for mce events. https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/mtail/files/programs/k... [14:45:56] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[apport],Package[diamond] [14:47:28] (03CR) 10Gehel: "> @Gehel - not sure why it doesn't work, do you have any ideas? Do we" [puppet] - 10https://gerrit.wikimedia.org/r/483047 (https://phabricator.wikimedia.org/T213234) (owner: 10Smalyshev) [14:51:21] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10herron) [14:53:15] (03CR) 10Filippo Giunchedi: [C: 03+1] mtail: Remove sca2004 from tests [puppet] - 10https://gerrit.wikimedia.org/r/483134 (owner: 10Alexandros Kosiaris) [14:54:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, modulo typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483125 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:54:34] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) @herron thanks - I have my ssh keypair, where do I put the public one? It doesn't seem to say in the docs. Gerrit? [14:57:53] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10herron) Great! It's would be fine to paste the public ssh key here in the task. Also, to clarify, are the groups being requested `zotero-admin` and `... [14:58:07] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10herron) p:05Triage→03Normal [15:00:29] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10mobrovac) >>! In T213269#4865988, @herron wrote: > Also, to clarify, are the groups being requested `zotero-admin` and `citoid-admin`? The `zotero-adm... [15:01:37] (03PS2) 10Jcrespo: mariadb: Fully repool db1082 after recovery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483108 (https://phabricator.wikimedia.org/T213108) [15:03:34] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10Mvolz) ` ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIKynyl6NqNr/+mz1p1RFkE4KLXr+gZQrisj56XtNjYOA marielle@dull ` Huh, these ED25519 ones are really short! [15:05:07] 10Operations, 10monitoring, 10Goal: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10CDanis) [15:05:25] 10Operations, 10monitoring: Upgrade to Prometheus 2.x - https://phabricator.wikimedia.org/T187987 (10CDanis) [15:05:27] 10Operations, 10monitoring, 10Goal: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10CDanis) [15:06:09] (03CR) 10Ottomata: [C: 03+1] Assign role::spare::system to analytics1028->41 [puppet] - 10https://gerrit.wikimedia.org/r/483136 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [15:06:23] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483139 (https://phabricator.wikimedia.org/T86338) [15:07:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483139 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [15:07:50] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for passive Icinga node [puppet] - 10https://gerrit.wikimedia.org/r/483125 (https://phabricator.wikimedia.org/T135991) [15:07:54] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 763.18 seconds [15:08:47] (03PS2) 10Elukey: Assign role::spare::system to analytics1028->41 [puppet] - 10https://gerrit.wikimedia.org/r/483136 (https://phabricator.wikimedia.org/T209929) [15:08:50] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483139 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [15:09:12] 10Operations, 10Patch-For-Review: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [15:09:54] (03PS1) 10Arturo Borrero Gonzalez: apt: repository: trust also the source repo [puppet] - 10https://gerrit.wikimedia.org/r/483140 [15:10:02] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1106 T86338 T202167 (duration: 00m 52s) [15:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:07] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [15:10:07] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [15:10:08] !log Deploy schema change on db1106 (sanitarium s1 master) with replication, lag will be generated on s1 labs - T86338 T202167 [15:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] (03CR) 10Elukey: [C: 03+2] Assign role::spare::system to analytics1028->41 [puppet] - 10https://gerrit.wikimedia.org/r/483136 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [15:11:51] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is a bit unclear to me. In what way the intent the commit message says is going to be accomplished? Differently put, how will switchi" [puppet] - 10https://gerrit.wikimedia.org/r/482860 (owner: 10MSantos) [15:11:59] (03PS3) 10Jcrespo: mariadb: Fully repool db1082 after recovery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483108 (https://phabricator.wikimedia.org/T213108) [15:12:00] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:13:36] (03PS2) 10Giuseppe Lavagetto: aptrepo: allow importing xdebug in thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/483107 (https://phabricator.wikimedia.org/T212757) [15:15:03] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483139 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [15:15:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] aptrepo: allow importing xdebug in thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/483107 (https://phabricator.wikimedia.org/T212757) (owner: 10Giuseppe Lavagetto) [15:20:09] (03PS1) 10Elukey: Remove decomissioned nodes from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/483141 (https://phabricator.wikimedia.org/T209929) [15:21:26] (03CR) 10Elukey: [C: 03+2] Remove decomissioned nodes from Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/483141 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [15:22:09] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:33] I will restart it [15:23:00] (03CR) 10Ottomata: ":D" [puppet] - 10https://gerrit.wikimedia.org/r/482790 (owner: 10Giuseppe Lavagetto) [15:23:11] 10Operations, 10Traffic, 10Patch-For-Review: HTTP/2 requests fail with too-long URLs - https://phabricator.wikimedia.org/T209590 (10Anomie) 05Resolved→03Open Confirmed. The URLs described earlier work now, and there's even a range of URLs that give a 414 from Apache with HTTP/2 now, but I still get a dro... [15:23:17] !log restarting scb* pdfrender [15:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:01] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [15:24:04] (03PS1) 10Mathew.onipe: New upstream version [debs/prometheus-elasticsearch-exporter] - 10https://gerrit.wikimedia.org/r/483143 (https://phabricator.wikimedia.org/T210592) [15:26:06] 10Operations, 10Traffic, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10CDanis) Is this basically the same as T190992? Anyway I'm making all the 'slow prometheus query' tasks sub-tasks of the prometheus 2.x upgrade T187987 as t... [15:26:42] 10Operations, 10Traffic, 10monitoring: prometheus-based graph significantly slower than statsd equivalent - https://phabricator.wikimedia.org/T212312 (10CDanis) [15:26:44] 10Operations, 10Traffic, 10monitoring: prometheus: slow dashboards due to suboptimal query_range performance - https://phabricator.wikimedia.org/T190992 (10CDanis) [15:26:46] 10Operations, 10monitoring: Upgrade to Prometheus 2.x - https://phabricator.wikimedia.org/T187987 (10CDanis) [15:27:21] (03PS1) 10Gehel: make blazegraph port configurable [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 [15:27:41] (03PS2) 10Gehel: make blazegraph port configurable [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 (https://phabricator.wikimedia.org/T213289) [15:28:05] (03CR) 10DCausse: "does not seem to be a merge commit?" [debs/prometheus-elasticsearch-exporter] - 10https://gerrit.wikimedia.org/r/483143 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [15:30:38] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483147 [15:32:35] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483147 (owner: 10Marostegui) [15:33:40] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483147 (owner: 10Marostegui) [15:34:33] (03CR) 10SBassett: [C: 03+1] "All of the groups listed in the commit message sound sane to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483022 (owner: 10Gergő Tisza) [15:34:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1106 T86338 T202167 (duration: 00m 51s) [15:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:59] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [15:35:00] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [15:35:21] 10Operations, 10monitoring: Upgrade metrics monitoring infrastructure core components (FY2018-2019 Q3 TEC6) - https://phabricator.wikimedia.org/T213158 (10fgiunchedi) [15:35:23] 10Operations, 10monitoring, 10Goal: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi) [15:37:54] 10Operations, 10Patch-For-Review: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [15:39:36] Going to do a quick deploy. Will only affect TestCommons. [15:39:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1106" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483147 (owner: 10Marostegui) [15:41:03] 10Operations, 10monitoring, 10Goal: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10CDanis) [15:41:29] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 52.54 seconds [15:41:42] (03PS2) 10Ottomata: [WIP] Helm chart for eventgate-analytics deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/483035 (https://phabricator.wikimedia.org/T211247) [15:42:37] hi James_F [15:42:41] :) [15:42:55] Hey addshore. Thanks for the merge. Let's find out what breaks now. ;-) [15:43:03] nothing! mwahahahaha [15:43:04] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [15:43:09] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) 05Open→03Stalled [15:45:58] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [15:46:01] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) 05Stalled→03Resolved [15:46:18] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [15:48:15] (03CR) 10Jcrespo: [C: 03+2] mariadb: Fully repool db1082 after recovery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483108 (https://phabricator.wikimedia.org/T213108) (owner: 10Jcrespo) [15:48:19] RECOVERY - Juniper alarms on asw-c-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [15:48:21] (03PS4) 10Jcrespo: mariadb: Fully repool db1082 after recovery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483108 (https://phabricator.wikimedia.org/T213108) [15:48:24] 10Operations, 10ops-codfw: asw-c-codfw - FPC 1 PEM 1 is not powered - https://phabricator.wikimedia.org/T213233 (10Papaul) 05Open→03Resolved papaul@asw-c-codfw> show chassis environment | match Power Power FPC 1 Power Supply 0 OK FPC 1 Power Supply 1 OK F... [15:48:30] 10Operations, 10fundraising-tech-ops, 10netops: Refresh Minfraud IP list - https://phabricator.wikimedia.org/T213100 (10cwdent) 05Open→03Resolved a:03cwdent @ayounsi thanks for the help, this looks good [15:50:02] !log Drop valid_tag tables from db1095 (s3) - T212254 [15:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:05] T212254: Drop valid_tag table - https://phabricator.wikimedia.org/T212254 [15:55:06] 10Operations, 10Performance-Team (Radar), 10User-Elukey: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10elukey) I'd prefer 1.5.6 too, we'd be really close to upstream (atm 1.5.12) and getting help from them would surely be easier if needed. I'd also love to be able... [16:01:45] (03CR) 10Mathew.onipe: "> Patch Set 1:" [debs/prometheus-elasticsearch-exporter] - 10https://gerrit.wikimedia.org/r/483143 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [16:01:59] (03PS1) 10Muehlenhoff: Update Grafana package source [puppet] - 10https://gerrit.wikimedia.org/r/483159 [16:06:05] (03CR) 10CDanis: [C: 03+1] Update Grafana package source [puppet] - 10https://gerrit.wikimedia.org/r/483159 (owner: 10Muehlenhoff) [16:06:13] thanks moritzm :) [16:08:40] (03PS1) 10Vgutierrez: certcentral: Allow specifying authorized hosts and regex in the config [WIP] [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) [16:11:26] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.12/extensions/Wikibase/repo/RepoHooks.php: T213227 RepoHooks::onApiCheckCanExecute: Only fail if the edit is for our entity's slot (duration: 00m 54s) [16:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:28] T213227: Wikibase RepoHooks::onApiCheckCanExecute dies (via EntityHandler::getEntityNamespaces's assert) for all edits on wikis where Item isn't enabled? - https://phabricator.wikimedia.org/T213227 [16:12:48] (03PS1) 10Volans: remote: add workaround for Cumin bug [software/spicerack] - 10https://gerrit.wikimedia.org/r/483164 (https://phabricator.wikimedia.org/T213296) [16:13:03] And I'm out. [16:14:27] addshore: All looks good to me. Will mark as Resolved. [16:16:50] (03CR) 10Mathew.onipe: make blazegraph port configurable (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 (https://phabricator.wikimedia.org/T213289) (owner: 10Gehel) [16:17:19] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1082 with full weight (duration: 00m 53s) [16:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:33] (03CR) 10jenkins-bot: mariadb: Fully repool db1082 after recovery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483108 (https://phabricator.wikimedia.org/T213108) (owner: 10Jcrespo) [16:18:55] (03PS1) 10Jbond42: Small change to test merge permissions to use [puppet] - 10https://gerrit.wikimedia.org/r/483168 (https://phabricator.wikimedia.org/T213079) [16:21:01] (03CR) 10Volans: make blazegraph port configurable (031 comment) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 (https://phabricator.wikimedia.org/T213289) (owner: 10Gehel) [16:21:44] (03CR) 10Cwhite: [C: 03+1] mtail: Remove sca2004 from tests [puppet] - 10https://gerrit.wikimedia.org/r/483134 (owner: 10Alexandros Kosiaris) [16:22:34] (03CR) 10Cwhite: [C: 03+1] Enable base::service_auto_restart for passive Icinga node [puppet] - 10https://gerrit.wikimedia.org/r/483125 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:24:29] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483170 (https://phabricator.wikimedia.org/T86338) [16:24:48] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10jcrespo) [16:24:52] 10Operations, 10DBA, 10Data-Services, 10Patch-For-Review: db1082 power loss resulted on mysql crash - https://phabricator.wikimedia.org/T213108 (10jcrespo) 05Open→03Resolved db1082 is fully repooled, it and db1124 had gtid reeenabled. [16:26:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483170 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [16:26:51] 10Operations, 10ops-eqiad, 10Analytics: Rack A2's hosts alarm for PSU broken - https://phabricator.wikimedia.org/T212861 (10jcrespo) I rebuilt db1082- we are no blocker for any maintenance on those servers, but we would prefer to stop mysql if there is a chance for the server to lose power, while it does not... [16:27:33] (03PS3) 10Gehel: make blazegraph port configurable [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 (https://phabricator.wikimedia.org/T213289) [16:28:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483170 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [16:28:12] (03CR) 10Gehel: make blazegraph port configurable (032 comments) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 (https://phabricator.wikimedia.org/T213289) (owner: 10Gehel) [16:29:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1083 T86338 T202167 (duration: 00m 53s) [16:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:18] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [16:29:19] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [16:29:22] (03PS1) 10Dzahn: admins: add Daimona to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/483173 (https://phabricator.wikimedia.org/T211962) [16:29:34] !log Deploy schema change on db1083 - T86338 T202167 [16:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:46] (03PS2) 10Vgutierrez: certcentral: Allow specifying authorized hosts and regex in the config [WIP] [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) [16:30:17] 10Operations, 10Patch-For-Review: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [16:30:30] 10Operations, 10Patch-For-Review: Onboarding John Bond - https://phabricator.wikimedia.org/T213079 (10MoritzMuehlenhoff) [16:31:27] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483170 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [16:31:35] (03CR) 10Filippo Giunchedi: [C: 03+1] Update Grafana package source [puppet] - 10https://gerrit.wikimedia.org/r/483159 (owner: 10Muehlenhoff) [16:31:45] (03CR) 10Dzahn: [C: 03+2] admins: add Daimona to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/483173 (https://phabricator.wikimedia.org/T211962) (owner: 10Dzahn) [16:31:47] (03CR) 10Mathew.onipe: [C: 03+1] make blazegraph port configurable [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 (https://phabricator.wikimedia.org/T213289) (owner: 10Gehel) [16:31:52] (03PS2) 10Dzahn: admins: add Daimona to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/483173 (https://phabricator.wikimedia.org/T211962) [16:32:24] 10Operations, 10monitoring, 10Goal: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10herron) p:05Triage→03Normal [16:35:08] 10Operations, 10monitoring, 10netops, 10Patch-For-Review: Icinga check for VRRP - https://phabricator.wikimedia.org/T150264 (10ayounsi) a:05ayounsi→03faidon CR tested and updated, ready for reviews. [16:36:12] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Add krinkle to contint-docker group - https://phabricator.wikimedia.org/T213015 (10herron) Proceeding with this [16:36:48] (03PS2) 10Herron: Add krinkle to contint-docker group [puppet] - 10https://gerrit.wikimedia.org/r/482483 (https://phabricator.wikimedia.org/T213015) (owner: 10Krinkle) [16:37:41] (03CR) 10Herron: [C: 03+2] Add krinkle to contint-docker group [puppet] - 10https://gerrit.wikimedia.org/r/482483 (https://phabricator.wikimedia.org/T213015) (owner: 10Krinkle) [16:39:37] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10monitoring: upgrade prometheus-blazegraph-exporter to python3 - https://phabricator.wikimedia.org/T213305 (10Gehel) [16:40:36] (03PS2) 10Gehel: Change frequency of OSM replication on maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/482860 (owner: 10MSantos) [16:54:18] (03PS1) 10Dzahn: admins: fix real name field for Daimona [puppet] - 10https://gerrit.wikimedia.org/r/483178 [16:55:15] (03PS2) 10Dzahn: admins: fix real name field for Daimona [puppet] - 10https://gerrit.wikimedia.org/r/483178 [16:58:52] (03CR) 10Dzahn: [C: 03+2] admins: fix real name field for Daimona [puppet] - 10https://gerrit.wikimedia.org/r/483178 (owner: 10Dzahn) [17:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190109T1700). [17:00:04] Zoranzoki21: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:43] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483180 [17:00:48] tarrow: there is one now too ;0 [17:01:23] addshore: oooh, true. 2 secs [17:02:24] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483180 (owner: 10Marostegui) [17:03:51] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483180 (owner: 10Marostegui) [17:05:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1083 T86338 T202167 (duration: 00m 52s) [17:05:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:32] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [17:05:32] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [17:05:40] (03PS1) 10Alexandros Kosiaris: mathoid: Move config.yaml into a template [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 [17:05:57] (03PS1) 10Tarrow: Increase PHP constraint check entities to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483185 (https://phabricator.wikimedia.org/T209504) [17:07:04] (03CR) 10Tarrow: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483185 (https://phabricator.wikimedia.org/T209504) (owner: 10Tarrow) [17:07:43] oh tarrow although technically the slot is full [17:07:48] not sure if anyone is running swat though? [17:09:26] (03PS3) 10Vgutierrez: certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) [17:10:07] (03CR) 10Alexandros Kosiaris: mathoid: Move config.yaml into a template (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [17:10:20] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483180 (owner: 10Marostegui) [17:11:00] (03CR) 10jerkins-bot: [V: 04-1] certcentral: Allow specifying authorized hosts and regex in the config [software/certcentral] - 10https://gerrit.wikimedia.org/r/483163 (https://phabricator.wikimedia.org/T213301) (owner: 10Vgutierrez) [17:11:45] It doesn't seem anyone is... Also, is there any reason for there to be no EU midday SWAT tomorrow? [17:11:58] greg-g: would know! [17:15:36] (03Abandoned) 10Dzahn: ircecho: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/448770 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:16:02] !log uploaded gdnsd-2.99.9949-beta-1+wmf1 to reprepro for stretch-wikimedia [17:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:09] I can SWAT. [17:17:41] James_F: while you're at it can you please check namespaceDupes.php --wiki=bewikibooks ? I think we forgot to run it last time (although I asked for it) [17:17:45] tarrow: Did you have something urgent? [17:18:21] James_F: nope, not really urgent :) [17:18:27] !log Ran `namespaceDupes.php --wiki=bewikibooks` on mwmaint1002, no change [17:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:32] Hauskatze: Done, but no-op. [17:18:47] James_F: any conflicts ? [17:18:56] Hauskatze: No. [17:19:00] perfect :) [17:19:04] thanky [17:19:22] tarrow: Zoranzoki21 doesn't seem to be around, so sling me a patch and I'll deploy. [17:19:49] James_F: thanks for asking though; if we have time at the end of the window I can deploy it myself. I'm just nervous enough about my own patches before I wade into SWATing others :) [17:19:52] (03CR) 10Ottomata: [C: 03+1] "Nice!" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [17:20:06] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/483185 ? [17:20:12] James_F: ah, cool. mind if I take over and do mine then? [17:20:21] yep! [17:20:28] tarrow: Go for it. I'll back off. :-) [17:20:34] Thanks! [17:21:22] (03CR) 10Ottomata: [C: 03+1] mathoid: Move config.yaml into a template (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/483184 (owner: 10Alexandros Kosiaris) [17:21:56] i'll be happy to stand-in for zoranzoki if that's fine [17:23:24] (03PS4) 10BBlack: authdns::scripts: no more python-jinja2 [puppet] - 10https://gerrit.wikimedia.org/r/480873 [17:23:26] (03PS1) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [17:23:28] (03PS1) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [17:24:12] oh, he has a lotta patches [17:24:28] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [17:24:33] nvm [17:24:35] (03CR) 10jerkins-bot: [V: 04-1] authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 (owner: 10BBlack) [17:24:35] addshore: tarrow fixed [17:25:07] woo [17:26:07] (03PS1) 10Marostegui: dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/483190 (https://phabricator.wikimedia.org/T86338) [17:26:11] 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955 (10bd808) 05Open→03Resolved a:03Bstorm Both tools-sgegrid-master.tools.eqiad.wmflabs and tools-sgegrid-shadow.tools.eqiad.wmfla... [17:27:03] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/483190 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [17:28:03] !log Reload haproxy on dbproxy1010 to depool labsdb1011 - T86338 [17:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:07] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [17:28:16] (03CR) 10Tarrow: [C: 03+2] Increase PHP constraint check entities to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483185 (https://phabricator.wikimedia.org/T209504) (owner: 10Tarrow) [17:29:23] (03Merged) 10jenkins-bot: Increase PHP constraint check entities to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483185 (https://phabricator.wikimedia.org/T209504) (owner: 10Tarrow) [17:34:28] (03PS6) 10BBlack: Remove authdns-gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/480871 [17:34:58] (03CR) 10BBlack: [C: 03+2] Remove authdns-gen-zones.py [dns] - 10https://gerrit.wikimedia.org/r/480871 (owner: 10BBlack) [17:35:53] (03CR) 10BBlack: [C: 03+2] authdns::scripts: no more python-jinja2 [puppet] - 10https://gerrit.wikimedia.org/r/480873 (owner: 10BBlack) [17:36:06] (03PS5) 10BBlack: authdns::scripts: no more python-jinja2 [puppet] - 10https://gerrit.wikimedia.org/r/480873 [17:36:20] !log tarrow@deploy1001 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 53s) [17:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:39] tarrow: Tsk, no justification. :-P [17:36:43] (03CR) 10jenkins-bot: Increase PHP constraint check entities to 150 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483185 (https://phabricator.wikimedia.org/T209504) (owner: 10Tarrow) [17:37:05] tsk tsk :P [17:39:07] !log That last one was SWAT: [[gerrit:483185|T209504 Increase PHP constraint check entities to 150]] [17:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:10] T209504: Perform more constraint type checks in PHP before falling back to SPARQL - https://phabricator.wikimedia.org/T209504 [17:39:23] whoopsie [17:39:39] right; I'm now done if you want to do any others [17:40:42] That's a lid on SWAT. [17:40:45] (03PS2) 10BBlack: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 [17:40:46] (03PS2) 10BBlack: authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 [17:41:19] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [17:41:41] Eh, whilst I'm here I'll make some Zero progress. [17:41:47] (03CR) 10jerkins-bot: [V: 04-1] authdns: reload (replace) gdnsd on config changes [puppet] - 10https://gerrit.wikimedia.org/r/483188 (owner: 10BBlack) [17:42:00] (03PS3) 10Jforrester: Disable ZeroBanner and ZeroPortal on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482097 (https://phabricator.wikimedia.org/T212864) [17:42:07] (03CR) 10Jforrester: [C: 03+2] Disable ZeroBanner and ZeroPortal on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482097 (https://phabricator.wikimedia.org/T212864) (owner: 10Jforrester) [17:43:06] (03Merged) 10jenkins-bot: Disable ZeroBanner and ZeroPortal on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482097 (https://phabricator.wikimedia.org/T212864) (owner: 10Jforrester) [17:43:14] we still have some puppet-driven stuff that does data pulls from zero.wikimedia.org I think [17:43:31] (which can be killed too, but I think it will throw some alerts or cronjob failures or something until we do) [17:43:52] bblack: Hmm. [17:45:15] bblack: Oh, zerofetch.py? Yeah, that's on my list to kill. [17:45:20] probably in general, it might make sense to decom zero from the edge before decomming it from the inside stuff :) [17:45:38] bblack: Yes, that's what I did. [17:45:39] yeah zerofetch and related bits [17:46:19] So just disable the zero_update manifest from prod? [17:46:46] (03CR) 10CDanis: [C: 03+1] mtail: Remove sca2004 from tests [puppet] - 10https://gerrit.wikimedia.org/r/483134 (owner: 10Alexandros Kosiaris) [17:46:51] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 604.07 seconds [17:47:13] we'll need to manually clean up as well. when you just remove puppet-defined things it just stops managing them, but doesn't e.g. delete cronjobs on the host, etc. [17:47:23] Ah. Sucky. [17:47:51] but then a real cleanup of zero_update means delete the files it deployed too, and then there's still deployed running VCL code that needs those files to exist [17:48:09] and other VCL code dependent on that which processes various cache-level zero things [17:48:19] Yeah. [17:48:25] it all has to be unwound at the edge in a certain order I think [17:48:29] OK, I'll revert for now. [17:48:51] (03PS1) 10Jforrester: Revert "Disable ZeroBanner and ZeroPortal on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483192 [17:48:57] (03CR) 10Jforrester: [C: 03+2] Revert "Disable ZeroBanner and ZeroPortal on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483192 (owner: 10Jforrester) [17:49:40] (03CR) 10jenkins-bot: Disable ZeroBanner and ZeroPortal on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482097 (https://phabricator.wikimedia.org/T212864) (owner: 10Jforrester) [17:49:57] (03Merged) 10jenkins-bot: Revert "Disable ZeroBanner and ZeroPortal on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483192 (owner: 10Jforrester) [17:50:05] sorry and thanks! we can probably clean up the edge/varnish stuff fairly quickly, maybe today/tomorrow, I just wasn't quite ready yet [17:50:12] (03CR) 10jenkins-bot: Revert "Disable ZeroBanner and ZeroPortal on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483192 (owner: 10Jforrester) [17:50:13] No problem at all. :-) [17:50:23] Good spot, though. [17:50:46] (03PS1) 10Jforrester: Re-do "Disable ZeroBanner and ZeroPortal on all wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 [17:50:56] (03CR) 10Jforrester: [C: 04-2] "Wait for SRE VCL stuff to be done first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483193 (owner: 10Jforrester) [17:56:03] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Patch-For-Review, 10Performance-Team (Radar): Add krinkle to contint-docker group - https://phabricator.wikimedia.org/T213015 (10herron) 05Open→03Resolved a:03herron [17:56:26] (03CR) 10BBlack: "Note sure what's going on here with the dep cycle failure reported by CI, whereas compiler seems happy?" [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [17:59:15] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/483168 (https://phabricator.wikimedia.org/T213079) (owner: 10Jbond42) [17:59:25] (03PS12) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [18:00:00] bblack: i think it might be fixed if you add a symlink line to .fixtures.yaml in modules/authdns [18:00:11] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 251.32 seconds [18:00:17] (03CR) 10Smalyshev: [C: 03+1] make blazegraph port configurable [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 (https://phabricator.wikimedia.org/T213289) (owner: 10Gehel) [18:00:18] so that the resource name isnt unknown anymore.. for 'nrpe::monitor_service' [18:00:34] resource type [18:02:32] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10herron) >>! In T213269#4865991, @mobrovac wrote: > The `zotero-admin` group is defunct effectively. The groups should be `citoid-admin`, `deployment` a... [18:02:41] 10Operations, 10Citoid, 10SRE-Access-Requests: Requesting access to Citoid/Zotero production servers for MVOLZ - https://phabricator.wikimedia.org/T213269 (10herron) [18:03:04] (03CR) 10Dzahn: "i think you need a line "nrpe: "../../../../nrpe" in modules/authdns/fixtures.yml so that tests now about the NRPE module and that should " [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [18:03:21] Is it known that https://doc.wikimedia.org/cover/ is giving 500s? [18:03:59] anomie someone else reported that too [18:04:07] https://phabricator.wikimedia.org/T213306 [18:04:13] it is known that hashar moved the docs site today [18:04:24] not sure if he knows this particular one [18:04:38] !log Drop valid_tag from s3 master (db1075) - T212254 [18:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:43] T212254: Drop valid_tag table - https://phabricator.wikimedia.org/T212254 [18:04:47] (03CR) 10Volans: [C: 04-1] "Minor fixes required inline." (032 comments) [debs/prometheus-blazegraph-exporter] - 10https://gerrit.wikimedia.org/r/483144 (https://phabricator.wikimedia.org/T213289) (owner: 10Gehel) [18:05:56] (03CR) 10Herron: [C: 03+1] Enable base::service_auto_restart for passive Icinga node [puppet] - 10https://gerrit.wikimedia.org/r/483125 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:06:23] (03PS3) 10Dzahn: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [18:06:57] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [18:07:21] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/481833 (owner: 10Volans) [18:13:23] (03PS4) 10Dzahn: authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [18:13:47] (03PS13) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [18:13:58] (03CR) 10jerkins-bot: [V: 04-1] authdns: add NRPE for gdnsd checkconf [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [18:15:15] PROBLEM - puppet last run on proton1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[ip addr add 2620:0:861:103:10:64:32:61/64 dev ens5] [18:15:46] (03PS1) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/483196 [18:15:59] (03PS2) 10Marostegui: Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/483196 [18:16:27] PROBLEM - HHVM rendering on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [18:16:55] (03PS14) 10Fsero: Initial docker::registry::ha puppetization. [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) [18:17:39] RECOVERY - HHVM rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 77006 bytes in 0.142 second response time [18:17:54] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) 05Open→03Stalled I've submitted RFC about the whole concept of Wikibase front end changes as T213318. I've taken the liber... [18:18:18] !log add bgp sessions to AS38895 on cr1-eqsin [18:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:03] (03CR) 10Fsero: "Thanks for the review, i've added your suggestions and PCC looks happy also" [puppet] - 10https://gerrit.wikimedia.org/r/482675 (https://phabricator.wikimedia.org/T210076) (owner: 10Fsero) [18:19:53] !log Rename table tag_summary on enwiki on db1089 - T212255 [18:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:56] T212255: Drop tag_summary table - https://phabricator.wikimedia.org/T212255 [18:23:27] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/481833 (owner: 10Volans) [18:25:14] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1010: Depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/483196 (owner: 10Marostegui) [18:25:35] (03CR) 10Herron: [C: 03+1] "afaict this only affects the tungsten test system, looks good to me. Krinkle ping me to merge when you have a few minutes to test after?" [puppet] - 10https://gerrit.wikimedia.org/r/483048 (https://phabricator.wikimedia.org/T213218) (owner: 10Krinkle) [18:26:29] !log add bgp sessions to AS31800 on cr1-eqsin [18:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:36] log Reload haproxy on dbproxy1010 to repool labsdb1011 - T86338 [18:26:36] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [18:28:37] (03CR) 10Dzahn: "ehm..yea, sorry, tried to fix it but while it solved one issue i am also not sure how to fix the remaining one, will ask Hashar" [puppet] - 10https://gerrit.wikimedia.org/r/483187 (owner: 10BBlack) [18:30:12] (03PS1) 10BBlack: CI check [dns] - 10https://gerrit.wikimedia.org/r/483198 [18:30:25] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active, AS1299/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:33:42] (03CR) 10Andrew Bogott: "Hello, Alex! This issue is still blocking us -- can we please merge this as a stopgap, or get more attention on the actual proper fix?" [puppet] - 10https://gerrit.wikimedia.org/r/481215 (https://phabricator.wikimedia.org/T212327) (owner: 10Bstorm) [18:39:44] (03CR) 10BBlack: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/483198 (owner: 10BBlack) [18:40:03] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 269, down: 1, shutdown: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:41:15] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:42:41] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 71, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:43:07] !log authdns2001 (ns1) - upgrade gdnsd to 9949 beta release [18:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:23] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:48:37] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv6: Active, AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:57:29] (03CR) 10DCausse: [C: 04-1] "I'd ask gehel what he does on previous merges (perhaps pushed directly to gerrit? or simply the warning shown by git review is just OK). B" [debs/prometheus-elasticsearch-exporter] - 10https://gerrit.wikimedia.org/r/483143 (https://phabricator.wikimedia.org/T210592) (owner: 10Mathew.onipe) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190109T1900) [19:16:12] (03PS1) 10CRusnov: Rebuilt wheels for Django security update [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/483202 [19:17:19] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/483202 (owner: 10CRusnov) [19:18:26] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Rebuilt wheels for Django security update [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/483202 (owner: 10CRusnov) [19:23:30] !log crusnov@deploy1001 Started deploy [netbox/deploy@7fe39e1]: Deploy Django security upgrade [19:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:01] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:28:03] !log crusnov@deploy1001 Finished deploy [netbox/deploy@7fe39e1]: Deploy Django security upgrade (duration: 04m 33s) [19:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:25] chaomodus: ^^^ (systemd not happy on 2001) [19:29:34] transient? [19:29:39] meh [19:30:15] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 271, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:31:37] chaomodus: ah the segfault we have also during cron.dayly [19:31:46] T212697 [19:31:47] T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 [19:32:42] a restart should be enough to fix it chaomodus ;) [19:35:57] chaomodus: did it worked? [19:36:17] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:36:22] PROBLEM - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:43] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [19:36:48] ah hah yes :) [19:36:48] nice! :) [19:37:03] took me a min to get all the parts together [19:37:17] is the zotero issue known? [19:37:23] i was just typing that =] [19:37:27] RECOVERY - LVS HTTP IPv4 on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 138 bytes in 0.008 second response time [19:37:42] and now a clear, anyone about who knows enough about to investigate? [19:38:00] yeah, me [19:38:02] looking [19:38:07] thank you =] [19:38:11] ehehe zotero master arrived :) [19:38:16] although I have a pretty good guess already [19:38:33] .... damn it now i just see alex in a zorro mask in my brain [19:38:45] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [19:38:45] lol the zotero effect [19:39:07] perfect name for it [19:39:29] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [19:39:50] is it just OOMing again? [19:40:22] no, that's the fun part [19:40:24] akosiaris: also share the magic spell needed to fix it ;) [19:40:35] it's something different this time around [19:40:35] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [19:40:47] not sure yet what [19:42:15] * volans subscribes for the notification of the new spell, once found [19:42:22] also, can we help? [19:42:49] sigh the fact that thing feels the need to log the entire html page it parses at times [19:42:58] oh dear [19:43:25] Lol [19:43:38] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation, 10Release-Engineering-Team (Next): TEC3:O6:O:6.1:Q3: Deployment Pipeline Documentation - https://phabricator.wikimedia.org/T213090 (10greg) [19:43:40] 10Operations, 10Prod-Kubernetes, 10Documentation, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Next): Update Blubber documentation - https://phabricator.wikimedia.org/T213198 (10greg) [19:43:53] 10Operations, 10Prod-Kubernetes, 10Release Pipeline, 10Documentation, 10Release-Engineering-Team (Next): Document helm chart creation - https://phabricator.wikimedia.org/T213197 (10greg) [19:44:17] hmm pods are running just fine, no restarts, memory looks fine at https://grafana.wikimedia.org/d/000000620/xxxx-zotero-debugging-kubernetes?orgId=1&from=now-15m&to=now [19:44:19] weird [19:44:41] ah no [19:44:43] scratch that [19:44:44] my bad [19:44:45] https://grafana.wikimedia.org/d/000000620/xxxx-zotero-debugging-kubernetes?panelId=41&fullscreen&orgId=1&from=now-30m&to=now [19:44:53] yeah OOM again. cdanis nailed it [19:45:37] although the pod limit is at 4Gi it seems like at around 1.5GB zotero fails to do whatever it is doing [19:45:44] i'm going to make another guess: excessive memory consumption associated with a particular query, which the user gave up on after it errored out a dozen-ish times [19:46:00] yeah we are on the same page [19:48:18] 10Operations, 10ExternalGuidance, 10Traffic: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10dr0ptp4kt) Hi @bblack , any suggestion here? We determined on a separate thread that I'll be the one writing the patch. CC @Arrbee @ggellerman [19:48:42] iirc, isn't around 1.6GB the default heap size limit for node.js? [19:49:12] if I'm not needed dinner's ready [19:49:17] quite possibly [19:49:40] volans: yeah you aren't needed. go. And there isn't no magic incanation this time around. The issue fixed itself [19:50:45] http://prestonparry.com/articles/IncreaseNodeJSMemorySize/ says so as well [19:50:49] eh :) [19:50:52] ack ttyl [19:50:53] thanks [19:52:04] that being said that FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed is not in the logs [19:52:58] but zotero's logs is not something to trust or be proud off anyway [19:53:10] haha, i think that message comes from the v8 interpreter internals though [19:55:32] 10Operations, 10Mail, 10Phabricator, 10serviceops, and 2 others: Convert Phabricator mail config to use cluster.mailers - https://phabricator.wikimedia.org/T212989 (10greg) [20:00:04] marxarelli: How many deployers does it take to do MediaWiki train - Americas version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190109T2000). [20:00:39] (03CR) 10Krinkle: Hotfix for logging in php-fpm (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478021 (https://phabricator.wikimedia.org/T211184) (owner: 10Giuseppe Lavagetto) [20:04:00] 10Operations, 10RESTBase-Cassandra, 10Services (next): restbase cassandra driver excessive logging when cassandra hosts are down - https://phabricator.wikimedia.org/T212424 (10Pchelolo) [20:04:46] tgr: I'm thinking that maybe i didn't do a very good job with that security review [20:07:09] (03PS1) 10Dduvall: group1 wikis to 1.33.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483216 [20:07:11] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.33.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483216 (owner: 10Dduvall) [20:07:23] (03PS1) 10Gehel: wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) [20:08:12] (03CR) 10jerkins-bot: [V: 04-1] wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:08:44] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483216 (owner: 10Dduvall) [20:09:24] (03PS2) 10Gehel: wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) [20:10:11] (03CR) 10jerkins-bot: [V: 04-1] wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:10:29] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.12 [20:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:23] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.12 (duration: 00m 53s) [20:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:23] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/483216 (owner: 10Dduvall) [20:13:57] PROBLEM - Apache HTTP on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:41] PROBLEM - HHVM rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:43] PROBLEM - Nginx local proxy to apache on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:53] PROBLEM - Nginx local proxy to apache on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:53] RECOVERY - Nginx local proxy to apache on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 6.450 second response time [20:15:55] RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 77015 bytes in 9.293 second response time [20:16:13] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.040 second response time [20:16:57] RECOVERY - Nginx local proxy to apache on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.047 second response time [20:20:43] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 663.33 seconds [20:25:18] waiting to see if that lag recovers. today's deployment caused a blip of "timed out" errors that have since subsided [20:27:22] marxarelli: you talking about dbstore1002? [20:27:29] yeah [20:28:05] don't wait for that host, it is an analytics host which is not in production and it is used for other things [20:28:42] let me do a quuck scan to make sure everything is fine on the DB land [20:28:44] *quick [20:28:59] ah k, that would be great. ty [20:29:38] (03PS3) 10Gehel: wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) [20:30:39] marxarelli: I think you are good [20:30:54] I've found a DB problem with TestCommons (the Wikibase tables didn't get created, probably for the same reason that some of our ES tables didn't when things broke during creation). [20:31:06] But that's just causing app-layer errors. [20:31:17] (03PS4) 10Gehel: wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) [20:32:23] marxarelli: When all's clear, I'd like to create the tables on TestCommons. [20:32:46] marostegui: great. thanks for looking! [20:32:53] James_F: we're clear [20:32:57] Cool. [20:34:29] (03PS5) 10Gehel: wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) [20:35:18] (03CR) 10jerkins-bot: [V: 04-1] wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:35:19] addshore: Ran `mwscript sql.php --wiki=testcommonswiki extensions/Wikibase/repo/sql/Wikibase.sql`, it errored: [20:35:29] https://www.irccloud.com/pastebin/Xf9F1skG/ [20:36:27] (03PS1) 10Dzahn: geoip::maxmind: replace deprecated validate_string functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/483222 [20:36:44] (And no change to the tables.) [20:37:01] (03CR) 10jerkins-bot: [V: 04-1] geoip::maxmind: replace deprecated validate_string functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [20:37:24] (03PS6) 10Gehel: wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) [20:37:48] (03PS2) 10Dzahn: geoip::maxmind: replace deprecated validate_string functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/483222 [20:38:23] (03CR) 10jerkins-bot: [V: 04-1] geoip::maxmind: replace deprecated validate_string functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [20:38:37] (03PS1) 10Hashar: doc: add php7.2-xml package [puppet] - 10https://gerrit.wikimedia.org/r/483223 (https://phabricator.wikimedia.org/T213306) [20:39:26] (03PS3) 10Dzahn: geoip::maxmind: replace deprecated validate_string functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/483222 [20:39:37] mutante: Guten Tag. Eventually I lack php-xml on the host :) I have send https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483223/ ! [20:39:42] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/483223 (https://phabricator.wikimedia.org/T213306) (owner: 10Hashar) [20:40:00] (03CR) 10jerkins-bot: [V: 04-1] geoip::maxmind: replace deprecated validate_string functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/483222 (owner: 10Dzahn) [20:40:07] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 625.87 seconds [20:40:46] mutante: and locally I usually run: bundle exec rake --job=1 test [20:40:47] ;) [20:41:54] James_F: unable to open input file? :P [20:42:09] that sounds like something odd happening? [20:42:15] Yeah. [20:42:17] Reedy: ^^ you didnt make the wikibase tables ;) [20:42:39] All the .sql files are owned by mwdeploy. [20:43:59] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:45:04] OK, ran it again and it worked. [20:45:21] interesting [20:45:25] !log Created Wikibase repo tables on TestCommons [20:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:29] lets see if RC works :) [20:45:56] wb_terms, wb_items_per_site, wb_id_counters got created, but not wb_changes. [20:46:34] (03PS1) 10Dzahn: conftool::client: remove obsolete trusty distribution check [puppet] - 10https://gerrit.wikimedia.org/r/483226 [20:46:51] hashar: i wanted to link you to that ticket and i see this is all related. ok [20:47:05] gotcha.. on it [20:47:23] and I will have later on fix a bunch of permissions issues :) [20:47:28] addshore: … because Wikibase.sql doesn't know about wb_changes. Running changes.sql too. [20:47:37] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:47:47] (03CR) 10Dzahn: [C: 03+2] doc: add php7.2-xml package [puppet] - 10https://gerrit.wikimedia.org/r/483223 (https://phabricator.wikimedia.org/T213306) (owner: 10Hashar) [20:47:50] James_F: let me see if anything else is not in Wikibase.sql [20:48:27] addshore: And bingo, a caption edit appears in RC. [20:48:32] winner [20:48:43] (03PS7) 10Gehel: wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) [20:49:06] James_F: to make sure you have got everything you need you could comapare the wikidatawiki tables vs the commonstestwiki tables [20:49:10] (03PS1) 10Alexandros Kosiaris: Add an stdout log stanza to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/483227 [20:49:46] (03CR) 10jerkins-bot: [V: 04-1] wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:50:04] hashar: unrelated matter.. i tried to fix the jenkins-bot -1 on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/483187/ by adding to .fixtures.yml but while i removed one error there is still another one.. while compiler is happy. would you know why that is? [20:50:30] hashar: doc1001: Notice: /Stage[main]/Profile::Doc/Package[php7.2-xml]/ensure: created [20:50:36] good [20:50:50] that fixes https://doc.wikimedia.org/cover/ :) [20:50:57] addshore: TestCommons has wb_changes, wb_id_counters, wb_items_per_site, wb_terms, wbc_entity_usage. [20:51:03] hashar: confirmed, it does :) [20:51:08] paladox: ^ [20:51:16] addshore: Wikidata has those plus: wb_changes_dispatch, wb_changes_subscription, wb_property_info, wbqc_constraints, wbs_propertypairs [20:51:17] and https://phabricator.wikimedia.org/T213306 is fixed [20:51:28] now I should find a way to get apache2 error log to be readable by wikidev folks [20:51:29] anomie: & [20:51:33] ^ [20:52:16] James_F: it would make sense to create wb_changes_dispatch, wb_changes_subscription, wb_property_info too even if we are not using them initially [20:52:21] addshore: We don't have QualityConstraints so that one is fine. [20:52:24] then all of the Wikibase repo tables are created [20:52:26] OK, one moment. [20:52:32] hashar: 'ALL = NOPASSWD: /bin/journalctl *', [20:52:32] 'ALL = (syslog) NOPASSWD: ALL'] [20:52:49] ^ just copy from admins/data/data.yaml from other admin classes [20:53:25] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 262.96 seconds [20:53:41] !log authdns1001 (ns0) - upgrade gdnsd to 9949 beta release [20:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:00] addshore: OK, now the same except for wbqc_constraints. [20:54:07] james wonderful [20:54:09] James_F: :P [20:54:19] (03PS8) 10Gehel: wdqs: preliminary work to manage multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) [20:55:21] (03CR) 10Gehel: "PCC agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/14247/" [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [20:56:03] addshore: OK, so… should we add the tables to real Commons now, or wait until tomorrow? [20:56:41] * addshore goes to double check them quickly [20:57:33] mutante: ahh journalctl. I got it already :) [20:57:53] hashar: ah:) yea, sudo journalctl :) [20:58:01] it lacks apache error logs though [20:59:11] hashar: https://httpd.apache.org/docs/trunk/de/mod/mod_journald.html ? [20:59:19] addshore: I'd rather run rebuildall.php on TestCommons if you think that's sane? [20:59:38] James_F: rebuildall sounds fine to me! [21:00:00] !log Running rebuildall on TestCommons [21:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190109T2100). [21:00:23] and James_F i compared the tables on commonstestwiki and wikidata wiki and the field etc all match perfectly, so i think we can create them on real commons when ready [21:00:48] addshore: Worked great. [21:02:31] addshore: Excellent. OK, going ahead now. [21:03:53] !log ppchelko@deploy1001 Started deploy [cpjobqueue/deploy@bfa9241]: Increase concurrency for categoryMembershipJob T192691 [21:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:55] T192691: [Commons] A new image added to a category is not shown in Watchlist - https://phabricator.wikimedia.org/T192691 [21:04:30] !log Creating Wikibase repo tables on Commons for T68108 [21:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:33] T68108: [Epic] Store media information for files on Wikimedia Commons as structured data - https://phabricator.wikimedia.org/T68108 [21:04:38] !log ppchelko@deploy1001 Finished deploy [cpjobqueue/deploy@bfa9241]: Increase concurrency for categoryMembershipJob T192691 (duration: 00m 45s) [21:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:50] Yup, they're live. [21:09:21] great! [21:09:22] :D [21:15:35] I'm all clear from prod, sorry for not saying. [21:15:38] (03PS1) 10Gehel: wdqs: create multiple instances of blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) [21:16:24] (03PS2) 10Gehel: [WIP] wdqs: create multiple instances of blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) [21:16:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wdqs: create multiple instances of blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [21:19:44] 10Operations, 10SRE-Access-Requests: unable to access Eventlogging via stat1006 - https://phabricator.wikimedia.org/T213344 (10Capt_Swing) [21:20:57] !log multatuli (ns2) - upgrade gdnsd to 9949 beta release [21:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:36] (03CR) 10Smalyshev: [C: 03+1] "some nitpicks, but in general looks ok to me" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483217 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [21:28:19] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops: Stronger DKIM key for fundraising emails? - https://phabricator.wikimedia.org/T210445 (10bsisolak) The key is correct, and IBM will validate it before turing it live. I would recommend in four months, you remove the old key. [21:28:44] (03CR) 10BryanDavis: "> As FYI I don't see the string message in Chrome at the moment," [puppet] - 10https://gerrit.wikimedia.org/r/480869 (owner: 10Framawiki) [21:33:00] 10Operations, 10SRE-Access-Requests: unable to access Eventlogging via stat1006 - https://phabricator.wikimedia.org/T213344 (10Ottomata) 05Open→03Resolved a:03Ottomata @Capt_Swing The issue is the presence of the .my.cnf file in your home dir on stat1006. It's being read by default and overriding the re... [21:34:39] (03CR) 10Smalyshev: [WIP] wdqs: create multiple instances of blazegraph (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [21:37:34] hashar: unfortunately after looking at it closer.. the mod_journald option is a dead-end for now.. because it's only in httpd 2.5 and not backported to 2.4 and even upgrading doc1001 to buster smeans still 2.4 and we'd have to patch it. [21:37:57] so the second sudo line there.. run things as syslog user [21:41:22] mutante: I thought about making /var/log/apache2/ accesible/readable by the wikidev group [21:41:38] then find a way for apache to write the log I am interested in with the wikidev group [21:41:52] but apache is not a member of wikidev so.. [21:42:22] else, get the apache error logs emitted to logstash ;) [21:42:30] well.. others have pondered about this before [21:42:41] because multiple admin groups have sudo privs to read logs [21:42:55] i would say let's copy that [21:43:05] unless we find a way that is generally better and then use it for all [21:43:23] but maybe avoid doing it slightly different in each [21:43:30] +1 [21:44:00] +1 to sending it to logtash though too [21:44:34] an example I have seen when digging is statistics::web which sets /var/log/apache2 to be owned by wikidev [21:45:48] hashar: eh.. all that being said, why cant you read the error.log but i can? [21:45:52] and then there is a puppet file { '/var/log/apache2/access.metrics.log': group => 'wikidev' } [21:45:53] i am not using sudo or root [21:46:02] [doc1001:~] $ cat /var/log/apache2/error.log [21:46:06] mabye you are a member of the "adm" group? [21:46:16] uid=2075(dzahn) gid=500(wikidev) groups=500(wikidev),4(adm),700(ops) [21:46:19] ack [21:46:22] 10Operations, 10SRE-Access-Requests: unable to access Eventlogging via stat1006 - https://phabricator.wikimedia.org/T213344 (10Capt_Swing) excellent, thank you so much @Ottomata [21:46:23] \o/ [21:46:58] I am ot sure what the "adm" gruop is for [21:47:15] but it seems to be used to be able to read most logs under /var/log/ [21:47:16] Clearly it's for Admirals. [21:47:24] James_F: ;)))))))))))) [21:47:44] -rw------- 1 root root 339 Jan 9 20:49 php7.2-fpm.log [21:47:45] bah [21:47:46] ) [21:47:54] 10Operations, 10Analytics, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) [21:48:07] (03CR) 10Smalyshev: "I'd probably separate it into two patches - first creates infra for multiple instances but keeps current one running as is and second adds" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [21:48:26] adm: Group adm is used for system monitoring tasks. Members of this group can read many log files in /var/log, and can use xconsole. [21:48:35] mutante: so yeah maybe making us members of adm would be enough [21:48:43] and similar to sudo journalctl * [21:49:15] (03CR) 10Smalyshev: [WIP] wdqs: create multiple instances of blazegraph (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [21:50:08] 10Operations, 10Recommendation-API, 10Research, 10SRE-Access-Requests, and 3 others: Add Baha as a deployer for Recommendation API - https://phabricator.wikimedia.org/T212945 (10bmansurov) [21:50:18] mutante: I will ask for feedback to the ops list. But that is for tomorrow ;) [21:53:04] hashar: that is ideal, yep :) [21:53:54] mutante: and thanks again for the doc.wm.o migration. oojs-ui demos work again now https://doc.wikimedia.org/oojs-ui/master/demos/demos.php?page=widgets&theme=wikimediaui&direction=ltr&platform=desktop [21:54:29] lol, now i want group "doc-admirals" [21:54:59] hehe. Anyway time to go to bed *wave* [21:55:13] yw and good night hashar [21:59:07] (03CR) 10Smalyshev: "Also questions: now that we'd have two services, how we tell scap to restart both on deploy? Will have also to make scap to produce two co" [puppet] - 10https://gerrit.wikimedia.org/r/483234 (https://phabricator.wikimedia.org/T213234) (owner: 10Gehel) [22:02:46] (03PS1) 10Jgreen: Add SHA256 selector record for fundraising mail contractor (IBM/Silverpop). [dns] - 10https://gerrit.wikimedia.org/r/483294 [22:03:23] (03PS2) 10Jgreen: Add SHA256 selector record for fundraising mail contractor (IBM/Silverpop). [dns] - 10https://gerrit.wikimedia.org/r/483294 (https://phabricator.wikimedia.org/T210445) [22:06:33] (03CR) 10Jgreen: [C: 03+1] Add SHA256 selector record for fundraising mail contractor (IBM/Silverpop). [dns] - 10https://gerrit.wikimedia.org/r/483294 (https://phabricator.wikimedia.org/T210445) (owner: 10Jgreen) [22:11:23] Presumably there's some magic in puppet I need to do to make upload.wikimedia.org/…/testcommons/ readable? manifests/web/prod_sites.pp has us in it but it's not clear if there's more to do? [22:12:03] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review: Stronger DKIM key for fundraising emails? - https://phabricator.wikimedia.org/T210445 (10Jgreen) >>! In T210445#4867613, @bsisolak wrote: > The key is correct, and IBM will validate it before turing it live. I would... [22:27:16] (03CR) 10Addshore: [C: 03+1] Enable WikibaseMediaInfo on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466955 (https://phabricator.wikimedia.org/T159708) (owner: 10Jforrester) [22:27:21] (03CR) 10Addshore: [C: 03+1] Install but don't enable the WikibaseMediaInfo extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446844 (https://phabricator.wikimedia.org/T180981) (owner: 10Jforrester) [23:14:45] (03PS1) 10Smalyshev: Puppetize blazegraph config for cases where deployed one is not enough [puppet] - 10https://gerrit.wikimedia.org/r/483310 [23:15:22] (03CR) 10jerkins-bot: [V: 04-1] Puppetize blazegraph config for cases where deployed one is not enough [puppet] - 10https://gerrit.wikimedia.org/r/483310 (owner: 10Smalyshev) [23:20:50] 10Operations, 10netops: Add eqsin routing special cases to jnt - https://phabricator.wikimedia.org/T211930 (10ayounsi) a:05ayounsi→03faidon Thanks for the feedback! If the back and forth in diffs via a task gets old, we can find a different solution. 1. 2. 3. Indeed using the ASXXX_in and _out as well as... [23:29:23] (03CR) 10MSantos: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/482860 (owner: 10MSantos) [23:30:11] (03PS3) 10MSantos: Change frequency of OSM replication on maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/482860 [23:34:30] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` ['mw2151.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201901092334_dzahn... [23:34:57] !log reinstalling mw2151.codfw.wmnet because it was the very last mw* host on jessie [23:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:58] !log mw2151 - change netbox status from active to staged - it's not actually active, it's role(spare) and was jessie (T192457) [23:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:01] T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 [23:47:12] (03PS2) 10Smalyshev: Puppetize blazegraph config for cases where deployed one is not enough [puppet] - 10https://gerrit.wikimedia.org/r/483310 [23:47:45] (03CR) 10jerkins-bot: [V: 04-1] Puppetize blazegraph config for cases where deployed one is not enough [puppet] - 10https://gerrit.wikimedia.org/r/483310 (owner: 10Smalyshev) [23:48:41] (03PS3) 10Smalyshev: Puppetize blazegraph config for cases where deployed one is not enough [puppet] - 10https://gerrit.wikimedia.org/r/483310 [23:49:34] (03CR) 10jerkins-bot: [V: 04-1] Puppetize blazegraph config for cases where deployed one is not enough [puppet] - 10https://gerrit.wikimedia.org/r/483310 (owner: 10Smalyshev) [23:51:03] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 26 ge 4 daniel_zahn https://phabricator.wikimedia.org/T207721 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [23:51:58] !log thumb1004 - still needs broken RAM replaced, expired downtime, re-ACKed (T207721) [23:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:01] T207721: Broken memory on thumbor1004 - https://phabricator.wikimedia.org/T207721 [23:53:54] (03PS4) 10Smalyshev: Puppetize blazegraph config for cases where deployed one is not enough [puppet] - 10https://gerrit.wikimedia.org/r/483310 [23:54:32] 10Operations, 10ops-eqiad: Heating alerts and broken RAM on kafka1014 - https://phabricator.wikimedia.org/T204479 (10Dzahn) [23:56:11] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) The following need to be pushed to routers to stop adding a higher local_pref to routes learned via peering (IXP+private). `lang=diff [edit policy-options policy-... [23:57:08] 10Operations, 10ops-eqiad: Heating alerts and broken RAM on kafka1014 - https://phabricator.wikimedia.org/T204479 (10Dzahn) as a new issue, kafka1014 reports broken RAM since recently https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=kafka1014&service=Memory+correctable+errors+-EDAC- [23:59:28] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on kafka1014 is CRITICAL: 7.001 ge 4 daniel_zahn https://phabricator.wikimedia.org/T204479#4868003 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=kafka1014&var-datasource=eqiad+prometheus/ops