[00:18:56] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2627266 (10AlexMonk-WMF) ```Sep 12 00:01:48 deployment-ms-fe01 puppet-agent[16051]: Could no... [00:30:48] (03PS1) 10Ema: cache_upload esams: route to codfw [puppet] - 10https://gerrit.wikimedia.org/r/309928 (https://phabricator.wikimedia.org/T131502) [00:32:12] (03CR) 10Ema: [C: 032] cache_upload esams: route to codfw [puppet] - 10https://gerrit.wikimedia.org/r/309928 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [00:41:57] 07Puppet, 10Beta-Cluster-Infrastructure: Puppet failing on deployment-conf03 due to missing files - https://phabricator.wikimedia.org/T144703#2627269 (10AlexMonk-WMF) a:03AlexMonk-WMF [00:42:02] (03PS1) 10Alex Monk: etcd::ssl: fix puppet ssldir path [puppet] - 10https://gerrit.wikimedia.org/r/309929 (https://phabricator.wikimedia.org/T144703) [00:46:36] (03PS1) 10Ema: Upgrade upload esams to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309930 (https://phabricator.wikimedia.org/T131502) [00:46:44] (03CR) 10Alex Monk: "cherry-picked on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/309929 (https://phabricator.wikimedia.org/T144703) (owner: 10Alex Monk) [00:48:34] (03CR) 10Ema: [C: 032 V: 032] Upgrade upload esams to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309930 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [00:51:20] !log upgrade cp3034 to varnish 4 T131502 [00:51:21] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [00:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:55:57] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 48 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[varnish],Service[varnish-frontend] [00:58:37] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:28:07] !log upgrade cp3035 to varnish 4 T131502 [01:28:08] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [01:28:11] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 138.93 seconds [01:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:51] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 35 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[varnish],Service[varnish-frontend] [01:34:15] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:06:23] !log upgrade cp3036 to varnish 4 T131502 [02:06:25] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [02:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:21:48] (03CR) 10Andrew Bogott: [C: 031] "I'd run this through the puppet compiler if you haven't already... looks good to me though. Thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/309695 (owner: 10Alex Monk) [02:23:34] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 10m 37s) [02:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:27] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 12 02:29:27 UTC 2016 (duration 5m 53s) [02:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:44] !log upgrade cp3037 to varnish 4 T131502 [02:34:46] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [02:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:31] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 48 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[varnish],Service[varnish-frontend] [02:40:02] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:49:19] !log upgrade cp3038 to varnish 4 T131502 [02:49:20] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [02:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:05] RECOVERY - HTTPS on cp3038 is OK: Thread 1 terminated abnormally: Cant call method peer_certificate on an undefined value at /usr/lib/nagios/plugins/check_ssl line 166. [02:52:36] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [02:55:07] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:58:41] PROBLEM - HTTPS on cp3038 is CRITICAL: SSLXNN CRITICAL - 37 CRITICAL [03:04:23] oh I guess we just crossed the 90 mark this weekend for the unified [03:04:27] annoying that that's critical :P [03:05:03] !log upgrade cp3039 to varnish 4 T131502 [03:05:05] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [03:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:02] RECOVERY - HTTPS on cp3039 is OK: Thread 1 terminated abnormally: Cant call method peer_certificate on an undefined value at /usr/lib/nagios/plugins/check_ssl line 166. [03:06:52] ACKNOWLEDGEMENT - HTTPS on cp1045 is CRITICAL: SSLXNN CRITICAL - 37 CRITICAL Brandon Black Unified cert expiry in 90 days, ACK - The acknowledgement expires at: 2016-11-13 03:06:01. [03:06:52] ACKNOWLEDGEMENT - HTTPS on cp1046 is CRITICAL: SSLXNN CRITICAL - 37 CRITICAL Brandon Black Unified cert expiry in 90 days, ACK - The acknowledgement expires at: 2016-11-13 03:06:01. [03:06:52] ACKNOWLEDGEMENT - HTTPS on cp1047 is CRITICAL: SSLXNN CRITICAL - 37 CRITICAL Brandon Black Unified cert expiry in 90 days, ACK - The acknowledgement expires at: 2016-11-13 03:06:01. [03:06:52] ACKNOWLEDGEMENT - HTTPS on cp1048 is CRITICAL: SSLXNN CRITICAL - 37 CRITICAL Brandon Black Unified cert expiry in 90 days, ACK - The acknowledgement expires at: 2016-11-13 03:06:01. [03:06:52] ACKNOWLEDGEMENT - HTTPS on cp1049 is CRITICAL: SSLXNN CRITICAL - 37 CRITICAL Brandon Black Unified cert expiry in 90 days, ACK - The acknowledgement expires at: 2016-11-13 03:06:01. [03:07:41] one more spammer toll dealt with :) [03:07:45] *troll [03:09:37] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [03:13:41] PROBLEM - HTTPS on cp3039 is CRITICAL: SSLXNN CRITICAL - 37 CRITICAL [03:21:14] bblack, Faidon uploaded a patch to lower the critical days [03:21:24] https://gerrit.wikimedia.org/r/309923 [03:21:33] !log upgrade cp3044 to varnish 4 T131502 [03:21:35] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [03:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:27:52] (03PS2) 10BBlack: Reduce check_sslxNN alert thresholds to 30d/15d [puppet] - 10https://gerrit.wikimedia.org/r/309923 (owner: 10Faidon Liambotis) [03:28:16] (03CR) 10BBlack: [C: 032 V: 032] Reduce check_sslxNN alert thresholds to 30d/15d [puppet] - 10https://gerrit.wikimedia.org/r/309923 (owner: 10Faidon Liambotis) [03:28:54] Krenair: thanks [03:36:48] !log upgrade cp3045 to varnish 4 T131502 [03:36:49] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [03:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:50:21] (03CR) 10Alex Monk: "The changes look about right to me: https://puppet-compiler.wmflabs.org/4039/" [puppet] - 10https://gerrit.wikimedia.org/r/309695 (owner: 10Alex Monk) [03:51:17] !log upgrade cp3046 to varnish 4 T131502 [03:51:18] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [03:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:53:51] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 37 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[varnish],Service[varnish-frontend] [03:56:22] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [04:06:57] !log upgrade cp3047 to varnish 4 T131502 [04:06:58] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [04:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:09:12] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 16 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[varnish],Service[varnish-frontend] [04:11:42] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [04:20:13] !log upgrade cp3048 to varnish 4 T131502 [04:20:15] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [04:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:23:09] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:34:50] !log upgrade cp3049 to varnish 4 T131502 [04:34:51] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [04:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:35:40] RECOVERY - HTTPS on cp3049 is OK: Thread 1 terminated abnormally: Cant call method peer_certificate on an undefined value at /usr/lib/nagios/plugins/check_ssl line 166. [04:37:10] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 23 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[varnish],Service[varnish-frontend] [04:39:41] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [04:43:13] PROBLEM - HTTPS on cp3049 is CRITICAL: SSLXNN CRITICAL - 37 CRITICAL [04:46:04] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [05:19:59] <_joe_> bblack: uh you're almost done with esams? :) [05:26:36] (03CR) 10Giuseppe Lavagetto: [C: 032] Mark mw2017 and mw2099 as codfw test app servers [puppet] - 10https://gerrit.wikimedia.org/r/309554 (owner: 10Muehlenhoff) [05:36:21] (03PS6) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) [05:48:56] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2627449 (10AlexMonk-WMF) ```Sep 12 02:10:46 deployment-mathoid puppet-agent[30097]: Could no... [05:59:10] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2183851 (10Joe) >>! In T131946#2229797, @mmodell wrote: > I'd like to open the broader discu... [06:05:46] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2627494 (10elukey) [06:27:07] RECOVERY - Disk space on scb1002 is OK: DISK OK [06:28:06] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] [06:41:07] what the SSL check really needs is to be split up into 2-3 different checks probably [06:41:25] one should be checking if the SSL negotiation works and all that, ran against each cp* [06:41:48] one should be checking that all the cp* are having the same cert [06:42:12] and one against the service IP should be checking for expiry times and all that [06:43:29] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [06:43:42] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2627529 (10AlexMonk-WMF) >>! In T131946#2627467, @Joe wrote: >>>! In T131946#2229797, @mmode... [06:44:17] (03CR) 10Muehlenhoff: [C: 04-1] "That attribute is used in the current LDAP; plenty of DNS host entries have attributes like" [puppet] - 10https://gerrit.wikimedia.org/r/309009 (owner: 10Alex Monk) [06:50:05] PROBLEM - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: Connection refused [06:52:01] (03PS1) 10Urbanecm: Add throttling rule for University of Canterbury [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309943 (https://phabricator.wikimedia.org/T145327) [06:55:13] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jq] [07:00:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] labs LDAP: remove puppetVar attribute (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309009 (owner: 10Alex Monk) [07:04:38] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2627561 (10Joe) >>! In T131946#2627529, @AlexMonk-WMF wrote: >>>! In T131946#2627467, @Joe w... [07:04:57] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:13:11] (03CR) 10Brian Wolff: "The VIPS extension currently does not support exiftool, but it would be fairly easy to add support for it." [puppet] - 10https://gerrit.wikimedia.org/r/309505 (https://phabricator.wikimedia.org/T71336) (owner: 10Dereckson) [07:15:22] (03CR) 10Gilles: [C: 031] thumbor: tune nginx next_upstream behaviour [puppet] - 10https://gerrit.wikimedia.org/r/309574 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [07:16:29] !log installing openjpeg security updates [07:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:21] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:21:03] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:25:28] !log reimaging mw2077-mw2079, mw2017 to jessie [07:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:30:44] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:31:14] RECOVERY - HTTPS on cp3031 is OK: SSLXNN OK - 37 OK [07:31:14] RECOVERY - HTTPS on cp2009 is OK: SSLXNN OK - 37 OK [07:31:37] RECOVERY - HTTPS on cp1050 is OK: SSLXNN OK - 37 OK [07:31:37] RECOVERY - HTTPS on cp1066 is OK: SSLXNN OK - 37 OK [07:31:37] RECOVERY - HTTPS on cp2014 is OK: SSLXNN OK - 37 OK [07:31:37] RECOVERY - HTTPS on cp3043 is OK: SSLXNN OK - 37 OK [07:31:37] RECOVERY - HTTPS on cp4001 is OK: SSLXNN OK - 37 OK [07:31:37] RECOVERY - HTTPS on cp4016 is OK: SSLXNN OK - 37 OK [07:31:37] RECOVERY - HTTPS on cp4007 is OK: SSLXNN OK - 37 OK [07:32:58] RECOVERY - HTTPS on cp1048 is OK: SSLXNN OK - 37 OK [07:32:58] RECOVERY - HTTPS on cp3030 is OK: SSLXNN OK - 37 OK [07:33:20] RECOVERY - HTTPS on cp2012 is OK: SSLXNN OK - 37 OK [07:33:27] RECOVERY - HTTPS on cp3032 is OK: SSLXNN OK - 37 OK [07:33:27] RECOVERY - HTTPS on cp1062 is OK: SSLXNN OK - 37 OK [07:33:27] RECOVERY - HTTPS on cp1059 is OK: SSLXNN OK - 37 OK [07:33:27] RECOVERY - HTTPS on cp1051 is OK: SSLXNN OK - 37 OK [07:33:27] RECOVERY - HTTPS on cp3034 is OK: SSLXNN OK - 37 OK [07:33:27] RECOVERY - HTTPS on cp2024 is OK: SSLXNN OK - 37 OK [07:33:47] RECOVERY - HTTPS on cp1099 is OK: SSLXNN OK - 37 OK [07:33:47] RECOVERY - HTTPS on cp2011 is OK: SSLXNN OK - 37 OK [07:33:47] RECOVERY - HTTPS on cp3035 is OK: SSLXNN OK - 37 OK [07:33:58] RECOVERY - HTTPS on cp3036 is OK: SSLXNN OK - 37 OK [07:33:58] RECOVERY - HTTPS on cp2003 is OK: SSLXNN OK - 37 OK [07:33:58] RECOVERY - HTTPS on cp3004 is OK: SSLXNN OK - 37 OK [07:34:17] RECOVERY - HTTPS on cp2006 is OK: SSLXNN OK - 37 OK [07:34:37] RECOVERY - HTTPS on cp3006 is OK: SSLXNN OK - 37 OK [07:34:37] RECOVERY - HTTPS on cp3010 is OK: SSLXNN OK - 37 OK [07:34:37] RECOVERY - HTTPS on cp3038 is OK: SSLXNN OK - 37 OK [07:34:37] RECOVERY - HTTPS on cp2019 is OK: SSLXNN OK - 37 OK [07:34:37] RECOVERY - HTTPS on cp3009 is OK: SSLXNN OK - 37 OK [07:34:48] RECOVERY - HTTPS on cp1068 is OK: SSLXNN OK - 37 OK [07:34:57] RECOVERY - HTTPS on cp1058 is OK: SSLXNN OK - 37 OK [07:34:57] RECOVERY - HTTPS on cp2002 is OK: SSLXNN OK - 37 OK [07:34:57] RECOVERY - HTTPS on cp3007 is OK: SSLXNN OK - 37 OK [07:34:57] RECOVERY - HTTPS on cp2022 is OK: SSLXNN OK - 37 OK [07:41:27] (03PS2) 10DCausse: Upgrade elasticsearch pluglins to 2.4.0 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) [07:43:51] (03CR) 10DCausse: "updated extra-2.4.0 to includ the latest patch" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) (owner: 10DCausse) [07:45:05] (03CR) 10Alexandros Kosiaris: prometheus::ops: allow using puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [07:48:16] <_joe_> akosiaris: I hate lint ignore comments, and I think those were introduced because we were just finding existing occurrences [07:48:25] <_joe_> but I'm ok with being consistent [07:48:30] <_joe_> I'll amend [07:48:50] * _joe_ hates enforced linters so much [07:49:19] <_joe_> I think there is no rule that can beat our best judgement on what's more readable in specific cases [07:53:15] good morning [07:54:37] 07Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Puppet failing on deployment-conf03 due to missing files - https://phabricator.wikimedia.org/T144703#2627614 (10hashar) [07:56:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] "questions inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [08:04:39] PROBLEM - HHVM rendering on mw2017 is CRITICAL: Connection refused [08:04:43] (03CR) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [08:05:39] PROBLEM - DPKG on mw2017 is CRITICAL: Connection refused by host [08:06:17] (03CR) 10Gehel: Upgrade elasticsearch pluglins to 2.4.0 (031 comment) [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) (owner: 10DCausse) [08:06:19] PROBLEM - Disk space on mw2017 is CRITICAL: Connection refused by host [08:06:49] PROBLEM - HHVM processes on mw2017 is CRITICAL: Connection refused by host [08:10:15] (03CR) 10DCausse: "yes I've released extra a first time but I missed a commit" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) (owner: 10DCausse) [08:13:09] (03PS7) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) [08:13:22] <_joe_> akosiaris: yeah the function had a big bug [08:13:36] <_joe_> akosiaris: you're supposed to call get_nodes({'cluster [08:14:27] <_joe_> akosiaris: you're supposed to call get_nodes({'cluster' => [a,b], 'site' => ['eqiad', 'codfw']) and get all nodes corresponding to the query, gouped by cluster/site [08:17:38] 06Operations, 07LDAP, 13Patch-For-Review: Enhance group membership visibility using the memberof LDAP overlay - https://phabricator.wikimedia.org/T142817#2627665 (10MoritzMuehlenhoff) This is now enabled, but there seems to be a problem of the memberof overlay in combination with mirror mode/syncrepl: Change... [08:17:54] _joe_: tbh, I would expect a function call get_nodes to return nodes as the first level of the data structure with cluster, site as attributes. get_clusters() seems more appropriate to the data structure the function is returning [08:19:09] 06Operations: Cronspam from terbium - https://phabricator.wikimedia.org/T145360#2627669 (10jcrespo) I am not sure wikitech should be reachable by terbium maintenance, and less by a production credential user like wikiadmin. Labswiki is not a production-core wiki, and it is not in the production network. [08:19:42] <_joe_> akosiaris: ok, I think the DS we're returning is more useful in general [08:19:58] <_joe_> and yes, that name might be more appropriate [08:21:01] <_joe_> changing that [08:24:20] <_joe_> akosiaris: fun fact: the Nagios_hostextinfo exported resource includes 'host_name' as a parameter in active_record, and doesn't in puppetdb [08:24:36] how can that be ? [08:24:47] <_joe_> well, I just verified it :) [08:25:09] grrr [08:27:30] <_joe_> select param_names.name, param_values.value from param_values JOIN param_names on param_name_id = param_names.id where resource_id = XXX [08:27:42] <_joe_> (note the beautiful normalization of the database) [08:28:39] <_joe_> this is going to make my naggen2 changes all more ugly [08:28:40] <_joe_> sigh [08:37:24] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2627690 (10hashar) ``` # du -m --max-depth 2 --one-file-system / |sort -rn|head -n10 6477... [08:38:38] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2627695 (10hashar) Sorry I have commented on the wrong task. My removed comment was about T1... [08:39:01] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [08:41:01] (03CR) 10Alexandros Kosiaris: prometheus::ops: allow using puppetdb (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [08:41:43] (03Abandoned) 10Alex Monk: labs LDAP: remove puppetVar attribute [puppet] - 10https://gerrit.wikimedia.org/r/309009 (owner: 10Alex Monk) [08:41:48] (03CR) 10Alexandros Kosiaris: "Aside from the non-intuitive function name get_nodes() (already suggested a slightly better name in PS6) rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [08:43:01] (03PS8) 10Giuseppe Lavagetto: prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) [08:43:21] <_joe_> akosiaris: I fixed the function name and a couple more things, I think it's ok now [08:43:44] <_joe_> I'm going to merge it, verify it works as expected, and move on to naggen2 [08:43:59] <_joe_> I think after that we're done with things querying the db directly [08:44:14] ok [08:44:26] (03CR) 10Giuseppe Lavagetto: [C: 032] prometheus::ops: allow using puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/304983 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [08:45:42] RECOVERY - Disk space on mw2017 is OK: DISK OK [08:46:32] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:47:52] RECOVERY - DPKG on mw2017 is OK: All packages OK [08:48:20] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2627710 (10hashar) [08:50:13] (03PS1) 10Giuseppe Lavagetto: wmflib: fix get_clusters calls to custom functions [puppet] - 10https://gerrit.wikimedia.org/r/309951 [08:53:45] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: fix get_clusters calls to custom functions [puppet] - 10https://gerrit.wikimedia.org/r/309951 (owner: 10Giuseppe Lavagetto) [08:54:02] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [08:59:19] <_joe_> grrr [09:00:16] (03PS1) 10Giuseppe Lavagetto: wmflib: brown paper bag fix to get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/309952 [09:02:25] (03CR) 10Giuseppe Lavagetto: "The change itself seems correct, but we might want to move to use" [puppet] - 10https://gerrit.wikimedia.org/r/309929 (https://phabricator.wikimedia.org/T144703) (owner: 10Alex Monk) [09:03:03] (03CR) 10Giuseppe Lavagetto: [C: 032] wmflib: brown paper bag fix to get_clusters [puppet] - 10https://gerrit.wikimedia.org/r/309952 (owner: 10Giuseppe Lavagetto) [09:06:33] (03PS2) 10Filippo Giunchedi: thumbor: tune nginx next_upstream behaviour [puppet] - 10https://gerrit.wikimedia.org/r/309574 (https://phabricator.wikimedia.org/T139606) [09:06:59] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2627760 (10elukey) [09:08:13] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: tune nginx next_upstream behaviour [puppet] - 10https://gerrit.wikimedia.org/r/309574 (https://phabricator.wikimedia.org/T139606) (owner: 10Filippo Giunchedi) [09:17:21] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2627786 (10elukey) I wanted to spin up a new Debian instance with Horizon for deployment-prep but it seems that we are already hitting the resource limits: {F4459123} Maybe... [09:19:27] (03PS10) 10Volans: Automation: automatically reimage host [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) [09:20:22] (03PS1) 10Gilles: Upgrade to 0.1.17 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309955 [09:22:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [09:22:22] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [09:23:37] (03PS11) 10Volans: Automation: automatically reimage host [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) [09:27:12] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [09:28:47] (03PS3) 10Giuseppe Lavagetto: Bump scap version to 3.2.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/309635 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [09:29:29] (03PS2) 10Gilles: Upgrade to 0.1.17 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309955 [09:33:44] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.17 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309955 (owner: 10Gilles) [09:33:57] (03CR) 10Giuseppe Lavagetto: [C: 032] Bump scap version to 3.2.5-1 [puppet] - 10https://gerrit.wikimedia.org/r/309635 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [09:39:45] (03PS3) 10Giuseppe Lavagetto: Mathoid: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/308574 (https://phabricator.wikimedia.org/T144755) (owner: 10Mobrovac) [09:40:22] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [09:43:19] (03CR) 10Giuseppe Lavagetto: [C: 032] Mathoid: Use Scap3 to deploy the config [puppet] - 10https://gerrit.wikimedia.org/r/308574 (https://phabricator.wikimedia.org/T144755) (owner: 10Mobrovac) [09:46:01] 07Puppet, 07Beta-Cluster-reproducible: puppet failures due to "Could not find class" or "Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type" - https://phabricator.wikimedia.org/T131946#2627829 (10hashar) There is a bit of a misunderstanding about how beta differs from producti... [09:47:02] !log depool cp4006 (503 Could not get storage) [09:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:55] (03PS2) 10Muehlenhoff: Mark mw2017 and mw2099 as codfw test app servers [puppet] - 10https://gerrit.wikimedia.org/r/309554 [10:02:57] !log deploying schema change on s4 hosts T139090 [10:02:59] T139090: Deploy I2b042685 to all databases - https://phabricator.wikimedia.org/T139090 [10:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:21] !log Testing schema change on db1039 - T141951 [10:04:22] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - mathoid_10042 - Could not depool server scb1001.eqiad.wmnet because of too many down! [10:04:23] T141951: Add local_user_id and global_user_id fields to localuser table in centralauth database - https://phabricator.wikimedia.org/T141951 [10:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:04:52] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - mathoid_10042 - Could not depool server scb1001.eqiad.wmnet because of too many down! [10:05:30] <_joe_> mathoid is back up, btw [10:05:42] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - mathoid_10042 - Could not depool server scb1001.eqiad.wmnet because of too many down! [10:06:52] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [10:07:25] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [10:07:53] !log reimage mw2198, mw2199 to Jessie (again) T143536 [10:07:54] T143536: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536 [10:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:08:12] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [10:14:34] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:19:45] !log decomissioning mw2061-mw2074 (Bug: T144745) [10:19:46] T144745: Remove mw2061-mw2074 - https://phabricator.wikimedia.org/T144745 [10:19:50] (03PS2) 10Muehlenhoff: Decom mw2061-mw2074 [puppet] - 10https://gerrit.wikimedia.org/r/309572 (https://phabricator.wikimedia.org/T144745) [10:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:50] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: support '-backports' in distribution [puppet] - 10https://gerrit.wikimedia.org/r/309568 (owner: 10Hashar) [10:26:54] (03PS3) 10Alexandros Kosiaris: package_builder: support '-backports' in distribution [puppet] - 10https://gerrit.wikimedia.org/r/309568 (owner: 10Hashar) [10:26:56] (03CR) 10Alexandros Kosiaris: [V: 032] package_builder: support '-backports' in distribution [puppet] - 10https://gerrit.wikimedia.org/r/309568 (owner: 10Hashar) [10:28:19] !log renaming tables in db1015 - T132837 [10:28:20] T132837: hitcounter and _counter tables are on the cluster but were deleted/unsused? - https://phabricator.wikimedia.org/T132837 [10:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:39:00] (03PS1) 10Mobrovac: service::node: Also do the Scap fetch phase when refreshing the config [puppet] - 10https://gerrit.wikimedia.org/r/309963 [10:40:35] (03PS1) 10Ema: Upgrade upload eqiad to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309964 (https://phabricator.wikimedia.org/T131502) [10:46:56] (03CR) 10Ema: [C: 032] Upgrade upload eqiad to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309964 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [10:48:05] !log upgrade cp1048 to varnish 4 T131502 [10:48:06] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [10:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:32] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor error spawning ghostscript 'libcgroup initialization failed: Cgroup is not mounted' - https://phabricator.wikimedia.org/T144938#2627969 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi with native systemd memory cgroup this is no... [10:50:35] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2627972 (10fgiunchedi) [11:03:26] !log upgrade cp1049 to varnish 4 T131502 [11:03:28] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [11:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:32] PROBLEM - Apache HTTP on mw2199 is CRITICAL: Connection timed out [11:04:52] PROBLEM - Apache HTTP on mw2198 is CRITICAL: Connection timed out [11:05:41] PROBLEM - nutcracker process on mw2198 is CRITICAL: Timeout while attempting connection [11:05:41] PROBLEM - puppet last run on mw2199 is CRITICAL: Timeout while attempting connection [11:06:11] PROBLEM - puppet last run on mw2198 is CRITICAL: Timeout while attempting connection [11:06:11] PROBLEM - salt-minion processes on mw2199 is CRITICAL: Timeout while attempting connection [11:06:23] (03PS1) 10Giuseppe Lavagetto: mathoid: add test variable [puppet] - 10https://gerrit.wikimedia.org/r/309966 [11:06:31] <_joe_> mobrovac: ^^ [11:07:26] mw2198-9 was me, reimaging, silenced [11:08:38] _joe_: running pcc on it [11:08:44] <_joe_> mobrovac: ok [11:10:52] RECOVERY - nutcracker process on mw2198 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [11:11:02] <_joe_> seems legit [11:11:14] RECOVERY - salt-minion processes on mw2199 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:11:26] (03CR) 10Mobrovac: [C: 031] "PCC looking good - https://puppet-compiler.wmflabs.org/4044/" [puppet] - 10https://gerrit.wikimedia.org/r/309966 (owner: 10Giuseppe Lavagetto) [11:11:34] _joe_: ^ [11:11:44] <_joe_> yeah already saw it [11:11:47] (03CR) 10Giuseppe Lavagetto: [C: 032] mathoid: add test variable [puppet] - 10https://gerrit.wikimedia.org/r/309966 (owner: 10Giuseppe Lavagetto) [11:12:03] hehehe [11:12:09] (03PS2) 10Giuseppe Lavagetto: mathoid: add test variable [puppet] - 10https://gerrit.wikimedia.org/r/309966 [11:12:16] RECOVERY - Apache HTTP on mw2199 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.078 second response time [11:12:29] (03CR) 10Giuseppe Lavagetto: [V: 032] mathoid: add test variable [puppet] - 10https://gerrit.wikimedia.org/r/309966 (owner: 10Giuseppe Lavagetto) [11:12:41] RECOVERY - Apache HTTP on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.192 second response time [11:14:07] <_joe_> ok so, running puppet on scb2002 first [11:16:16] (03PS2) 10Urbanecm: Add throttling rule for University of Canterbury [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309943 (https://phabricator.wikimedia.org/T145327) [11:16:27] <_joe_> mobrovac: uhm take a look [11:16:35] at puppet there? [11:16:54] <_joe_> nope sorry, it's ok [11:17:04] <_joe_> just the symlink wasn't updated [11:17:14] <_joe_> but the actual config file was [11:17:50] (03PS3) 10Urbanecm: Add HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) [11:17:52] !log upgrade cp1050 to varnish 4 T131502 [11:17:54] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [11:17:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:18:09] <_joe_> mobrovac: so, we're ok [11:18:09] _joe_: right, but the test var isn't in it because it's not specified in the template [11:18:14] <_joe_> yes [11:18:49] _joe_: to be extra sure, we should change a value that is already in the template, like the logstash host or something like that, run puppet only in codfw and see it update [11:18:54] (03PS1) 10Giuseppe Lavagetto: Revert "mathoid: add test variable" [puppet] - 10https://gerrit.wikimedia.org/r/309968 [11:19:08] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "mathoid: add test variable" [puppet] - 10https://gerrit.wikimedia.org/r/309968 (owner: 10Giuseppe Lavagetto) [11:19:29] but yeah, i ca see the config file got touched by scap [11:19:50] <_joe_> mobrovac: no need for that, really [11:20:01] <_joe_> also, the service got restarted as expected [11:20:04] <_joe_> :) [11:20:16] yay [11:20:24] _joe_: thnx for the extra check [11:21:38] 06Operations, 10Mail, 10OTRS, 10Wiki-Loves-Monuments: E-mails not being received by OTRS - https://phabricator.wikimedia.org/T145293#2628071 (10siebrand) wikilovesmonuments@wikimedia.nl is a Google Group with the following aliases: * wlm@wikimedia.nl and * wikilovesmonuments@wmnederland.nl Messages to... [11:25:30] 06Operations, 06Project-Admins, 06Release-Engineering-Team: #blocked-on-schema-change was archived, now schema change workflow is broken - https://phabricator.wikimedia.org/T145361#2628076 (10jcrespo) @Aklapper you literally told me to create this project: T119751#1835024 Now this and #Blocked-on-operation... [11:26:53] (03PS4) 10Urbanecm: Add HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) [11:27:58] (03PS5) 10Urbanecm: Add HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) [11:28:23] 06Operations, 10Mail, 10OTRS, 10Wiki-Loves-Monuments: E-mails not being received by OTRS - https://phabricator.wikimedia.org/T145293#2628082 (10siebrand) [[ https://support.google.com/a/answer/1185267 | Google states ]] that info-nl@wikilovesmonuments.org (presumed the Google account of that email adress)... [11:29:21] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2628085 (10Gilles) [11:29:24] 06Operations, 06Performance-Team, 10Thumbor: invalid literal for int() with base 10: - https://phabricator.wikimedia.org/T145061#2628083 (10Gilles) 05Open>03Resolved Definitely fixed since the cgroup issue was resolved [11:29:40] 06Operations, 06Performance-Team, 10Thumbor: Unsupported header value None - https://phabricator.wikimedia.org/T145051#2628086 (10Gilles) 05Open>03Resolved Was almost certainly the cgroup issue, doesn't happen anymore [11:29:43] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2437379 (10Gilles) [11:30:08] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2628091 (10Gilles) 05Open>03Resolved a:03Gilles This should be fixed now [11:30:11] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2437379 (10Gilles) [11:32:45] 06Operations, 06Performance-Team, 10Thumbor: SVG type check too slow in production - https://phabricator.wikimedia.org/T145377#2628098 (10Gilles) [11:32:51] !log upgrade cp1062 to varnish 4 T131502 [11:32:53] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [11:32:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:33:00] 06Operations, 06Performance-Team, 10Thumbor: SVG type check too slow in production - https://phabricator.wikimedia.org/T145377#2628114 (10Gilles) a:05fgiunchedi>03Gilles [11:33:54] (03PS4) 10ArielGlenn: fix up locking for misc dumps [dumps] - 10https://gerrit.wikimedia.org/r/308016 [11:40:38] !log change-prop deploying 79b172a [11:40:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:43:03] 06Operations, 06Performance-Team, 10Thumbor: SVG type check too slow in production - https://phabricator.wikimedia.org/T145377#2628098 (10MoritzMuehlenhoff) python-magic/file should be reasonably fast? [11:47:44] !log upgrade cp1063 to varnish 4 T131502 [11:47:45] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [11:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:57:33] 06Operations, 06Project-Admins, 06Release-Engineering-Team: #blocked-on-schema-change was archived, now schema change workflow is broken - https://phabricator.wikimedia.org/T145361#2628197 (10jcrespo) [11:58:42] (03CR) 10Gehel: [C: 032] Upgrade elasticsearch pluglins to 2.4.0 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) (owner: 10DCausse) [11:59:05] (03CR) 10Gehel: [V: 032] Upgrade elasticsearch pluglins to 2.4.0 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309597 (https://phabricator.wikimedia.org/T145199) (owner: 10DCausse) [12:00:15] !log upgrade cp1064 to varnish 4 T131502 [12:00:16] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [12:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:01:44] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[varnish],Service[varnish-frontend] [12:03:52] 06Operations, 06Project-Admins, 06Release-Engineering-Team: #blocked-on-schema-change was archived, now schema change workflow is broken - https://phabricator.wikimedia.org/T145361#2628201 (10Aklapper) >>! In T145361#2628076, @jcrespo wrote: > @Aklapper you literally told me to create this project: More rec... [12:04:23] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:04:44] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Upgrade elasticsearch to 2.4.0 - https://phabricator.wikimedia.org/T145058#2628205 (10Gehel) Note: upgrade to elastic 2.4.0 require new packaging of elasticsearch plugins. See T145199. [12:11:45] !log upgrade cp1071 to varnish 4 T131502 [12:11:47] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [12:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:10] 06Operations, 06Project-Admins, 06Release-Engineering-Team: #blocked-on-schema-change was archived, now schema change workflow is broken - https://phabricator.wikimedia.org/T145361#2628241 (10jcrespo) @Aklapper, now that I know the context (I wasn't aware of that task, and that it a problem by itself), it wa... [12:17:09] 06Operations, 06Performance-Team, 10Thumbor: SVG type check too slow in production - https://phabricator.wikimedia.org/T145377#2628243 (10Gilles) I'm just going to rely on Thumbor, which does its own basic check for SVG now that it recently added support for it. [12:17:48] 06Operations, 06Performance-Team, 10Thumbor: SVG type check too slow in production - https://phabricator.wikimedia.org/T145377#2628244 (10Gilles) https://github.com/thumbor/thumbor/blob/master/thumbor/engines/__init__.py#L122 [12:27:19] 06Operations, 10ops-eqiad, 10hardware-requests: decommission snapshot1002, 1003, 1004 - https://phabricator.wikimedia.org/T141762#2628266 (10ArielGlenn) Polite nag :-) [12:30:42] !log upgrade cp1072 to varnish 4 T131502 [12:30:44] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [12:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:32:43] (03CR) 10Hashar: [C: 031] "We can drop it indeed." [puppet] - 10https://gerrit.wikimedia.org/r/301523 (owner: 10Krinkle) [12:32:53] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:33:25] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [12:35:12] volans: --^ script working fine? [12:35:48] almost :) [12:36:29] * elukey dances [12:38:21] (03PS1) 10Mobrovac: Change-Prop: Use Scap3 for config deploys [puppet] - 10https://gerrit.wikimedia.org/r/309979 (https://phabricator.wikimedia.org/T144595) [12:44:40] (03CR) 10Mobrovac: "PCC ok - https://puppet-compiler.wmflabs.org/4045/" [puppet] - 10https://gerrit.wikimedia.org/r/309979 (https://phabricator.wikimedia.org/T144595) (owner: 10Mobrovac) [12:45:12] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 683 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5081093 keys - replication_delay is 683 [12:47:25] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628342 (10MelodyKramer) [12:47:34] !log upgrade cp1073 to varnish 4 T131502 [12:47:35] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [12:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:22] (03PS2) 10Mobrovac: Change-Prop: Use Scap3 for config deploys [puppet] - 10https://gerrit.wikimedia.org/r/309979 (https://phabricator.wikimedia.org/T144595) [12:51:23] SWAT is going to be busy :) [12:53:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5054647 keys - replication_delay is 0 [12:53:27] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Upgrade elasticsearch to 2.4.0 - https://phabricator.wikimedia.org/T145058#2628359 (10Gehel) 2.4.0 upgrade generate error when trying to process our existing mappings: java.lang.IllegalArgumentException: Cannot set p... [12:54:26] (03PS1) 10BBlack: upload: raise be->be max conns to 10K [puppet] - 10https://gerrit.wikimedia.org/r/309982 [12:55:32] (03CR) 10Ema: [C: 031] upload: raise be->be max conns to 10K [puppet] - 10https://gerrit.wikimedia.org/r/309982 (owner: 10BBlack) [12:55:51] (03CR) 10Hashar: Add HD logos for hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [12:55:52] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628361 (10MelodyKramer) Wait! I regenerated the key for my work address. Please use this one: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC+0/t+joJHGh5gAGAl4qGNKMt... [12:56:02] (03CR) 10BBlack: [C: 032] upload: raise be->be max conns to 10K [puppet] - 10https://gerrit.wikimedia.org/r/309982 (owner: 10BBlack) [12:56:09] (03PS1) 10Gehel: Revert "Upgrade elasticsearch pluglins to 2.4.0" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309983 [12:56:14] hashar: what's the plan for swat today? [12:56:39] (03CR) 10Gehel: [C: 032 V: 032] Revert "Upgrade elasticsearch pluglins to 2.4.0" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/309983 (owner: 10Gehel) [12:57:11] (03CR) 10Hashar: Add HD logos for hewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [12:57:21] (03PS2) 10Hashar: Allow sysops/'crats to assign massmessage-sender in urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309368 (https://phabricator.wikimedia.org/T144701) (owner: 10Urbanecm) [12:57:23] (03PS6) 10Hashar: Add HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [12:57:25] (03PS2) 10Hashar: Limit file uploads on Ladino Wikipedia to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309348 (https://phabricator.wikimedia.org/T145090) (owner: 10Urbanecm) [12:57:35] zeljkof: I have rebased a few patches from mediawiki-config [12:59:22] hashar: who is doing the swat? you? [12:59:48] godog: is /mnt/upload7 entirely gone from production and moved to SWIFT? [12:59:58] we are at maximum, 8 patches :) [13:00:04] hashar, Dereckson, addshore, and aude: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1300). Please do the needful. [13:00:04] Urbanecm, Krenair, stephanebisson, MatmaRex, yurik, and jgirault: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:15] Present [13:00:16] yeah hi [13:00:18] hello [13:00:22] hello [13:00:29] going to push Urbanecm fix first [13:00:39] (03CR) 10Hashar: [C: 032] Limit file uploads on Ladino Wikipedia to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309348 (https://phabricator.wikimedia.org/T145090) (owner: 10Urbanecm) [13:00:41] (03CR) 10Hashar: [C: 032] Allow sysops/'crats to assign massmessage-sender in urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309368 (https://phabricator.wikimedia.org/T144701) (owner: 10Urbanecm) [13:00:43] (03CR) 10Hashar: [C: 032] Add HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [13:01:08] I am doing all four in one batch [13:01:09] (03Merged) 10jenkins-bot: Limit file uploads on Ladino Wikipedia to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309348 (https://phabricator.wikimedia.org/T145090) (owner: 10Urbanecm) [13:01:11] (03Merged) 10jenkins-bot: Allow sysops/'crats to assign massmessage-sender in urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309368 (https://phabricator.wikimedia.org/T144701) (owner: 10Urbanecm) [13:01:13] (03Merged) 10jenkins-bot: Add HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [13:01:46] (03PS3) 10Hashar: Add throttling rule for University of Canterbury [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309943 (https://phabricator.wikimedia.org/T145327) (owner: 10Urbanecm) [13:02:05] (03CR) 10Hashar: Add throttling rule for University of Canterbury (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309943 (https://phabricator.wikimedia.org/T145327) (owner: 10Urbanecm) [13:02:21] zeljkof: sorry missed your ping. I am handling it :D [13:02:51] (03CR) 10Hashar: [C: 032] Add throttling rule for University of Canterbury [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309943 (https://phabricator.wikimedia.org/T145327) (owner: 10Urbanecm) [13:03:25] (03Merged) 10jenkins-bot: Add throttling rule for University of Canterbury [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309943 (https://phabricator.wikimedia.org/T145327) (owner: 10Urbanecm) [13:03:49] Urbanecm: I have pulled all four of your changes on mw1099 [13:04:15] Checking... [13:04:48] hashar: ok, have fun :) [13:04:56] !log upgrade cp1074 to varnish 4 T131502 [13:04:58] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [13:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:07:16] stephanebisson: I am not sure about https://gerrit.wikimedia.org/r/#/c/309623/1/includes/controller/ModerationController.php [13:07:23] 06Operations, 06Project-Admins, 06Release-Engineering-Team: #blocked-on-schema-change was archived, now schema change workflow is broken - https://phabricator.wikimedia.org/T145361#2628396 (10Aklapper) @jcrespo: Makes a lot of sense! No worries; I just was a bit confused by the choice of words (and probably... [13:07:28] and how it is sending apparently mass of queries to the database master :( [13:08:39] hashar: it's deferred, but yes. Otherwise we create work with lagged data and end up setting wrong notifications count [13:08:40] stephanebisson: then that is "just" for messages being moderated isn't it ? [13:08:43] hashar: yeah afaik /mnt/upload7 is gone, though Krenair would/can confirm [13:08:58] godog: he sent the patch to drop upload7 so I guess that is a confirmation :] [13:09:07] hehe then yeah [13:09:12] hashar: it's for flow post or topics being moderated. It doesn't happen very often in practice but it could. [13:09:23] godog: I will deploy his change soonish ( https://gerrit.wikimedia.org/r/#/c/280170/ ) [13:09:27] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/conf.d/00_main] [13:09:41] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:09:48] ^interesting [13:09:58] I am talking about db2053 [13:09:59] hashar: it's not really urgent. Do you prefer to check with RoanKattouw (the patch author)? [13:10:18] stephanebisson: if it is not urgent, yeah would be better to double check [13:10:33] hashar: ok, no worries [13:10:51] stephanebisson: though I barely know what the code is going to do, there might be a risk of suddenly flooding the master with queries. But maybe it is just a false alarm [13:11:17] can you move it to the next SWAT window pending confirmation from Catrope? [13:11:42] hashar: I will bring it up with Roan today but this code is gonna roll the train tomorrow ;) [13:11:48] (03CR) 10Muehlenhoff: [C: 032] Decom mw2061-mw2074 [puppet] - 10https://gerrit.wikimedia.org/r/309572 (https://phabricator.wikimedia.org/T144745) (owner: 10Muehlenhoff) [13:11:53] (03PS3) 10Muehlenhoff: Decom mw2061-mw2074 [puppet] - 10https://gerrit.wikimedia.org/r/309572 (https://phabricator.wikimedia.org/T144745) [13:11:54] stephanebisson: yeah so better hurry :] [13:12:21] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:12:23] stephanebisson: I thought it has been added to the swat slot because of some emergency. If it is a corner case, probably better to get some extra time to confirm the DB_MASTER is going to be happy [13:12:31] Urbanecm: how is the check going on? Looks fine to me [13:12:36] hashar: You can scap it everywhere, checked. [13:12:41] PROBLEM - mediawiki-installation DSH group on mw2199 is CRITICAL: Host mw2199 is not in mediawiki-installation dsh group [13:12:53] PROBLEM - mediawiki-installation DSH group on mw2198 is CRITICAL: Host mw2198 is not in mediawiki-installation dsh group [13:13:11] that's me, fixing [13:15:13] syncing [13:15:17] in several batches [13:15:19] !log hashar@tin Synchronized static/images/project-logos/: Add HD logos for hewiki T145017 (duration: 00m 50s) [13:15:20] T145017: Pixelized logo - hewiki - https://phabricator.wikimedia.org/T145017 [13:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:03] hashar: Okay, thanks. [13:16:36] !log hashar@tin Synchronized wmf-config/throttle.php: Add throttling rule for University of Canterbury T145327 (duration: 00m 46s) [13:16:38] T145327: Requesting temporary lift of IP cap for Editathon in University of Canterbury - https://phabricator.wikimedia.org/T145327 [13:16:41] MatmaRex: will do https://gerrit.wikimedia.org/r/#/c/309813/ [13:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:39] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 48s) [13:17:42] Hello. MatmaRex: thanks to have quickly planned that. [13:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:59] 7 fatal error: Argument 2 passed to UploadStash::__construct() must be an instance of User, bool given in /srv/mediawiki/php-1.28.0-wmf.18/includes/upload/UploadStash.php on line 94 [13:18:04] it is showing up in fatal monitor :) [13:18:15] hashar: yes, this patch fixes that [13:19:31] jgirault: and the Kartographer patch seems straightforward ( https://gerrit.wikimedia.org/r/#/c/309925/ ) then it seems it is hard/not reproducable easily? [13:19:46] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Upgrade elasticsearch to 2.4.0 - https://phabricator.wikimedia.org/T145058#2628458 (10Gehel) [[ https://github.com/elastic/elasticsearch/issues/20413 | Bug has been filled upstream ]]. [13:20:17] I recommended to deploy it Saturday as an unbreak now, but MatmaRex stressed on the fact upload conditions to reproduce the bug are really limited: 1. use Special:Upload 2. the upload must succeed, but with some error like an invalid title, so it could wait Monday [13:20:45] yeah sounds good [13:20:48] or safe [13:20:58] !log upgrade cp1099 to varnish 4 T131502 [13:20:59] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [13:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:18] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628342 (10elukey) Hi Melody! Bureaucracy questions: 1) Have you read and completed all the steps in https://wikitech.wikimedia.org/wiki/Production_shell_a... [13:21:27] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628465 (10elukey) p:05Triage>03Normal [13:21:30] hashar yes it's a safe patch [13:21:32] * hashar waits for tests to complete [13:21:43] jgirault: is that something you can check on mw1099 easily? [13:21:49] or should I just push it everywhere? [13:22:24] hashar no you should put it everywhere [13:22:43] okkk [13:22:49] waiting for merge to occur [13:22:55] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628467 (10Dereckson) [13:23:04] (please ping me when things need testing) [13:23:17] MatmaRex: in a couple minute [13:23:31] the slow php test job is completing [13:23:58] merged [13:25:03] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628471 (10MelodyKramer) @elukey 1. Nope, those weren't in the instructions I was given. I'll do that now and update when finished. 2. Not sure. I'll ask in... [13:25:39] MatmaRex: Dereckson pulled on mw1099 [13:26:23] hashar: aha, testing [13:26:52] jgirault: I am pushing the Kartographer patch :) [13:27:01] who can change the topic for me... [13:27:07] moritzm: [13:27:17] I'm stealing clinic duty for the week [13:27:21] and you're a chanop [13:27:27] !log hashar@tin Synchronized php-1.28.0-wmf.18/extensions/Kartographer: Fix mw.Uri crushing bug T145178 (duration: 00m 49s) [13:27:28] T145178: Sometimes Kartographer crashes on page load - https://phabricator.wikimedia.org/T145178 [13:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:27:52] hashar ok! [13:27:55] thanks [13:28:09] jgirault: poke me if something else is needed [13:28:22] or #wikimedia-releng :D [13:28:23] hashar: yup, verified working [13:28:29] nice!! [13:28:32] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628499 (10MelodyKramer) @elukey I have now completed all of the steps in the requesting access document. [13:28:44] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:29:08] apergos: done [13:29:18] I'll ping you for a brief handover in about 10 mins [13:29:29] thank you! [13:29:31] wfm [13:29:54] MatmaRex: syncing it to the whole cluster [13:30:06] (03PS2) 10Giuseppe Lavagetto: naggen2: add --puppetdb switch [puppet] - 10https://gerrit.wikimedia.org/r/305037 (https://phabricator.wikimedia.org/T142846) [13:31:04] jouncebot: now [13:31:04] For the next 0 hour(s) and 28 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1300) [13:31:22] !log hashar@tin Synchronized php-1.28.0-wmf.18/includes/upload/UploadStash.php: Revert "Clean up user handling in UploadStash" T145228 (duration: 00m 46s) [13:31:23] T145228: Typehint issue in UploadStash::__construct() (causes uploads from stash via Special:Upload to fatal) - https://phabricator.wikimedia.org/T145228 [13:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:37] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:32:32] !log hashar@tin Synchronized php-1.28.0-wmf.18/maintenance/cleanupUploadStash.php: Revert "Clean up user handling in UploadStash" T145228 (duration: 00m 46s) [13:32:33] T145228: Typehint issue in UploadStash::__construct() (causes uploads from stash via Special:Upload to fatal) - https://phabricator.wikimedia.org/T145228 [13:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:32:56] so last one is dropping upload7 ( https://gerrit.wikimedia.org/r/#/c/280170/ ) [13:33:31] hashar: Are my patches everywhere? [13:33:47] (03PS1) 10Andrew Bogott: Add labspuppetbackend::mysql_password: dummy for the puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/309988 [13:33:52] (03CR) 10Andrew Bogott: [C: 032] Remove references to the old virt* servers [puppet] - 10https://gerrit.wikimedia.org/r/309695 (owner: 10Alex Monk) [13:34:11] (03PS2) 10Andrew Bogott: Add labspuppetbackend::mysql_password: dummy for the puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/309988 [13:34:19] (03CR) 10Andrew Bogott: [C: 032 V: 032] Add labspuppetbackend::mysql_password: dummy for the puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/309988 (owner: 10Andrew Bogott) [13:34:34] thanks! [13:34:48] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:34:58] (03CR) 10Giuseppe Lavagetto: [C: 032] naggen2: add --puppetdb switch [puppet] - 10https://gerrit.wikimedia.org/r/305037 (https://phabricator.wikimedia.org/T142846) (owner: 10Giuseppe Lavagetto) [13:35:03] 14 fatal error: Argument 2 passed to UploadStash::__construct() must be an instance of User, bool given in /srv/mediawiki/php-1.28.0-wmf.18/includes/upload/UploadStash.php on line 94 [13:35:07] MatmaRex: ^^ [13:35:54] hashar: if you still see these in the logs, that means my patch was not actually deployed. [13:36:06] 06Operations, 13Patch-For-Review: Remove mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2628531 (10MoritzMuehlenhoff) mw2061-mw2074 have been removed from puppet, Icinga, conftool and Salt and were powered down. [13:37:44] 06Operations, 06Operations-Software-Development, 07HHVM, 13Patch-For-Review: Upgrade all mw* servers to debian jessie - https://phabricator.wikimedia.org/T143536#2628535 (10MoritzMuehlenhoff) All mw* servers in codfw with the exception of mw2152 (the video scaler) are now running jessie. [13:37:49] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628536 (10elukey) @MelodyKramer thanks! So now there are a couple of things missing (whenever you have time): 1) @Deskana approval. 2) A brief description of... [13:38:03] (03PS3) 10Andrew Bogott: Remove references to the old virt* servers [puppet] - 10https://gerrit.wikimedia.org/r/309695 (owner: 10Alex Monk) [13:38:07] maybe I have screwed up something [13:38:42] "All mw* servers in codfw with the exception of mw2152 (the video scaler) are now running jessie." \o/ [13:39:26] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [13:39:27] congratulations [13:39:31] MatmaRex: how those were errors from BEFORE I have did the sync. So should be all fine [13:40:31] <_joe_> elukey, moritzm awesome job! [13:41:35] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2628538 (10MelodyKramer) 2. I'm going to be doing some analysis of external and referred traffic to our projects, like those backing our external referrers das... [13:43:59] (03CR) 10Mobrovac: [C: 031] "Checked in Beta" [puppet] - 10https://gerrit.wikimedia.org/r/309979 (https://phabricator.wikimedia.org/T144595) (owner: 10Mobrovac) [13:45:05] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [13:45:27] (03CR) 10Hashar: [C: 032] Remove upload7 references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [13:46:28] (03PS3) 10Hashar: Remove upload7 references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [13:46:56] (03CR) 10Hashar: Remove upload7 references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [13:47:01] (03CR) 10Hashar: [C: 032] "rebased, no more upload7 :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [13:47:32] (03Merged) 10jenkins-bot: Remove upload7 references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280170 (https://phabricator.wikimedia.org/T129586) (owner: 10Reedy) [13:48:41] (03PS2) 10Alexandros Kosiaris: puppetmaster: recurse on /srv/private permissions [puppet] - 10https://gerrit.wikimedia.org/r/309610 [13:48:42] !log hashar@tin Synchronized wmf-config: Remove upload7 references T129586 (duration: 00m 50s) [13:48:43] (03PS1) 10Alexandros Kosiaris: (WIP) puppetmaster: Experiment in defining a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 [13:48:43] T129586: /mnt/upload7 does not exist anywhere, yet it is referenced in multiple places in wmf-config - https://phabricator.wikimedia.org/T129586 [13:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:03] all - 1 patches for European SWAT have been deployed [13:50:39] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: recurse on /srv/private permissions [puppet] - 10https://gerrit.wikimedia.org/r/309610 (owner: 10Alexandros Kosiaris) [13:51:32] (03CR) 10jenkins-bot: [V: 04-1] (WIP) puppetmaster: Experiment in defining a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 (owner: 10Alexandros Kosiaris) [13:52:43] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2628562 (10elukey) @greg, @hashar - I just created deployment-mediawiki04 in deployment-prep and the VCPU quota is now maxed out (there were only 4 VCPUs left). Do we need... [13:55:45] RECOVERY - mediawiki-installation DSH group on mw2198 is OK: OK [13:55:56] RECOVERY - mediawiki-installation DSH group on mw2199 is OK: OK [13:56:10] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2628588 (10hashar) We had two 8 CPU / 16 G instances created to migrate the databases to Jessie T138778 that is scheduled for Thursday. Once migrated I guess they will be del... [13:57:35] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2628594 (10Gilles) [13:57:38] 06Operations, 06Performance-Team, 10Thumbor: SVG type check too slow in production - https://phabricator.wikimedia.org/T145377#2628592 (10Gilles) 05Open>03Resolved [14:03:25] (03PS3) 10Giuseppe Lavagetto: Change-Prop: Use Scap3 for config deploys [puppet] - 10https://gerrit.wikimedia.org/r/309979 (https://phabricator.wikimedia.org/T144595) (owner: 10Mobrovac) [14:03:42] <_joe_> mobrovac: did you bring the code up on tin? [14:03:57] <_joe_> mobrovac: changeprop will be easier as it will not restart automagically [14:04:58] _joe_: yup [14:05:18] <_joe_> mobrovac: if the code is up, I'll merge and run puppet on the hosts [14:05:27] yup, let's do it! [14:05:33] (03CR) 10Giuseppe Lavagetto: [C: 032] Change-Prop: Use Scap3 for config deploys [puppet] - 10https://gerrit.wikimedia.org/r/309979 (https://phabricator.wikimedia.org/T144595) (owner: 10Mobrovac) [14:06:24] (03PS1) 10BBlack: cache_upload: restore normal inter-DC routing [puppet] - 10https://gerrit.wikimedia.org/r/309993 (https://phabricator.wikimedia.org/T131502) [14:07:56] <_joe_> mobrovac: running puppet on all nodes now [14:08:07] kk, let's see :) [14:09:12] <_joe_> mobrovac: done [14:09:17] all good? [14:09:41] i'll do a deploy from tin _joe_ [14:10:01] <_joe_> ok [14:10:15] !log change-prop deploying 404b07c to enable scap config deploys [14:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:05] <_joe_> mobrovac: seems legit on 2002 [14:13:40] (03PS1) 10Filippo Giunchedi: install_server: use separate /srv for bastions [puppet] - 10https://gerrit.wikimedia.org/r/309995 [14:13:42] (03PS1) 10Filippo Giunchedi: site: install prometheus server in esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) [14:14:01] _joe_: yup, looking good in eqiad too! [14:14:15] yuhuu [14:14:17] <_joe_> mobrovac: great [14:14:22] thnx _joe_! [14:14:27] (03PS12) 10Volans: Automation: automatically reimage host [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) [14:16:56] (03CR) 10Volans: "@moritz @elukey:" [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [14:17:20] <_joe_> volans: if you want I can write the conftool integration for you :) [14:17:46] 06Operations: Install puppetDB at WMF - https://phabricator.wikimedia.org/T139476#2628712 (10Joe) [14:17:48] (03PS1) 10Gilles: Upgrade upstream code to 0.1.18 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309998 [14:17:48] 06Operations, 13Patch-For-Review: Create replacement for our scripts that depend on exported resources - https://phabricator.wikimedia.org/T142846#2628711 (10Joe) 05Open>03Resolved [14:18:01] (03PS2) 10Alexandros Kosiaris: (WIP) puppetmaster: Experiment in defining a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 [14:18:02] 06Operations, 13Patch-For-Review: Create replacement for our scripts that depend on exported resources - https://phabricator.wikimedia.org/T142846#2548142 (10Joe) [14:18:05] (03PS1) 10Elukey: Add mediawiki04 to the list of labs appservers in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/309999 (https://phabricator.wikimedia.org/T144006) [14:19:24] (03PS1) 10Gilles: Upgrade to 0.1.18 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/310000 [14:19:34] _joe_: ofc feel free, but that parts works fine, there are more needed steps IMHO (bugs/improvements in icinga downtime and wmf_reimage scripts, missing steps, etc..) [14:19:46] if you want to spend some time on this ;) [14:19:47] (03PS3) 10Alexandros Kosiaris: (WIP) puppetmaster: Experiment in defining a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 [14:21:30] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade upstream code to 0.1.18 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309998 (owner: 10Gilles) [14:21:35] PROBLEM - Varnishkafka log producer on cp1048 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:22:22] (03CR) 10Filippo Giunchedi: "recheck" [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/310000 (owner: 10Gilles) [14:22:59] PROBLEM - Varnishkafka log producer on cp1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [14:23:50] yeah it says Sep 12 14:15:24 cp1048 varnishkafka[2621]: VSLQ_Dispatch: Varnish Log abandoned or overrun. [14:24:23] 06Operations, 13Patch-For-Review: Remove mw2061-mw2074 - https://phabricator.wikimedia.org/T144745#2628723 (10mark) Yes, these should be decommissioned as they're very old. [14:24:52] but cp104[89] seemed done a while a go [14:24:53] *ago [14:25:32] ema --^ [14:25:32] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2628724 (10fgiunchedi) [14:26:02] (03CR) 10Alexandros Kosiaris: "PCC seems to be happy at https://puppet-compiler.wmflabs.org/4051/" [puppet] - 10https://gerrit.wikimedia.org/r/309992 (owner: 10Alexandros Kosiaris) [14:27:02] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/309992 (owner: 10Alexandros Kosiaris) [14:27:47] (03CR) 10Alexandros Kosiaris: [C: 031] site: install prometheus server in esams and ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/309996 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [14:28:30] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.18 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/310000 (owner: 10Gilles) [14:28:49] (03PS1) 10Muehlenhoff: Remove mw2061-mw2074 from DNS [dns] - 10https://gerrit.wikimedia.org/r/310002 (https://phabricator.wikimedia.org/T144745) [14:29:47] RECOVERY - Varnishkafka log producer on cp1048 is OK: PROCS OK: 1 process with command name varnishkafka [14:30:40] (03CR) 10Mobrovac: [C: 04-1] "Obsoleted by I30a7a30cc884f1a97744b4142462985e9f8d4fa1" [puppet] - 10https://gerrit.wikimedia.org/r/309377 (owner: 10Ppchelko) [14:31:43] (03Abandoned) 10Ppchelko: Change-Prop: Bump transcludes concurrency once again. [puppet] - 10https://gerrit.wikimedia.org/r/309377 (owner: 10Ppchelko) [14:32:18] elukey: vk restarted by hand on cp1049, did you fix 1048? [14:32:35] nope I was waiting for you before taking any action [14:32:45] (03PS3) 10Alexandros Kosiaris: puppetmaster: recurse on /srv/private permissions [puppet] - 10https://gerrit.wikimedia.org/r/309610 [14:32:47] (03PS2) 10Alexandros Kosiaris: puppetmaster: Fixes to the post-receive hook on a frontend [puppet] - 10https://gerrit.wikimedia.org/r/309609 [14:32:49] (03PS4) 10Alexandros Kosiaris: puppetmaster: Experiment in defining a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 [14:33:19] ema: systemd [14:33:33] just checked the logs [14:33:47] RECOVERY - Varnishkafka log producer on cp1049 is OK: PROCS OK: 1 process with command name varnishkafka [14:34:58] (03CR) 10Ema: [C: 031] cache_upload: restore normal inter-DC routing [puppet] - 10https://gerrit.wikimedia.org/r/309993 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [14:35:42] (03PS4) 10Alexandros Kosiaris: puppetmaster: recurse on /srv/private permissions [puppet] - 10https://gerrit.wikimedia.org/r/309610 [14:35:44] (03PS3) 10Alexandros Kosiaris: puppetmaster: Fixes to the post-receive hook on a frontend [puppet] - 10https://gerrit.wikimedia.org/r/309609 [14:35:46] (03PS2) 10Alexandros Kosiaris: puppetmaster: Ship a gitconfig file [puppet] - 10https://gerrit.wikimedia.org/r/309608 [14:35:48] (03PS5) 10Alexandros Kosiaris: puppetmaster: Experiment in defining a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 [14:36:30] 06Operations, 10Cassandra, 06Services: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826#2628746 (10Eevans) [14:36:58] !log T144826: Removing compaction rate limit, increasing compactor threads (from 10 to 20), and beginning scrub of local_group_wikipedia_T_parsoid_html.data (restbase2004-b.codfw.wmnet) [14:36:59] T144826: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826 [14:37:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:03] !log change-prop deploying 5d5d39e [14:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:37] !log powering down mw2017 for hardware maintenance [14:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:43] (03CR) 10Elukey: [C: 031] Remove mw2061-mw2074 from DNS [dns] - 10https://gerrit.wikimedia.org/r/310002 (https://phabricator.wikimedia.org/T144745) (owner: 10Muehlenhoff) [14:39:06] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2628750 (10fgiunchedi) a:03Cmjohnson moving to @Cmjohnson , the data on copper is fairly volatile so I _think_ it could be just reimaged. Though if there's slots alongside the existing disks could... [14:39:58] 06Operations, 10Cassandra, 06Services: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826#2628754 (10fgiunchedi) I've checked via the syslog server for reoccurence of this error in dmesg and only restbase2004 has a non-inquiry CDB reported ``` $ fgrep restbase /tmp/hpsa_sys... [14:41:50] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2628757 (10MoritzMuehlenhoff) Hmm, when reimaging, we'd lose all the builds from /var/cache/pbuilder/result (and not all of those are necessarily imported to carbon directly). It's probably better to... [14:44:09] hashar: As stephanebisson said, ModerationController is invoked infrequently enough (and in DeferredUpdates most of the time) that I'm not worried. Also this is right after writing to the master, and most calls to it already pass DB_MASTER anyway [14:44:09] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2628760 (10akosiaris) Em.. are those considered data we should not be deleting ? Cause I personally have deleted almost everything once or twice. [14:45:10] (03PS3) 10Alexandros Kosiaris: puppetmaster: Ship a gitconfig file [puppet] - 10https://gerrit.wikimedia.org/r/309608 [14:45:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Ship a gitconfig file [puppet] - 10https://gerrit.wikimedia.org/r/309608 (owner: 10Alexandros Kosiaris) [14:45:30] (03PS4) 10Alexandros Kosiaris: puppetmaster: Fixes to the post-receive hook on a frontend [puppet] - 10https://gerrit.wikimedia.org/r/309609 [14:45:34] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Fixes to the post-receive hook on a frontend [puppet] - 10https://gerrit.wikimedia.org/r/309609 (owner: 10Alexandros Kosiaris) [14:45:57] (03PS5) 10Alexandros Kosiaris: puppetmaster: recurse on /srv/private permissions [puppet] - 10https://gerrit.wikimedia.org/r/309610 [14:46:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: recurse on /srv/private permissions [puppet] - 10https://gerrit.wikimedia.org/r/309610 (owner: 10Alexandros Kosiaris) [14:46:58] RoanKattouw: AaronSchulz also commented on the ticket (https://gerrit.wikimedia.org/r/#/c/309623/). I've signed you up for evening SWAT if it's ok with you. [14:47:26] OK looking [14:48:24] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is CRITICAL: Connection refused eevans Corruption wack-a-mole (see: https://phabricator.wikimedia.org) [14:48:30] OK Aaron's suggestion sounds reasonable to me [14:48:51] RoanKattouw: good morning! [14:48:59] I don't know enough to know how to turn it into code, but if you do, go ahead and do it [14:49:01] I have played it safe, not really knowing what will happen : [14:49:37] (03PS6) 10Alexandros Kosiaris: puppetmaster: Define a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 [14:49:45] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2628795 (10MoritzMuehlenhoff) /var/cache/pbuilder/result is certainly not long-term storage, but in my case it frequently contains builds, which need some time to test before they can be merged on ca... [14:49:50] Good call, I'd forgotten this one was in a loop [14:50:08] also stephane noticed the current patch is in master [14:50:20] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2628797 (10fgiunchedi) Generally I'm rsync'ing the built packages to carbon and uploading there, plus some source code/packages in my /home which can be removed at will. [14:50:25] so it is going to be pushed to prod this week. [14:50:43] but I guess you can get a patch before that lands on group1 / group2 :] [14:51:21] RoanKattouw, hashar: I'm on it [14:51:29] !log depool cp4015, restart and repool cp4006's backend [14:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:33] (03PS1) 10Alexandros Kosiaris: puppetmaster: Fix typo for .git/config in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310007 [14:52:16] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2628801 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff All mw* systems in codfw have been fixed by Papaul, closing the ticket. [14:52:20] stephanebisson: then swat as needed :] And if needed, whatever patch you come up with can be cherry picked to wmf.19 which will be cut tomorrow [14:56:18] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/private/.git/config] [14:56:24] (03CR) 10Elukey: "It looks good to me except a minor style comment, but I am a bit ignorant about the change and its repercussions. Maybe we can chat a bit " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [14:57:33] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster: Define a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 (owner: 10Alexandros Kosiaris) [14:58:03] 06Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2628818 (10fgiunchedi) >>! In T130759#2628795, @MoritzMuehlenhoff wrote: > Personally I also have quite some patches/test builds/sources in my home which I'd want to save before a reimage and others... [15:03:17] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/private/.git/config],Etcd_role[conftool],Etcd_user[conftool] [15:03:47] (03PS1) 10Gilles: Fix SLOW_PROCESSING_LIMIT [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/310015 [15:04:15] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057#2628851 (10hashar) [15:05:48] (03PS7) 10Alexandros Kosiaris: puppetmaster: Define a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 [15:05:52] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Define a gitpuppet group [puppet] - 10https://gerrit.wikimedia.org/r/309992 (owner: 10Alexandros Kosiaris) [15:06:00] (03PS2) 10Alexandros Kosiaris: puppetmaster: Fix typo for .git/config in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310007 [15:06:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Fix typo for .git/config in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310007 (owner: 10Alexandros Kosiaris) [15:09:30] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [15:10:56] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:14:22] (03PS1) 10BBlack: 1.4: bugfix: must handle NULL ip_string in v4 [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/310019 [15:14:47] (03PS1) 10Alexandros Kosiaris: puppetmaster: Add datacenter-ops on frontends [puppet] - 10https://gerrit.wikimedia.org/r/310020 [15:15:00] (03PS2) 10Alexandros Kosiaris: puppetmaster: Add datacenter-ops on frontends [puppet] - 10https://gerrit.wikimedia.org/r/310020 [15:15:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] puppetmaster: Add datacenter-ops on frontends [puppet] - 10https://gerrit.wikimedia.org/r/310020 (owner: 10Alexandros Kosiaris) [15:15:32] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2628884 (10hashar) [15:15:57] !log drain and restart cassandra instances on restbase2001 with new CA - T143044 [15:15:58] T143044: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044 [15:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:22] (03PS2) 10Madhuvishy: nfsclient: Create /data/scratch symlink only if mount is present [puppet] - 10https://gerrit.wikimedia.org/r/308941 [15:16:31] (03PS1) 10Mobrovac: Citoid: Use Scap3 for config deploys [puppet] - 10https://gerrit.wikimedia.org/r/310021 (https://phabricator.wikimedia.org/T144597) [15:16:45] PROBLEM - Disk space on scb1002 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=87%) [15:17:06] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:26] akosiaris: ^^^ ? [15:17:49] mobrovac: ? [15:18:01] FYI daemon.log.1 is 2.6GB on scb1002 [15:18:03] akosiaris: disk space on scb1002 critical [15:18:08] 3.6GB sorry [15:18:09] ah there we go [15:18:11] yeah, known.. [15:18:19] it's damn celery [15:18:20] (03PS8) 10Eevans: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [15:18:28] I have an open task to fix it with research [15:18:36] (03CR) 10Eevans: Simplification of Cassandra Logstash filtering (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [15:18:55] I 'll just gzip the logs for now [15:19:28] akosiaris: you might not have enough space in / [15:19:56] daemon.log.3.gz is 304M, I don't know how much was before compression ;) [15:20:17] probably 3.6GB .... [15:20:22] or something close to that [15:20:25] lol [15:20:35] :D [15:21:24] so, ORES celery workers log to stdout/stderr, hence the issue [15:21:57] RECOVERY - Disk space on scb1002 is OK: DISK OK [15:22:39] akosiaris: why is ores not logging stuff to /srv ? [15:22:49] on second thought, given the amount of logs, perhaps better [15:22:50] :P [15:23:04] (03PS2) 10Elukey: Add mediawiki04 to the list of labs appservers in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/309999 (https://phabricator.wikimedia.org/T144006) [15:23:07] mobrovac: cause we haven't done it yet, remains to be done [15:23:26] akosiaris, didn't know this was an issue. [15:23:40] halfak: yeah I discovered just last week [15:23:46] If you file a task that explains what you think should be done, we'll prioritize it. [15:23:46] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 3.34 ms [15:23:51] do we need that level of logging? [15:23:51] halfak: ok thanks [15:24:14] volans: it probably wouldn't hurt [15:24:24] Most likely, we'll be able to remove some of the messages from the logs. We did a lot of that earlier in ORES' life. [15:24:26] space in /srv is sufficient [15:24:38] (probably). famous last words [15:24:40] I'm guessing that there are timeout errors making it into the logs and they ought not to. [15:25:17] halfak: no, it's INFO level messages from all workers [15:25:22] things like [15:25:29] Sep 12 15:25:23 scb1002 celery[17515]: [2016-09-12 15:25:23,638: INFO/Worker-101285] Looking up reverted in 1a27cce0-da42-43a4-9244-fa83f6aa574b [15:25:41] Sep 12 15:25:23 scb1002 celery[17515]: 2016-09-12 15:25:23,639 INFO:ores.scoring_systems.celery_queue -- Found reverted in 1a27cce0-da42-43a4-9244-fa83f6aa574b! [15:25:47] etc [15:26:06] Woah. We should not be logging INFO! [15:28:30] (03PS1) 10BBlack: Revert "upload VCL: do not cache objects with CL:0 and status 200" [puppet] - 10https://gerrit.wikimedia.org/r/310023 [15:28:50] (03CR) 10Ema: [C: 031] 1.4: bugfix: must handle NULL ip_string in v4 [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/310019 (owner: 10BBlack) [15:29:28] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Upgrade elasticsearch to 2.4.0 - https://phabricator.wikimedia.org/T145058#2628978 (10Gehel) 05Open>03declined Elasticsearch 2.4.0 has a bug that prevents us from upgrading. We will analyze the next release when ava... [15:29:36] PROBLEM - swift-account-reaper on ms-be1022 is CRITICAL: Connection refused by host [15:29:36] PROBLEM - Disk space on ms-be1022 is CRITICAL: Connection refused by host [15:29:36] PROBLEM - Check size of conntrack table on ms-be1022 is CRITICAL: Connection refused by host [15:29:47] PROBLEM - swift-container-updater on ms-be1022 is CRITICAL: Connection refused by host [15:29:49] PROBLEM - swift-object-server on ms-be1022 is CRITICAL: Connection refused by host [15:30:04] godog: ^^^ [15:30:05] PROBLEM - swift-account-replicator on ms-be1022 is CRITICAL: Connection refused by host [15:30:17] PROBLEM - swift-account-server on ms-be1022 is CRITICAL: Connection refused by host [15:30:18] PROBLEM - swift-object-auditor on ms-be1022 is CRITICAL: Connection refused by host [15:30:27] 06Operations, 10CirrusSearch, 06Discovery, 06Discovery-Search (Current work): Upgrade elasticsearch and plugins to 2.3.5 - https://phabricator.wikimedia.org/T145404#2628996 (10Gehel) [15:30:40] PROBLEM - swift-account-auditor on ms-be1022 is CRITICAL: Connection refused by host [15:30:46] PROBLEM - swift-container-auditor on ms-be1022 is CRITICAL: Connection refused by host [15:30:50] volans: thanks, silenced, I guess cmjohnson1 is working on it [15:30:56] PROBLEM - swift-object-updater on ms-be1022 is CRITICAL: Connection refused by host [15:30:56] PROBLEM - MD RAID on ms-be1022 is CRITICAL: Connection refused by host [15:30:56] PROBLEM - very high load average likely xfs on ms-be1022 is CRITICAL: Connection refused by host [15:31:06] PROBLEM - SSH on ms-be1022 is CRITICAL: Connection refused [15:31:06] PROBLEM - HP RAID on ms-be1022 is CRITICAL: Connection refused by host [15:31:17] yes I am working on it...godog...sorry forgot to silence [15:31:30] HP and they're logs that do not exist [15:31:36] I am rebooting now [15:31:40] it should come back in a minute [15:31:52] hehe np cmjohnson1, I've silenced it until tomorrow [15:33:00] (03PS1) 10Gilles: Clean up Thumbor configuration [puppet] - 10https://gerrit.wikimedia.org/r/310024 [15:33:27] (03CR) 10jenkins-bot: [V: 04-1] Clean up Thumbor configuration [puppet] - 10https://gerrit.wikimedia.org/r/310024 (owner: 10Gilles) [15:33:56] (03PS1) 10BBlack: VCL: workaround netmapper-1.3 NULL ip_string [puppet] - 10https://gerrit.wikimedia.org/r/310025 [15:34:26] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, 03Discovery-Wikidata-Query-Service-Sprint: some icinga checks on WDQS do not send notifications - https://phabricator.wikimedia.org/T144948#2629013 (10Gehel) [15:34:26] (03CR) 10BBlack: [C: 032] Revert "upload VCL: do not cache objects with CL:0 and status 200" [puppet] - 10https://gerrit.wikimedia.org/r/310023 (owner: 10BBlack) [15:34:57] (03CR) 10BBlack: [C: 032 V: 032] VCL: workaround netmapper-1.3 NULL ip_string [puppet] - 10https://gerrit.wikimedia.org/r/310025 (owner: 10BBlack) [15:35:51] (03CR) 10Mobrovac: [C: 031] "Checked in Beta, works." [puppet] - 10https://gerrit.wikimedia.org/r/310021 (https://phabricator.wikimedia.org/T144597) (owner: 10Mobrovac) [15:36:30] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2629018 (10Gehel) [15:36:44] (03CR) 10Ema: [C: 032 V: 032] 1.4: bugfix: must handle NULL ip_string in v4 [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/310019 (owner: 10BBlack) [15:37:30] Is there a database problem? [15:37:35] RECOVERY - swift-account-reaper on ms-be1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:37:36] RECOVERY - Disk space on ms-be1022 is OK: DISK OK [15:37:38] (03PS2) 10Gilles: Clean up Thumbor configuration [puppet] - 10https://gerrit.wikimedia.org/r/310024 [15:37:39] Nevermind. Page is just loading slow [15:37:46] RECOVERY - Check size of conntrack table on ms-be1022 is OK: OK: nf_conntrack is 0 % full [15:37:46] RECOVERY - swift-container-updater on ms-be1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [15:37:47] RECOVERY - swift-object-server on ms-be1022 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:37:56] RECOVERY - swift-account-replicator on ms-be1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [15:38:07] RECOVERY - swift-account-server on ms-be1022 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:38:16] RECOVERY - swift-object-auditor on ms-be1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [15:38:26] RECOVERY - swift-account-auditor on ms-be1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [15:38:37] RECOVERY - swift-container-auditor on ms-be1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:38:46] RECOVERY - swift-object-updater on ms-be1022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [15:38:46] RECOVERY - MD RAID on ms-be1022 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:38:46] RECOVERY - very high load average likely xfs on ms-be1022 is OK: OK - load average: 1.98, 1.71, 0.71 [15:38:56] RECOVERY - SSH on ms-be1022 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [15:39:15] RECOVERY - HP RAID on ms-be1022 is OK: OK: Slot 3: OK: 2I:4:2, 2I:4:1, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:39:18] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:39:31] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work): Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2629031 (10Gehel) a:03Gehel [15:41:21] !log roll-restart cassandra in codfw with new CA and certs T143044 [15:41:22] T143044: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044 [15:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:55] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:41:56] (03PS1) 10Giuseppe Lavagetto: puppetmaster: rsync volatile and ca dirs between puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/310026 [15:42:27] (03PS2) 10Giuseppe Lavagetto: puppetmaster: rsync volatile and ca dirs between puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/310026 [15:43:38] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: rsync volatile and ca dirs between puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/310026 (owner: 10Giuseppe Lavagetto) [15:44:26] <_joe_> akosiaris: ^^ [15:45:37] 06Operations, 10Ops-Access-Requests: Requesting access to stat1003, stat1002, and fluorine for MelodyKramer - https://phabricator.wikimedia.org/T145387#2629052 (10Deskana) Approved. I've discussed this with Mel and can vouch for her request. [15:46:06] (03PS2) 10BBlack: cache_upload: restore normal inter-DC routing [puppet] - 10https://gerrit.wikimedia.org/r/309993 (https://phabricator.wikimedia.org/T131502) [15:46:37] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: restore normal inter-DC routing [puppet] - 10https://gerrit.wikimedia.org/r/309993 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [15:47:14] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Make puppet generate path config for WDQS nodes - https://phabricator.wikimedia.org/T144537#2629062 (10Smalyshev) 05Open>03Resolved [15:47:17] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#2629063 (10Smalyshev) [15:47:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "looks good, comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310026 (owner: 10Giuseppe Lavagetto) [15:56:10] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: Elasticsearch logs are not send to logstash after 2.3.3 upgrade - https://phabricator.wikimedia.org/T136696#2629098 (10debt) Moving to the backlog at this time to tackle later on [15:58:59] (03PS1) 10Cmjohnson: Removing both mgmt and production dns entries for decommissioned apache servers mw1131-1151 and mw1153-mw1160. All server have been wiped and powered off. [dns] - 10https://gerrit.wikimedia.org/r/310030 [16:00:04] greg-g: Dear anthropoid, the time has come. Please deploy Test long running operation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1600). [16:03:03] oh right [16:03:06] jouncebot: now [16:03:06] For the next 2 hour(s) and 56 minute(s): Test long running operation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1600) [16:03:17] * greg-g reminds self to test after team meeting [16:07:53] (03PS1) 10Volans: Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 [16:10:07] akosiaris, _joe_ ^^^ just a quick proposal, feel free to trash it if you think doesn't fit ;) [16:11:51] volans: \o/ that's bugged me for so long [16:14:40] (03CR) 10Giuseppe Lavagetto: [C: 031] "let's merge this fix, I will work later on using properly" [puppet] - 10https://gerrit.wikimedia.org/r/309929 (https://phabricator.wikimedia.org/T144703) (owner: 10Alex Monk) [16:15:03] (03CR) 10Alexandros Kosiaris: "nice!I was contemplating populating .git/config files for everyone to achieve the same thing, this is nicer. Comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [16:15:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [16:15:27] volans: thanks. I 've commented on it [16:16:38] (03PS2) 10Volans: Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 [16:16:44] akosiaris: I was modifying it, reading now :D [16:17:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [16:18:58] (03PS3) 10Volans: Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 [16:20:16] (03CR) 10Volans: "agree for the -e but if the exit status it != 0 the commit will be blocked, not sure is what we want here." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [16:20:20] (03Abandoned) 10Elukey: Add mediawiki04 to the list of labs appservers in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/309999 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [16:23:06] (03PS1) 10Elukey: Add deployment-mediawiki04 to the deployment-prep scap dsh [puppet] - 10https://gerrit.wikimedia.org/r/310034 (https://phabricator.wikimedia.org/T144006) [16:23:08] (03PS1) 10Elukey: Add deployment-mediawiki04 to the deployment-prep Varnish config [puppet] - 10https://gerrit.wikimedia.org/r/310035 (https://phabricator.wikimedia.org/T144006) [16:23:52] akosiaris: let me know if I should fix the mode for the post-commit too [16:28:53] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2629309 (10chasemp) a:05chasemp>03yuvipanda [16:29:49] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] [16:30:12] !log wiping/repooling cp4015 [16:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:31:58] (03CR) 10Alexandros Kosiaris: Git: Add username into commit message in private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [16:35:14] (03PS4) 10Volans: Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 [16:35:44] (03CR) 10Volans: Git: Add username into commit message in private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [16:37:30] 06Operations, 10Beta-Cluster-Infrastructure, 07HHVM, 13Patch-For-Review: Move the MW Beta appservers to Debian - https://phabricator.wikimedia.org/T144006#2629355 (10hashar) https://wikitech.wikimedia.org/wiki/HHVM/Troubleshooting has some interesting bits furl http://en.wikipedia.beta.wmflabs.org/wiki/M... [16:37:49] (03CR) 10Alexandros Kosiaris: "I meant -e as a sed argument (not the shell). Which probably has nothing to do with exit code != 0. But on that subject, I think it's fine" [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [16:37:51] (03PS1) 10BBlack: upload VCL: no probes for be->be [puppet] - 10https://gerrit.wikimedia.org/r/310037 [16:38:08] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:38:47] (03CR) 10BBlack: [C: 032 V: 032] upload VCL: no probes for be->be [puppet] - 10https://gerrit.wikimedia.org/r/310037 (owner: 10BBlack) [16:43:55] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Sudo access to the Analytics Druid cluster for the Analytics team - https://phabricator.wikimedia.org/T144726#2608377 (10MoritzMuehlenhoff) Approved in today's Ops meeting. [16:44:26] (03CR) 10Hashar: [C: 031] "That should update the dsh files that scap is relying on. Then we can trigger the Jenkins job on https://integration.wikimedia.org/ci/view" [puppet] - 10https://gerrit.wikimedia.org/r/310034 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [16:45:06] (03PS1) 10DCausse: Upgrade elasticsearch plugins to 2.3.5 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/310038 (https://phabricator.wikimedia.org/T145404) [16:45:27] (03CR) 10DCausse: [C: 04-1] "archiva not yet updated" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/310038 (https://phabricator.wikimedia.org/T145404) (owner: 10DCausse) [16:45:47] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [16:45:59] (03CR) 10Hashar: [C: 031] "On beta there is no LVS/Pybal/Conftool. The mw app servers are listed in the Varnish backend configuration as directors, IIRC varnish jus" [puppet] - 10https://gerrit.wikimedia.org/r/310035 (https://phabricator.wikimedia.org/T144006) (owner: 10Elukey) [16:49:15] (03PS3) 10Giuseppe Lavagetto: puppetmaster: rsync volatile and ca dirs between puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/310026 [16:49:17] (03PS5) 10Volans: Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 [16:50:20] (03CR) 10Volans: "@akosiarias: added -e to sed also" [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [16:50:37] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: rsync volatile and ca dirs between puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/310026 (owner: 10Giuseppe Lavagetto) [16:52:40] 06Operations, 10ops-eqiad: Decomission mw1153-mw1160 - https://phabricator.wikimedia.org/T141352#2629479 (10Cmjohnson) 05Open>03Resolved all servers wiped and removed from rack and racktables, dns entries, removed, added to decom spreadsheet. [16:54:04] 06Operations, 10hardware-requests: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2629488 (10Cmjohnson) [16:55:58] (03PS1) 10Hashar: Revert "contint:firewall: let phabricator talk to gearman" [puppet] - 10https://gerrit.wikimedia.org/r/310039 (https://phabricator.wikimedia.org/T137323) [16:56:41] (03CR) 10Cmjohnson: [C: 032] Removing both mgmt and production dns entries for decommissioned apache servers mw1131-1151 and mw1153-mw1160. All server have been wiped an [dns] - 10https://gerrit.wikimedia.org/r/310030 (owner: 10Cmjohnson) [16:57:09] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2629559 (10hashar) [16:58:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "One inline comment, otherwise LGTM and good to merge. Thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1700). Please do the needful. [17:00:04] gehel: A patch you scheduled for Weekly Wikidata query service deployment window is about to be deployed. Please be available during the process. [17:00:43] (03PS1) 10BBlack: Revert "upload VCL: no probes for be->be" [puppet] - 10https://gerrit.wikimedia.org/r/310041 [17:00:57] (03CR) 10BBlack: [C: 032 V: 032] Revert "upload VCL: no probes for be->be" [puppet] - 10https://gerrit.wikimedia.org/r/310041 (owner: 10BBlack) [17:00:59] (03PS6) 10Volans: Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 [17:01:12] (03CR) 10Volans: Git: Add username into commit message in private repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [17:01:44] scheduling conflict for WDQS deployment, it will be delayed by 30 minutes [17:01:52] SMalyshev: ^ [17:01:53] (03CR) 10Alexandros Kosiaris: "minor comments inline, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310026 (owner: 10Giuseppe Lavagetto) [17:03:10] jouncebot: now [17:03:10] For the next 0 hour(s) and 26 minute(s): Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1700) [17:03:10] For the next 1 hour(s) and 56 minute(s): Test long running operation (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1600) [17:03:20] neat [17:03:35] (03PS7) 10Volans: Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 [17:05:47] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:06:44] (03CR) 10Volans: [C: 032] Git: Add username into commit message in private repo [puppet] - 10https://gerrit.wikimedia.org/r/310031 (owner: 10Volans) [17:11:09] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2629609 (10greg) Here's what the output looks like from jouncebot when t... [17:11:42] 06Operations, 07HHVM: Migrate deployment servers (tin/mira) to jessie - https://phabricator.wikimedia.org/T144578#2629610 (10demon) >>! In T144578#2622431, @MoritzMuehlenhoff wrote: > I don't think it would cause problems: The I/O performance of the Ganeti clusters should be adequate for deployments (but of co... [17:13:59] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /mnt/upload7 does not exist anywhere, yet it is referenced in multiple places in wmf-config - https://phabricator.wikimedia.org/T129586#2629613 (10AlexMonk-WMF) 05Open>03Resolved thanks @hashar [17:14:23] (03CR) 10DCausse: "archiva updated" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/310038 (https://phabricator.wikimedia.org/T145404) (owner: 10DCausse) [17:14:47] (03PS1) 10Cmjohnson: Removing mgmt dns entries for decommissioned elastic1001-1016. [dns] - 10https://gerrit.wikimedia.org/r/310045 [17:16:15] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns entries for decommissioned elastic1001-1016. [dns] - 10https://gerrit.wikimedia.org/r/310045 (owner: 10Cmjohnson) [17:21:21] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:23:15] (03PS1) 10Volans: Git: fix commit hook [puppet] - 10https://gerrit.wikimedia.org/r/310047 [17:23:33] akosiaris: FYI, my bad, typo, fixing it [17:25:20] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2629638 (10bd808) >>! In T144661#2629609, @greg wrote: > @bd808: it migh... [17:25:23] (03CR) 10Volans: [C: 032] Git: fix commit hook [puppet] - 10https://gerrit.wikimedia.org/r/310047 (owner: 10Volans) [17:26:21] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2629655 (10greg) yeah, I'd prefer both because I know some people ignore... [17:29:30] !log roll-restart cassandra in eqiad with new CA and certs T143044 [17:29:32] T143044: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044 [17:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:43:48] capito, comunque se quelle metriche li' sono sufficienti potremmo usare quello [17:44:01] copy-pasta: fail [17:44:31] rotfl [17:45:03] you'd think I learned how copy/paste works over the years [17:45:23] (03PS1) 10BryanDavis: debian: Explictly manage python dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/310052 (https://phabricator.wikimedia.org/T145326) [17:46:01] did you mean paste or pasta? [17:46:36] !log deploying latest wikidata query service [17:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:03] in the first case I meant copy-pasta as a mockery of what I just did [17:47:22] Oh [17:47:35] that copy-pasta doesn't sound very delicious [17:47:41] LOL [17:48:07] nah it isn't, just a bunch of paper stripes mushed together [17:50:55] !log wdqs1001 put in maitnenance, some issue with config file deployment [17:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:52:57] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.014 second response time [17:53:36] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.003 second response time [17:53:49] ^that's me, silencing [17:56:22] does anyone has experience in deploying config files with scap3? [17:56:51] (03PS1) 10Yuvipanda: tools: Provision ssl private key from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310054 (https://phabricator.wikimedia.org/T145120) [17:56:58] my config is created correctly in .git/config-files, the log says that a symlink is created, but it does not seem to be the case [17:57:04] gehel I think Dereckson has [17:57:12] hello [17:57:28] isn't scap sync / scap deploy ? [17:57:41] gehel: ask bd808 or thcipriani [17:57:58] gehel: #wikimedia-releng is generally very responsive for scap questions [17:58:17] Dereckson: thanks! I tried #scap3, but not much activity there [17:59:36] gehel: sorry we're all in a meeting, missed it. What is the command you're running? [17:59:56] simply "scap deploy"... [18:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T1800). Please do the needful. [18:00:04] MatmaRex: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [18:00:09] thcipriani: if you're busy, I can rollback and wait, no problem [18:00:52] 06Operations, 10Traffic: Strong cipher preference ordering for cache terminators - https://phabricator.wikimedia.org/T144626#2629774 (10BBlack) My current thinking on this at the moment is that it's probably not worth the more-complex prefhacks. We should probably stall on seeing a little more broad adoption... [18:01:02] thcipriani: you want I handle this SWAT if you're in a meeting? [18:01:05] gehel: it might be ideal to rollback for now. I might try the deploy with --force since the deploy command is supposed to be idempotent but the check for that may have some kind of incorrect check. [18:01:44] thcipriani: ok, roll back for now, I'll ping you later to analyze deeper [18:01:47] Dereckson: if you're around and available that would be great. Thank you. This is a non-typical meeting time for me :\ [18:01:51] gehel: thank you sounds good. [18:02:01] thcipriani: no problem, you're welcome [18:02:07] thcipriani: thanks! [18:03:42] MatmaRex: ping? [18:05:44] (03PS4) 10Giuseppe Lavagetto: puppetmaster: rsync volatile and ca dirs between puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/310026 [18:05:48] (03CR) 10Giuseppe Lavagetto: puppetmaster: rsync volatile and ca dirs between puppetmasters (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/310026 (owner: 10Giuseppe Lavagetto) [18:08:21] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 9706 bytes in 0.015 second response time [18:09:00] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 9706 bytes in 0.010 second response time [18:10:32] Dereckson: around. sorry, for being late [18:11:11] we're waiting Zuul [18:11:17] ah merged [18:11:24] Dereckson: i can't really test my patch, since i was unable to reproduce the issue. it's apparently intermittent [18:11:46] you can at least check it doesn't create a new js error [18:11:47] it depends on the loading order of various modules :/ [18:11:51] yep [18:11:55] well, it can't :D [18:13:32] !log rolled back wdqs to HEAD^1 [18:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:16:30] (03CR) 10Merlijn van Deen: [C: 032] debian: Explictly manage python dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/310052 (https://phabricator.wikimedia.org/T145326) (owner: 10BryanDavis) [18:17:30] MatmaRex: live on mw1099 [18:18:11] (03Merged) 10jenkins-bot: debian: Explictly manage python dependencies [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/310052 (https://phabricator.wikimedia.org/T145326) (owner: 10BryanDavis) [18:18:37] servers log look good to me [18:19:12] looks good to me too [18:22:53] anyone using mw1099 at the moment? [18:22:59] oh, you are [18:23:06] anyone using mw1017, then? [18:23:23] ori: I'm done with mw1099 if you need it [18:23:43] !log dereckson@tin Synchronized php-1.28.0-wmf.18/resources/Resources.php: Add missing dependency to 'mediawiki.Upload.BookletLayout' module (T145315) (duration: 00m 47s) [18:23:45] T145315: Exception in module-execute in module mediawiki.Upload.BookletLayout - https://phabricator.wikimedia.org/T145315 [18:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:24:17] I'll just use mw1017 [18:24:29] MatmaRex: here you are ^ [18:24:41] thanks! [18:24:51] !log Changing wikiversion for group2 wikis on mw1017 to debug regression (T145359) [18:24:53] T145359: Investigate increase in firstPaint for p95 & p99 - https://phabricator.wikimedia.org/T145359 [18:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:30:09] (03PS1) 10Urbanecm: Fix HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310056 (https://phabricator.wikimedia.org/T145017) [18:36:25] (03PS2) 10Yuvipanda: updated *.wmflabs.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/309379 (https://phabricator.wikimedia.org/T145120) (owner: 10RobH) [18:36:28] (03CR) 10Yuvipanda: [C: 032 V: 032] updated *.wmflabs.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/309379 (https://phabricator.wikimedia.org/T145120) (owner: 10RobH) [18:36:44] (03PS2) 10Yuvipanda: tools: Provision ssl private key from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310054 (https://phabricator.wikimedia.org/T145120) [18:36:49] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Provision ssl private key from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310054 (https://phabricator.wikimedia.org/T145120) (owner: 10Yuvipanda) [18:39:56] ^^ why does it keep saying that [18:40:07] how do i get the customised cloak [18:40:42] (03PS3) 10Ottomata: Update camus jar version [puppet] - 10https://gerrit.wikimedia.org/r/309323 (https://phabricator.wikimedia.org/T144716) (owner: 10Joal) [18:41:01] you're asking the wrong person, I got mine in 200... 7? 8? before working at wmf anyways [18:41:18] so they've undoubtedly changed the procedure 15 times since then [18:42:15] apergos lol, i think i fixed it now ^^ :) [18:42:23] There we go [18:42:39] I was chery picking two patches and off course you can only do one. [18:42:45] ah ha [18:43:08] Im hopping now it will be stable with the package updates i did :) [18:43:24] apergos ^^ [18:43:39] any big changes? [18:44:06] Well maybe [18:44:07] https://phabricator.wikimedia.org/rTGRTcc6d5cbd169f5529d2d579c690d70c65c230a17e [18:44:12] apergos see ^^ [18:44:30] it depends, it now supports the newer ssh standards in openssh [18:44:35] ohhh [18:44:37] that's nice [18:45:04] apergos The only one i coulden update was swig since it broke the thing and i doint know why. [18:45:53] (03PS1) 10ArielGlenn: set up static nfs lock manager ports for dataset hosts [puppet] - 10https://gerrit.wikimedia.org/r/310059 [18:45:57] apergos over the weekend it looked like it worked for 4 days before i restarted it. [18:46:26] Better then when it keeped disconnecting every couple of hours, so im hopping the package upgrades improve reliability and decrease disconnections :) [18:46:48] that would be nice [18:46:57] (03CR) 10jenkins-bot: [V: 04-1] set up static nfs lock manager ports for dataset hosts [puppet] - 10https://gerrit.wikimedia.org/r/310059 (owner: 10ArielGlenn) [18:47:01] yeahyeah [18:47:33] Yep [18:49:28] (03CR) 10Ottomata: [C: 032] Update camus jar version [puppet] - 10https://gerrit.wikimedia.org/r/309323 (https://phabricator.wikimedia.org/T144716) (owner: 10Joal) [18:49:51] (03PS2) 10ArielGlenn: set up static nfs lock manager ports for dataset hosts [puppet] - 10https://gerrit.wikimedia.org/r/310059 [18:50:58] (03PS3) 10Urbanecm: Lift of IP cap - WomenInSience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309511 (https://phabricator.wikimedia.org/T145115) [18:51:10] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [18:52:43] (03PS4) 10Urbanecm: Lift of IP cap - WomenInSience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309511 (https://phabricator.wikimedia.org/T145115) [18:53:40] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5042631 keys - replication_delay is 0 [18:56:46] anomie: ostriches thcipriani hasharAway twentyafterfour Is there a time for deploying https://gerrit.wikimedia.org/r/#/c/310056/ ? Sorry for my lateness! [18:57:17] I thought that the Morning SWAT is from 21:00 to 22:00 UTC+2 but it isn't :D [18:57:38] (03PS1) 10Yuvipanda: tools: Provision proxy's cert from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310062 (https://phabricator.wikimedia.org/T145120) [18:57:43] (03CR) 10Mobrovac: [C: 04-1] "Obsoleted now that change-prop is using scap3's config deploy. The same change for the deploy repo is Ideaab257e8bf9f6fd9345ee855d53f23aac" [puppet] - 10https://gerrit.wikimedia.org/r/308077 (owner: 10Ppchelko) [18:57:55] (03PS2) 10Yuvipanda: tools: Provision proxy's cert from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310062 (https://phabricator.wikimedia.org/T145120) [18:57:59] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Provision proxy's cert from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310062 (https://phabricator.wikimedia.org/T145120) (owner: 10Yuvipanda) [18:58:13] (03PS3) 10Yuvipanda: tools: Provision ssl private key from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310054 (https://phabricator.wikimedia.org/T145120) [18:58:34] (03Abandoned) 10Ppchelko: Change-Prop: Switch to new events. [puppet] - 10https://gerrit.wikimedia.org/r/308077 (owner: 10Ppchelko) [18:58:48] (03CR) 10Yuvipanda: [V: 032] "fuck you for hiding the submit button, gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/310054 (https://phabricator.wikimedia.org/T145120) (owner: 10Yuvipanda) [18:58:57] (03PS3) 10Yuvipanda: tools: Provision proxy's cert from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310062 (https://phabricator.wikimedia.org/T145120) [18:59:08] (03CR) 10Yuvipanda: [V: 032] "fuck you for hiding the submit button, gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/310062 (https://phabricator.wikimedia.org/T145120) (owner: 10Yuvipanda) [18:59:18] we should do weekly stats on swear words in commit messages/changesets [18:59:30] or monthly, send em out with all the rest of themonthly reports [19:00:15] LOL [19:00:44] "number of swear words this month:" [19:00:51] "number of pissed-off people:" [19:01:34] Urbanecm: I can deploy your change, sure. [19:01:51] Thanks a lot thcipriani ! [19:01:57] no problem :) [19:04:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310056 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [19:04:28] (03Merged) 10jenkins-bot: Fix HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310056 (https://phabricator.wikimedia.org/T145017) (owner: 10Urbanecm) [19:06:34] /win go caemir [19:06:56] Urbanecm: change is live on mw1099, check please [19:07:45] (I see the update FWIW) [19:08:13] It seems it works but I have no Retina display so I can't check it fully. [19:08:35] Urbanecm: ack. I will push the change everywhere. [19:08:40] Okay [19:08:41] Thanks [19:10:53] !log thcipriani@tin Synchronized static/images/project-logos: SWAT: [[gerrit:310056|Fix HD logos for hewiki (T145017)]] (duration: 00m 48s) [19:10:56] T145017: Pixelized logo - hewiki - https://phabricator.wikimedia.org/T145017 [19:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:49] Urbanecm: sync'd and purged [19:11:55] Thanks. [19:12:13] !log T133805: Disabling Puppet for GC experiment on restbase1013.eqiad.wmnet [19:12:15] T133805: Isolated testing of GC settings for aggressive Cassandra chunk_length_kb values - https://phabricator.wikimedia.org/T133805 [19:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:12:54] Urbanecm: I guess for quick fix up we can deploy any time [19:13:04] :) [19:13:16] But better is scheduling, isn't it? [19:13:27] Urbanecm: it depends on the situation [19:13:31] !log T133805: Restarting Cassandra to apply G1 region size of 32M on restbase1013-a.eqiad.wmnet [19:13:32] T133805: Isolated testing of GC settings for aggressive Cassandra chunk_length_kb values - https://phabricator.wikimedia.org/T133805 [19:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:53] Urbanecm: that one is an easy fix that impact only hewiki so I would do it anytime. [19:14:02] Urbanecm: though if nobody is around, you can schedule it :] [19:14:08] :) thanks anyway for the deployment. [19:18:56] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2630146 (10yuvipanda) I've moved over tools-static and tools, and they're good now. Need to do novaproxy next. [19:20:10] RECOVERY - HTTPS-wmflabs on tools.wmflabs.org is OK: SSL OK - Certificate *.wmflabs.org valid until 2017-10-16 15:41:05 +0000 (expires in 398 days) [19:20:27] \o/ ^ [19:21:06] 398? :) [19:21:24] I guess they gave you an extra month on renewal or something [19:21:54] I hate that GlobalSign does things like that by-default, when we'd prefer a strict policy of not having cert lifetimes over X (currently 1y), but it's not a major issue [19:26:31] (03PS1) 10Chad: Revert "getMWVersion: Unused, dubiously useful" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310068 [19:27:18] (03PS2) 10Chad: Revert "getMWVersion: Unused, dubiously useful" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310068 (https://phabricator.wikimedia.org/T145336) [19:27:30] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2630182 (10yuvipanda) I had to do this for the following set of hosts on tools: 1. tools-proxy-* 2. tools-static-* I've done... [19:28:28] jouncebot: next [19:28:28] In 0 hour(s) and 31 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T2000) [19:32:18] (03CR) 10ArielGlenn: [C: 031] "afaik a plain revert should just work, so..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310068 (https://phabricator.wikimedia.org/T145336) (owner: 10Chad) [19:38:28] (03CR) 10Chad: [C: 032] Revert "getMWVersion: Unused, dubiously useful" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310068 (https://phabricator.wikimedia.org/T145336) (owner: 10Chad) [19:38:54] (03Merged) 10jenkins-bot: Revert "getMWVersion: Unused, dubiously useful" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310068 (https://phabricator.wikimedia.org/T145336) (owner: 10Chad) [19:42:34] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1007-c.eqiad.wmnet [19:42:36] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [19:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:46] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1010-c.eqiad.wmnet [19:42:47] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [19:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:42:57] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1011-c.eqiad.wmnet [19:42:58] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [19:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:14] !log !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1008-c.eqiad.wmnet [19:44:15] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [19:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:28] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1012-c.eqiad.wmnet [19:44:29] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [19:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:44:42] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1013-c.eqiad.wmnet [19:44:43] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [19:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:41] !log demon@tin Synchronized multiversion/getMWVersion: for dumps <3 (duration: 00m 46s) [19:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:47] apergos: ^^ [19:53:18] I just checked it about 30 seconds before that sync went around [19:53:21] perfect, thank you! [19:53:39] I should see the results later tonight [19:59:37] in fact abstracts for commons have already started. automagically \o/ [20:00:04] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T2000). [20:00:21] no mobileapps deploy today [20:02:00] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:02:21] 06Operations, 10Wikimedia-Site-requests, 07Wikimedia-log-errors: Requests to localhost spam the 'localhost' and 'xff' log buckets - https://phabricator.wikimedia.org/T129982#2630310 (10hashar) From the Gerrit change: I have added it to European SWAT window of Tuesday, September 13 at 13:00–14:00 Also sent... [20:04:14] (03PS1) 10Yuvipanda: labs: Worst typo [puppet] - 10https://gerrit.wikimedia.org/r/310072 [20:04:23] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: /mnt/upload7 does not exist anywhere, yet it is referenced in multiple places in wmf-config - https://phabricator.wikimedia.org/T129586#2630321 (10hashar) Well done @AlexMonk-WMF that the last item of a long tail. Quite an achievement \o/ [20:04:50] (03PS2) 10Yuvipanda: labs: Worst typo [puppet] - 10https://gerrit.wikimedia.org/r/310072 [20:05:00] (03CR) 10Yuvipanda: [C: 032 V: 032] "can I blame this one on gerrit too? no?" [puppet] - 10https://gerrit.wikimedia.org/r/310072 (owner: 10Yuvipanda) [20:12:56] clocking out for the day. have a good one folks [20:13:27] !log starting Parsoid deploy [20:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:21:55] !log change-prop deploying 86a60b3 [20:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:24:53] where oh where is wtp2019.codfw.wmnet [20:26:59] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:28:18] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1009-b.eqiad.wmnet [20:28:20] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [20:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:28:31] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1014-b.eqiad.wmnet [20:28:32] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [20:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:28:42] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1015-b.eqiad.wmnet [20:28:44] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [20:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:17] T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase200[1-9]-c.codfw.wmnet [20:32:17] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [20:32:40] !log Parsoid deploy failed, rolling back [20:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:10] !log Parsoid back on 7c43009c [20:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:44] 06Operations, 10Cassandra, 06Services: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826#2630545 (10Eevans) To summarize an IRC convo with @fgiunchedi, I will wait for the current scrub to complete, then online the node. Tomorrow we will try rebooting the host (an act of d... [20:55:54] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2630676 (10yuvipanda) ok, done on novaproxy-01 and -02 as well! I've also documened the tools ssl certs in https://wikitech.w... [21:00:04] dapatrick and bawolff: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T2100). [21:04:18] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2630718 (10yuvipanda) 05Open>03Resolved [21:17:28] !log For completeness, "back" in my last log is a mistake. I scap deployed the wrong --rev, but that was ultimately the version we wanted deployed anyways, so no harm no foul. (T145460) [21:17:29] T145460: Rollback failed when target is down - https://phabricator.wikimedia.org/T145460 [21:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:28] (03PS10) 10Dduvall: beta: Create and mount LVM volumes for mariadb [puppet] - 10https://gerrit.wikimedia.org/r/305668 (https://phabricator.wikimedia.org/T138778) [21:51:07] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt] [22:10:56] (03PS1) 10Andrew Bogott: puppettable: Eliminate most uses of instance_id [puppet] - 10https://gerrit.wikimedia.org/r/310148 [22:10:58] (03PS1) 10Andrew Bogott: puppet tab: Rely less on fqdns and instance objects [puppet] - 10https://gerrit.wikimedia.org/r/310149 [22:11:00] (03PS1) 10Andrew Bogott: Puppet tab: A tiny bit of error handling [puppet] - 10https://gerrit.wikimedia.org/r/310150 [22:12:49] (03CR) 10jenkins-bot: [V: 04-1] puppet tab: Rely less on fqdns and instance objects [puppet] - 10https://gerrit.wikimedia.org/r/310149 (owner: 10Andrew Bogott) [22:13:37] (03CR) 10jenkins-bot: [V: 04-1] Puppet tab: A tiny bit of error handling [puppet] - 10https://gerrit.wikimedia.org/r/310150 (owner: 10Andrew Bogott) [22:16:09] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:41:14] (03PS1) 10BBlack: Revert "Revert "upload VCL: do not cache objects with CL:0 and status 200"" [puppet] - 10https://gerrit.wikimedia.org/r/310155 [22:41:25] (03CR) 10BBlack: [C: 032 V: 032] Revert "Revert "upload VCL: do not cache objects with CL:0 and status 200"" [puppet] - 10https://gerrit.wikimedia.org/r/310155 (owner: 10BBlack) [22:41:30] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review: Certain images failing to load in ulsfo - https://phabricator.wikimedia.org/T144257#2631114 (10Tgr) [22:47:52] (03PS1) 10Aaron Schulz: Set DBReplication log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310159 [22:56:11] (03PS1) 10Yuvipanda: tools: Pick up cert for k8s master from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310160 [22:58:09] (03PS2) 10Aaron Schulz: Set some database loggin groups to log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310159 [22:58:18] (03PS2) 10Yuvipanda: tools: Pick up cert for k8s master from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310160 [22:58:22] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Pick up cert for k8s master from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310160 (owner: 10Yuvipanda) [22:59:38] (03PS1) 10BBlack: Revert "cache_upload: switch to file storage backend on Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/310161 [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160912T2300). Please do the needful. [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:27] (03PS2) 10BBlack: Revert "cache_upload: switch to file storage backend on Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/310161 [23:00:35] Hah, that patch is abandoned, don't deploy it [23:00:38] (03CR) 10BBlack: [C: 032 V: 032] Revert "cache_upload: switch to file storage backend on Varnish 4" [puppet] - 10https://gerrit.wikimedia.org/r/310161 (owner: 10BBlack) [23:00:39] And with that, SWAT is empty [23:01:11] yuvipanda: ok for merge? [23:01:20] bblack yup [23:05:32] (03PS1) 10Yuvipanda: tools: Get ssl key for docker registry from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310164 [23:05:49] (03PS2) 10Yuvipanda: tools: Get ssl key for docker registry from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310164 [23:05:54] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Get ssl key for docker registry from puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/310164 (owner: 10Yuvipanda) [23:13:59] !log switch cache_upload eqiad to -sdeprecated_persistent... [23:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:54] (03PS3) 10Aaron Schulz: Set some database logging groups to log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310159 [23:14:59] (03PS4) 10Aaron Schulz: Set some database logging groups to log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/310159 [23:22:34] There are two throttle rules changes pending. I'm adding them to SWAT. [23:24:15] Calendar updated [23:24:52] (03PS5) 10Dereckson: Lift of IP cap - WomenInSience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309511 (https://phabricator.wikimedia.org/T145115) (owner: 10Urbanecm) [23:24:58] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309511 (https://phabricator.wikimedia.org/T145115) (owner: 10Urbanecm) [23:25:25] (03Merged) 10jenkins-bot: Lift of IP cap - WomenInSience [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309511 (https://phabricator.wikimedia.org/T145115) (owner: 10Urbanecm) [23:26:08] hey, new Gerrit version isn't as efficient as before for dependencies patches, it wants I manually rebase a patch against 309511 [23:27:28] ah understood [23:27:40] I've a conflict with *another* throttle change meanwhile merged [23:33:31] !log T144826: Restarting Cassandra on restbase2004-b.codfw.wmnet (scrub complete, re-joining cluster) [23:33:33] T144826: restbase2004.codfw.wmnet data corruption - https://phabricator.wikimedia.org/T144826 [23:33:36] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2631303 (10BBlack) To recap the current state of affairs and recent investigation/experimentation: 1. We finished upgrading the remaining DCs to Varnish4 earlier today, as shown in th... [23:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:11] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [23:36:11] (03PS2) 10Dereckson: Women in Science (Vancouver, BCIT) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309737 (https://phabricator.wikimedia.org/T145253) [23:36:11] RECOVERY - cassandra-b CQL 10.192.32.138:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042 [23:36:36] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309737 (https://phabricator.wikimedia.org/T145253) (owner: 10Dereckson) [23:37:01] (03Merged) 10jenkins-bot: Women in Science (Vancouver, BCIT) throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309737 (https://phabricator.wikimedia.org/T145253) (owner: 10Dereckson) [23:37:32] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [23:38:03] 309511 and 309737 live on mw1099 [23:38:33] looks good [23:39:53] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:40:15] !log dereckson@tin Synchronized wmf-config/throttle.php: Women in Science throttle rules (T145115 and T145253) (duration: 00m 47s) [23:40:17] T145115: Throttle rule for Women in Science, Vancouver, UBC - https://phabricator.wikimedia.org/T145115 [23:40:18] T145253: Request of temporary lift of IP cap for British Columbia Institute of Technology Library edit-a-thon - https://phabricator.wikimedia.org/T145253 [23:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:53] !log switch cache_upload codfw to -sdeprecated_persistent... [23:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:04] SWAT done. [23:41:32] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:42:23] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:46:02] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [23:49:53] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [23:52:23] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1099 is CRITICAL: Connection refused [23:54:53] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [23:57:14] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp1099 is OK: HTTP OK: HTTP/1.1 200 OK - 177 bytes in 0.009 second response time