[00:21:49] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:36:21] !log reset email address for User:INeverCry after identify verification [01:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:37:48] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1729945 (10ellery) @awight Thanks for double checking, I understand the difference. See this notebook for an ana... [01:46:31] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:01] RECOVERY - Host mw2027 is UP: PING OK - Packet loss = 0%, RTA = 35.46 ms [02:23:41] !log repairing table and restarting replication on s5 from dbstore1002 (non-production host) [02:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:24:36] and replication works again, 44372 seconds behind master, it will catch up soon [02:27:01] !log repairing table and restarting replication on s7 from labsdb1002 [02:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:11] 53502 seconds behind master, it will also catch up soon [02:28:21] <_joe_> jynus: did you just skip the error ? [02:28:27] no [02:28:34] there is no consistency error [02:28:38] <_joe_> ok, I'll ask you more tomorrow [02:28:44] just TOkuDB shitting itshelf [02:28:50] <_joe_> ah ok [02:28:56] the engine crashes the table [02:29:02] and replication stops [02:29:06] very sad [02:29:26] there is a ticket about it, search Tokudb on phab [02:30:52] _joe_, https://phabricator.wikimedia.org/T109069 [02:31:06] and that is why I banned Toku from our production [02:31:12] only for analytics and labs [02:33:04] <_joe_> we tried toku sortly before I left JOB~1 [02:33:09] <_joe_> *shortly [02:33:17] <_joe_> but I was kind of unimpressed [02:38:47] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 08m 38s) [02:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:39] _joe_, it is the most stable engine with high compression we have now for mysql [02:40:49] and it is not very stable [02:41:43] this is the cure: http://rocksdb.org/ [02:41:56] lsm-based, but not yet production ready [02:43:42] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-16 02:43:42+00:00 [02:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:25] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 31s) [03:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:11:14] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-16 03:11:14+00:00 [03:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:45:11] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: puppet fail [03:50:30] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [03:51:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [03:52:02] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [04:09:52] (03CR) 10MZMcBride: "I'd still like to see a clearly articulated reason for making this change in the commit message. A wiki is a pretty good collaboration pla" [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [04:11:52] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [04:21:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 4 below the confidence bounds [04:28:00] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 5 below the confidence bounds [04:29:41] 6operations: Database corruption in hewiki (categorylinks) - https://phabricator.wikimedia.org/T115682#1730018 (10eranroz) 3NEW [04:31:46] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Database corruption in hewiki (categorylinks) - https://phabricator.wikimedia.org/T115682#1730026 (10Legoktm) [04:32:41] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 1 failures [04:51:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [04:56:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [04:59:11] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [05:04:04] 6operations: Add a command like option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730055 (10kaldari) 3NEW [05:07:33] 6operations: Add a command like option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730071 (10kaldari) Basically this would just entail setting the title.keyword filter to either be '.*\\.(js|css)' (the current default) or whatever was passed as... [05:07:45] 6operations, 7Easy: Add a command like option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730072 (10kaldari) [05:08:18] 6operations, 7Easy: Add a command like option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730055 (10kaldari) [05:11:22] 6operations, 6Community-Tech, 7Easy: Add a command-line option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730079 (10kaldari) [05:12:30] 6operations, 6Community-Tech, 7Easy: Add a command-line option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730055 (10kaldari) [05:12:44] 6operations, 6Community-Tech, 7Easy: Add a command-line option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730055 (10kaldari) [05:21:30] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 9 below the confidence bounds [05:30:35] 6operations, 6Community-Tech, 7Easy: Add a command-line option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730096 (10MZMcBride) While I can understand the motivation for creating this task, I'm pretty wary of continuing to extend and enhance... [05:56:40] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [06:05:52] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [06:07:32] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [06:20:49] (03PS1) 10Yuvipanda: k8s: Make ssldir configurable for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/246807 [06:20:54] (03CR) 10jenkins-bot: [V: 04-1] k8s: Make ssldir configurable for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/246807 (owner: 10Yuvipanda) [06:21:42] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: puppet fail [06:22:02] (03PS2) 10Yuvipanda: k8s: Make ssldir configurable for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/246807 [06:25:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Oct 16 06:25:56 UTC 2015 (duration 25m 55s) [06:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:11] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:51] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:02] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:11] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:11] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:42] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:01] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:20] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:51] PROBLEM - salt-minion processes on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:51] PROBLEM - Hadoop DataNode on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:51] PROBLEM - Hadoop NodeManager on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:47:51] PROBLEM - dhclient process on analytics1034 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:48:41] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:21] RECOVERY - Hadoop NodeManager on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:49:21] RECOVERY - salt-minion processes on analytics1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:49:21] RECOVERY - dhclient process on analytics1034 is OK: PROCS OK: 0 processes with command name dhclient [06:49:21] RECOVERY - Hadoop DataNode on analytics1034 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [06:55:21] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [06:56:11] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:01] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:11] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:41] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:12] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:29:40] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [07:31:22] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [07:37:12] (03CR) 10Smalyshev: "I think it would be more efficient to keep the discussion in one place, namely https://phabricator.wikimedia.org/T110070" [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [09:04:42] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [09:09:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [09:13:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [09:18:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [09:24:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [09:30:01] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [10:07:33] 6operations, 6Release-Engineering-Team: Monitor Phabricator and Gerrit availability - https://phabricator.wikimedia.org/T115611#1730343 (10Aklapper) To some extend related: {T109279} [10:15:10] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [11:11:10] 6operations, 10Wikimedia-General-or-Unknown, 7Database: hewiki's categorylinks shown as not empty though it is; purging does not help - https://phabricator.wikimedia.org/T115682#1730411 (10Aklapper) [11:28:34] 7Blocked-on-Operations, 6operations, 3Discovery-Maps-Sprint: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1730448 (10BBlack) Time and effort :) [11:31:44] (03PS2) 10Faidon Liambotis: Move base::firewall include into the mx role [puppet] - 10https://gerrit.wikimedia.org/r/245969 (owner: 10Muehlenhoff) [11:32:05] (03CR) 10Faidon Liambotis: [C: 032] Move base::firewall include into the mx role [puppet] - 10https://gerrit.wikimedia.org/r/245969 (owner: 10Muehlenhoff) [11:33:12] (03PS2) 10Faidon Liambotis: Mark old MXes as spares [puppet] - 10https://gerrit.wikimedia.org/r/246392 (https://phabricator.wikimedia.org/T115489) (owner: 10Muehlenhoff) [11:33:19] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Mark old MXes as spares [puppet] - 10https://gerrit.wikimedia.org/r/246392 (https://phabricator.wikimedia.org/T115489) (owner: 10Muehlenhoff) [11:38:07] (03PS2) 10Faidon Liambotis: deployment: use the same ferm rule as mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/246207 (owner: 10Alex Monk) [11:38:18] (03PS3) 10Faidon Liambotis: deployment: use the same ferm rule as mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/246207 (owner: 10Alex Monk) [11:38:56] (03CR) 10Faidon Liambotis: [C: 032] deployment: use the same ferm rule as mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/246207 (owner: 10Alex Monk) [11:52:42] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [11:55:28] (03PS1) 10Faidon Liambotis: deployment: use ferm::rule, not ::service [puppet] - 10https://gerrit.wikimedia.org/r/246817 [11:56:07] (03CR) 10Faidon Liambotis: [C: 032] deployment: use ferm::rule, not ::service [puppet] - 10https://gerrit.wikimedia.org/r/246817 (owner: 10Faidon Liambotis) [12:02:33] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [12:45:25] 6operations, 6Community-Tech, 7Easy: Add a command-line option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1730532 (10Krenair) I agree that we (Wikimedia engineers) should be working towards making mwgrep mostly obsolete (as far as public wik... [12:57:03] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [12:57:34] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [12:58:44] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [13:05:13] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: puppet fail [13:08:29] (03PS2) 10Muehlenhoff: Move base::firewall include into the gerrit::production role [puppet] - 10https://gerrit.wikimedia.org/r/245975 [13:09:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move base::firewall include into the gerrit::production role [puppet] - 10https://gerrit.wikimedia.org/r/245975 (owner: 10Muehlenhoff) [13:13:17] (03CR) 10Ori.livneh: [C: 032] Turn off UserDailyContribs extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246689 (https://phabricator.wikimedia.org/T85984) (owner: 10Ori.livneh) [13:13:24] (03Merged) 10jenkins-bot: Turn off UserDailyContribs extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246689 (https://phabricator.wikimedia.org/T85984) (owner: 10Ori.livneh) [13:14:07] (03PS1) 10Muehlenhoff: Make db2055 to db2070 as role spare [puppet] - 10https://gerrit.wikimedia.org/r/246823 [13:17:16] (03PS4) 10Faidon Liambotis: Remove classes snapshot::common, snapshot::packages [puppet] - 10https://gerrit.wikimedia.org/r/245616 [13:17:18] (03PS1) 10Faidon Liambotis: Remove class role::dataset::publicdirs, noop [puppet] - 10https://gerrit.wikimedia.org/r/246824 [13:17:20] (03PS1) 10Faidon Liambotis: dataset: move system user creation to module [puppet] - 10https://gerrit.wikimedia.org/r/246825 [13:17:22] (03PS1) 10Faidon Liambotis: dataset: inline the non-role role classes [puppet] - 10https://gerrit.wikimedia.org/r/246826 [13:17:24] (03PS1) 10Faidon Liambotis: dataset: remove system::role from the dataset module [puppet] - 10https://gerrit.wikimedia.org/r/246827 [13:17:26] (03PS1) 10Faidon Liambotis: snapshot: create a proper role::snapshot [puppet] - 10https://gerrit.wikimedia.org/r/246828 [13:18:13] !log ori@tin Synchronized wmf-config/CommonSettings.php: I14ecd0ae87: Turn off UserDailyContribs extension (duration: 00m 18s) [13:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:21:06] paravoid: nice [13:23:11] what is? [13:23:15] the clean-up [13:23:22] moritzm: https://gerrit.wikimedia.org/r/#/c/246691/ [13:23:24] oh I gave up for now [13:23:36] it needs another 20 patchsets or so to become sane :( [13:24:05] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "While this is good in general, you should include standard AFTER you declare 'role X' with the role keyword, see https://wikitech.wikimedi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/245616 (owner: 10Faidon Liambotis) [13:24:34] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:25:11] (03PS1) 10Muehlenhoff: Mark multatuli as spare [puppet] - 10https://gerrit.wikimedia.org/r/246831 [13:27:58] (03PS1) 10Muehlenhoff: Mark rubidium as spare [puppet] - 10https://gerrit.wikimedia.org/r/246832 [13:30:04] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/246691 (owner: 10Ori.livneh) [13:33:53] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:34:37] Any sysadmin online? [13:36:51] yes, what's up? [13:37:35] (03CR) 10Faidon Liambotis: [C: 04-1] "This isn't equivalent from what I can see. role spare doesn't include base::firewall. Perhaps it should (I see no reason why not to?), but" [puppet] - 10https://gerrit.wikimedia.org/r/246831 (owner: 10Muehlenhoff) [13:39:15] (03PS2) 10Ori.livneh: add ferm rule opening port 80 for grafana [puppet] - 10https://gerrit.wikimedia.org/r/246691 [13:39:34] (03CR) 10Ori.livneh: [C: 032 V: 032] add ferm rule opening port 80 for grafana [puppet] - 10https://gerrit.wikimedia.org/r/246691 (owner: 10Ori.livneh) [13:39:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Again this seems correct in general but there is an issue with using the role keyword twice; this will give a compilation error on snapsho" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246828 (owner: 10Faidon Liambotis) [13:39:59] Hi paravoid. I don't know who is best for this, but hopefully you can advice. I have a user with a log in problem. When you ask for a temp password, and use the temp password, they are send to a "reset password" screen. When they reset the password, it is eaten by the system, and both the temp and new password do not work [13:40:03] I have tried it myself [13:40:10] with the temp passwords send to me [13:40:22] (03CR) 10MZMcBride: [C: 04-1] "This is unacceptable. Git guidelines require that a commit message (1) explain what change is being made; and (2) provide a clear rational" [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [13:41:00] In order to merge the account, I recently disconnected the account from the global account and renamed it. However without password it cannot be reattached [13:41:20] Any ideas? [13:50:24] Taketa: no, please file a phabricator task [13:58:32] (03CR) 10Chad: [C: 04-2] "Waiting for next week, deploys on hold." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246281 (owner: 10Chad) [14:04:49] 6operations, 6Release-Engineering-Team: Monitor Phabricator and Gerrit availability - https://phabricator.wikimedia.org/T115611#1730624 (10Dzahn) gerrit process monitoring: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=ytterbium&service=gerrit+process phabricator http monitoring: http... [14:06:06] 6operations, 6Release-Engineering-Team: Monitor Phabricator and Gerrit availability - https://phabricator.wikimedia.org/T115611#1730626 (10Dzahn) Gerrit availabilty: http://status.wikimedia.org/8777/249692/Gerrit Phabricator availability: http://status.wikimedia.org/8777/388149/Phabricator probably additio... [14:10:40] 6operations, 10Wikimedia-General-or-Unknown: security@mediawiki.org : Create a public key and publish it on the public key servers - https://phabricator.wikimedia.org/T40860#1730632 (10Dzahn) @Aklapper for now, until this ticket is resolved, i'd suggest to tell the user to: gpg --search-keys csteipp@wikimedia... [14:12:08] 7Blocked-on-Operations, 6operations, 3Discovery-Maps-Sprint: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1730633 (10Yurik) Could you quantify them somehow? :) [14:14:55] paravoid : I created https://phabricator.wikimedia.org/T115699 [14:19:17] (03CR) 10Dzahn: [C: 04-1] Switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [14:52:07] (03PS1) 10Dzahn: admin: deactivate rmoen's shell account [puppet] - 10https://gerrit.wikimedia.org/r/246839 (https://phabricator.wikimedia.org/T115544) [14:55:40] (03CR) 10Alex Monk: "Shouldn't ensure be changed to absent and the user added to the absent group?" [puppet] - 10https://gerrit.wikimedia.org/r/246839 (https://phabricator.wikimedia.org/T115544) (owner: 10Dzahn) [14:55:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [14:57:29] (03CR) 10Dzahn: "afaict, no. the key should be emptied but still ensured present, to have it actively removed. adding Chase though to confirm" [puppet] - 10https://gerrit.wikimedia.org/r/246839 (https://phabricator.wikimedia.org/T115544) (owner: 10Dzahn) [15:00:56] Krenair: ok, so both. first do that above with the key, then a follow-up with what you said [15:01:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [15:01:11] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1730753 (10Eevans) >>! In T114443#1726139, @Eevans wrote: > I've expanded upon @gwicke's prototype a bit, progress here: https://github.com/wikimedia/restevent Actually, after some addi... [15:03:50] (03CR) 10Dzahn: "both, first this and then a follow-up with what you said" [puppet] - 10https://gerrit.wikimedia.org/r/246839 (https://phabricator.wikimedia.org/T115544) (owner: 10Dzahn) [15:06:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [15:07:55] mutante: https://wikitech.wikimedia.org/wiki/Help:Git_rebase [15:08:08] (My rant from a few years ago about the dangers of ‘git pull’) [15:11:04] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 9 below the confidence bounds [15:11:34] mutante: this? https://etherpad.wikimedia.org/p/ops-offsite-discussions [15:13:46] andrewbogott: yes :) i had this in mind https://etherpad.wikimedia.org/p/TechOps-2015-10-14 [15:13:49] and it links [15:14:46] 6operations, 6Release-Engineering-Team: Monitor Phabricator and Gerrit availability - https://phabricator.wikimedia.org/T115611#1730798 (10greg) Just to be sure: we are NOT talking about the traditional sense of HA (https://en.wikipedia.org/wiki/High_availability) with things like redundancy, quick failover, e... [15:20:41] (03PS1) 10Ori.livneh: xenon-grep: add `BagOStuff` to exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/246843 [15:27:45] (03PS2) 10Ori.livneh: xenon-grep: use shell-style wildcards for exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/246843 [15:28:12] (03CR) 10Ori.livneh: [C: 032] xenon-grep: use shell-style wildcards for exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/246843 (owner: 10Ori.livneh) [15:30:08] 6operations: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1730831 (10ori) 3NEW a:3jcrespo [15:31:59] 6operations, 10ops-eqiad, 10netops: Decom Tele2 @ eqiad - https://phabricator.wikimedia.org/T115712#1730843 (10faidon) 3NEW a:3RobH [15:35:14] (03PS12) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [15:35:24] (03CR) 10jenkins-bot: [V: 04-1] Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [15:43:06] (03CR) 10Giuseppe Lavagetto: [C: 031] "Looks good to me, but I won't merge it now" [puppet] - 10https://gerrit.wikimedia.org/r/246687 (https://phabricator.wikimedia.org/T115588) (owner: 10Mobrovac) [15:43:41] (03PS2) 10Giuseppe Lavagetto: Remove tmh100[12].yaml hieradata [puppet] - 10https://gerrit.wikimedia.org/r/246680 (owner: 10Reedy) [15:44:36] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/246680 (owner: 10Reedy) [15:44:46] (03PS1) 10Dzahn: admin: add new group for datacenter ops [puppet] - 10https://gerrit.wikimedia.org/r/246848 [15:45:39] (03CR) 10jenkins-bot: [V: 04-1] admin: add new group for datacenter ops [puppet] - 10https://gerrit.wikimedia.org/r/246848 (owner: 10Dzahn) [15:46:25] (03PS1) 10Dzahn: admin: add papaul to datacenter ops group [puppet] - 10https://gerrit.wikimedia.org/r/246849 [15:47:10] (03CR) 10jenkins-bot: [V: 04-1] admin: add papaul to datacenter ops group [puppet] - 10https://gerrit.wikimedia.org/r/246849 (owner: 10Dzahn) [15:49:16] 6operations, 10Wikimedia-Logstash: gdash reports for php/apache errors - https://phabricator.wikimedia.org/T81030#1730906 (10greg) [15:50:06] 6operations, 10Wikimedia-Logstash: gdash reports for php/apache errors - https://phabricator.wikimedia.org/T81030#882575 (10greg) [15:51:10] (03PS1) 10Dzahn: admin: add datacenter-ops group to palladium [puppet] - 10https://gerrit.wikimedia.org/r/246850 [15:51:44] (03CR) 10jenkins-bot: [V: 04-1] admin: add datacenter-ops group to palladium [puppet] - 10https://gerrit.wikimedia.org/r/246850 (owner: 10Dzahn) [15:55:55] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: make the domain to monitor configurable [puppet] - 10https://gerrit.wikimedia.org/r/246687 (https://phabricator.wikimedia.org/T115588) (owner: 10Mobrovac) [15:56:00] (03PS2) 10Alexandros Kosiaris: RESTBase: make the domain to monitor configurable [puppet] - 10https://gerrit.wikimedia.org/r/246687 (https://phabricator.wikimedia.org/T115588) (owner: 10Mobrovac) [16:00:15] 6operations: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1730969 (10Dzahn) 3NEW [16:02:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [16:11:42] (03PS3) 10Yuvipanda: k8s: Make ssldir configurable for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/246807 [16:11:46] (03PS2) 10Dzahn: admin: deactivate rmoen's shell account [puppet] - 10https://gerrit.wikimedia.org/r/246839 (https://phabricator.wikimedia.org/T115544) [16:11:49] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Make ssldir configurable for k8s master [puppet] - 10https://gerrit.wikimedia.org/r/246807 (owner: 10Yuvipanda) [16:12:08] (03PS2) 10Yuvipanda: toollabs: install mailutils [puppet] - 10https://gerrit.wikimedia.org/r/246289 (https://phabricator.wikimedia.org/T114073) (owner: 10Merlijn van Deen) [16:12:14] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: install mailutils [puppet] - 10https://gerrit.wikimedia.org/r/246289 (https://phabricator.wikimedia.org/T114073) (owner: 10Merlijn van Deen) [16:13:25] (03CR) 10Yuvipanda: [C: 04-1] "Should just be one regex instead of an array, both for simplicity and performance reasons." [puppet] - 10https://gerrit.wikimedia.org/r/246125 (https://phabricator.wikimedia.org/T90844) (owner: 10Alex Monk) [16:17:01] (03CR) 10Dzahn: [C: 032] admin: deactivate rmoen's shell account [puppet] - 10https://gerrit.wikimedia.org/r/246839 (https://phabricator.wikimedia.org/T115544) (owner: 10Dzahn) [16:17:21] (03PS3) 10Dzahn: admin: deactivate rmoen's shell account [puppet] - 10https://gerrit.wikimedia.org/r/246839 (https://phabricator.wikimedia.org/T115544) [16:18:12] (03PS1) 10Ori.livneh: xenon-grep: tweaks to output formatting [puppet] - 10https://gerrit.wikimedia.org/r/246855 [16:18:39] (03PS2) 10Ori.livneh: xenon-grep: tweaks to output formatting [puppet] - 10https://gerrit.wikimedia.org/r/246855 [16:18:48] (03CR) 10Ori.livneh: [C: 032 V: 032] xenon-grep: tweaks to output formatting [puppet] - 10https://gerrit.wikimedia.org/r/246855 (owner: 10Ori.livneh) [16:21:27] (03PS1) 10Dzahn: admin: ensure rmoen user is absented [puppet] - 10https://gerrit.wikimedia.org/r/246856 (https://phabricator.wikimedia.org/T115544) [16:22:16] (03CR) 10Yuvipanda: [C: 031] "If you ack that you tested this and it works I'll +2 :)" [puppet] - 10https://gerrit.wikimedia.org/r/246231 (owner: 10Alex Monk) [16:24:19] (03PS3) 10Giuseppe Lavagetto: IdleConnection: set keepalive [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [16:25:16] 6operations, 7Database: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1731043 (10Krenair) [16:25:54] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: puppet fail [16:28:44] (03CR) 10Alex Monk: "How about moving him out of bastiononly and granting this group bastion access?" [puppet] - 10https://gerrit.wikimedia.org/r/246849 (owner: 10Dzahn) [16:29:14] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:29:42] icinga-wm: lies [16:29:46] (03CR) 10Hashar: [C: 04-1] "It is missing a single quote." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246848 (owner: 10Dzahn) [16:29:47] eh, not anymore :) [16:33:04] (03CR) 10Alex Monk: "krenair@terbium:~$ ./ldaplist.sh -l projects zulip" [puppet] - 10https://gerrit.wikimedia.org/r/246231 (owner: 10Alex Monk) [16:34:14] (03PS2) 10Yuvipanda: ldaplist: Add support for projects/projectroles [puppet] - 10https://gerrit.wikimedia.org/r/246231 (owner: 10Alex Monk) [16:34:27] (03CR) 10Yuvipanda: [C: 032 V: 032] ldaplist: Add support for projects/projectroles [puppet] - 10https://gerrit.wikimedia.org/r/246231 (owner: 10Alex Monk) [16:34:45] krrrit-wm: thanks! [16:34:47] err [16:34:49] Krenair: [16:34:51] thanks [16:37:33] _joe_: reviewed, but just as krrrit-wm was restarting [16:38:43] (03PS1) 10BryanDavis: labs-vagrant: Add a stub labs-vagrant command [puppet] - 10https://gerrit.wikimedia.org/r/246857 [16:38:56] (03PS2) 10Dzahn: admin: add new group for datacenter ops [puppet] - 10https://gerrit.wikimedia.org/r/246848 (https://phabricator.wikimedia.org/T115718) [16:40:23] (03PS3) 10Dzahn: admin: add new group for datacenter ops [puppet] - 10https://gerrit.wikimedia.org/r/246848 (https://phabricator.wikimedia.org/T115718) [16:41:12] <_joe_> ori: we just tested it with akosiaris btw [16:41:13] (03CR) 10Dzahn: admin: add new group for datacenter ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246848 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [16:41:30] _joe_: and? worked? [16:41:52] (03PS2) 10BryanDavis: labs-vagrant: Add a stub labs-vagrant command [puppet] - 10https://gerrit.wikimedia.org/r/246857 [16:42:14] <_joe_> ori: yes! [16:42:22] nice! [16:42:24] <_joe_> of course we're kind of abusing what tcp keepalive is thought for, but whatever [16:42:31] (03PS3) 10Alex Monk: dynamicproxy: Make blocked user agents configurable [puppet] - 10https://gerrit.wikimedia.org/r/246125 (https://phabricator.wikimedia.org/T90844) [16:42:39] i think it's clever [16:42:45] but does it truly kick in even though we don't send any data? [16:45:34] (03CR) 10BryanDavis: "tested via cherry-pick on vagrant-lxc-trusty.mediawiki-core-team.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/246857 (owner: 10BryanDavis) [16:46:03] (03PS3) 10Dzahn: RESTBase: make the domain to monitor configurable [puppet] - 10https://gerrit.wikimedia.org/r/246687 (https://phabricator.wikimedia.org/T115588) (owner: 10Mobrovac) [16:50:15] (03CR) 10Ori.livneh: "http://man7.org/linux/man-pages/man7/tcp.7.html suggests we may want to use TCP_USER_TIMEOUT as well / instead" [debs/pybal] - 10https://gerrit.wikimedia.org/r/244717 (https://phabricator.wikimedia.org/T113151) (owner: 10Ori.livneh) [16:51:22] <_joe_> ori: ah! interesting [16:51:42] (03PS2) 10Dzahn: admin: add dc-ops group to role access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) [16:52:42] (03CR) 10Alex Monk: [C: 04-1] "dataset-admins?" [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [16:53:25] dataset-admins having access to iron and palladium sounds fun [16:53:51] luckily no one is in that group [16:54:13] (03PS3) 10Dzahn: admin: add dc-ops group to role access_new_install [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) [16:54:40] (03CR) 10Dzahn: "arr, yea, "datacenter-ops" is what i meant" [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [16:56:34] mutante: could you please force puppet to run on aqs100x so we make sure the icinga errors are gone? [16:56:40] (03PS3) 10Yuvipanda: labs-vagrant: Add a stub labs-vagrant command [puppet] - 10https://gerrit.wikimedia.org/r/246857 (owner: 10BryanDavis) [16:57:23] (03CR) 10Yuvipanda: [C: 032 V: 032] labs-vagrant: Add a stub labs-vagrant command [puppet] - 10https://gerrit.wikimedia.org/r/246857 (owner: 10BryanDavis) [16:57:43] (03PS2) 10Dzahn: admin: ensure rmoen user is absented [puppet] - 10https://gerrit.wikimedia.org/r/246856 (https://phabricator.wikimedia.org/T115544) [16:58:49] mobrovac: already did. needs puppet run on neon, did that too. i have the icinga tab open ... [16:59:01] ah hm [16:59:15] looks... [17:00:49] mobrovac: hold on :) i saw the change applied on aqs1001 now.. [17:00:55] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [17:00:59] there we go [17:01:02] yay :) [17:01:13] :) [17:01:18] thnx mutante [17:01:29] np, thanks for the fix [17:01:49] 6operations, 10Analytics, 6Services, 5Patch-For-Review: Automatic monitoring not working for AQS - https://phabricator.wikimedia.org/T115588#1731198 (10Dzahn) that fixed it :) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=aqs1001&service=Restbase+endpoints+health [17:02:01] 6operations, 10Analytics, 6Services, 5Patch-For-Review: Automatic monitoring not working for AQS - https://phabricator.wikimedia.org/T115588#1731199 (10Dzahn) 5Open>3Resolved [17:02:57] (03CR) 10Dzahn: [C: 032] admin: ensure rmoen user is absented [puppet] - 10https://gerrit.wikimedia.org/r/246856 (https://phabricator.wikimedia.org/T115544) (owner: 10Dzahn) [17:03:10] 6operations, 6Community-Tech, 7Easy: Add a command-line option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1731202 (10kaldari) @MZMcBride: Good point. If mwgrep were available to the community, they would be able to keep a better eye on on-wi... [17:03:14] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [17:03:18] 6operations, 10Analytics, 6Services: Automatic monitoring not working for AQS - https://phabricator.wikimedia.org/T115588#1728063 (10Dzahn) [17:04:54] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:06:25] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: puppet fail [17:13:14] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [17:26:14] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [17:26:25] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [17:26:33] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1731284 (10GWicke) >>! In T114443#1730753, @Eevans wrote: > # Already leverages a (really slick) [[https://meta.wikimedia.org/wiki/Category:Schemas_%28active%29?status=active|JSON schema... [17:28:03] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [17:35:24] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:47:51] (03CR) 10DCausse: "I tested with a monolog config made by Erik." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [17:52:24] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:59] 6operations, 6Community-Tech, 7Easy: Add a command-line option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1731379 (10kaldari) [17:55:36] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1731399 (10ori) >>! In T114443#1731284, @GWicke wrote: > See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see above) for EL schema fetching outweights... [17:55:44] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 5.301 second response time [17:56:21] (03CR) 10Hashar: [C: 031] admin: add new group for datacenter ops [puppet] - 10https://gerrit.wikimedia.org/r/246848 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [18:16:54] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1731526 (10GWicke) >>! In T114443#1731399, @ori wrote: >>>! In T114443#1731284, @GWicke wrote: >> See T88459#1604768. tl;dr: It's not necessarily clear that saving very little code (see... [18:23:27] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1731564 (10Eevans) >>! In T114443#1731284, @GWicke wrote: >>>! In T114443#1730753, @Eevans wrote: >> # Already leverages a (really slick) [[https://meta.wikimedia.org/wiki/Category:Schem... [18:27:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [18:33:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [5000000.0] [18:36:35] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [18:38:53] (03PS1) 10Ori.livneh: mwgrep: add '--title' arg [puppet] - 10https://gerrit.wikimedia.org/r/246891 (https://phabricator.wikimedia.org/T114443) [18:41:25] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [18:41:50] (03CR) 10Ori.livneh: [C: 032 V: 032] mwgrep: add '--title' arg [puppet] - 10https://gerrit.wikimedia.org/r/246891 (https://phabricator.wikimedia.org/T114443) (owner: 10Ori.livneh) [18:42:05] ori: you linked to the Eventbus mvp ticket on the mwgrep patch [18:43:23] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [18:46:33] madhuvishy: oops [18:46:35] my bad [18:46:44] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [18:47:00] (03CR) 10Ori.livneh: "Sorry, the bug reference should have been T115683" [puppet] - 10https://gerrit.wikimedia.org/r/246891 (https://phabricator.wikimedia.org/T114443) (owner: 10Ori.livneh) [18:47:14] 6operations, 6Community-Tech, 7Easy: Add a command-line option to mwgrep to allow it to search a particular page across all wikis - https://phabricator.wikimedia.org/T115683#1731694 (10kaldari) 5Open>3Resolved a:3kaldari FWIW, I went ahead and just ran my query with a modified mwgrep (per Krenair), sin... [18:47:15] ori: np, it showed up on the phab ticket so noticed. [18:49:09] (03CR) 10Alex Monk: "So after I (among others) suggested that we should not expand this script, you came along and wrote the patch and then self-+2'd it, and c" [puppet] - 10https://gerrit.wikimedia.org/r/246891 (https://phabricator.wikimedia.org/T114443) (owner: 10Ori.livneh) [18:49:13] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1731699 (10mobrovac) [18:50:22] (03CR) 10Ori.livneh: "Krenair: yes" [puppet] - 10https://gerrit.wikimedia.org/r/246891 (https://phabricator.wikimedia.org/T114443) (owner: 10Ori.livneh) [18:55:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [19:00:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [19:05:06] 10Ops-Access-Requests, 6operations: Requesting access to analytics-privatedata-users for Bryan Davis - https://phabricator.wikimedia.org/T115548#1731744 (10Tnegrin) approved [19:05:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [19:07:22] (03PS4) 10Smalyshev: Switch to git-based portal [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) [19:08:00] (03CR) 10Smalyshev: "I do not think 10-page discussion where the actual maintainer of the page endorses the change and explains in detail why actually qualifie" [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [19:12:15] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [19:13:38] I'm looking into reports of media file serving errors (https://phabricator.wikimedia.org/T115563), what would be the right graphite key for that? [19:17:58] (03PS2) 10Dzahn: Move base::firewall include into the racktables and rt roles [puppet] - 10https://gerrit.wikimedia.org/r/245962 (owner: 10Muehlenhoff) [19:19:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [19:20:16] (03CR) 10Dzahn: [C: 032] Move base::firewall include into the racktables and rt roles [puppet] - 10https://gerrit.wikimedia.org/r/245962 (owner: 10Muehlenhoff) [19:23:35] tgr: on the swift side? see "frontend" on https://grafana.wikimedia.org/dashboard/db/swift-eqiad [19:23:38] (03Abandoned) 10Dzahn: torrus: remove role from netmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/245890 (https://phabricator.wikimedia.org/T87840) (owner: 10Dzahn) [19:29:11] !log deleted /var/lib/carbon/whisper/MediaWiki/MediaWiki on graphite1001 & graphite2001 per tgr's request [19:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:32] ori: thanks! [19:38:15] godog: btw, dunno if you saw my ping the other day, but grafana 2.x will have prometheus support [19:38:18] 2.5 rather [19:40:47] ori: ah yeah I saw that but forgot to ack, that's cool! [19:40:57] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#1731803 (10Amire80) [19:41:02] 6operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 6Services, and 2 others: Test CXServer in Jessie - https://phabricator.wikimedia.org/T107307#1731804 (10Amire80) [19:41:09] I think it got merged in master already? [19:41:41] ori: btw I'm going to decom tessera in an effort to have a single frontend if that's ok [19:42:34] nevermind, looks like you did that already (and thanks!) [19:45:25]  [19:45:33]  [19:48:08] godog: thanks, I was looking for varnish but that's useful as well [19:50:32] (03PS2) 10Dzahn: Move base::firewall includes for roles on krypton [puppet] - 10https://gerrit.wikimedia.org/r/245968 (owner: 10Muehlenhoff) [19:53:06] (03CR) 10Dzahn: [C: 031] Make spare role include base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/246388 (owner: 10Muehlenhoff) [19:53:33] (03CR) 10Dzahn: [C: 031] aqs: Include base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/246330 (owner: 10Muehlenhoff) [19:54:07] (03CR) 10Dzahn: [C: 031] wdqs: Move the ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/246224 (owner: 10Muehlenhoff) [19:55:00] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1731857 (10faidon) >>! In T115416#1724286, @Nemo_bis wrote: > Well, gerrit and Phabricator emails are certainly very bad. Multiple bugs have been reported in the last few years... [19:55:08] (03CR) 10Dzahn: [C: 031] Move base::debdeploy into the base class [puppet] - 10https://gerrit.wikimedia.org/r/246220 (owner: 10Muehlenhoff) [19:57:42] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1731867 (10faidon) FWIW, we haven't changed anything recently - @JKrauska, this is probably something you should contact Google Apps about, we're customers after all. [20:02:34] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [20:02:35] (03CR) 10Dzahn: [C: 031] Move base:firewall include into the memcached role [puppet] - 10https://gerrit.wikimedia.org/r/245964 (owner: 10Muehlenhoff) [20:03:24] (03CR) 10Dzahn: [C: 031] Move base::firewall include into the openldap::corp role [puppet] - 10https://gerrit.wikimedia.org/r/245972 (owner: 10Muehlenhoff) [20:04:04] (03CR) 10Dzahn: [C: 031] Move base::firewall includes for roles on krypton [puppet] - 10https://gerrit.wikimedia.org/r/245968 (owner: 10Muehlenhoff) [20:04:12] (03CR) 10Filippo Giunchedi: [C: 031] admin: add new group for datacenter ops [puppet] - 10https://gerrit.wikimedia.org/r/246848 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [20:05:34] (03CR) 10Dzahn: [C: 031] "lgtm, just remove the trailing whitespace" [puppet] - 10https://gerrit.wikimedia.org/r/245876 (owner: 10Muehlenhoff) [20:11:30] (03CR) 10Filippo Giunchedi: [C: 04-1] "one question on merging behaviour" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [20:23:44] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1731940 (10Tbayer) I seem to recall that WMF Office IT enabled 'aggressive' spam filtering on @ wikimedia.org Gmail on August 27. I understand that they have the option to whi... [20:26:31] 6operations, 10Traffic, 7HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1731946 (10Umherirrender) [20:26:35] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1731944 (10Umherirrender) 5Open>3Resolved [20:26:44] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1731947 (10GWicke) > For starters, it means that we have alternatives for environments where Kafka is overkill (small third-party installations, dev environments, mw-vagrant, etc). Using... [20:29:04] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection refused [20:30:55] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.001 second response time on port 9042 [20:33:40] (03PS1) 10Muehlenhoff: Add salt grains for hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/246944 [20:33:42] (03PS1) 10Muehlenhoff: Add salt grains for hadoop master and standby [puppet] - 10https://gerrit.wikimedia.org/r/246945 [20:33:44] (03PS1) 10Muehlenhoff: Add salt grains for gitblit [puppet] - 10https://gerrit.wikimedia.org/r/246946 [20:33:46] (03PS1) 10Muehlenhoff: Add salt grains for spares [puppet] - 10https://gerrit.wikimedia.org/r/246947 [20:33:48] (03PS1) 10Muehlenhoff: Add salt grains for aqs [puppet] - 10https://gerrit.wikimedia.org/r/246948 [20:33:50] (03PS1) 10Muehlenhoff: Add salt grains for releases role [puppet] - 10https://gerrit.wikimedia.org/r/246949 [20:33:52] (03PS1) 10Muehlenhoff: Add salt grains for restbase canaries [puppet] - 10https://gerrit.wikimedia.org/r/246950 [20:33:54] (03PS1) 10Muehlenhoff: Add salt grains for restbase [puppet] - 10https://gerrit.wikimedia.org/r/246951 [20:33:56] (03PS1) 10Muehlenhoff: Add salt grains for etcd [puppet] - 10https://gerrit.wikimedia.org/r/246952 [20:33:58] (03PS1) 10Muehlenhoff: Add salt grains for test systems [puppet] - 10https://gerrit.wikimedia.org/r/246953 [20:34:00] (03PS1) 10Muehlenhoff: Add salt grains for mxes [puppet] - 10https://gerrit.wikimedia.org/r/246954 [20:34:02] (03PS1) 10Muehlenhoff: Add salt grains for ocg [puppet] - 10https://gerrit.wikimedia.org/r/246955 [20:34:04] (03PS1) 10Muehlenhoff: Add salt grains for sca [puppet] - 10https://gerrit.wikimedia.org/r/246956 [20:34:06] (03PS1) 10Muehlenhoff: Add salt grains for scb [puppet] - 10https://gerrit.wikimedia.org/r/246957 [20:34:07] oh wow :) [20:34:08] (03PS1) 10Muehlenhoff: Add salt grains for videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/246958 [20:35:44] (03PS1) 10Muehlenhoff: archiva: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246959 [20:36:49] (03PS1) 10Muehlenhoff: argon: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246960 [20:38:10] (03CR) 10Dzahn: admin: add dc-ops group to role access_new_install (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246850 (https://phabricator.wikimedia.org/T115718) (owner: 10Dzahn) [20:40:08] (03PS1) 10Muehlenhoff: bast4001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246962 [20:42:11] (03PS1) 10Muehlenhoff: bromine: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246964 [20:43:03] 7~/win 40 [20:43:06] er :) [20:43:14] 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4 - https://phabricator.wikimedia.org/T107762#1731988 (10GWicke) 4.2.1 is now in unstable: https://packages.debian.org/search?keywords=nodejs [20:45:11] (03PS1) 10Muehlenhoff: californium: Move to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246965 [20:47:25] (03PS1) 10Muehlenhoff: carbon: Move to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246966 [20:48:20] (03CR) 10Dzahn: "if we had different contact groups per each role, and then applied more than 1 role to a node, this would likely not work currently and ju" [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [20:51:25] (03PS1) 10Muehlenhoff: conf*: Convert to fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246967 [20:51:49] https://phabricator.wikimedia.org/T115755!log Restarting Jenkins to remove potential dead locks before the week-end [20:52:19] !log Restarting Jenkins to remove potential dead locks before the week-end [20:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:52:28] (03PS1) 10Muehlenhoff: db1047: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246968 [20:54:13] (03PS1) 10Muehlenhoff: db1069: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246969 [20:56:13] (03PS1) 10Muehlenhoff: erbium: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246971 [20:58:46] (03PS1) 10Muehlenhoff: etherpad1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246978 [20:59:52] (03PS1) 10Muehlenhoff: eventlog1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246979 [21:01:01] (03PS1) 10Muehlenhoff: fluorine: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246980 [21:01:04] (03CR) 10Dzahn: [C: 04-1] "in .pp files, such as role/restbase.pp the lookup for contact groups will not happen implicitly, these are not classes but defines. so the" [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [21:01:55] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Aside from the big-ish change that is kind of difficult to review, this will not work. puppet defines do not support implicit hiera lookup" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [21:02:15] (03CR) 10Alexandros Kosiaris: [C: 04-1] move all non-default contact_group variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [21:02:40] (03CR) 10Alexandros Kosiaris: [C: 031] etherpad1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246978 (owner: 10Muehlenhoff) [21:04:12] (03CR) 10Dzahn: [C: 031] dnsrecursor: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/244692 (owner: 10Muehlenhoff) [21:08:25] 6operations: document debian packaging guidelines - https://phabricator.wikimedia.org/T115757#1732017 (10fgiunchedi) 3NEW [21:14:40] 6operations, 5Patch-For-Review: create new admin group for datacenter ops to add new systems to puppet - https://phabricator.wikimedia.org/T115718#1732027 (10Dzahn) a:3Dzahn [21:18:19] (03CR) 10Dzahn: [C: 04-1] "doesn't match our schema - Faidon said it might work after the switch to openldap" [puppet] - 10https://gerrit.wikimedia.org/r/229299 (https://phabricator.wikimedia.org/T107702) (owner: 10Dzahn) [21:20:10] (03CR) 10Dzahn: [C: 031] "+1, but only after coordinating with releng and finding a good time for it" [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff) [21:31:35] 6operations: consider debian "experimental" workflow for internal repository - https://phabricator.wikimedia.org/T115758#1732058 (10fgiunchedi) 3NEW [21:32:56] 6operations: consider debian "experimental" workflow for internal repository - https://phabricator.wikimedia.org/T115758#1732058 (10fgiunchedi) other examples include the nodejs upgrade in {T107762} [21:33:03] 6operations: Audit uses of package=>latest - https://phabricator.wikimedia.org/T115348#1732072 (10fgiunchedi) [21:33:04] 6operations: consider debian "experimental" workflow for internal repository - https://phabricator.wikimedia.org/T115758#1732071 (10fgiunchedi) [21:46:51] 6operations: deployment: user trebuchet gets added and removed from group wikidev on every puppet run - https://phabricator.wikimedia.org/T115760#1732096 (10faidon) 3NEW a:3chasemp [21:53:37] (03PS1) 10Alexandros Kosiaris: hiera_lookup: Use sub instead of tr [puppet] - 10https://gerrit.wikimedia.org/r/246987 [21:59:05] 6operations: Do not apply spam headers on email assessed NOT to be spam - https://phabricator.wikimedia.org/T111595#1732118 (10faidon) > Would it be possible to tweak these settings to NOT apply the header unless it's believed to be spam? This actually hurts debugging, though, which is what the rationale is for... [21:59:13] (03PS13) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [21:59:54] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000000.0] [22:00:32] (03PS1) 10Dzahn: maps: move hieradata from codfw to role/common [puppet] - 10https://gerrit.wikimedia.org/r/246992 [22:07:25] (03PS1) 10Dzahn: ntp: do not 'ensure latest' [puppet] - 10https://gerrit.wikimedia.org/r/247005 (https://phabricator.wikimedia.org/T115348) [22:08:24] (03PS1) 10Dzahn: puppet: do not 'ensure latest' [puppet] - 10https://gerrit.wikimedia.org/r/247007 (https://phabricator.wikimedia.org/T115348) [22:09:49] (03PS1) 10Dzahn: interface: do not 'ensure latest' [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) [22:11:54] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:12:04] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail [22:27:32] 6operations, 7Mail: Google Mail marking Phabricator and Gerrit notification emails as spam - https://phabricator.wikimedia.org/T115416#1732158 (10Quiddity) >>! In T115416#1726149, @daniel wrote: > You can whitelist mails from wikimedia.org via gmail's web interface: > http://email.about.com/od/gmailtips/qt/et_... [22:28:23] (03CR) 10MZMcBride: "The updated commit message makes no sense. I'll discuss on Phabricator." [puppet] - 10https://gerrit.wikimedia.org/r/240888 (https://phabricator.wikimedia.org/T110070) (owner: 10Smalyshev) [22:30:53] (03CR) 10Alexandros Kosiaris: [V: 04-1] "Depends on apertium-isl, -1 to block for now" [debs/contenttranslation/apertium-isl-eng] - 10https://gerrit.wikimedia.org/r/244416 (https://phabricator.wikimedia.org/T114988) (owner: 10KartikMistry) [22:32:24] (03CR) 10Alexandros Kosiaris: [V: 04-1] "The following packages have unmet dependencies:" [debs/contenttranslation/apertium-isl] - 10https://gerrit.wikimedia.org/r/244415 (https://phabricator.wikimedia.org/T114988) (owner: 10KartikMistry) [22:39:17] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:47:21] 6operations, 10Traffic, 7HTTPS: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1732202 (10BBlack) [22:47:25] 6operations, 6Commons, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, and 7 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1732200 (10BBlack) 5Resolved>3Open ^ Was the software released? We still haven't removed the exception itself... [22:49:08] lol [22:53:54] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added Debian package for apertium-mlt-ara [debs/contenttranslation/apertium-mlt-ara] - 10https://gerrit.wikimedia.org/r/244410 (https://phabricator.wikimedia.org/T111902) (owner: 10KartikMistry) [22:56:32] (03PS2) 10Milimetric: Add a public endpoint for AQS [puppet] - 10https://gerrit.wikimedia.org/r/245887 (https://phabricator.wikimedia.org/T114830) [23:32:58] 6operations: Do not apply spam headers on email assessed NOT to be spam - https://phabricator.wikimedia.org/T111595#1732255 (10JKrauska) Google has been unhelpful in explaining /WHY/ it is a problem, but it comes up each time I open a ticket about this. 'Your very own spam analysis is saying this email isn't sp... [23:49:16] 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga for fermium/mailman - https://phabricator.wikimedia.org/T105229#1732274 (10Dzahn) We tested if using "John F. Lewis" as contact name breaks Icinga config and it actually seems ok (as in not breaking syntax ch...