[00:04:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1695859 (10RobH) 5Open>3Resolved I neglected to come back and update this task, I merged Jonathan's access to bastions earlier today. As @Krenair pointed ou... [00:06:40] (03CR) 10Dzahn: "it seems there are atleast roles "wikimetrics" and "labsores" using redis but not having ferm rules. per yuvi's comment then" [puppet] - 10https://gerrit.wikimedia.org/r/242915 (owner: 10Muehlenhoff) [00:08:48] (03CR) 10Yuvipanda: "labsores doesn't have base::firewall so that's ok and wikimetrics is running off self-hosted puppetmaster that also doesn't have base::fir" [puppet] - 10https://gerrit.wikimedia.org/r/242915 (owner: 10Muehlenhoff) [00:08:58] mutante: if it's just those two.. [00:14:49] (03CR) 10Dzahn: "role::ci::slave::browsertests" [puppet] - 10https://gerrit.wikimedia.org/r/242915 (owner: 10Muehlenhoff) [00:15:48] (03CR) 10Dzahn: "these also look like they use redis but have no ferm in the role" [puppet] - 10https://gerrit.wikimedia.org/r/242915 (owner: 10Muehlenhoff) [00:39:37] (03CR) 10Tim Landscheidt: "(The task for automating this would be T100990 if someone is interested to take this on.)" [puppet] - 10https://gerrit.wikimedia.org/r/243047 (owner: 10Yuvipanda) [00:43:39] !log ori@tin Synchronized php-1.27.0-wmf.1/vendor: Add wikimedia/relpath 1.0.3 (duration: 00m 21s) [00:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:32] !log ori@tin Synchronized php-1.27.0-wmf.1/includes/resourceloader: I21bb3f08e7f and follow-ups (duration: 00m 18s) [00:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:29:43] (03PS1) 10Dzahn: add wiki.voyage as parked domain [dns] - 10https://gerrit.wikimedia.org/r/243091 (https://phabricator.wikimedia.org/T88851) [02:36:45] !log l10nupdate@tin Synchronized php-1.27.0-wmf.1/cache/l10n: l10nupdate for 1.27.0-wmf.1 (duration: 08m 04s) [02:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:33] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.1) at 2015-10-02 02:41:33+00:00 [02:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:29] !log tstarling@tin Synchronized php-1.27.0-wmf.1/extensions/ParsoidBatchAPI: stats (duration: 00m 17s) [02:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:47] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:27:56] RECOVERY - Disk space on labstore1002 is OK: DISK OK [04:04:35] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:05:03] lol? [04:05:18] yuvipanda: ---^^ [04:05:29] yeaaaah [04:05:35] RoanKattouw: there's a bug for it, Coren is/was working on it [04:05:50] it just means 'oh, so I am doing a backup now!' [04:06:03] Oh! Aha [04:06:08] OK so in a way that's good news I suppose [04:06:15] RoanKattouw: yeah it's reassuring in some ways. [04:06:54] "I'm still getting whipped, so the universe must not have ended!" [04:43:35] (03PS1) 10Dzahn: add wikipedia.es with link to wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/243099 [04:43:53] (03PS2) 10Dzahn: add wikipedia.es with link to wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/243099 (https://phabricator.wikimedia.org/T101060) [04:56:15] (03PS1) 10Dzahn: park wikiquotes.info [dns] - 10https://gerrit.wikimedia.org/r/243100 (https://phabricator.wikimedia.org/T106114) [05:01:40] (03PS1) 10Dzahn: park wikipedia.lol [dns] - 10https://gerrit.wikimedia.org/r/243101 (https://phabricator.wikimedia.org/T88861) [05:06:50] (03PS1) 10Yuvipanda: k8s: Make kubelet serve/register with fqdn [puppet] - 10https://gerrit.wikimedia.org/r/243102 [05:07:35] (03PS1) 10Dzahn: unlink wikimedia.xyz from wikimedia.com and park it [dns] - 10https://gerrit.wikimedia.org/r/243103 (https://phabricator.wikimedia.org/T92547) [05:08:31] (03CR) 10Yuvipanda: [C: 032] k8s: Make kubelet serve/register with fqdn [puppet] - 10https://gerrit.wikimedia.org/r/243102 (owner: 10Yuvipanda) [05:31:58] <_joe_> yuvipanda: yep the bug is in the way nrpe reads globs etc [05:32:09] <_joe_> clearly something needs escaping [06:07:05] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Oct 2 06:07:05 UTC 2015 (duration 7m 4s) [06:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:25:42] (03PS2) 10Muehlenhoff: Add ferm rules for Spark [puppet] - 10https://gerrit.wikimedia.org/r/240341 (https://phabricator.wikimedia.org/T83597) [06:28:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for Spark [puppet] - 10https://gerrit.wikimedia.org/r/240341 (https://phabricator.wikimedia.org/T83597) (owner: 10Muehlenhoff) [06:29:17] PROBLEM - Check size of conntrack table on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:26] PROBLEM - RAID on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:29:56] PROBLEM - puppet last run on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:16] PROBLEM - SSH on analytics1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:30:17] PROBLEM - YARN NodeManager Node-State on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:26] PROBLEM - Hadoop NodeManager on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:30:36] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] PROBLEM - salt-minion processes on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:16] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:27] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - Hadoop DataNode on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:31:46] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - dhclient process on analytics1055 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:32:17] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:47] RECOVERY - Hadoop NodeManager on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:34:16] RECOVERY - Check size of conntrack table on analytics1055 is OK: OK: nf_conntrack is 0 % full [06:34:25] RECOVERY - salt-minion processes on analytics1055 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:34:25] RECOVERY - RAID on analytics1055 is OK: OK: optimal, 13 logical, 14 physical [06:34:55] RECOVERY - Hadoop DataNode on analytics1055 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [06:34:55] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 31 minutes ago with 0 failures [06:35:06] RECOVERY - dhclient process on analytics1055 is OK: PROCS OK: 0 processes with command name dhclient [06:35:15] RECOVERY - SSH on analytics1055 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [06:35:15] RECOVERY - YARN NodeManager Node-State on analytics1055 is OK: OK: YARN NodeManager analytics1055.eqiad.wmnet:8041 Node-State: RUNNING [06:55:56] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:56:26] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:05] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:07] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:08] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:05] 6operations, 7Availability, 7Varnish: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1696280 (10aaron) [07:08:06] (03PS1) 10Aaron Schulz: [WIP] Switched to pt-heartbeat lag detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [07:20:59] 6operations, 10Traffic, 7Availability: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1696320 (10faidon) [07:21:13] AaronSchulz: misses a description :) [07:21:49] AaronSchulz: also we're using #Traffic for all the traffic/edge/CDN/whatever stuff these days, as it's more than just "Varnish" [07:44:12] (03CR) 10Filippo Giunchedi: [C: 031] Cassandra: dequote some booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241238 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [08:04:05] (03CR) 10Addshore: [C: 031] Avoid breaking full phabricator URLs [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [08:04:46] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:21:36] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:29:04] 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4 - https://phabricator.wikimedia.org/T107762#1696377 (10MoritzMuehlenhoff) 4.2 will be the LTS release: https://github.com/nodejs/node/issues/3000#issuecomment-144894835 [08:31:45] (03PS1) 10Filippo Giunchedi: update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/243121 (https://phabricator.wikimedia.org/T113733) [08:44:45] (03CR) 10Hashar: "role::ci::slave::browsertests is used on the Jenkins slaves in labs. Only one of them had some ferm rules and I have dropped them." [puppet] - 10https://gerrit.wikimedia.org/r/242915 (owner: 10Muehlenhoff) [08:45:35] !log restarting Nodepool to take in account changes made to the logging configuration https://gerrit.wikimedia.org/r/#/c/240986/ [08:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:47:36] (03CR) 10Hashar: "I have restarted nodepool service on labnodepool1001.eqiad.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/240986 (owner: 10Hashar) [08:48:16] (03CR) 10Muehlenhoff: "I double-checked, the other roles don't use ferm either:" [puppet] - 10https://gerrit.wikimedia.org/r/242915 (owner: 10Muehlenhoff) [08:51:56] PROBLEM - swift-container-server on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:26] PROBLEM - swift-object-updater on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:54:56] RECOVERY - swift-object-updater on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:55:21] (03PS1) 10Muehlenhoff: Move base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/243122 [08:56:46] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [5000000.0] [08:58:15] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [5000000.0] [09:00:16] RECOVERY - swift-container-server on ms-be1012 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:01:14] (03PS2) 10Jcrespo: Adding wikiuser and wikiadmin grants to read the heartbeat table [puppet] - 10https://gerrit.wikimedia.org/r/242211 [09:02:50] (03CR) 10Jcrespo: [C: 032] "I will merge this now and slowly deploy it (it will not take effect immediately)." [puppet] - 10https://gerrit.wikimedia.org/r/242211 (owner: 10Jcrespo) [09:03:15] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: Puppet has 1 failures [09:06:46] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:07:06] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:09:28] (03PS1) 10Muehlenhoff: releases: Move the base::firewall include into the role [puppet] - 10https://gerrit.wikimedia.org/r/243123 [09:29:56] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [09:31:25] (03PS1) 10Filippo Giunchedi: cassandra: new metrics-collector version [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) [09:35:01] (03CR) 10Filippo Giunchedi: [C: 04-1] cassandra: new metrics-collector version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [09:41:35] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 1 failures [09:44:06] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Puppet has 1 failures [09:50:03] (03PS1) 10Jcrespo: Adding ferm::mariadb role to labsdb1004 (tools slave) [puppet] - 10https://gerrit.wikimedia.org/r/243129 [09:50:53] (03CR) 10jenkins-bot: [V: 04-1] Adding ferm::mariadb role to labsdb1004 (tools slave) [puppet] - 10https://gerrit.wikimedia.org/r/243129 (owner: 10Jcrespo) [09:55:01] (03PS2) 10Jcrespo: Adding ferm::mariadb role to labsdb1004 (tools slave) [puppet] - 10https://gerrit.wikimedia.org/r/243129 [09:55:43] (03CR) 10jenkins-bot: [V: 04-1] Adding ferm::mariadb role to labsdb1004 (tools slave) [puppet] - 10https://gerrit.wikimedia.org/r/243129 (owner: 10Jcrespo) [09:57:08] (03PS3) 10Jcrespo: Adding ferm::mariadb role to labsdb1004 (tools slave) [puppet] - 10https://gerrit.wikimedia.org/r/243129 [09:59:13] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me. Also double-checked the ports on labsdb1004" [puppet] - 10https://gerrit.wikimedia.org/r/243129 (owner: 10Jcrespo) [10:00:46] PROBLEM - swift-container-auditor on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:01:05] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:01:24] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [10:02:09] (03CR) 10Jcrespo: [C: 032] Adding ferm::mariadb role to labsdb1004 (tools slave) [puppet] - 10https://gerrit.wikimedia.org/r/243129 (owner: 10Jcrespo) [10:02:25] RECOVERY - swift-container-auditor on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:02:37] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [10:03:40] 7Blocked-on-Operations, 6operations, 10Parsoid, 10Salt, 6Scrum-of-Scrums: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1696552 (10ArielGlenn) [10:05:09] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks good to me as well." [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [10:08:26] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:10:56] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:35] PROBLEM - RAID on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:05] PROBLEM - DPKG on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:19:26] (03CR) 10Jcrespo: [C: 031] "Do not veto." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/242452 (owner: 10Alexandros Kosiaris) [10:20:06] RECOVERY - RAID on ms-be1012 is OK: OK: optimal, 14 logical, 14 physical [10:20:35] RECOVERY - DPKG on ms-be1012 is OK: All packages OK [10:24:28] (03PS1) 10Faidon Liambotis: sslcert: fix update-ocsp's non-proxy mode [puppet] - 10https://gerrit.wikimedia.org/r/243133 [10:25:13] (03CR) 10Faidon Liambotis: [C: 032] sslcert: fix update-ocsp's non-proxy mode [puppet] - 10https://gerrit.wikimedia.org/r/243133 (owner: 10Faidon Liambotis) [10:28:15] PROBLEM - salt-minion processes on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:15] PROBLEM - swift-container-replicator on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:35] PROBLEM - swift-account-server on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:28:35] PROBLEM - swift-container-updater on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:29:46] PROBLEM - swift-object-replicator on ms-be1012 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:30:45] ^ looking, it was rebuilding one disk [10:30:46] 6operations, 7Database: Puppetize grants for mysql analytics servers - https://phabricator.wikimedia.org/T114476#1696609 (10jcrespo) 3NEW a:3jcrespo [10:31:25] RECOVERY - swift-object-replicator on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [10:31:45] RECOVERY - swift-container-replicator on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [10:31:45] RECOVERY - salt-minion processes on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:31:56] RECOVERY - swift-account-server on ms-be1012 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [10:31:56] RECOVERY - swift-container-updater on ms-be1012 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [10:41:03] 6operations, 7Database: Grant 'show view' permissions on s1-analytics-slave/jmorgan to user jmorgan - https://phabricator.wikimedia.org/T114396#1696646 (10jcrespo) @Capt_Swing these are your new grants: ``` MariaDB ANALYTICS localhost (none) > SHOW GRANTS FOR 'jmorgan'@'%'; +----------------------------------... [10:41:07] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:43:05] (03PS1) 10Giuseppe Lavagetto: Add support for http_status to ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/243139 (https://phabricator.wikimedia.org/T102393) [10:43:31] <_joe_> paravoid: ^^ (both lines above :P) [10:44:36] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:53:27] !log restarting HHVM on mw1130 [10:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:45] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.105 second response time [10:54:45] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 70212 bytes in 0.617 second response time [10:56:04] so...close...to..0...alerts [11:03:38] !log installed rpcbind security updates on all Ubuntu servers which runs it (jessie was already updated, since the DSA was released earlier) [11:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:06:36] PROBLEM - puppet last run on mw1029 is CRITICAL: CRITICAL: Puppet has 1 failures [11:07:35] (03CR) 10John Vandenberg: Fixed PEP-8 issues (0311 comments) [dumps] - 10https://gerrit.wikimedia.org/r/207504 (owner: 10Dereckson) [11:08:49] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: Make pybal accept 30[12] for ProxyFetch - https://phabricator.wikimedia.org/T102393#1696702 (10Joe) a:5BBlack>3Joe [11:11:42] (03PS1) 10Muehlenhoff: WIP: Hiera-based assignment of grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/243142 (https://phabricator.wikimedia.org/T111006) [11:17:30] (03CR) 10John Vandenberg: [C: 031] "ping" [dumps] - 10https://gerrit.wikimedia.org/r/207699 (owner: 10Dereckson) [11:20:12] (03CR) 10John Vandenberg: [C: 031] More UNIX agnostic, less GNU/Linux-centric scripts [dumps] - 10https://gerrit.wikimedia.org/r/207694 (owner: 10Dereckson) [11:21:46] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [11:36:36] (03PS2) 10Muehlenhoff: Hiera-based assignment of grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/243142 (https://phabricator.wikimedia.org/T111006) [11:38:05] (03PS7) 10Alexandros Kosiaris: Add the new OTRS scheduler watchdog cron entry [puppet] - 10https://gerrit.wikimedia.org/r/242184 [11:38:07] (03PS3) 10Alexandros Kosiaris: otrs: Ship systemd unit file for OTRS scheduler [puppet] - 10https://gerrit.wikimedia.org/r/242860 [11:41:27] (03PS3) 10Alexandros Kosiaris: Add a .gitreview file [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/242453 [11:41:29] (03PS3) 10Alexandros Kosiaris: actually use is_critical in monitor_replication [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/242452 [11:44:37] (03CR) 10Alexandros Kosiaris: "Kind of changed my mind again. I only fixed the erroneous usage of is_critical in monitor_replication and added some comments to revisit t" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/242452 (owner: 10Alexandros Kosiaris) [11:44:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] actually use is_critical in monitor_replication [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/242452 (owner: 10Alexandros Kosiaris) [11:45:44] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: Ship systemd unit file for OTRS scheduler [puppet] - 10https://gerrit.wikimedia.org/r/242860 (owner: 10Alexandros Kosiaris) [11:46:02] (03CR) 10Alexandros Kosiaris: [C: 032] Add the new OTRS scheduler watchdog cron entry [puppet] - 10https://gerrit.wikimedia.org/r/242184 (owner: 10Alexandros Kosiaris) [11:50:18] (03PS1) 10Alexandros Kosiaris: otrs: move systemd unit file into correct location [puppet] - 10https://gerrit.wikimedia.org/r/243143 [11:50:47] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: move systemd unit file into correct location [puppet] - 10https://gerrit.wikimedia.org/r/243143 (owner: 10Alexandros Kosiaris) [12:00:56] (03PS1) 10Alexandros Kosiaris: otrs: scheduler is forking, set Type accordingly [puppet] - 10https://gerrit.wikimedia.org/r/243146 [12:01:23] <_joe_> akosiaris: you're enjoying the last bit of it, heh [12:01:48] * akosiaris thrilled [12:02:08] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: scheduler is forking, set Type accordingly [puppet] - 10https://gerrit.wikimedia.org/r/243146 (owner: 10Alexandros Kosiaris) [12:06:49] (03PS1) 10Alexandros Kosiaris: mariadb: update submodule in production repo [puppet] - 10https://gerrit.wikimedia.org/r/243148 [12:11:27] (03PS4) 10Muehlenhoff: Create ferm rules for Hadoop NameNode and ResourceManager for master and standby [puppet] - 10https://gerrit.wikimedia.org/r/237335 [12:13:18] (03CR) 10Muehlenhoff: [C: 032 V: 032] Create ferm rules for Hadoop NameNode and ResourceManager for master and standby [puppet] - 10https://gerrit.wikimedia.org/r/237335 (owner: 10Muehlenhoff) [12:16:18] (03PS1) 10Muehlenhoff: Enable ferm on analytics1002 [puppet] - 10https://gerrit.wikimedia.org/r/243150 [12:16:19] (03PS1) 10Muehlenhoff: Enable ferm on analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/243151 [12:19:43] (03PS1) 10Alexandros Kosiaris: otrs: Fix typo in scheduler cron entry [puppet] - 10https://gerrit.wikimedia.org/r/243152 [12:20:31] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] otrs: Fix typo in scheduler cron entry [puppet] - 10https://gerrit.wikimedia.org/r/243152 (owner: 10Alexandros Kosiaris) [12:20:43] (03PS2) 10Alexandros Kosiaris: otrs: Fix typo in scheduler cron entry [puppet] - 10https://gerrit.wikimedia.org/r/243152 [12:20:48] (03CR) 10Alexandros Kosiaris: [V: 032] otrs: Fix typo in scheduler cron entry [puppet] - 10https://gerrit.wikimedia.org/r/243152 (owner: 10Alexandros Kosiaris) [12:58:56] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: puppet fail [13:13:13] Hi ops folks, we're having some reports on enwiki of user scripts/gadgets intermittently failing to load for some users [13:13:27] See WP:VPT and https://github.com/azatoth/twinkle/issues/294 [13:13:43] May be connected to T114462 as well [13:14:06] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:14:28] <_joe_> tto: I was about to ask what points to production problems directly :) [13:15:07] <_joe_> what I see there seems to be related to javascript errors [13:15:32] Hm, yes the latter task seems to be unrelated [13:16:19] 6operations, 10Continuous-Integration-Config, 5Patch-For-Review: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1696901 (10hashar) a:3Andrew [13:16:32] All roads point to ContentTranslation script error, I'll look into that instead! [13:16:39] <_joe_> so, let me look at VP:T [13:16:45] 6operations, 10Continuous-Integration-Config, 5Patch-For-Review: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1675730 (10hashar) p:5Triage>3Normal [13:16:45] <_joe_> I was about to say that [13:16:59] <_joe_> I'll take a look at cxserver for sanity too [13:29:39] (03PS7) 10Andrew Bogott: Cassandra: dequote some booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241238 (https://phabricator.wikimedia.org/T113783) [13:31:55] (03CR) 10Andrew Bogott: [C: 032] Cassandra: dequote some booleans. [puppet] - 10https://gerrit.wikimedia.org/r/241238 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [13:33:32] (03PS2) 10Andrew Bogott: puppet-lint: enable quoted_booleans-check [puppet] - 10https://gerrit.wikimedia.org/r/241111 (https://phabricator.wikimedia.org/T113783) (owner: 10Hashar) [13:37:03] <_joe_> tto: no smoking guns in cxserver logs either [13:37:46] _joe_: I think the CX people have been messing around with RL dependencies, which is somehow causing the problem [13:37:51] So not an ops issue :) [13:38:04] Thanks for checking the server though [13:40:33] 6operations: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1696950 (10Hydriz) [13:40:54] <_joe_> tto: yw [13:44:27] (03CR) 10Hashar: "Note since quoted_boolean is just a warning, that is only caught by the strict job which is NOT voting." [puppet] - 10https://gerrit.wikimedia.org/r/241111 (https://phabricator.wikimedia.org/T113783) (owner: 10Hashar) [13:44:43] (03PS1) 10Andrew Bogott: Varnish: Add lint:ignore:quoted_booleans around a boolean that needs quoting. [puppet] - 10https://gerrit.wikimedia.org/r/243176 (https://phabricator.wikimedia.org/T113783) [13:53:05] (03PS2) 10Andrew Bogott: Varnish: Add lint:ignore:quoted_booleans around a boolean that needs quoting. [puppet] - 10https://gerrit.wikimedia.org/r/243176 (https://phabricator.wikimedia.org/T113783) [13:54:24] (03CR) 10Andrew Bogott: [C: 032] Varnish: Add lint:ignore:quoted_booleans around a boolean that needs quoting. [puppet] - 10https://gerrit.wikimedia.org/r/243176 (https://phabricator.wikimedia.org/T113783) (owner: 10Andrew Bogott) [13:55:15] (03PS3) 10Andrew Bogott: puppet-lint: enable quoted_booleans-check [puppet] - 10https://gerrit.wikimedia.org/r/241111 (https://phabricator.wikimedia.org/T113783) (owner: 10Hashar) [13:57:21] (03CR) 10Andrew Bogott: [C: 032] puppet-lint: enable quoted_booleans-check [puppet] - 10https://gerrit.wikimedia.org/r/241111 (https://phabricator.wikimedia.org/T113783) (owner: 10Hashar) [14:01:40] (03PS1) 10Andrew Bogott: puppet-lint: Turn on --no-puppet_url_without_modules-check [puppet] - 10https://gerrit.wikimedia.org/r/243177 [14:03:30] (03Abandoned) 10Faidon Liambotis: otrs: disable SessionCheckRemoteIP [puppet] - 10https://gerrit.wikimedia.org/r/242789 (https://phabricator.wikimedia.org/T87217) (owner: 10Faidon Liambotis) [14:09:03] (03CR) 10Jgreen: "Are you sure about predence? My memory is that settings in config.pm take precedence over the dynamic ones." [puppet] - 10https://gerrit.wikimedia.org/r/242789 (https://phabricator.wikimedia.org/T87217) (owner: 10Faidon Liambotis) [14:10:57] 6operations, 10OTRS, 6Security: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1696982 (10Steinsplitter) [14:12:47] (03PS1) 10Andrew Bogott: labstore: rearrange args to cleanup_snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243178 [14:12:49] (03PS1) 10Andrew Bogott: analytics: modernize the ensure => link syntax in a couple of places. [puppet] - 10https://gerrit.wikimedia.org/r/243179 [14:12:55] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1696984 (10mobrovac) >>! In T114443#1695708, @GWicke wrote: > 1) Provide edit related events (ex: edit, creation, deletion, revision deletion, rename). Consumers: RESTBase / change propa... [14:18:03] (03PS2) 10Andrew Bogott: analytics: modernize the ensure => link syntax in a couple of places. [puppet] - 10https://gerrit.wikimedia.org/r/243179 [14:18:05] (03PS2) 10Andrew Bogott: labstore: rearrange args to cleanup_snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243178 [14:20:10] (03PS1) 10Alexandros Kosiaris: otrs: disable the scheduler watchdog [puppet] - 10https://gerrit.wikimedia.org/r/243182 [14:21:51] (03PS2) 10coren: Disable NFS lookup cache on NFS client instances [puppet] - 10https://gerrit.wikimedia.org/r/241663 (https://phabricator.wikimedia.org/T106170) [14:24:37] (03CR) 10coren: [C: 032] "Nihil obstat." [puppet] - 10https://gerrit.wikimedia.org/r/241663 (https://phabricator.wikimedia.org/T106170) (owner: 10coren) [14:27:10] (03CR) 10Andrew Bogott: "tested with puppet compiler on analytics1015.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/243179 (owner: 10Andrew Bogott) [14:27:27] (03CR) 10Andrew Bogott: "tested with puppet compiler on labstore1001" [puppet] - 10https://gerrit.wikimedia.org/r/243178 (owner: 10Andrew Bogott) [14:27:46] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: puppet fail [14:31:39] (03PS3) 10Andrew Bogott: labstore: rearrange args to cleanup_snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243178 [14:31:45] (03PS3) 10Andrew Bogott: analytics: modernize the ensure => link syntax in a couple of places. [puppet] - 10https://gerrit.wikimedia.org/r/243179 [14:32:01] !log Change of NFS mount options in labs pushed - puppet may report failures to refresh the mounts (once) on instances; expected and harmless. [14:32:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:32:40] (03CR) 10Andrew Bogott: [C: 032] labstore: rearrange args to cleanup_snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243178 (owner: 10Andrew Bogott) [14:33:09] (03CR) 10Andrew Bogott: [C: 032] analytics: modernize the ensure => link syntax in a couple of places. [puppet] - 10https://gerrit.wikimedia.org/r/243179 (owner: 10Andrew Bogott) [14:33:32] (03CR) 10Eevans: [C: 031] update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/243121 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [14:33:34] (03PS2) 10Andrew Bogott: puppet-lint: Turn on --no-puppet_url_without_modules-check [puppet] - 10https://gerrit.wikimedia.org/r/243177 [14:36:09] <_joe_> andrewbogott: you're on fire! [14:36:59] _joe_: ‘Never again!’ I want to make the quoted-boolean test voting so things don’t backslide [14:37:27] !log providing extra grants to wiki db users for heartbeat monitoring [14:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:20] (03PS3) 10Andrew Bogott: puppet-lint: Turn off --no-puppet_url_without_modules-check [puppet] - 10https://gerrit.wikimedia.org/r/243177 [14:38:42] hm, a commit message typo made that last patch seem much more ambitious than it really was [14:39:34] maybe? double negatives are confusing in english [14:39:57] (03PS4) 10Andrew Bogott: puppet-lint: Turn on --no-puppet_url_without_modules-check [puppet] - 10https://gerrit.wikimedia.org/r/243177 [14:41:19] 6operations, 3labs-sprint-116: Quoted booleans probably stopping a lot of pages - https://phabricator.wikimedia.org/T113781#1697008 (10Andrew) 5Open>3Resolved [14:42:23] 6operations, 10Continuous-Integration-Config, 5Patch-For-Review: Forbid quoted booleans in puppet manifests - https://phabricator.wikimedia.org/T113783#1697010 (10Andrew) These are now caught by the strict lint check. That check isn't voting, however. [14:45:21] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1697034 (10Andrew) I would like us to just disable puppet_url_without_modules-check for now, and then make the strict test voting. That will lock in all the lint fix... [14:45:28] (03CR) 10Hashar: "We had a few discussions about puppet_url_without_modules in T87132" [puppet] - 10https://gerrit.wikimedia.org/r/243177 (owner: 10Andrew Bogott) [14:45:42] (03PS5) 10Andrew Bogott: puppet-lint: Turn on --no-puppet_url_without_modules-check [puppet] - 10https://gerrit.wikimedia.org/r/243177 (https://phabricator.wikimedia.org/T87132) [14:48:14] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1697055 (10hashar) >>! In T87132#1697034, @Andrew wrote: > I would like us to just disable puppet_url_without_modules-check for now, and then make the strict test vot... [14:51:00] so [14:51:11] puppet-lint is going to vote -1 on warnings !!!!!!!!!!! [14:54:38] (03CR) 10Hashar: [C: 031] "Since we are warning free and puppet_url_without_modules is a bit harder to properly fix. It is probably easier to just ignore it for now" [puppet] - 10https://gerrit.wikimedia.org/r/243177 (https://phabricator.wikimedia.org/T87132) (owner: 10Andrew Bogott) [14:54:46] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:57:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [14:57:20] (03CR) 10BBlack: [C: 031] park border-wikipedia.de [dns] - 10https://gerrit.wikimedia.org/r/241122 (owner: 10Dzahn) [14:58:08] (03CR) 10BBlack: [C: 031] park visualwikipedia domains [dns] - 10https://gerrit.wikimedia.org/r/197362 (owner: 10Dzahn) [14:58:20] (03CR) 10BBlack: [C: 031] add wiki.voyage as parked domain [dns] - 10https://gerrit.wikimedia.org/r/243091 (https://phabricator.wikimedia.org/T88851) (owner: 10Dzahn) [14:58:41] (03CR) 10BBlack: [C: 04-1] add wikipedia.es with link to wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/243099 (https://phabricator.wikimedia.org/T101060) (owner: 10Dzahn) [14:58:53] (03CR) 10BBlack: [C: 031] park wikiquotes.info [dns] - 10https://gerrit.wikimedia.org/r/243100 (https://phabricator.wikimedia.org/T106114) (owner: 10Dzahn) [14:59:08] (03CR) 10BBlack: [C: 031] park wikipedia.lol [dns] - 10https://gerrit.wikimedia.org/r/243101 (https://phabricator.wikimedia.org/T88861) (owner: 10Dzahn) [14:59:24] (03CR) 10BBlack: [C: 031] unlink wikimedia.xyz from wikimedia.com and park it [dns] - 10https://gerrit.wikimedia.org/r/243103 (https://phabricator.wikimedia.org/T92547) (owner: 10Dzahn) [14:59:33] (03CR) 10Eevans: cassandra: new metrics-collector version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [15:00:21] (03CR) 10Andrew Bogott: [C: 032] puppet-lint: Turn on --no-puppet_url_without_modules-check [puppet] - 10https://gerrit.wikimedia.org/r/243177 (https://phabricator.wikimedia.org/T87132) (owner: 10Andrew Bogott) [15:04:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:04:12] (03PS1) 10ArielGlenn: page title dumps: skip labswiki, it's not accessible to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243188 [15:05:26] (03CR) 10ArielGlenn: [C: 032] page title dumps: skip labswiki, it's not accessible to snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243188 (owner: 10ArielGlenn) [15:09:25] (03PS2) 10RobH: adding spage to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/242581 [15:09:46] (03PS1) 10Muehlenhoff: Bugfixes and finetuning [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/243190 [15:10:16] (03CR) 10RobH: [C: 032] adding spage to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/242581 (owner: 10RobH) [15:13:23] !log ori@tin Synchronized php-1.27.0-wmf.1/resources/src/mediawiki/mediawiki.js: I8029208: Don't clobber existing styles when adding more in IE9 (duration: 00m 17s) [15:13:25] MatmaRex: ^ [15:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:34] (03CR) 10Eevans: [C: 031] "(FWIW, )LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/242896 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [15:13:56] (03PS1) 10ArielGlenn: allow wikiqueries to skip dbs in a specified list [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243191 [15:14:00] 10Ops-Access-Requests, 6operations: add spage to analytics-privatedata-users group for hive access - https://phabricator.wikimedia.org/T114150#1697198 (10RobH) 5Open>3Resolved No objections were raised in the 3 day wait, so I've gone ahead and merged your access into analytics-privatedata-users. Your acce... [15:15:01] (03CR) 10ArielGlenn: [C: 032 V: 032] allow wikiqueries to skip dbs in a specified list [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243191 (owner: 10ArielGlenn) [15:15:53] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bugfixes and finetuning [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/243190 (owner: 10Muehlenhoff) [15:23:00] (03PS1) 10ArielGlenn: dumps: skip labswiki for media listings, snpashots have no access [puppet] - 10https://gerrit.wikimedia.org/r/243193 [15:24:04] (03CR) 10ArielGlenn: [C: 032] dumps: skip labswiki for media listings, snpashots have no access [puppet] - 10https://gerrit.wikimedia.org/r/243193 (owner: 10ArielGlenn) [15:27:40] (03PS1) 10ArielGlenn: for media lists, be able to skip dbs in a specified list [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243194 [15:29:00] (03CR) 10ArielGlenn: [C: 032 V: 032] for media lists, be able to skip dbs in a specified list [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243194 (owner: 10ArielGlenn) [15:29:25] (03PS4) 10RobH: admin: add dpatrick to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/242163 (https://phabricator.wikimedia.org/T114119) (owner: 10John F. Lewis) [15:29:57] (03CR) 10RobH: [C: 032] admin: add dpatrick to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/242163 (https://phabricator.wikimedia.org/T114119) (owner: 10John F. Lewis) [15:30:03] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1697267 (10BBlack) >>! In T101048#1694267, @Slaporte wrote: >> Do we have / can we produce a list of all domains registered to us globally, with any registr... [15:32:01] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Analytics statistics-users access on stat1002 for dpatrick - https://phabricator.wikimedia.org/T114119#1697288 (10RobH) 5Open>3Resolved No objections were raised during the 3 day waiting period, so I've gone ahead and merged this access live. [15:38:54] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1697314 (10Slaporte) >>! In T101048#1697267, @BBlack wrote: >>>! In T101048#1694267, @Slaporte wrote: >>> Do we have / can we produce a list of all domains... [15:44:27] (03PS1) 10Faidon Liambotis: sslcert: add preamble for sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/243195 [15:44:29] (03PS1) 10Faidon Liambotis: sslcert: add --config argument to update-ocsp [puppet] - 10https://gerrit.wikimedia.org/r/243196 [15:44:31] (03PS1) 10Faidon Liambotis: tlsproxy: fix a couple of OCSP-related dependencies [puppet] - 10https://gerrit.wikimedia.org/r/243197 [15:44:33] (03PS1) 10Faidon Liambotis: tlsproxy: switch update-ocsp(-all) to config files [puppet] - 10https://gerrit.wikimedia.org/r/243198 [15:44:35] (03PS1) 10Faidon Liambotis: tlsproxy: add support for update-ocsp-all hooks [puppet] - 10https://gerrit.wikimedia.org/r/243199 [15:44:37] (03PS1) 10Faidon Liambotis: Move tlsproxy's OCSP stapler/updater to sslcert [puppet] - 10https://gerrit.wikimedia.org/r/243200 [15:44:39] (03PS1) 10Faidon Liambotis: tlsproxy: inline ocsp_stapler/ocsp_updater [puppet] - 10https://gerrit.wikimedia.org/r/243201 [15:44:41] (03PS1) 10Faidon Liambotis: mail: add OCSP stapling to role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/243202 [15:44:45] bblack: ^^^ for your preliminary review, completely untested so far :) [15:44:52] :P [15:45:37] (03CR) 10BBlack: [C: 031] sslcert: add preamble for sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/243195 (owner: 10Faidon Liambotis) [15:46:03] gotta go, bbiab [15:50:35] RECOVERY - Disk space on labstore1002 is OK: DISK OK [15:55:45] (03PS1) 10ArielGlenn: addschanges dumps: skip labswiki, not reachable from snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243204 [15:56:39] (03CR) 10ArielGlenn: [C: 032] addschanges dumps: skip labswiki, not reachable from snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243204 (owner: 10ArielGlenn) [15:57:18] (03CR) 10BBlack: [C: 04-1] "By moving the declaration of sslcert::ocsp::hook for nginx into tlsproxy::localssl, we'd end up with multiple declarations, as a single s" [puppet] - 10https://gerrit.wikimedia.org/r/243201 (owner: 10Faidon Liambotis) [15:57:36] (03PS1) 10coren: Labstore: fix disk check false positive [puppet] - 10https://gerrit.wikimedia.org/r/243205 (https://phabricator.wikimedia.org/T113435) [16:00:28] (03PS1) 10ArielGlenn: addschanges dumps: allow dbs in a specified list to be skipped [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243206 [16:00:58] (03CR) 10coren: [C: 032] "Trivial fix to hieradata with strictly contained scope." [puppet] - 10https://gerrit.wikimedia.org/r/243205 (https://phabricator.wikimedia.org/T113435) (owner: 10coren) [16:01:06] (03PS2) 10coren: Labstore: fix disk check false positive [puppet] - 10https://gerrit.wikimedia.org/r/243205 (https://phabricator.wikimedia.org/T113435) [16:01:35] (03CR) 10ArielGlenn: [C: 032] addschanges dumps: allow dbs in a specified list to be skipped [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243206 (owner: 10ArielGlenn) [16:01:44] (03CR) 10ArielGlenn: [V: 032] addschanges dumps: allow dbs in a specified list to be skipped [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243206 (owner: 10ArielGlenn) [16:06:58] (03CR) 10Ottomata: [C: 031] "Let's do this one today, and wait til Monday for the analytics1001 one?" [puppet] - 10https://gerrit.wikimedia.org/r/243150 (owner: 10Muehlenhoff) [16:07:55] akosiaris: joal just posted this: [16:07:56] https://phabricator.wikimedia.org/T107056#1697401 [16:08:06] need hole poked in Analytics VLAN for AQS cassandra [16:08:31] joal: just that port/ [16:08:33] ? [16:08:39] I think so yes [16:08:54] maybe we should get port 80 too? [16:09:01] we're not exposing this to the real world yet, right? [16:09:13] and restbase will serve on 80? [16:09:16] or will that be behind varnish? [16:09:30] ottomata: I think it'll be behind varnish [16:09:54] ok, will it be public though? immediately? [16:10:01] accessing aqs on restbqse port from statXXXX is a good idea ! [16:10:32] ottomata: I don't think we'll make it public immediatly, but soon [16:12:42] k, yeah so we should just ask for that too [16:12:49] Cool [16:12:54] I don't remember th [16:13:06] though what restbase default port is [16:13:57] Puppet has it parameterized ottomata [16:16:34] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1697452 (10Ottomata) Edit events would be awesome and totally doable with this MVP, but I'm a little worried about the amount of bike shedding that will go into designing that schema! W... [16:29:40] !log deployed grafana 8e92884bae (with backport of upstream fb9f9548829f2d4cecf35cda933700e5c2fa1bd6) [16:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:41:18] 6operations, 10ops-eqiad: db1051 degraded raid (disk) - https://phabricator.wikimedia.org/T113786#1697509 (10jcrespo) 5Open>3Resolved a:3jcrespo ``` Device Present ================ Virtual Drives : 1 Degraded : 0 Offline : 0 Physical Devices : 14... [16:51:47] (03PS2) 10Dzahn: add wiki.voyage as parked domain [dns] - 10https://gerrit.wikimedia.org/r/243091 (https://phabricator.wikimedia.org/T88851) [16:52:10] (03CR) 10Dzahn: [C: 032] add wiki.voyage as parked domain [dns] - 10https://gerrit.wikimedia.org/r/243091 (https://phabricator.wikimedia.org/T88851) (owner: 10Dzahn) [17:06:54] (03PS1) 10Ori.livneh: Configure MultiWriteBagOStuff for ResourceLoader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243212 [17:10:22] (03CR) 10Ori.livneh: [C: 032] Configure MultiWriteBagOStuff for ResourceLoader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243212 (owner: 10Ori.livneh) [17:10:27] (03Merged) 10jenkins-bot: Configure MultiWriteBagOStuff for ResourceLoader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243212 (owner: 10Ori.livneh) [17:12:31] !log ori@tin Synchronized wmf-config/CommonSettings.php: I8318fe892: Configure MultiWriteBagOStuff for ResourceLoader (duration: 00m 17s) [17:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:31] (03CR) 1020after4: [C: 032] Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) (owner: 10Thcipriani) [17:20:16] (03Merged) 10jenkins-bot: Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) (owner: 10Thcipriani) [17:21:00] (weee ^^ ) [17:25:36] (03PS1) 10Ori.livneh: admin/ori: update deployment scripts for new branch name format [puppet] - 10https://gerrit.wikimedia.org/r/243215 [17:26:09] (03CR) 10Ori.livneh: [C: 032 V: 032] admin/ori: update deployment scripts for new branch name format [puppet] - 10https://gerrit.wikimedia.org/r/243215 (owner: 10Ori.livneh) [17:26:27] (03PS2) 10Dzahn: park wikiquotes.info [dns] - 10https://gerrit.wikimedia.org/r/243100 (https://phabricator.wikimedia.org/T106114) [17:27:18] (03CR) 10Dzahn: [C: 032] park wikiquotes.info [dns] - 10https://gerrit.wikimedia.org/r/243100 (https://phabricator.wikimedia.org/T106114) (owner: 10Dzahn) [17:29:05] 6operations: determine new swift ms-be hostnames (codfw/eqiad) - https://phabricator.wikimedia.org/T114500#1697674 (10RobH) 3NEW a:3fgiunchedi [17:29:26] 6operations: determine new swift ms-be hostnames (codfw/eqiad) - https://phabricator.wikimedia.org/T114500#1697684 (10RobH) These are the hosts being ordered on https://rt.wikimedia.org/Ticket/Display.html?id=9624 [17:36:00] (03PS1) 10Ottomata: Set python3 PYTHONHASHSEED for spark + python3 compatibility [puppet/cdh] - 10https://gerrit.wikimedia.org/r/243217 (https://phabricator.wikimedia.org/T113419) [17:38:01] (03PS2) 10Dzahn: park wikipedia.lol [dns] - 10https://gerrit.wikimedia.org/r/243101 (https://phabricator.wikimedia.org/T88861) [17:39:46] (03CR) 10Dzahn: [C: 032] park wikipedia.lol [dns] - 10https://gerrit.wikimedia.org/r/243101 (https://phabricator.wikimedia.org/T88861) (owner: 10Dzahn) [17:40:23] ori: nice trick: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/admin/files/home/ori/.bash_profile;6538eb295bbb3427688b69945ce776f4369f7ca5$113 [17:40:37] thanks :) [17:40:55] I saw the .hosts and thought "is .hosts a thing?" and duckduckgo said no [17:40:55] greg-g: i am esp. proud of https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/admin/files/home/ori/.bash_profile;6538eb295bbb3427688b69945ce776f4369f7ca5$22 [17:41:10] colorizes the hostname on the prompt based on a hash of the hostname [17:41:16] hah [17:41:18] after a while you associate certain colors with hosts [17:41:22] that's awesome [17:41:26] makes it easier when managing multiple tabs [17:41:30] tin is blue, for example :) [17:41:32] I had mine manually set, but gave up over time [17:44:26] (03CR) 10Ottomata: [C: 032 V: 032] Set python3 PYTHONHASHSEED for spark + python3 compatibility [puppet/cdh] - 10https://gerrit.wikimedia.org/r/243217 (https://phabricator.wikimedia.org/T113419) (owner: 10Ottomata) [17:45:04] (03PS1) 10Ottomata: Update cdh submodule with python3 + spark fix [puppet] - 10https://gerrit.wikimedia.org/r/243219 (https://phabricator.wikimedia.org/T113419) [17:46:04] (03CR) 10Ottomata: [C: 032] Update cdh submodule with python3 + spark fix [puppet] - 10https://gerrit.wikimedia.org/r/243219 (https://phabricator.wikimedia.org/T113419) (owner: 10Ottomata) [17:47:34] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1697875 (10bd808) >>! In T114443#1697452, @Ottomata wrote: > What about api.php logging? See T108618. Those logs are still being collected via udp2log, whereas the edit ones do have a... [17:50:48] (03PS3) 10Dzahn: park border-wikipedia.de [dns] - 10https://gerrit.wikimedia.org/r/241122 [17:51:19] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1697887 (10Ottomata) @bd808, what's your timeline on this? Producing directly into Kafka is fine, but we are trying to do two things with this MVP: - centralize and standardize schemas... [17:52:55] (03CR) 10Dzahn: [C: 032] park border-wikipedia.de [dns] - 10https://gerrit.wikimedia.org/r/241122 (owner: 10Dzahn) [17:53:25] !log rolling restart of all hadoop-yarn-nodemanagers to pick up python3 + spark fix [17:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:34] !log ori@tin Synchronized php-1.27.0-wmf.1/includes/resourceloader: I680f3fda66c5: Configure ResourceLoader-specific ObjectCache instance (duration: 00m 17s) [17:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:48] (03CR) 10Dzahn: "this pointed to wikipedia but was not configured on the Apaches, all you got was an error message" [dns] - 10https://gerrit.wikimedia.org/r/241122 (owner: 10Dzahn) [17:56:33] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1697908 (10bd808) >>! In T114443#1697887, @Ottomata wrote: > @bd808, what's your timeline on this? Producing directly into Kafka is fine, but we are trying to do two things with this MV... [17:58:16] PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:00:16] (03PS2) 10Dzahn: unlink wikimedia.xyz from wikimedia.com and park it [dns] - 10https://gerrit.wikimedia.org/r/243103 (https://phabricator.wikimedia.org/T92547) [18:01:46] RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:01:50] (03PS2) 10Thcipriani: Make deployment rev represent config state [tools/scap] - 10https://gerrit.wikimedia.org/r/243009 [18:02:05] (03CR) 10Dzahn: [C: 032] unlink wikimedia.xyz from wikimedia.com and park it [dns] - 10https://gerrit.wikimedia.org/r/243103 (https://phabricator.wikimedia.org/T92547) (owner: 10Dzahn) [18:02:32] (03CR) 10GWicke: [C: 031] Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [18:02:47] \o/ [18:03:07] (03PS3) 10Dzahn: park visualwikipedia domains [dns] - 10https://gerrit.wikimedia.org/r/197362 [18:04:44] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1697943 (10Andrew) 5Open>3Resolved [18:04:59] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labservices1001 - https://phabricator.wikimedia.org/T106584#1472236 (10Andrew) [18:06:59] (03CR) 10Dzahn: [C: 032] park visualwikipedia domains [dns] - 10https://gerrit.wikimedia.org/r/197362 (owner: 10Dzahn) [18:07:57] (03PS2) 10Dzahn: park softwarewikipedia domains [dns] - 10https://gerrit.wikimedia.org/r/241123 [18:10:17] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/rOAPAf3dc1fc7396e2dd899be7109aa3ca31ebfd706be" [dns] - 10https://gerrit.wikimedia.org/r/197362 (owner: 10Dzahn) [18:12:37] (03PS4) 10Chad: [WIP] Sync /srv/mediawiki-staging to co-masters [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:12:53] (03CR) 10Chad: "PS4 was a very manual rebase." [tools/scap] - 10https://gerrit.wikimedia.org/r/224313 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:13:27] PlayStation 4? [18:13:56] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:14:58] Bsadowski1: I do all my rebasing on the Playstation 4, don't you? :D [18:16:16] (03PS14) 10Ottomata: Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [18:17:15] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [18:20:17] (03CR) 10Dzahn: [C: 031] Move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/242915 (owner: 10Muehlenhoff) [18:24:45] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1698026 (10Dzahn) @bblack looking at the google doc above.. are you gonna +1 if i add all of the missing ones as links to "parking" then? hmm [18:27:13] bd808: why for are o.ri and su.bbu showing up in this query: https://tools.wmflabs.org/sal/production?p=0&q=twentyafterfour%40tin&d= [18:28:11] the parser splits it into a search for "twenty.after.four OR tin" [18:28:18] oh, %40 [18:28:20] hrm [18:28:27] https://tools.wmflabs.org/sal/production?p=0&q=%22twentyafterfour%40tin%22&d= [18:28:40] perfect [18:28:54] ty [18:29:02] yw [18:29:56] (03CR) 10Jcrespo: [C: 04-1] "Need to roll in heartbeat and its grants to x1 and es*, at least, as a prerequisite." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [18:30:17] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1698049 (10BBlack) It's going to take me a couple days to process all of this, but the list is a good start. I'd like to be able to at least start binning... [18:31:13] (03PS3) 10Chad: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:31:18] greg-g: more gory details on how that search works are at the SAL stuff works just like the bash one [18:32:36] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1698060 (10Dzahn) Ok, sounds good. I'll try to help with putting them into categories. [18:32:52] (03PS4) 10Chad: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [18:32:56] bd808: bd [18:37:12] (03CR) 10Dzahn: "https://phabricator.wikimedia.org/rODNSb4989dcf69c5b3bfbca49c7a5f54836b4421ca85" [dns] - 10https://gerrit.wikimedia.org/r/241123 (owner: 10Dzahn) [18:38:12] 6operations, 10MediaWiki-Cache, 6Performance-Team, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1698087 (10aaron) a:5aaron>3None [18:38:36] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1698088 (10BBlack) Yeah there's really a few different dimensions to categorize here. There's the "reasoning" categorization, as in "Is this held because i... [18:39:12] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10procurement, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1698089 (10ori) [18:39:51] !log kafka preferred-replica-election [18:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:48] (03CR) 10Dzahn: [C: 032] "not used. i only added the redirects myself in the past to avoid showing an Apache error. nowadays not worth the cost of getting certifica" [dns] - 10https://gerrit.wikimedia.org/r/241123 (owner: 10Dzahn) [18:45:13] (03PS3) 10Dzahn: add wikipedia.es as parked domain [dns] - 10https://gerrit.wikimedia.org/r/243099 (https://phabricator.wikimedia.org/T101060) [18:46:06] (03CR) 10Dzahn: "amended to "just parking", no redirect for wikipedia.es to es.wikipedia.org then" [dns] - 10https://gerrit.wikimedia.org/r/243099 (https://phabricator.wikimedia.org/T101060) (owner: 10Dzahn) [18:47:15] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:56] (03CR) 10Ottomata: [C: 032] Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [18:48:46] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [18:50:00] (03PS4) 10Dzahn: add wikipedia.es as parked domain [dns] - 10https://gerrit.wikimedia.org/r/243099 (https://phabricator.wikimedia.org/T101060) [18:52:18] come on Spanish NIC. First "This TLD has no whois server" and then "www.nic.es uses an invalid security certificate." on the website supposed to replace it [18:52:26] PROBLEM - puppet last run on aqs1003 is CRITICAL: CRITICAL: puppet fail [18:52:28] PROBLEM - puppet last run on aqs1002 is CRITICAL: CRITICAL: puppet fail [18:52:28] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: puppet fail [18:52:53] that looks like the role got merged :) [18:53:18] first words "puppet fail" [18:53:47] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:52] 6operations, 7Tracking: Upgrade Wikimedia servers to Ubuntu Trusty (14.04) (tracking) - https://phabricator.wikimedia.org/T65899#1698176 (10Neil_P._Quinn_WMF) [18:55:13] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1698184 (10Ottomata) Hm, @gwicke, if Mediawiki has access to schemas, and is relatively sure that it can produce valid messages for a given topic, why should a Mediawiki client use Event... [18:56:03] mutante: on it! [18:56:08] its a ganglia cluster problem [18:56:15] ottomata: something about missing ganglia.. .. what you said [18:57:05] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [18:57:46] mutante: do you know where I need to set this? looking in ganglia stuff... [18:57:51] not sure if it is inhiera [18:58:12] ottomata: cool that it's merged. i made an admin group to unblock that. it has the rights to control restbase/cassandra. it just doesnt have people in it becuase that is the access request [18:58:24] right ja [18:58:27] i know [18:59:07] ottomata: yes, hiera. so i think this: [18:59:10] found it! [18:59:20] role/eqiad/ganeti.yaml:cluster: ganeti [18:59:22] like that, right [18:59:46] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1698215 (10ori) [18:59:58] (03PS1) 10Ottomata: Add aqs ganglia cluster [puppet] - 10https://gerrit.wikimedia.org/r/243230 [19:00:04] mutante: ^ [19:00:52] (03CR) 10Dzahn: [C: 031] ".. or you could already add an empty "codfw" too.. shrug" [puppet] - 10https://gerrit.wikimedia.org/r/243230 (owner: 10Ottomata) [19:00:54] (03CR) 10Ottomata: [C: 032] Add aqs ganglia cluster [puppet] - 10https://gerrit.wikimedia.org/r/243230 (owner: 10Ottomata) [19:00:58] ottomata: ah, right! [19:01:05] the cluster itself ,, yes [19:01:29] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1698223 (10GWicke) @ottomata, main reason would be the ability to work with $simple_queue, $binary_kafka, $amazon_queue and so on without changes in MW code. This isn't so theoretical. W... [19:02:28] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1698226 (10Ottomata) Oh, so the service is planned to be kafka agnostic? (I will come hang on your side after lunch :) ) [19:04:45] (03CR) 10Dzahn: [C: 032] "it's new that we have it at all. that redirect you see now will be gone once it points to our DNS" [dns] - 10https://gerrit.wikimedia.org/r/243099 (https://phabricator.wikimedia.org/T101060) (owner: 10Dzahn) [19:05:10] mutante: some puppet stuff might still be weird, will fix after lunch [19:05:33] ottomata: 'alright [19:05:43] RECOVERY - puppet last run on aqs1001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:05:50] that looks good though :) [19:06:15] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:06:16] 6operations, 10ops-eqiad: Return polonium/lead to spares - https://phabricator.wikimedia.org/T113962#1698239 (10Krenair) [19:06:32] oh, cool, looking good! [19:06:43] took two runs, so there are clearly some puppet deps that aren't right [19:06:44] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:06:45] RECOVERY - puppet last run on aqs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:07:53] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [19:08:15] RECOVERY - MariaDB Slave Lag: s1 on db1051 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [19:08:38] ok [19:09:04] ottomata: if it's only that .. 2 runs the very first time it runs is not extremely uncommon [19:23:36] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1698266 (10Dzahn) summary of progress today: category: "WMF owns it but it wasn't in our DNS zones yet, added as parked domain without traffic": wiki.voya... [19:29:47] 6operations, 7Database: Grant 'show view' permissions on s1-analytics-slave/jmorgan to user jmorgan - https://phabricator.wikimedia.org/T114396#1698286 (10Capt_Swing) thanks @jcrespo! [19:31:41] 6operations, 7Database: Grant 'show view' permissions on s1-analytics-slave/jmorgan to user jmorgan - https://phabricator.wikimedia.org/T114396#1698287 (10jcrespo) 5Open>3Resolved [19:33:10] 6operations, 6WMF-Legal, 7domains: get wiki.voyage? - https://phabricator.wikimedia.org/T88851#1698288 (10Dzahn) [19:33:12] 6operations, 6WMF-Legal, 7domains: get wiki.voyage? - https://phabricator.wikimedia.org/T88851#1698294 (10Dzahn) [19:33:19] 6operations, 6WMF-Legal, 7domains: get wiki.voyage? - https://phabricator.wikimedia.org/T88851#1698298 (10Dzahn) 5Open>3Resolved a:3Dzahn [19:44:03] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [20:07:36] Contact group 'aqs-admins' ... is not defined anywhere! [20:07:55] (03CR) 10QChris: [C: 04-1] "Since we're running an old gerrit, the corresponding old its-base" [puppet] - 10https://gerrit.wikimedia.org/r/242237 (owner: 10Daniel Kinzler) [20:08:01] ^ ottomata ? [20:10:04] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [20:38:33] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:46:51] (03PS1) 10Ori.livneh: Configure APCBagOStuff for ResourceLoader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243243 [20:47:26] (03CR) 10Ori.livneh: [C: 032] Configure APCBagOStuff for ResourceLoader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243243 (owner: 10Ori.livneh) [20:47:31] (03Merged) 10jenkins-bot: Configure APCBagOStuff for ResourceLoader [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243243 (owner: 10Ori.livneh) [21:03:54] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:06:02] !log ori@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 17s) [21:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:14] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:15:10] !log mwscript cleanupRemovedModules.php --wiki test2wiki [21:15:14] !log mwscript cleanupRemovedModules.php --wiki nlwiki [21:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:16] !log mwscript cleanupRemovedModules.php --wiki dewiki [21:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:58] so have we killed /usr/local/apache/common now Krinkle? [21:16:21] Krenair: That was effectively gone yesterday already when we changed the cache key [21:16:27] :) [21:16:29] Ori also removed those rows already [21:16:38] This is removing rows for modules that no longer exist [21:17:25] On nlwiki it removed 2x500 rows in module_deps and 37x500 rows in msg_resource [21:17:28] RoanKattouw: ^ [21:17:36] (batches) [21:17:47] also 1x500 (197 actual) rows in msg_resource_links [21:21:19] !log mwscript cleanupRemovedModules.php --wiki testwikidatawiki [21:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:38] (03PS1) 10ArielGlenn: create cache dir for staged dump runs [puppet] - 10https://gerrit.wikimedia.org/r/243249 [21:24:36] (03CR) 10ArielGlenn: [C: 032] create cache dir for staged dump runs [puppet] - 10https://gerrit.wikimedia.org/r/243249 (owner: 10ArielGlenn) [21:28:21] (03PS1) 10ArielGlenn: dumps: various fixes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243251 [21:28:23] (03PS1) 10ArielGlenn: dumps: more pylint of worker.py and related files [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243252 [21:33:00] !log mwscript cleanupRemovedModules.php --wiki zhwiktionary [21:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:38] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: various fixes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243251 (owner: 10ArielGlenn) [21:35:25] (03Abandoned) 10Reedy: Add wiki.voyage to DNS [dns] - 10https://gerrit.wikimedia.org/r/242924 (https://phabricator.wikimedia.org/T88851) (owner: 10Reedy) [21:40:13] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [21:43:24] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: more pylint of worker.py and related files [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243252 (owner: 10ArielGlenn) [21:45:13] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 210 seconds ago with 0 failures [21:52:19] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1698767 (10Dzahn) >>! In T101048#1698088, @BBlack wrote: > - Is this set to our DNS servers at the registrar? (if not - MM servers for parking/redirect? we... [21:56:03] PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 57 failures [21:56:14] ^^^ looking [21:57:21] should be fine now [22:00:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [22:01:03] RECOVERY - check_puppetrun on beryllium is OK: OK: Puppet is currently enabled, last run 248 seconds ago with 0 failures [22:01:29] (03PS1) 10Ottomata: Set datacenters in aqs restbase config [puppet] - 10https://gerrit.wikimedia.org/r/243333 [22:04:43] (03CR) 10GWicke: [C: 031] Set datacenters in aqs restbase config [puppet] - 10https://gerrit.wikimedia.org/r/243333 (owner: 10Ottomata) [22:05:03] (03CR) 10Ottomata: [C: 032] Set datacenters in aqs restbase config [puppet] - 10https://gerrit.wikimedia.org/r/243333 (owner: 10Ottomata) [22:05:13] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: Puppet has 1 failures [22:10:13] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [22:21:23] (03PS1) 10ArielGlenn: update monitor script to import from library of modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243336 [22:22:43] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1698862 (10Dzahn) The domain has been parked in DNS and won't get traffic anymore. The Apache redirect can be reverted. [22:24:12] (03CR) 10ArielGlenn: [C: 032 V: 032] update monitor script to import from library of modules [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/243336 (owner: 10ArielGlenn) [22:26:07] (03PS1) 10Dzahn: Revert "adding support to redirect wikimedia.xyz to wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/243338 [22:26:14] (03CR) 10jenkins-bot: [V: 04-1] Revert "adding support to redirect wikimedia.xyz to wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/243338 (owner: 10Dzahn) [22:26:30] (03PS2) 10Dzahn: Revert "adding support to redirect wikimedia.xyz to wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/243338 (https://phabricator.wikimedia.org/T92547) [22:26:34] (03CR) 10jenkins-bot: [V: 04-1] Revert "adding support to redirect wikimedia.xyz to wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/243338 (https://phabricator.wikimedia.org/T92547) (owner: 10Dzahn) [22:27:35] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1698868 (10Dzahn) @Robh ^ i'd revert your addition to the Apache config from back in the days. feel like reviewing/rebasing? [22:31:14] (03PS1) 10ArielGlenn: dumps monitor config: skip labswiki, not reachable from snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243339 [22:32:22] (03CR) 10ArielGlenn: [C: 032] dumps monitor config: skip labswiki, not reachable from snapshots [puppet] - 10https://gerrit.wikimedia.org/r/243339 (owner: 10ArielGlenn) [22:32:52] (03PS1) 10Dzahn: apache: remove visualwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243340 [22:33:59] (03PS1) 10Dzahn: apache: remove wikiartpedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243341 [22:35:07] (03PS1) 10Dzahn: apache: remove softwarewikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243342 [22:37:00] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1698898 (10BBlack) Thanks! [22:37:22] (03PS1) 10Dzahn: apache: remove webhostingwikipedia redirects [puppet] - 10https://gerrit.wikimedia.org/r/243344 [22:39:24] PROBLEM - Disk space on holmium is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=86%) [22:40:16] (03PS1) 10Dzahn: apache: remove wikifamily redirects [puppet] - 10https://gerrit.wikimedia.org/r/243345 [22:42:19] (03PS1) 10Dzahn: apache: remove wikidisclosure redirects [puppet] - 10https://gerrit.wikimedia.org/r/243347 [22:43:11] lol re. holmium disk [22:43:32] (03PS1) 10Dzahn: apache: remove wikimaps redirects [puppet] - 10https://gerrit.wikimedia.org/r/243348 [22:46:14] !log holmium: apt-get clean for a little disk space - /var/log/designate/designate-mdns.log is more than half the size of / - needs logrotate [22:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:46:36] andrewbogott: ^ do we want to keep that log? [22:48:47] 6operations, 7Graphite: grafana access control - https://phabricator.wikimedia.org/T108546#1698932 (10Tgr) >>! In T108546#1627937, @Tgr wrote: > Why don't we just switch to Grafana 2 then? Already being tracked in T104738. [22:49:06] mutante: that log file is over 250GB big? [22:49:25] PROBLEM - Disk space on holmium is CRITICAL: DISK CRITICAL - free space: / 242 MB (2% inode=86%) [22:49:29] (assuming there's 500GB on /, according to Ganglia) [22:49:48] SPF|Cloud: no 6GB, but / is 9.1 [22:49:54] bblack: gilles subbu would appreciate if you guys could take a look, comment, award a token for consideration at the summit: https://phabricator.wikimedia.org/T114542 [22:49:58] okay [22:50:05] SPF|Cloud: i'm gzipping it now [22:50:13] mr. liambotis, you too when you're awake ^^^ [22:50:30] gzipping temp. needs space too, haha. [22:50:51] SPF|Cloud: the rest of the space you see is /srv [22:50:58] i'll just move it [22:51:03] RECOVERY - Disk space on holmium is OK: DISK OK [22:51:07] and make a ticket for logrotate [22:51:52] SPF|Cloud: actually, no, doesnt here [22:52:04] why not? [22:52:15] because of the /srv partition with more space? [22:59:21] dr0ptp4kt, will do. thanks. [22:59:28] subbu: thx [22:59:44] no, i just gzipped it in place. it was never 100% full, monitoring tells us at 95%, then i freed more apt-get clean [23:03:22] 6operations, 6Labs, 10Labs-Infrastructure: add logrotate for designate logs - https://phabricator.wikimedia.org/T114544#1698962 (10Dzahn) [23:03:38] 6operations, 6Labs, 10Labs-Infrastructure: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1698966 (10Dzahn) [23:04:07] 6operations, 6Labs, 10Labs-Infrastructure: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1698955 (10Dzahn) [23:06:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1698970 (10JUnikowski_WMF) Hi @RobH. Thanks a lot--I just managed to log into stat1002 through bastion! [23:11:05] 6operations, 6Labs, 10Labs-Infrastructure: add logrotate for designate logs (holmium disk space) - https://phabricator.wikimedia.org/T114544#1698983 (10Dzahn) moved to gzipped log to /srv which has lots of free space and is not used [23:20:22] (03Abandoned) 10Dzahn: Revert "adding support to redirect wikimedia.xyz to wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/243338 (https://phabricator.wikimedia.org/T92547) (owner: 10Dzahn) [23:33:55] PROBLEM - puppet last run on aqs1001 is CRITICAL: CRITICAL: puppet fail [23:42:53] (03PS1) 10Dzahn: apache: remove wikimedia.xyz redirect [puppet] - 10https://gerrit.wikimedia.org/r/243354 (https://phabricator.wikimedia.org/T92547) [23:45:53] (03CR) 10Dzahn: [C: 04-1] "should be "ferm rules/service in role","base::firewall on node", "nothing in module"." [puppet] - 10https://gerrit.wikimedia.org/r/243123 (owner: 10Muehlenhoff) [23:48:30] (03CR) 10Dzahn: [C: 04-1] "should be "ferm rules in role", "base::firewall on nodes", "nothing in module". that's been the rule per Alex, unless we changed that. add" [puppet] - 10https://gerrit.wikimedia.org/r/243122 (owner: 10Muehlenhoff) [23:50:14] (03CR) 10Dzahn: "in my opinion it should be consistent the other way around, no groups include bastions" [puppet] - 10https://gerrit.wikimedia.org/r/242779 (owner: 10Yuvipanda) [23:51:20] (03CR) 10Dzahn: "i have not seen/used the "create_resources" part with salt, but the idea seems right if that adds grains" [puppet] - 10https://gerrit.wikimedia.org/r/243142 (https://phabricator.wikimedia.org/T111006) (owner: 10Muehlenhoff) [23:51:30] (03PS1) 10Alex Monk: [WIP] Labs DNS: Stop hardcoding instance IPs in Puppet [puppet] - 10https://gerrit.wikimedia.org/r/243357 [23:51:50] mutante, you mean no group should imply bastion access? [23:54:31] yes [23:55:11] mutante, so everyone should have to be added manually? [23:55:14] yes [23:55:29] mutante, that sounds like a bad idea considering many groups rely on bastion access to do anything [23:57:34] many but not all. letting a group just do one thing and treating them like flags seems better to me [23:57:41] we created it for a reason [23:58:54] groups including other groups - does that really make it simpler? [23:59:14] rdepends group