[00:58:55] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1701171 (10Nuria) >@Ottomata, main reason would be the ability to work with $simple_queue, $binary_kafka, $amazon_queue and so on without changes in MW code. This isn't so theoretical. W... [01:17:00] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1701193 (10GWicke) @Nuria, see the task description, heading "Initial use cases". [01:19:25] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1701197 (10ori) >>! In T114443#1701193, @GWicke wrote: > @Nuria, see the task description, heading "Initial use cases". Potential applications are one thing; a concise problem-statement... [01:49:29] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1701206 (10GWicke) [01:52:44] 6operations, 6Editing-Department, 6Parsing-Team, 6Services: Services team goals October - December 2015 (Q2 2015/16) - https://phabricator.wikimedia.org/T111819#1701208 (10GWicke) [01:53:27] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1701209 (10GWicke) [01:53:38] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1695470 (10GWicke) @ori, I changed the text to clarify which of those are potential, and which are concrete plans for this quarter. Please follow the provided links if things are still u... [02:29:02] !log l10nupdate@tin Synchronized php-1.27.0-wmf.1/cache/l10n: l10nupdate for 1.27.0-wmf.1 (duration: 08m 25s) [02:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:26] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1701215 (10ssastry) At this point, this is a tangent to the subject of the ticket (and I am responsible for the tangent, but one final clarification and we can take this on... [02:33:57] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.1) at 2015-10-05 02:33:57+00:00 [02:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:34] hi [02:40:58] anybody here? [04:10:39] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1701245 (10ssastry) Turns out the reason for the really tight fit was simply because I was doing the wrong thing (smoothening out 2 varying curves and fitting them visually, instead... [05:33:18] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Oct 5 05:33:18 UTC 2015 (duration 33m 17s) [05:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:46:02] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1701296 (10Joe) Apart from the concerns on a practical use case which I agree with, I have a big doubt about the implementation idea: I am in general a fan of the paradigm that it's bet... [05:52:30] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1701312 (10Joe) >>! In T114443#1698223, @GWicke wrote: > @ottomata, main reason would be the ability to work with $simple_queue, $binary_kafka, $amazon_queue and so on without changes in... [06:30:03] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:44] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:54] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:05] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:23] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:44] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:53] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:53] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:43] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:56:14] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:56:55] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:04] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:57:15] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:24] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:24] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:33] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:34] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:53] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:03:54] 6operations, 10Traffic, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1701361 (10Joe) I was not able to reproduce this behaviour in a small test setup, but in the meantime I implemented support for 3xx responses in proxyfetch, which... [07:43:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 686 [07:48:12] (03PS1) 10Alexandros Kosiaris: dbstore: Set Replication alerts to not page [puppet] - 10https://gerrit.wikimedia.org/r/243614 [07:48:31] jynus: ^ thoughts ? [07:49:42] +1 [07:51:10] (03CR) 10Jcrespo: [C: 031] dbstore: Set Replication alerts to not page [puppet] - 10https://gerrit.wikimedia.org/r/243614 (owner: 10Alexandros Kosiaris) [07:54:30] they were never a problem before, something has changed on the code [07:54:43] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Seconds_Behind_Master: 222 [07:54:58] that is why we never were bothered by them [07:56:33] Something related to linksupdate that makes toku very slow [07:58:02] (03CR) 10Jcrespo: [C: 032] dbstore: Set Replication alerts to not page [puppet] - 10https://gerrit.wikimedia.org/r/243614 (owner: 10Alexandros Kosiaris) [07:58:28] (03PS2) 10Giuseppe Lavagetto: Add support for http_status to ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/243139 (https://phabricator.wikimedia.org/T102393) [08:04:44] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 100 [08:07:15] <_joe_> !log upgrading HHVM on all API appservers [08:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:08:27] <_joe_> (if salt behaves, that is [08:09:40] (03CR) 10Jcrespo: "$wmgParserCacheDBs are read/write, but do not have a replication topology, as they are pure caches. Probably not affected as they do not u" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [08:19:43] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 100 [08:26:25] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [08:33:07] 6operations, 5Patch-For-Review: Tweak sysctl settings for nf_conntrack - https://phabricator.wikimedia.org/T105307#1701435 (10MoritzMuehlenhoff) 5Open>3Resolved net.netfilter.nf_conntrack_tcp_timeout_time_wait has been set to 65 seconds across the cluster. [08:35:04] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:39:33] 6operations: determine new swift ms-be hostnames (codfw/eqiad) - https://phabricator.wikimedia.org/T114500#1701450 (10fgiunchedi) a:5fgiunchedi>3RobH @robh hostnames look good, thanks! [08:41:12] (03CR) 10Hashar: [C: 031] "Rule is disabled now so this can be landed." [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [08:42:58] 6operations, 10Traffic, 7discovery-system, 5services-tooling: Figure out a security model for etcd - https://phabricator.wikimedia.org/T97972#1701455 (10Joe) [08:43:00] (03CR) 10Hashar: [C: 04-1] "I guess we can just disable the rule for now until rubocop version is bumped with a version that includes a fix for above issue." [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [08:43:25] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: ! NTT (service ID 234631) {#1061} [10Gbps]BR [09:02:25] (03CR) 10Filippo Giunchedi: cassandra: new metrics-collector version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [09:04:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] update collector version [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/243121 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [09:15:18] !log stop puppet on restbase and maps in preparation for https://gerrit.wikimedia.org/r/#/c/242896/1 [09:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:15:28] (03PS2) 10Filippo Giunchedi: cassandra: add multi-instance support, disabled [puppet] - 10https://gerrit.wikimedia.org/r/242896 (https://phabricator.wikimedia.org/T95253) [09:15:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add multi-instance support, disabled [puppet] - 10https://gerrit.wikimedia.org/r/242896 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [09:18:00] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Need a way to create a systemd service that is initially stopped - https://phabricator.wikimedia.org/T105749#1701526 (10Joe) For the record, this is possible by doing the following: ``` base::service_unit{ 'foo':... [09:18:12] !log roll-restart cassandra on restbase test cluster [09:18:14] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Need a way to create a systemd service that is initially stopped - https://phabricator.wikimedia.org/T105749#1701528 (10Joe) 5Open>3Resolved [09:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:25] 6operations, 10MediaWiki-API, 7HHVM, 7Pywikibot-tests, 7Wikimedia-log-errors: internal_api_error_BadMethodCallException: [xxx] Exception Caught: Call to a member function getNames() on a non-object (NULL) - https://phabricator.wikimedia.org/T109929#1701532 (10Joe) 5Open>3Resolved [09:20:05] 6operations, 5codfw-appserver-setup, 5wikis-in-codfw: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#1701533 (10Joe) 5Open>3Resolved a:3Joe [09:20:06] 6operations, 5codfw-appserver-setup, 5wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1701535 (10Joe) [09:20:19] 6operations, 5codfw-appserver-setup, 5wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1701537 (10Joe) 5Open>3Resolved [09:34:34] PROBLEM - Host mw1153 is DOWN: PING CRITICAL - Packet loss = 100% [09:37:01] (03CR) 10Filippo Giunchedi: [C: 031] varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 (owner: 10BBlack) [09:39:21] (03PS2) 10Muehlenhoff: Move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/242915 [09:42:53] <_joe_> uh, what's up with mw1153 [09:44:16] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1701608 (10jcrespo) [09:46:26] <_joe_> ok, so. mw1153 hit a bug in the kernel relating to its ethernet driver it seems, or the network card is broken [09:46:33] <_joe_> either way, rebooting it [09:47:35] 6operations, 7Graphite: Upgrade to Grafana v2.x - https://phabricator.wikimedia.org/T104738#1701614 (10BBlack) Was looking for logScale stuff recently. It seems the version of Grafana we're running right now is some interim pre-release-2.0.0 state before 2.0.0-Beta1. There's been a few 2.0.0-BetaX, then 2.0.... [09:49:33] RECOVERY - Host mw1153 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [09:50:02] !log roll-restart cassandra in restbase codfw [09:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:16] <_joe_> !log rebooted mw1153, soft lockup due to bnx2 failure [09:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:29] (03PS3) 10Muehlenhoff: Move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/242915 [09:51:12] 7Puppet, 6operations: Add the puppet CA to the certification authorities trusted by our systems, on demand - https://phabricator.wikimedia.org/T114638#1701620 (10Joe) [09:51:45] (03CR) 10BBlack: "+1-ish as we're only using 301s for the cases that matter right now, but see inline re: other redirects?" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/243139 (https://phabricator.wikimedia.org/T102393) (owner: 10Giuseppe Lavagetto) [09:52:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/242915 (owner: 10Muehlenhoff) [09:54:58] (03CR) 10Giuseppe Lavagetto: "@bblack: the code correctly handles 302s and 303s as well:" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/243139 (https://phabricator.wikimedia.org/T102393) (owner: 10Giuseppe Lavagetto) [09:56:34] (03CR) 10BBlack: [C: 031] "Ah ok :)" [debs/pybal] - 10https://gerrit.wikimedia.org/r/243139 (https://phabricator.wikimedia.org/T102393) (owner: 10Giuseppe Lavagetto) [09:56:52] (03CR) 10Giuseppe Lavagetto: [C: 032] Add support for http_status to ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/243139 (https://phabricator.wikimedia.org/T102393) (owner: 10Giuseppe Lavagetto) [09:57:13] (03PS2) 10Giuseppe Lavagetto: Minor fixes to instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/243413 [09:57:16] (03Merged) 10jenkins-bot: Add support for http_status to ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/243139 (https://phabricator.wikimedia.org/T102393) (owner: 10Giuseppe Lavagetto) [09:58:42] <_joe_> !log installing the new HHVM package to appservers [09:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:59] (03CR) 10BBlack: [C: 031] Fix signal handling, some cleanup [debs/pybal] - 10https://gerrit.wikimedia.org/r/243414 (owner: 10Giuseppe Lavagetto) [10:16:13] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [10:16:27] <_joe_> uhm wtf [10:17:54] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [10:19:49] (03PS4) 10BBlack: remove last vcl_config fe default [puppet] - 10https://gerrit.wikimedia.org/r/243401 [10:20:13] (03CR) 10BBlack: [C: 032 V: 032] remove last vcl_config fe default [puppet] - 10https://gerrit.wikimedia.org/r/243401 (owner: 10BBlack) [10:24:54] (03PS2) 10BBlack: sslcert: add preamble for sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/243195 (owner: 10Faidon Liambotis) [10:25:02] (03CR) 10BBlack: [C: 032 V: 032] sslcert: add preamble for sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/243195 (owner: 10Faidon Liambotis) [10:25:31] !log roll-restart cassandra in restbase eqiad [10:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:34] PROBLEM - Cassandra database on restbase1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [10:29:45] PROBLEM - Cassandra CQL query interface on restbase1001 is CRITICAL: Connection refused [10:31:40] (03CR) 10BBlack: [C: 031] tlsproxy: fix a couple of OCSP-related dependencies [puppet] - 10https://gerrit.wikimedia.org/r/243197 (owner: 10Faidon Liambotis) [10:31:48] that's me ^ [10:33:03] RECOVERY - Cassandra database on restbase1001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [10:33:14] RECOVERY - Cassandra CQL query interface on restbase1001 is OK: TCP OK - 0.000 second response time on port 9042 [10:34:58] (03CR) 10BBlack: [C: 031] tlsproxy: switch update-ocsp(-all) to config files [puppet] - 10https://gerrit.wikimedia.org/r/243198 (owner: 10Faidon Liambotis) [10:36:25] (03CR) 10BBlack: [C: 031] tlsproxy: add support for update-ocsp-all hooks [puppet] - 10https://gerrit.wikimedia.org/r/243199 (owner: 10Faidon Liambotis) [10:38:05] PROBLEM - Cassandra CQL query interface on restbase1002 is CRITICAL: Connection refused [10:38:57] PROBLEM - Cassandra database on restbase1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 111 (cassandra), command name java, args CassandraDaemon [10:41:22] (03CR) 10BBlack: [C: 031] Move tlsproxy's OCSP stapler/updater to sslcert [puppet] - 10https://gerrit.wikimedia.org/r/243200 (owner: 10Faidon Liambotis) [10:43:44] (03CR) 10BBlack: "^ ditto for the nagios freshness check. basically, I think you can inline ocsp_stapler, but still need to require a separate singleton oc" [puppet] - 10https://gerrit.wikimedia.org/r/243201 (owner: 10Faidon Liambotis) [10:47:35] (03PS1) 10Muehlenhoff: Move ferm rules out of the module, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/243629 [10:50:33] RECOVERY - Cassandra database on restbase1002 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [10:51:23] RECOVERY - Cassandra CQL query interface on restbase1002 is OK: TCP OK - 0.005 second response time on port 9042 [10:51:52] (03PS1) 10Alexandros Kosiaris: aqs: change contact group to analytics [puppet] - 10https://gerrit.wikimedia.org/r/243632 [10:52:48] (03PS2) 10Alexandros Kosiaris: aqs: change contact group to analytics [puppet] - 10https://gerrit.wikimedia.org/r/243632 [10:53:00] (03CR) 10Alexandros Kosiaris: [C: 032] aqs: change contact group to analytics [puppet] - 10https://gerrit.wikimedia.org/r/243632 (owner: 10Alexandros Kosiaris) [10:53:24] (03CR) 10Alexandros Kosiaris: [V: 032] aqs: change contact group to analytics [puppet] - 10https://gerrit.wikimedia.org/r/243632 (owner: 10Alexandros Kosiaris) [10:55:59] (03Abandoned) 10Alexandros Kosiaris: Smallest change needed to unbreak nagios config [puppet] - 10https://gerrit.wikimedia.org/r/243408 (https://phabricator.wikimedia.org/T114556) (owner: 10Jcrespo) [10:56:07] (03PS2) 10Muehlenhoff: Move ferm rules out of the module, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/243629 [10:57:44] 6operations, 10Analytics, 6Services, 7Icinga, 5Patch-For-Review: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1701698 (10jcrespo) 5Open>3Resolved a:3jcrespo Resolved on https://gerrit.wikimedia.org/r/#/c/243632/ [10:58:59] 6operations, 10Analytics, 6Services, 7Icinga: Icinga configuration broken by aqs - https://phabricator.wikimedia.org/T114556#1701702 (10Revi) [10:59:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move ferm rules out of the module, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/243629 (owner: 10Muehlenhoff) [11:04:43] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [11:09:24] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [11:09:34] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [11:11:14] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [11:12:44] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [11:26:58] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: ! NTT (service ID 234631) {#1061} [10Gbps]BR [11:27:08] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:27:27] PROBLEM - Restbase root url on aqs1003 is CRITICAL: Connection refused [11:28:10] PROBLEM - Restbase root url on aqs1001 is CRITICAL: Connection refused [11:29:11] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:29:29] PROBLEM - Restbase root url on aqs1002 is CRITICAL: Connection refused [11:30:40] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:31:39] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [11:34:30] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [11:35:13] 6operations, 6Analytics-Kanban, 10netops, 5Patch-For-Review: Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug} - https://phabricator.wikimedia.org/T107056#1701759 (10akosiaris) [11:37:47] 6operations, 6Analytics-Kanban, 10netops, 5Patch-For-Review: Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug} - https://phabricator.wikimedia.org/T107056#1701767 (10akosiaris) >>! In T107056#1697401, @JAllemandou wrote: > Hey DevOps Guys, > As part of that task, w... [11:42:30] PROBLEM - puppet last run on mw2025 is CRITICAL: CRITICAL: puppet fail [11:42:51] (03CR) 10Alexandros Kosiaris: "@cscott, thanks for explaining why the parsoid configuration is still required. I was about to ask why we haven't drop it yet. That being " [puppet] - 10https://gerrit.wikimedia.org/r/243400 (owner: 10Cscott) [11:52:32] (03PS1) 10Alexandros Kosiaris: aqs: Allow CQL access from analytics [puppet] - 10https://gerrit.wikimedia.org/r/243635 (https://phabricator.wikimedia.org/T107056) [12:01:10] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [12:09:58] 10Ops-Access-Requests, 6operations: Access request to hive and webrequests - https://phabricator.wikimedia.org/T114642#1701804 (10dcausse) 3NEW [12:10:19] RECOVERY - puppet last run on mw2025 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:26:10] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 1 failures [12:26:30] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/2: down - Transit: ! NTT (service ID 234631) {#1061} [10Gbps]BR [12:26:59] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:47:21] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [12:52:19] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:53:02] 6operations, 10ops-eqiad: analytics1049 /dev/sdi busted - https://phabricator.wikimedia.org/T114034#1701898 (10Cmjohnson) 5Open>3Resolved [12:55:45] (03CR) 10Faidon Liambotis: [C: 031] "Awesome! :)" [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [13:03:51] (03CR) 10Faidon Liambotis: [C: 031] "That's so much cleaner!" [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [13:19:31] (03CR) 10Faidon Liambotis: [C: 04-1] "Yes on the concept, no on the specifics; see inline :)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227327 (https://phabricator.wikimedia.org/T114161) (owner: 10Alex Monk) [13:29:43] (03PS1) 10Hashar: contint: resurect contint::browsertests [puppet] - 10https://gerrit.wikimedia.org/r/243644 [13:32:00] I'm looking into high cassandra latencies, it seems also parsoid is affected according to http://grafana.wikimedia.org/#/dashboard/db/restbase [13:32:57] on the cassandra side what changed today was merging https://gerrit.wikimedia.org/r/#/c/242896/1 still not sure if related [13:36:18] (03PS8) 10Hashar: Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [13:37:33] (03CR) 10Hashar: "I had contint::browsertests included again with https://gerrit.wikimedia.org/r/#/c/243644/" [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [13:38:19] RECOVERY - Disk space on stat1003 is OK: DISK OK [13:42:54] ^wow, easiest maintenance ever- I send an email, and the alert disappears :-) [13:53:31] (03CR) 10Hashar: [C: 031 V: 032] "cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/243644 (owner: 10Hashar) [13:53:51] (03CR) 10Hashar: [C: 031 V: 032] "cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [13:55:09] jynus: great :-D Maybe you would be willing to merge the above two contint patches for me now you have some "free time" :-} [13:56:51] free time, I do not understand the concept? [13:57:43] In exchange, I would like some help in the future with some CI work I have pending [13:57:51] (03PS7) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [13:59:00] jynus: sold ! [13:59:06] jynus: what do you have in mind? [13:59:06] (03PS2) 10Alexandros Kosiaris: otrs: disable the scheduler watchdog [puppet] - 10https://gerrit.wikimedia.org/r/243182 [13:59:36] hashar, some mysql changes, but I am not too familiar with CI environement [14:00:04] not something for now, really, let me have a look at those patches [14:00:15] the easiest is probably to fill a task [14:00:18] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: disable the scheduler watchdog [puppet] - 10https://gerrit.wikimedia.org/r/243182 (owner: 10Alexandros Kosiaris) [14:00:26] yes [14:00:32] that will spam a bunch of folks and we call all figure out something together [14:02:34] 237876 is easy, but let me grep around because this week we had an issue with a similar movement [14:03:42] yeah I ran it on the CI slaves and puppet pass [14:03:48] should be fine :-} [14:05:29] (03PS2) 10Jcrespo: contint: resurect contint::browsertests [puppet] - 10https://gerrit.wikimedia.org/r/243644 (owner: 10Hashar) [14:06:08] yes, I think it impossible to make it fail if it worked before, unlike the other change I mentioned [14:06:14] (03PS8) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [14:06:32] (03CR) 10Jcrespo: [C: 032] contint: resurect contint::browsertests [puppet] - 10https://gerrit.wikimedia.org/r/243644 (owner: 10Hashar) [14:06:38] (03PS9) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [14:06:51] (03PS9) 10Jcrespo: Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [14:06:55] * hashar watch the race [14:06:58] ups [14:06:59] sorry [14:07:06] bblack, please [14:07:14] (03CR) 10BBlack: "PS8 vs PS6 moves the split of MCC-MNC + metadata up to the common code, instead of off in the zero+analytics -specific code." [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [14:07:16] your turn [14:07:25] Gerrit has an interesting merging strategy which is "cherry pick" [14:07:31] that keeps a linear history [14:07:51] I think it was discussed some time ago [14:07:56] but that cherry-pick all the patches and adds metadata to the commit message [14:07:59] and the sha1 changes [14:08:30] (03CR) 10Jcrespo: [C: 032] Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [14:08:55] (03PS1) 10Rush: check legal html for en.wikibooks as requests [puppet] - 10https://gerrit.wikimedia.org/r/243650 [14:09:41] that should be it [14:10:27] (03PS2) 10Rush: check legal html for en.wikibooks as requests [puppet] - 10https://gerrit.wikimedia.org/r/243650 [14:11:41] (03CR) 10Eevans: cassandra: new metrics-collector version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243127 (https://phabricator.wikimedia.org/T113733) (owner: 10Filippo Giunchedi) [14:12:59] (03CR) 10Rush: [C: 032] check legal html for en.wikibooks as requests [puppet] - 10https://gerrit.wikimedia.org/r/243650 (owner: 10Rush) [14:14:07] jynus: bblack please what? [14:14:16] (03PS1) 10Muehlenhoff: Move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243651 [14:14:18] (03PS1) 10Muehlenhoff: Move ferm rules out of the module [puppet] - 10https://gerrit.wikimedia.org/r/243652 [14:14:30] you don't have to rebase when I push new commits, only if I merge, which I haven't :P [14:14:50] bblack, oh, I though your we about to [14:14:54] I was suspecting you two were racing to get changes to merge :D [14:15:08] and was saying sorry for that [14:15:18] (03PS10) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [14:15:20] (03PS10) 10BBlack: move netmapper processing to common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243395 (https://phabricator.wikimedia.org/T89177) [14:16:35] (03PS8) 10BBlack: varnish: misspass limiter [puppet] - 10https://gerrit.wikimedia.org/r/241643 [14:16:50] my little sub-tree of VCL patch deps is getting complicated to rebase heh [14:19:25] jynus: all puppet runs are fine. thank you very much [14:19:38] thanks for checking [14:23:04] (03PS1) 10Andrew Bogott: Fix a couple of recent lint regressions. [puppet] - 10https://gerrit.wikimedia.org/r/243653 [14:23:32] (03PS1) 10Rush: icinga: define en.wb.o host [puppet] - 10https://gerrit.wikimedia.org/r/243654 [14:23:40] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1701995 (10Nuria) >EventLogging: Decode, validate and enqueue JSON events for EL. mmm..I am not sure who would be the users of this endpoint at this time, do you have a case for EL that... [14:24:15] (03PS2) 10Rush: icinga: define en.wb.o host [puppet] - 10https://gerrit.wikimedia.org/r/243654 [14:25:08] (03CR) 10Rush: [C: 032] icinga: define en.wb.o host [puppet] - 10https://gerrit.wikimedia.org/r/243654 (owner: 10Rush) [14:26:51] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [14:27:16] ^fixing now [14:34:58] :-) [14:36:52] (03PS11) 10BBlack: Move all X-Analytics code to analytics.inc, include in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) [14:37:19] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [14:38:29] (03PS2) 10Andrew Bogott: Fix a couple of recent lint regressions. [puppet] - 10https://gerrit.wikimedia.org/r/243653 [14:39:36] (03CR) 10Andrew Bogott: [C: 032] Fix a couple of recent lint regressions. [puppet] - 10https://gerrit.wikimedia.org/r/243653 (owner: 10Andrew Bogott) [14:40:40] (03CR) 10Cscott: "From the code, it doesn't appear that CX is currently being used in any project which doesn't match XX.wikipedia.org, and those wikis are " [puppet] - 10https://gerrit.wikimedia.org/r/243400 (owner: 10Cscott) [14:40:54] (03PS6) 10Andrew Bogott: Add an additional puppet config to use with minimal runs. [puppet] - 10https://gerrit.wikimedia.org/r/221563 [14:41:07] !log puppet-lint Jenkins job is now strict and will -1 on errors as well as warnings https://gerrit.wikimedia.org/r/#/c/243185/ [14:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:55] all good [14:42:23] I am dropping the 'pplint-HEAD' job [14:43:27] (03PS2) 10Faidon Liambotis: sslcert: add --config argument to update-ocsp [puppet] - 10https://gerrit.wikimedia.org/r/243196 [14:43:39] (03CR) 10Faidon Liambotis: [C: 032] sslcert: add --config argument to update-ocsp [puppet] - 10https://gerrit.wikimedia.org/r/243196 (owner: 10Faidon Liambotis) [14:44:45] (03PS2) 10Faidon Liambotis: tlsproxy: fix a couple of OCSP-related dependencies [puppet] - 10https://gerrit.wikimedia.org/r/243197 [14:44:51] (03CR) 10Faidon Liambotis: [C: 032] tlsproxy: fix a couple of OCSP-related dependencies [puppet] - 10https://gerrit.wikimedia.org/r/243197 (owner: 10Faidon Liambotis) [14:48:27] (03PS2) 10Faidon Liambotis: tlsproxy: switch update-ocsp(-all) to config files [puppet] - 10https://gerrit.wikimedia.org/r/243198 [14:48:29] (03PS2) 10Faidon Liambotis: tlsproxy: add support for update-ocsp-all hooks [puppet] - 10https://gerrit.wikimedia.org/r/243199 [14:48:31] (03PS2) 10Faidon Liambotis: Move tlsproxy's OCSP stapler/updater to sslcert [puppet] - 10https://gerrit.wikimedia.org/r/243200 [14:51:35] (03PS1) 10Hashar: contint: fix bundler package name on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/243659 [14:53:06] (03PS1) 10Giuseppe Lavagetto: Add class base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243661 (https://phabricator.wikimedia.org/T114638) [14:53:08] (03PS1) 10Giuseppe Lavagetto: k8s: switch to using base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) [14:53:10] (03PS1) 10Giuseppe Lavagetto: etcd: switch to using base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243663 (https://phabricator.wikimedia.org/T114638) [14:53:12] (03PS1) 10Giuseppe Lavagetto: conftool: switch to using base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243664 (https://phabricator.wikimedia.org/T114638) [14:53:14] (03PS1) 10Giuseppe Lavagetto: eventlogging: switch to using base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243665 (https://phabricator.wikimedia.org/T114638) [14:53:16] (03PS1) 10Giuseppe Lavagetto: toolschecker: switch to using base::puppet::ca [puppet] - 10https://gerrit.wikimedia.org/r/243666 (https://phabricator.wikimedia.org/T114638) [14:53:25] 6operations, 7HHVM, 5Patch-For-Review: Package and deploy HHVM 3.6.5+dfsg1-1+wm7 - https://phabricator.wikimedia.org/T112640#1702068 (10Joe) [14:53:33] 6operations, 7HHVM, 5Patch-For-Review: Package and deploy HHVM 3.6.5+dfsg1-1+wm7 - https://phabricator.wikimedia.org/T112640#1640999 (10Joe) 5Open>3Resolved [14:54:09] 10Ops-Access-Requests, 6operations, 6Editing-Department, 6Release-Engineering-Team: Please add Frédéric Bolduc, Thalia Chan, and David Lynch to ldap/wmf - https://phabricator.wikimedia.org/T114646#1702074 (10hashar) Seems a task for #ops-access-requests . I at least don't have any access to LDAP. James me... [14:55:50] 6operations, 7HHVM, 5Patch-For-Review: Package and deploy HHVM 3.6.5+dfsg1-1+wm7 - https://phabricator.wikimedia.org/T112640#1702081 (10Joe) [14:57:59] (03PS3) 10Faidon Liambotis: tlsproxy: switch update-ocsp(-all) to config files [puppet] - 10https://gerrit.wikimedia.org/r/243198 [14:58:14] 6operations, 10Deployment-Systems, 10Salt, 5Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1702094 (10ArielGlenn) I've looked at this some over the weekend. Some notes: It looks like the timeout is not passed through from service-restart all th... [14:59:07] (03CR) 10Faidon Liambotis: [C: 032] tlsproxy: switch update-ocsp(-all) to config files [puppet] - 10https://gerrit.wikimedia.org/r/243198 (owner: 10Faidon Liambotis) [14:59:33] (03CR) 10Merlijn van Deen: [C: 04-1] "Aren't the k8s CA cert (= the one used to create k8s certificates?) and the puppet CA cert different?" [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151005T1500). Please do the needful. [15:00:04] matt_flaschen: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:14] akosiaris: I'm going to roll restart cassandra on maps-test2* to pick up the systemd unit unless objections? [15:00:19] (03CR) 10Faidon Liambotis: [C: 04-1] Add class base::puppet::ca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243661 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [15:00:28] Here [15:03:26] matt_flaschen: I can SWAT, just reviewing this patch, it's kinda huge for SWAT. [15:03:30] (03PS1) 10Faidon Liambotis: tlsproxy: no arguments for update-ocsp-all [puppet] - 10https://gerrit.wikimedia.org/r/243670 [15:03:43] (03CR) 10Faidon Liambotis: [C: 032 V: 032] tlsproxy: no arguments for update-ocsp-all [puppet] - 10https://gerrit.wikimedia.org/r/243670 (owner: 10Faidon Liambotis) [15:04:20] thcipriani, okay, thanks, let me know if you have any questions. We just wanted to get it out, since it has a pretty noticeable user impact. [15:06:12] matt_flaschen: anything in particular that I should know about pushing this out? Doesn't seem to change any i18n. Any order to the sync? Or would a sync-dir for the whole extension work fine? [15:06:42] (03CR) 10Giuseppe Lavagetto: Add class base::puppet::ca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243661 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [15:07:06] thcipriani, sync-dir should be fine. [15:11:00] Sorry, my wifi went down for a second, did I miss anything? [15:11:36] No reply after your last message [15:11:59] Thanks [15:13:32] (03PS3) 10Faidon Liambotis: tlsproxy: add support for update-ocsp-all hooks [puppet] - 10https://gerrit.wikimedia.org/r/243199 [15:13:34] (03PS3) 10Faidon Liambotis: Move tlsproxy's OCSP stapler/updater to sslcert [puppet] - 10https://gerrit.wikimedia.org/r/243200 [15:13:36] (03PS2) 10Faidon Liambotis: tlsproxy: inline ocsp_stapler, rename ocsp_updater [puppet] - 10https://gerrit.wikimedia.org/r/243201 [15:13:38] (03PS2) 10Faidon Liambotis: mail: add OCSP stapling to role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/243202 [15:15:04] alrighty, pulling it down to tin [15:15:17] (03CR) 10Faidon Liambotis: [C: 032] tlsproxy: add support for update-ocsp-all hooks [puppet] - 10https://gerrit.wikimedia.org/r/243199 (owner: 10Faidon Liambotis) [15:15:40] (03CR) 10Giuseppe Lavagetto: "Nope, we actually use the puppet-generated certs and the puppet CA." [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [15:19:09] !log thcipriani@tin Synchronized php-1.27.0-wmf.1/extensions/Flow: SWAT: Fix exception on board and topic history pages [[gerrit:243337]] (duration: 00m 20s) [15:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:14] ^ matt_flaschen check please [15:19:15] (03PS14) 10Mforns: Consume EventLogging validation logs from Logstash [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) [15:19:27] (03CR) 10Mforns: Consume EventLogging validation logs from Logstash (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/241984 (https://phabricator.wikimedia.org/T113627) (owner: 10Mforns) [15:19:28] thcipriani, thanks, looking now. [15:22:24] thcipriani, tested. Looks fixed. Thanks. [15:22:33] matt_flaschen: thanks for checking! [15:25:40] (03PS1) 10Hashar: labs: get zip via 'require_package' [puppet] - 10https://gerrit.wikimedia.org/r/243674 [15:27:36] (03CR) 10Hashar: [C: 031 V: 032] "Spotted on Jessie instance integration-slave-jessie-1001.integration.eqiad.wmflabs which has the package builder module applied." [puppet] - 10https://gerrit.wikimedia.org/r/243674 (owner: 10Hashar) [15:27:38] (03PS1) 10Filippo Giunchedi: cassandra: enable multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/243675 (https://phabricator.wikimedia.org/T95253) [15:32:56] (03PS2) 10Hashar: contint: fix bundler package name on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/243659 (https://phabricator.wikimedia.org/T110865) [15:34:12] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1702211 (10hashar) CI is now voting on puppet-lint error/warnings. Seems all the repository has been cleaned up :-} [15:34:23] 10Ops-Access-Requests, 6operations: Access request to hive and webrequests - https://phabricator.wikimedia.org/T114642#1702214 (10Revi) [15:38:36] 10Ops-Access-Requests, 6operations, 6Editing-Department, 6Release-Engineering-Team: Please add Frédéric Bolduc, Thalia Chan, and David Lynch to ldap/wmf - https://phabricator.wikimedia.org/T114646#1702226 (10Krenair) There is an LDAP admins group (CC'd) on the servers (so ops should not have to handle thes... [15:41:14] 6operations, 7Monitoring: I do not receive pages, ever - https://phabricator.wikimedia.org/T114653#1702235 (10jcrespo) 3NEW [15:43:07] jynus: most people would be happy about that ;) [15:43:45] yuvipanda|maybe: just merged the multi-dc code for cirrussearch, in theory we should be able to test sending data as early as tomorrow if we only turn it on for testwiki [15:44:13] 10Ops-Access-Requests, 6operations, 6Editing-Department, 6Release-Engineering-Team: Please add Frédéric Bolduc, Thalia Chan, and David Lynch to ldap/wmf - https://phabricator.wikimedia.org/T114646#1702246 (10Jdforrester-WMF) >>! In T114646#1702226, @Krenair wrote: > There is an LDAP admins group (CC'd) on... [16:05:00] !log Updated Wikidata's property suggester with data from today's json dump [16:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:06] sjoerddebruin: ^ fyi [16:05:15] Oh, ok. [16:09:10] 6operations, 10Traffic: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1702350 (10BBlack) 3NEW [16:09:22] 6operations, 10Traffic, 10netops: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1702358 (10BBlack) [16:11:04] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero, 5Patch-For-Review: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1702382 (10BBlack) 5Open>3Resolved a:3BBlack [16:11:17] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1702384 (10GWicke) I guess we have slightly different ideas about what a message bus should be: 1) a way to get blobs from a to b, and 2) a way to expose a stream of events in a defined... [16:14:06] 6operations, 7Icinga: make critical icinga services always send email but keep honoring timezones for pages - https://phabricator.wikimedia.org/T114661#1702395 (10Dzahn) [16:15:58] 6operations, 7Icinga: make critical icinga services always send email but keep honoring timezones for pages - https://phabricator.wikimedia.org/T114661#1702412 (10Dzahn) ..if this turns out to be not possible or too complicated then send email to a global alias address / group contact with tiemzone 24x7 [16:24:17] (03CR) 10Merlijn van Deen: "Huh. Okay, I don't get why that would be the case, but I assume this will be documented at some point." [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [16:24:51] (03PS1) 10Alex Monk: Sharper apple-touch icon for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243685 (https://phabricator.wikimedia.org/T114275) [16:28:01] (03PS1) 10John F. Lewis: admin: add dcausse to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/243686 (https://phabricator.wikimedia.org/T114642) [16:28:11] 6operations, 10Wikimedia-Logstash: Logstash elasticsearch cluster filled up, dropping logstash events - https://phabricator.wikimedia.org/T113571#1702488 (10bd808) I just checked the disk utilization on logstash100[4-6] and found that they all have at least 400G of free space. The largest daily index we have r... [16:32:23] !log running backPopulateRenameQueueLogs.php ([[gerrit:237169]]) on metawiki [16:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:02] legoktm: \o/ [16:33:04] awesome [16:35:14] 6operations, 10Traffic, 10netops: Re-balance North American traffic in DNS for codfw - https://phabricator.wikimedia.org/T114659#1702543 (10BBlack) Note: we've done a 24-hour test of all eqiad traffic to codfw already, so there's no real load concerns here, just latency and link-balance optimization. [16:35:48] Glaisher: :) [16:41:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [16:44:10] (03PS1) 10Yuvipanda: elasticsearch: Use the default elasticsearch role for nobelium [puppet] - 10https://gerrit.wikimedia.org/r/243687 [16:44:40] (03PS1) 10Merlijn van Deen: sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) [16:45:00] (03CR) 10Yuvipanda: [C: 032] elasticsearch: Use the default elasticsearch role for nobelium [puppet] - 10https://gerrit.wikimedia.org/r/243687 (owner: 10Yuvipanda) [16:45:17] (03CR) 10jenkins-bot: [V: 04-1] sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [16:47:50] (03PS2) 10Merlijn van Deen: sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) [16:48:21] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:48:25] (03CR) 10jenkins-bot: [V: 04-1] sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) (owner: 10Merlijn van Deen) [16:50:06] does anyone know how to get a dump of entries in kibana? [16:50:20] PROBLEM - puppet last run on mw2056 is CRITICAL: CRITICAL: puppet fail [16:51:41] Glaisher: it finished running [16:51:49] yay :D [16:53:30] 10Ops-Access-Requests, 6operations, 6Services, 7RESTBase-architecture: Access to aqs100x for gwicke, eevans and mobrovac - https://phabricator.wikimedia.org/T114383#1702639 (10RobH) This request has been approved in the operations meeting. We'll enable this access later today. [16:53:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1702640 (10RobH) This request has been approved in the operations meeting. We'll enable this access later today. [16:54:02] (03PS3) 10Merlijn van Deen: sites/redirects: Redirect *.pywikibot.org to tool labs [puppet] - 10https://gerrit.wikimedia.org/r/243688 (https://phabricator.wikimedia.org/T106311) [16:56:22] (03PS1) 10Yuvipanda: elasticsearch: Tweak nobelium parameters [puppet] - 10https://gerrit.wikimedia.org/r/243691 [16:57:15] (03CR) 10Yuvipanda: [C: 032] elasticsearch: Tweak nobelium parameters [puppet] - 10https://gerrit.wikimedia.org/r/243691 (owner: 10Yuvipanda) [16:59:43] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures [17:01:45] Coren/robh : https://phabricator.wikimedia.org/T113298 this ticket was for access to user juniwoski for stat1002, and adding to analytics-privatedata-users - In the actual patch, he has been added to statistics-privatedata-users, and cannot execute queries. Could you please fix this? [17:02:31] what's the difference between those groups? [17:02:45] and the non-privatedata groups [17:02:55] Krenair: statistics gives different access - to sample logs etc. analytics creates a hadoop user [17:03:03] well, first someone needs to request it on task [17:03:08] and note if the other one should be removed [17:03:21] robh: analytics was the one requested [17:03:48] patch says Add to analytics-privatedata-users https://gerrit.wikimedia.org/r/#/c/240691/ [17:03:50] I'm not disagreeing with you, I'm stating that asking for it to be fixed in IRC isn't valid. There isn't enough accountability [17:03:51] =] [17:04:00] The task should be reopened [17:04:08] and me putting 'this person said to fix thix for this other person said in irc' looks shady no? [17:04:23] Krenair: robh okay will do. [17:04:29] madhuvishy: so if you or ideally the actual requestor of the access states thats needed [17:05:05] madhuvishy: so they need hive data right? [17:05:11] yes [17:05:16] i've seen a LOT of confusion on that so its not unheard of [17:05:26] i fixed a few others last week =] [17:05:44] madhuvishy: just half of them need both groups, and half only need hive group [17:05:48] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1702687 (10Milimetric) Thanks! [17:05:48] robh: which is why the group name was explicitly mentioned I thought. [17:06:20] indeed [17:06:21] and coren [17:06:28] and coren's patch says analytics-privatedata-users in message [17:06:32] (sorry for double ping marc) [17:06:48] so i'll just fix and move it over, no biggie [17:07:19] madhuvishy: hrmm, im not sure you need to comment on ticket, you are right that it was a mistake in the initial patchset [17:07:27] not a confusion , as the proper group was called out [17:07:30] i'll take care of it =] [17:07:46] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1702692 (10madhuvishy) This patch: https://gerrit.wikimedia.org/r/#/c/240691/ adds the user to statistics-privatedata-users, and not analytics-privatedata-user... [17:08:01] robh: aah just added, thanks anyway [17:08:05] =] [17:08:27] thx for noticing, did Jonathan point out his access wasnt working to you? [17:08:38] he states he logged ina nd tested it worked [17:08:43] so im wondering why he didnt notice no hive data? [17:10:34] robh: because he was able to log in - the group he was added to grants him that [17:10:50] robh: but if he tries to query, it throws an access control exception [17:11:01] so he thought it was on analytics end and reached out [17:11:11] cool, just making sure he was aware of the issue [17:12:28] (03PS1) 10RobH: fixing junikowski's access [puppet] - 10https://gerrit.wikimedia.org/r/243695 [17:12:29] fixing now [17:12:35] thanks! [17:12:50] (03CR) 10RobH: [C: 032] fixing junikowski's access [puppet] - 10https://gerrit.wikimedia.org/r/243695 (owner: 10RobH) [17:13:44] PROBLEM - SSH on analytics1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:15:14] PROBLEM - RAID on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:17:33] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:17:51] ok, lets see what sup with analytics1035 [17:18:28] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1702726 (10RobH) Sorry about that, I've gone ahead and merged an updated patch to fix this, and its been applied to stat1002. [17:18:34] RECOVERY - RAID on analytics1035 is OK: OK: optimal, 13 logical, 14 physical [17:18:35] madhuvishy: his fix is live so it shoudl work for him now [17:18:38] .... [17:18:40] damn it analytics1035 [17:18:43] what the heeeeellll [17:18:54] RECOVERY - puppet last run on mw2056 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:19:12] !log analytics1035 pegged out, ssh unresponsive and raid failures, and then fixed itself 5 minutes later [17:19:14] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [17:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:13] RECOVERY - SSH on analytics1035 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [17:22:27] (03PS4) 10Faidon Liambotis: Move tlsproxy's OCSP stapler/updater to sslcert [puppet] - 10https://gerrit.wikimedia.org/r/243200 [17:22:29] (03PS3) 10Faidon Liambotis: tlsproxy: inline ocsp_stapler, rename ocsp_updater [puppet] - 10https://gerrit.wikimedia.org/r/243201 [17:22:31] (03PS3) 10Faidon Liambotis: mail: add OCSP stapling to role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/243202 [17:24:23] PROBLEM - YARN NodeManager Node-State on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:25:23] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:27:05] (03PS5) 10Faidon Liambotis: Move tlsproxy's OCSP stapler/updater to sslcert [puppet] - 10https://gerrit.wikimedia.org/r/243200 [17:27:07] (03PS4) 10Faidon Liambotis: tlsproxy: inline ocsp_stapler, rename ocsp_updater [puppet] - 10https://gerrit.wikimedia.org/r/243201 [17:27:09] !log rolling restart of restbase cluster to rule out driver issues causing the increased p99 read lacenty [17:27:09] (03PS4) 10Faidon Liambotis: mail: add OCSP stapling to role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/243202 [17:27:13] PROBLEM - RAID on analytics1035 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:26] (03CR) 10Faidon Liambotis: [C: 032] Move tlsproxy's OCSP stapler/updater to sslcert [puppet] - 10https://gerrit.wikimedia.org/r/243200 (owner: 10Faidon Liambotis) [17:28:28] (03CR) 10Faidon Liambotis: [C: 032] tlsproxy: inline ocsp_stapler, rename ocsp_updater [puppet] - 10https://gerrit.wikimedia.org/r/243201 (owner: 10Faidon Liambotis) [17:28:30] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1702809 (10RobH) @VBaranetsky: What is your wikitech username? I don't see a VBaranetsky or anything similar (perhaps I've missed it?) You'll need to register on https://wikitech.wiki... [17:29:21] For the sake of those that see subbu's question about "dumping logstash" and wonder the same thing, one possible solution is https://github.com/bd808/ggml [17:30:37] RECOVERY - RAID on analytics1035 is OK: OK: optimal, 13 logical, 14 physical [17:31:01] (03CR) 10Faidon Liambotis: [C: 032] mail: add OCSP stapling to role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/243202 (owner: 10Faidon Liambotis) [17:31:24] RECOVERY - YARN NodeManager Node-State on analytics1035 is OK: OK: YARN NodeManager analytics1035.eqiad.wmnet:8041 Node-State: RUNNING [17:31:56] (03PS1) 10Faidon Liambotis: lists: add OCSP stapling to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/243698 [17:34:49] (03CR) 10Faidon Liambotis: [C: 032] lists: add OCSP stapling to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/243698 (owner: 10Faidon Liambotis) [17:35:44] PROBLEM - Exim SMTP on mx2001 is CRITICAL: CRITICAL - Cannot make SSL connection [17:36:35] interesting [17:36:43] PROBLEM - Exim SMTP on mx1001 is CRITICAL: CRITICAL - Cannot make SSL connection [17:37:10] these are not real [17:37:13] don't panic [17:37:41] i wish most alets were followed by that.. ;] [17:37:52] so I need to update our ffmpeg2theora package with another bugfix patch [17:38:07] i'm a bit at a loss as to how [17:38:39] the operations/debs/ffmpeg2theorawmf repo does not appear to be what we're building packages from at present [17:39:30] do i need to be diving into this reprepo thing? https://wikitech.wikimedia.org/wiki/Reprepro or is that a red herring? [17:41:34] brion: red herring for you I'd say. Who did the last update? [17:41:37] package update [17:41:45] mm lemme check bugs [17:42:37] brion: poking same person again is probably way to go :D [17:43:43] ok looks like it was godog :D [17:44:26] !log rolling restart of eqiad restbase nodes done [17:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:48] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1699347 (10brion) (Note last major update to our ffmpeg2theora package was for T69953 -- just need to swap out the patch file for newer one.) [17:46:46] brion: hey, hah I didn't spot the git repo, anyways you can fetch the debian source package from our repo and add the patch, uploading to reprepro is restricted tho I can likely work on it tomorrow morning [17:46:51] UTC morning that is [17:47:30] godog: awesome i'll prep it and have it ready for later :D thanks! [17:48:16] brion: sounds good! [17:53:16] grumble grumble [17:57:08] (03PS1) 10Faidon Liambotis: Revert OCSP stapling to roles mail::mx and lists [puppet] - 10https://gerrit.wikimedia.org/r/243703 [17:57:11] *sigh* [17:57:34] (03PS9) 10Hoo man: Use one node definition for both tin and mira [puppet] - 10https://gerrit.wikimedia.org/r/223458 (https://phabricator.wikimedia.org/T95436) [17:57:46] (03CR) 10Faidon Liambotis: [C: 032] Revert OCSP stapling to roles mail::mx and lists [puppet] - 10https://gerrit.wikimedia.org/r/243703 (owner: 10Faidon Liambotis) [17:58:33] bblack: ^^ all of this ocsp work for nothing... [17:59:44] RECOVERY - Exim SMTP on mx2001 is OK: OK - Certificate will expire on 09/22/2016 18:01. [18:00:45] RECOVERY - Exim SMTP on mx1001 is OK: OK - Certificate will expire on 09/22/2016 18:01. [18:01:01] (03PS1) 10Faidon Liambotis: lists: extend check_smtp check to TLS as well [puppet] - 10https://gerrit.wikimedia.org/r/243706 [18:01:16] (03CR) 10Faidon Liambotis: [C: 032] lists: extend check_smtp check to TLS as well [puppet] - 10https://gerrit.wikimedia.org/r/243706 (owner: 10Faidon Liambotis) [18:03:00] \o/ [18:03:05] have fun, ori-vacation [18:03:23] thanks :) [18:09:56] whoa, reall? [18:10:02] ori-vacation: indeed, enjoy! [18:10:49] ori-vacation: so can we merge all the performance killing patches now? ;) [18:11:38] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1703087 (10Jgreen) [18:11:40] 6operations, 10Fundraising-Backlog, 6Security, 10fundraising-tech-ops: Delete gadolinium:/a/log/fundraising/ - https://phabricator.wikimedia.org/T92336#1703088 (10Jgreen) [18:11:44] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Update ffmpeg2theora package to fix additional ogv transcode failures - https://phabricator.wikimedia.org/T114557#1703089 (10brion) Updated patch: {F2660426} drop into the debian/patches/ replacing the old fix-resample-sefault.patch. [18:13:30] (03PS2) 10Dzahn: apache: remove wikimedia.xyz redirect [puppet] - 10https://gerrit.wikimedia.org/r/243354 (https://phabricator.wikimedia.org/T92547) [18:13:33] twentyafterfour: hey! around? [18:13:49] SMalyshev: yes [18:14:04] SMalyshev: What's up [18:14:39] twentyafterfour: remember that portals thing? do we have any resolution for it? [18:15:18] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1703097 (10Eevans) >>! In T114443#1701296, @Joe wrote: > Apart from the concerns on a practical use case which I agree with, I have a big doubt about the implementation idea: > > I am i... [18:15:55] paravoid: that sucks, but at least some things got cleaned up along the way [18:17:40] what the heck, gerrit [18:17:45] "Invalid xsrfKey in request" [18:18:19] (03PS3) 10Dzahn: apache: remove wikimedia.xyz redirect [puppet] - 10https://gerrit.wikimedia.org/r/243354 (https://phabricator.wikimedia.org/T92547) [18:18:24] and now it works again [18:18:46] (03PS1) 10Ori.livneh: graphite-web: fix incorrect variable name introduced in I1e41e6e3 [puppet] - 10https://gerrit.wikimedia.org/r/243710 [18:19:02] (03CR) 10Ori.livneh: [C: 032 V: 032] graphite-web: fix incorrect variable name introduced in I1e41e6e3 [puppet] - 10https://gerrit.wikimedia.org/r/243710 (owner: 10Ori.livneh) [18:20:44] (03CR) 10John F. Lewis: [C: 031] apache: remove wikimedia.xyz redirect [puppet] - 10https://gerrit.wikimedia.org/r/243354 (https://phabricator.wikimedia.org/T92547) (owner: 10Dzahn) [18:20:58] (03PS1) 10Yuvipanda: tools: Move toolswatcher into puppet [puppet] - 10https://gerrit.wikimedia.org/r/243711 [18:21:02] (03CR) 10RobH: [C: 031] "makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/243354 (https://phabricator.wikimedia.org/T92547) (owner: 10Dzahn) [18:22:30] (03PS4) 10Dzahn: apache: remove wikimedia.xyz redirect [puppet] - 10https://gerrit.wikimedia.org/r/243354 (https://phabricator.wikimedia.org/T92547) [18:23:31] (03CR) 10Yuvipanda: [C: 04-1] webservicemonitor: some improvements (032 comments) [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/239377 (https://phabricator.wikimedia.org/T109362) (owner: 10coren) [18:25:06] (03CR) 10John F. Lewis: [C: 031] add AAAA record for argon (irc,rc streams) [dns] - 10https://gerrit.wikimedia.org/r/214506 (https://phabricator.wikimedia.org/T105422) (owner: 10Dzahn) [18:26:17] SMalyshev: is there a task for it? I don't believe there is any resolution yet [18:26:56] twentyafterfour: https://phabricator.wikimedia.org/T110070 is the general task but we don't have specific one for deployment [18:26:59] I'll create it now [18:27:53] (03CR) 10Dzahn: [C: 032] apache: remove wikimedia.xyz redirect [puppet] - 10https://gerrit.wikimedia.org/r/243354 (https://phabricator.wikimedia.org/T92547) (owner: 10Dzahn) [18:31:40] twentyafterfour: https://phabricator.wikimedia.org/T114694 [18:32:11] twentyafterfour: which person/group should it be assigned to you think? [18:32:34] 6operations, 5Patch-For-Review, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1703148 (10Dzahn) 5Open>3Resolved support removed, this closes the ticket cycle :) [18:32:38] (03PS1) 10RobH: adding joal, milimetric, gwicke, eevans, mobrovac to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/243712 [18:32:51] that patchset pinged a lot of folks ;D [18:33:14] 6operations, 7domains: add support for wikimedia.xyz - https://phabricator.wikimedia.org/T92547#1703154 (10Dzahn) [18:33:21] (03CR) 10RobH: [C: 032] adding joal, milimetric, gwicke, eevans, mobrovac to aqs-admins [puppet] - 10https://gerrit.wikimedia.org/r/243712 (owner: 10RobH) [18:34:12] 6operations, 7Database: decom ishmael - https://phabricator.wikimedia.org/T109777#1703160 (10Dzahn) [18:34:19] 10Ops-Access-Requests, 6operations, 6Services, 7RESTBase-architecture: Access to aqs100x for gwicke, eevans and mobrovac - https://phabricator.wikimedia.org/T114383#1703161 (10RobH) 5Open>3Resolved This has been merged live with patchset https://gerrit.wikimedia.org/r/#/c/243712/ [18:34:30] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1703163 (10RobH) 5Open>3Resolved This has been merged live with patchset https://gerrit.wikimedia.org/r/#/c/243712/ [18:35:09] SMalyshev: I added #deployment-systems and #scap3. How urgent is this? I'd like to "do it the right way" but that might take an extra week or so because scap3 isn't finished yet. [18:35:26] twentyafterfour: week or so is ok [18:35:41] I don't want it to linger indefinitely but it can wait a week [18:36:17] twentyafterfour: could you add "blocked by" on scap3 to it so we'd know when it's ready to move forward? [18:36:39] SMalyshev: ok I'll claim the task, hopefully I can get it working with scap3 in a reasonable time, if that doesn't look promising then I'll look for another more expedient solution. [18:36:55] twentyafterfour: thank you! [18:37:11] (03PS1) 10Dzahn: ishmael: remove module, decom service [puppet] - 10https://gerrit.wikimedia.org/r/243714 (https://phabricator.wikimedia.org/T109777) [18:37:13] ebernhardson: btw think you can ssh into nobelium and confirm that ES is fine? I did a curl localhost:9200 and it seemed ok... [18:37:40] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:47] yuvipanda: es version is right, but it doesn't have the plugins :( [18:37:58] ebernhardson: I thought I had just fixed that [18:38:12] hmm, maybe it just needs a reboot. it only picks up the plugins after restarting service [18:38:22] !log restarted gitblit [18:38:24] ebernhardson: ah yes then probably it does [18:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:38:46] yuvipanda: just restarted it, will find out :) [18:39:03] ebernhardson: ok [18:39:06] how do we deploy the plugins? [18:39:18] (03CR) 10John F. Lewis: [C: 031] ishmael: remove module, decom service [puppet] - 10https://gerrit.wikimedia.org/r/243714 (https://phabricator.wikimedia.org/T109777) (owner: 10Dzahn) [18:39:45] chasemp: in prod, via git-fat and a special plugins repository [18:39:53] i think deployed via trebuchet? [18:40:21] yes [18:40:22] I thought so -- that makes me think we'll have to 'git deploy' them (right?) [18:40:24] terribuchet [18:40:29] zing [18:40:32] chasemp: there's a provider => trebuchet [18:40:32] sigh...nobelium is trying to find a master in a 1 node cluster :( will poke it [18:40:40] chasemp: and that' sused [18:40:53] oh right I forget it bootstraps etc etc [18:41:38] ebernhardson: discovery.zen.ping.multicast: false [18:41:40] I imagine [18:42:40] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 62418 bytes in 1.142 second response time [18:42:54] chasemp: its set as master capable, and that only 1 master node is necessary. I thought that would be enough but i guess not :S we should just switch this all to unicast at some point... [18:43:32] agreed, I have a task for it somewhere [18:43:41] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch http://10.64.37.14:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.37.14, port=9200): Read timed out. (read timeout=4) [18:44:10] ^ thats me, just ignore it [18:44:53] chasemp: different problem, it looks to be trying to join the production cluster [18:44:59] uh [18:45:10] how is that possible, did you set the cluster name etc in hiera? [18:45:25] at least, the logfile is named production-search-eqiad.log looking closer [18:48:08] chasemp: the cluster was previously set in hiera, but the current config file names the production cluster...looking into why [18:48:20] (the generated elasticsearch.yml) [18:51:18] yuvipanda: i don't see the string 'labsearch' ever mentioned in operations/puppet, but that was the previous name of the cluster. is that hiera loaded from some other store? [18:51:31] oh wait i'm looking at old repo...1 sec :) [18:53:05] ebernhardson: I think that just changed... [18:53:18] ebernhardson: I had it stop using a separate role and just using the elasticsearch::server role [18:53:27] maybe I should override 'cluster' but that's also used by ganglia [18:53:39] yuvipanda: yea we have to override the cluster hieradata [18:53:49] (or turn off multicast would work too i suppose) [18:54:04] is it elasticsearch::cluster? [18:54:26] cluster_name [18:55:04] ebernhardson: ^ [18:55:06] well [18:55:10] once gerrit decides to accpet my change [18:55:39] (03CR) 10MaxSem: [C: 031] "w00t!" [puppet] - 10https://gerrit.wikimedia.org/r/243406 (https://phabricator.wikimedia.org/T89177) (owner: 10BBlack) [18:56:26] wtf gerrit [18:56:41] just me or gerrit down for others too? [18:58:53] can fetch can't push [19:00:34] yuvipanda: accepted my push to cirrussearch [19:00:55] (03PS2) 10Yuvipanda: tools: Move toolswatcher into puppet [puppet] - 10https://gerrit.wikimedia.org/r/243711 [19:00:56] (03PS1) 10Yuvipanda: elasticsearch: Set nobelium cluster name explicitly [puppet] - 10https://gerrit.wikimedia.org/r/243722 [19:17:12] (03CR) 10EBernhardson: "LGTM, although this probably can be based directly on master" [puppet] - 10https://gerrit.wikimedia.org/r/243722 (owner: 10Yuvipanda) [19:20:06] (03PS5) 10Dzahn: lint: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242057 [19:21:00] (03CR) 10Andrew Bogott: [C: 031] lint: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242057 (owner: 10Dzahn) [19:21:29] (03CR) 10Dzahn: [C: 032] lint: fix 'variable not enclosed' pt2 [puppet] - 10https://gerrit.wikimedia.org/r/242057 (owner: 10Dzahn) [19:23:02] (03PS3) 10Yuvipanda: tools: Move toolswatcher into puppet [puppet] - 10https://gerrit.wikimedia.org/r/243711 [19:23:23] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Move toolswatcher into puppet [puppet] - 10https://gerrit.wikimedia.org/r/243711 (owner: 10Yuvipanda) [19:23:45] (03PS2) 10Yuvipanda: elasticsearch: Set nobelium cluster name explicitly [puppet] - 10https://gerrit.wikimedia.org/r/243722 [19:23:53] (03CR) 10Yuvipanda: [C: 032 V: 032] elasticsearch: Set nobelium cluster name explicitly [puppet] - 10https://gerrit.wikimedia.org/r/243722 (owner: 10Yuvipanda) [19:23:54] Hey milimetric [19:25:06] ebernhardson: hmm, I wonder if comcast is throttling my ssh access or something [19:25:57] yuvipanda: i doubt it, i've had comcast for ~15 years and havn't had any kinda of filtering issues, network was either up or down [19:26:31] ebernhardson: hmm, so youtube / google play work fine [19:28:42] ebernhardson: I think it's just the entire connection is crap except for youtube / play music [19:28:57] multi second ping times to elsewhere [19:30:21] yuvipanda: interesting, i've certainly heard of issues like that (sounds like some connection is bottlenecked between networks) but never experienced them [19:30:43] youtube/google stuff might be coming from a server google puts inside comcast's network [19:30:47] hmm [19:33:57] (03PS1) 10Dzahn: remove ishmael service [dns] - 10https://gerrit.wikimedia.org/r/243727 (https://phabricator.wikimedia.org/T109777) [19:34:56] goddamit I can't ssh to anywhere [19:35:41] hmm [19:35:50] it's just to wikimedia actually [19:36:01] :S [19:36:34] (03PS1) 10Jdlrobson: Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243728 (https://phabricator.wikimedia.org/T114566) [19:37:58] ebernhardson: wooo, it turns out ssh was trying to ssh via ipv6 [19:38:01] and that fails [19:38:03] forcing v4 works [19:39:32] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:39:50] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:42:42] yuvipanda: you are saying IPv6 broke? [19:42:56] checks a bastion [19:42:58] mutante: see other channel [19:43:17] ok [19:44:22] ok, stopped at "comcast" [19:44:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:44:51] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:52:37] yuvipanda: hmm, i ran into that ipv6 issue the other day too. ipv6 fully works but i had some other oddity going on that faidon pointed out (i forget now, but it had something to do with ControlMaster) [19:53:10] ebernhardson: hmm I see. I don't use ControlMaster [19:53:17] ebernhardson: switched to the Verizon MiFi :) [19:54:12] ebernhardson: anyway, nobelium's cluster.name set to labsearch now [19:55:08] sounds like a good solution :) [19:55:26] ebernhardson: :D yeah [19:55:50] ebernhardson: anything else needed for nobelium? [19:56:05] yuvipanda: did you do that via hiera on a wikipage? (just curious) [19:56:18] chasemp: no I merged a change a while ago [19:56:27] and then struggled to get puppet merge because comcast [19:56:36] makes sense, was just curious [19:56:46] chasemp: hiera on wikipage only works for labs instances, not for any host in the production vlan [19:56:48] err [19:56:51] yuvipanda: looks reasonable from here, i'll put together the patch for testwiki and send it out tomorow evening swat after train deploy [19:56:56] for any host with .wmnet :) [19:57:00] ebernhardson: \o/ awesome [19:57:04] yuvipanda: right I figured that out as you responding :D [19:57:12] my brain did the math so to speak [19:57:16] :D [19:57:32] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: red, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 0, cluster_name: labsearch, relocating_shards: 0, active_shards: 0, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [19:57:42] (03PS1) 10Jdlrobson: Use beta labs logo for login form on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T114552) [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151005T2000). [20:00:13] no parsoid deploy today [20:04:57] (03PS6) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [20:04:59] (03PS1) 10EBernhardson: Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 [20:05:08] (03CR) 10Hashar: "Good to see the task is easily solvable via a $wg variable :-}" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T114552) (owner: 10Jdlrobson) [20:05:12] (03PS2) 10EBernhardson: Drop cirrussearch write jobs after 3 hours of failures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243744 [20:12:48] any bright mind could merge in a contint change for me please ? : D Made a mistake in a package name for Jessie ..... https://gerrit.wikimedia.org/r/#/c/243659/ [20:13:14] I take all the blame :D [20:14:39] (03CR) 10Dzahn: "yep. it's "bundler". Description-en: Manage Ruby application dependencies" [puppet] - 10https://gerrit.wikimedia.org/r/243659 (https://phabricator.wikimedia.org/T110865) (owner: 10Hashar) [20:14:43] (03PS3) 10Dzahn: contint: fix bundler package name on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/243659 (https://phabricator.wikimedia.org/T110865) (owner: 10Hashar) [20:14:44] \O/ [20:14:55] (03CR) 10Dzahn: [C: 032] contint: fix bundler package name on Jessie [puppet] - 10https://gerrit.wikimedia.org/r/243659 (https://phabricator.wikimedia.org/T110865) (owner: 10Hashar) [20:15:19] mutante: for the story I was pretty sure I tested it ... The only problem is that I have tested Jessie on a Trusty box hehe [20:16:01] hashar: :p merged on palladium now [20:16:07] greeaat [20:16:19] tested jessie on stretch, hehe [20:16:29] no mobileapps deploy today [20:19:08] andrewbogott: i see like ONE (1) "quoted boolean" error in the entire repo [20:22:36] andrewbogott: ah, "submodule" hrrrr [20:27:04] hashar: to be able to re-enable a lint check option, do i need to fix it in all submodules too or just in the entire repo itself [20:27:12] mutante: hm, I suppose the linter doesn’t check submodules until there’s a patch in the submodule, huh? [20:27:19] andrewbogott: heh, yea [20:27:38] i am also down to just a single "variable not enclosed" :) [20:27:45] great! [20:30:27] (03PS1) 10Dzahn: lint: fix the last "variable not enclosed" [puppet] - 10https://gerrit.wikimedia.org/r/243796 [20:30:36] (03PS1) 10John F. Lewis: lint: enclose variables [puppet/cdh] - 10https://gerrit.wikimedia.org/r/243797 [20:30:45] mutante: ^ [20:31:06] (03CR) 10Dzahn: [C: 032] lint: fix the last "variable not enclosed" [puppet] - 10https://gerrit.wikimedia.org/r/243796 (owner: 10Dzahn) [20:35:43] (03CR) 10Andrew Bogott: [C: 032] lint: enclose variables [puppet/cdh] - 10https://gerrit.wikimedia.org/r/243797 (owner: 10John F. Lewis) [20:37:09] andrewbogott: that was a submodule.. that means it needs another update somehow [20:37:16] oh, right [20:37:22] um… need me to do that? [20:37:24] greg-g, services are not deploying, and zero asked me to urgently deploy something if that's ok [20:37:38] (03PS1) 10John F. Lewis: bump cdh submodule (lint fixes) [puppet] - 10https://gerrit.wikimedia.org/r/243799 [20:37:44] andrewbogott: ^ nope [20:37:48] (03PS1) 10Yuvipanda: tools: Add puppetmaster/client roles [puppet] - 10https://gerrit.wikimedia.org/r/243800 (https://phabricator.wikimedia.org/T112005) [20:37:49] andrewbogott: i dont know :p [20:38:09] JohnFLewis: thanks :) [20:38:10] (03PS2) 10John F. Lewis: bump cdh submodule (lint fixes) [puppet] - 10https://gerrit.wikimedia.org/r/243799 [20:38:39] (03PS2) 10Yuvipanda: tools: Add puppetmaster/client roles [puppet] - 10https://gerrit.wikimedia.org/r/243800 (https://phabricator.wikimedia.org/T112005) [20:40:35] (03PS2) 10Andrew Bogott: labs: get zip via 'require_package' [puppet] - 10https://gerrit.wikimedia.org/r/243674 (owner: 10Hashar) [20:40:57] valhallasw`cloud: ^ I wrote a puppet patch for it too [20:40:59] the puppetmaster [20:41:29] valhallasw`cloud: lets me automate credential provisioning for k8s :) [20:41:41] yuvipanda: I'm going to bed now :-) [20:41:44] (03CR) 10Andrew Bogott: [C: 032] labs: get zip via 'require_package' [puppet] - 10https://gerrit.wikimedia.org/r/243674 (owner: 10Hashar) [20:41:51] valhallasw`cloud: kk [20:41:58] valhallasw`cloud: hate timezones, etc [20:42:06] yes, come back to europe :> [20:42:10] (03PS3) 10Andrew Bogott: bump cdh submodule (lint fixes) [puppet] - 10https://gerrit.wikimedia.org/r/243799 (owner: 10John F. Lewis) [20:42:15] (03PS3) 10Yuvipanda: tools: Add puppetmaster/client roles [puppet] - 10https://gerrit.wikimedia.org/r/243800 (https://phabricator.wikimedia.org/T112005) [20:42:23] (03CR) 10Andrew Bogott: [C: 032] bump cdh submodule (lint fixes) [puppet] - 10https://gerrit.wikimedia.org/r/243799 (owner: 10John F. Lewis) [20:42:24] valhallasw`cloud: it is tempting etc :) we'll se [20:43:25] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1703563 (10JUnikowski_WMF) Thanks, @RobH and @madhuvishy! [20:44:36] (03PS1) 10Dzahn: lint: re-enable 'variable not enclosed' check [puppet] - 10https://gerrit.wikimedia.org/r/243803 [20:44:40] hashar: ^ [20:44:48] (03CR) 10Yuvipanda: [C: 032] tools: Add puppetmaster/client roles [puppet] - 10https://gerrit.wikimedia.org/r/243800 (https://phabricator.wikimedia.org/T112005) (owner: 10Yuvipanda) [20:45:11] (03PS7) 10EBernhardson: Refactor monolog handling for kafka logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) [20:45:21] mutante: puppet-lint will do all the magic for us! [20:46:02] (03PS2) 10Dzahn: lint: re-enable 'variable not enclosed' check [puppet] - 10https://gerrit.wikimedia.org/r/243803 (https://phabricator.wikimedia.org/T87132) [20:46:25] (03PS4) 10Yuvipanda: tools: Add puppetmaster/client roles [puppet] - 10https://gerrit.wikimedia.org/r/243800 (https://phabricator.wikimedia.org/T112005) [20:46:39] (03CR) 10Yuvipanda: [V: 032] tools: Add puppetmaster/client roles [puppet] - 10https://gerrit.wikimedia.org/r/243800 (https://phabricator.wikimedia.org/T112005) (owner: 10Yuvipanda) [20:46:44] yurik: ok, what is it? :) [20:46:47] (03CR) 10EBernhardson: "max had some concerns over the test case pulling in InitializeSettings.php, that it would change global state for other tests. PS7 adds a " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [20:46:49] yuvipanda: one second? [20:47:05] hmm? [20:47:16] yuvipanda: you just re-introduced some warnings that i just removed from the entire repo [20:47:24] greg-g, https://gerrit.wikimedia.org/r/#/c/242661/ [20:47:28] ow [20:47:30] did I? [20:47:30] so that blocks enabling the jenkins check [20:47:35] oops [20:47:37] I know which one [20:47:39] let me fix that [20:47:40] minute => "*/$run_every_minutes", [20:47:44] thanks :) [20:47:56] yurik: kk [20:48:14] yuvipanda: Hey, got a moment [20:48:15] ? [20:48:24] (03PS1) 10Yuvipanda: puppetmaster: Quote variable inclusions [puppet] - 10https://gerrit.wikimedia.org/r/243804 [20:48:27] (03CR) 10Andrew Bogott: [C: 031] lint: re-enable 'variable not enclosed' check [puppet] - 10https://gerrit.wikimedia.org/r/243803 (https://phabricator.wikimedia.org/T87132) (owner: 10Dzahn) [20:48:27] mutante: yw and sorry I missed it [20:48:29] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: Quote variable inclusions [puppet] - 10https://gerrit.wikimedia.org/r/243804 (owner: 10Yuvipanda) [20:48:45] hoo: sup [20:48:45] (03PS2) 10Yuvipanda: puppetmaster: Quote variable inclusions [puppet] - 10https://gerrit.wikimedia.org/r/243804 [20:48:53] yuvipanda: thx, we'll let jenkins tell us [20:49:02] mutante: yeah [20:49:07] although this -1 was a rebase -1 [20:49:18] hashar: "all the magic" ? [20:49:25] yuvipanda: On Wednesday the fix for the last blocker for https://gerrit.wikimedia.org/r/238396 will be deployed [20:49:26] mutante: actually running the check :-) [20:49:41] hashar: ok, so not "i noticed this doesnt trigger so i'm gonna re-enable it" hehe [20:50:51] anyway sleepy time [20:50:59] good night [20:52:29] hoo: ah :) want me to remove my -2? [20:52:55] yuvipanda: That would be nice yes. [20:53:04] (03PS1) 10Yuvipanda: tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/243805 [20:53:09] (03CR) 10jenkins-bot: [V: 04-1] tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/243805 (owner: 10Yuvipanda) [20:53:13] Also would you be willing to deploy that on Thursday? [20:53:19] (03PS1) 10Rush: iridium: define a second IP address for VCS things [dns] - 10https://gerrit.wikimedia.org/r/243806 [20:53:33] (03CR) 10Yuvipanda: [V: 032] "Hoo tells me it is almost ready to merge. Removing -2" [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [20:53:38] err [20:53:42] not sure why I added V+2 [20:53:45] I'm looking for someone to merge that change, preferably during European business hours, so that we can still react to trouble [20:53:51] hoo: put it up for puppetswat [20:54:08] hoo: oh, hmm, puppetswat isn't during european business hours (well, towards the very end of) [20:54:19] It's quite late :S [20:54:28] hoo: unfortunately (?) I'm in the US west coast now and can't do it :( getting bblack to +1 it might help recruit others [20:54:57] (03CR) 10Yuvipanda: [WIP] Enable automatic redirect to mobile Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [20:55:04] !log yurik@tin Synchronized php-1.27.0-wmf.1/extensions/ZeroBanner: Deploying ZeroBanner 242661 (duration: 00m 17s) [20:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:18] (03CR) 10Rush: [C: 032] iridium: define a second IP address for VCS things [dns] - 10https://gerrit.wikimedia.org/r/243806 (owner: 10Rush) [20:56:15] (03CR) 10Yuvipanda: [C: 032] puppetmaster: Quote variable inclusions [puppet] - 10https://gerrit.wikimedia.org/r/243804 (owner: 10Yuvipanda) [20:56:28] (03PS2) 10Yuvipanda: tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/243805 [20:56:37] (03PS1) 10Andrew Bogott: tools.pp: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/243807 [20:56:40] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/243805 (owner: 10Yuvipanda) [20:56:49] (03PS2) 10Andrew Bogott: tools.pp: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/243807 [20:56:59] yuvipanda: That sounds like a good idea [20:57:01] rebased into nothing :) [20:58:31] (03Abandoned) 10Andrew Bogott: tools.pp: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/243807 (owner: 10Andrew Bogott) [20:58:49] hoo: :) [20:58:57] andrewbogott: sorry about that [20:58:59] at least it didn't hit prod [20:59:05] :) [20:59:08] because I puppet merged them together [21:00:01] (03PS3) 10Hoo man: Enable automatic redirect to mobile Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [21:00:04] yurik: Dear anthropoid, the time has come. Please deploy Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151005T2100). [21:00:16] (03PS4) 10Hoo man: Enable automatic redirect to mobile Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [21:00:20] !log yurik@tin Synchronized php-1.27.0-wmf.1/extensions/ZeroBanner: Rolling back ZeroBanner 242661 (duration: 00m 18s) [21:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:52] bblack: If you got a moment, it would be awesome, if you could +1 https://gerrit.wikimedia.org/r/238396 so that it can go out on Thursday easily [21:02:05] yuvipanda: Thanks... and come back to enjoy CEST! :D [21:04:02] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1703643 (10Nuria) >a way to expose a stream of events in a defined format that can be consumed easily by a range of clients. This talks about consumption, not production but I do not wan... [21:06:00] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 18.18% of data above the critical threshold [500.0] [21:13:39] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for VBaranetsky - https://phabricator.wikimedia.org/T114308#1703654 (10RobH) a:5RobH>3VBaranetsky [21:14:28] (03PS1) 10Yuvipanda: labs: Don't explicitly include base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/243811 [21:14:59] (03PS2) 10Yuvipanda: labs: Don't explicitly include base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/243811 [21:15:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access request to hive and webrequests - https://phabricator.wikimedia.org/T114642#1703656 (10RobH) @dcausse, Please have your manager approve your addition to the analytics-privatedata-users group. (This is what allows access to the hive data.) [21:15:55] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding dcausse to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T114642#1703657 (10RobH) [21:16:00] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:16:14] !log yurik@tin Synchronized php-1.27.0-wmf.1/extensions/ZeroBanner: Take2: Deploying ZeroBanner 242661+243808 (duration: 00m 17s) [21:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:42] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Don't explicitly include base::puppet [puppet] - 10https://gerrit.wikimedia.org/r/243811 (owner: 10Yuvipanda) [21:17:50] (03CR) 10BBlack: [C: 031] Enable automatic redirect to mobile Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [21:22:03] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1703666 (10RobH) 3NEW [21:22:11] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [21:23:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding dcausse to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T114642#1703674 (10Tfinc) Approved [21:26:45] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1703678 (10RobH) 3NEW [21:26:54] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1703687 (10RobH) We'll need to determine the racking location of these systems. Please note these come with both 1G and 10G connection options, but we'll be using the 10G options with DAC cables in... [21:29:23] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1703678 (10RobH) [21:31:41] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1703697 (10RobH) [21:31:58] (03PS1) 10Yuvipanda: tools: Do not shadow 'puppetmaster' from elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/243812 [21:32:03] (03CR) 10jenkins-bot: [V: 04-1] tools: Do not shadow 'puppetmaster' from elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/243812 (owner: 10Yuvipanda) [21:32:43] (03PS2) 10Yuvipanda: tools: Do not shadow 'puppetmaster' from elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/243812 [21:33:30] (03CR) 10Yuvipanda: [C: 032] tools: Do not shadow 'puppetmaster' from elsewhere [puppet] - 10https://gerrit.wikimedia.org/r/243812 (owner: 10Yuvipanda) [21:35:54] 6operations: determine new swift ms-be hostnames (codfw/eqiad) - https://phabricator.wikimedia.org/T114500#1703707 (10RobH) [21:35:55] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1703706 (10RobH) [21:36:00] 6operations: determine new swift ms-be hostnames (codfw/eqiad) - https://phabricator.wikimedia.org/T114500#1697674 (10RobH) [21:36:01] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1703708 (10RobH) [21:36:06] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1703678 (10RobH) [21:36:07] 6operations, 10ops-eqiad, 7Swift: [determine] rack ms-be1019-1021 - https://phabricator.wikimedia.org/T114711#1703666 (10RobH) [21:36:09] 6operations: determine new swift ms-be hostnames (codfw/eqiad) - https://phabricator.wikimedia.org/T114500#1703710 (10RobH) 5Open>3Resolved [21:36:25] the joys of task cross-linking irc echos =P [21:41:51] (03PS3) 10Dzahn: lint: re-enable 'variable not enclosed' check [puppet] - 10https://gerrit.wikimedia.org/r/243803 (https://phabricator.wikimedia.org/T87132) [21:41:58] (03CR) 10Dzahn: [C: 032] lint: re-enable 'variable not enclosed' check [puppet] - 10https://gerrit.wikimedia.org/r/243803 (https://phabricator.wikimedia.org/T87132) (owner: 10Dzahn) [21:45:38] (03PS1) 10Yuvipanda: labs: Update split horizon for new proxy IP [puppet] - 10https://gerrit.wikimedia.org/r/243814 [21:45:42] bd808: ^ [21:45:52] Krenair: ^ :D did you put up your code as a patch somewhere? [21:46:01] (03PS2) 10Yuvipanda: labs: Update split horizon for new proxy IP [puppet] - 10https://gerrit.wikimedia.org/r/243814 [21:46:08] which code yuvipanda? [21:46:20] Krenair: the internal / external ip generator [21:46:38] yuvipanda, https://gerrit.wikimedia.org/r/#/c/243357/ [21:47:09] (03CR) 10Yuvipanda: [C: 032] labs: Update split horizon for new proxy IP [puppet] - 10https://gerrit.wikimedia.org/r/243814 (owner: 10Yuvipanda) [21:47:21] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [21:51:11] PROBLEM - RAID on db1026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [21:52:38] oops, I think that l10nupdate puppet patch I did back in July broke l10nupdate on beta [21:54:56] (we disabled l10nupdate by default and explicitly set it on for tin, off for mira. but didn't touch deployment-bastion) [21:57:20] 6operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Varnish not letting through RESTBase back-end service responses for rest.wm.org - https://phabricator.wikimedia.org/T113223#1703769 (10GWicke) p:5Triage>3Low [21:58:43] (03CR) 10Krinkle: Send image varnish frontend data from logs to statsd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [22:04:00] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:04:11] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:07:20] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:07:31] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [22:08:40] PROBLEM - puppet last run on cp4012 is CRITICAL: CRITICAL: puppet fail [22:14:41] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: puppet fail [22:26:20] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:27:53] (03CR) 10Yuvipanda: [C: 031] Add class base::puppet::ca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/243661 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [22:31:22] PROBLEM - Freshness of OCSP Stapling files on cp1043 is CRITICAL: CRITICAL: File /var/cache/ocsp/wmfusercontent.org.ocsp is more than 29100 secs old! [22:32:34] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1703834 (10ssastry) Here is the summary of my findings so far: 1. I took a dump from [[https://logstash.wikimedia.org/#/dashboard/elasticsearch/parsoid-cpu-timeouts|Parsoid CPU time... [22:34:14] (03PS1) 10Yuvipanda: k8s: Adjust ssldir for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/243820 [22:34:30] (03PS2) 10Yuvipanda: k8s: Adjust ssldir for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/243820 [22:35:30] RECOVERY - puppet last run on cp4012 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [22:36:03] (03CR) 10Yuvipanda: [C: 032] k8s: Adjust ssldir for self hosted puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/243820 (owner: 10Yuvipanda) [22:42:10] (03CR) 10BryanDavis: [C: 031] "One very tiny nit inline but in general this looks reasonable" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240615 (https://phabricator.wikimedia.org/T103505) (owner: 10EBernhardson) [22:55:04] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1703936 (10ssastry) Here is the script I ran on different production server (it is either parsoid.log.2.gz or parsoid.log.3.gz depending on how the logs getted rolled up). ``` $ gunz... [22:55:41] PROBLEM - Freshness of OCSP Stapling files on cp1044 is CRITICAL: CRITICAL: File /var/cache/ocsp/wmfusercontent.org.ocsp is more than 29100 secs old! [22:58:10] 6operations, 10ops-codfw: update spares sheet with DAC cable count - https://phabricator.wikimedia.org/T114720#1703965 (10RobH) 3NEW a:3Papaul [23:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151005T2300). Please do the needful. [23:00:04] James_F jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:41] 4pm already/ [23:02:13] thanks for doing the submodule update commit James_F [23:02:43] * James_F waves. [23:02:44] jdlrobson, around? [23:02:45] Yeah. [23:02:54] Krenair: yep [23:03:34] Krenair: sounds like there might be a big zero issue to fix as well.. [23:03:42] (03PS2) 10Alex Monk: Enable WikidataPageBanner on Catalan Wikipedia and Chinese Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242942 (https://phabricator.wikimedia.org/T114392) (owner: 10Jdlrobson) [23:03:46] zero completely broken in earlier SWAT. [23:04:01] (03CR) 10Alex Monk: [C: 032] Enable WikidataPageBanner on Catalan Wikipedia and Chinese Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242942 (https://phabricator.wikimedia.org/T114392) (owner: 10Jdlrobson) [23:04:08] jdlrobson, they didn't fix that? [23:04:24] (03Merged) 10jenkins-bot: Enable WikidataPageBanner on Catalan Wikipedia and Chinese Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242942 (https://phabricator.wikimedia.org/T114392) (owner: 10Jdlrobson) [23:04:25] seems not. It's still broken, but luckily it's just a one line change I believe. [23:04:30] I heard a class was missing or something but assumed they dealt with it immediately [23:05:15] we fixed the issue with a class missing but something popped up since then somehow [23:05:19] https://gerrit.wikimedia.org/r/243827 jdlrobson [23:05:55] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/242942/ (duration: 00m 17s) [23:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:01] jdlrobson, ^ [23:06:07] 6operations, 10ops-codfw: update spares sheet with DAC cable count - https://phabricator.wikimedia.org/T114720#1703995 (10RobH) Background: We'll be using 6 of these for the racking of the new swift backend servers ordered (to be racked via T114712). So I want to ensure we have all the needed DAC cables, sinc... [23:07:31] Krenair: i am testing [23:07:35] k [23:07:38] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1704015 (10RobH) a:3fgiunchedi I'd like to get @fgiunchedi's viewpoint on this racking, just to ensure I'm not overlooking anything. If my proposed racking locations work, please assign back to me... [23:08:54] Krenair: works for Catalan and zh wikivoyage. Thanks :D [23:08:59] great [23:09:09] (03PS2) 10Alex Monk: Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243728 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [23:09:45] jdlrobson, ummm [23:09:52] is ruwikivoyage like a one-person community or something? [23:10:40] Requester may be a sysop, but it's not properly linked... [23:11:36] Krenair: looks great! Thanks :) [23:11:46] ... [23:11:55] I didn't deploy anything... [23:12:00] on Russian? [23:12:04] no [23:12:13] mm why is that workign for me then.. now i'm very confused [23:12:55] oh wait [23:12:58] i was testing user namespace [23:13:02] that was already enabled [23:13:04] ignore :) [23:13:08] i'll test user talk [23:14:50] !log krenair@tin Synchronized php-1.27.0-wmf.1/extensions/VisualEditor/modules/ve-mw/ui: https://gerrit.wikimedia.org/r/#/c/243729/ (duration: 00m 17s) [23:14:51] James_F, ^ [23:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:56] Thanks. Checking. [23:15:55] Krenair: Yup, looks good. Thanks! [23:15:58] great [23:17:22] @jdlrobson I can confirm that the unexpected token error does not occur when modifying the headers via extension, so that script must only get loaded if the IP is properly added to a partner config.... which makes absolutely no sense to me but that's how it works :/ [23:19:02] (03CR) 10Alex Monk: [C: 04-1] "this can't go on all beta wikis, it's a wikipedia logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243732 (https://phabricator.wikimedia.org/T114552) (owner: 10Jdlrobson) [23:20:40] MaxSem or Krenair would one of you mind adding https://gerrit.wikimedia.org/r/243728 to the SWAT today. Zero experience of wiki is pretty broken without it :-/ [23:21:30] 6operations, 10Parsoid: All parsoid servers almost unresponsive, under high cpu load - https://phabricator.wikimedia.org/T114558#1704067 (10cscott) Dumping some IRC conversation: (06:51:28 PM) cscott-free: yeah, it's because for certain long lived articles the number of contributors is huge. (06:53:32 PM) csco... [23:21:45] you mean https://gerrit.wikimedia.org/r/243827 jdlrobson? [23:21:51] 827 vs. 728 [23:22:37] Krenair: correct. [23:22:39] sorry for that [23:23:16] who can test those zero changes? [23:23:37] jhobs: [23:23:56] he's sick but came in especially given yurik is unavailable... [23:24:15] yuri's on a plane atm so yes I can test them [23:24:33] as long as someone pings me [23:24:43] oops, we both cherry-picked the same thing jdlrobson :) [23:26:05] Krenair: hah :) [23:26:43] (03CR) 10Alex Monk: [C: 04-1] "see ticket" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243728 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [23:27:36] Krenair: https://ru.wikivoyage.org/wiki/Wikivoyage:%D0%9F%D0%B8%D0%B2%D0%BD%D0%B0%D1%8F_%D0%BF%D1%83%D1%82%D0%B5%D1%88%D0%B5%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%B8%D0%BA%D0%BE%D0%B2#.D0.9F.D1.80.D0.BE.D0.BF.D0.B0.D0.BB_.D0.B1.D0.B0.D0.BD.D0.BD.D0.B5.D1.80_.D0.B2_.D0.BF.D0.B8.D0.B2.D0.BD.D0.BE.D0.B9 [23:27:48] it was a regression when they shifted to the banner extension [23:30:22] ah, ok [23:30:56] jdlrobson, can we change CommonSettings to set this instead? [23:31:24] 6operations, 10Parsoid: Investigate Oct 3 outage of the Parsoid cluster due to high cpu usage + high memory usage (sharp spike in both) around 08:35 UTC - https://phabricator.wikimedia.org/T114558#1704095 (10ssastry) [23:31:25] or instead come up with a way of setting WPBNamespaces to apply to all? by setting it to false or something? [23:31:39] I don't want to duplicate the array of namespaces [23:32:48] Krenair: yeh i wasn't sure how best to do this. This was the best I could come up with. Open to new ideas that will work. [23:33:01] we could also make the extension support "All namespaces" [23:33:10] i was just keen to get the Russian Wikivoyagers back their banners. [23:34:34] jdlrobson, if I merge this commit will you open a tech debt task to fix this properly? [23:34:45] Krenair: of course. [23:36:32] Krenair: https://phabricator.wikimedia.org/T114723?workflow=create [23:37:29] (03CR) 10Alex Monk: [C: 032] Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243728 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [23:37:35] (03Merged) 10jenkins-bot: Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243728 (https://phabricator.wikimedia.org/T114566) (owner: 10Jdlrobson) [23:38:09] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/243728/ (duration: 00m 17s) [23:38:11] jdlrobson, please test [23:38:34] Krenair, fatals [23:38:44] fatal that doesn't look good! [23:38:51] (03PS1) 10MaxSem: Revert "Enable banners on all namespaces on Russian Wikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243836 [23:38:52] is wikitech broken? [23:39:02] PROBLEM - HHVM rendering on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50416 bytes in 0.014 second response time [23:39:02] I get a blank page [23:39:05] shooott [23:39:06] typo [23:39:07] arrray [23:39:09] Damn r [23:39:14] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: rv (duration: 00m 17s) [23:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:27] jdlrobson: is that a pirate array? :) [23:39:41] * jdlrobson wonders why phpcs didnt pick that up [23:39:58] wikitech is back [23:40:08] jdlrobson, I'll file a bug [23:40:40] Krenair, should I abandon https://gerrit.wikimedia.org/r/#/c/243836/ ? [23:40:41] RECOVERY - HHVM rendering on mw1187 is OK: HTTP OK: HTTP/1.1 200 OK - 70501 bytes in 0.116 second response time [23:40:51] no [23:40:53] sorry about that :-( [23:40:56] I'm about to merge it, MaxSem [23:41:02] I just reset --hard HEAD^ on tin and sync'd [23:41:14] (03CR) 10Alex Monk: [C: 032] Revert "Enable banners on all namespaces on Russian Wikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243836 (owner: 10MaxSem) [23:41:20] (03Merged) 10jenkins-bot: Revert "Enable banners on all namespaces on Russian Wikivoyage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243836 (owner: 10MaxSem) [23:42:09] hmm, phplint isn't able to detect this kind of errors [23:42:16] (03PS1) 10Jdlrobson: Enable banners on all namespaces on Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243837 (https://phabricator.wikimedia.org/T114566) [23:42:31] phplint runs php -l * [23:42:39] arrray is valid function name [23:42:43] it's syntactically valid [23:42:53] does phpcs not have a concept of globals/ functions ? [23:42:54] so, instead tests are to blame [23:43:05] (03PS1) 10Gilles: Fix varnishmedia comment [puppet] - 10https://gerrit.wikimedia.org/r/243838 [23:43:14] btw, why did wikitech/mw.org have an error but not enwiki? [23:43:16] phpcs doesn't run on mediawiki-config... [23:43:21] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1002 is CRITICAL: CRITICAL - Expecting active but unit nfs-exports is failed [23:43:22] greg-g, enwiki did have an error. [23:43:29] oh, I just didn't get it [23:43:30] jdlrobson, to see this kind of problems, it has to actually execute the code in question [23:43:32] (03CR) 10Gilles: Send image varnish frontend data from logs to statsd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles) [23:43:33] Several users noticed. [23:43:42] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [23:44:07] jdlrobson: Krenair: when ya'll are done, since users noticed, a quick incident report should be written [23:44:14] I knew you were going to say that... [23:44:20] :) [23:44:29] For some reason the bot didn't log the initial sync to wikitech [23:44:30] greg-g: noted [23:44:32] just the revert [23:47:14] Krenair, can't really blame thebot for not logging to a broken wiki ;P [23:47:22] uh, right [23:47:23] good point [23:48:08] heh [23:48:33] from my IRC logs it looks like it lasted about 65 seconds [23:48:37] https://tools.wmflabs.org/sal/production got both !logs [23:49:23] 6operations, 6Editing-Department, 6Parsing-Team, 6Services: Services team goals October - December 2015 (Q2 2015/16) - https://phabricator.wikimedia.org/T111819#1704131 (10GWicke) [23:50:00] greg-g, okay if we continue with the ZeroBanner fix? [23:50:47] Please greg-g - the bugs didn't get squashed earlier and all our Zero traffic is throwing js exceptions :/ [23:51:46] yeah [23:53:01] filed as https://phabricator.wikimedia.org/T114725 <-- greg-g [23:53:07] ty [23:53:20] jhobs, ping [23:53:34] !log krenair@tin Synchronized php-1.27.0-wmf.1/extensions/ZeroBanner/includes/ZeroSpecialPage.php: https://gerrit.wikimedia.org/r/#/c/243833/ (duration: 00m 17s) [23:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:43] Krenair: live on enwiki? [23:53:48] yes [23:53:52] 6operations, 10ops-codfw: update spares sheet with DAC cable count - https://phabricator.wikimedia.org/T114720#1704158 (10RobH) [23:53:53] 6operations, 10ops-codfw, 7Swift: [determine] rack ms-be2016-2021 - https://phabricator.wikimedia.org/T114712#1704157 (10RobH) [23:54:27] jhobs: i can confirm its fixed but you may need to clear cache [23:54:52] https://en.m.wikipedia.org/w/index.php?title=Special:ZeroRatedMobileAccess&zcmd=js-banner < can we purge the cache for this? bblack ? [23:54:59] jdlrobson: i'm clearing cache but it doesn't appear to be fixed [23:55:12] jdlrobson: but i still see the