[00:02:05] hoo: did you file a bug for opensearch changes? [00:02:16] RoanKattouw: can confirm working now without debug=true [00:02:18] \o/ [00:02:50] legoktm: No, I couldn't figure out any changes when manually undoing all of the API patches [00:02:54] so I'm rather clueless [00:03:20] I have a patch that fixes wikibase by shifting indexes, but ath's not nice [00:03:46] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [00:04:09] gwicke: Could that be related to the RB change? ---^^ [00:04:27] hoo: what did the output used to look like? [00:04:51] legoktm: Like it does now as far as I can tell, which is weird [00:06:20] :/ [00:06:37] oh, doh [00:06:39] I got it [00:06:48] The warning stuff is the problem [00:06:51] it switched positions [00:06:59] that's why the indexes shifter [00:07:14] RoanKattouw: looking [00:07:59] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/207671/1/wmf-config/InitialiseSettings.php looks sane to me [00:08:35] Even if I check out master locally it still doesn't match up with prod :S [00:10:57] RECOVERY - RAID on db1060 is OK optimal, 1 logical, 2 physical [00:11:00] hoo: warning? [00:11:11] https://en.wikipedia.org/w/api.php?search=Foo&action=opensearch&format=jsonfm&dd [00:11:19] legoktm: The unrecognized parameter thing [00:11:40] o.O [00:11:40] It used to have the key warning, now it has a numeric key, thus the keys of the rest shifts [00:12:14] ooh fun [00:12:17] That's why I didn't notice a change... I removed all cruft to the point the results matched p [00:12:28] yeah, file a bug for that [00:13:08] RoanKattouw: nothing in https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor that looks related to RB [00:13:16] OK [00:13:27] Just wanted to check, because that 500 spike happened shortly after the RB deploy [00:13:30] lots of OOMs it seems [00:13:33] Correlation != causation but you gotta check [00:15:21] *nod* [00:15:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:17:08] that's a good sign [00:23:45] (03PS1) 10Ori.livneh: add role::logging::eventlistener to text and mobile varnishes [puppet] - 10https://gerrit.wikimedia.org/r/207692 [00:57:25] (03PS1) 10Dereckson: More UNIX agnostic, less GNU/Linux-centric scripts [dumps] - 10https://gerrit.wikimedia.org/r/207694 [01:09:31] (03CR) 10Tim Landscheidt: [C: 031] "checkbashisms didn't complain." [dumps] - 10https://gerrit.wikimedia.org/r/207694 (owner: 10Dereckson) [01:14:13] (03PS2) 10Dzahn: chapcomwiki -> affcomwiki [puppet] - 10https://gerrit.wikimedia.org/r/169944 (https://bugzilla.wikimedia.org/39482) (owner: 10Reedy) [01:15:52] 6operations, 10Wikimedia-Mailing-lists: move analytics-internal list to analytics-wmf - https://phabricator.wikimedia.org/T97618#1248009 (10Krenair) [01:17:20] !log legoktm Synchronized php-1.26wmf4/includes/api/ApiOpenSearch.php: Restore B/C for ApiOpenSearch json output if warnings are present (duration: 00m 30s) [01:17:30] Logged the message, Master [01:18:08] !log legoktm Synchronized php-1.26wmf3/includes/api/ApiOpenSearch.php: Restore B/C for ApiOpenSearch json output if warnings are present (duration: 00m 20s) [01:18:09] 30s for one file? [01:18:10] o_0 [01:18:14] Logged the message, Master [01:18:15] Reedy: there's one host being slow [01:18:21] aha [01:18:30] > 21:04 bd808: load avg on snapshot04 11.11; scap slow waiting on it [01:25:03] * YuviPanda YuviKTM [01:25:05] err [01:26:58] switcheroo? /nick Reedori [01:29:56] (03PS2) 10Dereckson: Fixed PEP-8 issues [dumps] - 10https://gerrit.wikimedia.org/r/207504 [01:32:14] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1248038 (10GWicke) An example for why we should have some sanity checks during deploy: https://wikitech.wikimedia.org/wiki/Incident_docume... [01:32:19] (03PS1) 10Dereckson: Created .gitignore [dumps] - 10https://gerrit.wikimedia.org/r/207699 [01:32:54] (03PS1) 10Yuvipanda: tools: Remove mariadb-client from precise exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/207700 [01:36:12] (03PS2) 10Yuvipanda: tools: Remove mariadb-client from precise exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/207700 [01:36:22] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Remove mariadb-client from precise exec hosts [puppet] - 10https://gerrit.wikimedia.org/r/207700 (owner: 10Yuvipanda) [01:37:16] (03PS3) 10Dereckson: Fixed PEP-8 issues [dumps] - 10https://gerrit.wikimedia.org/r/207504 [01:39:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [01:50:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:51:28] PROBLEM - puppet last run on cp3004 is CRITICAL puppet fail [02:02:12] (03PS1) 10Dereckson: Fixed various issues detected by pyflakes [dumps] - 10https://gerrit.wikimedia.org/r/207708 [02:08:17] RECOVERY - puppet last run on cp3004 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [02:20:59] (03PS1) 10Dereckson: Fixed code documentation punctuation [dumps] - 10https://gerrit.wikimedia.org/r/207712 [02:31:26] PROBLEM - puppet last run on es2004 is CRITICAL puppet fail [02:31:42] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 10m 59s) [02:31:54] Logged the message, Master [02:36:37] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:16] RECOVERY - puppet last run on snapshot1004 is OK Puppet is currently enabled, last run 15 minutes ago with 0 failures [02:39:06] !log LocalisationUpdate completed (1.26wmf3) at 2015-04-30 02:38:03+00:00 [02:39:13] Logged the message, Master [02:49:47] RECOVERY - puppet last run on es2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:59:47] PROBLEM - puppet last run on mw1066 is CRITICAL Puppet has 1 failures [03:00:15] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 07m 09s) [03:00:25] Logged the message, Master [03:04:12] !log LocalisationUpdate completed (1.26wmf4) at 2015-04-30 03:03:09+00:00 [03:04:20] Logged the message, Master [03:14:47] RECOVERY - puppet last run on mw1066 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [03:36:45] (03CR) 10Dereckson: "Followup: I2732313ea5d61814e247c27c60da40a6acd1f283" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207273 (https://phabricator.wikimedia.org/T97397) (owner: 10Dereckson) [03:37:12] (03PS1) 10Dereckson: Restrict local uploads on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207725 (https://phabricator.wikimedia.org/T97397) [05:15:53] !log draining esams, planned upsteam network maintenance [05:16:04] Logged the message, Master [05:21:58] grrrit-wm1: grrrr [05:28:30] did it break again now? [05:28:43] * YuviKTM restarts it again [05:32:56] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [05:34:36] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [05:45:57] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 787.712830401 [05:48:30] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Apr 30 05:47:26 UTC 2015 (duration 47m 25s) [05:48:35] Logged the message, Master [06:03:16] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [06:03:17] (03PS1) 10Yuvipanda: tools: Catch appropriate File Not Found error for webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/207733 [06:03:40] (03PS2) 10Yuvipanda: tools: Catch appropriate File Not Found error for webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/207733 [06:04:01] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Catch appropriate File Not Found error for webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/207733 (owner: 10Yuvipanda) [06:06:37] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [06:24:56] PROBLEM - BGP status on cr2-ulsfo is CRITICAL No response from remote host 198.35.26.193 [06:28:27] RECOVERY - BGP status on cr2-ulsfo is OK host 198.35.26.193, sessions up: 45, down: 0, shutdown: 0 [06:29:38] PROBLEM - puppet last run on cp4004 is CRITICAL puppet fail [06:29:57] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on db1018 is CRITICAL Puppet has 1 failures [06:31:07] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 1 failures [06:31:46] PROBLEM - puppet last run on db1015 is CRITICAL Puppet has 1 failures [06:33:48] PROBLEM - puppet last run on db2065 is CRITICAL Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 1 failures [06:34:47] PROBLEM - puppet last run on mw1052 is CRITICAL Puppet has 1 failures [06:34:47] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:35:08] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:35:16] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/2: down - Core: cr2-knams:xe-1/1/0 (GTT, 00341724) [10Gbps MPLS]BR [06:35:46] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 1 failures [06:35:47] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:36:07] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:37:47] PROBLEM - puppet last run on cp3049 is CRITICAL puppet fail [06:38:36] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [06:39:36] PROBLEM - puppet last run on cp3019 is CRITICAL puppet fail [06:45:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [06:46:17] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:26] RECOVERY - puppet last run on db1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:06] RECOVERY - puppet last run on db2065 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:49:47] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:49:57] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:50:08] RECOVERY - puppet last run on db1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:17] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:50:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [06:50:56] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:57] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:52:57] RECOVERY - puppet last run on cp3049 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:54:37] RECOVERY - puppet last run on cp3019 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:06:46] RECOVERY - puppet last run on mw1052 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:07:46] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:08:07] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:22:54] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1248351 (10tstarling) The usual policy for internal network services, e.g. MW contacting MySQL, search, Redis, Swift, etc. has been to not retry at all. I th... [07:35:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:35:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:35:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:36:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:37:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:37:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:38:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:39:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:39:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:40:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [07:40:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:41:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [07:43:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [07:44:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [07:44:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [07:45:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 22.22% of data above the critical threshold [20000.0] [07:46:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 22.22% of data above the critical threshold [20000.0] [07:49:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 22.22% of data above the critical threshold [20000.0] [07:50:25] (03CR) 10Ori.livneh: [C: 031] graphite: stop system carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/206127 (owner: 10Filippo Giunchedi) [07:50:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 22.22% of data above the critical threshold [20000.0] [07:51:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [07:52:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [07:52:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [07:52:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 22.22% of data above the critical threshold [20000.0] [07:52:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [07:53:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [07:53:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [07:54:31] akosiaris: what OS is beta running ? precise/trusty/jessie ? [07:54:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [07:54:46] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [07:57:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:57:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [07:57:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [07:57:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [07:58:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:00:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:00:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:01:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:01:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:02:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [08:03:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:03:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [08:04:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [08:05:46] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [08:06:06] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:06:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [08:06:57] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [08:07:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:09:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [08:10:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:11:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:11:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:12:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:12:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:15:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:15:29] (03PS1) 10Faidon Liambotis: Revert "Depool esams, planned upsteam network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/207736 [08:15:41] (03CR) 10Faidon Liambotis: [C: 032] Revert "Depool esams, planned upsteam network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/207736 (owner: 10Faidon Liambotis) [08:15:57] !log repooling esams, network maintenance is over [08:16:03] Logged the message, Master [08:16:06] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:18:19] (03PS1) 10Giuseppe Lavagetto: hhvm: make base_jit_size configurable [puppet] - 10https://gerrit.wikimedia.org/r/207737 (https://phabricator.wikimedia.org/T93194) [08:18:25] <_joe_> ori: ^^ [08:18:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:18:37] <_joe_> (just FYI) [08:19:28] _joe_: seems sane [08:19:40] (03CR) 10Ori.livneh: [C: 031] hhvm: make base_jit_size configurable [puppet] - 10https://gerrit.wikimedia.org/r/207737 (https://phabricator.wikimedia.org/T93194) (owner: 10Giuseppe Lavagetto) [08:19:49] <_joe_> yeah I didn't want to hold you here :) [08:19:55] <_joe_> it's very late there [08:20:05] <_joe_> or very early, depending on how you look at it :P [08:20:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [08:21:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [08:22:06] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:22:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [08:23:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [08:23:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [08:24:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [08:24:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [08:25:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [08:25:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:26:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [08:26:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:26:53] (03PS2) 10Giuseppe Lavagetto: hhvm: make base_jit_size configurable [puppet] - 10https://gerrit.wikimedia.org/r/207737 (https://phabricator.wikimedia.org/T93194) [08:27:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [08:27:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [08:28:39] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: make base_jit_size configurable [puppet] - 10https://gerrit.wikimedia.org/r/207737 (https://phabricator.wikimedia.org/T93194) (owner: 10Giuseppe Lavagetto) [08:29:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [08:30:24] <_joe_> apergos: say I want to add a new module to our salt installation, where should I put it? [08:30:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [08:30:33] <_joe_> do we even have a repository for salt modules? [08:30:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:31:07] no, there's not a repo for them [08:31:17] I wonder why varnishkafka is still complaining [08:31:18] just put it in puppet in the appropriate module there I guess [08:31:21] <_joe_> so, how do we install them? [08:31:30] <_joe_> apergos: one I write? [08:31:36] yes, one you write [08:31:48] <_joe_> uhm seems... wrong to me [08:32:07] <_joe_> but well, if it's the status quo [08:32:16] well e.g. the git_deploy salt modules and returners are installed via puppet now [08:32:47] <_joe_> ok I'll take a look [08:32:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 22.22% of data above the critical threshold [20000.0] [08:33:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:33:40] lemme know what you do in the end [08:33:53] <_joe_> apergos: maybe I wasn't clear [08:34:10] ? [08:34:20] <_joe_> but where exactly are those modules in puppet? [08:34:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:34:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:35:15] puppet/modules/deployment/files/modules those are just the ones for git-deploy [08:35:33] <_joe_> oh ok so not in the salt module :) [08:35:34] <_joe_> thanks [08:35:43] and the returners etc are in the files directory too [08:35:49] <_joe_> ok [08:35:54] no, not in the puppet module for salt :-D [08:36:13] <_joe_> ok :) [08:36:20] so I was suggesting maybe making a puppet module for salt modules [08:36:32] :-D [08:36:47] <_joe_> maybe we could create a "salt" repository instead :P [08:36:58] <_joe_> ...and make it a submodule in git [08:37:03] <_joe_> for the joy of bblack [08:37:12] vetoed :-P [08:37:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:37:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [08:38:12] by module for salt you do mean something like the test module which has e.g. test.ping in it, right? when you say you have written a/some modules? [08:38:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [08:39:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [08:39:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [08:39:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [08:40:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [08:40:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [08:42:54] <_joe_> apergos: yes something like that and no, I've written none [08:42:57] <_joe_> I'm getting started [08:43:01] ok [08:43:15] <_joe_> I want to write something that may help us with orchestration of our cluster [08:43:39] if you do, please ad me as reviewer, not that I need to review it, I just wanna see what you are working on :-) [08:43:39] <_joe_> we should also really start working on updating palladium and the rest :) [08:43:45] <_joe_> yep [08:43:55] to jessie you mean? [08:44:04] <_joe_> yes [08:44:08] yep [08:44:14] sodium [08:46:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:46:33] it would be nice to get all the precise hosts updated in the first round [08:47:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:48:41] <_joe_> yeah, in the case of the puppetmaster we have a scalability incentive [08:48:55] <_joe_> although I strongly hope 3.4 clients can work with a 3.7 master [08:49:05] <_joe_> or rebuilding puppet again for precise, ugh [08:49:10] yuck [08:49:12] <_joe_> that was an ugly process [08:50:22] labs testing for that I guess [08:51:06] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [08:52:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [08:58:06] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [08:58:23] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1248432 (10Joe) a:3Joe [09:01:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [09:02:32] 6operations: consul evaluation - https://phabricator.wikimedia.org/T96832#1248433 (10Joe) I tested consul a bit, and it seems very solid in general for such a young project. Although its features look great in principle, all the things like DNS, service autodiscovery, health checks kind of overlap and clash wit... [09:02:47] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1248437 (10Joe) [09:02:49] 6operations: consul evaluation - https://phabricator.wikimedia.org/T96832#1248434 (10Joe) 5Open>3Resolved a:3Joe [09:03:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:06:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:08:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [09:08:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:09:46] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [09:10:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:11:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:13:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [09:14:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:14:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:15:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [09:15:42] why the hell is it still complaining [09:16:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [09:17:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [09:19:46] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [09:23:11] 6operations: zookeeper evaluation - https://phabricator.wikimedia.org/T96839#1248474 (10Joe) Zookeeper is a solid, proven (https://aphyr.com/posts/291-call-me-maybe-zookeeper) distributed k-v with solid performance and reliability. Pros: - It's backed by a large community and different companies contribute to i... [09:25:56] (03CR) 10Nikerabbit: hhvm: make base_jit_size configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207737 (https://phabricator.wikimedia.org/T93194) (owner: 10Giuseppe Lavagetto) [09:26:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:26:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:27:02] 6operations, 10MediaWiki-Debug-Logging, 5Patch-For-Review: Investigation if Fluorine needs bigger disks or we retain too much data - https://phabricator.wikimedia.org/T92417#1248482 (10fgiunchedi) so api.log went from 35M to 18G ``` -rw-r--r-- 1 udp2log udp2log 35M Apr 28 06:25 api.log-20150428.gz -rw-r--r-... [09:28:37] PROBLEM - puppet last run on ms-fe2001 is CRITICAL puppet fail [09:30:06] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:31:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [09:31:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [09:33:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [09:33:27] matanya: all of them btw [09:38:47] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [09:39:02] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1248501 (10Joe) [09:39:04] 6operations: zookeeper evaluation - https://phabricator.wikimedia.org/T96839#1248499 (10Joe) 5Open>3Resolved [09:39:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:42:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:43:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:43:27] thanks akosiaris [09:43:47] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60586 bytes in 0.549 second response time [09:44:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [09:45:36] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [09:46:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [09:46:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [09:47:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:48:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:50:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [09:53:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [09:54:13] 6operations: etcd evaluation - https://phabricator.wikimedia.org/T96825#1248516 (10Joe) Etcd is a relatively new (the first usable versions date to late 2013 IIRC) system, yet it's become mature enough to be deemed production-ready, it seems. it's quite feature-rich and has a very nice REST API. Pros: - Very ac... [09:54:28] 6operations: etcd evaluation - https://phabricator.wikimedia.org/T96825#1248517 (10Joe) 5Open>3Resolved [09:54:29] 6operations: Choose a consistent, distributed k/v storage for configuration management/discovery - https://phabricator.wikimedia.org/T95656#1248519 (10Joe) [09:55:48] (03PS3) 10Filippo Giunchedi: graphite: stop system carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/206127 [09:55:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: stop system carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/206127 (owner: 10Filippo Giunchedi) [09:58:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [09:59:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [10:00:44] wth is that [10:01:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:02:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [10:06:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [10:12:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:13:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:18:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:18:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:18:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [10:18:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:19:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [10:19:29] (03PS1) 10Muehlenhoff: Switch to a non-trunk build, using abi=1 for our first build. [debs/linux] - 10https://gerrit.wikimedia.org/r/207751 [10:20:30] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labvirt1005 memory errors - https://phabricator.wikimedia.org/T97521#1248596 (10hashar) @Andrew thank you for the instances migrations! [10:21:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:22:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [10:23:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [10:24:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [10:25:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:26:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:29:52] (03PS2) 10Muehlenhoff: Switch to a non-trunk build, using abi=1 for our first build. [debs/linux] - 10https://gerrit.wikimedia.org/r/207751 [10:30:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [10:31:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [10:35:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:36:54] !log upgrade statsite on labmon1001 [10:37:02] Logged the message, Master [10:39:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [10:43:57] PROBLEM - Disk space on uranium is CRITICAL: DISK CRITICAL - free space: / 346 MB (3% inode=83%) [10:44:52] <_joe_> uhm what's up with uranium [10:45:03] beta cluster down? [10:45:19] http://en.wikipedia.beta.wmflabs.org/ says 503 [10:45:24] <_joe_> zeljkof: I'm looking at something else atm, what's the symptom? [10:45:32] <_joe_> zeljkof: ok I'll take a look in a few [10:45:46] _joe_: thanks, there is no rush, but looks down to me [10:46:16] <_joe_> zeljkof: we have a new HHVM version in beta, so that may be relevant [10:46:29] _joe_: need an hand with uranium? I can take a look too [10:46:39] <_joe_> godog: thanks, yes please [10:46:47] looking [10:47:49] <_joe_> zeljkof: I guess it's my fault btw [10:47:57] _joe_: :) [10:48:14] <_joe_> zeljkof: thanks for telling me [10:48:41] <_joe_> I gave the TC cache 17.5 gigabytes of space instead of 175 mb :P [10:48:43] _joe_: no problem [10:49:12] <_joe_> "it's just an order of magnitude anyway" [10:49:26] <_joe_> (it's two actually, but still) [10:50:42] (03PS1) 10Giuseppe Lavagetto: beta: correct HHVM TC size [puppet] - 10https://gerrit.wikimedia.org/r/207752 [10:51:06] (03CR) 10Giuseppe Lavagetto: [C: 032] beta: correct HHVM TC size [puppet] - 10https://gerrit.wikimedia.org/r/207752 (owner: 10Giuseppe Lavagetto) [10:51:14] (03CR) 10Giuseppe Lavagetto: [V: 032] beta: correct HHVM TC size [puppet] - 10https://gerrit.wikimedia.org/r/207752 (owner: 10Giuseppe Lavagetto) [10:52:59] !log delete old /tmp/ganglia-graph from uranium [10:53:06] Logged the message, Master [10:53:57] RECOVERY - Disk space on uranium is OK: DISK OK [10:54:37] <_joe_> zeljkof: it should be recovering [10:54:44] _joe_: thanks [10:55:08] _joe_: looks fine to me [11:00:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:00:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:02:23] 6operations: stray ganglia-graph files left in /tmp - https://phabricator.wikimedia.org/T97637#1248680 (10fgiunchedi) 3NEW [11:02:33] "there's your problem" [11:03:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:03:38] what about the delivery errors instead? [11:03:46] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [11:04:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:04:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:05:57] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [11:07:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:08:03] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1248695 (10mark) >>! In T93790#1231396, @GWicke wrote: > Based on our GC / latency data so far, each instance should have no more than 1T of storage (aiming for ~750G working load), along... [11:08:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [11:09:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [11:09:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [11:09:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:12:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [11:16:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [11:22:45] _joe_: oh I guess I should have pinged on IRC [11:23:55] <_joe_> Nikerabbit: what's up? [11:24:36] <_joe_> if you wrote me an email, yes I check them once/hour [11:24:48] <_joe_> when I'm not too focused on something, that is [11:25:08] akosiaris: seen this before? https://phabricator.wikimedia.org/T97637 ganglia graphs left in /tmp [11:26:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:27:18] 6operations: stray ganglia-graph files left in /tmp - https://phabricator.wikimedia.org/T97637#1248720 (10akosiaris) Nope, first time I see this. This looks http://sourceforge.net/p/ganglia/mailman/message/32998550/ relevant [11:27:23] godog: http://sourceforge.net/p/ganglia/mailman/message/32998550/ [11:27:41] someone has seen it too [11:27:51] ah, google-fu fail :( [11:29:28] _joe_: I meant the hhvm 175 gigabytes [11:30:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:32:28] <_joe_> yeah, sorry about that [11:32:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [11:35:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [11:37:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:37:44] it didn't affect me, just wondering whether it would have avoided bringing down beta if I commented here instead of the patch itself [11:37:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:39:03] <_joe_> I've missed that, sorry [11:40:26] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:40:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [11:41:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [11:43:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:44:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:45:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [11:46:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:48:48] (03PS1) 10Muehlenhoff: * Amend older changelog entries with security issues fixed in 3.19.x so that we properly keep track [debs/linux] - 10https://gerrit.wikimedia.org/r/207755 [11:48:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [11:49:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [11:49:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [11:49:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:51:56] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [11:54:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [11:58:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [12:01:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:03:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:04:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:04:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [12:04:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:06:56] 6operations: stray ganglia-graph files left in /tmp - https://phabricator.wikimedia.org/T97637#1248763 (10fgiunchedi) dug into this a bit more, graph.php ```lang=php 1274 case "rrdtool": 1275 if (strlen($command) < 100000) { 1276 my_passthru($command); 1277 } else { 1278 $tf = te... [12:08:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [12:08:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [12:09:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [12:09:17] !log restarting Jenkins https://phabricator.wikimedia.org/T96183 [12:09:25] Logged the message, Master [12:09:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:10:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:11:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:12:54] (03PS1) 10Filippo Giunchedi: ganglia_new: bandaid cleanup /tmp [puppet] - 10https://gerrit.wikimedia.org/r/207759 (https://phabricator.wikimedia.org/T97637) [12:13:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:13:57] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [12:14:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [12:15:07] 6operations, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1248778 (10Yurik) 3NEW [12:15:26] 6operations, 10OpenStreetMap, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1248785 (10Yurik) [12:16:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [12:16:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [12:24:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:25:11] !log upgrade statsite on ms-fe1* [12:25:19] Logged the message, Master [12:27:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:27:54] !log upgrade statsite on ms-be1* [12:28:00] Logged the message, Master [12:28:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [12:28:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [12:30:42] (03PS2) 10BBlack: sanitize Accept-Encoding for cache efficiency T97128 [puppet] - 10https://gerrit.wikimedia.org/r/206387 [12:30:45] I've silenced that ^ on ulsfo for three hours [12:31:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [12:31:42] actually downtime might not have been a good idea, recoveries will still notify iirc [12:33:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [12:37:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 33.33% of data above the critical threshold [500.0] [12:41:16] refreshLinks: 10707552 queued; 75 claimed (2 active, 73 abandoned); 0 delayed [12:41:18] (enwiki) [12:41:23] that number just keeps increasing... [12:50:01] Krenair: context? [12:50:10] is this an immediate issue, or just general long-term thing? [12:50:15] Not sure. [12:50:29] Someone just pointed out in tech that enwiki has a ridiculously large job queue at the moment [12:50:40] this is the entry that stands out on showJobs.php [12:50:56] what software runs that queue? [12:51:09] (consumes, I mean) [12:51:29] MediaWiki job queue [12:53:18] anomie, any thoughts? [12:53:29] pfff [12:53:42] we no more monitor job insert / job pop :( [12:54:53] ops: elasticsearch alerts on logstash100[456] is part of some known ongoing work right? https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162&nostatusheader [12:58:04] bblack: yes, robh and bd808 were reimaging them yesterday [12:58:23] 6operations, 10MediaWiki-Logging, 7Graphite: MediaWiki jobs statsd metric are no more monitored due to a metric name change - https://phabricator.wikimedia.org/T97640#1248840 (10hashar) 3NEW [12:59:14] Ah. [12:59:22] krenair@terbium:~$ mwscript showJobs.php enwiki --type refreshLinks --list [12:59:22] [49b2df1b] [no req] JobQueueError from line 713 of /srv/mediawiki/php-1.26wmf3/includes/jobqueue/JobQueueRedis.php: Redis server error: read error on connection [12:59:34] That works for another job type I tried. [13:03:16] PROBLEM - puppet last run on cp3013 is CRITICAL puppet fail [13:03:48] 6operations, 10MediaWiki-Logging, 7Graphite: MediaWiki jobs statsd metric are no more monitored due to a metric name change - https://phabricator.wikimedia.org/T97640#1248852 (10hashar) [13:03:48] The change occurred on 29/04/2015 at 9:47am UTC. [13:05:32] Coren: you still need a Commons admin ? [13:06:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:10:27] PROBLEM - MySQL InnoDB on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 630 seconds [13:11:07] PROBLEM - MySQL Idle Transactions on db1040 is CRITICAL: CRIT longest blocking idle transaction sleeps for 670 seconds [13:13:47] RECOVERY - MySQL InnoDB on db1040 is OK longest blocking idle transaction sleeps for 25 seconds [13:14:27] RECOVERY - MySQL Idle Transactions on db1040 is OK longest blocking idle transaction sleeps for 0 seconds [13:17:58] 6operations: consul evaluation - https://phabricator.wikimedia.org/T96832#1248868 (10Manybubbles) DNS for service resolution is pretty cute. If we didn't have LVS/pybal it'd be pretty compelling. I do wonder how DNS resolution caching plays here too. LVS/pybal is comparatively simpler to think about - or maybe i... [13:21:00] 6operations, 10MediaWiki-Logging, 7Graphite: MediaWiki jobs statsd metric are no more monitored due to a metric name change - https://phabricator.wikimedia.org/T97640#1248870 (10hashar) The Statsd has been changed recently https://gerrit.wikimedia.org/r/#/c/191854/ by @ori but that seems to have been deploye... [13:21:57] RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:24:53] godog: some statsd metrics have been renamed yesterday morning. The ones under MediaWiki.stats. ended up being moved straight under MediaWiki. [13:25:08] godog: an impact is that jobqueues are no more monitored (we use check_graphite). I have filled https://phabricator.wikimedia.org/T97640 [13:25:08] (03PS1) 10Anomie: Bump timestamp in 'ValidateExtendedMetadataCache' hook for T97469 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207769 (https://phabricator.wikimedia.org/T97591) [13:26:01] <_joe_> hashar: so your proposal is that we break this again, instead of patching the software to work as expected? [13:26:04] <_joe_> :) [13:26:22] I dont even know what is happening to be honest :] [13:26:43] heya moritzm, yt? i tried your cipher setting and am getting an error [13:26:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [13:27:12] 6operations: consul evaluation - https://phabricator.wikimedia.org/T96832#1248879 (10BBlack) There's a pretty big feature mismatch between what we could do over DNS and what we can do with LVS/pybal, in terms of accurately weighting traffic, instantaneous response to failed servers, and (very importantly in some... [13:28:42] https://gdash.wikimedia.org/dashboards/reqerror/ <- wtf? [13:29:46] anyone have a lead on the 5xx there from ~12:30-12:50, and now a new little plateau appearing? [13:30:04] (new one looks like 500, not 503) [13:30:16] PROBLEM - check_mysql on payments1002 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [13:30:16] PROBLEM - check_mysql on payments1003 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [13:30:17] PROBLEM - check_mysql on payments1004 is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [13:30:52] fixing ^^^ [13:31:09] ok the new little spike seems to have ended up being tiny, but that one circa 12:30-12:50 is both awful and inexplicable so far [13:31:17] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 7346.99788415 [13:31:18] hashar: I'll take a look [13:31:45] godog: that might be a mediawiki related changes but I havent found anything at that time yesterday :/ [13:32:08] godog: maybe some change strips '.stats' from metric names :D [13:32:47] ottomata: on which host? [13:33:22] hashar: no .stats. has been gone for a long time [13:34:05] um, tried a few of them, just can't log in when i've got that [13:34:06] i get [13:34:22] /Users/otto/.ssh/config line 5: Bad SSH2 cipher spec 'chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr'. [13:34:41] that was with bast1001.w [13:34:51] but same for any host, seems more like a local problem [13:35:16] RECOVERY - check_mysql on payments1003 is OK: Uptime: 223 Threads: 1 Questions: 1101 Slow queries: 28 Opens: 432 Flush tables: 1 Open tables: 64 Queries per second avg: 4.937 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:35:16] RECOVERY - check_mysql on payments1002 is OK: Uptime: 691432 Threads: 1 Questions: 975751 Slow queries: 2773 Opens: 502 Flush tables: 1 Open tables: 57 Queries per second avg: 1.411 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:35:16] what version of the openssh client do you use? maybe it's too old [13:35:17] PROBLEM - check_mysql on payments1004 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [13:35:27] Hi anomie! How's core API going along? :) [13:35:33] (03PS1) 10Ottomata: Remove some (large) unused graphs from varnishkafka ganglia view [puppet] - 10https://gerrit.wikimedia.org/r/207771 (https://phabricator.wikimedia.org/T97637) [13:35:48] OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011 [13:36:08] anomie: Also... wondering if you might want to help deploy this today: https://gerrit.wikimedia.org/r/#/c/207723/ [13:36:20] thanx in advance :) [13:36:30] (03CR) 10Ottomata: [C: 032] Remove some (large) unused graphs from varnishkafka ganglia view [puppet] - 10https://gerrit.wikimedia.org/r/207771 (https://phabricator.wikimedia.org/T97637) (owner: 10Ottomata) [13:37:16] anomie: oh wait, I was looking at the wrong swat, I thought you were on today's monring deploy... [13:38:18] ottomata: ah, sorry. this is only available with openssh 6.4 and later, I'll send a followup. you'll need to stick with the default for older SSH clients [13:39:22] ok, i mean, ,maybe there is an easy os x upgrade [13:39:34] i'm using default os x whatever, so i expect others on os x might have the same problem [13:40:16] RECOVERY - check_mysql on payments1004 is OK: Uptime: 293 Threads: 2 Questions: 1045 Slow queries: 28 Opens: 433 Flush tables: 1 Open tables: 64 Queries per second avg: 3.566 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [13:40:27] ottomata: I know next to nothing about mac OS X, let me know what you find out :-) [13:40:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:41:34] 10Ops-Access-Requests, 6operations, 6Release-Engineering: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1248924 (10chasemp) 3NEW [13:50:36] AndyRussG: Bunch of bugs raised by the ApiResult change, mostly due to extensions doing crazy stuff. That patch would be good for SWAT, see https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1b:_extension_changes for how to set up the patch. [13:51:36] Hi anomie! Sorry I haven't had my coffee yet... I see you _are_ on today's swat... Yeah just added it to the Swat on the Deployments page [13:52:15] Yeah I can prepare the patches, though I don't have +2 on branches other than master... [13:52:42] thanks! [13:53:01] (03PS1) 10Rush: admin cleanup yuvi has root as part of ops on labmon [puppet] - 10https://gerrit.wikimedia.org/r/207773 [13:53:18] AndyRussG: Ping me when you need the extension backport +2ed, then [13:54:36] K u bet [13:55:58] (03PS2) 10Rush: admin cleanup yuvi has root as part of ops on labmon [puppet] - 10https://gerrit.wikimedia.org/r/207773 [13:56:24] (03CR) 10Rush: [C: 032 V: 032] admin cleanup yuvi has root as part of ops on labmon [puppet] - 10https://gerrit.wikimedia.org/r/207773 (owner: 10Rush) [13:57:07] whoa, confd, cool _joe_, didn't know that was a thing [14:05:16] 6operations, 10Wikimedia-Logstash: Elasticsearch not starting on Jessie hosts - https://phabricator.wikimedia.org/T97645#1248972 (10bd808) 3NEW a:3bd808 [14:08:24] 6operations, 10Wikimedia-Logstash: Elasticsearch not starting on Jessie hosts - https://phabricator.wikimedia.org/T97645#1248981 (10Manybubbles) Fun! [14:08:54] (03PS1) 10Rush: admin adjust spacing to be consistent [puppet] - 10https://gerrit.wikimedia.org/r/207780 [14:11:22] 6operations, 10MediaWiki-Logging, 7Graphite: MediaWiki jobs statsd metric are no more monitored due to a metric name change - https://phabricator.wikimedia.org/T97640#1248990 (10fgiunchedi) I think there's two factors at play, https://gerrit.wikimedia.org/r/#/c/206781/ dealt with the extended counters which... [14:12:08] (03CR) 10Rush: [C: 032 V: 032] admin adjust spacing to be consistent [puppet] - 10https://gerrit.wikimedia.org/r/207780 (owner: 10Rush) [14:15:12] (03PS1) 10Rush: admin reduce service permissions complexity [puppet] - 10https://gerrit.wikimedia.org/r/207781 [14:17:09] (03CR) 10jenkins-bot: [V: 04-1] admin reduce service permissions complexity [puppet] - 10https://gerrit.wikimedia.org/r/207781 (owner: 10Rush) [14:25:43] AndyRussG: Working on the extension-update patches in mediawiki/core now, for your SWAT? [14:26:22] anomie: ah sure one sec [14:28:57] 6operations, 7network: Establish IPsec tunnel between codfw and eqiad pfw - https://phabricator.wikimedia.org/T89294#1249044 (10faidon) [14:28:59] 6operations, 10fundraising-tech-ops, 7network: pfw-eqiad JunOS upgrade - https://phabricator.wikimedia.org/T96569#1249041 (10faidon) 5Open>3Resolved a:3faidon This was done last Monday (Apr 27). pfw-codfw was also updated to match it. No complaints so far, so all went as planned! [14:29:01] 6operations, 10MediaWiki-Logging, 7Graphite: MediaWiki jobs statsd metric are no more monitored due to a metric name change - https://phabricator.wikimedia.org/T97640#1249045 (10fgiunchedi) [14:29:04] 6operations, 7Monitoring: Job queue stats are broken - https://phabricator.wikimedia.org/T87594#1249046 (10fgiunchedi) [14:29:53] (03PS1) 10Filippo Giunchedi: mediawiki: adjust jobq alarms [puppet] - 10https://gerrit.wikimedia.org/r/207785 (https://phabricator.wikimedia.org/T87594) [14:30:06] 6operations, 7network: Establish IPsec tunnel between codfw and eqiad pfw - https://phabricator.wikimedia.org/T89294#1249053 (10faidon) 5Open>3Resolved a:3faidon pfws now have an IPsec tunnel established between them, as well as a BGP feed over that tunnel to exchange their respective routes. We need se... [14:30:34] (03PS1) 10Filippo Giunchedi: gdash: adjust jobq dashboard [puppet] - 10https://gerrit.wikimedia.org/r/207786 (https://phabricator.wikimedia.org/T87594) [14:34:04] (03Abandoned) 10Rush: admin reduce service permissions complexity [puppet] - 10https://gerrit.wikimedia.org/r/207781 (owner: 10Rush) [14:34:26] (03PS1) 10Rush: admin simplify service permissions grants [puppet] - 10https://gerrit.wikimedia.org/r/207788 [14:35:05] anomie: here's wmf3 https://gerrit.wikimedia.org/r/#/c/207787/ [14:35:28] AndyRussG: Put it on the Deployments page, and the wmf4 one once you get that. [14:37:27] (03CR) 10Rush: [C: 032] admin simplify service permissions grants [puppet] - 10https://gerrit.wikimedia.org/r/207788 (owner: 10Rush) [14:40:26] (03PS3) 10BBlack: sanitize Accept-Encoding for cache efficiency [puppet] - 10https://gerrit.wikimedia.org/r/206387 (https://phabricator.wikimedia.org/T97128) [14:41:57] anomie: heh looks like someone or something automatically updated wmf4: https://git.wikimedia.org/commit/mediawiki%2Fcore/cb9434caf95b33cf81271a172491760d5893874a [14:42:19] Funny that I wasn't getting a diff for wmf4 submodule change [14:42:21] (03PS1) 10Ottomata: Attempt to get resource defaults set by not globally qualifying kafkatee::input [puppet] - 10https://gerrit.wikimedia.org/r/207790 [14:43:03] (03PS2) 10Ottomata: Attempt to get resource defaults set by not globally qualifying kafkatee::input [puppet] - 10https://gerrit.wikimedia.org/r/207790 [14:43:14] (03CR) 10Ottomata: [C: 032 V: 032] Attempt to get resource defaults set by not globally qualifying kafkatee::input [puppet] - 10https://gerrit.wikimedia.org/r/207790 (owner: 10Ottomata) [14:43:17] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1249066 (10GWicke) > What am I missing here? The comparison is for slightly upgraded storage of 27T, equivalent to three additional (9 total) nodes of the current spec. > For compariso... [14:43:37] <^d> This automatic updating of deployment branches is scaring the crap out of me. [14:44:17] When did that start happening? [14:45:38] in general we've seemed a bit borderline-unstable all week to me. Seems like too much going on without enough review/process/feedback or something. I've been trying to fathom what's at the root in the change of the feel of things, or whether it's just an unlucky week. [14:47:38] <^d> anomie: No idea. [14:48:05] <^d> I have a suspicion that the wmf/* branches were setup to auto-track their extension counterparts like a meta repo [14:48:12] <^d> Which is SCARY AS CRAP FOR DEPLOYMENT [14:48:48] (03PS1) 10Ottomata: Specify kafkatee::input parameters manually, resource defaults wasn't working [puppet] - 10https://gerrit.wikimedia.org/r/207795 [14:49:25] (03PS2) 10Ottomata: Specify kafkatee::input parameters manually, resource defaults wasn't working [puppet] - 10https://gerrit.wikimedia.org/r/207795 [14:49:34] <^d> Gah, yes [14:49:38] ^d: But it didn't happen on wmf3, just wmf4? [14:49:59] <^d> Somebody did it for wmf4. [14:50:20] (03CR) 10Ottomata: [C: 032] Specify kafkatee::input parameters manually, resource defaults wasn't working [puppet] - 10https://gerrit.wikimedia.org/r/207795 (owner: 10Ottomata) [14:50:32] kart_, AndyRussG, Dereckson: Ping for SWAT in 10 minutes [14:50:46] anomie: ack [14:51:12] * anomie did SWAT the past two mornings, and wonders if ^d, manybubbles, thcipriani, or marktraceur want a turn [14:51:36] <^d> https://phabricator.wikimedia.org/P585 [14:51:43] Hi. [14:51:43] <^d> anomie: There we have it ^ [14:52:14] ^d: The question is who do we need to smack upside the head for it? [14:52:42] <^d> git blame should tell us, just check the .gitmodules on the branch :D [14:52:50] <^d> And see who added branch = . or w/e [14:53:54] (03PS1) 10Ottomata: Fix typo with extra ' in 5xx kafkatee output [puppet] - 10https://gerrit.wikimedia.org/r/207800 [14:53:55] anomie: thanks! Added the wmf3 core gerrit change and wmf4 automerge commit sha to the Deployments page [14:54:05] ^d: Looks like "branch = wmf/1.26wmf4" was there when Mukunda created it... [14:54:05] (03CR) 10Ottomata: [C: 032 V: 032] Fix typo with extra ' in 5xx kafkatee output [puppet] - 10https://gerrit.wikimedia.org/r/207800 (owner: 10Ottomata) [14:55:12] <^d> dd34a56 Revert "Track branch for special extensions" because: --set-upstream-to doesn't work on git 1.7.9.5 which is what we are running on tin. [14:55:16] <^d> And ec04d21 Track branch for special extensions [14:55:21] <^d> Look suspect :) [14:55:40] <^d> Or not [14:56:21] <^d> 3a14f58 Fix branched sub-submodule support [14:56:21] <^d> 9de378d Branch submodules for branched extensions where requested [14:56:21] what did I do? [14:56:23] <^d> Maybe? [14:56:44] <^d> twentyafterfour: wmf/* branches on core automatically subscribe their to their extension wmf/* branches now? [14:57:12] subscribe? [14:57:29] <^d> Automatically updating mw/core wmf/* branches when you commit to a wmf/* branch on an extension [14:57:33] did it accidentally turn into mediawiki/extensions? [14:57:46] <^d> Basically, for wmf/ branches at wmf4 and beyond [14:58:10] Roan asked me about something like that, I don't think I did it [14:58:14] sounds like a feature! [14:58:16] Example: https://git.wikimedia.org/commit/mediawiki%2Fcore/cb9434caf95b33cf81271a172491760d5893874a auto-happened [14:58:39] (03PS1) 10Ottomata: Don't set output.format if it is undef, also change default $output_format to undef [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/207802 [14:58:53] <^d> I'm going to undo the behavior. It's wildly shitty in Gerrit to begin with (see mw/extensions meta repo) and it's scarying the bejebus out of me. People will accidentally deploy stuff :p [14:59:01] (03PS2) 10Ottomata: Don't set output.format if it is undef, also change default $output_format to undef [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/207802 [14:59:01] anomie: does that create/creating any issues, my CX update is in 1 hour :) [14:59:42] (03CR) 10Ottomata: [C: 032 V: 032] Don't set output.format if it is undef, also change default $output_format to undef [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/207802 (owner: 10Ottomata) [14:59:59] (03PS1) 10Ottomata: Update kafkatee module with output_format change [puppet] - 10https://gerrit.wikimedia.org/r/207803 [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, anomie, AndyRussG, Dereckson: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150430T1500). Please do the needful. [15:00:11] (03CR) 10Ottomata: [C: 032 V: 032] Update kafkatee module with output_format change [puppet] - 10https://gerrit.wikimedia.org/r/207803 (owner: 10Ottomata) [15:00:12] kart_: It means that merges to extension wmf/1.26wmf4 branches will auto-merge the change to update the extension reference in mediawiki/core (until ^d fixes it) [15:00:34] anomie: ah. wmf3 is not affected? [15:00:39] kart_: No [15:00:56] No one took me up on the offer to do SWAT, so I guess I have to. [15:01:14] (03CR) 10Anomie: [C: 032] Bump timestamp in 'ValidateExtendedMetadataCache' hook for T97469 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207769 (https://phabricator.wikimedia.org/T97591) (owner: 10Anomie) [15:01:16] (03PS7) 10Andrew Bogott: puppetsigner: Clean up certs for instances we can't find in ldap [puppet] - 10https://gerrit.wikimedia.org/r/205897 [15:01:21] 6operations, 7Graphite: test sending varnishkafka and swift statsd traffic directly - https://phabricator.wikimedia.org/T95687#1249085 (10fgiunchedi) I did a bit of back of the envelope calculations, each varnishkafka/logster sends ~6.5 metrics/s, with ~130 hosts that's an additional ~700 metrics/s, I think it... [15:02:07] * anomie tries out his new script that should beep at him when Jenkins merges the SWAT patch [15:02:27] (03Abandoned) 10coren: Make gordon an alternate to dickson [dns] - 10https://gerrit.wikimedia.org/r/115093 (owner: 10coren) [15:03:00] (03Merged) 10jenkins-bot: Bump timestamp in 'ValidateExtendedMetadataCache' hook for T97469 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207769 (https://phabricator.wikimedia.org/T97591) (owner: 10Anomie) [15:03:50] !log anomie Synchronized wmf-config/CommonSettings.php: SWAT: Bump timestamp in 'ValidateExtendedMetadataCache' hook for T97469 [[gerrit:207769]] (duration: 00m 30s) [15:03:58] Logged the message, Master [15:04:07] (03PS2) 10Anomie: Enable Content Translation for Deployment 20150430 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207472 (https://phabricator.wikimedia.org/T97540) (owner: 10KartikMistry) [15:04:13] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207472 (https://phabricator.wikimedia.org/T97540) (owner: 10KartikMistry) [15:04:18] (03Merged) 10jenkins-bot: Enable Content Translation for Deployment 20150430 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207472 (https://phabricator.wikimedia.org/T97540) (owner: 10KartikMistry) [15:04:46] (03PS1) 10Filippo Giunchedi: varnishkafka: use statsd.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/207805 (https://phabricator.wikimedia.org/T95687) [15:05:08] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Content Translation for Deployment 20150430 [[gerrit:207472]] (duration: 00m 18s) [15:05:11] kart_: ^ Test please [15:05:14] Logged the message, Master [15:05:27] Sure [15:07:16] anomie: looks good. [15:07:26] (03PS2) 10Anomie: Restrict local uploads on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207725 (https://phabricator.wikimedia.org/T97397) (owner: 10Dereckson) [15:07:33] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207725 (https://phabricator.wikimedia.org/T97397) (owner: 10Dereckson) [15:07:36] <^d> Weird, I don't see why make-wmf-branch would've done it [15:07:37] (03Merged) 10jenkins-bot: Restrict local uploads on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207725 (https://phabricator.wikimedia.org/T97397) (owner: 10Dereckson) [15:08:09] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Restrict local uploads on mai.wikipedia [[gerrit:207725]] (duration: 00m 14s) [15:08:10] (03CR) 10KartikMistry: [C: 031] CX: Add languages for Deployment on 20150430 [puppet] - 10https://gerrit.wikimedia.org/r/207473 (https://phabricator.wikimedia.org/T97540) (owner: 10KartikMistry) [15:08:11] Dereckson: ^ Test please [15:08:15] Logged the message, Master [15:08:27] Works. [15:08:35] akosiaris: or ottomata or godog Merge this, https://gerrit.wikimedia.org/r/#/c/207473/ Please :) [15:08:43] (03PS2) 10Anomie: Enable GeoData at cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199930 (https://phabricator.wikimedia.org/T93637) (owner: 10Gerardduenas) [15:08:50] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199930 (https://phabricator.wikimedia.org/T93637) (owner: 10Gerardduenas) [15:08:55] (03Merged) 10jenkins-bot: Enable GeoData at cawikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199930 (https://phabricator.wikimedia.org/T93637) (owner: 10Gerardduenas) [15:09:08] i gotcha kart [15:09:15] (03PS2) 10Ottomata: CX: Add languages for Deployment on 20150430 [puppet] - 10https://gerrit.wikimedia.org/r/207473 (https://phabricator.wikimedia.org/T97540) (owner: 10KartikMistry) [15:09:40] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable GeoData at cawikibooks [[gerrit:199930]] (duration: 00m 19s) [15:09:41] Dereckson: ^ Test please [15:09:45] Logged the message, Master [15:09:48] Works too. Thanks for the deploy. [15:10:04] bblack: yt? [15:10:11] (03PS6) 10BryanDavis: Add AffCom user group application contact page on meta (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [15:10:15] ottomata: yes? [15:10:24] i'm using dh-systemd to install a kafkatee.service file [15:10:30] it installs it in /lib/systemd/system [15:10:32] not /etc... [15:10:33] that ok? [15:10:37] yup that's normal [15:10:41] ok cool just checking [15:10:47] mind a quikc look? [15:10:48] https://gerrit.wikimedia.org/r/#/c/207806/ [15:10:49] should be simple [15:11:00] we don't intend for our puppet to then replace it with a different one, right? [15:11:03] (03CR) 10Ottomata: [C: 032] CX: Add languages for Deployment on 20150430 [puppet] - 10https://gerrit.wikimedia.org/r/207473 (https://phabricator.wikimedia.org/T97540) (owner: 10KartikMistry) [15:11:11] right bblack [15:11:36] i don't think i have a plan to build a new version for trusty, and the existing trusty version in our apt uses upstart [15:11:44] so, i'll only build this for jessie [15:11:57] kart_: merged. [15:11:59] !log anomie Synchronized php-1.26wmf4/extensions/EducationProgram/: SWAT: EducationProgram: ApiListStudents: Use XML-friendly tag names [[gerrit:207779]] (duration: 00m 25s) [15:12:03] AndyRussG: ^ Test please (you can use test2.wikipedia.org) [15:12:04] Logged the message, Master [15:12:25] <^d> anomie, twentyafterfour: https://gerrit.wikimedia.org/r/#/c/207808/ [15:12:31] anomie: does the XML mangling patch need to be backported? [15:12:47] ottomata: thanks! [15:12:54] legoPanda: Not unless some other extension turns up with the same bug. But it probably should go to 1.25 [15:13:50] anomie: K testing [15:14:19] ottomata: it's too bad kafkatee doesn't have an option to simply not write a pidfile at all. under systemd w/ type=simple, you really don't need one [15:14:36] hm [15:14:47] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:15:02] also, needs -D to prevent daemonizations [15:15:20] as in ExecStart=/usr/bin/kafkatee -Dc /etc/kafkatee.conf [15:15:43] anomie: several wikis would like to deploy ShortURL, which requires (i) a config change merged (ii) to run a script to populate the short urls table. I presume the need to run a script after it's merged prevent that changes to be passed in SWAT. If so, what's the procedure to plan a window for this deploy? [15:15:52] ottomata: ^ [15:16:03] Dereckson: How scary is the script? [15:16:13] anomie: yes the api works fine on test2! Caveat: I hadn't checked that the error was happening before on tes2 (though it's safe to assume it was) (I had been testing with the beta cluster and locally) [15:16:30] bblack, cool, looking, was checking to see if i could prevent pidfile thing [15:16:36] wouldnt' be that hard to add a conditional [15:16:38] you can't, I looked at the source [15:16:50] I mean, you can, but you'd need to patch up kafkatee.c [15:17:06] https://github.com/wikimedia/analytics-kafkatee/blob/master/kafkatee.c#L225 [15:17:07] right [15:17:13] if (!conf.pid_file_path) [15:17:15] maybe? [15:17:22] yeah but it's sprinkled all over [15:17:23] anomie: SELECT, 100 pages at the times, INSERT [15:17:27] it would be in several places :) [15:17:37] i only see one ezd_pidfile_open [15:17:45] close already doesn't do anytihng if path is false [15:17:52] https://github.com/wikimedia/analytics-kafkatee/blob/68815711e4ccbb69519c08bfa40809c0a184fcc3/ezd.c#L335 [15:18:23] anomie: script doesn't seem scary, 60 lines straightforward to grasp, wikis are rather smalls (ne. sa. kn.) but it seems that could take some times. [15:18:24] Dereckson: Over the whole page table? That sounds like it could be slow enough that I'd not want to SWAT it. [15:18:24] i dunno, you tell me if it is worth it? i'm already building new packages for this thing now so, now is the time :) [15:18:34] In concur. [15:18:35] I [15:19:17] ottomata: if you want to fix it upstream first in a new kafkatee version that recognizes some kind of arg/option to disable the pidfile, sure. It's not really that important, though. [15:19:31] (you probably don't want to change the default behavior with the compiled-in default pidfile) [15:19:42] yeah, ok, i mean, it is simpler for me to not worry about it :) [15:20:02] AndyRussG: Now I'm waiting for Jenkins to merge the wmf3 update, then I'll have you test it on one of those wikis. [15:20:12] just would avoid all that mess with PostStop rm and such. In a systemd type=simple world, there's no need for daemonization or pidfiles [15:20:13] bblack, i also added User and PermissionsStartOnly [15:20:14] https://gerrit.wikimedia.org/r/#/c/207806/2/debian/kafkatee.service [15:20:18] that make sense? [15:20:47] anomie: will do, thx :) [15:20:53] anomie: so, I see with greg-g to plan a specific deploy window? [15:21:02] Dereckson: Yeah [15:21:04] (03PS5) 10Phuedx: Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) [15:21:50] ^d: wmf4 still not so good? [15:22:00] ottomata: User= definitely, we shouldn't be running daemons like these as root. PermissionsStartOnly seems pointless: StopPost doesn't need it (the pidfile will be owned by the same user as the main proc), and the reload kill should work fine without it too [15:22:09] ^d: need CX update there. [15:22:21] ok cool [15:22:28] !log anomie Synchronized php-1.26wmf3/extensions/EducationProgram/: SWAT: EducationProgram: ApiListStudents: Use XML-friendly tag names [[gerrit:207778]] (duration: 00m 39s) [15:22:29] AndyRussG: ^ Test please [15:22:35] Logged the message, Master [15:23:52] anomie: ragesoss: enwiki lgtm :) [15:24:11] * anomie declares SWAT done! [15:25:07] woohoo! [15:25:14] lgtm too! [15:25:28] woohoo x two! [15:26:17] <^d> kart_: All fixed [15:27:10] ^d: cool! [15:32:32] (03CR) 10Anomie: [C: 031] Add AffCom user group application contact page on meta (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [15:32:52] anomie: Dereckson fwiw you dont *have* to run the script - it'll autogenerate a short URL every time each page is viewed [15:33:18] (I wrote shorturl - was my first extension! Wheeee) [15:33:28] YuviKTM: A different one each view, or just once per title? [15:33:42] Once per title [15:33:51] Then it gets cached like fuck etc [15:34:25] But [15:34:33] If db writes on view sound scary [15:34:40] mutante, ugh [15:34:42] You can wait and run the script [15:34:48] grep "RT-\d+" * -RP [15:34:51] on the puppet repo [15:35:12] I don't understand why people do this [15:38:30] <_joe_> what you don't understand? [15:39:37] Krenair: any updates on job queue? [15:40:19] _joe_, people using different formats for this sort of thing [15:40:39] Betacommand, well I haven't looked at it [15:41:07] Krenair: 11m and growing [15:41:51] hi -- I'm helping Mark with the capes for 2016. What are the names of the the virginia and dallas data centers? [15:42:35] tnegrin: dallas has a couple, but the primary site names are eqiad and codfw [15:42:43] 6operations, 10ops-eqiad: db1060 raid degraded - https://phabricator.wikimedia.org/T96471#1249186 (10Cmjohnson) 5Open>3Resolved Raid is no longer degraded [15:42:46] we are adding a peering site to dallas though, but you meant the main sites right? [15:42:53] yes [15:43:11] eqiad = eqinix in VA, located close to IAD (dulles airport) [15:43:16] codfw is cyrusone dallas airport [15:43:39] https://meta.wikimedia.org/wiki/Wikimedia_servers#Hosting [15:44:45] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1249193 (10kevinator) p:5Normal>3Low [15:44:50] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1249194 (10ssastry) > Apparently the reason Parsoid needs to retry is because it does not have the ability to fail gracefully when the API fails. According t... [15:44:53] Betacommand: I see 10 million refreshlinks jobs??? [15:45:28] (03CR) 10Alex Monk: "Thanks for cleaning this up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [15:45:39] legoPanda: Something is up, not sure what [15:45:47] legoPanda, yeah, it's been like this for a few hours [15:45:51] at least a few hours* [15:46:05] https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions#Datacenter_Sites [15:46:16] who edited a template? :P [15:46:17] no one in ops maintains that meta page ;] [15:46:39] asked about it, but no one knew about it/did anything I guess [15:46:55] robh, no, but I update the meta page when I hear about changes [15:47:52] its certainly nicer to look at than the wikitech page, heh [15:49:13] 7Puppet, 10Browser-Tests: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1249214 (10greg) p:5Low>3Normal [15:49:33] 6operations, 10ops-eqiad, 10Incident-20141130-eqiad-C4: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1249217 (10faidon) We've agreed for a maintenance window of replacing the switch: Wednesday May 6th, 13:00 UTC. [15:52:19] I appear to have corrupted my copy of the puppet repo :/ [15:52:55] alex@alex-laptop:~/Development/Wikimedia/Operations-Puppet (rip-bz)$ git commit -a [15:52:55] error: invalid object 100644 7c5832fab3036d4ca8ac4c791eca5bbb5eda4030 for 'manifests/role/elasticsearch.pp' [15:52:55] error: invalid object 100644 7c5832fab3036d4ca8ac4c791eca5bbb5eda4030 for 'manifests/role/elasticsearch.pp' [15:52:55] error: Error building trees [15:55:20] Krenair: http://fpaste.org/217256/9314143/raw/ [15:55:42] legoPanda, yep, that's what I found earlier [15:55:46] didn't know what to do from there though [15:58:36] why does it fail to connect for some types but not others? :/ [16:00:05] kart_: Dear anthropoid, the time has come. Please deploy Content Translation deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150430T1600). [16:01:22] jouncebot: yes sir! [16:01:32] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1249241 (10GWicke) > Do you have a model for a "hanging backend"? I don't have a comprehensive model, but can give you examples we encountered in the past.... [16:03:12] 7Puppet, 10Browser-Tests: [Regression] QA: Puppet failing for Role::Ci::Slave::Browsertests/elasticsearch - https://phabricator.wikimedia.org/T74255#1249247 (10greg) p:5Normal>3Low [16:16:12] !log kartik Started scap: Update ContentTranslation [16:16:18] Logged the message, Master [16:16:21] Krenair: the errors in JobQueueFederated.log look similar [16:17:33] (03PS2) 10Ottomata: Run hdfs balancer weekly [puppet] - 10https://gerrit.wikimedia.org/r/206461 [16:17:52] (03PS1) 10Rush: admin cleanup for citoid and mathoid perms [puppet] - 10https://gerrit.wikimedia.org/r/207818 [16:18:48] https://lists.wikimedia.org/pipermail/wikimedia-l/2015-April/077517.html huhu :( [16:19:02] Eloquence :'-( [16:21:18] (03PS3) 10Ottomata: Run hdfs balancer weekly [puppet] - 10https://gerrit.wikimedia.org/r/206461 [16:21:24] great -- thanks robh [16:21:50] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1249322 (10ssastry) > Apparently the reason Parsoid needs to retry is because it does not have the ability to fail gracefully when the API fails. > There a... [16:23:25] (03CR) 10Ottomata: [C: 032] Run hdfs balancer weekly [puppet] - 10https://gerrit.wikimedia.org/r/206461 (owner: 10Ottomata) [16:27:45] (03PS2) 10Rush: admin cleanup for citoid and mathoid perms [puppet] - 10https://gerrit.wikimedia.org/r/207818 [16:27:55] (03CR) 10Rush: [C: 032 V: 032] admin cleanup for citoid and mathoid perms [puppet] - 10https://gerrit.wikimedia.org/r/207818 (owner: 10Rush) [16:31:01] (03PS1) 10Ottomata: Properly redirect hdfs balancer cron output [puppet] - 10https://gerrit.wikimedia.org/r/207823 [16:31:17] (03PS2) 10Ottomata: Properly redirect hdfs balancer cron output [puppet] - 10https://gerrit.wikimedia.org/r/207823 [16:31:23] (03CR) 10Ottomata: [C: 032 V: 032] Properly redirect hdfs balancer cron output [puppet] - 10https://gerrit.wikimedia.org/r/207823 (owner: 10Ottomata) [16:34:00] hmm. scap seems stuck? [16:35:03] bd808: ^d: Should I wait? [16:35:31] kart_: let me look and see what's stuck. It may be snapshot1004 again [16:35:52] bd808: at sync-common: 99% (ok: 464; fail: 0; left: 1) [16:36:00] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1249344 (10ssastry) I might have understated the ability of our current template encapsulation to handle this ... because of the changed DOM structure, edit... [16:36:05] kart_: Yup. it's snapshot1004.eqiad.wmnet again [16:36:54] you can unstick it by opening another ssh session to tin and killing the /usr/bin/ssh process connecting to snapshot1004.eqiad.wmnet to run sync-common [16:37:22] get the PID from `ps axuwww|grep sync-common` [16:37:36] You own the process so you can just kill -9 it [16:38:04] that will leave snapshot1004 un-synced but that shouldn't hurt anything in the short term [16:39:11] okay! [16:40:18] bd808: first kill in tin and then run sync-common on snapshot1004? [16:40:51] kart_: that's optinal yes. you can wait for the rest of the scap to finish before you worry about snapshot1004 [16:42:54] !log kartik Finished scap: Update ContentTranslation (duration: 26m 42s) [16:42:57] bd808: ok, killed and unstuck [16:43:02] Logged the message, Master [16:43:06] kart_: awesome [16:43:09] bd808: and finished. now. sync-common again? [16:43:25] I'm on snapshot1004, I can run it there [16:43:49] bd808: please :) [16:44:25] bd808: let me know when started. [16:44:27] !log started sync-common on snapshot1004 to fix aborted sync [16:44:33] Logged the message, Master [16:44:50] kart_: I expect this will take a while but I'm running it [16:44:55] bd808: thanks! [16:46:55] (03PS1) 10Ottomata: Use lockfiles instead of ps to conditionally start hdfs balancer [puppet] - 10https://gerrit.wikimedia.org/r/207831 [16:47:00] (03CR) 10jenkins-bot: [V: 04-1] Use lockfiles instead of ps to conditionally start hdfs balancer [puppet] - 10https://gerrit.wikimedia.org/r/207831 (owner: 10Ottomata) [16:47:07] (03PS2) 10Ottomata: Use lockfiles instead of ps to conditionally start hdfs balancer [puppet] - 10https://gerrit.wikimedia.org/r/207831 [16:47:56] (03CR) 10Ottomata: [C: 032] Use lockfiles instead of ps to conditionally start hdfs balancer [puppet] - 10https://gerrit.wikimedia.org/r/207831 (owner: 10Ottomata) [16:50:36] (03PS1) 10Ottomata: Store not running warning in balancer.log [puppet] - 10https://gerrit.wikimedia.org/r/207833 [16:50:40] (03CR) 10jenkins-bot: [V: 04-1] Store not running warning in balancer.log [puppet] - 10https://gerrit.wikimedia.org/r/207833 (owner: 10Ottomata) [16:50:43] (03PS2) 10Ottomata: Store not running warning in balancer.log [puppet] - 10https://gerrit.wikimedia.org/r/207833 [16:51:09] (03PS1) 10Rush: admin group sudoers permissions cleanup [puppet] - 10https://gerrit.wikimedia.org/r/207834 [16:52:51] (03PS2) 10Rush: admin group sudoers permissions cleanup [puppet] - 10https://gerrit.wikimedia.org/r/207834 [16:53:54] (03CR) 10Ottomata: [C: 032] Store not running warning in balancer.log [puppet] - 10https://gerrit.wikimedia.org/r/207833 (owner: 10Ottomata) [16:57:50] <^d> jouncebot: next [16:57:51] In 6 hour(s) and 2 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150430T2300) [16:58:32] (03PS3) 10Rush: admin group sudoers permissions cleanup [puppet] - 10https://gerrit.wikimedia.org/r/207834 [16:58:47] (03CR) 10Rush: [C: 032 V: 032] admin group sudoers permissions cleanup [puppet] - 10https://gerrit.wikimedia.org/r/207834 (owner: 10Rush) [16:59:24] !log aborted sync-common on snapshot1004.eqiad.wmnet after 15 minutes for inactivity; trying again [16:59:31] Logged the message, Master [17:01:04] !log demon Synchronized php-1.26wmf3/includes/Setup.php: trying something (duration: 00m 18s) [17:01:09] Logged the message, Master [17:02:47] PROBLEM - puppet last run on mw2181 is CRITICAL puppet fail [17:03:07] PROBLEM - puppet last run on mw2053 is CRITICAL puppet fail [17:03:08] PROBLEM - puppet last run on mw1244 is CRITICAL puppet fail [17:03:09] (03PS1) 10Rush: admin user sudoers permissions cleanup [puppet] - 10https://gerrit.wikimedia.org/r/207836 [17:03:17] PROBLEM - puppet last run on mw2102 is CRITICAL puppet fail [17:03:17] PROBLEM - puppet last run on mw2133 is CRITICAL puppet fail [17:03:17] PROBLEM - puppet last run on mw2106 is CRITICAL puppet fail [17:03:21] (03PS2) 10Rush: admin user sudoers permissions cleanup [puppet] - 10https://gerrit.wikimedia.org/r/207836 [17:03:26] PROBLEM - puppet last run on mw1124 is CRITICAL puppet fail [17:03:27] PROBLEM - puppet last run on mw1257 is CRITICAL puppet fail [17:03:37] PROBLEM - puppet last run on mw2202 is CRITICAL puppet fail [17:03:38] PROBLEM - puppet last run on mw1246 is CRITICAL puppet fail [17:03:46] PROBLEM - puppet last run on mw1218 is CRITICAL puppet fail [17:03:46] PROBLEM - puppet last run on mw1062 is CRITICAL puppet fail [17:03:47] PROBLEM - puppet last run on mw1040 is CRITICAL puppet fail [17:03:47] PROBLEM - puppet last run on mw1245 is CRITICAL puppet fail [17:04:02] !log demon Synchronized php-1.26wmf3/includes/Setup.php: meh, didn't work (duration: 00m 27s) [17:04:06] PROBLEM - puppet last run on mw1216 is CRITICAL puppet fail [17:04:06] PROBLEM - puppet last run on mw1161 is CRITICAL puppet fail [17:04:06] PROBLEM - puppet last run on mw1221 is CRITICAL puppet fail [17:04:07] PROBLEM - puppet last run on mw1147 is CRITICAL puppet fail [17:04:07] PROBLEM - puppet last run on mw2183 is CRITICAL puppet fail [17:04:09] Logged the message, Master [17:04:16] PROBLEM - puppet last run on mw1234 is CRITICAL puppet fail [17:04:17] PROBLEM - puppet last run on mw1134 is CRITICAL puppet fail [17:04:17] PROBLEM - puppet last run on mw1132 is CRITICAL puppet fail [17:04:27] PROBLEM - puppet last run on mw1196 is CRITICAL puppet fail [17:04:27] chasemp: ^ ? [17:04:27] PROBLEM - puppet last run on mw1252 is CRITICAL puppet fail [17:04:27] PROBLEM - puppet last run on mw2197 is CRITICAL puppet fail [17:04:27] PROBLEM - puppet last run on mw2046 is CRITICAL puppet fail [17:04:28] PROBLEM - puppet last run on mw1089 is CRITICAL puppet fail [17:04:28] PROBLEM - puppet last run on mw2160 is CRITICAL puppet fail [17:04:36] PROBLEM - puppet last run on mw2179 is CRITICAL puppet fail [17:04:36] PROBLEM - puppet last run on mw1178 is CRITICAL puppet fail [17:04:36] PROBLEM - puppet last run on mw2214 is CRITICAL puppet fail [17:04:41] hmm ok maybe me looking [17:04:47] PROBLEM - puppet last run on mw2099 is CRITICAL puppet fail [17:04:47] PROBLEM - puppet last run on mw1109 is CRITICAL puppet fail [17:04:48] PROBLEM - puppet last run on mw1031 is CRITICAL puppet fail [17:04:56] PROBLEM - puppet last run on mw1106 is CRITICAL puppet fail [17:04:57] PROBLEM - puppet last run on mw1233 is CRITICAL puppet fail [17:04:57] PROBLEM - puppet last run on mw1240 is CRITICAL puppet fail [17:04:57] PROBLEM - puppet last run on mw2169 is CRITICAL puppet fail [17:04:57] PROBLEM - puppet last run on mw1045 is CRITICAL puppet fail [17:04:57] PROBLEM - puppet last run on mw2122 is CRITICAL puppet fail [17:04:57] PROBLEM - puppet last run on mw2103 is CRITICAL puppet fail [17:04:57] PROBLEM - puppet last run on mw2112 is CRITICAL puppet fail [17:05:06] PROBLEM - puppet last run on mw2032 is CRITICAL puppet fail [17:05:06] PROBLEM - puppet last run on mw2041 is CRITICAL puppet fail [17:05:06] PROBLEM - puppet last run on mw2028 is CRITICAL puppet fail [17:05:06] PROBLEM - puppet last run on mw2074 is CRITICAL puppet fail [17:05:06] PROBLEM - puppet last run on mw2034 is CRITICAL puppet fail [17:05:06] PROBLEM - puppet last run on mw1130 is CRITICAL puppet fail [17:05:07] PROBLEM - puppet last run on mw1048 is CRITICAL puppet fail [17:05:07] PROBLEM - puppet last run on mw1140 is CRITICAL puppet fail [17:05:16] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Sudo::Group[wikidev] is already declared in file /etc/puppet/modules/mediawiki/manifests/users.pp:84; cannot redeclare at /etc/puppet/modules/admin/manifests/group.pp:46 on node mw1252.eqiad.wmnet [17:05:17] PROBLEM - puppet last run on mw2199 is CRITICAL puppet fail [17:05:18] PROBLEM - puppet last run on mw1059 is CRITICAL puppet fail [17:05:27] PROBLEM - puppet last run on mw1038 is CRITICAL puppet fail [17:05:27] PROBLEM - puppet last run on mw1067 is CRITICAL puppet fail [17:05:36] PROBLEM - puppet last run on mw1256 is CRITICAL puppet fail [17:05:37] PROBLEM - puppet last run on mw2121 is CRITICAL puppet fail [17:05:37] PROBLEM - puppet last run on mw1072 is CRITICAL puppet fail [17:05:46] PROBLEM - puppet last run on mw2170 is CRITICAL puppet fail [17:05:47] PROBLEM - puppet last run on mw1197 is CRITICAL puppet fail [17:05:56] PROBLEM - puppet last run on mw1200 is CRITICAL puppet fail [17:05:56] PROBLEM - puppet last run on mw1080 is CRITICAL puppet fail [17:05:57] PROBLEM - puppet last run on mw1005 is CRITICAL puppet fail [17:05:59] bblack: yeah that's from a change I made but is errant to begin with not necessarily teh change I think [17:06:06] let me see why it was already defined outside of the scope of things [17:06:07] PROBLEM - puppet last run on mw2164 is CRITICAL puppet fail [17:06:07] PROBLEM - puppet last run on mw2108 is CRITICAL puppet fail [17:06:08] PROBLEM - puppet last run on mw2057 is CRITICAL puppet fail [17:06:08] PROBLEM - puppet last run on mw1224 is CRITICAL puppet fail [17:06:16] PROBLEM - puppet last run on mw1115 is CRITICAL puppet fail [17:06:17] PROBLEM - puppet last run on mw1028 is CRITICAL puppet fail [17:06:27] PROBLEM - puppet last run on mw2076 is CRITICAL puppet fail [17:06:27] PROBLEM - puppet last run on mw1250 is CRITICAL puppet fail [17:06:36] PROBLEM - puppet last run on mw1160 is CRITICAL puppet fail [17:06:37] PROBLEM - puppet last run on mw2161 is CRITICAL puppet fail [17:06:37] PROBLEM - puppet last run on mw2213 is CRITICAL puppet fail [17:06:37] PROBLEM - puppet last run on mw2118 is CRITICAL puppet fail [17:06:37] PROBLEM - puppet last run on snapshot1003 is CRITICAL puppet fail [17:06:37] PROBLEM - puppet last run on mw1141 is CRITICAL puppet fail [17:06:37] PROBLEM - puppet last run on mw2100 is CRITICAL puppet fail [17:06:37] PROBLEM - puppet last run on mw1063 is CRITICAL puppet fail [17:06:46] PROBLEM - puppet last run on mw2124 is CRITICAL puppet fail [17:06:47] PROBLEM - puppet last run on mw2033 is CRITICAL puppet fail [17:06:47] PROBLEM - puppet last run on mw2015 is CRITICAL puppet fail [17:06:47] PROBLEM - puppet last run on mw2014 is CRITICAL puppet fail [17:06:47] PROBLEM - puppet last run on mw2080 is CRITICAL puppet fail [17:06:48] PROBLEM - puppet last run on mw1041 is CRITICAL puppet fail [17:06:57] PROBLEM - puppet last run on mw1006 is CRITICAL puppet fail [17:06:57] PROBLEM - puppet last run on mw1007 is CRITICAL puppet fail [17:06:58] PROBLEM - puppet last run on mw1145 is CRITICAL puppet fail [17:07:06] PROBLEM - puppet last run on mw1254 is CRITICAL puppet fail [17:07:07] PROBLEM - puppet last run on mw1008 is CRITICAL puppet fail [17:07:07] PROBLEM - puppet last run on mw1068 is CRITICAL puppet fail [17:07:07] PROBLEM - puppet last run on mw1069 is CRITICAL puppet fail [17:07:16] PROBLEM - puppet last run on mw2104 is CRITICAL puppet fail [17:07:17] PROBLEM - puppet last run on mw2105 is CRITICAL puppet fail [17:07:17] PROBLEM - puppet last run on mw1012 is CRITICAL puppet fail [17:07:17] PROBLEM - puppet last run on mw1242 is CRITICAL puppet fail [17:07:17] PROBLEM - puppet last run on mw1222 is CRITICAL puppet fail [17:07:26] PROBLEM - puppet last run on mw1082 is CRITICAL puppet fail [17:07:27] PROBLEM - puppet last run on mw1174 is CRITICAL puppet fail [17:07:27] PROBLEM - puppet last run on mw2145 is CRITICAL puppet fail [17:07:27] PROBLEM - puppet last run on mw1046 is CRITICAL puppet fail [17:07:27] PROBLEM - puppet last run on mw2109 is CRITICAL puppet fail [17:07:36] PROBLEM - puppet last run on mw1187 is CRITICAL puppet fail [17:07:36] PROBLEM - puppet last run on mw1226 is CRITICAL puppet fail [17:07:46] PROBLEM - puppet last run on mw1189 is CRITICAL puppet fail [17:07:47] PROBLEM - puppet last run on mw1026 is CRITICAL puppet fail [17:07:47] PROBLEM - puppet last run on mw1088 is CRITICAL puppet fail [17:07:47] PROBLEM - puppet last run on mw1150 is CRITICAL puppet fail [17:07:57] PROBLEM - puppet last run on mw1120 is CRITICAL puppet fail [17:07:57] PROBLEM - puppet last run on mw2173 is CRITICAL puppet fail [17:07:57] PROBLEM - puppet last run on mw1170 is CRITICAL puppet fail [17:07:57] PROBLEM - puppet last run on mw1205 is CRITICAL puppet fail [17:08:06] PROBLEM - puppet last run on mw1100 is CRITICAL puppet fail [17:08:06] PROBLEM - puppet last run on mw1061 is CRITICAL puppet fail [17:08:07] PROBLEM - puppet last run on mw2123 is CRITICAL puppet fail [17:08:07] PROBLEM - puppet last run on mw1099 is CRITICAL puppet fail [17:08:08] PROBLEM - puppet last run on mw1217 is CRITICAL puppet fail [17:08:08] PROBLEM - puppet last run on mw1003 is CRITICAL puppet fail [17:08:08] PROBLEM - puppet last run on mw1060 is CRITICAL puppet fail [17:08:08] PROBLEM - puppet last run on mw1117 is CRITICAL puppet fail [17:08:16] PROBLEM - puppet last run on mw1228 is CRITICAL puppet fail [17:08:17] PROBLEM - puppet last run on mw2184 is CRITICAL puppet fail [17:08:26] PROBLEM - puppet last run on mw2163 is CRITICAL puppet fail [17:08:27] PROBLEM - puppet last run on mw2083 is CRITICAL puppet fail [17:08:27] PROBLEM - puppet last run on mw2097 is CRITICAL puppet fail [17:08:27] PROBLEM - puppet last run on mw2082 is CRITICAL puppet fail [17:08:27] PROBLEM - puppet last run on mw2114 is CRITICAL puppet fail [17:08:27] PROBLEM - puppet last run on mw2045 is CRITICAL puppet fail [17:08:27] PROBLEM - puppet last run on mw2043 is CRITICAL puppet fail [17:08:28] PROBLEM - puppet last run on mw2017 is CRITICAL puppet fail [17:08:29] PROBLEM - puppet last run on mw2036 is CRITICAL puppet fail [17:08:29] PROBLEM - puppet last run on mw2016 is CRITICAL puppet fail [17:08:29] PROBLEM - puppet last run on mw2023 is CRITICAL puppet fail [17:08:30] PROBLEM - puppet last run on mw2013 is CRITICAL puppet fail [17:08:30] PROBLEM - puppet last run on mw2050 is CRITICAL puppet fail [17:08:31] PROBLEM - puppet last run on mw2066 is CRITICAL puppet fail [17:08:47] PROBLEM - puppet last run on mw1144 is CRITICAL puppet fail [17:08:47] PROBLEM - puppet last run on mw1235 is CRITICAL puppet fail [17:08:47] PROBLEM - puppet last run on mw1119 is CRITICAL puppet fail [17:08:57] PROBLEM - puppet last run on mw2127 is CRITICAL puppet fail [17:08:57] PROBLEM - puppet last run on mw2134 is CRITICAL puppet fail [17:09:00] ohai icinga [17:09:06] PROBLEM - puppet last run on mw1166 is CRITICAL puppet fail [17:09:06] PROBLEM - puppet last run on mw1251 is CRITICAL puppet fail [17:09:07] PROBLEM - puppet last run on mw1065 is CRITICAL puppet fail [17:09:07] PROBLEM - puppet last run on mw1118 is CRITICAL puppet fail [17:09:17] I'm goign to revert and look at the deeper problem [17:09:17] PROBLEM - puppet last run on mw2090 is CRITICAL puppet fail [17:09:17] PROBLEM - puppet last run on mw2073 is CRITICAL puppet fail [17:09:27] PROBLEM - puppet last run on mw1052 is CRITICAL puppet fail [17:09:27] PROBLEM - puppet last run on mw2126 is CRITICAL puppet fail [17:09:27] PROBLEM - puppet last run on mw2113 is CRITICAL puppet fail [17:09:27] PROBLEM - puppet last run on mw2059 is CRITICAL puppet fail [17:09:28] the revert hides an issue instead of fixing it but .... yeah [17:09:28] PROBLEM - puppet last run on mw2022 is CRITICAL puppet fail [17:09:28] PROBLEM - puppet last run on mw1123 is CRITICAL puppet fail [17:09:42] (03Abandoned) 10Rush: admin user sudoers permissions cleanup [puppet] - 10https://gerrit.wikimedia.org/r/207836 (owner: 10Rush) [17:09:46] PROBLEM - puppet last run on mw1025 is CRITICAL puppet fail [17:09:46] PROBLEM - puppet last run on mw1213 is CRITICAL puppet fail [17:09:47] PROBLEM - puppet last run on mw2096 is CRITICAL puppet fail [17:09:47] PROBLEM - puppet last run on mw2092 is CRITICAL puppet fail [17:09:47] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [17:09:48] PROBLEM - puppet last run on mw2146 is CRITICAL puppet fail [17:09:48] PROBLEM - puppet last run on mw1054 is CRITICAL puppet fail [17:09:56] PROBLEM - puppet last run on mw1002 is CRITICAL puppet fail [17:09:57] PROBLEM - puppet last run on mw1114 is CRITICAL puppet fail [17:09:57] PROBLEM - puppet last run on mw1211 is CRITICAL puppet fail [17:09:57] PROBLEM - puppet last run on mw1044 is CRITICAL puppet fail [17:09:57] PROBLEM - puppet last run on mw1126 is CRITICAL puppet fail [17:09:57] PROBLEM - puppet last run on mw1039 is CRITICAL puppet fail [17:09:57] PROBLEM - puppet last run on snapshot1001 is CRITICAL puppet fail [17:09:57] PROBLEM - puppet last run on mw2143 is CRITICAL puppet fail [17:10:06] PROBLEM - puppet last run on mw2168 is CRITICAL puppet fail [17:10:06] PROBLEM - puppet last run on mw2079 is CRITICAL puppet fail [17:10:07] PROBLEM - puppet last run on mw2093 is CRITICAL puppet fail [17:10:07] PROBLEM - puppet last run on mw2056 is CRITICAL puppet fail [17:10:07] PROBLEM - puppet last run on mw2011 is CRITICAL puppet fail [17:10:07] PROBLEM - puppet last run on mw2003 is CRITICAL puppet fail [17:10:08] PROBLEM - puppet last run on mw1175 is CRITICAL puppet fail [17:10:17] PROBLEM - puppet last run on mw1172 is CRITICAL puppet fail [17:10:17] PROBLEM - puppet last run on mw1149 is CRITICAL puppet fail [17:10:17] PROBLEM - puppet last run on tin is CRITICAL puppet fail [17:10:17] PROBLEM - puppet last run on mw1162 is CRITICAL puppet fail [17:10:17] PROBLEM - puppet last run on mw2212 is CRITICAL puppet fail [17:10:17] PROBLEM - puppet last run on mw2136 is CRITICAL puppet fail [17:10:27] PROBLEM - puppet last run on mw1177 is CRITICAL puppet fail [17:10:27] PROBLEM - puppet last run on mw1129 is CRITICAL puppet fail [17:10:38] I'm now convinced: endless walls of critical puppet fail really is the best use a shared communication channel. Thank you for the enlightenment, icinga-wm. [17:10:46] PROBLEM - puppet last run on mw1237 is CRITICAL puppet fail [17:10:46] PROBLEM - puppet last run on mw1011 is CRITICAL puppet fail [17:10:47] PROBLEM - puppet last run on mw1168 is CRITICAL puppet fail [17:10:47] PROBLEM - puppet last run on mw2190 is CRITICAL puppet fail [17:10:49] (03PS1) 10Rush: Revert "admin group sudoers permissions cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/207841 [17:10:56] PROBLEM - puppet last run on mw1249 is CRITICAL puppet fail [17:10:56] PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail [17:10:57] PROBLEM - puppet last run on mw1151 is CRITICAL puppet fail [17:10:57] PROBLEM - puppet last run on mw1076 is CRITICAL puppet fail [17:10:59] (03PS2) 10Rush: Revert "admin group sudoers permissions cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/207841 [17:11:07] PROBLEM - puppet last run on mw2200 is CRITICAL puppet fail [17:11:08] PROBLEM - puppet last run on mw2062 is CRITICAL puppet fail [17:11:16] PROBLEM - puppet last run on mw2192 is CRITICAL puppet fail [17:11:35] ^ dead, watching log manually till the spam fades [17:12:07] (03CR) 10Rush: [C: 032] Revert "admin group sudoers permissions cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/207841 (owner: 10Rush) [17:13:02] bblack: the problem is we are using a human group to allocate permissions to our deploy (read: service account) users [17:13:13] and trying to cleanup the perms file for in the human case [17:13:28] well a group isn't really a group [17:13:28] that's a weird situation [17:13:49] bd808: was it aborted? [17:13:54] sure I mean, we are mixng bot and human permissions [17:13:58] in weird and dangerious ways [17:14:13] kart_: I aborted, restarted and it is still stuck [17:14:15] or I should say, I think it's ok to mix bots and humans. the group is just the label by which we say "these things can access these other things" [17:14:43] probably not a good idea when teh group in question is the universal PUG for all users even non technical :) [17:14:46] the uids associated with that "group" don't have to form a coherent group in any other sense, like bots-v-humans-v-lesserhumans [17:15:38] but yeah, I'm sure wikidev to some degree suffers from becoming pointless due to overuse [17:16:06] replace pointless with dangerous I think in this case [17:16:12] sort of like how people used to set daemons to run as "nobody" because that was better than root, and then realized that if every daemon used "nobody", breaches spread pretty easily... [17:17:07] bd808: okay. [17:18:37] 6operations, 10Traffic, 5Patch-For-Review: Investigate Vary:Accept-Encoding issues on cache clusters - https://phabricator.wikimedia.org/T97128#1249514 (10BBlack) 5Open>3Invalid a:3BBlack Varnish already handles this sanely, I was just reading the wrong outdated documentation apparently! [17:18:54] (03Abandoned) 10BBlack: sanitize Accept-Encoding for cache efficiency [puppet] - 10https://gerrit.wikimedia.org/r/206387 (https://phabricator.wikimedia.org/T97128) (owner: 10BBlack) [17:19:42] akosiaris, around? [17:21:55] bd808: do I have to do anything with it? Nikerabbit can follow on it if we've to :) [17:22:08] kart_: no, you're clear [17:22:52] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Set up ops kafkatee instance as part of udp2log transition - https://phabricator.wikimedia.org/T96616#1249528 (10Ottomata) 5Open>3Resolved [17:23:47] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1249533 (10BBlack) There seems to be some conflict in the investigations here, as I did say earlier "I think we co... [17:24:02] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1249534 (10BBlack) [17:24:38] ^d: But but but... it's a feature! [17:24:40] :) [17:24:50] <^d> Ugh [17:31:43] (03PS1) 10Dzahn: admin: add aklapper to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/207846 (https://phabricator.wikimedia.org/T97642) [17:32:21] (03CR) 10jenkins-bot: [V: 04-1] admin: add aklapper to phabricator-admins [puppet] - 10https://gerrit.wikimedia.org/r/207846 (https://phabricator.wikimedia.org/T97642) (owner: 10Dzahn) [17:33:11] Jeff_Green: yt? [17:34:13] ya [17:34:44] SO! [17:34:45] https://github.com/wikimedia/operations-puppet/blob/production/templates/udp2log/filters.erbium.erb#L8 [17:34:46] :) [17:34:56] i know you are busy with crazy stuff [17:34:58] can I help? [17:35:19] what if I set up a kafkatee instance on erbium that was outputting the exact same logs? [17:35:37] that would be fine by me [17:36:08] ok, would you have time to verify that the output works for you, or should I just switch it and make kafkatee output to the same place, and turn off the udp2log filters? [17:37:18] ottomata: we're probably better off having one of the other fr-tech folks verify them, they're more familiar with the parser and whatnot [17:37:25] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures [17:37:52] (also, restarting icinga-wm, but that's from when it was offline ^) [17:38:36] also it's really tempting to just export NFS from frack (rather than pull in kafkatee) and be done with it :-P [17:39:08] Jeff_Green: also, another q, do you know if your filters do not need bits or upload traffic? [17:39:19] i don't think they do [17:39:25] (03CR) 10Anomie: "Hook works, but see If41e3dae." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207279 (https://phabricator.wikimedia.org/T97469) (owner: 10Anomie) [17:40:16] ottomata1: i agree re upload traffic, a little less certain about bits [17:40:29] ha, well i think udp2log does not contain bits now anyway... :p [17:40:43] ha then I guess we have our answer! [17:41:02] haha [17:41:03] yeah [17:41:06] ok cool ,that will be good [17:41:18] yeah, sorry, i might have missed somethign, do you have time to verify the output? [17:41:24] or should I just swap in place with the udp2log logs [17:41:39] !log sync-common on snapshot1004 failed after 33 minutes with rsync timeout [17:41:46] Logged the message, Master [17:43:04] ottomata1 sec, talking to katie [17:44:00] k [17:44:01] 6operations, 6Security, 10Wikimedia-General-or-Unknown, 7Mail: DMARC: Users cannot send emails via a wiki's [[Special:EmailUser]] - https://phabricator.wikimedia.org/T66795#1249686 (10TheDJ) The setting that needs to be changed is $wgUserEmailUseReplyTo This variable changes the behavior in SpecialEmailu... [17:44:38] ottomata1: is it feasible to write logs in parallel side by side for a bit? [17:45:08] we're thinking adam is the best candidate because he's most knowledgeable about the processor [17:45:21] (03CR) 10Phuedx: "@kaldari, @Bmansurov: As I said on the task, I think that this should go down (it's labs-only ATM)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) (owner: 10Phuedx) [17:45:49] yes we can do that [17:50:35] 6operations, 10Traffic: Evaluate limited caching inside nginx - https://phabricator.wikimedia.org/T96851#1249725 (10BBlack) p:5Low>3High [17:53:27] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:54:22] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1249751 (10BBlack) Oh, I guess it's amazing sometimes what a fresh look can reveal. I thought we had... [17:54:52] Bad SSH2 cipher spec 'chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr'. [17:54:55] * AaronSchulz sighs [17:56:11] AaronSchulz: Mac? [17:56:30] there's info on upgrade/workaround for Macs in the thread [17:57:37] also, I really hate how when I sort by-priority and then drag-n-drop on a phab workboard, it edits priorities on drop :P [17:57:45] debian [17:57:59] (especially when it's a list of closed tasks that are greyed out, and thus the priority of them isn't obvious at all) [17:58:06] very old debian? [17:58:17] hmmm [17:59:25] (03PS1) 10Ottomata: Set up kafkatee instance on erbium to output fundraising logs [puppet] - 10https://gerrit.wikimedia.org/r/207858 (https://phabricator.wikimedia.org/T97294) [18:00:52] (03CR) 10Ottomata: [C: 032] Set up kafkatee instance on erbium to output fundraising logs [puppet] - 10https://gerrit.wikimedia.org/r/207858 (https://phabricator.wikimedia.org/T97294) (owner: 10Ottomata) [18:01:39] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban, 10Traffic: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1249819 (10ori) >>! In T91347#1249751, @BBlack wrote: > Oh, I guess it's amazing sometimes what a fre... [18:02:41] (03PS2) 10BBlack: add role::logging::eventlistener to text and mobile varnishes [puppet] - 10https://gerrit.wikimedia.org/r/207692 (owner: 10Ori.livneh) [18:02:45] (03PS1) 10Ottomata: Add kafkatee logrotate for fundraising instance [puppet] - 10https://gerrit.wikimedia.org/r/207859 [18:03:00] (03CR) 10BBlack: [C: 032 V: 032] add role::logging::eventlistener to text and mobile varnishes [puppet] - 10https://gerrit.wikimedia.org/r/207692 (owner: 10Ori.livneh) [18:03:07] (03CR) 10Ottomata: [C: 032 V: 032] Add kafkatee logrotate for fundraising instance [puppet] - 10https://gerrit.wikimedia.org/r/207859 (owner: 10Ottomata) [18:04:04] I guess you did mine too :P [18:05:37] PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures [18:05:55] oh ha, bblack, sorry, yours ended up showing right before my scrollback, and I didin't notice the double commiter [18:06:05] hope that's ok! [18:06:31] Jeff_Green: check out /a/log/fundraising-kafkatee on erbium [18:07:32] which adam should I bug? wight? [18:07:48] 6operations, 7HHVM: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1249847 (10Anomie) 3NEW [18:08:03] 6operations, 7HHVM: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1249858 (10Anomie) [18:11:46] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1249875 (10BBlack) looking at mediawiki-config repo, it's not immediately obvious to me how we'd test this on one wiki or anything like that. We could obviously gut the $wmfHostname['bits']... [18:12:44] bblack: i'll show you [18:13:13] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1249877 (10Ottomata) 3NEW [18:13:31] Jeff_Green: https://phabricator.wikimedia.org/T97676 [18:13:50] great [18:14:24] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1249884 (10Ottomata) [18:15:16] ottomata: I guess short term you should rotate that log until we see how adam wants to use it, so it doesn't fill the hdd? [18:18:46] RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:18:51] its got a rotate [18:19:13] /etc/logrotate.d/kafkatee-fundraising [18:19:22] 6operations, 6Release-Engineering: Move sudo permissions for deployment from modules/mediawiki/manifests/users.pp to data.yaml - https://phabricator.wikimedia.org/T97678#1249908 (10chasemp) 3NEW [18:21:59] great [18:23:28] 6operations, 6Release-Engineering: Move sudo permissions for deployment from modules/mediawiki/manifests/users.pp to data.yaml - https://phabricator.wikimedia.org/T97678#1249921 (10chasemp) [18:27:20] (03PS1) 10Ori.livneh: Add 'wmgUseBits' config option; true by default. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207872 [18:27:58] (03PS2) 10Ori.livneh: Add 'wmgUseBits' config option; true by default. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207872 [18:28:10] (03CR) 10Ori.livneh: [C: 032] Add 'wmgUseBits' config option; true by default. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207872 (owner: 10Ori.livneh) [18:28:17] (03Merged) 10jenkins-bot: Add 'wmgUseBits' config option; true by default. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207872 (owner: 10Ori.livneh) [18:29:44] ori: \o/ [18:31:27] bblack: voila: http://se.wikibooks.org/ [18:31:55] look ma, no bits! [18:33:34] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1249980 (10ori) >>! In T95448#1249875, @BBlack wrote: > looking at mediawiki-config repo, it's not immediately obvious to me how we'd test this on one wiki or anything like that. We could o... [18:34:00] I still see bits! [18:34:36] bblack: well, for 'powered by' [18:34:42] meh ok, i'll fix that too sec [18:35:07] and images/wikimedia-button.png [18:35:08] (03PS1) 10Jforrester: Disable wmgVisualEditorEnableTocWidget even on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207874 [18:35:10] and the resource loader [18:36:05] bblack: huh? not for me [18:36:36] oh, cache maybe? [18:36:43] I could try globally banning se.wb.o [18:36:51] probably. add a cache-busting thing to the URI [18:37:14] 6operations, 6Release-Engineering: Move sudo permissions for deployment from modules/mediawiki/manifests/users.pp to data.yaml - https://phabricator.wikimedia.org/T97678#1249993 (10chasemp) [18:38:08] yeah that was is [18:38:11] I'm getting "fork failed: Resource temporarily unavailable" when trying to ssh to gerrit. Anybody knows what could cause this? [18:38:18] ~900ms vs ~720ms just now [18:38:35] (for whatever the chome dev extension marks as when everything's done) [18:39:47] bblack, 7.8 [18:40:43] SMalyshev: is that a local error message? are you actually using "ssh", or you mean git-over-ssh? [18:41:00] (FWIW, my git-over-ssh to gerrit seems ok) [18:41:09] bblack: happens on both. I don't really know what is causing it [18:41:40] and a lot of "ssh_exchange_identification: Connection closed by remote host" after it [18:41:55] (03PS1) 10Rush: admin cleanup permissions granted to wikidev [puppet] - 10https://gerrit.wikimedia.org/r/207877 [18:41:56] ah could be the bastion? [18:42:13] SMalyshev: you've reached the maximum number of processes [18:42:16] (if you're hitting gerrit through one, you don't have to!) [18:42:32] 6operations, 6Release-Engineering: Move sudo permissions for deployment from modules/mediawiki/manifests/users.pp to data.yaml - https://phabricator.wikimedia.org/T97678#1250005 (10chasemp) https://gerrit.wikimedia.org/r/#/c/207877/ [18:42:33] ori: locally? not likely as I can run other stuff with no problem [18:43:08] ori: also happens right after reboot too [18:43:29] bblack: the thing is I can ssh to bastion... [18:44:18] SMalyshev: I'm on the gerrit server, nothing seems amiss there in terms of load or memory exhaustion, etc [18:44:23] and all my sessions to it work fine [18:44:37] (03PS2) 10Alex Monk: Admin: Cleanup permissions granted to wikidev [puppet] - 10https://gerrit.wikimedia.org/r/207877 (https://phabricator.wikimedia.org/T97678) (owner: 10Rush) [18:44:38] bblack: looks like it's bastion... if I exclude bastion proxy command, it works fine [18:45:00] well, I'd exclude that in general (make an exception for gerrit.wikimedia.org in the Host line) [18:45:04] what bastion? [18:45:26] bblack: bast1001.wikimedia.org [18:46:07] hmmm [18:46:22] I don't see any general problems there either, and it's not like you have any other procs running there for a resource limit [18:46:38] maybe ssh forwarding misconfig -> infinite loop of bastion->bastion connections when trying to reach gerrit? [18:46:43] !log rebooting labstore1002 in prevision of switch to make sure it starts up cleanly. [18:46:46] (03CR) 10Ori.livneh: [C: 031] "Looks correct. Haven't tested." [puppet] - 10https://gerrit.wikimedia.org/r/207877 (https://phabricator.wikimedia.org/T97678) (owner: 10Rush) [18:46:50] Logged the message, Master [18:47:24] SMalyshev: what did the ssh config look like re gerrit + bastion? [18:47:47] PROBLEM - Host labstore1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:49:31] ori: I'm gonna ban se.wb.o just to make further testing easier, but I'll wait for your poweredby [18:49:50] bblack: Host *.wikimedia.org *.eqiad.wmnet [18:49:50] ProxyCommand ssh -a -W %h:%p bastion-wikimedia [18:50:03] maybe *.wikimedia.org is a problem... [18:50:11] ... you're going to what se.wb.o? [18:50:13] what's bastion-wikimedia? [18:50:29] Host bastion-wikimedia [18:50:29] HostName bast1001.wikimedia.org [18:50:34] Krenair: I'm going to flush out varnish caching of se.wikibooks.org - it's a closed wiki [18:50:40] oh right, closed, ok [18:50:46] SMalyshev, you send all your gerrit traffic via bastion? [18:50:51] I wonder if it created a loop... [18:51:01] SMalyshev: yes, that creates a loop :) [18:51:08] Krenair: I guess so... I copied it from somewhere I'm sure [18:51:12] bblack: hang on, poweredby change coming [18:51:20] (03CR) 10Rush: [C: 032] Admin: Cleanup permissions granted to wikidev [puppet] - 10https://gerrit.wikimedia.org/r/207877 (https://phabricator.wikimedia.org/T97678) (owner: 10Rush) [18:51:24] Yeah, I wouldn't trust those copy+paste wikimedia SSH configs [18:51:30] make the Host line like "Host *.wikimedia.org *.eqiad.wmnet !gerrit.wikimedia.org !bast1001.wikimedia.org" [18:51:34] SMalyshev: ^ [18:51:58] bblack: ok, excluding it seems to fix the problem, thanks! [18:52:03] Aren't there other hosts around still that don't need bastion? [18:52:06] Silver used to be one. [18:52:24] yeah there's no simple 2-liner that covers correct bastion usage :) [18:53:11] but for the most part, if you're willing to live with a little excess-bastion, it's not so bad. gerrit's a good one to exclude for all the git traffic though. [18:53:32] Stupid post that takes forever and a half. [18:54:56] (03PS1) 10Ori.livneh: Define $wgAssetsHost based on wmgUseBits; use it to reference standard chrome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207886 [18:55:28] ori: Are you able to sync a trivial Beta Cluster config change whilst you're there? (Sorry to ask.) [18:55:39] James_F: sure [18:55:40] which? [18:55:45] ori: https://gerrit.wikimedia.org/r/#/c/207874/ [18:55:45] What in the name of Baal is going on with labs HW? [18:55:57] (03CR) 10Ori.livneh: [C: 032] Disable wmgVisualEditorEnableTocWidget even on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207874 (owner: 10Jforrester) [18:56:02] Thanks! [18:56:03] (03Merged) 10jenkins-bot: Disable wmgVisualEditorEnableTocWidget even on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207874 (owner: 10Jforrester) [18:56:11] (03CR) 10Ori.livneh: [C: 032] Define $wgAssetsHost based on wmgUseBits; use it to reference standard chrome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207886 (owner: 10Ori.livneh) [18:56:16] (03Merged) 10jenkins-bot: Define $wgAssetsHost based on wmgUseBits; use it to reference standard chrome [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207886 (owner: 10Ori.livneh) [18:58:58] (03PS1) 10Rush: Revert "Revert "admin group sudoers permissions cleanup"" [puppet] - 10https://gerrit.wikimedia.org/r/207890 [18:59:16] (03PS2) 10Rush: Revert "Revert "admin group sudoers permissions cleanup"" [puppet] - 10https://gerrit.wikimedia.org/r/207890 [18:59:33] bblack: need to apply a fix-up to that, sec [19:00:44] (03CR) 10Rush: [C: 032] Revert "Revert "admin group sudoers permissions cleanup"" [puppet] - 10https://gerrit.wikimedia.org/r/207890 (owner: 10Rush) [19:00:55] hmm yeah didn't seem to affect https://bits.wikimedia.org/images/wikimedia-button.png for the poweredby [19:03:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [19:04:57] (03PS1) 10Ori.livneh: Fix-up for I9ee6bec1f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207893 [19:05:29] (03CR) 10Ori.livneh: [C: 032] Fix-up for I9ee6bec1f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207893 (owner: 10Ori.livneh) [19:06:20] (03Merged) 10jenkins-bot: Fix-up for I9ee6bec1f [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207893 (owner: 10Ori.livneh) [19:07:53] !log ori Synchronized wmf-config/CommonSettings.php: I93cdc4a2e and I9ee6bec1f: Define $wgAssetsHost based on wmgUseBits; use it to reference standard chrome (duration: 00m 16s) [19:08:00] Logged the message, Master [19:08:12] godog: yt? [19:08:27] bblack: {{done}} [19:11:58] ori: purged [19:12:48] there's still the logo from upload, but there's going to be lots of stuff from upload in the common case I guess [19:13:06] (03PS1) 10Rush: Admin: consolidate command to grant root [puppet] - 10https://gerrit.wikimedia.org/r/207895 [19:13:57] works, ship it! :) [19:14:26] (03CR) 10Rush: [C: 032] Admin: consolidate command to grant root [puppet] - 10https://gerrit.wikimedia.org/r/207895 (owner: 10Rush) [19:14:37] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:15:05] bblack: IMO the next step is to enable it on mediawiki.org. It has real users, but its users are developers who are good and spotting and reporting bugs. [19:15:07] _joe_: yt? [19:15:17] ori: that seems reasonable [19:15:22] <_joe_> ottomata: more or less, yes [19:15:29] thinking about reqstats overhaul [19:15:55] https://phabricator.wikimedia.org/T83580 [19:16:01] <_joe_> ottomata: there is a phab ticket from faidon assigned to me [19:16:02] why not just keep it really really simple [19:16:26] and run multiple varnishncsas -m ... | some process that just counts # lines and sends stats per minute? [19:16:26] (having logos coming from upload may actually help in the common case, to get the upload conn primed for the article images, I guess - assuming the user is a fresh uncached hit straight into a page) [19:16:43] e.g. [19:17:13] varnishncsa -n frontend -m TxStatus:^4.. | count_lines_and_send_every_60_seconds -statsd -metric reqstats.4xx [19:17:14] or whatever [19:17:38] ( just looking for a brain bouncer, if you have stopped working feel free to ignore :) ) [19:17:58] my brain says, "why does stats always involved pipelining together 16 tools?":) [19:18:13] bblack ^ that would be really simple, no? [19:18:32] I guess because reimplementing the VSL stuff that vk or vncsa does would be a real PITA [19:18:45] but after it's out of vncsa, it could still be just one daemon on the host [19:18:55] ? [19:19:07] vsl stuff? [19:19:11] the shmlog stuff [19:19:16] oh [19:19:26] which is why your count_lines_and_send daemon can't skip using varnishncsa as a feeder [19:19:31] you mean doing more of a varnishstats kinda thing, rather than reading shmlogs for all of them? [19:19:56] no, I mean having the sending daemon actually read the shmlog for itself instead of pipe from a varnishncsa command [19:20:02] (03CR) 10Alex Monk: Create Wikipedia Konkani (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [19:20:09] but VSL does not make that easy, so yeah :/ [19:20:29] (03PS1) 10Rush: Admin: granting commands as root syntax [puppet] - 10https://gerrit.wikimedia.org/r/207899 [19:21:05] 6operations, 10ops-codfw: labstore1002 does not pass POST - https://phabricator.wikimedia.org/T97688#1250127 (10coren) 3NEW a:3Cmjohnson [19:21:07] ah [19:21:09] yeah [19:21:23] the chain I'm hyperbolizing about is the current one, which seems to be (if I'm interpreting it all correctly).... [19:21:43] 6operations, 10ops-codfw: labstore1002 does not pass POST - https://phabricator.wikimedia.org/T97688#1250137 (10coren) [19:21:48] (03CR) 10Rush: [C: 032] Admin: granting commands as root syntax [puppet] - 10https://gerrit.wikimedia.org/r/207899 (owner: 10Rush) [19:22:05] varnishkafka reads shmlog, outputs filesystem files in /var/cache/varnishkafka/webrequest.stats.json, then a once-per-minute logster cronjob picks that up and sends it to a local statsd process, which then sends it off the host elsewhere [19:22:12] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1250140 (10GWicke) [19:22:38] ohoh [19:22:38] nono [19:22:43] oh [19:22:53] um sorta [19:22:54] bblack, but [19:23:08] the stats that are in stats.json ahve nothing to do with the shmlogs [19:23:14] those are just vk operational stats [19:23:16] like, kafka rtt times [19:23:18] etc. [19:23:19] ah [19:23:30] rtts [19:23:35] rttts [19:23:36] heh [19:23:54] i mean, here you go [19:23:54] well still, why does this have to hit a filesystem and a cronjob and then another daemon before leaving the host? [19:23:59] here's your 4xx per minute command: [19:24:00] varnishncsa -n frontend -m TxStatus:^4.. | pv -l --interval 60 > /dev/null [19:24:18] bblack, ok we are talking about two things here [19:24:26] so, 1 you think the .stats.json stuff is uncessary, maybe so [19:24:34] the thing i want to talk about is request stats [19:24:37] that is currently doing this: [19:24:56] varnishncas -> udp2log -> perl script somewhere that counts things -> statsd [19:25:16] replacing that is what https://phabricator.wikimedia.org/T83580 is about [19:25:24] now, for the vk .stats.json files [19:25:45] the reason it does that, rather than what, send directly to statsd, is because magnus wanted to be agnostic about how stats were collected [19:26:14] so, if we really cared, yes, we could implement a statsd sender in vk, which might be nice to have [19:26:14] oh we had this conversation, in the ticket, a week ago [19:26:53] ja, anyway yeah, about the reqstats [19:26:59] (03PS1) 10Ori.livneh: wmgUseBits: false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207902 [19:27:12] would it be bad to run a bunch of local varnishncsas that used -m to filter for a single metric to count? [19:27:22] (03CR) 10Ori.livneh: [C: 032] wmgUseBits: false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207902 (owner: 10Ori.livneh) [19:27:24] ori: I'm really excited about the potential for this to drop page times for SPDY clients :) [19:27:34] (well, and to be able to nuke the bits infrastructure) [19:27:47] meeeee too [19:27:49] (03Merged) 10jenkins-bot: wmgUseBits: false for mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207902 (owner: 10Ori.livneh) [19:27:50] or, would it be better to run a single varnishncsa (or shmlog reader of some kind) and parse ALL of the requests and bucket and send stats [19:28:23] i tend to think that sense this is all local shmlogs, letting varnishncsa filter the shmlogs with -m would be better, but i dont' know for sure [19:28:27] since* [19:28:52] no, I think your "vncsa -m .... | whatever" pattern is reasonable [19:29:20] mostly just because touching nvcsa/vkafka code or writing another VSL (shmlog) consumer all sounds like a total PITA and more difficult-to-maintain code [19:29:25] yeah [19:29:30] aye ok, cool, good to have some backup for crazy ideas. i will suggest this on the ticket [19:29:37] (03CR) 10Alex Monk: Create Wikipedia Konkani (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [19:33:49] ottomata: you can make that into a systemd service unit too, using ExecStart=/bin/sh -c 'nvcsa -m ... | whatever' + Type=simple [19:34:54] 6operations, 6Release-Engineering, 5Patch-For-Review: Move sudo permissions for deployment from modules/mediawiki/manifests/users.pp to data.yaml - https://phabricator.wikimedia.org/T97678#1250166 (10chasemp) 5Open>3Resolved a:3chasemp [19:35:07] krenair I am on unreliable internet [19:35:19] yeah bblack, was thinnking that too :) [19:35:26] so can't really help today with adding the wiki [19:35:26] i'm seeing if I can make pv output in statsd format [19:35:29] then its as simple as [19:35:43] vanrishncsa | pv --interval 60 | netcat statsd.eqiad.wmnet 8125 [19:35:51] heh [19:36:05] audephone, that's ok [19:36:19] If you want leave it off wikidataclient dblist and I can take care of it on monday [19:36:20] the wikidata bit seems straightforward [19:36:25] ok [19:36:54] The site needs to be added to the sites table on other wikis but [19:36:59] ottomata: well I assumed you had other uses of this pattern, which didn't just count lines/sec [19:37:25] we modified some rows by hand to set them to use https [19:37:51] not sure we can just run the script everywhere as it is now [19:37:53] bblack, i'm not certain, but i think that for all of the things listed in that ticket, we can just use -m to make the output only be lines that match the metric we are counting [19:38:03] which means each ones turns into lines / second [19:38:07] or / minute [19:38:08] or whatever [19:39:06] what kinds of filters? [19:39:11] oh like 5xx [19:39:22] yeah [19:39:42] if you're doing one for overall rate, you could use varnishstat to be much more efficient of course [19:39:51] e.g. [19:39:52] # varnishstat -1 -f client_req -w 60 -j [19:39:52] { "timestamp": "2015-04-30T19:38:31", "client_req": {"value": 123326082, "flag": "a", "description": "Client requests received"} [19:40:02] will keep outputting those JSON blocks once a minute [19:40:06] (03CR) 10Alex Monk: Create Wikipedia Konkani (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [19:40:42] ah cool [19:40:45] ok yeah [19:40:51] nice [19:40:58] but there's no filtering there, it's just the global stat counter for all of that instance [19:41:03] yeah [19:41:04] ok [19:41:53] (and I think in the client_req case, you do your own math, the number is always-incrementing, count of reqs so far) [19:44:24] (03PS1) 10Andrew Bogott: Throttle the rsync in cold-migrate. [puppet] - 10https://gerrit.wikimedia.org/r/207908 [19:45:18] (03PS1) 10Paladox: Change setting name from $wmincClosedWikis to $wgwmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [19:45:39] (03PS2) 10Paladox: Change setting name from $wmincClosedWikis to $wgwmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [19:45:43] (03CR) 10Andrew Bogott: [C: 032] Throttle the rsync in cold-migrate. [puppet] - 10https://gerrit.wikimedia.org/r/207908 (owner: 10Andrew Bogott) [19:45:51] (03PS3) 10Paladox: Change setting name from $wmincClosedWikis to $wgwmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [19:50:24] (03PS8) 10Alex Monk: Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [19:52:28] greg-g, hey, around? [19:57:09] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1250220 (10Nemo_bis) > since they *have* reached the threshold some time ago Impossible, or the locale would have been exported, per http... [19:59:00] actually, never mind [20:01:41] (03PS1) 10Andrew Bogott: Add some glance policies for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/207913 [20:03:52] (03PS2) 10Andrew Bogott: Add some glance policies for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/207913 [20:05:07] (03CR) 10Andrew Bogott: [C: 032] Add some glance policies for Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/207913 (owner: 10Andrew Bogott) [20:07:01] <_joe_> ottomata: sorry I fell asleep :P [20:07:16] s'ok, i'm getting somewhere :) [20:07:56] <_joe_> ottomata: I would have used the varnishkafka feed tbh [20:08:38] <_joe_> the idea here is to use one mean to extract info [20:08:46] ? [20:08:59] you mean you like the idea of parsing everything by consuming from kafka? [20:09:36] <_joe_> I don't know how our varnishkafka is set up [20:09:51] <_joe_> do you attach topic tags to messages incoming in kafka? [20:10:11] yes, per cache type [20:10:13] upload, bits, etc. [20:10:23] <_joe_> I guessed it was possible to tag 5xx too with an additional topic [20:10:31] kafkatee will let us consume from multiple topics [20:10:32] hmmm [20:10:36] not sure that would make much sense [20:10:39] would duplicate the data really [20:10:50] <_joe_> ottomata: basically we want the 5xx separable for datacenter/cache type [20:10:58] (03PS1) 10Andrew Bogott: Standardize on 'glance_policy.json' vs 'image_policy.json' [puppet] - 10https://gerrit.wikimedia.org/r/207919 [20:11:04] <_joe_> so I can see the 5xx for text esams [20:11:11] <_joe_> for all text [20:11:14] <_joe_> and so on :) [20:11:26] PROBLEM - puppet last run on californium is CRITICAL Puppet has 1 failures [20:11:40] _joe_, could tag that at the metric level on the host itself [20:11:43] <_joe_> ottomata: duplicate data? you can't attach multiple tags to data? [20:11:50] (03CR) 10Andrew Bogott: [C: 032] Standardize on 'glance_policy.json' vs 'image_policy.json' [puppet] - 10https://gerrit.wikimedia.org/r/207919 (owner: 10Andrew Bogott) [20:11:59] _joe_: i'm not sure i understand [20:12:01] what are you suggesting? [20:12:15] <_joe_> ottomata: never used kafka, I used rabbitmq and you can tag any data incoming with multiple tags [20:12:28] <_joe_> so you could have esams,5xx [20:12:46] hm, not really, there isn't much metadata, the data is all in the message [20:12:48] you can key messages [20:12:56] but, you can't really choose to only consume those keys [20:12:57] RECOVERY - puppet last run on californium is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:13:08] the keys are used to route messages to partitions [20:13:19] and partitions are the finest level of consumer granulatiry [20:13:30] so you could use that to say, make sure all of user A's messages go to the same consumer [20:13:46] but, _joe_, we have that data in the message data already [20:13:47] so yes [20:13:49] you could do [20:14:10] consume from kafka | jq fancy filter where http_status == 5xx | count it [20:14:26] but, that means you would be filtering out all 5xxes from ALL webrequests [20:14:31] peaks at around 200K / second [20:15:01] <_joe_> yeah too much [20:15:11] para void: and go dog seem to want to do this at the cache level, which might make sense, cause then you can just aggregate for that stuff in graphite [20:15:16] if we made the metric names like [20:15:33] (03PS7) 10BryanDavis: Add AffCom user group application contact page on meta (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [20:15:34] reqstats.eqiad.cp1052.5xx.count [20:15:36] or somethign [20:15:41] <_joe_> yeah that makes sense of course [20:15:41] <_joe_> mh [20:15:42] <_joe_> no [20:15:46] or whatever [20:15:52] whatever yall graphiters think is best [20:16:12] <_joe_> reqstats.eqiad.text.cp1052.503.count [20:16:20] <_joe_> if possible, that would be best :P [20:16:20] good w me :) [20:16:39] so, since varnishncsa can use -m to match on tags [20:16:49] this is really easy to do if we just run one vncsa per metric we want [20:16:54] sometjing like [20:16:57] varnishncsa -n frontend -m TxStatus:^4.. | pv --numeric -l --interval 60 [20:17:07] that's the count of 4xxs per minute [20:17:11] <_joe_> ok [20:17:12] per cache host [20:17:14] (03PS8) 10BryanDavis: Add AffCom user group application contact page on meta (take 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207332 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [20:17:24] <_joe_> sounds nice, yes [20:17:31] i was trying to figure out a fancy way to make pv output in statsd format [20:17:34] almost ther,e but i gotta run soon [20:17:36] this almost works! [20:17:39] bd808: hmm what's with the double disk failures in md0 on logstash1004-6? [20:17:42] <_joe_> and you send that number to statsd :P [20:17:58] varnishncsa -n frontend -m TxStatus:^4.. | pv -f -l --interval 2 -F 'test.reqstats.4xx.per_minute:%b|c' > /dev/nul [20:18:01] ja [20:18:23] but, pv, doesn't work with --format and --numeric :( soooo, anyway, there will be a way to do it [20:18:46] jgage: no clue. _joe_ was working on getting Elasticsearch up and running there. They aren't in use yet. [20:20:23] perhaps a hack to decrease the cost of writes while they fill with data [20:23:03] 6operations, 7Monitoring: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1250338 (10Ottomata) What if we just abstract out something like this: varnishncsa -n frontend -m TxStatus:^4.. | pv --numeric -l --interval 60 > /dev/null That right there is a count of 4xxes per minute. Gotta run so... [20:43:39] !log renaming <2k users who were missed in the original run (SUL finalization) [20:43:50] Logged the message, Master [20:49:58] AaronSchulz: do you know why enwiki has 11m refreshLinks jobs queued? [20:50:45] AaronSchulz: also http://fpaste.org/217374/30427039/raw/ :/ [21:03:01] (03PS1) 10Andrew Bogott: No #s allowed in .json, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/207977 [21:05:27] !log Finally got sync-common to run to completion on snapshot1004; runtime 45 minutes! [21:05:34] Logged the message, Master [21:06:09] greg-g, can we have a window at 2:30 Pacific (in 25 minutes) for stuff related to https://phabricator.wikimedia.org/T94953 ? [21:06:29] (03CR) 10Andrew Bogott: [C: 032] No #s allowed in .json, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/207977 (owner: 10Andrew Bogott) [21:17:18] (03PS1) 10Aaron Schulz: Revert "Set dedicated SUL rename runner loop" [puppet] - 10https://gerrit.wikimedia.org/r/207982 [21:24:16] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [21:26:47] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1250636 (10BBlack) This is live on mediawikiwiki now as well ( https://www.mediawiki.org/ )! [21:26:58] 6operations, 10Traffic, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1250640 (10BBlack) [21:29:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [21:35:59] (03PS1) 10Andrew Bogott: Further restrict glance policies. [puppet] - 10https://gerrit.wikimedia.org/r/207988 [21:36:33] (03CR) 10Andrew Bogott: [C: 032] Further restrict glance policies. [puppet] - 10https://gerrit.wikimedia.org/r/207988 (owner: 10Andrew Bogott) [21:36:54] (03PS1) 10GWicke: Increase new generation size to 1/4 heap [puppet] - 10https://gerrit.wikimedia.org/r/207989 [21:37:31] (03PS2) 10Andrew Bogott: Further restrict glance policies. [puppet] - 10https://gerrit.wikimedia.org/r/207988 [21:38:20] (03Abandoned) 10GWicke: Increase the new generation to 1/4 heap [puppet] - 10https://gerrit.wikimedia.org/r/192762 (owner: 10GWicke) [21:42:19] matt_flaschen: yeah [21:42:26] sorry, was afk [21:43:08] greg-g, well I didn't start yet. How about 3? [21:44:20] sure [21:45:38] (03CR) 10Eevans: Increase new generation size to 1/4 heap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207989 (owner: 10GWicke) [21:46:52] (03PS1) 10Andrew Bogott: Restrict a few nova policices in Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/207997 [21:47:42] (03CR) 10GWicke: Increase new generation size to 1/4 heap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207989 (owner: 10GWicke) [21:47:49] (03PS2) 10Andrew Bogott: Restrict a few nova policies in Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/207997 [21:48:05] (03PS2) 10GWicke: Increase new generation size to 1/4 heap [puppet] - 10https://gerrit.wikimedia.org/r/207989 [21:48:17] (03PS1) 10Ori.livneh: wmgUseBits: false for itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207998 [21:48:21] (03CR) 10Andrew Bogott: [C: 032] Restrict a few nova policies in Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/207997 (owner: 10Andrew Bogott) [21:48:38] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review, 15User-Bd808-Test: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1250712 (10bd808) [21:50:23] (03CR) 10BBlack: [C: 031] wmgUseBits: false for itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207998 (owner: 10Ori.livneh) [21:52:10] (03CR) 10Bmansurov: "@kaldari, would you please then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) (owner: 10Phuedx) [21:53:30] bmansurov: So I should deploy it? [21:53:46] 6operations, 10Wikimedia-Logstash: Elasticsearch not starting on Jessie hosts - https://phabricator.wikimedia.org/T97645#1250722 (10bd808) @joe volunteered to help me fix this up. The first problem he found was that the version of Elasticsearch we got via apt-get was ancient (1.0.3+dfsg-5). We need 1.3.6 to ma... [21:55:25] 6operations, 10Wikimedia-Logstash: Elasticsearch not starting on Jessie hosts - https://phabricator.wikimedia.org/T97645#1250724 (10demon) Ouch yeah that'd do it. Come to think of it, I must've installed from upstream's apt when I did this... my experience might be less useful. [21:56:37] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:57:06] PROBLEM - dhclient process on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:57:27] PROBLEM - salt-minion processes on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:58:22] 6operations, 10Wikimedia-Logstash: Elasticsearch not starting on Jessie hosts - https://phabricator.wikimedia.org/T97645#1250739 (10Manybubbles) Nice! 1.5.2 is the most current version of Elasticsearch and 1.6.X is the has some compelling work to make rolling restarts faster so I've mostly been waiting on tha... [22:02:23] (03CR) 10Eevans: Increase new generation size to 1/4 heap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207989 (owner: 10GWicke) [22:03:43] (03CR) 10GWicke: Increase new generation size to 1/4 heap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207989 (owner: 10GWicke) [22:06:05] kaldari: yes [22:09:39] (03CR) 10Eevans: Increase new generation size to 1/4 heap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207989 (owner: 10GWicke) [22:10:38] (03CR) 10GWicke: Increase new generation size to 1/4 heap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207989 (owner: 10GWicke) [22:11:59] (03CR) 10Eevans: [C: 031] Increase new generation size to 1/4 heap [puppet] - 10https://gerrit.wikimedia.org/r/207989 (owner: 10GWicke) [22:13:15] (03PS3) 10GWicke: Increase new generation size to 1/4 heap [puppet] - 10https://gerrit.wikimedia.org/r/207989 [22:14:21] (03CR) 10Eevans: [C: 031] Increase new generation size to 1/4 heap [puppet] - 10https://gerrit.wikimedia.org/r/207989 (owner: 10GWicke) [22:14:45] !log mattflaschen Started scap: Deploy Flow changes to 1.26wmf4 facilitate LQT->Flow conversion [22:14:50] Logged the message, Master [22:29:10] jouncebot, next [22:29:10] In 0 hour(s) and 30 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150430T2300) [22:35:43] (03PS1) 10Andrew Bogott: Updated labs graphic to fit. [puppet] - 10https://gerrit.wikimedia.org/r/208014 [22:36:28] (03CR) 10Andrew Bogott: [C: 032] Updated labs graphic to fit. [puppet] - 10https://gerrit.wikimedia.org/r/208014 (owner: 10Andrew Bogott) [22:40:49] hmm, did something change on terbium? Attempting to connect to a db is getting me a sudo prompt [22:42:52] It might be hanging on: [22:42:54] sync-common: 99% (ok: 464; fail: 0; left: 1) [22:42:59] Anyone seen that? [22:43:01] <^d> snapshot1004 [22:43:35] <^d> matt_flaschen: bd808 said something earlier about opening a second terminal to tin and killing your own ssh connection to the node [22:43:39] <^d> (since you'll own the pid) [22:44:34] Thanks ^d, done. [22:44:51] <^d> yw [22:45:35] <^d> jamesofur: Doing what exactly? [22:46:13] ^d: any attempt to get into an sql database, so the last two attempts were 'sql votewiki' and 'sql votewiki -h db1027' for example [22:46:31] but looks to be the same for all db's, at least for me [22:46:39] <^d> hmm, wonder if it was the sudo changes earlier [22:46:42] <^d> chasemp ^? [22:46:58] wasn't that reverted? [22:47:11] <^d> Hmm, wfm. [22:47:19] Since jamesofur's account is restricted rather than deployment it will be different for him [22:47:22] it wouldn't affect that afaik [22:47:37] but let me look [22:47:46] <^d> Just my first inclination because sudo and a deploy target :) [22:47:47] (It works for me too) [22:48:02] (03PS1) 10Andrew Bogott: Spruce up Horizon graphics [puppet] - 10https://gerrit.wikimedia.org/r/208015 [22:48:20] !log mattflaschen Finished scap: Deploy Flow changes to 1.26wmf4 facilitate LQT->Flow conversion (duration: 33m 35s) [22:48:30] Logged the message, Master [22:48:59] (03CR) 10Andrew Bogott: [C: 032] Spruce up Horizon graphics [puppet] - 10https://gerrit.wikimedia.org/r/208015 (owner: 10Andrew Bogott) [22:49:38] In https://github.com/wikimedia/operations-puppet/commit/423c8d6bbb46abec36ef2c0ba1306c41460bd2c5 chasemp changed some rights from wikidev (i.e., including jamesofur) to deployment-only (so not jamesofur) [22:50:10] this included the rights to sudo as www-data, apache, mwdeploy, l10nupdate [22:50:38] which is probably needed to get the password to the databases [22:51:10] the rights were being inherited by all users there sure, but it didn't remove the old in the cases I saw is why I was thinking no change [22:51:23] k [22:51:33] Oh, yeah [22:51:41] while I've certainly sudo'd as apache before (I can't remember what for actually... it was a while ago) I don't generally need it for sql access or maintenance scripts. Unless things like the sql alias or mwscript do that automatically of course [22:51:55] the sql command uses mwscript to pull the right database hostname for the given wiki [22:52:02] ah [22:52:11] yeah, does mwscript do apache automatically? [22:52:16] yes [22:52:23] * jamesofur will be sad if no more maintenance scripts :( [22:52:24] (03PS1) 10Andrew Bogott: Further attempts to disable snapshotting in horizon [puppet] - 10https://gerrit.wikimedia.org/r/208018 [22:52:29] that will make the election annoying later [22:52:35] sudo -u "$MEDIAWIKI_WEB_USER" php "$MEDIAWIKI_DEPLOYMENT_DIR_DIR_USE/multiversion/MWScript.php" "$@" [22:52:36] etc. [22:52:37] PROBLEM - RAID on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:52:41] ahh [22:53:07] I'm guessing this was kind of implicit before and just "worked" because of the murky perms [22:53:09] (that's probably why i used sudo -u apache at some point in the past, when it wasn't in that script or someone teaching me wasn't sure if it was) [22:53:17] what group should this actually be managed by? [22:53:29] chasemp, it should work for deployers and restricted users [22:53:39] I think basically everyone with access to terbium, so deployers and restricted [22:53:43] yeah, what Krenair said [22:53:43] restricted is specifically granted access to terbium :) [22:53:45] (03CR) 10Andrew Bogott: [C: 032] Further attempts to disable snapshotting in horizon [puppet] - 10https://gerrit.wikimedia.org/r/208018 (owner: 10Andrew Bogott) [22:54:16] RECOVERY - RAID on snapshot1004 is OK no RAID installed [22:55:04] this should cover it then? [22:55:04] 'ALL = (www-data,apache,mwdeploy,l10nupdate) NOPASSWD: ALL' [22:55:14] I think so [22:55:57] lgtm, the other stuff was for restarting servers I don't think they can even access [22:57:05] jamesofur: try again? [22:57:27] chasemp: thumbs up [22:57:29] 'evening [22:57:30] works, thank you [22:57:37] ok I will put in the proper fix then [22:57:48] hi Dereckson [22:57:55] Krenair: do they need "mwdeploy,l10nupdate" as well? [22:58:00] Probably not. [22:58:03] agreed ok [22:58:21] Actually those two probably should actually be restricted to deployers [22:58:31] right [22:58:43] which we can do now :) [22:59:07] (03CR) 10Dereckson: "Translation done." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207662 (https://phabricator.wikimedia.org/T97563) (owner: 10Dereckson) [22:59:50] <^d> go team! [23:00:04] RoanKattouw, ^d, rmoen, Dereckson, Kaldari, RoanKattouw, legoktm: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150430T2300). [23:00:10] o/ [23:00:29] Alright [23:00:38] Dereckson: Your config patch first again? [23:00:46] kaldari: You there for your SWTA? [23:00:46] <^d> Ah ok guessing it was you since you're on the list [23:00:56] I'm here [23:01:03] That's fine for me. [23:01:17] (03CR) 10Catrope: [C: 032] Enable WikiLove on hy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207662 (https://phabricator.wikimedia.org/T97563) (owner: 10Dereckson) [23:01:23] (03Merged) 10jenkins-bot: Enable WikiLove on hy.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207662 (https://phabricator.wikimedia.org/T97563) (owner: 10Dereckson) [23:01:42] RoanKattouw: Are you doing the SWAT today? [23:01:51] Yup [23:01:55] RoanKattouw, that patch needs schema changes doesn't it? [23:02:15] (03PS1) 10Rush: Admin: restricted folks need to sudo as apache [puppet] - 10https://gerrit.wikimedia.org/r/208020 [23:02:27] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:02:29] RoanKattouw: If you want me to create submodule updates for my patches, just let me know. [23:02:36] kaldari: Please do [23:02:39] Krenair: Which one? [23:02:41] sure [23:02:55] RoanKattouw, Dereckson's config change [23:03:00] I suppose so [23:03:02] (03CR) 10Alex Monk: [C: 031] Admin: restricted folks need to sudo as apache [puppet] - 10https://gerrit.wikimedia.org/r/208020 (owner: 10Rush) [23:03:05] Crap [23:03:09] mwscript hywiki createExtensionTables.php wikilove? [23:03:09] Checked on https://www.mediawiki.org/wiki/Extension:WikiLove#Installation indeed [23:03:10] Thanks for bringing that up [23:03:18] !log catrope Synchronized wmf-config/InitialiseSettings.php: Enable WikiLove on hywiki (duration: 00m 49s) [23:03:23] Logged the message, Master [23:03:30] oh, extensions/WikimediaMaintenance/createExtensionTables.php probably [23:03:46] (03CR) 10Rush: [C: 032] Admin: restricted folks need to sudo as apache [puppet] - 10https://gerrit.wikimedia.org/r/208020 (owner: 10Rush) [23:04:05] Yeah [23:04:08] catrope@tin:/srv/mediawiki-staging$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=hywiki wikilove [23:04:17] Without --wiki= it things hywiki is the extension name [23:04:22] !log Created wikilove tables on hywiki [23:04:27] Logged the message, Mr. Obvious [23:04:31] Testing. [23:04:35] sigh [23:04:44] (03PS4) 10Paladox: Change setting name from $wmincClosedWikis to $wgwmincClosedWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207909 [23:05:24] of course it does [23:05:47] there's a getArg( 0 ) call in there IIRC [23:05:58] Works. [23:06:10] thanks for the heads up ^d and review Krenair [23:06:22] fyi maybe more ppl who got more perms than tehy should turn up [23:06:53] (03CR) 10Catrope: [C: 032] Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) (owner: 10Phuedx) [23:08:45] RoanKattouw: waiting for the 2 patches to merge, then will build the submodule updates. Hopefully it won't take forever :) [23:09:09] OK [23:09:12] Yeah Jenkins is slow :( [23:09:13] Achievement unlocked: give a barnstar on a wiki with a non-latin alphabet. [23:09:18] Thanks RoanKattouw for the deploy. [23:09:33] Thanks Dereckson, always a pleasure working with you [23:09:34] our tests are slow [23:09:34] is this shorturl deployment? [23:09:39] Dereckson: Level Up! :) [23:09:42] And thanks to Krenair for noticing that I needed to create tables [23:09:42] (03Merged) 10jenkins-bot: Enable Browse experiment on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206375 (https://phabricator.wikimedia.org/T94739) (owner: 10Phuedx) [23:09:46] YuviKTM: Nope [23:09:53] ah [23:10:02] WikiLove [23:10:10] yeah Krenair, I didn't imagined such extension could have a table ^^ [23:10:40] Will doublecheck that everytime in the future. [23:10:56] :D [23:11:04] there’s a shorturl deployment being pushed through as well... [23:11:44] Where? [23:11:46] It's not on my SWAT list [23:11:48] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1250990 (10GWicke) [23:12:43] RoanKattouw: not today, I think. [23:12:52] RoanKattouw: it needed a maint run over the entire page table [23:12:57] so got pushed to next week on its own window [23:12:58] RoanKattouw: First submodule update (wmf4) is here: https://gerrit.wikimedia.org/r/#/c/208022/ [23:13:06] 6operations, 10Wikimedia-Logstash: Elasticsearch not starting on Jessie hosts - https://phabricator.wikimedia.org/T97645#1250993 (10bd808) >>! In T97645#1250739, @Manybubbles wrote: > Nice! > > 1.5.2 is the most current version of Elasticsearch and 1.6.X is the has some compelling work to make rolling restart... [23:13:57] (03CR) 10Ori.livneh: [C: 032] "announced by Nemo_bis on itwiki (thanks!) https://it.wikipedia.org/wiki/Wikipedia:Bar/2015_05_1#Piccola_velocizzazione_dei_JavaScript_stan" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207998 (owner: 10Ori.livneh) [23:14:29] ori: Dude [23:14:33] ori: It's the middle of a SWAT window [23:14:51] RoanKattouw: ack, my bad [23:15:08] do you have config changes to push? if not, just let it sit unmerged until after your window and i'll take care of it [23:15:12] Also, out of curiosity, what does wmgUseBits=false do? [23:15:21] I don't [23:15:22] So that's fine [23:15:41] makes the site not use bits for static assets [23:16:07] OK... [23:16:09] (03PS1) 10Dduvall: ci: Role for running Raita [puppet] - 10https://gerrit.wikimedia.org/r/208024 [23:16:16] I guess that makes some amount of sense [23:16:23] RoanKattouw: Second submodule update (wmf3) is here: https://gerrit.wikimedia.org/r/#/c/208023/ . I'll add them to the deploy schedule in place of the extension patches. [23:16:35] bits was useful in some sense when we shared lots of assets across wikis [23:16:38] But RL reduced taht [23:17:05] And it's really only beneficial if lots of people visit multiple wikis [23:17:14] (03PS1) 10Andrew Bogott: Further attempts to fix the Horizon splash [puppet] - 10https://gerrit.wikimedia.org/r/208025 [23:17:46] RoanKattouw: with HTTPS-everywhere on the horizon and SPDY support, the cost of having to establish a connection and perform TLS handshake with another domain vs. simply multiplexing additional requests over an already-open connection make the purported benefit of bits quite dubious [23:17:46] (03CR) 10Andrew Bogott: [C: 032] Further attempts to fix the Horizon splash [puppet] - 10https://gerrit.wikimedia.org/r/208025 (owner: 10Andrew Bogott) [23:17:55] Yeah I agree [23:18:06] And most bits URLs look like /en.wikipedia.org/load.php anyway [23:18:14] (When bits was first introduced, that wasn't true) [23:18:37] yeah [23:18:46] mediawiki.org was switched earlier today [23:20:19] (03PS2) 10Dduvall: ci: Role for running Raita [puppet] - 10https://gerrit.wikimedia.org/r/208024 [23:20:43] (03Merged) 10jenkins-bot: wmgUseBits: false for itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207998 (owner: 10Ori.livneh) [23:23:20] (03CR) 10Mobrovac: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [23:25:17] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 4 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [23:27:07] RECOVERY - puppet last run on snapshot1004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:28:39] kaldari: MobileFrontend in flight [23:29:07] RoanKattouw: ready to test when it's on test.wiki [23:29:20] It's slow :( [23:29:26] !log catrope Synchronized php-1.26wmf3/extensions/MobileFrontend/: SWAT (duration: 01m 43s) [23:29:34] Logged the message, Master [23:31:43] RoanKattouw: Seems to be working on en.wiki (since it looks like that was synced first) [23:32:10] RoanKattouw: I guess you like to live dangerously :) [23:32:19] !log EventLogging events logged client-side appear not to be making it to eventlog1001.eqiad.wmnet; Ori investigating. [23:32:24] Logged the message, Master [23:33:20] !log catrope Synchronized php-1.26wmf3/extensions/CentralAuth/: SWAT (duration: 00m 31s) [23:33:27] Logged the message, Master [23:33:38] legoPanda: ----^^ [23:35:16] !log catrope Synchronized php-1.26wmf4/includes/skins/SkinTemplate.php: Add mw-content-ltr/rtl for missing pages (duration: 00m 35s) [23:35:21] Logged the message, Master [23:38:38] !log catrope Synchronized php-1.26wmf4/extensions/MobileFrontend: SWAT (duration: 00m 58s) [23:38:46] Logged the message, Master [23:39:00] (03PS1) 10Ori.livneh: Follow-up to I2af6fb3d7: fix regex escapes in role::logging::eventlistener [puppet] - 10https://gerrit.wikimedia.org/r/208030 [23:39:16] (03CR) 10Ori.livneh: [C: 032 V: 032] Follow-up to I2af6fb3d7: fix regex escapes in role::logging::eventlistener [puppet] - 10https://gerrit.wikimedia.org/r/208030 (owner: 10Ori.livneh) [23:39:29] !log catrope Synchronized php-1.26wmf4/extensions/Flow: SWAT (duration: 00m 51s) [23:39:35] Logged the message, Master [23:39:44] !log catrope Synchronized php-1.26wmf4/extensions/CentralAuth: SWAT (duration: 00m 15s) [23:39:49] Logged the message, Master [23:41:12] matt_flaschen: Your Flow patches were just SWATted [23:41:19] legoPanda: And your CentralAuth patches are done [23:41:27] kaldari: And your MobileFrontend patch for wmf4 [23:41:32] RoanKattouw: thanks! [23:41:49] checking... [23:43:40] RoanKattouw: Looks great. Thanks! [23:49:15] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown: dbtree loads third party resources - https://phabricator.wikimedia.org/T96499#1251080 (10Krinkle) Please don't use bits.wikimedia.org outside a MediaWiki context. Just add a distribution file to the repo that provides this dashboard. E.g. from `{docroo... [23:49:22] RoanKattouw: https://gerrit.wikimedia.org/r/208036 and https://gerrit.wikimedia.org/r/208037 [23:49:49] legoPanda: OK, will deploy. Can you add that to the SWAT wiki page? [23:50:00] sure :P [23:50:37] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1251090 (10tstarling) >>! In T97204#1249241, @GWicke wrote: >> Do you have a model for a "hanging backend"? > > I don't have a comprehensive model, but can... [23:50:53] added [23:51:24] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1251095 (10GWicke) Here is a sample system from Dell: {F159098} - 2.5" disk 1U hot-swap chassis - 12-core processors - 64G RAM - 10G ethernet - no disks (add $1200 for 3x Samsung 850 Ev... [23:57:47] RoanKattouw: updates merged :) [23:59:02] !log catrope Synchronized php-1.26wmf4/extensions/PageTriage/: SWAT (duration: 00m 30s) [23:59:10] Logged the message, Master