[00:11:32] RECOVERY - uWSGI web apps on graphite1001 is OK: OK: All defined uWSGI apps are runnning. [00:29:57] 3ops-codfw, operations: rack and initial configuration of wtp2001-2020 - https://phabricator.wikimedia.org/T86807#1006559 (10Papaul) Racktable updated. [01:02:26] 3operations, Deployment-Systems: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1006561 (10yuvipanda) So I guess the 'fix' is to upgrade to a newer version of salt? [01:05:22] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [01:05:51] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:08:06] (03PS1) 10Yuvipanda: toollabs: Make webservice2 send jobs to trusty by default [puppet] - 10https://gerrit.wikimedia.org/r/187940 (https://phabricator.wikimedia.org/T88228) [01:12:43] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:18:12] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:20:02] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:32:42] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:38:34] PROBLEM - RAID on analytics1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:18] RECOVERY - RAID on analytics1004 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [01:46:46] Is Gerrit broken? [01:47:23] hmm... no [01:47:28] must have been my config [01:52:01] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:10:53] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 02s) [02:11:07] Logged the message, Master [02:12:01] !log LocalisationUpdate completed (1.25wmf14) at 2015-02-01 02:10:58+00:00 [02:12:05] Logged the message, Master [02:20:20] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 01s) [02:20:24] Logged the message, Master [02:21:27] !log LocalisationUpdate completed (1.25wmf15) at 2015-02-01 02:20:24+00:00 [02:21:31] Logged the message, Master [02:24:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:32:03] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 1 failures [02:33:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [02:34:07] legoktm: https://gerrit.wikimedia.org/r/#/c/187940/ [02:34:08] :D [02:34:19] (03PS2) 10Yuvipanda: toollabs: Make webservice2 send jobs to trusty by default [puppet] - 10https://gerrit.wikimedia.org/r/187940 (https://phabricator.wikimedia.org/T88228) [02:34:51] (03CR) 10Yuvipanda: [C: 032] toollabs: Make webservice2 send jobs to trusty by default [puppet] - 10https://gerrit.wikimedia.org/r/187940 (https://phabricator.wikimedia.org/T88228) (owner: 10Yuvipanda) [02:39:59] nice! [02:42:09] legoktm: I’m going to move all of magnus’ tools now [02:46:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:49:12] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [03:04:00] (03PS1) 10Yuvipanda: tools: Move hiera data into ops/puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/187944 [03:04:28] (03PS2) 10Yuvipanda: tools: Move hiera data into ops/puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/187944 [03:05:00] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Move hiera data into ops/puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/187944 (owner: 10Yuvipanda) [03:33:31] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [03:33:41] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:02] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:12] PROBLEM - puppet last run on mw1032 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:12] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:31] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:52] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:02] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:11] PROBLEM - puppet last run on mw1121 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:22] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures [03:36:11] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet has 2 failures [03:38:22] (03PS1) 10Yuvipanda: tools: Add support to bigbrother to just say webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/187949 [03:38:35] (03PS2) 10Yuvipanda: tools: Add support to bigbrother to just say webservice2 [puppet] - 10https://gerrit.wikimedia.org/r/187949 [03:51:42] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [03:51:52] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:51:52] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [03:52:12] RECOVERY - puppet last run on db1001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [03:52:12] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [03:52:32] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [03:52:51] PROBLEM - puppet last run on elastic1015 is CRITICAL: CRITICAL: Puppet has 1 failures [03:52:51] RECOVERY - puppet last run on mw1121 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [03:53:12] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [03:53:42] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Puppet has 1 failures [03:53:51] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:54:01] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [04:09:32] RECOVERY - puppet last run on elastic1015 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [04:11:11] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:11:31] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [04:12:33] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Feb 1 04:11:30 UTC 2015 (duration 11m 29s) [04:12:37] Logged the message, Master [06:28:22] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:42] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:52] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:42] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:42] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:52] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:32] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:46:02] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:47:32] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:14:57] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1006691 (10yuvipanda) 3NEW a:3yuvipanda [09:03:02] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [09:06:52] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [09:35:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [10:41:31] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [11:28:44] PROBLEM - puppet last run on nembus is CRITICAL: CRITICAL: puppet fail [11:33:10] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1006743 (10hashar) What is git-sync-upstream ? Where is it ? [11:38:51] 3operations, Deployment-Systems: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1006753 (10hashar) >>! In T63882#1006561, @yuvipanda wrote: > So I guess the 'fix' is to upgrade to a newer version of salt? As per my earlier comment yes. Fix included in v2014.7 [11:47:32] RECOVERY - puppet last run on nembus is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:14:21] PROBLEM - Apache HTTP on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:22] PROBLEM - HHVM rendering on mw1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:53] PROBLEM - HHVM busy threads on mw1207 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [115.2] [17:20:22] PROBLEM - HHVM queue size on mw1207 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [80.0] [18:00:12] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.01 [18:10:21] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [18:25:32] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [18:31:02] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [18:35:01] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [18:38:22] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [18:49:42] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [20000.0] [18:56:52] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [19:31:29] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1007035 (10mmodell) @hashar it's in operations/puppet: [[ https://git.wikimedia.org/blob/operations%2Fpuppet/f5b70fc0da2c3247d5fe287306eda0df2560d316/modules%2Fpuppetmaster%2Ftemplates%2Fgit... [19:40:59] :) [19:41:00] set -e [19:41:00] #set -x [19:45:47] 3Beta-Cluster: deployment-mx does not have salt master set to deployment-salt - https://phabricator.wikimedia.org/T87849#1007039 (10hashar) I have updated deployment-mx instance configuration variables: ``` deployment_server_override salt_master_finger_override salt_master_override ``` Accepted the key on de... [19:46:01] 3Beta-Cluster: deployment-mx does not have salt master set to deployment-salt - https://phabricator.wikimedia.org/T87849#1007040 (10hashar) 5Open>3Resolved a:3hashar [20:50:02] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1007074 (10Joe) I don't agree at all. The script is extremely simple and is just doing some very simple operations with git. I don't see why rewriting this in python would give us any advant... [21:58:42] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:11:13] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:12:13] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1007133 (10greg) (Also, why is this a Beta Cluster project bug? What am I missing/not understanding?) [22:13:47] (03CR) 10Ori.livneh: [C: 031] graphite: explicit install python-twisted-core [puppet] - 10https://gerrit.wikimedia.org/r/187683 (https://phabricator.wikimedia.org/T85909) (owner: 10Filippo Giunchedi) [23:24:21] (03PS1) 10Ori.livneh: ve: make profiles a little less noisy with additional CLI options [puppet] - 10https://gerrit.wikimedia.org/r/187996 [23:31:16] (03CR) 10Ori.livneh: [C: 032] ve: make profiles a little less noisy with additional CLI options [puppet] - 10https://gerrit.wikimedia.org/r/187996 (owner: 10Ori.livneh) [23:34:22] PROBLEM - Ori committing changes on the weekend on palladium is CRITICAL: CRITICAL: Ori committed a change on a weekend [23:34:32] (sorry) [23:35:53] ACKNOWLEDGEMENT - Ori committing changes on the weekend on palladium is CRITICAL: CRITICAL: Ori committed a change on a weekend ori.livneh applied a non-risky change to osmium [23:46:20] 3Beta-Cluster, operations: Convert git-sync-upstream script from bash to python - https://phabricator.wikimedia.org/T88238#1007175 (10yuvipanda) @greg it's a bc bug because of the task it blocks - I want to move it to python so we can send stats to graphite about how many cherry-picked commits are there, and hav... [23:55:25] (03PS1) 10Ori.livneh: vbench: don't disable sandbox or localstorage; clear cookies between runs [puppet] - 10https://gerrit.wikimedia.org/r/187998 [23:55:31] YuviPanda: ^ if you're still up for it [23:56:42] yeah [23:56:51] * YuviPanda meatpuppets [23:57:01] ori: I’ve ‘woken up’ for the day. [23:57:20] thanks [23:57:58] (03CR) 10Yuvipanda: [C: 032] vbench: don't disable sandbox or localstorage; clear cookies between runs [puppet] - 10https://gerrit.wikimedia.org/r/187998 (owner: 10Ori.livneh) [23:58:02] puppet merging in a min [23:58:10] or whenever palladium lets me in [23:58:10] thanks [23:58:16] i can do that [23:58:29] actually, i guess it better be you [23:58:48] yeah, done [23:59:02] i am going to restart the parsoid cluster .. it has had high load since y'day .. looks like some set of pages or events caused memory leaks on most nodes around the same time which spiked the load (likley because those processes are GC-ing heavily). [23:59:03] thank you [23:59:23] i've copied parsoid logs from 4 nodes onto bast for further investigation. [23:59:38] subbu: cool. you guys have root on the parsoid nodes now, right? [23:59:40] i mean: restart the parsoid service on the parsoid cluster, to be specific. [23:59:48] YuviPanda, i am going to use dsh from bast1001