[00:00:54] (03PS1) 10Ori.livneh: Import chromium module from mediawiki-vagrant [puppet] - 10https://gerrit.wikimedia.org/r/186614 [00:52:59] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: puppet fail [01:11:49] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:33:25] (03CR) 10Reedy: "I saw it on production last night when we were looking at the profiler issue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186392 (owner: 10Reedy) [01:39:24] (03PS1) 10Reedy: Use / for regex delimiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186616 [01:39:38] (03CR) 10Reedy: "Done in https://gerrit.wikimedia.org/r/186616" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186392 (owner: 10Reedy) [02:10:26] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 02s) [02:10:30] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-25 02:10:30+00:00 [02:10:38] Logged the message, Master [02:10:44] Logged the message, Master [02:22:55] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 02s) [02:22:59] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-25 02:22:58+00:00 [02:23:06] Logged the message, Master [02:23:10] Logged the message, Master [03:33:39] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Puppet has 2 failures [03:33:49] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures [03:33:59] PROBLEM - puppet last run on wtp1013 is CRITICAL: CRITICAL: Puppet has 1 failures [03:33:59] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:08] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:50] PROBLEM - puppet last run on mw1032 is CRITICAL: CRITICAL: Puppet has 1 failures [03:41:09] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Puppet has 1 failures [03:42:09] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: Puppet has 1 failures [03:48:29] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [03:51:19] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:51:28] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [03:51:38] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [03:51:39] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [03:51:39] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [03:51:58] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [03:52:19] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [03:58:39] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [03:59:49] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:04:58] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [04:08:29] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [04:08:47] 3Analytics, operations, ops-core: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#993537 (10Ironholds) Seems sensible. (Out of interest, as part of the ticket-that-must-not-be-named, do we want to also stop generating the sampled logs altogether? Are we using them for anything now?) [04:14:33] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jan 25 04:14:33 UTC 2015 (duration 14m 32s) [04:14:41] Logged the message, Master [04:38:41] (03PS1) 10PleaseStand: Update logo URL for nostalgiawiki to point to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186617 [04:41:51] 3operations: Kill network.pp - https://phabricator.wikimedia.org/T87519#993541 (10mark) [04:42:13] (03CR) 10Hoo man: "I've restored https://meta.wikimedia.org/wiki/File:Wiki_orig_logo.png for now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186617 (owner: 10PleaseStand) [05:59:09] (03Abandoned) 10BryanDavis: Fix regex pattern delimiters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186607 (owner: 10BryanDavis) [06:28:38] (03PS2) 10Ori.livneh: Import chromium module from mediawiki-vagrant [puppet] - 10https://gerrit.wikimedia.org/r/186614 [06:28:40] (03PS1) 10Ori.livneh: Add role::ve::test_rig [puppet] - 10https://gerrit.wikimedia.org/r/186620 [06:29:08] YuviPanda: ^ [06:29:22] (I am pinging you while staring at your phone to see if you get notified :P) [06:29:37] lol [06:29:49] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:49] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:50] (apparently not) [06:30:08] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] ori: haha [06:32:32] ori: I was immediately notified yeah. Just didn't open phone and see [06:35:59] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: puppet fail [06:46:19] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:53:49] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:25:39] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [07:26:39] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 1, timed_out: False, active_primary_shards: 43, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 127, initializing_shards: 1, number_of_data_nodes: 3 [08:09:53] (03PS1) 10Tim Landscheidt: Tools: Properly puppetize crontab replacement [puppet] - 10https://gerrit.wikimedia.org/r/186627 (https://phabricator.wikimedia.org/T86445) [08:33:30] (03CR) 10Tim Landscheidt: [C: 04-1] "On Toolsbeta, this leads to:" [puppet] - 10https://gerrit.wikimedia.org/r/186627 (https://phabricator.wikimedia.org/T86445) (owner: 10Tim Landscheidt) [08:45:58] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: Puppet has 1 failures [09:02:29] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:47:48] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [13:06:39] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [13:33:45] 3Analytics, operations, ops-core: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#993718 (10ezachte) The sampled logs are used for about 15 monthly and quarterly reports, for which replacement is still in the 'someday, somehow by someone' phase. http://stats.wikimedia.org/wikimedia/s... [17:19:00] (03PS3) 10QChris: Add logs from 'misc' caches to kafka pipeline [puppet] - 10https://gerrit.wikimedia.org/r/184183 [17:19:02] (03PS1) 10QChris: Re-enable varnishkafka for bits again [puppet] - 10https://gerrit.wikimedia.org/r/186641 [17:24:05] (03CR) 10QChris: Add logs from 'misc' caches to kafka pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184183 (owner: 10QChris) [18:38:48] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [18:43:55] !log trimmed Logstash redis input queues to 0 events; dropped ~4M backlogged events [18:44:05] Logged the message, Master [18:48:39] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:50:48] memcached-serious log has a lot of `Memcached error for key "..." on server "/var/run/nutcracker/nutcracker.sock:0": SYSTEM ERROR` messages for mw1118 [18:52:27] the nutcracker process looks to be running and the socket exists but 4K of the last 5K messages in that log are about it being broken [18:54:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [18:55:44] !log high rate of nutcracker "SYSTEM ERROR" errors on mw1118 [18:55:51] Logged the message, Master [18:58:48] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:00:28] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:06:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [20:09:20] 3Analytics, operations, ops-core: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#993854 (10faidon) If you're talking about 1:1000 sampled text logs, these are immensely useful for day to day operations. But let's keep this on-topic, we can discuss this further in a separate task if yo... [20:19:14] 3Analytics, operations, ops-core: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#993857 (10Ironholds) Yep; Erik's answer is enough. My comment was merely an aside. [20:20:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [20000.0] [20:26:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [20:40:29] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [20:45:39] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [20:50:09] !log depooled mw1118 while investigating T85428 [20:50:13] ^ _joe_ fyi [20:50:15] Logged the message, Master [23:27:02] (03PS4) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 [23:28:04] (03CR) 10KartikMistry: [C: 04-1] ".. shouldn't merge until tested enough in Beta though." [puppet] - 10https://gerrit.wikimedia.org/r/186538 (owner: 10KartikMistry) [23:28:35] (03PS4) 10KartikMistry: Use cxserver/deploy in deployment [puppet] - 10https://gerrit.wikimedia.org/r/184217