[00:02:45] Like outright load as if unresolvable domain [00:08:01] (03PS1) 10Catrope: Fix notification icon path for foundationwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319967 [00:16:57] (03PS1) 10Catrope: Make notification logos high-density [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) [00:26:26] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:45:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:54:26] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [00:58:36] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2773428 (10AlexMonk-WMF) @akosiaris, this cert changed because we changed puppetmasters to a different host. I'm vaguely aware of paladium being retir... [01:34:46] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1812.519114 Seconds [01:35:06] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 1835.61616 Seconds [01:35:46] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 20.572239 Seconds [01:36:06] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 43.520286 Seconds [02:06:16] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:17:01] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 05m 25s) [02:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:37] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Nov 5 02:21:37 UTC 2016 (duration 4m 36s) [02:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:25:38] 06Operations, 10Monitoring: Huge log files on icinga machines - https://phabricator.wikimedia.org/T150061#2773501 (10Dzahn) > 9.2G account Do we really want process accounting on? root@einsteinium:/var/log/account# lastcomm | head icinga F icinga __ 0.00 secs Sat Nov 5 02:21 check... [02:34:16] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [03:23:36] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 702.76 seconds [03:34:36] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 297.72 seconds [03:37:26] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:00:16] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [04:04:16] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code [04:05:26] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [04:16:10] (03CR) 10Chad: "Awww eiximenis, I miss you...." [labs/private] - 10https://gerrit.wikimedia.org/r/319924 (owner: 10Dzahn) [04:22:26] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:50:26] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [05:11:36] PROBLEM - HHVM rendering on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.005 second response time [05:12:36] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:12:37] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 71789 bytes in 1.611 second response time [05:25:20] (03CR) 10Alex Monk: "what about praseodymium?" [labs/private] - 10https://gerrit.wikimedia.org/r/319924 (owner: 10Dzahn) [05:35:26] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:41:36] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [05:54:26] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:26] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:06:26] PROBLEM - Disk space on logstash1003 is CRITICAL: DISK CRITICAL - free space: / 1724 MB (3% inode=96%) [06:23:26] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:27:26] RECOVERY - Disk space on logstash1003 is OK: DISK OK [06:32:16] PROBLEM - Disk space on logstash1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:32:26] PROBLEM - Disk space on logstash1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:33:06] PROBLEM - Disk space on logstash1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:42:16] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:44:06] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:48:16] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:10:16] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [07:25:26] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:33:16] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:54:26] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:36:14] 06Operations, 06Discovery, 06Maps, 10Maps-data, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2773576 (10Gehel) [08:49:14] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2773589 (10Gehel) @debt: as @BBlack pointed out in the start of this thread, we tend to have a fairly liberal view on who can reuse our content / services. Bulk downl... [10:46:56] PROBLEM - Varnish HTTP text-backend - port 3128 on cp1068 is CRITICAL: connect to address 10.64.0.105 and port 3128: Connection refused [10:55:56] RECOVERY - Varnish HTTP text-backend - port 3128 on cp1068 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.001 second response time [11:53:16] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:21:16] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:23:16] PROBLEM - Apache HTTP on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [12:23:26] PROBLEM - HHVM rendering on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.002 second response time [12:24:16] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time [12:24:26] RECOVERY - HHVM rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 71778 bytes in 0.066 second response time [12:25:46] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:53:48] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:03:26] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:09:47] (03PS1) 10Ladsgroup: ores (labs): Define log directory in worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/319984 (https://phabricator.wikimedia.org/T149925) [14:28:15] grrrit-wm: force-restart [14:28:36] grrrit-wm: force-restart [14:28:40] re-connecting to irc and gerrit [14:29:16] re-connected to gerrit and irc. [14:31:26] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:54:35] <_joe_> paladox: I see a big potential abuse there [14:54:44] <_joe_> grrrit-wm: force-restart [14:54:52] <_joe_> does it work with anyone? [14:54:55] _joe_ oh, yeh i am going to add an extra check [14:54:58] <_joe_> I hope not :) [14:55:10] and _joe_ nope i just tryed with an ip that wasent whitelisted [14:55:14] and it didnt run the command [14:55:15] :) [14:55:18] <_joe_> ok [14:55:34] Im just testing the extra check now [14:55:46] grrrit-wm is not running the new check, so it is still based on the nick [14:58:33] _joe_ i will need to whitelist you to be able to run it [14:58:57] <_joe_> paladox: I got root, I can just go to toolabs and restart it [14:58:58] <_joe_> :P [14:59:05] Oh lol [14:59:20] i can do that now [15:01:09] _joe_ now try it [15:01:11] :) [15:01:21] <_joe_> grrrit-wm: force-restart [15:01:34] it will take a few secs [15:01:39] re-connecting to irc and gerrit [15:01:46] There ^^ :) [15:01:58] <_joe_> ok, this is surely faster than ssh-ing to tools :) [15:02:23] Yep, _joe_ it will also try and automatically try and reconnect to ssh if it losses ssh [15:02:33] like if gerrit goes down it will try and connect to it [15:02:45] there's also a [15:02:51] grrrit-wm: nick [15:02:51] and [15:02:56] grrrit-wm: restart [15:02:56] re-connecting to gerrit [15:02:57] command [15:02:57] reconnected to gerrit [15:03:02] _joe_ ^^ [15:12:10] _joe_ https://gerrit.wikimedia.org/r/#/c/319983/ that will make things more securer [15:13:28] <_joe_> paladox: seems like a sensible idea [15:13:37] Yep :) [15:25:45] (03PS1) 10BBlack: varnish: backend restarts on v4 only [puppet] - 10https://gerrit.wikimedia.org/r/319995 [15:27:12] (03CR) 10BBlack: [C: 032 V: 032] varnish: backend restarts on v4 only [puppet] - 10https://gerrit.wikimedia.org/r/319995 (owner: 10BBlack) [15:29:16] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:16] PROBLEM - puppet last run on labsdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:58:16] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:16:55] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773843 (10Reedy) [16:20:58] (03PS1) 10BBlack: VCL: only create new hit-for-pass on miss [puppet] - 10https://gerrit.wikimedia.org/r/319997 [16:21:16] RECOVERY - puppet last run on labsdb1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:22:53] (03CR) 10BBlack: [C: 032 V: 032] VCL: only create new hit-for-pass on miss [puppet] - 10https://gerrit.wikimedia.org/r/319997 (owner: 10BBlack) [16:41:16] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:53:34] (03PS1) 10BBlack: Revert "VCL: only create new hit-for-pass on miss" [puppet] - 10https://gerrit.wikimedia.org/r/319999 [16:53:45] (03CR) 10BBlack: [C: 032 V: 032] Revert "VCL: only create new hit-for-pass on miss" [puppet] - 10https://gerrit.wikimedia.org/r/319999 (owner: 10BBlack) [17:07:46] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:16] RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:26:51] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773987 (10bd808) What about using https://blueprints.launchpad.net/keystone/+spec/delegated-auth-via-oauth ? [17:27:36] Not to go on constantly about video scalers, but they seem to have some kind of load balancing drama today. [17:36:46] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:57:13] (03CR) 10VolkerE: [C: 04-1] "Three minor things: At large looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/319968 (https://phabricator.wikimedia.org/T147219) (owner: 10Catrope) [19:00:32] 06Operations: update-ca-certificates, run via puppets sslcert module, doesn't update symlinks to replaced certificates - https://phabricator.wikimedia.org/T150058#2774050 (10Joe) @AlexMonk-WMF nope, we moved the CA from the old server to the new one; changing the CA in production will surely need quite some work... [19:14:57] !log Deleted huge logstash1001:/var/log/logstash/logstash.log.1 log file; disk full and difficult to debug with no free space on / [19:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:06] RECOVERY - Disk space on logstash1001 is OK: DISK OK [19:16:17] was wondering about that error bd808 :) [19:23:31] !log Restarted logstash on logstash1001 [19:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:46] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.122:9200/_cluster/health error while fetching: (Connection aborted., error(111, Connection refused)) [19:35:56] PROBLEM - logstash process on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [19:38:14] !log Elasticsearch on logstash1001 won't restart due to missing /etc/elasticsearch/scripts directory [19:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:56] RECOVERY - logstash process on logstash1001 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [19:40:13] bd808 logstash seems to be having lots of issues anyway i can be of assistance? [19:40:40] Zppix: kind of doubtful. you aren't a shell user nor do you have root on these boxes [19:40:59] ok [19:41:06] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [19:41:09] bd808 ill make you some virtual beer is that ok xD [19:41:25] Zppix: :) thanks [19:41:46] no problem let me know if theres anything i can do... im generally always in here or -dev [19:42:26] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2773828 (10AlexMonk-WMF) I'm pretty sure the Nova half of #1 is unnecessary, instances can already hit the nova API (it runs on labnet), the... [19:43:46] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 39, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 100.0, active [19:45:06] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[logstash] [19:45:13] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774119 (10AlexMonk-WMF) Actually, I'll go further: We should delete any user other than a whitelisted account that managed to get a success... [19:45:27] !log Forced several puppet runs on logstash1001 until things stopped changing; out of disk seemed to have messed up apt upgrades [19:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:40] grrrit-wm: force-restart [19:47:42] re-connecting to gerrit and irc. [19:48:23] re-connected to gerrit and irc. [19:49:14] _joe_ ^^ whitelist is now more secure [19:49:47] deployed the change, so should now basically checks against your hostname, like wikimedia/ and so on [19:49:47] execept for thcipriani and i cant remember whomelse (because they have no cloak) [19:50:31] twentyafterfour and thcipriani doint have a cloak so they are exempt so we just whitelisted there nicks instead [19:54:17] !log ELK stack problems are related to Elasticsearch index mapping. Some events are being rejected for not matching the expected mappings and that is filling up the disk on the logstash injestion hosts [19:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:39] A database query error has occurred. This may indicate a bug in the software.[WB46SwpAIDcAAC@jSjYAAAAK] 2016-11-05 20:00:27: Fatal exception of type "DBQueryError" [20:00:45] On https://commons.wikimedia.org/w/index.php?title=File:A_Conversation_With_Tim_Kaine.webm&action=delete [20:01:14] But… it deleted it, lol. [20:12:06] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [20:16:56] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:32:43] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774143 (10Andrew) >>! In T150092#2773987, @bd808 wrote: > What about using https://blueprints.launchpad.net/keystone/+spec/delegated-auth-v... [20:33:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:33:31] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774144 (10Andrew) >>! In T150092#2774119, @AlexMonk-WMF wrote: >>>! In T150092#2774103, @AlexMonk-WMF wrote: >> We should probably disallow... [20:36:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:38:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:44:56] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:50:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:52:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:21:51] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774171 (10AlexMonk-WMF) >>! In T150092#2774144, @Andrew wrote: >>>! In T150092#2774119, @AlexMonk-WMF wrote: >>>>! In T150092#2774103, @Ale... [21:28:36] (03PS1) 10BryanDavis: logstash: Temporarily disable EventBus channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320016 [21:28:55] Reedy: can you give me a quick sanity check on ^ that change? [21:29:40] (03CR) 10Reedy: [C: 031] logstash: Temporarily disable EventBus channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320016 (owner: 10BryanDavis) [21:29:57] thx [21:30:25] bd808: awesome, thanks for taking care of that! anything I can do to help in the next 30 min? [21:30:39] godog: Can you make java not suck? [21:31:11] godog: stand by while I sync the config change just incase everything goes weird and I need a root? [21:31:20] bd808: sure! [21:31:34] Reedy: I'm afaid I can't do that :( [21:32:03] not letting developers manage memory allocation scope turns out to be a bad idea, who knew [21:32:10] (03PS2) 10BryanDavis: logstash: Temporarily disable EventBus channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320016 (https://phabricator.wikimedia.org/T150106) [21:32:56] (03CR) 10BryanDavis: [C: 032] logstash: Temporarily disable EventBus channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320016 (https://phabricator.wikimedia.org/T150106) (owner: 10BryanDavis) [21:33:28] (03Merged) 10jenkins-bot: logstash: Temporarily disable EventBus channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320016 (https://phabricator.wikimedia.org/T150106) (owner: 10BryanDavis) [21:36:04] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: logstash: Temporarily disable EventBus channel (T150106) (duration: 00m 50s) [21:36:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:12] T150106: Eventbus exception logs causing indexing failures in ELK Elasticsearch - https://phabricator.wikimedia.org/T150106 [21:37:53] 06Operations, 10Wikimedia-Logstash: fix partition scheme for logstash ingester hosts - https://phabricator.wikimedia.org/T150108#2774185 (10fgiunchedi) [21:39:44] !log Deleted huge logstash1002:/var/log/logstash/logstash.log.1 log file; disk full [21:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:16] RECOVERY - Disk space on logstash1002 is OK: DISK OK [21:40:20] !log Deleted huge logstash1003:/var/log/logstash/logstash.log.1 log file; disk full [21:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:26] RECOVERY - Disk space on logstash1003 is OK: DISK OK [21:41:11] godog: good news for folks trying to have a weekend: the logstash error log has stopped growing [21:46:16] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [21:55:02] bd808: looks like the change worked on 1002, not seeing high volume spam in logstash.log [21:55:28] godog: yeah. the immediate crisis seems to be resolved [21:55:54] \o/ [22:02:16] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:21:36] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:50:36] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures