[00:17:12] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Labs: Allow per-host Hiera overrides via wikitech [puppet] - 10https://gerrit.wikimedia.org/r/233184 (https://phabricator.wikimedia.org/T104202) 
[00:22:02] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1004 is CRITICAL - load average: 227.40, 160.11, 79.09
[00:23:35] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Tested on Toolsbeta: toolsbeta-exec-202 (https://wikitech.wikimedia.org/wiki/Hiera:Toolsbeta):" [puppet] - 10https://gerrit.wikimedia.org/r/233184 (https://phabricator.wikimedia.org/T104202) (owner: 10Tim Landscheidt)
[00:26:52] <icinga-wm>	 PROBLEM - Disk space on mw1010 is CRITICAL: DISK CRITICAL - free space: / 8182 MB (3% inode=93%)
[00:48:03] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
[00:53:53] <icinga-wm>	 PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures
[01:07:53] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[01:42:22] <icinga-wm>	 PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 15 not-conn: cp3018_v6
[01:44:13] <icinga-wm>	 RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 16 ESP OK
[01:47:59] <Danny_B>	 something with images
[01:48:05] <Danny_B>	 they don't display
[01:55:17] <Katie>	 Danny_B: Example link?
[02:00:52] <icinga-wm>	 RECOVERY - Last backup of the tools filesystem on labstore1002 is OK - Last run successful
[02:19:52] <icinga-wm>	 RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:19:59] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 06m 23s)
[02:20:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:21:24] <Danny_B>	 Katie: it works now. perhaps some temporary overload or dropout
[02:50:52] <icinga-wm>	 PROBLEM - puppet last run on db2047 is CRITICAL puppet fail
[03:00:53] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
[03:01:12] <icinga-wm>	 RECOVERY - Last backup of the others filesystem on labstore1002 is OK - Last run successful
[03:08:33] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[03:17:53] <icinga-wm>	 RECOVERY - puppet last run on db2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:23:23] <icinga-wm>	 PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures
[03:47:13] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
[03:50:12] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1003 is CRITICAL check_failover servers up 1 down 1
[03:50:23] <icinga-wm>	 RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:53:38] <wikibugs>	 6operations, 6Phabricator: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1564721 (10Krenair)
[03:54:52] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[03:59:27] <wikibugs>	 6operations, 7Mail: Move trademark@ alias to Google Mail - https://phabricator.wikimedia.org/T109868#1564723 (10Krenair)
[04:00:53] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
[04:10:24] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[04:25:19] <wikibugs>	 6operations: mw2187 - read-only filesystem - https://phabricator.wikimedia.org/T109717#1564742 (10Krenair) ```krenair@mw2187:~$ sync-common 04:21:38 Copying to mw2187.codfw.wmnet from tin.eqiad.wmnet 04:21:38 Started rsync common 04:23:34 Finished rsync common (duration: 01m 56s) krenair@mw2187:~$ ``` Weird.
[04:44:48] <wikibugs>	 6operations, 6Discovery, 10Maps: Determine limited maps deployment options - https://phabricator.wikimedia.org/T109159#1564743 (10Krenair)
[05:13:11] <icinga-wm>	 ACKNOWLEDGEMENT - haproxy failover on dbproxy1003 is CRITICAL check_failover servers up 2 down 1 Jcrespo not in use
[05:53:32] <icinga-wm>	 PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures
[06:13:33] <icinga-wm>	 PROBLEM - puppet last run on mw2029 is CRITICAL Puppet has 1 failures
[06:20:22] <icinga-wm>	 RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:30:03] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL puppet fail
[06:32:04] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 2 failures
[06:32:12] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 3 failures
[06:32:13] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures
[06:32:23] <icinga-wm>	 PROBLEM - puppet last run on mw1215 is CRITICAL Puppet has 1 failures
[06:32:53] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures
[06:33:02] <icinga-wm>	 PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures
[06:33:02] <icinga-wm>	 PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 2 failures
[06:33:04] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 3 failures
[06:33:05] <icinga-wm>	 PROBLEM - puppet last run on mw1090 is CRITICAL Puppet has 3 failures
[06:33:12] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 2 failures
[06:34:03] <icinga-wm>	 RECOVERY - Disk space on mw1010 is OK: DISK OK
[06:38:52] <icinga-wm>	 RECOVERY - puppet last run on mw2029 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures
[06:43:55] <grrrit-wm>	 (03CR) 1020after4: "Ori: that does sound promising. We had discussed ways to make a single proxy do the job but we weren't aware of the "SO_PEERCRED" trick..." [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani)
[06:46:28] <HaeB>	 oh, there's a phabricator bug about the phabricator outages. awesome (https://phabricator.wikimedia.org/T109964 )
[06:49:46] <wikibugs>	 6operations, 6Commons, 7Database: remove Tag: HHVM on commons - https://phabricator.wikimedia.org/T109967#1564789 (10Steinsplitter) 3NEW
[06:55:33] <icinga-wm>	 RECOVERY - puppet last run on mw1215 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures
[06:56:13] <icinga-wm>	 RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:56:13] <icinga-wm>	 RECOVERY - puppet last run on mw1090 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:23] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures
[06:57:14] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures
[06:57:14] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures
[06:57:22] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures
[06:57:32] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 second ago with 0 failures
[06:58:04] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:13] <icinga-wm>	 RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:13] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:11:58] <wikibugs>	 6operations, 7Database: db2034 crashed - https://phabricator.wikimedia.org/T109282#1564799 (10jcrespo) 5Open>3Resolved a:3jcrespo
[07:50:03] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 1 failures
[07:54:53] <icinga-wm>	 PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 663265 msg: ocg_render_job_queue 3082 msg (=3000 critical)
[07:55:14] <icinga-wm>	 PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 663715 msg: ocg_render_job_queue 3263 msg (=3000 critical)
[07:56:13] <icinga-wm>	 PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 664681 msg: ocg_render_job_queue 3626 msg (=3000 critical)
[08:17:03] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures
[08:38:43] <icinga-wm>	 RECOVERY - OCG health on ocg1003 is OK ocg_job_status 688368 msg: ocg_render_job_queue 66 msg
[08:39:13] <icinga-wm>	 RECOVERY - OCG health on ocg1001 is OK ocg_job_status 688454 msg: ocg_render_job_queue 0 msg
[08:39:43] <icinga-wm>	 RECOVERY - OCG health on ocg1002 is OK ocg_job_status 688518 msg: ocg_render_job_queue 0 msg
[08:40:15] <wikibugs>	 6operations, 6Phabricator: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1564846 (10Nemo_bis) Some discussion/work on the matter at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20150822.txt after `[0...
[09:52:22] <icinga-wm>	 PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures
[10:19:13] <icinga-wm>	 RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[11:19:03] <icinga-wm>	 PROBLEM - puppet last run on mw1122 is CRITICAL Puppet has 1 failures
[11:44:13] <icinga-wm>	 RECOVERY - puppet last run on mw1122 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures
[14:23:53] <icinga-wm>	 PROBLEM - puppet last run on mw2174 is CRITICAL puppet fail
[14:50:52] <icinga-wm>	 RECOVERY - puppet last run on mw2174 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures
[14:55:10] <wikibugs>	 6operations, 6Commons, 7Database: remove Tag: HHVM on commons - https://phabricator.wikimedia.org/T109967#1565228 (10Krenair) I see no reason for this to be Commons-specific
[14:55:20] <wikibugs>	 6operations, 6Commons, 7Database: remove Tag: HHVM on commons - https://phabricator.wikimedia.org/T109967#1565229 (10Krenair)
[15:55:33] <icinga-wm>	 PROBLEM - puppet last run on ms-be1004 is CRITICAL puppet fail
[16:04:35] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Set up an rsync daemon that allows rsyncing of nova instances among virt hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233068 
[16:24:23] <icinga-wm>	 RECOVERY - puppet last run on ms-be1004 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures
[16:34:27] <JohnFLewis>	 YuviPanda: did you see my comments about the puppet SWAT thing?
[16:34:48] <YuviPanda>	 JohnFLewis: yeah, I'll respond tomorrow. Mostly we'll do Tues / Thurs, yes
[16:35:08] <JohnFLewis>	 okay, just wanted to check it on that :)
[16:49:42] <icinga-wm>	 PROBLEM - Host ms-be1004 is DOWN: PING CRITICAL - Packet loss = 100%
[16:53:53] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Set up an rsync daemon that allows rsyncing of nova instances among virt hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233068 
[16:54:26] <urandom>	 !log bouncing Cassandra on restbase1001 to apply temporary GC settings
[16:54:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:55:11] <grrrit-wm>	 (03CR) 10Andrew Bogott: "I'm feeling better about this now, although of course it's untestable on labs." [puppet] - 10https://gerrit.wikimedia.org/r/233068 (owner: 10Andrew Bogott)
[17:27:59] <grrrit-wm>	 (03PS1) 10Aklapper: [WIP] Phabricator project creation/changes log email for Phab admins [puppet] - 10https://gerrit.wikimedia.org/r/233219 (https://phabricator.wikimedia.org/T85183) 
[17:28:36] <grrrit-wm>	 (03CR) 10Aklapper: "WIP and untested." [puppet] - 10https://gerrit.wikimedia.org/r/233219 (https://phabricator.wikimedia.org/T85183) (owner: 10Aklapper)
[17:45:34] <wikibugs>	 6operations, 6Phabricator: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1565720 (10Aklapper) Dup of T109941 ?
[18:07:55] <wikibugs>	 6operations, 6Phabricator: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1565738 (10Krenair) I don't think so...
[19:55:42] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0]
[20:01:42] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]