[00:17:12] (03PS1) 10Tim Landscheidt: Labs: Allow per-host Hiera overrides via wikitech [puppet] - 10https://gerrit.wikimedia.org/r/233184 (https://phabricator.wikimedia.org/T104202) [00:22:02] PROBLEM - very high load average likely xfs on ms-be1004 is CRITICAL - load average: 227.40, 160.11, 79.09 [00:23:35] (03CR) 10Tim Landscheidt: "Tested on Toolsbeta: toolsbeta-exec-202 (https://wikitech.wikimedia.org/wiki/Hiera:Toolsbeta):" [puppet] - 10https://gerrit.wikimedia.org/r/233184 (https://phabricator.wikimedia.org/T104202) (owner: 10Tim Landscheidt) [00:26:52] PROBLEM - Disk space on mw1010 is CRITICAL: DISK CRITICAL - free space: / 8182 MB (3% inode=93%) [00:48:03] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [00:53:53] PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures [01:07:53] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:42:22] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 15 not-conn: cp3018_v6 [01:44:13] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 16 ESP OK [01:47:59] something with images [01:48:05] they don't display [01:55:17] Danny_B: Example link? [02:00:52] RECOVERY - Last backup of the tools filesystem on labstore1002 is OK - Last run successful [02:19:52] RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:19:59] !log l10nupdate@tin Synchronized php-1.26wmf19/cache/l10n: l10nupdate for 1.26wmf19 (duration: 06m 23s) [02:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:21:24] Katie: it works now. perhaps some temporary overload or dropout [02:50:52] PROBLEM - puppet last run on db2047 is CRITICAL puppet fail [03:00:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [03:01:12] RECOVERY - Last backup of the others filesystem on labstore1002 is OK - Last run successful [03:08:33] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:17:53] RECOVERY - puppet last run on db2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:23:23] PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures [03:47:13] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [03:50:12] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL check_failover servers up 1 down 1 [03:50:23] RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:53:38] 6operations, 6Phabricator: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1564721 (10Krenair) [03:54:52] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:59:27] 6operations, 7Mail: Move trademark@ alias to Google Mail - https://phabricator.wikimedia.org/T109868#1564723 (10Krenair) [04:00:53] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [04:10:24] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [04:25:19] 6operations: mw2187 - read-only filesystem - https://phabricator.wikimedia.org/T109717#1564742 (10Krenair) ```krenair@mw2187:~$ sync-common 04:21:38 Copying to mw2187.codfw.wmnet from tin.eqiad.wmnet 04:21:38 Started rsync common 04:23:34 Finished rsync common (duration: 01m 56s) krenair@mw2187:~$ ``` Weird. [04:44:48] 6operations, 6Discovery, 10Maps: Determine limited maps deployment options - https://phabricator.wikimedia.org/T109159#1564743 (10Krenair) [05:13:11] ACKNOWLEDGEMENT - haproxy failover on dbproxy1003 is CRITICAL check_failover servers up 2 down 1 Jcrespo not in use [05:53:32] PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures [06:13:33] PROBLEM - puppet last run on mw2029 is CRITICAL Puppet has 1 failures [06:20:22] RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:03] PROBLEM - puppet last run on cp3007 is CRITICAL puppet fail [06:32:04] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 2 failures [06:32:12] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 3 failures [06:32:13] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on mw1215 is CRITICAL Puppet has 1 failures [06:32:53] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:33:02] PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 2 failures [06:33:04] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 3 failures [06:33:05] PROBLEM - puppet last run on mw1090 is CRITICAL Puppet has 3 failures [06:33:12] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 2 failures [06:34:03] RECOVERY - Disk space on mw1010 is OK: DISK OK [06:38:52] RECOVERY - puppet last run on mw2029 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:43:55] (03CR) 1020after4: "Ori: that does sound promising. We had discussed ways to make a single proxy do the job but we weren't aware of the "SO_PEERCRED" trick..." [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [06:46:28] oh, there's a phabricator bug about the phabricator outages. awesome (https://phabricator.wikimedia.org/T109964 ) [06:49:46] 6operations, 6Commons, 7Database: remove Tag: HHVM on commons - https://phabricator.wikimedia.org/T109967#1564789 (10Steinsplitter) 3NEW [06:55:33] RECOVERY - puppet last run on mw1215 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:13] RECOVERY - puppet last run on mw1090 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:23] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on mw1110 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:14] RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:58:04] RECOVERY - puppet last run on mw2158 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:58] 6operations, 7Database: db2034 crashed - https://phabricator.wikimedia.org/T109282#1564799 (10jcrespo) 5Open>3Resolved a:3jcrespo [07:50:03] PROBLEM - puppet last run on labcontrol1001 is CRITICAL Puppet has 1 failures [07:54:53] PROBLEM - OCG health on ocg1001 is CRITICAL ocg_job_status 663265 msg: ocg_render_job_queue 3082 msg (=3000 critical) [07:55:14] PROBLEM - OCG health on ocg1002 is CRITICAL ocg_job_status 663715 msg: ocg_render_job_queue 3263 msg (=3000 critical) [07:56:13] PROBLEM - OCG health on ocg1003 is CRITICAL ocg_job_status 664681 msg: ocg_render_job_queue 3626 msg (=3000 critical) [08:17:03] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:38:43] RECOVERY - OCG health on ocg1003 is OK ocg_job_status 688368 msg: ocg_render_job_queue 66 msg [08:39:13] RECOVERY - OCG health on ocg1001 is OK ocg_job_status 688454 msg: ocg_render_job_queue 0 msg [08:39:43] RECOVERY - OCG health on ocg1002 is OK ocg_job_status 688518 msg: ocg_render_job_queue 0 msg [08:40:15] 6operations, 6Phabricator: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1564846 (10Nemo_bis) Some discussion/work on the matter at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20150822.txt after `[0... [09:52:22] PROBLEM - puppet last run on erbium is CRITICAL Puppet has 1 failures [10:19:13] RECOVERY - puppet last run on erbium is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [11:19:03] PROBLEM - puppet last run on mw1122 is CRITICAL Puppet has 1 failures [11:44:13] RECOVERY - puppet last run on mw1122 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:23:53] PROBLEM - puppet last run on mw2174 is CRITICAL puppet fail [14:50:52] RECOVERY - puppet last run on mw2174 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:55:10] 6operations, 6Commons, 7Database: remove Tag: HHVM on commons - https://phabricator.wikimedia.org/T109967#1565228 (10Krenair) I see no reason for this to be Commons-specific [14:55:20] 6operations, 6Commons, 7Database: remove Tag: HHVM on commons - https://phabricator.wikimedia.org/T109967#1565229 (10Krenair) [15:55:33] PROBLEM - puppet last run on ms-be1004 is CRITICAL puppet fail [16:04:35] (03PS2) 10Andrew Bogott: Set up an rsync daemon that allows rsyncing of nova instances among virt hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233068 [16:24:23] RECOVERY - puppet last run on ms-be1004 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:34:27] YuviPanda: did you see my comments about the puppet SWAT thing? [16:34:48] JohnFLewis: yeah, I'll respond tomorrow. Mostly we'll do Tues / Thurs, yes [16:35:08] okay, just wanted to check it on that :) [16:49:42] PROBLEM - Host ms-be1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:53:53] (03PS3) 10Andrew Bogott: Set up an rsync daemon that allows rsyncing of nova instances among virt hosts. [puppet] - 10https://gerrit.wikimedia.org/r/233068 [16:54:26] !log bouncing Cassandra on restbase1001 to apply temporary GC settings [16:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:11] (03CR) 10Andrew Bogott: "I'm feeling better about this now, although of course it's untestable on labs." [puppet] - 10https://gerrit.wikimedia.org/r/233068 (owner: 10Andrew Bogott) [17:27:59] (03PS1) 10Aklapper: [WIP] Phabricator project creation/changes log email for Phab admins [puppet] - 10https://gerrit.wikimedia.org/r/233219 (https://phabricator.wikimedia.org/T85183) [17:28:36] (03CR) 10Aklapper: "WIP and untested." [puppet] - 10https://gerrit.wikimedia.org/r/233219 (https://phabricator.wikimedia.org/T85183) (owner: 10Aklapper) [17:45:34] 6operations, 6Phabricator: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1565720 (10Aklapper) Dup of T109941 ? [18:07:55] 6operations, 6Phabricator: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109964#1565738 (10Krenair) I don't think so... [19:55:42] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 8.33% of data above the critical threshold [500.0] [20:01:42] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]