[01:51:53] * jgage looks at ms-be3003
[01:56:09] <jgage>	 [Sat Jun 21 23:16:29 2014] end_request: I/O error, dev sdk, sector 3409757152
[02:13:53] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf9) at 2014-06-22 02:12:50+00:00
[02:14:04] <morebots>	 Logged the message, Master
[02:21:48] <jgage>	 so ms-be3003 has a 55gb root partition at 100% full, but i don't see that space being used in /var /usr /home /root /etc .. suspect that maybe /srv/swift-storage/ has been written to under one of the mount points (on the root partition).
[02:22:08] <jgage>	 i want to stop swift and unmount the swift-storage disks to check
[02:22:27] <jgage>	 checking whether i need to depool ms-be3003 or something first..
[02:24:53] <logmsgbot>	 !log LocalisationUpdate completed (1.24wmf10) at 2014-06-22 02:23:50+00:00
[02:24:59] <morebots>	 Logged the message, Master
[02:26:00] <jgage>	 https://wikitech.wikimedia.org/wiki/Swift/How_To#Remove_a_failed_storage_node_from_the_cluster
[02:26:39] <jgage>	 this doesn't seem to be a crisis, i've never touched this stuff before, and the rest of the team are afk. i think i'll wait to make any changes.
[02:31:28] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on ms-be3003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%): Jeff Gage See discussion on IRC, sdk issues but root partition full
[02:42:36] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jun 22 02:41:30 UTC 2014 (duration 41m 29s)
[02:42:40] <morebots>	 Logged the message, Master
[05:54:11] <icinga-wm>	 PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Sun 22 Jun 2014 02:53:13 UTC  
[06:19:04] <grrrit-wm>	 (03PS1) 10Yuvipanda: toollabs: Add phantomjs package [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) 
[06:19:08] <yuvipanda>	 legoktm: ^
[06:19:12] <legoktm>	 :D
[06:19:21] <grrrit-wm>	 (03PS2) 10Legoktm: toollabs: Add phantomjs package [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) (owner: 10Yuvipanda)
[06:19:32] <legoktm>	 it said "Cannot Merge" for some reason
[06:19:44] <grrrit-wm>	 (03CR) 10Legoktm: [C: 031] "Thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) (owner: 10Yuvipanda)
[06:32:51] <icinga-wm>	 RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Sun Jun 22 06:32:42 UTC 2014  
[08:03:37] <kaldari>	 has anyone reported database connection issues yet?
[08:03:52] <kaldari>	 i.e. "Cannot contact the database server: Too many connections (10.64.16.10)"
[08:04:29] <legoktm>	 nope
[08:04:31] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0]  
[08:04:46] <kaldari>	 just happened to me, but hasn't happened again
[08:05:01] <_joe_>	 kaldari: the alarm you see here ^ may be related
[08:05:50] <_joe_>	 sorry I'm not able to look right now if it's not a straightforward outage
[08:06:18] <kaldari>	 that's OK, just wanted to share :)
[08:06:37] <_joe_>	 it's been a transinet problem anyway
[08:06:43] <_joe_>	 thanks :)
[08:06:55] <_joe_>	 https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22)
[08:07:13] <_joe_>	 kaldari: you've been one of the 25K users served with an error
[08:07:57] <kaldari>	 ah, looks like a blip of evil
[08:08:43] <_joe_>	 it could be worth looking into, not at 9 am on a sunday in june :)
[08:17:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]  
[08:57:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[08:58:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[08:58:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time  
[08:59:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:00:01] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 8.072 second response time  
[09:00:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:02:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.162 second response time  
[09:03:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:06:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:06:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:08:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:10:01] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 8.598 second response time  
[09:10:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.659 second response time  
[09:10:41] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds  
[09:12:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.231 second response time  
[09:19:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:20:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:20:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:21:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.918 second response time  
[09:21:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.609 second response time  
[09:21:51] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 0.482 second response time  
[09:27:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:27:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:27:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:28:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:28:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.568 second response time  
[09:28:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time  
[09:28:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:28:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:28:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:28:32] <godog>	 mh got a slew of paging, looking into what's up with rendering on pybal
[09:29:01] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 9.301 second response time  
[09:29:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.214 second response time  
[09:29:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.132 second response time  
[09:30:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.891 second response time  
[09:30:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.654 second response time  
[09:31:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.751 second response time  
[09:34:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.951 second response time  
[09:34:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:34:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:35:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:35:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:35:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:37:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:38:12] <godog>	 looks like the imagescalers got all busier at the same time
[09:38:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.951 second response time  
[09:38:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.703 second response time  
[09:38:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.849 second response time  
[09:39:04] <apergos>	 http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&tab=v&vn=swift+backend+eqiad&hide-hf=false   swift object change graph has a nice spike
[09:39:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.501 second response time  
[09:39:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.582 second response time  
[09:39:32] <akosiaris_>	 Hey
[09:39:34] <apergos>	 I don't see more uploads than from yesterday at this time though, just a quick look at commons rc anyways.  I did not try looking at sample logs
[09:39:54] <apergos>	 morning
[09:40:01] <godog>	 hey apergos akosiaris_ !
[09:40:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.039 second response time  
[09:40:32] <apergos>	 maybe I should not be looking at successful uploads but at attempts
[09:40:36] <akosiaris_>	 What is happening? 
[09:41:26] <godog>	 akosiaris_: I think the imagescalers are chugging through a backlog and have all apache workers busy, why I don't know yet http://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=ap_idle_workers&s=by+name&c=Image+scalers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4
[09:41:58] <godog>	 apergos: so perhaps just a whole lot of uploads?
[09:42:34] <apergos>	 well they would have had to fail I guess, in order not to show up in rc
[09:43:04] <apergos>	 or maybe there is a bunch of stuff in the job queue, this has happened before, bad coordination with mass uploads of very large images
[09:43:15] <apergos>	 which then fail to scale
[09:43:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:43:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:43:41] <apergos>	 but when I looked at the scaler jobs I didn't see a bunch of things hung or multiples of the same image being scaled, so dunno
[09:44:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.274 second response time  
[09:45:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.969 second response time  
[09:48:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:48:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:48:47] <akosiaris_>	 It seems to have died down. Or is it me?
[09:49:26] <akosiaris_>	 It is me
[09:49:30] <apergos>	 heh
[09:50:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.632 second response time  
[09:50:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.804 second response time  
[09:53:14] <godog>	 bah I'm looking at thumbnail.log but nothing jumps my eye
[09:54:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:56:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:57:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.609 second response time  
[09:57:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[09:59:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.380 second response time  
[09:59:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:01:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.074 second response time  
[10:01:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.046 second response time  
[10:05:46] <godog>	 mh what would be the job queue group name with the resize jobs?
[10:06:10] <godog>	 I was trying to look for where these images are coming from
[10:08:44] <apergos>	 I don't know, I'd have to dig around in there
[10:09:44] <apergos>	 most uploads aren't queued unless that's been changed since last time I looked at it (quite possible)
[10:09:51] <apergos>	 I mean the scaling isn't queued
[10:10:42] <godog>	 ah ok! nevermind then I thought it was queued
[10:11:42] <apergos>	 the case I was thinking f was a special deal with a bulk upload tool
[10:18:55] <akosiaris_>	 gwtoolset?
[10:19:38] <apergos>	 whatever fae wass using for uploads (I think that was the user)
[10:20:31] <apergos>	 anyways.... it's been nearly 20 mins now and no whines
[10:21:25] <apergos>	 what sounds like the right tool though
[10:25:24] <apergos>	 I"m going to wander off (trying to find parts for a tool I need... on a Sunday :-/), hope things stay quiet
[10:25:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:25:44] <apergos>	 grrrrrr
[10:25:50] <apergos>	 I totally jinxd it
[10:29:00] <godog>	 haha
[10:29:17] <akosiaris_>	 :(
[10:29:17] <akosiaris_>	 Not totally... yet
[10:31:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:32:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.915 second response time  
[10:33:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.556 second response time  
[10:36:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:37:01] <godog>	 so yeah what seems like bulk uploads is sth like 1403431665802American_Football_EM_2014_-_AUT-DEU_-239.JPG
[10:37:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time  
[10:37:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:37:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:38:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:38:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:38:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:38:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:38:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:38:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[10:39:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time  
[10:39:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.390 second response time  
[10:39:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.819 second response time  
[10:39:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.214 second response time  
[10:39:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.406 second response time  
[10:39:51] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 0.552 second response time  
[10:40:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.267 second response time  
[10:41:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.047 second response time  
[11:16:44] <grrrit-wm>	 (03CR) 10Tim Landscheidt: [C: 031] toollabs: Add phantomjs package [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) (owner: 10Yuvipanda)
[11:18:41] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[11:45:18] <grrrit-wm>	 (03PS1) 1001tonythomas: Removed primary hostname to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 
[11:47:00] <grrrit-wm>	 (03PS2) 1001tonythomas: Removed primary hostname to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 
[11:54:11] <icinga-wm>	 PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Sun 22 Jun 2014 08:53:03 UTC  
[12:57:11] <icinga-wm>	 PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time  
[13:05:11] <icinga-wm>	 RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.009 second response time  
[13:13:01] <icinga-wm>	 RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Sun Jun 22 13:12:51 UTC 2014  
[13:35:31] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] "This is multiple different changes into one. I think they're all wrong :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas)
[13:55:17] <grrrit-wm>	 (03PS2) 10Odder: Change some user group rights on ruwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140910 (https://bugzilla.wikimedia.org/66871) 
[14:04:32] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0]  
[14:17:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]  
[15:34:45] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] toollabs: Add phantomjs package [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) (owner: 10Yuvipanda)
[15:54:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:57:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time  
[15:57:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[16:00:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.513 second response time  
[16:03:41] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds  
[16:17:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[16:18:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[16:19:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.055 second response time  
[16:20:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.048 second response time  
[16:39:41] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected  
[17:13:06] <grrrit-wm>	 (03PS1) 10Se4598: beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 
[17:14:27] <grrrit-wm>	 (03PS2) 10Se4598: beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 
[17:18:36] <grrrit-wm>	 (03CR) 10Umherirrender: [C: 031] beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598)
[17:20:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:21:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:23:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time  
[17:25:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:27:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time  
[17:29:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:31:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:31:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:32:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.950 second response time  
[17:32:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.236 second response time  
[17:33:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.909 second response time  
[17:36:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time  
[17:37:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:38:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:39:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:40:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.625 second response time  
[17:40:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:40:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:41:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:41:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.045 second response time  
[17:41:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:41:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:41:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.726 second response time  
[17:41:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:42:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.303 second response time  
[17:42:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.407 second response time  
[17:42:31] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data exceeded the critical threshold [500.0]  
[17:42:41] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.078 second response time  
[17:43:01] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 6.783 second response time  
[17:43:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time  
[17:46:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time  
[17:47:49] <ori>	 godog: if we were talking here there'd be commons admins and volunteers to jump in and help :)
[17:48:13] * ori introduces godog to matanya and twkozlowski
[17:48:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:48:41] <JohnLewis>	 Apache seems a bit unhappy today.
[17:49:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.430 second response time  
[17:52:02] <godog>	 indeed, it is the imagescalers being overloaded with work to do
[17:53:44] <matanya>	 godog: can i help ?
[17:53:52] <twkozlowski>	 http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Image%20scalers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1403459578&g=load_report&z=large
[17:53:55] <twkozlowski>	 what's happened here?
[17:56:06] <godog>	 matanya twkozlowski there seem to be batch uploads overloading imagescalers (running out of free apache workers essentially)
[17:56:18] <ori>	 i see puppet saturating cpu on mw1158 too, but maybe that's ephemeral
[17:56:50] <twkozlowski>	 godog: Yes, but I see no unusual activity.
[17:56:57] <twkozlowski>	 Just folks uploading files.
[17:57:26] <matanya>	 ori: it is an image scaler, i guess it has something to do with it
[17:57:49] <godog>	 twkozlowski: ok I'm looking again
[17:57:53] <ori>	 i don't think so, because the spike in cpu corresponds to a spike in network traffic
[17:58:02] <matanya>	 godog: can you find the offending file/s ?
[17:58:26] <twkozlowski>	 mw1155 appears to be down
[17:58:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:58:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]  
[17:59:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.498 second response time  
[17:59:44] <twkozlowski>	 (or is Ganglia lying to me? :)
[18:00:16] <godog>	 matanya: yep looking
[18:00:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:00:39] <godog>	 twkozlowski: there's been quite some activity this morning, not sure how it is how
[18:01:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.621 second response time  
[18:01:55] * matanya wishes he had shell
[18:02:33] <JohnLewis>	 matanya: Don't we all :p
[18:06:21] <matanya>	 godog: seems like it mostly recovered 
[18:07:38] <godog>	 matanya: the idle workers are not many tho http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=ap_idle_workers&s=by+name&c=Image+scalers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4
[18:08:22] <matanya>	 godog: maybe some thing is in a transcode loop or something ?
[18:08:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:08:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:08:39] <matanya>	 can you look at the transcoding logs ?
[18:09:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:09:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:10:01] <matanya>	 i think i found the issue godog 
[18:10:03] <matanya>	 https://commons.wikimedia.org/wiki/Special:TimedMediaHandler
[18:10:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time  
[18:10:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.166 second response time  
[18:10:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.080 second response time  
[18:10:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:10:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:10:33] <matanya>	 1006 in the queue
[18:10:54] <matanya>	 306 failed, and retrying those
[18:11:22] <matanya>	 so it seems like the queue is full
[18:11:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.782 second response time  
[18:11:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.353 second response time  
[18:12:23] <matanya>	 or this report is a lie
[18:12:54] <godog>	 mmhh so there might be pressure from that too now
[18:13:32] <hoo>	 oh, what's up? Scalers hammered down?
[18:13:48] <matanya>	 how many items do you see in the queue on the server side ?
[18:13:50] <matanya>	 yes hoo 
[18:15:04] <godog>	 matanya: on commonswiki? webVideoTranscode: 0 queued; 180 claimed (32 active, 148 abandoned); 0 delayed
[18:15:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.125 second response time  
[18:15:24] <matanya>	 that is suspicious 
[18:16:01] <matanya>	 I see at least 10 that seem queueed
[18:16:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:17:17] <matanya>	 e.g https://commons.wikimedia.org/wiki/File:Rame_MCL80_M%C3%A9tro_C_de_Lyon_21062014.ogv 
[18:18:00] <matanya>	 which i just restarted 
[18:19:21] <godog>	 mhh
[18:20:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:20:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:20:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:21:01] <matanya>	 godog: is 1158 down ?
[18:21:18] <hoo>	 nope
[18:21:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.763 second response time  
[18:21:59] <godog>	 matanya: what hoo said :) looks up
[18:23:03] <hoo>	 puppet run on mw1160 also :/
[18:23:26] <matanya>	 godog: i'll let you debug, as i can't really help without shell. if you need anything, please do poke
[18:23:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.975 second response time  
[18:25:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.052 second response time  
[18:25:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.944 second response time  
[18:28:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:29:21] <godog>	 matanya: the problem is with image scalers btw not video scalers
[18:29:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time  
[18:29:24] <hoo>	 puppet run on mw1160 also :/
[18:29:29] <hoo>	 whoops
[18:29:31] <hoo>	 wrong window
[18:29:47] <hoo>	 wrong window
[18:37:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:38:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time  
[18:39:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:41:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time  
[18:41:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:41:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:43:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.115 second response time  
[18:45:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:46:34] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.059 second response time  
[18:46:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:46:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:46:41] <icinga-wm>	 PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100%  
[18:47:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.553 second response time  
[18:47:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.330 second response time  
[18:48:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time  
[18:59:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:01:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.811 second response time  
[19:01:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:02:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.535 second response time  
[19:03:22] <hoo>	 yikes
[19:03:44] <hoo>	 godog: are the nics of the swift boxes saturated?
[19:04:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:04:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:05:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.610 second response time  
[19:06:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:06:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:07:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.771 second response time  
[19:08:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time  
[19:08:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.440 second response time  
[19:08:30] <godog>	 hoo: mmh a bit high but sustainable I think http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=bytes_out&s=by+name&c=Swift+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4
[19:09:57] <godog>	 no news from librenms too
[19:10:41] <hoo>	 mh
[19:11:19] <hoo>	 maybe we should think about https://gerrit.wikimedia.org/r/127632 again
[19:12:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:13:17] <godog>	 indeed, I was expecting the imagescalers to be fully cpu-bound but that doesn't seem the case
[19:13:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:14:11] <hoo>	 nowhere close, actually... all workers busy and the load is still rather moderate
[19:14:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time  
[19:14:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:14:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:14:52] <godog>	 anyways I couldn't find anything obviously wrong, except that the number of objects making it to swift is unusally high http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&tab=v&vn=swift+backend+eqiad&hide-hf=false
[19:15:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:15:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.257 second response time  
[19:15:32] <grrrit-wm>	 (03PS1) 10Odder: Add a Library of Congress domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141308 (https://bugzilla.wikimedia.org/66945) 
[19:15:51] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 68902 bytes in 0.460 second response time  
[19:16:01] <godog>	 which would coincide with many uploads
[19:17:53] <hoo>	 do we have metrics on that?
[19:18:05] <hoo>	 (upload count)
[19:19:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:19:06] <godog>	 not that I know of, perhaps there's something in graphite but a mw expert can confirm/deny
[19:19:14] <godog>	 ok I'm silencing that alarm
[19:19:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:19:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:19:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:19:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:19:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:19:37] <hoo>	 yikes
[19:19:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:20:01] <godog>	 bd808|BUFFER perhaps knows if we have upload count metrics, matanya twkozlowski ?
[19:20:03] <hoo>	 cpu load is lt. 15%
[19:20:12] <hoo>	 in mw1154
[19:20:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.672 second response time  
[19:20:31] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0]  
[19:21:11] <grrrit-wm>	 (03Restored) 10Hoo man: Increase apache MaxClients to 23 in order to have 40 more scaling slots [operations/puppet] - 10https://gerrit.wikimedia.org/r/127632 (owner: 10Hoo man)
[19:21:16] <twkozlowski>	 Not that I am aware of, godog
[19:21:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time  
[19:21:26] <twkozlowski>	 As in, no graphs or anything
[19:21:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.816 second response time  
[19:21:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.025 second response time  
[19:21:34] <hoo>	 I can manually fetch metrics from the DB
[19:22:01] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 68902 bytes in 7.139 second response time  
[19:22:56] <hoo>	 MariaDB [commonswiki_p]> SELECT COUNT(*) FROM logging WHERE log_type = 'upload' AND log_timestamp LIKE "20140622%";
[19:22:58] <hoo>	 12340
[19:23:08] <hoo>	 so nothing special
[19:23:19] <hoo>	 14653 yesterday
[19:23:31] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0]  
[19:23:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.025 second response time  
[19:24:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time  
[19:24:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time  
[19:24:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.026 second response time  
[19:24:42] <godog>	 !log silenced LVS healthcheck on rendering.svc until 23:23 UTC
[19:24:48] <morebots>	 Logged the message, Master
[19:25:28] <akosiaris_>	 godog: :-)
[19:25:35] <matanya>	 hello akosiaris_ 
[19:25:43] <godog>	 akosiaris_: hey, yeah that was getting old :)
[19:25:52] <akosiaris_>	 Hey matanya
[19:26:44] <akosiaris_>	 godog: yeah
[19:26:47] <godog>	 akosiaris_: same problem as this morning, however upload rate now seems ok judging from upload.log
[19:27:42] <akosiaris_>	 hmm, I can not shake the feeling though it is just the same problem as a couple of weeks ago
[19:28:29] <akosiaris_>	 as in gwtoolset related uploads
[19:28:49] <grrrit-wm>	 (03PS2) 10Hoo man: Increase apache MaxClients to 23 in order to have 40 more scaling slots [operations/puppet] - 10https://gerrit.wikimedia.org/r/127632 
[19:30:16] <godog>	 akosiaris_: could be, the object rate change in swift has been evelated since this morning
[19:30:49] <hoo>	 on which wikis do those end up? commons, I guess
[19:30:54] <matanya>	 godog: i'm just pointing out _joe_ moved the swift boxes to puppet3 on friday
[19:32:13] <godog>	 hoo: yes commons I believe
[19:32:35] <godog>	 matanya: mmhh swift seems fine to me I think but thanks !
[19:33:21] <matanya>	 akosiaris_: you refer to https://commons.wikimedia.org/wiki/User_talk:F%C3%A6/2014#Large_file_uploads ?
[19:35:27] <akosiaris_>	 matanya: yes
[19:35:50] <matanya>	 I didn't see any abnormal uploads today
[19:35:59] <matanya>	 and sundays are usually quite
[19:36:17] <hoo>	 Yep, the numbers look fairly normal to me also
[19:40:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]  
[19:41:50] <akosiaris_>	 Well, 20th of April was Sunday too.
[19:42:06] <akosiaris_>	 Not that that matters
[19:42:39] <akosiaris_>	 Numbers are more important
[19:44:27] <godog>	 hoo: you were right btw swift frontend network is at capacity
[19:45:09] <grrrit-wm>	 (03Abandoned) 10Hoo man: Increase apache MaxClients to 23 in order to have 40 more scaling slots [operations/puppet] - 10https://gerrit.wikimedia.org/r/127632 (owner: 10Hoo man)
[19:45:24] <hoo>	 Feared that :/
[19:51:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:52:10] <grrrit-wm>	 (03PS3) 1001tonythomas: Removed primary hostname to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 
[19:52:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.124 second response time  
[19:53:25] <grrrit-wm>	 (03PS4) 1001tonythomas: Updated exim errors_to to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 
[19:57:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:57:35] <grrrit-wm>	 (03CR) 1001tonythomas: "@Faidon:- errors_to = wiki@wikimedia.org always overwrites any Return-Path set in $headers. Editing this makes the Return-Path the require" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas)
[19:58:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:58:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:58:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:58:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:59:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.041 second response time  
[19:59:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[19:59:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:00:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.628 second response time  
[20:00:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.710 second response time  
[20:00:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.346 second response time  
[20:00:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.775 second response time  
[20:00:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.269 second response time  
[20:03:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:05:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:06:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.861 second response time  
[20:06:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:06:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:07:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.403 second response time  
[20:09:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.248 second response time  
[20:11:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.911 second response time  
[20:12:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.830 second response time  
[20:17:50] <godog>	 ideas on how to lessen the load on imagescalers/swift? or better where the load is coming from
[20:18:34] <mark>	 upload caches do seem a bit busier
[20:19:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:19:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:20:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:20:32] <hoo>	 mark: So, we got more hits on images/ thumbs?
[20:20:45] <mark>	 not drastically it seems
[20:22:16] <mark>	 the top referers on a random upload cache (backend!) I checked are mobile
[20:22:22] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.335 second response time  
[20:22:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.858 second response time  
[20:22:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:23:32] <hoo>	 but more misses, I guess
[20:24:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.746 second response time  
[20:26:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:26:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:26:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:26:31] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0]  
[20:26:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:26:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:26:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:26:46] <grrrit-wm>	 (03CR) 10Steinsplitter: [C: 031] beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598)
[20:27:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.210 second response time  
[20:27:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:28:37] <mark>	 what happened to ms-be1012 yesterday?
[20:31:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.257 second response time  
[20:31:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.057 second response time  
[20:31:47] <godog>	 mark: found this in SAL  16:12 _joe_: restarted ms-be1012, see http://paste.debian.net/106247/ for console output 
[20:32:09] <mark>	 seems to be around the time the extra load on the image scalers started
[20:32:12] <mark>	 but: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&m=swift_object_change&h=Swift+eqiad+prod&c=Swift+eqiad
[20:32:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.641 second response time  
[20:32:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.659 second response time  
[20:32:32] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.272 second response time  
[20:32:41] <matanya>	 xfs issues ?
[20:32:54] <Nemo_bis>	 at the time one user had reported slow responses from upload.wm.o (no details)
[20:35:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:36:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.480 second response time  
[20:36:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:36:32] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:36:38] <godog>	 also swift frontend machines are at network capacity
[20:37:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.842 second response time  
[20:37:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:37:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:37:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[20:38:02] <godog>	 also seeing some of these on thumbnail.log, but could be a red herring as I don't know if they are normal 2014-06-22 20:37:27 mw1154 commonswiki: Thumbnail failed on mw1154: could not get local copy of "Karl_Marx.jpg"
[20:38:15] <grrrit-wm>	 (03CR) 10Steinsplitter: [C: 04-1] "mir wurde gesagt für beta sollte $stdlogo genutzt werden, kanndt du das bitte umbauen?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598)
[20:38:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.335 second response time  
[20:38:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.541 second response time  
[20:39:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.487 second response time  
[20:39:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.016 second response time  
[20:39:31] <icinga-wm>	 RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.901 second response time  
[20:43:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time  
[20:48:29] <hoo>	 godog: Those are mostly images that went missing someway... that's nothing special, just (hopefully) old stuff we no longer have around
[20:49:17] <godog>	 hoo: oh ok, so "local copy" means local to swift not local to the imagescaler
[20:53:10] <mark>	 it means the image scaler couldn't download the image from swift (local copy) to operate on
[20:55:54] <hoo>	 what mark said
[20:57:29] <grrrit-wm>	 (03PS3) 10Se4598: beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 
[20:59:31] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0]  
[21:04:21] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] "Nothing controversial, visually separating beta and production makes sense (consensus)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598)
[21:04:46] <grrrit-wm>	 (03Merged) 10jenkins-bot: beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598)
[21:06:04] <logmsgbot>	 !log hoo Synchronized wmf-config/InitialiseSettings-labs.php: For cluster consistency... (duration: 00m 08s)
[21:06:09] <morebots>	 Logged the message, Master
[21:07:05] <godog>	 ok thanks! 
[21:07:51] <mark>	 i just went over a few originals for which thumbs were being created
[21:07:55] <mark>	 none of those originals were new
[21:08:22] <mark>	 of course new thumbnails can be created any time
[21:08:27] <mark>	 but that's kinda curious at this rate
[21:08:55] <hoo>	 some harvester going through? One of these random image from commons sites? ...
[21:11:35] <mark>	 and creating new thumbnail sizes then
[21:12:11] <hoo>	 possible, yes
[21:16:32] <ori>	 there are a lot of PUTs: <https://graphite.wikimedia.org/render/?title=proxy%20ops/s%20(state-changing)&from=-72hour&width=1024&height=500&until=now&areaMode=none&hideLegend=&target=alias(movingAverage(sumSeries(swift.eqiad-prod.*.proxy-server.object.PUT.2*.timing.rate),5),%22object%20PUT%202xx%22)&target=alias(secondYAxis(movingAverage(sumSeries(swift.eqiad-prod.*.proxy-server.object.DELETE.2*.timing.rate),5)),%22object%
[21:16:32] <ori>	 20DELETE%202xx%22)&target=alias(secondYAxis(movingAverage(sumSeries(swift.eqiad-prod.*.proxy-server.object.POST.2*.timing.rate),5)),%22object%20POST%202xx%22)>
[21:17:34] <mark>	 that's my impression too
[21:17:37] <mark>	 but where are they coming from
[21:18:02] <mark>	 i see a lot of weird thumb sizes in the logs
[21:18:08] <mark>	 but then again, that's hard to say
[21:18:11] <godog>	 wouldn't that be the imagescalers uploading back the thumbs?
[21:18:24] <mark>	 627px etc ;)
[21:18:29] <mark>	 godog: yes
[21:25:32] <Nemo_bis>	 what's an example of such a thumb?
[22:04:49] <mark>	 I think it's caused by this user agent: User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B176 Safari/7534.48.3 WikiLinks/2.12.0
[22:05:37] <mark>	 but I'm off now
[22:08:41] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out  
[22:09:41] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69110 bytes in 0.599 second response time  
[22:09:57] <Nemo_bis>	 http://wikilinks.net/ ?
[22:11:17] <hoo>	 Nemo_bis: sounds plausible
[22:15:31] <icinga-wm>	 PROBLEM - puppetmaster backend https on strontium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error  
[22:15:42] <ori>	 Nemo_bis: is that new?
[22:16:19] <ori>	 hasn't been updated since may
[22:17:04] <Reedy>	 Wouldn't be the first time something using a false UA
[22:17:16] <Nemo_bis>	 sure
[22:20:04] <hoo>	 ori: Not sure how long it takes for apple to approve etc. updates... not sure why someone would fake exactly that UA
[22:20:24] <ori>	 dunno, i stopped debugging this
[22:20:31] <ori>	 wasn't getting anywhere and there are smarter people around
[22:21:07] <hoo>	 nobody really got anywhere... the problems just stopped at some point
[22:25:31] <icinga-wm>	 RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.980 second response time  
[22:25:54] <_joe_>	 !log restarted apache on strontium, passenger crashed (again).
[22:25:59] <morebots>	 Logged the message, Master
[22:26:04] <_joe_>	 one gets back home...
[22:54:02] <lfaraone>	 Is mailman being sad? Mails I send to a specific list aren't showing up in the archives or reported as being received by the other list members. 
[22:57:25] <hoo>	 lfaraone: Those are probably stuck queued https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=mchenry.wikimedia.org&r=hour&z=default&jr=&js=&st=1403468761&v=545&m=exim%20queued%20messages&vl=messages&z=large