[01:51:53] * jgage looks at ms-be3003 [01:56:09] [Sat Jun 21 23:16:29 2014] end_request: I/O error, dev sdk, sector 3409757152 [02:13:53] !log LocalisationUpdate completed (1.24wmf9) at 2014-06-22 02:12:50+00:00 [02:14:04] Logged the message, Master [02:21:48] so ms-be3003 has a 55gb root partition at 100% full, but i don't see that space being used in /var /usr /home /root /etc .. suspect that maybe /srv/swift-storage/ has been written to under one of the mount points (on the root partition). [02:22:08] i want to stop swift and unmount the swift-storage disks to check [02:22:27] checking whether i need to depool ms-be3003 or something first.. [02:24:53] !log LocalisationUpdate completed (1.24wmf10) at 2014-06-22 02:23:50+00:00 [02:24:59] Logged the message, Master [02:26:00] https://wikitech.wikimedia.org/wiki/Swift/How_To#Remove_a_failed_storage_node_from_the_cluster [02:26:39] this doesn't seem to be a crisis, i've never touched this stuff before, and the rest of the team are afk. i think i'll wait to make any changes. [02:31:28] ACKNOWLEDGEMENT - Disk space on ms-be3003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%): Jeff Gage See discussion on IRC, sdk issues but root partition full [02:42:36] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jun 22 02:41:30 UTC 2014 (duration 41m 29s) [02:42:40] Logged the message, Master [05:54:11] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Sun 22 Jun 2014 02:53:13 UTC [06:19:04] (03PS1) 10Yuvipanda: toollabs: Add phantomjs package [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) [06:19:08] legoktm: ^ [06:19:12] :D [06:19:21] (03PS2) 10Legoktm: toollabs: Add phantomjs package [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) (owner: 10Yuvipanda) [06:19:32] it said "Cannot Merge" for some reason [06:19:44] (03CR) 10Legoktm: [C: 031] "Thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) (owner: 10Yuvipanda) [06:32:51] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Sun Jun 22 06:32:42 UTC 2014 [08:03:37] has anyone reported database connection issues yet? [08:03:52] i.e. "Cannot contact the database server: Too many connections (10.64.16.10)" [08:04:29] nope [08:04:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [08:04:46] just happened to me, but hasn't happened again [08:05:01] <_joe_> kaldari: the alarm you see here ^ may be related [08:05:50] <_joe_> sorry I'm not able to look right now if it's not a straightforward outage [08:06:18] that's OK, just wanted to share :) [08:06:37] <_joe_> it's been a transinet problem anyway [08:06:43] <_joe_> thanks :) [08:06:55] <_joe_> https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [08:07:13] <_joe_> kaldari: you've been one of the 25K users served with an error [08:07:57] ah, looks like a blip of evil [08:08:43] <_joe_> it could be worth looking into, not at 9 am on a sunday in june :) [08:17:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [08:57:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:01] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:21] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.067 second response time [08:59:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:00:01] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 8.072 second response time [09:00:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.162 second response time [09:03:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:06:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:06:31] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:08:01] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:10:01] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 8.598 second response time [09:10:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.659 second response time [09:10:41] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [09:12:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.231 second response time [09:19:01] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:20:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:20:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:21:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.918 second response time [09:21:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.609 second response time [09:21:51] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 0.482 second response time [09:27:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:01] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.568 second response time [09:28:21] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [09:28:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:32] mh got a slew of paging, looking into what's up with rendering on pybal [09:29:01] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 9.301 second response time [09:29:31] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.214 second response time [09:29:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.132 second response time [09:30:31] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.891 second response time [09:30:41] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.654 second response time [09:31:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.751 second response time [09:34:31] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.951 second response time [09:34:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:34:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:35:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:35:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:35:31] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:37:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:38:12] looks like the imagescalers got all busier at the same time [09:38:31] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.951 second response time [09:38:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.703 second response time [09:38:31] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.849 second response time [09:39:04] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&tab=v&vn=swift+backend+eqiad&hide-hf=false swift object change graph has a nice spike [09:39:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.501 second response time [09:39:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.582 second response time [09:39:32] Hey [09:39:34] I don't see more uploads than from yesterday at this time though, just a quick look at commons rc anyways. I did not try looking at sample logs [09:39:54] morning [09:40:01] hey apergos akosiaris_ ! [09:40:31] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.039 second response time [09:40:32] maybe I should not be looking at successful uploads but at attempts [09:40:36] What is happening? [09:41:26] akosiaris_: I think the imagescalers are chugging through a backlog and have all apache workers busy, why I don't know yet http://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=ap_idle_workers&s=by+name&c=Image+scalers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [09:41:58] apergos: so perhaps just a whole lot of uploads? [09:42:34] well they would have had to fail I guess, in order not to show up in rc [09:43:04] or maybe there is a bunch of stuff in the job queue, this has happened before, bad coordination with mass uploads of very large images [09:43:15] which then fail to scale [09:43:32] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:43:32] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:43:41] but when I looked at the scaler jobs I didn't see a bunch of things hung or multiples of the same image being scaled, so dunno [09:44:31] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.274 second response time [09:45:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.969 second response time [09:48:32] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:48:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:48:47] It seems to have died down. Or is it me? [09:49:26] It is me [09:49:30] heh [09:50:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.632 second response time [09:50:41] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.804 second response time [09:53:14] bah I'm looking at thumbnail.log but nothing jumps my eye [09:54:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:56:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:57:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.609 second response time [09:57:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:59:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.380 second response time [09:59:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:01:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.074 second response time [10:01:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.046 second response time [10:05:46] mh what would be the job queue group name with the resize jobs? [10:06:10] I was trying to look for where these images are coming from [10:08:44] I don't know, I'd have to dig around in there [10:09:44] most uploads aren't queued unless that's been changed since last time I looked at it (quite possible) [10:09:51] I mean the scaling isn't queued [10:10:42] ah ok! nevermind then I thought it was queued [10:11:42] the case I was thinking f was a special deal with a bulk upload tool [10:18:55] gwtoolset? [10:19:38] whatever fae wass using for uploads (I think that was the user) [10:20:31] anyways.... it's been nearly 20 mins now and no whines [10:21:25] what sounds like the right tool though [10:25:24] I"m going to wander off (trying to find parts for a tool I need... on a Sunday :-/), hope things stay quiet [10:25:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:44] grrrrrr [10:25:50] I totally jinxd it [10:29:00] haha [10:29:17] :( [10:29:17] Not totally... yet [10:31:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:21] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.915 second response time [10:33:31] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.556 second response time [10:36:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:37:01] so yeah what seems like bulk uploads is sth like 1403431665802American_Football_EM_2014_-_AUT-DEU_-239.JPG [10:37:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [10:37:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:37:32] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:01] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:32] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:38:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:39:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [10:39:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.390 second response time [10:39:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.819 second response time [10:39:31] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.214 second response time [10:39:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.406 second response time [10:39:51] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 0.552 second response time [10:40:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.267 second response time [10:41:31] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.047 second response time [11:16:44] (03CR) 10Tim Landscheidt: [C: 031] toollabs: Add phantomjs package [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) (owner: 10Yuvipanda) [11:18:41] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [11:45:18] (03PS1) 1001tonythomas: Removed primary hostname to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 [11:47:00] (03PS2) 1001tonythomas: Removed primary hostname to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 [11:54:11] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Sun 22 Jun 2014 08:53:03 UTC [12:57:11] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [13:05:11] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.009 second response time [13:13:01] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Sun Jun 22 13:12:51 UTC 2014 [13:35:31] (03CR) 10Faidon Liambotis: [C: 04-1] "This is multiple different changes into one. I think they're all wrong :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [13:55:17] (03PS2) 10Odder: Change some user group rights on ruwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140910 (https://bugzilla.wikimedia.org/66871) [14:04:32] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [14:17:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [15:34:45] (03CR) 10Andrew Bogott: [C: 032] toollabs: Add phantomjs package [operations/puppet] - 10https://gerrit.wikimedia.org/r/141280 (https://bugzilla.wikimedia.org/66928) (owner: 10Yuvipanda) [15:54:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:22] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [15:57:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:00:31] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.513 second response time [16:03:41] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 8 below the confidence bounds [16:17:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:21] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.055 second response time [16:20:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.048 second response time [16:39:41] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [17:13:06] (03PS1) 10Se4598: beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 [17:14:27] (03PS2) 10Se4598: beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 [17:18:36] (03CR) 10Umherirrender: [C: 031] beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598) [17:20:31] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:21:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [17:25:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:27:31] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [17:29:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:32:31] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.950 second response time [17:32:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.236 second response time [17:33:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.909 second response time [17:36:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [17:37:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:39:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.625 second response time [17:40:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:31] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:01] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.045 second response time [17:41:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.726 second response time [17:41:32] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:42:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.303 second response time [17:42:31] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.407 second response time [17:42:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data exceeded the critical threshold [500.0] [17:42:41] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.078 second response time [17:43:01] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 69035 bytes in 6.783 second response time [17:43:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [17:46:21] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [17:47:49] godog: if we were talking here there'd be commons admins and volunteers to jump in and help :) [17:48:13] * ori introduces godog to matanya and twkozlowski [17:48:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:41] Apache seems a bit unhappy today. [17:49:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.430 second response time [17:52:02] indeed, it is the imagescalers being overloaded with work to do [17:53:44] godog: can i help ? [17:53:52] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Image%20scalers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1403459578&g=load_report&z=large [17:53:55] what's happened here? [17:56:06] matanya twkozlowski there seem to be batch uploads overloading imagescalers (running out of free apache workers essentially) [17:56:18] i see puppet saturating cpu on mw1158 too, but maybe that's ephemeral [17:56:50] godog: Yes, but I see no unusual activity. [17:56:57] Just folks uploading files. [17:57:26] ori: it is an image scaler, i guess it has something to do with it [17:57:49] twkozlowski: ok I'm looking again [17:57:53] i don't think so, because the spike in cpu corresponds to a spike in network traffic [17:58:02] godog: can you find the offending file/s ? [17:58:26] mw1155 appears to be down [17:58:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [17:59:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.498 second response time [17:59:44] (or is Ganglia lying to me? :) [18:00:16] matanya: yep looking [18:00:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:39] twkozlowski: there's been quite some activity this morning, not sure how it is how [18:01:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.621 second response time [18:01:55] * matanya wishes he had shell [18:02:33] matanya: Don't we all :p [18:06:21] godog: seems like it mostly recovered [18:07:38] matanya: the idle workers are not many tho http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=ap_idle_workers&s=by+name&c=Image+scalers+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [18:08:22] godog: maybe some thing is in a transcode loop or something ? [18:08:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:08:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:08:39] can you look at the transcoding logs ? [18:09:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:09:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:01] i think i found the issue godog [18:10:03] https://commons.wikimedia.org/wiki/Special:TimedMediaHandler [18:10:21] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.070 second response time [18:10:22] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.166 second response time [18:10:31] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.080 second response time [18:10:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:33] 1006 in the queue [18:10:54] 306 failed, and retrying those [18:11:22] so it seems like the queue is full [18:11:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.782 second response time [18:11:31] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.353 second response time [18:12:23] or this report is a lie [18:12:54] mmhh so there might be pressure from that too now [18:13:32] oh, what's up? Scalers hammered down? [18:13:48] how many items do you see in the queue on the server side ? [18:13:50] yes hoo [18:15:04] matanya: on commonswiki? webVideoTranscode: 0 queued; 180 claimed (32 active, 148 abandoned); 0 delayed [18:15:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.125 second response time [18:15:24] that is suspicious [18:16:01] I see at least 10 that seem queueed [18:16:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:17] e.g https://commons.wikimedia.org/wiki/File:Rame_MCL80_M%C3%A9tro_C_de_Lyon_21062014.ogv [18:18:00] which i just restarted [18:19:21] mhh [18:20:32] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:32] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:20:32] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:01] godog: is 1158 down ? [18:21:18] nope [18:21:31] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.763 second response time [18:21:59] matanya: what hoo said :) looks up [18:23:03] puppet run on mw1160 also :/ [18:23:26] godog: i'll let you debug, as i can't really help without shell. if you need anything, please do poke [18:23:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.975 second response time [18:25:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.052 second response time [18:25:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.944 second response time [18:28:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:29:21] matanya: the problem is with image scalers btw not video scalers [18:29:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [18:29:24] puppet run on mw1160 also :/ [18:29:29] whoops [18:29:31] wrong window [18:29:47] wrong window [18:37:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:38:21] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.065 second response time [18:39:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [18:41:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.115 second response time [18:45:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:46:34] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.059 second response time [18:46:34] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:46:34] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:46:41] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:31] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.553 second response time [18:47:31] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.330 second response time [18:48:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [18:59:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:31] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.811 second response time [19:01:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.535 second response time [19:03:22] yikes [19:03:44] godog: are the nics of the swift boxes saturated? [19:04:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:04:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:05:32] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.610 second response time [19:06:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:31] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:07:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.771 second response time [19:08:22] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.084 second response time [19:08:22] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.440 second response time [19:08:30] hoo: mmh a bit high but sustainable I think http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=bytes_out&s=by+name&c=Swift+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [19:09:57] no news from librenms too [19:10:41] mh [19:11:19] maybe we should think about https://gerrit.wikimedia.org/r/127632 again [19:12:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:13:17] indeed, I was expecting the imagescalers to be fully cpu-bound but that doesn't seem the case [19:13:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:11] nowhere close, actually... all workers busy and the load is still rather moderate [19:14:21] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [19:14:32] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:32] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:14:52] anyways I couldn't find anything obviously wrong, except that the number of objects making it to swift is unusally high http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&tab=v&vn=swift+backend+eqiad&hide-hf=false [19:15:01] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.257 second response time [19:15:32] (03PS1) 10Odder: Add a Library of Congress domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141308 (https://bugzilla.wikimedia.org/66945) [19:15:51] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 68902 bytes in 0.460 second response time [19:16:01] which would coincide with many uploads [19:17:53] do we have metrics on that? [19:18:05] (upload count) [19:19:01] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:06] not that I know of, perhaps there's something in graphite but a mw expert can confirm/deny [19:19:14] ok I'm silencing that alarm [19:19:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:31] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:37] yikes [19:19:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:01] bd808|BUFFER perhaps knows if we have upload count metrics, matanya twkozlowski ? [19:20:03] cpu load is lt. 15% [19:20:12] in mw1154 [19:20:31] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.672 second response time [19:20:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [19:21:11] (03Restored) 10Hoo man: Increase apache MaxClients to 23 in order to have 40 more scaling slots [operations/puppet] - 10https://gerrit.wikimedia.org/r/127632 (owner: 10Hoo man) [19:21:16] Not that I am aware of, godog [19:21:21] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [19:21:26] As in, no graphs or anything [19:21:31] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.816 second response time [19:21:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.025 second response time [19:21:34] I can manually fetch metrics from the DB [19:22:01] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 68902 bytes in 7.139 second response time [19:22:56] MariaDB [commonswiki_p]> SELECT COUNT(*) FROM logging WHERE log_type = 'upload' AND log_timestamp LIKE "20140622%"; [19:22:58] 12340 [19:23:08] so nothing special [19:23:19] 14653 yesterday [19:23:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [19:23:31] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.025 second response time [19:24:22] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.072 second response time [19:24:22] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [19:24:31] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.026 second response time [19:24:42] !log silenced LVS healthcheck on rendering.svc until 23:23 UTC [19:24:48] Logged the message, Master [19:25:28] godog: :-) [19:25:35] hello akosiaris_ [19:25:43] akosiaris_: hey, yeah that was getting old :) [19:25:52] Hey matanya [19:26:44] godog: yeah [19:26:47] akosiaris_: same problem as this morning, however upload rate now seems ok judging from upload.log [19:27:42] hmm, I can not shake the feeling though it is just the same problem as a couple of weeks ago [19:28:29] as in gwtoolset related uploads [19:28:49] (03PS2) 10Hoo man: Increase apache MaxClients to 23 in order to have 40 more scaling slots [operations/puppet] - 10https://gerrit.wikimedia.org/r/127632 [19:30:16] akosiaris_: could be, the object rate change in swift has been evelated since this morning [19:30:49] on which wikis do those end up? commons, I guess [19:30:54] godog: i'm just pointing out _joe_ moved the swift boxes to puppet3 on friday [19:32:13] hoo: yes commons I believe [19:32:35] matanya: mmhh swift seems fine to me I think but thanks ! [19:33:21] akosiaris_: you refer to https://commons.wikimedia.org/wiki/User_talk:F%C3%A6/2014#Large_file_uploads ? [19:35:27] matanya: yes [19:35:50] I didn't see any abnormal uploads today [19:35:59] and sundays are usually quite [19:36:17] Yep, the numbers look fairly normal to me also [19:40:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [19:41:50] Well, 20th of April was Sunday too. [19:42:06] Not that that matters [19:42:39] Numbers are more important [19:44:27] hoo: you were right btw swift frontend network is at capacity [19:45:09] (03Abandoned) 10Hoo man: Increase apache MaxClients to 23 in order to have 40 more scaling slots [operations/puppet] - 10https://gerrit.wikimedia.org/r/127632 (owner: 10Hoo man) [19:45:24] Feared that :/ [19:51:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:10] (03PS3) 1001tonythomas: Removed primary hostname to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 [19:52:31] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 4.124 second response time [19:53:25] (03PS4) 1001tonythomas: Updated exim errors_to to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 [19:57:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:35] (03CR) 1001tonythomas: "@Faidon:- errors_to = wiki@wikimedia.org always overwrites any Return-Path set in $headers. Editing this makes the Return-Path the require" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [19:58:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:32] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:32] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:22] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.041 second response time [19:59:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:21] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.628 second response time [20:00:22] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.710 second response time [20:00:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.346 second response time [20:00:31] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.775 second response time [20:00:31] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.269 second response time [20:03:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:21] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.861 second response time [20:06:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:31] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.403 second response time [20:09:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.248 second response time [20:11:22] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 2.911 second response time [20:12:31] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.830 second response time [20:17:50] ideas on how to lessen the load on imagescalers/swift? or better where the load is coming from [20:18:34] upload caches do seem a bit busier [20:19:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:19:32] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:32] mark: So, we got more hits on images/ thumbs? [20:20:45] not drastically it seems [20:22:16] the top referers on a random upload cache (backend!) I checked are mobile [20:22:22] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.335 second response time [20:22:32] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.858 second response time [20:22:32] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:32] but more misses, I guess [20:24:31] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.746 second response time [20:26:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [20:26:31] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:32] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:41] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:26:46] (03CR) 10Steinsplitter: [C: 031] beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598) [20:27:31] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.210 second response time [20:27:31] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:37] what happened to ms-be1012 yesterday? [20:31:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.257 second response time [20:31:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.057 second response time [20:31:47] mark: found this in SAL 16:12 _joe_: restarted ms-be1012, see http://paste.debian.net/106247/ for console output [20:32:09] seems to be around the time the extra load on the image scalers started [20:32:12] but: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&m=swift_object_change&h=Swift+eqiad+prod&c=Swift+eqiad [20:32:31] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 9.641 second response time [20:32:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.659 second response time [20:32:32] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.272 second response time [20:32:41] xfs issues ? [20:32:54] at the time one user had reported slow responses from upload.wm.o (no details) [20:35:31] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:31] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.480 second response time [20:36:31] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:32] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:36:38] also swift frontend machines are at network capacity [20:37:21] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.842 second response time [20:37:31] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:31] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:31] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:02] also seeing some of these on thumbnail.log, but could be a red herring as I don't know if they are normal 2014-06-22 20:37:27 mw1154 commonswiki: Thumbnail failed on mw1154: could not get local copy of "Karl_Marx.jpg" [20:38:15] (03CR) 10Steinsplitter: [C: 04-1] "mir wurde gesagt für beta sollte $stdlogo genutzt werden, kanndt du das bitte umbauen?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598) [20:38:31] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.335 second response time [20:38:31] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.541 second response time [20:39:31] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 6.487 second response time [20:39:31] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 7.016 second response time [20:39:31] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 8.901 second response time [20:43:21] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [20:48:29] godog: Those are mostly images that went missing someway... that's nothing special, just (hopefully) old stuff we no longer have around [20:49:17] hoo: oh ok, so "local copy" means local to swift not local to the imagescaler [20:53:10] it means the image scaler couldn't download the image from swift (local copy) to operate on [20:55:54] what mark said [20:57:29] (03PS3) 10Se4598: beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 [20:59:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [21:04:21] (03CR) 10Hoo man: [C: 032] "Nothing controversial, visually separating beta and production makes sense (consensus)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598) [21:04:46] (03Merged) 10jenkins-bot: beta.labs: logo and favicon change for dewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141305 (owner: 10Se4598) [21:06:04] !log hoo Synchronized wmf-config/InitialiseSettings-labs.php: For cluster consistency... (duration: 00m 08s) [21:06:09] Logged the message, Master [21:07:05] ok thanks! [21:07:51] i just went over a few originals for which thumbs were being created [21:07:55] none of those originals were new [21:08:22] of course new thumbnails can be created any time [21:08:27] but that's kinda curious at this rate [21:08:55] some harvester going through? One of these random image from commons sites? ... [21:11:35] and creating new thumbnail sizes then [21:12:11] possible, yes [21:16:32] there are a lot of PUTs: 20DELETE%202xx%22)&target=alias(secondYAxis(movingAverage(sumSeries(swift.eqiad-prod.*.proxy-server.object.POST.2*.timing.rate),5)),%22object%20POST%202xx%22)> [21:17:34] that's my impression too [21:17:37] but where are they coming from [21:18:02] i see a lot of weird thumb sizes in the logs [21:18:08] but then again, that's hard to say [21:18:11] wouldn't that be the imagescalers uploading back the thumbs? [21:18:24] 627px etc ;) [21:18:29] godog: yes [21:25:32] what's an example of such a thumb? [22:04:49] I think it's caused by this user agent: User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B176 Safari/7534.48.3 WikiLinks/2.12.0 [22:05:37] but I'm off now [22:08:41] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [22:09:41] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69110 bytes in 0.599 second response time [22:09:57] http://wikilinks.net/ ? [22:11:17] Nemo_bis: sounds plausible [22:15:31] PROBLEM - puppetmaster backend https on strontium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [22:15:42] Nemo_bis: is that new? [22:16:19] hasn't been updated since may [22:17:04] Wouldn't be the first time something using a false UA [22:17:16] sure [22:20:04] ori: Not sure how long it takes for apple to approve etc. updates... not sure why someone would fake exactly that UA [22:20:24] dunno, i stopped debugging this [22:20:31] wasn't getting anywhere and there are smarter people around [22:21:07] nobody really got anywhere... the problems just stopped at some point [22:25:31] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.980 second response time [22:25:54] <_joe_> !log restarted apache on strontium, passenger crashed (again). [22:25:59] Logged the message, Master [22:26:04] <_joe_> one gets back home... [22:54:02] Is mailman being sad? Mails I send to a specific list aren't showing up in the archives or reported as being received by the other list members. [22:57:25] lfaraone: Those are probably stuck queued https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20pmtpa&h=mchenry.wikimedia.org&r=hour&z=default&jr=&js=&st=1403468761&v=545&m=exim%20queued%20messages&vl=messages&z=large