[00:06:18] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [00:17:52] New patchset: Asher; "seeting a global mysql PS1 prompt [rt 3091]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29517 [00:18:09] binasher: where's the sql query stats? [00:18:38] https://ishmael.wikimedia.org [00:18:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29517 [00:19:18] thanks [00:19:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29517 [00:21:11] does anyone know what happened to the job queue CPU usagE? [00:21:27] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Application+servers+pmtpa&m=load_one&s=by+name&mc=2&g=cpu_report [00:22:07] TimStarling: i think the change is when the job queue was switched to dedicated hosts only [00:23:10] I see [00:24:01] a lot of the new job runners have those glitches in the CPU graphs, with 2^31% wait CPU [00:24:07] is that a precise bug? [00:24:44] maybe it's 2^64% [00:24:51] it's a lot of CPU anyway [00:25:36] i'm not seeing it on the few i've checked [00:25:58] http://ganglia.wikimedia.org/latest/graph.php?h=mw8.pmtpa.wmnet&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1350951935&g=cpu_report&z=medium&c=Jobrunners%20pmtpa [00:26:44] oh yeah, it shows up on the weekly view [00:27:39] The avg number seems to be roughly 2^45 [00:27:59] that looks like a gmond bug [00:29:08] I see we roll our own ganglia-monitor packages [00:30:01] RECOVERY - MySQL Replication Heartbeat on es10 is OK: OK replication delay 0 seconds [00:30:45] RECOVERY - MySQL Slave Delay on es10 is OK: OK replication delay 0 seconds [00:32:15] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:32:15] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:46:09] it looks like ganglia has never had any bounds checks or counter overflow detection in its CPU metrics [00:46:23] so maybe it is a kernel change [00:47:54] fatal.log is mostly just acwiki and arwiki, heh [00:51:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.867 seconds [01:07:21] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [01:16:19] it's not the kernel [01:16:50] unless the counter goes backwards [01:17:16] the kernel counters agree with the uptime [01:17:26] so they didn't overflow [01:24:18] PROBLEM - Puppet freshness on srv220 is CRITICAL: Puppet has not run in the last 10 hours [01:24:18] PROBLEM - Puppet freshness on srv219 is CRITICAL: Puppet has not run in the last 10 hours [01:27:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [01:43:21] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 298 seconds [01:46:39] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [01:54:03] could anybody clear NFS? https://bugzilla.wikimedia.org/show_bug.cgi?id=41008 [01:59:53] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 243 seconds [02:06:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [02:29:34] !log LocalisationUpdate completed (1.21wmf2) at Tue Oct 23 02:29:31 UTC 2012 [02:29:48] Logged the message, Master [02:32:15] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [02:43:16] I/O wait issue: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&me=Wikimedia&m=load_one&s=by+name&mc=2&g=cpu_report [02:44:02] it does make the graph rather hard to read [02:47:36] RECOVERY - Puppet freshness on srv219 is OK: puppet ran at Tue Oct 23 02:47:12 UTC 2012 [02:50:06] RECOVERY - Puppet freshness on srv220 is OK: puppet ran at Tue Oct 23 02:49:55 UTC 2012 [02:55:38] !log LocalisationUpdate completed (1.21wmf1) at Tue Oct 23 02:55:38 UTC 2012 [02:55:52] Logged the message, Master [03:20:15] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [03:26:15] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [03:40:57] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [03:43:12] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [03:50:51] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 242 seconds [03:52:05] PROBLEM - MySQL Slave Delay on db1001 is CRITICAL: CRIT replication delay 303 seconds [03:57:18] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [03:58:57] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 13 seconds [04:00:00] RECOVERY - MySQL Slave Delay on db1001 is OK: OK replication delay 0 seconds [04:47:15] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:17:21] !log tstarling synchronized php-1.21wmf2/includes/SpecialPage.php [05:17:33] Logged the message, Master [05:25:48] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 182 seconds [05:26:43] PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CRIT replication delay 187 seconds [05:53:52] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/29278 [06:09:18] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [06:20:42] RECOVERY - MySQL Slave Delay on db1050 is OK: OK replication delay 27 seconds [06:20:51] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 26 seconds [06:40:35] !log krinkle synchronized php-1.21wmf2/resources/mediawiki/mediawiki.searchSuggest.js 'Ibcc56f64' [06:40:48] Logged the message, Master [06:42:13] !log krinkle synchronized php-1.21wmf1/resources/mediawiki/mediawiki.searchSuggest.js 'Ibcc56f64' [06:42:25] Logged the message, Master [07:22:43] good morning [07:41:08] RECOVERY - MySQL Slave Delay on es1001 is OK: OK replication delay NULL seconds [07:46:05] PROBLEM - MySQL Slave Delay on es1001 is CRITICAL: CRIT replication delay 2733579 seconds [07:50:01] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [07:54:59] New patchset: J; "rename ::jobserver to mediawiki_new::jobserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29324 [07:56:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29324 [07:57:43] paravoid: moved files and templates and updated locations (https://gerrit.wikimedia.org/r/#/c/29324/) if you can have another look [08:26:17] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:31:06] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.578 second response time [08:34:34] j^: I will [08:44:15] New patchset: Mark Bergsma; "Collect duplicate headers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29546 [08:45:09] j^: have you tested it on that labs machine? [08:45:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29546 [08:45:29] paravoid: yes, would not have gotten the paths right otherwise [08:49:05] :-) [08:49:52] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29324 [08:51:51] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find template 'jobrunner/mw-job-runner.default.erb' at /var/lib/git/operations/puppet/manifests/jobrunner.pp:31 on node mw1.pmtpa.wmnet [08:51:52] yay [08:52:19] that's from a production jobrunner unfortunately [08:52:20] looking [08:53:45] it fixed itself [08:53:49] puppet was slow to reload probably [08:53:58] because that file doesn't exist anymore, heh [08:54:12] okay, good to go [08:54:20] and verified it won't kill production jobrunners :) [08:54:41] j^: thanks for all the attempts and sorry you had to get through all those hoops [08:55:35] New patchset: J; "enable videoscaler on tmh1/tmh2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29547 [08:56:21] paravoid: once those 2 boxes run the code, it was worth it [08:56:38] hehe [08:56:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29547 [08:56:47] so, care to share a few details on how this works? [08:56:54] are you doing transcodes via ffmpeg? [08:57:07] I remember a discussion about ffmpeg packages but it was in a different context [08:57:29] I don't think we've ever got a mail of how TMH works from an operations perspective, but I may be wrong [08:57:36] I've been told it's an old project, so it might predate me :) [08:57:55] to transcodes are done using ffmpeg (for WebM) and ffmpeg2theora (for Ogg) [08:58:25] all required features are in the packages that are in precise so no need for custom packages anymore [08:58:30] yay! [08:58:44] esp. for ffmpeg/libav, it's a HELL to package [08:58:45] i think i wrote something to the list some time ago [08:58:51] they break ABIs in every version or so [08:59:09] yes, commandline api too [09:00:02] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [09:00:43] so I see we have ffmpeg/ffmpeg2theora on imagescaler::packages [09:00:58] but imagescalers need them too to generate thumbs I presume? [09:01:29] other than that from the ops perspective its 2 new boxes (later possibly more) that are jobrunners that only handle webVideoTranscode jobs [09:01:32] those are basically take video from filebackend; convert to other format; insert new format into filebackend and update database [09:01:58] * paravoid nods [09:02:05] yes imagescalers need ffmpeg / ffmpeg2theora to extract thumbs from videos [09:02:38] once they all run precise and we only use TMH we can remove some of those packages(i.e. oggvideotools) [09:02:52] that's okay, who cares for an extra package or two [09:03:08] I'm just trying to understand which role does what :) [09:03:26] and what's the plan here for the TMH rollout? [09:03:50] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29547 [09:04:03] right now waiting for the imagescaler update to precise, after that enable it on some more wikis (mediawiki.org/foundation) [09:04:21] some /more/? is it enabled somewhere already? [09:04:26] if that goes well enable transcoding, after that commons and all other wikis [09:04:33] its currently enabled on test2 [09:04:46] http://test2.wikipedia.org/wiki/User:JanGerber/sandbox [09:05:15] (force-running puppet on tmh1/2 [09:05:22] broken frame in 3rd video is due to imagescalers running and old version of ffmpeg [09:05:38] (will be fixed with precise update) [09:05:53] and which machine runs the videoscaler job here? [09:06:18] for test2? [09:06:27] you an use TMH without transcoding, thats what we do right now [09:06:47] aha [09:06:52] err: Could not retrieve catalog from remote server: Error 400 on SERVER: $ganglia_clusters[$cluster] is not an hash or array when accessing it with ip_oct at /var/lib/git/operations/puppet/manifests/ganglia.pp:150 on node tmh1.pmtpa.wmnet [09:06:59] yay [09:07:14] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:08:44] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.471 second response time [09:09:42] so the ganglia cluster is derived from the ::common classes' group [09:10:25] mark: is there some logic behind ganglia's ip_oct? [09:10:43] no [09:10:52] okay :) [09:10:57] they're kind of unordered heh [09:11:03] legacy [09:11:07] k :) [09:11:25] thx [09:13:54] j^: fixing [09:15:33] mark: another question: don't the tmh1/tmh2 names violate our naming policy? :) [09:15:51] are they in eqiad? then yes [09:15:53] we don't have imgscaler1/2/3 [09:15:55] no they're in pmtpa [09:16:07] but they're not mwN [09:16:26] * mark shrugs [09:16:33] who cares [09:16:35] I don't [09:18:36] New patchset: Mark Bergsma; "Temporarily pipe .ogv videos from frontend to backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29548 [09:18:59] varnish is sending content-length twice [09:19:04] i'm wondering if that's what confuses firefox [09:19:19] have you reproduced it? [09:19:37] New patchset: Faidon; "Add a "video scalers" ganglia group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29549 [09:19:37] i've reproduced the lack of video length yes [09:19:45] ff 15 on my gf's ubuntu is showing it [09:19:51] but not ff 17? [09:20:04] no but currently I can't even get that to respect /etc/hosts :P [09:20:11] which was working before [09:20:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29548 [09:20:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29549 [09:21:01] so what i'm gonna do now temporarily, is to pipe requests for videos on the frontend [09:21:03] to see if that makes a difference [09:21:07] takes care of the duplicate headers [09:22:12] brb [09:23:36] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29549 [09:27:06] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:56] mark: should I merge your change? [09:30:22] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.525 second response time [09:56:30] oh oops [09:56:35] thought i already did [09:56:35] yes please [09:57:49] done [09:58:15] my thinking being "it'll be deployed when I come out of the shower" ;) [09:58:33] haha [09:58:59] it'd be nice to be able to deploy varnish configs as we deploy squid [09:59:09] push rather than pull [09:59:17] and quicker than a full puppet run [09:59:37] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:01:15] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.652 second response time [10:07:06] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] j^: tmh1 is done, puppet's running on tmh2 now [10:33:05] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:33:05] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:35:28] lunch [10:37:24] see [10:37:33] now i'm piping the request to the backend [10:37:33] and firefox works [11:08:01] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [11:20:38] !log depooling ms-fe2, in preparation for ubuntu/swift upgrade [11:20:52] Logged the message, Master [11:26:17] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:48] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.760 second response time [11:36:56] PROBLEM - Host ms-fe2 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:38] RECOVERY - Host ms-fe2 is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [11:45:21] sigh, what a broken fs config [11:45:23] fs/partman [11:45:49] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:45] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: Connection refused [11:46:50] PROBLEM - Memcached on ms-fe2 is CRITICAL: Connection refused [11:47:26] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.269 second response time [11:47:35] PROBLEM - SSH on ms-fe2 is CRITICAL: Connection refused [12:06:11] PROBLEM - NTP on ms-fe2 is CRITICAL: NTP CRITICAL: No response from NTP server [12:11:53] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:05] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.710 second response time [12:33:02] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:47:14] i think we found the content length issue [12:47:22] building a new package now [12:53:22] !log Inserted new varnish 3.0.3plus~rc1-wm3 packages into the precise-wikimedia APT repository, with a duplicate content-length header fix on streaming [12:53:35] Logged the message, Master [12:53:45] yay [12:55:34] New patchset: Mark Bergsma; "Install varnish 3.0.3~rc1-wm3 (with duplicate CL header fix), remove workaround" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29564 [12:56:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29564 [12:57:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29564 [13:07:08] New patchset: Mark Bergsma; "Unset Content-Length header when starting streaming to avoid duplicates" [operations/debs/varnish] (patches/streaming-range) - https://gerrit.wikimedia.org/r/29565 [13:07:42] New patchset: Mark Bergsma; "Unset Content-Length header when starting streaming to avoid duplicates" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29566 [13:07:42] New patchset: Mark Bergsma; "Don't exit on fflush() error" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29567 [13:07:42] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm3) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29568 [13:08:24] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/streaming-range) - https://gerrit.wikimedia.org/r/29565 [13:08:56] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29566 [13:09:14] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29567 [13:09:33] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29568 [13:19:20] New patchset: Matthias Mullie; "(bug 40672) Abuse filter: Increase 5% limit to allow filtering for very short posts" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29571 [13:19:22] argh [13:19:25] but it doesn't fix the firefox issue [13:19:53] New review: Matthias Mullie; "Do not merge before these are merged:" [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/29571 [13:21:02] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [13:25:14] New patchset: Mark Bergsma; "Remove the duplicate Accept-Ranges header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29572 [13:26:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29572 [13:27:06] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [13:37:06] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:44] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.137 second response time [13:42:13] solved [13:42:17] it was the duplicate Accept-Ranges header [13:43:33] \o/ [13:44:01] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [13:44:24] can you test it? [13:45:02] sure [13:45:20] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:09] probably firefox collects them internally to "Accept-Ranges: bytes, bytes" at which point it no longer recognizes it [13:46:50] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.895 second response time [13:47:46] wfm [13:48:20] oh wait, that was with esams [13:48:21] haha [13:48:33] stupid firefox dns cache [13:48:49] yes [13:49:10] make sure to do a full cache refresh of the video as well [13:49:12] otherwise ff might already know the video length [13:52:21] takes a while but I'm not sure if it's just bad throughput to eqiad... [13:52:37] aborted due to a network error [13:54:50] ok, so I know why [13:54:52] I tried with another video [13:55:23] T+42.9 sent the GET [13:55:33] T+43.1 got the tcp ack [13:55:41] data started flowing at 74.5 [13:55:55] a miss? [13:56:00] if that takes even longer, I presume Firefox times out [13:56:29] cp1021 hit, cp 1026 miss [13:56:44] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:47] that's only streaming from backend to frontend then [13:56:50] should start very quickly [13:57:40] hang on, let me double check that [13:58:06] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [13:59:53] yep, double-checked it [14:00:09] let me get you the headers from the cap [14:00:19] GET /wikipedia/commons/1/1f/Monthly_Metrics_Meeting_February_2%2C_2012_%28360p%29.ogv HTTP/1.1 [14:00:28] Range: bytes=700153856- [14:00:59] HTTP/1.1 206 Partial Content [14:01:11] X-Cache: cp1021 hit (1) [14:01:17] X-Cache-frontend: cp1026 miss (0) [14:01:26] Content-Range: bytes 700153856-700183371/700183372 [14:01:26] Accept-Ranges: bytes [14:01:26] Content-Length: 29516 [14:02:26] and that took like 30s for the first byte to come in? [14:02:30] yes [14:02:37] strange [14:03:26] aha, immediately before that [14:03:36] Request to first data byte: 127 ms [14:03:36] it requested bytes=0- [14:03:39] for that same request [14:03:52] ah, on the same connection? [14:03:53] upload started to send that immediately [14:04:07] but then my firefox closed the connection (I see multiple RST) [14:04:13] tcp rst [14:04:50] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.128 second response time [14:06:43] same file, different connection (port +1) [14:07:56] hmm [14:07:59] lemme try to simulate that [14:08:36] The Ogg format doesn't encapsulate the duration of media, so for the progress bar on the video controls to display the duration of the video, Gecko needs to determine the length of the media using other means. [14:08:40] There are two ways Gecko can do this. The best way is to offer an X-Content-Duration header when serving Ogg media files. This header provides the duration of the video in seconds (not in HH:MM:SS format) as a floating-point value. [14:08:44] If your server provides the X-Content-Duration header when serving Ogg media, Gecko doesn't have to do any extra HTTP requests to seek to the end of the file to calculate its duration. This makes the entire process much more efficient as well as more accurate. [14:08:50] As an inferior alternative, Gecko can estimate the video length based on the Content-Length. See next point. [14:08:52] https://developer.mozilla.org/en-US/docs/Configuring_servers_for_Ogg_media [14:08:56] that's why it seeks to the end [14:08:58] yes [14:09:04] I guessed that, so my Googling was very efficient :P [14:09:17] that's why I made sure range requests are now working for high ranges ;) [14:09:26] but we can set X-Content-Duration [14:09:34] swift can hold arbitrary headers [14:09:38] yes we should [14:09:45] but this should work too [14:09:49] yeah [14:10:21] i tried to simulate it on this DSL line, with two firstbyte-range invocations and one second in between [14:10:26] was very quick [14:10:36] so i'm not sure why your second request would be stalling [14:11:50] wouldn't that happen if streaming was not enabled? [14:12:05] yes [14:12:08] but it is enabled for this file [14:12:51] do we have a header that gets set when streaming is enabled? [14:13:05] no not anymore [14:13:08] as it was a bit misleading [14:13:18] it entered the cache and would be served on cached results [14:13:22] oh, hah [14:13:49] chrome does not support the X-Content-Duration header so even if we add it, chrome will still do range requests [14:14:12] heh, good to have a media expert around :) [14:14:51] webm has the index/duration at the start so that will also reduce the inital range requests [14:16:27] this is what I did to simulate: http://p.defau.lt/?hvZEmAof6GxBTv_2_VWggw [14:17:29] paravoid: can that header be added to objects after they're already in swift? [14:17:38] yes [14:17:40] (afaik) [14:17:47] we should probably have aaron make a batch job to do that [14:17:52] besides inserting it with new objects [14:18:05] last time i asked about changing mimetype i was told it would not be possible [14:18:15] but might be possible now [14:18:34] I'm not sure, so let me recheck [14:19:45] does swift store them in filesystem attributes? [14:19:49] yeah [14:19:55] well, in one single attribute [14:24:14] yeah, with POST you can update just the metadata [14:24:19] at least that's what I understand from the docs [14:24:29] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:38] the arbitrary header feature is not documented although it's definitely there in the code :-) [14:24:53] shall I make a bugzilla ticket? [14:24:58] initially they supported only X-Object-Meta-* but it seems they've added arbitrary headers [14:25:00] or do you want to? [14:25:14] I can do it [14:25:19] thanks, cc me [14:25:28] j^: should this be against TMH? [14:26:08] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.605 second response time [14:26:30] paravoid: File->getLength() in core should work ( * Get the duration of a media file in seconds [14:26:54] this will return the right value with TMH and should return the right value with OggHandler right now [14:27:14] the bug to set X-Content-Duration I meant [14:27:27] i'm gonna make varnish filter those headers now [14:27:41] mark: which ones? [14:27:42] x-content-duration? [14:27:43] Age, Accept-Ranges [14:27:59] the current hack for accept-ranges is stupid, and age is harder to deduplicate [14:28:03] I just want it to behave well [14:28:03] paravoid: ah ok, yes set it to TMH, it might have to be done in core though [14:28:36] but this affects OggHandler too, doesn't it? [14:31:34] paravoid: i think the patch will go to core/includes/filebackend/SwiftFileBackend.php [14:31:44] no changes needed in OggHandler and TMH [14:31:54] okay, I'll file it in core [14:31:58] thanks. [14:32:09] !log reedy synchronized php-1.21wmf1/extensions/PostEdit 'Update PostEdit to master' [14:32:17] aude: ^ [14:32:23] Logged the message, Master [14:34:25] Reedy: thanks [14:35:30] paravoid: whats the bug number for X-Content-Duration, will subscribe to the bug [14:35:44] I'm filing it now :) [14:37:19] Do we have a way of finding out the source of a request? Getting a tonne of api requests asking for parse on special pages... which is resulting in a huge spam of exceptions [14:38:08] new varnish build coming up [14:38:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=41304 [14:38:20] mark: bugzilla says no match for your mail address [14:38:31] maybe they can not throw an exception? fail in some other way? [14:45:07] mark: want the pcap for the 30s time for first byte? [14:45:17] I'd have to filter it out a bit, since it's 125MB :) [14:45:31] sure [14:47:14] there [14:47:17] those headers nicely filtered [14:47:24] let me roll out this build ;) [14:47:26] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:04] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:48:11] PROBLEM - Host ms-fe2 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:56] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [14:49:14] !log reedy synchronized php-1.21wmf2/includes/api/ApiParse.php 'Dont parse special pages' [14:49:23] !log removing srv194 from apaches pool for drive error [14:49:24] Logged the message, Master [14:49:36] Logged the message, notpeter [14:49:44] !log Inserted new varnish build 3.0.3plus~rc1-wm4 into precise-wikimedia, fixing duplicate header issue [14:49:51] Logged the message, Master [14:50:07] notpeter: gahhh I saw the nagios alerts, thought to check then forgot about it [14:50:26] heh, no prob :) [14:50:26] New patchset: Mark Bergsma; "Roll a new Varnish build which filters duplicate headers Age and Accepted-Ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29585 [14:50:30] !log temporarily stopping puppet on brewster [14:50:42] Logged the message, Master [14:50:49] Ahhhh, that's better [14:50:53] I'm in a state where I'm looking after the apaches. your attention is elsewhere :) [14:51:07]