[00:06:18] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [00:17:52] New patchset: Asher; "seeting a global mysql PS1 prompt [rt 3091]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29517 [00:18:09] binasher: where's the sql query stats? [00:18:38] https://ishmael.wikimedia.org [00:18:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29517 [00:19:18] thanks [00:19:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29517 [00:21:11] does anyone know what happened to the job queue CPU usagE? [00:21:27] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&c=Application+servers+pmtpa&m=load_one&s=by+name&mc=2&g=cpu_report [00:22:07] TimStarling: i think the change is when the job queue was switched to dedicated hosts only [00:23:10] I see [00:24:01] a lot of the new job runners have those glitches in the CPU graphs, with 2^31% wait CPU [00:24:07] is that a precise bug? [00:24:44] maybe it's 2^64% [00:24:51] it's a lot of CPU anyway [00:25:36] i'm not seeing it on the few i've checked [00:25:58] http://ganglia.wikimedia.org/latest/graph.php?h=mw8.pmtpa.wmnet&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1350951935&g=cpu_report&z=medium&c=Jobrunners%20pmtpa [00:26:44] oh yeah, it shows up on the weekly view [00:27:39] The avg number seems to be roughly 2^45 [00:27:59] that looks like a gmond bug [00:29:08] I see we roll our own ganglia-monitor packages [00:30:01] RECOVERY - MySQL Replication Heartbeat on es10 is OK: OK replication delay 0 seconds [00:30:45] RECOVERY - MySQL Slave Delay on es10 is OK: OK replication delay 0 seconds [00:32:15] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:32:15] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [00:46:09] it looks like ganglia has never had any bounds checks or counter overflow detection in its CPU metrics [00:46:23] so maybe it is a kernel change [00:47:54] fatal.log is mostly just acwiki and arwiki, heh [00:51:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:52:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.867 seconds [01:07:21] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [01:16:19] it's not the kernel [01:16:50] unless the counter goes backwards [01:17:16] the kernel counters agree with the uptime [01:17:26] so they didn't overflow [01:24:18] PROBLEM - Puppet freshness on srv220 is CRITICAL: Puppet has not run in the last 10 hours [01:24:18] PROBLEM - Puppet freshness on srv219 is CRITICAL: Puppet has not run in the last 10 hours [01:27:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [01:43:21] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 298 seconds [01:46:39] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 15 seconds [01:54:03] could anybody clear NFS? https://bugzilla.wikimedia.org/show_bug.cgi?id=41008 [01:59:53] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 243 seconds [02:06:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [02:29:34] !log LocalisationUpdate completed (1.21wmf2) at Tue Oct 23 02:29:31 UTC 2012 [02:29:48] Logged the message, Master [02:32:15] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [02:43:16] I/O wait issue: http://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&me=Wikimedia&m=load_one&s=by+name&mc=2&g=cpu_report [02:44:02] it does make the graph rather hard to read [02:47:36] RECOVERY - Puppet freshness on srv219 is OK: puppet ran at Tue Oct 23 02:47:12 UTC 2012 [02:50:06] RECOVERY - Puppet freshness on srv220 is OK: puppet ran at Tue Oct 23 02:49:55 UTC 2012 [02:55:38] !log LocalisationUpdate completed (1.21wmf1) at Tue Oct 23 02:55:38 UTC 2012 [02:55:52] Logged the message, Master [03:20:15] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [03:26:15] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [03:40:57] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [03:43:12] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [03:50:51] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 242 seconds [03:52:05] PROBLEM - MySQL Slave Delay on db1001 is CRITICAL: CRIT replication delay 303 seconds [03:57:18] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [03:58:57] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 13 seconds [04:00:00] RECOVERY - MySQL Slave Delay on db1001 is OK: OK replication delay 0 seconds [04:47:15] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:17:21] !log tstarling synchronized php-1.21wmf2/includes/SpecialPage.php [05:17:33] Logged the message, Master [05:25:48] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 182 seconds [05:26:43] PROBLEM - MySQL Slave Delay on db1050 is CRITICAL: CRIT replication delay 187 seconds [05:53:52] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/29278 [06:09:18] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [06:20:42] RECOVERY - MySQL Slave Delay on db1050 is OK: OK replication delay 27 seconds [06:20:51] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 26 seconds [06:40:35] !log krinkle synchronized php-1.21wmf2/resources/mediawiki/mediawiki.searchSuggest.js 'Ibcc56f64' [06:40:48] Logged the message, Master [06:42:13] !log krinkle synchronized php-1.21wmf1/resources/mediawiki/mediawiki.searchSuggest.js 'Ibcc56f64' [06:42:25] Logged the message, Master [07:22:43] good morning [07:41:08] RECOVERY - MySQL Slave Delay on es1001 is OK: OK replication delay NULL seconds [07:46:05] PROBLEM - MySQL Slave Delay on es1001 is CRITICAL: CRIT replication delay 2733579 seconds [07:50:01] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [07:54:59] New patchset: J; "rename ::jobserver to mediawiki_new::jobserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29324 [07:56:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29324 [07:57:43] paravoid: moved files and templates and updated locations (https://gerrit.wikimedia.org/r/#/c/29324/) if you can have another look [08:26:17] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:31:06] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.578 second response time [08:34:34] j^: I will [08:44:15] New patchset: Mark Bergsma; "Collect duplicate headers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29546 [08:45:09] j^: have you tested it on that labs machine? [08:45:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29546 [08:45:29] paravoid: yes, would not have gotten the paths right otherwise [08:49:05] :-) [08:49:52] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29324 [08:51:51] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find template 'jobrunner/mw-job-runner.default.erb' at /var/lib/git/operations/puppet/manifests/jobrunner.pp:31 on node mw1.pmtpa.wmnet [08:51:52] yay [08:52:19] that's from a production jobrunner unfortunately [08:52:20] looking [08:53:45] it fixed itself [08:53:49] puppet was slow to reload probably [08:53:58] because that file doesn't exist anymore, heh [08:54:12] okay, good to go [08:54:20] and verified it won't kill production jobrunners :) [08:54:41] j^: thanks for all the attempts and sorry you had to get through all those hoops [08:55:35] New patchset: J; "enable videoscaler on tmh1/tmh2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29547 [08:56:21] paravoid: once those 2 boxes run the code, it was worth it [08:56:38] hehe [08:56:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29547 [08:56:47] so, care to share a few details on how this works? [08:56:54] are you doing transcodes via ffmpeg? [08:57:07] I remember a discussion about ffmpeg packages but it was in a different context [08:57:29] I don't think we've ever got a mail of how TMH works from an operations perspective, but I may be wrong [08:57:36] I've been told it's an old project, so it might predate me :) [08:57:55] to transcodes are done using ffmpeg (for WebM) and ffmpeg2theora (for Ogg) [08:58:25] all required features are in the packages that are in precise so no need for custom packages anymore [08:58:30] yay! [08:58:44] esp. for ffmpeg/libav, it's a HELL to package [08:58:45] i think i wrote something to the list some time ago [08:58:51] they break ABIs in every version or so [08:59:09] yes, commandline api too [09:00:02] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [09:00:43] so I see we have ffmpeg/ffmpeg2theora on imagescaler::packages [09:00:58] but imagescalers need them too to generate thumbs I presume? [09:01:29] other than that from the ops perspective its 2 new boxes (later possibly more) that are jobrunners that only handle webVideoTranscode jobs [09:01:32] those are basically take video from filebackend; convert to other format; insert new format into filebackend and update database [09:01:58] * paravoid nods [09:02:05] yes imagescalers need ffmpeg / ffmpeg2theora to extract thumbs from videos [09:02:38] once they all run precise and we only use TMH we can remove some of those packages(i.e. oggvideotools) [09:02:52] that's okay, who cares for an extra package or two [09:03:08] I'm just trying to understand which role does what :) [09:03:26] and what's the plan here for the TMH rollout? [09:03:50] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29547 [09:04:03] right now waiting for the imagescaler update to precise, after that enable it on some more wikis (mediawiki.org/foundation) [09:04:21] some /more/? is it enabled somewhere already? [09:04:26] if that goes well enable transcoding, after that commons and all other wikis [09:04:33] its currently enabled on test2 [09:04:46] http://test2.wikipedia.org/wiki/User:JanGerber/sandbox [09:05:15] (force-running puppet on tmh1/2 [09:05:22] broken frame in 3rd video is due to imagescalers running and old version of ffmpeg [09:05:38] (will be fixed with precise update) [09:05:53] and which machine runs the videoscaler job here? [09:06:18] for test2? [09:06:27] you an use TMH without transcoding, thats what we do right now [09:06:47] aha [09:06:52] err: Could not retrieve catalog from remote server: Error 400 on SERVER: $ganglia_clusters[$cluster] is not an hash or array when accessing it with ip_oct at /var/lib/git/operations/puppet/manifests/ganglia.pp:150 on node tmh1.pmtpa.wmnet [09:06:59] yay [09:07:14] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:08:44] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.471 second response time [09:09:42] so the ganglia cluster is derived from the ::common classes' group [09:10:25] mark: is there some logic behind ganglia's ip_oct? [09:10:43] no [09:10:52] okay :) [09:10:57] they're kind of unordered heh [09:11:03] legacy [09:11:07] k :) [09:11:25] thx [09:13:54] j^: fixing [09:15:33] mark: another question: don't the tmh1/tmh2 names violate our naming policy? :) [09:15:51] are they in eqiad? then yes [09:15:53] we don't have imgscaler1/2/3 [09:15:55] no they're in pmtpa [09:16:07] but they're not mwN [09:16:26] * mark shrugs [09:16:33] who cares [09:16:35] I don't [09:18:36] New patchset: Mark Bergsma; "Temporarily pipe .ogv videos from frontend to backend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29548 [09:18:59] varnish is sending content-length twice [09:19:04] i'm wondering if that's what confuses firefox [09:19:19] have you reproduced it? [09:19:37] New patchset: Faidon; "Add a "video scalers" ganglia group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29549 [09:19:37] i've reproduced the lack of video length yes [09:19:45] ff 15 on my gf's ubuntu is showing it [09:19:51] but not ff 17? [09:20:04] no but currently I can't even get that to respect /etc/hosts :P [09:20:11] which was working before [09:20:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29548 [09:20:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29549 [09:21:01] so what i'm gonna do now temporarily, is to pipe requests for videos on the frontend [09:21:03] to see if that makes a difference [09:21:07] takes care of the duplicate headers [09:22:12] brb [09:23:36] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29549 [09:27:06] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:27:56] mark: should I merge your change? [09:30:22] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.525 second response time [09:56:30] oh oops [09:56:35] thought i already did [09:56:35] yes please [09:57:49] done [09:58:15] my thinking being "it'll be deployed when I come out of the shower" ;) [09:58:33] haha [09:58:59] it'd be nice to be able to deploy varnish configs as we deploy squid [09:59:09] push rather than pull [09:59:17] and quicker than a full puppet run [09:59:37] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:01:15] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.652 second response time [10:07:06] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [10:15:13] j^: tmh1 is done, puppet's running on tmh2 now [10:33:05] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:33:05] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:35:28] lunch [10:37:24] see [10:37:33] now i'm piping the request to the backend [10:37:33] and firefox works [11:08:01] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [11:20:38] !log depooling ms-fe2, in preparation for ubuntu/swift upgrade [11:20:52] Logged the message, Master [11:26:17] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:48] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.760 second response time [11:36:56] PROBLEM - Host ms-fe2 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:38] RECOVERY - Host ms-fe2 is UP: PING OK - Packet loss = 0%, RTA = 1.72 ms [11:45:21] sigh, what a broken fs config [11:45:23] fs/partman [11:45:49] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:45] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: Connection refused [11:46:50] PROBLEM - Memcached on ms-fe2 is CRITICAL: Connection refused [11:47:26] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.269 second response time [11:47:35] PROBLEM - SSH on ms-fe2 is CRITICAL: Connection refused [12:06:11] PROBLEM - NTP on ms-fe2 is CRITICAL: NTP CRITICAL: No response from NTP server [12:11:53] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:15:05] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.710 second response time [12:33:02] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:47:14] i think we found the content length issue [12:47:22] building a new package now [12:53:22] !log Inserted new varnish 3.0.3plus~rc1-wm3 packages into the precise-wikimedia APT repository, with a duplicate content-length header fix on streaming [12:53:35] Logged the message, Master [12:53:45] yay [12:55:34] New patchset: Mark Bergsma; "Install varnish 3.0.3~rc1-wm3 (with duplicate CL header fix), remove workaround" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29564 [12:56:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29564 [12:57:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29564 [13:07:08] New patchset: Mark Bergsma; "Unset Content-Length header when starting streaming to avoid duplicates" [operations/debs/varnish] (patches/streaming-range) - https://gerrit.wikimedia.org/r/29565 [13:07:42] New patchset: Mark Bergsma; "Unset Content-Length header when starting streaming to avoid duplicates" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29566 [13:07:42] New patchset: Mark Bergsma; "Don't exit on fflush() error" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29567 [13:07:42] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm3) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29568 [13:08:24] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/streaming-range) - https://gerrit.wikimedia.org/r/29565 [13:08:56] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29566 [13:09:14] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29567 [13:09:33] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29568 [13:19:20] New patchset: Matthias Mullie; "(bug 40672) Abuse filter: Increase 5% limit to allow filtering for very short posts" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29571 [13:19:22] argh [13:19:25] but it doesn't fix the firefox issue [13:19:53] New review: Matthias Mullie; "Do not merge before these are merged:" [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/29571 [13:21:02] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [13:25:14] New patchset: Mark Bergsma; "Remove the duplicate Accept-Ranges header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29572 [13:26:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29572 [13:27:06] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [13:37:06] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:44] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.137 second response time [13:42:13] solved [13:42:17] it was the duplicate Accept-Ranges header [13:43:33] \o/ [13:44:01] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [13:44:24] can you test it? [13:45:02] sure [13:45:20] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:09] probably firefox collects them internally to "Accept-Ranges: bytes, bytes" at which point it no longer recognizes it [13:46:50] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.895 second response time [13:47:46] wfm [13:48:20] oh wait, that was with esams [13:48:21] haha [13:48:33] stupid firefox dns cache [13:48:49] yes [13:49:10] make sure to do a full cache refresh of the video as well [13:49:12] otherwise ff might already know the video length [13:52:21] takes a while but I'm not sure if it's just bad throughput to eqiad... [13:52:37] aborted due to a network error [13:54:50] ok, so I know why [13:54:52] I tried with another video [13:55:23] T+42.9 sent the GET [13:55:33] T+43.1 got the tcp ack [13:55:41] data started flowing at 74.5 [13:55:55] a miss? [13:56:00] if that takes even longer, I presume Firefox times out [13:56:29] cp1021 hit, cp 1026 miss [13:56:44] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:47] that's only streaming from backend to frontend then [13:56:50] should start very quickly [13:57:40] hang on, let me double check that [13:58:06] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [13:59:53] yep, double-checked it [14:00:09] let me get you the headers from the cap [14:00:19] GET /wikipedia/commons/1/1f/Monthly_Metrics_Meeting_February_2%2C_2012_%28360p%29.ogv HTTP/1.1 [14:00:28] Range: bytes=700153856- [14:00:59] HTTP/1.1 206 Partial Content [14:01:11] X-Cache: cp1021 hit (1) [14:01:17] X-Cache-frontend: cp1026 miss (0) [14:01:26] Content-Range: bytes 700153856-700183371/700183372 [14:01:26] Accept-Ranges: bytes [14:01:26] Content-Length: 29516 [14:02:26] and that took like 30s for the first byte to come in? [14:02:30] yes [14:02:37] strange [14:03:26] aha, immediately before that [14:03:36] Request to first data byte: 127 ms [14:03:36] it requested bytes=0- [14:03:39] for that same request [14:03:52] ah, on the same connection? [14:03:53] upload started to send that immediately [14:04:07] but then my firefox closed the connection (I see multiple RST) [14:04:13] tcp rst [14:04:50] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.128 second response time [14:06:43] same file, different connection (port +1) [14:07:56] hmm [14:07:59] lemme try to simulate that [14:08:36] The Ogg format doesn't encapsulate the duration of media, so for the progress bar on the video controls to display the duration of the video, Gecko needs to determine the length of the media using other means. [14:08:40] There are two ways Gecko can do this. The best way is to offer an X-Content-Duration header when serving Ogg media files. This header provides the duration of the video in seconds (not in HH:MM:SS format) as a floating-point value. [14:08:44] If your server provides the X-Content-Duration header when serving Ogg media, Gecko doesn't have to do any extra HTTP requests to seek to the end of the file to calculate its duration. This makes the entire process much more efficient as well as more accurate. [14:08:50] As an inferior alternative, Gecko can estimate the video length based on the Content-Length. See next point. [14:08:52] https://developer.mozilla.org/en-US/docs/Configuring_servers_for_Ogg_media [14:08:56] that's why it seeks to the end [14:08:58] yes [14:09:04] I guessed that, so my Googling was very efficient :P [14:09:17] that's why I made sure range requests are now working for high ranges ;) [14:09:26] but we can set X-Content-Duration [14:09:34] swift can hold arbitrary headers [14:09:38] yes we should [14:09:45] but this should work too [14:09:49] yeah [14:10:21] i tried to simulate it on this DSL line, with two firstbyte-range invocations and one second in between [14:10:26] was very quick [14:10:36] so i'm not sure why your second request would be stalling [14:11:50] wouldn't that happen if streaming was not enabled? [14:12:05] yes [14:12:08] but it is enabled for this file [14:12:51] do we have a header that gets set when streaming is enabled? [14:13:05] no not anymore [14:13:08] as it was a bit misleading [14:13:18] it entered the cache and would be served on cached results [14:13:22] oh, hah [14:13:49] chrome does not support the X-Content-Duration header so even if we add it, chrome will still do range requests [14:14:12] heh, good to have a media expert around :) [14:14:51] webm has the index/duration at the start so that will also reduce the inital range requests [14:16:27] this is what I did to simulate: http://p.defau.lt/?hvZEmAof6GxBTv_2_VWggw [14:17:29] paravoid: can that header be added to objects after they're already in swift? [14:17:38] yes [14:17:40] (afaik) [14:17:47] we should probably have aaron make a batch job to do that [14:17:52] besides inserting it with new objects [14:18:05] last time i asked about changing mimetype i was told it would not be possible [14:18:15] but might be possible now [14:18:34] I'm not sure, so let me recheck [14:19:45] does swift store them in filesystem attributes? [14:19:49] yeah [14:19:55] well, in one single attribute [14:24:14] yeah, with POST you can update just the metadata [14:24:19] at least that's what I understand from the docs [14:24:29] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:24:38] the arbitrary header feature is not documented although it's definitely there in the code :-) [14:24:53] shall I make a bugzilla ticket? [14:24:58] initially they supported only X-Object-Meta-* but it seems they've added arbitrary headers [14:25:00] or do you want to? [14:25:14] I can do it [14:25:19] thanks, cc me [14:25:28] j^: should this be against TMH? [14:26:08] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.605 second response time [14:26:30] paravoid: File->getLength() in core should work ( * Get the duration of a media file in seconds [14:26:54] this will return the right value with TMH and should return the right value with OggHandler right now [14:27:14] the bug to set X-Content-Duration I meant [14:27:27] i'm gonna make varnish filter those headers now [14:27:41] mark: which ones? [14:27:42] x-content-duration? [14:27:43] Age, Accept-Ranges [14:27:59] the current hack for accept-ranges is stupid, and age is harder to deduplicate [14:28:03] I just want it to behave well [14:28:03] paravoid: ah ok, yes set it to TMH, it might have to be done in core though [14:28:36] but this affects OggHandler too, doesn't it? [14:31:34] paravoid: i think the patch will go to core/includes/filebackend/SwiftFileBackend.php [14:31:44] no changes needed in OggHandler and TMH [14:31:54] okay, I'll file it in core [14:31:58] thanks. [14:32:09] !log reedy synchronized php-1.21wmf1/extensions/PostEdit 'Update PostEdit to master' [14:32:17] aude: ^ [14:32:23] Logged the message, Master [14:34:25] Reedy: thanks [14:35:30] paravoid: whats the bug number for X-Content-Duration, will subscribe to the bug [14:35:44] I'm filing it now :) [14:37:19] Do we have a way of finding out the source of a request? Getting a tonne of api requests asking for parse on special pages... which is resulting in a huge spam of exceptions [14:38:08] new varnish build coming up [14:38:15] https://bugzilla.wikimedia.org/show_bug.cgi?id=41304 [14:38:20] mark: bugzilla says no match for your mail address [14:38:31] maybe they can not throw an exception? fail in some other way? [14:45:07] mark: want the pcap for the 30s time for first byte? [14:45:17] I'd have to filter it out a bit, since it's 125MB :) [14:45:31] sure [14:47:14] there [14:47:17] those headers nicely filtered [14:47:24] let me roll out this build ;) [14:47:26] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:04] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [14:48:11] PROBLEM - Host ms-fe2 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:56] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [14:49:14] !log reedy synchronized php-1.21wmf2/includes/api/ApiParse.php 'Dont parse special pages' [14:49:23] !log removing srv194 from apaches pool for drive error [14:49:24] Logged the message, Master [14:49:36] Logged the message, notpeter [14:49:44] !log Inserted new varnish build 3.0.3plus~rc1-wm4 into precise-wikimedia, fixing duplicate header issue [14:49:51] Logged the message, Master [14:50:07] notpeter: gahhh I saw the nagios alerts, thought to check then forgot about it [14:50:26] heh, no prob :) [14:50:26] New patchset: Mark Bergsma; "Roll a new Varnish build which filters duplicate headers Age and Accepted-Ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29585 [14:50:30] !log temporarily stopping puppet on brewster [14:50:42] Logged the message, Master [14:50:49] Ahhhh, that's better [14:50:53] I'm in a state where I'm looking after the apaches. your attention is elsewhere :) [14:51:07] (and I'm thankful that it's elsewhere, on swift :) ) [14:51:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29585 [14:51:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29585 [14:53:56] RECOVERY - Host ms-fe2 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:57:11] !log reedy synchronized php-1.21wmf2/includes/api/ApiParse.php 'Dont parse special pages' [14:57:21] PROBLEM - Apache HTTP on srv194 is CRITICAL: Connection refused [14:57:24] Logged the message, Master [14:58:01] New patchset: Mark Bergsma; "Filter the Age and Accept-Ranges HTTP headers before readding on deliver" [operations/debs/varnish] (patches/filter-headers) - https://gerrit.wikimedia.org/r/29588 [14:58:34] New patchset: Mark Bergsma; "Filter the Age and Accept-Ranges HTTP headers before readding on deliver" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29589 [14:58:34] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm4) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29590 [14:59:01] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/filter-headers) - https://gerrit.wikimedia.org/r/29588 [14:59:25] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29589 [14:59:41] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29590 [15:02:29] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:02] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:40] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:03:58] rendering about to go down [15:04:54] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:01] you mean you're about to do that? [15:05:14] or outage? [15:05:16] no, 4/7 hosts are looking rather down in nagios [15:05:36] although I'm on one right now and things don't look awful... [15:05:50] 6/7 are down [15:05:51] wtf [15:05:53] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:36] 7/7 [15:06:40] ja [15:06:47] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:52] and there we go [15:07:01] heh [15:07:21] I'm on srv190, no load or anything [15:07:29] yeah [15:07:32] try to curl one, though [15:07:32] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:33] times out bad [15:07:50] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:11] ganglia too, that's strnage [15:08:40] both aggregators are totally overloaded [15:08:48] but neither ram nor cpu.... [15:10:52] we have low limits on those apaches on purpose [15:11:37] Anybody able to help with some technical glitches in deleting a file? Ironholds from #wikipedia-en-admins told me to go here. [15:11:52] https://en.wikipedia.org/wiki/File:Sandesh_Kadur_portrait.jpg [15:12:03] ugh oh [15:12:09] paravoid: ? [15:12:12] ""Error deleting file: Could not delete file "mwstore://local-swift/local-public/d/d2/Sandesh_Kadur_portrait.jpg". Happened twice. [15:12:13] swift... [15:12:20] or nfs [15:12:36] FutPerf: we're looking at it, thanks. [15:12:39] No idea what swift is, but that's what Ironholds too thought it would be. [15:12:42] Thanks. [15:12:46] There was also another one: [15:13:00] https://en.wikipedia.org/wiki/File:Walter_Mercado.jpg [15:13:07] MySQL error: 1205: Lock wait timeout exceeded; try restarting transaction (10.0.6.73) [15:13:08] paravoid: load from rebalancing swift? [15:13:10] nfs looks fine [15:13:36] swuft doesn't [15:13:43] ja [15:13:45] but I'm not sure if it's rendering cascaded to swift [15:13:47] or the other way around [15:13:47] swift response times are through the roof [15:13:50] yeah [15:14:27] swift cpu had a very steep rise around 14:50 [15:14:27] could be vid requests going to swift backend? [15:14:41] on ms-fe3 and 4 [15:14:45] yes [15:14:56] ms-fe1 but a bit better there [15:14:56] RECOVERY - SSH on ms-fe2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:14:57] There's a tonne of swift warnings apparing in the apache logs [15:15:17] ms-be3 is melting [15:15:27] goddamn it [15:15:55] swift is at 500 Mbps in ganglia [15:16:00] hehe [15:16:04] ms-be3 is saturating its GigE [15:16:05] same outage [15:16:12] it has a huge i/o wait too [15:16:17] and I don't think I've patched that [15:16:23] you know what to do [15:16:32] why isn't swift caching it though [15:16:37] er [15:16:37] squid [15:17:35] 2.6G ogg [15:17:44] nice [15:17:51] not sure which yet [15:17:56] New patchset: Andrew Bogott; "Add "role::mediawiki-update::labs" for updating labs-MW-installations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28355 [15:18:34] The_MediaWiki_Web_API_and_How_to_use_it_-_San_Francisco_Wikipedia_Hackathon_2012.ogv [15:19:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28355 [15:19:19] 879 fds open with that file only on ms-be3 [15:20:01] well it's not gonna increase throughput [15:20:09] as the GigEs are already saturated on both backends and frontends [15:20:14] so we need to fix the squid level [15:20:38] fixing i/o wait is a good idea nevertheless [15:20:42] sure [15:21:04] !log manually patching swift on ms-be3 to comment-out the drop_cache() calls [15:21:17] Logged the message, Master [15:21:29] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59891 bytes in 0.182 seconds [15:21:35] ffs [15:22:09] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:22:23] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [15:22:23] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.363 second response time [15:22:32] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [15:22:32] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.639 second response time [15:22:41] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [15:23:26] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.071 second response time [15:25:05] PROBLEM - swift-container-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:25:26] i'm not inclined to fix squid [15:25:26] PROBLEM - swift-account-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:25:29] i'm inclined to put load on varnish tomorrow ;) [15:25:32] PROBLEM - swift-object-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:26:44] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [15:26:53] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [15:27:11] RECOVERY - swift-object-server on ms-be3 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [15:27:50] mark: we might have to [15:27:56] it's not just ms-be3's network that saturated [15:27:58] it was the proxies too [15:28:06] that's what I said [15:28:22] oh did you [15:28:24] was busy patching :) [15:30:59] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:05] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:13] !log reedy synchronized php-1.21wmf2/includes/EditPage.php [15:31:23] Logged the message, Master [15:31:59] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:17] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:46] squid is not caching that video atm [15:33:14] I can't see the video squid logs [15:33:21] could it be videoscalers? [15:33:27] could be [15:33:30] the new videoscalers that we deployed today [15:33:30] i have no idea [15:33:56] PROBLEM - Apache HTTP on srv190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:11] dont think so, they had not activity so far [15:34:11] i assume you can see where the requests are coming from? [15:34:23] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:37] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Video%2520scalers%2520pmtpa&tab=m&vn= [15:34:39] as soon as I can find them [15:34:41] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:17] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:32] definitely that video though [15:35:50] the frontends have falled back on ms-be12, also saturated [15:37:33] mark: at least through sq56.wikimedia.org [15:38:16] yes that's the CARP host [15:38:36] New review: jan; "1) There have to be some includes in mediawiki.pp: " [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/28355 [15:39:06] http://www.mediawiki.org/wiki/API:Main_page includes the video [15:39:56] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:08] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:41:00] j^: I was looking to an ipv6 ip that had some hits [15:41:05] and it was you [15:41:09] (via sixxs, via your 6bone handle) [15:41:21] easy to find [15:41:26] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.195 second response time [15:41:35] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.601 second response time [15:42:04] paravoid: but that was just me looking for places that video is linked from [15:42:09] yeah [15:43:23] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:26] mark: I think it's that swift bug [15:44:30] where it does the full fetch [15:44:34] nice [15:44:39] even if you request the range [15:44:43] it's just a guess [15:44:52] yeah i'm not seeing much on the squids or in the logs [15:44:53] but I can't explain how they have a full 1gbps in [15:44:53] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.495 second response time [15:44:56] but no out [15:45:26] can I go to dinner? [15:46:20] may i (grammer burn) [15:46:54] =P [15:46:59] RECOVERY - Apache HTTP on srv190 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.061 second response time [15:47:17] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.302 second response time [15:47:22] !log analytics1011-1022 will be dying today, byebye c2100 [15:47:22] heh [15:47:27] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [15:47:27] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [15:47:27] I restarted the proxy on ms-fe1 [15:47:34] Logged the message, RobH [15:47:44] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59879 bytes in 0.211 seconds [15:47:56] ffs [15:48:11] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [15:48:11] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [15:48:17] drdee: did you wanna login to any of the c2100s and tell them to go to hell before i pull them? ;] [15:48:41] plus yer the only one on your team around that i see and i just wanna let you know im pulling them [15:48:47] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [15:50:48] paravoid: i'm going to dinner, but if you need help, please text/call me [15:52:32] mark: seems to have recovered [15:52:36] after the proxy restart [15:52:39] but thanks :) [15:54:35] gah, again [15:55:22] !log analytics1012-1022 are starting wipe and having network cable pulled [15:55:35] Logged the message, RobH [15:56:04] all three proxies saturated [16:02:08] PROBLEM - Host analytics1022 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:29] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:29] PROBLEM - Host analytics1021 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:35] PROBLEM - Host analytics1019 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:35] PROBLEM - Host analytics1012 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:44] PROBLEM - Host analytics1018 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:57] PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% [16:02:57] PROBLEM - Host analytics1016 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:02] PROBLEM - Host analytics1017 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:11] PROBLEM - Host analytics1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:41] PROBLEM - Host analytics1014 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:27] !log db45 replacing disk2 (slot 1) [16:06:36] Logged the message, Master [16:07:18] New patchset: RobH; "analytics1011-1022 are being returned for new hardware, decommissioning the name so it removes from nagios as well as dhcp files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29600 [16:07:38] !log reedy synchronized php-1.21wmf2/includes/Message.php [16:07:44] RobH: at least for the swift servers, we talked about keeping the same names [16:07:52] Logged the message, Master [16:08:00] chris and ma rk [16:08:06] !log reedy synchronized php-1.21wmf2/includes/db/Database.php [16:08:20] Logged the message, Master [16:08:22] paravoid: i didn't remove names for swift...going to keep it the same [16:08:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29600 [16:08:44] cmjohnson1: right. I just saw the commit above [16:08:52] k..cool [16:09:02] cmjohnson1: So in gerrit you dont have +2? [16:09:07] but you have +1 right? [16:09:16] no [16:09:25] you cannot add anything? [16:09:49] when you go to review, what options do you have, just comment? [16:10:01] (take a look at my change for example) [16:10:05] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [16:14:03] !log reedy synchronized php-1.21wmf2/extensions/PagedTiffHandler/ [16:14:17] Logged the message, Master [16:16:34] back [16:16:42] paravoid: so you think it's the swift range bug? [16:20:05] not very sure [16:20:34] there were a lot of requests by one particular client [16:24:16] New review: Cmjohnson; "Looks good" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/29600 [16:24:22] there was a bit of a spike on sq56's network graphs, but really not that much [16:28:26] I really have to go, my accountant waits for me [16:28:30] I'll be back in an hour or so [16:28:33] (hopefully :) [16:28:47] and I'll get my laptop with me just in case [16:28:49] ttyiab [16:29:50] ok [16:29:50] gosh [16:29:50] and then people wonder why we use RT [16:29:51] bugzilla's UI is even worse [16:36:00] !log db56 replacing disk 7 (slot6) [16:36:14] Logged the message, Master [16:44:25] New patchset: Andrew Bogott; "Add "role::mediawiki-update::labs" for updating labs-MW-installations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28355 [16:45:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28355 [17:11:40] !log reedy synchronized php-1.21wmf2/extensions/PagedTiffHandler [17:11:53] Logged the message, Master [17:12:20] New patchset: preilly; "Partner IP Live testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29608 [17:13:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29608 [17:13:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29608 [17:15:23] !log reedy synchronized wmf-config/CommonSettings.php 'wgMaxImageArea to 25MP' [17:15:35] Logged the message, Master [17:22:36] New patchset: Dereckson; "(bug 41213) Namespace configuration for gl.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29611 [17:27:30] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28355 [17:37:29] seems quieter now [17:37:51] (not back yet, just checking in) [17:38:26] PROBLEM - SSH on ssl1001 is CRITICAL: Server answer: [17:39:53] !log analytics1011 now offline and wiping as well [17:40:05] RECOVERY - SSH on ssl1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:40:06] Logged the message, RobH [17:41:26] PROBLEM - Host analytics1011 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:34] New patchset: J; "update all extensions even if one fails" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29615 [17:45:38] New patchset: Ottomata; "Installing tofrodos on stat1 for Stefan" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29617 [17:46:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29615 [17:46:44] New patchset: J; "update all extensions even if one fails" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29615 [17:47:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29617 [17:47:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29615 [17:50:25] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29617 [17:51:02] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [18:25:23] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Tue Oct 23 18:24:55 UTC 2012 [18:28:49] New patchset: Jdlrobson; "add article context to factual error" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29623 [18:29:19] New patchset: Jdlrobson; "add article context to factual error (bug 39467)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29623 [18:36:38] binasher: for https://bugzilla.wikimedia.org/show_bug.cgi?id=41174 can't someone do the find/grep on nas1? [18:37:54] I suspect it'll need doing client side [18:38:02] nfs client side [18:39:00] AaronSchulz: sure.. [18:39:09] probably faster than swift ;) [18:40:15] !log shutting down es10 for dimm replacement [18:40:28] Logged the message, Master [18:40:35] AaronSchulz: not want to write a maintenance script? :p [18:43:05] PROBLEM - Host es10 is DOWN: PING CRITICAL - Packet loss = 100% [18:50:17] lesliecarr: could you please look at this today rt3779 [18:53:03] cmjohnson1: cool [18:57:19] cmjohnson1: done [18:58:23] RECOVERY - Host es10 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [18:58:35] lesliecarr: thx [18:58:41] binasher: all set [18:58:43] no prob [19:01:05] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [19:02:53] PROBLEM - mysqld processes on es10 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:04:43] RECOVERY - mysqld processes on es10 is OK: PROCS OK: 1 process with command name mysqld [19:04:58] cmjohnson1: thanks! [19:09:19] New patchset: Cmjohnson; "Adding db78 dhcpd file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29631 [19:10:02] !log asher synchronized wmf-config/db.php 'returning es10 to service' [19:10:16] Logged the message, Master [19:10:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29631 [19:16:44] New patchset: Cmjohnson; "Removing db62 from the ms-fe config, fixing db cnfg to include db 62" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29635 [19:17:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29635 [19:18:29] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 215 seconds [19:18:42] does db62 need its raid reconfigured back to raid-10 along with a reimage? [19:18:49] yes [19:19:11] once the changes are merged i am going to fix [19:20:12] binasher: can you +2 my my changes? [19:29:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29635 [19:29:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29631 [19:30:11] cmjohnson1: i merged it on sockpuppet too [19:30:36] cool..thx [19:30:36] binasher: so, how when is memcached stuff happening? [19:30:50] s/how// [19:34:30] dear people who know php . . . if one were to want to set a systemwide environment variable, would it be possible? [19:34:38] AaronSchulz: the new client still uses $wgMemCachedServers i assume? [19:34:51] i.e. and arbitrary variable that can be read from cli or apache [19:35:21] binasher: by default, but you can override that [19:36:18] ok [19:37:27] so maybe we can add a wgMemCachedServersNew to mc.php and code to select the new client with that config array if test.wikipedia.org or if srv193 [19:38:24] I don't like calling stuff "New", eventually it becomes old, and we all look silly :) [19:38:30] but yes, that's the idea [19:38:33] Jeff_Green: google suggests its somewhat messy.. "The environment variable will only exist for the duration of the current request." [19:38:45] yeah, that's what drove me here [19:39:34] AaronSchulz: s/New/Test ? :) [19:39:46] binasher: you'd have $wgObjectCaches['memc-pecl'] = array( ... ) and you could use that new global for the 'servers' setting [19:41:37] AaronSchulz: would it be ok to do this in an if{} in CommonSettings after the current wgObjectCaches definitions even though it isn't "common" if testwiki specific? [19:42:25] ah hah, there's already test2wiki stuff in commonsettings [19:42:46] Reedy: perhaps I can achieve the same result with a regular system environment variable [19:44:26] binasher: the $wgObjectCaches['memc-pecl'] line doesn't need to be in an "if" [19:44:45] but a $wgMainCacheType line will [19:45:02] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [19:45:42] I see $wgMainCacheType is in mc.php now [19:47:00] AaronSchulz: what would it be instead of "CACHE_MEMCACHED" ? [19:47:43] you can give 'memc-pecl', the string name of a registered object store [19:49:01] so you register it in CommonSettings and enable it for testwiki in mc.php [19:49:18] though let me think about load order here [19:51:03] yeah, that will be fine if you set wgObjectCaches where the other multiwrite stuff is in CommonSettings [19:52:18] * AaronSchulz wonders if we could keep stuff consolidated in mc.php a bit more in the future [20:02:37] !log fixing db62 raid cnfg [20:02:49] Logged the message, Master [20:05:13] !log olivneh synchronized php-1.21wmf2/extensions/PostEdit [20:05:22] Logged the message, Master [20:06:02] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [20:06:12] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 189 seconds [20:07:59] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [20:17:35] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 206 seconds [20:31:39] !log dist-upgrade & rebooting payments[1-4] [20:31:53] Logged the message, Master [20:32:26] PROBLEM - Host payments1 is DOWN: PING CRITICAL - Packet loss = 100% [20:34:05] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [20:34:05] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [20:35:35] RECOVERY - Host payments1 is UP: PING OK - Packet loss = 0%, RTA = 1.48 ms [20:40:48] !log reedy synchronized php-1.21wmf2/includes/ [20:41:00] Logged the message, Master [20:43:50] notpeter: who's on RT duty today? I presume it's not Asher again :) [20:44:12] it's for a week! [20:44:20] oh, I thought it changed daily [20:45:12] ok. binasher, whom should andre__ and I contact to check on https://bugzilla.wikimedia.org/show_bug.cgi?id=41008 "Clear out NFS"? [20:46:31] That's labs... [20:46:59] Labs has a separate NFS? ok, so I'll switch the product & component back [20:47:34] sumanah, ctfoo told me it should go to ryan lane [20:47:57] andre__: ah, you already did it. ok. [20:48:47] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [20:52:13] New patchset: Dereckson; "(bug 41213) Namespace configuration for gl.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29611 [20:54:23] New patchset: Ryan Lane; "Fix virt cluster in gmetad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29691 [20:55:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29691 [20:55:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29691 [20:55:58] New patchset: Dzahn; "add the star.wikidata.org SSL cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29692 [20:56:58] !log restarting gmetad [20:57:04] !log restarting gmetad on nickel [20:57:12] Logged the message, Master [20:57:12] New review: Dzahn; "RT-3534" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/29692 [20:57:12] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29692 [20:57:27] Logged the message, Master [21:01:42] New patchset: Ryan Lane; "Change ganglia aggregators for virt cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29693 [21:03:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29693 [21:04:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29693 [21:07:28] New patchset: Ryan Lane; "Ugh. Apparently I read upside down." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29696 [21:08:08] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [21:08:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29696 [21:09:11] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [21:25:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29696 [21:27:22] New patchset: Ryan Lane; "Fix aggregators in gmetad config for virt cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29701 [21:28:38] New patchset: Pgehres; "Adding back URLs for LandingCheck." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29702 [21:28:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29701 [21:29:48] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29701 [21:38:59] !log mlitn Started syncing Wikimedia installation... : updating ArticleFeedbackv5, PageTriage & E3Experiments [21:39:11] Logged the message, Master [21:41:12] !log reedy synchronized php-1.21wmf2/includes/api/ [21:41:24] Logged the message, Master [21:42:59] New patchset: Dzahn; "protoproxy.pp config for wikidata.org (SSL/LB)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29706 [21:44:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29706 [21:59:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:36] !log reedy synchronized php-1.21wmf2/includes/api/ [22:02:49] Logged the message, Master [22:03:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [22:12:59] if anyone has noticed ganglia spamming our logs with "slurpfile() read() buffer overflow on file /proc/stat", you may be interested to know that I just got it fixed upstream [22:13:16] !log mlitn Finished syncing Wikimedia installation... : updating ArticleFeedbackv5, PageTriage & E3Experiments [22:13:28] Logged the message, Master [22:16:05] slurpfile? heh [22:16:12] New patchset: Dzahn; "protoproxy.pp config for wikidata.org (SSL/LB)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29706 [22:16:41] https://github.com/ganglia/monitor-core/pull/63 [22:17:21] New patchset: Dzahn; "protoproxy.pp config for wikidata.org (SSL/LB)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29706 [22:18:19] also I noticed that our job runners sometimes see spikes of high system CPU usage [22:18:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29706 [22:18:50] like right now on mw7: http://ganglia.wikimedia.org/latest/graph.php?h=mw7.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1351030697&g=cpu_report&z=medium&c=Jobrunners%20pmtpa [22:19:05] apparently we use ulimit -v to limit memory usage of job runners [22:19:42] and when the limit is exhausted, libxml goes into a tight loop of mmap() calls for a few minutes [22:20:16] so I'm also working on a PHP patch to use emalloc() for libxml, since so many problems come back to that [22:21:26] binasher may possibly be interested in that [22:23:28] indeed.. preilly too ^^ [22:24:23] TimStarling: didn't you already have a emalloc() patch before? [22:24:38] I don't think so [22:24:43] if I did, this one is a bit different [22:25:01] okay cool [22:25:08] because now I've read the relevant libxml2 documentation, apparently there's a "right" way to override those hooks [22:25:09] TimStarling: where are you working on it? [22:25:28] ha ha ha — a "right" way [22:25:45] I'm working against PHP's git master, when it's done I guess I'll post it to php-internals [22:26:11] and I guess we will want to backport it to our own packages [22:26:37] binasher: still working on mc? [22:28:19] TimStarling: okay [22:29:40] PHP doesn't have a system for pull requests does it? [22:31:03] TimStarling: PHP accepts pull requests via github. Discussions are done on github, but depending on the topic can also be relayed to the official PHP developer mailinglist internals@lists.php.net. [22:31:09] TimStarling: https://github.com/php/php-src [22:31:26] ok [22:31:40] TimStarling: and new features require an RFC and must be accepted by the developers. See https://wiki.php.net/rfc and https://wiki.php.net/rfc/voting for more information on the process. [22:31:41] Bug fixes do not require an RFC, but require a bugtracker ticket. Always open a ticket at http://bugs.php.net and reference the bug id using #NNNNNN. [22:32:12] TimStarling: is there a libxml emalloc() bug open? [22:33:21] probably not [22:33:40] if there isn't, I'll open one shortly before I submit the pull request, and then reference the pull request from the bug [22:33:54] that should delay the trigger-happy bug triagers [22:34:09] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [22:34:25] you know, the ones who close everything as "invalid" [22:35:25] I thought they closed stuff as "bogus" [22:35:36] yeah, bogus seems to have disappeared [22:35:41] maybe it has been folded into invalid [22:36:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:42:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.388 seconds [22:45:26] actually if I use my @php.net address to open the bug, I should have a fair chance [22:45:39] I think that may have been a mistake I made in the past, using @wikimedia.org instead of @php.net [22:46:07] TimStarling: good idea [22:46:33] TimStarling: let me know when you post something — I'll respond to it quickly and ping a few other folks too [22:47:04] I just did a few searches, I don't think there is a bug open already [22:53:17] !log reedy synchronized php-1.21wmf2/includes/WikiPage.php 'debugging' [22:53:26] Logged the message, Master [22:53:35] ah yes, I did actually have a previous patch [22:53:51] I am remembering now because it hit the same segfault as the current patch is hitting [22:53:57] not a big problem though [22:54:22] !log reedy synchronized php-1.21wmf2/includes/WikiPage.php 'debugging' [22:54:35] Logged the message, Master [22:57:20] !log reedy synchronized wmf-config/ [22:57:32] Logged the message, Master [22:57:54] !log reedy synchronized php-1.21wmf2/includes/WikiPage.php [22:58:07] Logged the message, Master [23:01:23] TimStarling: oh okay [23:01:31] TimStarling: so is the old patch invalid now? [23:01:47] it was never completed [23:02:21] so yes [23:02:44] !log reedy synchronized php-1.21wmf2/includes/api/ [23:02:56] Logged the message, Master [23:09:26] !log reedy synchronized php-1.21wmf2/extensions/ExtensionDistributor [23:09:39] Logged the message, Master [23:11:20] Reedy, am I ok to scap or you're fixing something? [23:13:16] Feel free [23:17:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:10] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [23:25:01] !log maxsem Started syncing Wikimedia installation... : Weekly MobileFrontend and Zero deployment [23:25:14] Logged the message, Master [23:28:06] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [23:29:29] TimStarling: \o/ (re: libxml/emalloc()) [23:29:41] and yeah, I was telling you about those proper hooks at the all staff [23:30:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29706 [23:30:11] there are global variables, which are in the public headers, but you're not meant to set them [23:30:15] not that it's any big discovery or anything [23:30:29] there is a function called xmlMemSetup() which allows you to set the hooks [23:31:00] I had this in my browser's history: http://xmlsoft.org/xmlmem.html#setting [23:32:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.686 seconds [23:34:08] why does scap deploy to servers likesearchidx2 and spence? do they run mw too? [23:37:13] searchidx2 runs dumpBackup.php, or at least used to [23:37:22] to generate an XML file for indexing [23:37:50] spence uses MediaWiki for some nagios checks [23:39:10] TimStarling: Have we got a way to find out a source of numerous api requests? IP/referer/user agent? We're getting tonnes of action=parse requests for SpecialPages..... [23:39:40] have you tried api.log on fluorine? [23:39:47] it has the IP, at least [23:40:02] then you can cross-reference that with user-agent from the squid logs, if there are enough requests [23:40:41] Duh [23:40:45] How did I not see that log [23:41:13] Are the squid logs anywhere I can get at them? [23:41:42] so, grepping for page=Special I'll only get english (no big deal) [23:42:34] can you log in to locke? [23:42:45] yup [23:43:10] so /a/squid/sampled-1000.log on locke [23:44:07] but it only has one in every 1000 requests, so there has to be a lot of requests before some will show up there [23:44:44] if there aren't enough, we can set up a special-case filter [23:45:04] there was 10s per second appearing in the exception log earlier today on fluorine [23:45:54] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28604 [23:46:03] Some of the action=parse for Special:Contributions resolve back to a vps company [23:58:59] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours