[00:06:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:39] !log maxsem Finished syncing Wikimedia installation... : Weekly MobileFrontend and Zero deployment [00:09:51] Logged the message, Master [00:17:29] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29702 [00:17:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.447 seconds [00:18:34] !log maxsem synchronized php-1.21wmf2/extensions/MobileFrontend/ [00:18:48] Logged the message, Master [00:19:21] binasher: so one is pecl mc? :) [00:19:52] s/one/when [00:20:29] AaronSchulz: feel free to do the config instead of waiting for me [00:20:49] ok, I'll try to get something in gerrit this week [00:21:00] i am currently going over a large quantity of timing data from the fundraising team in prep for a meeting with erik in 10 minutes [00:21:01] i'll prob have time tomorrow though [00:21:02] !log maxsem synchronized php-1.21wmf1/extensions/MobileFrontend/ [00:21:14] Logged the message, Master [00:25:10] can someone please flush mobile Varnish? [00:26:11] MaxSem: ok, hold on [00:26:51] !log flushing varnish mobile cache [00:27:03] Logged the message, Master [00:27:49] binasher: would you happen to know if our version of varnish supports HTCP operations? [00:28:32] not directly, we use a special daemon for that [00:29:24] kk; I've seen what wikia did to make it support it - I presume ours is something similar? [00:30:45] or; actually; that doesn't even matter -- what I'm really after is if I use SquidUpdate::HTCPPurge() it will work with our varnish boxen? [00:31:04] mutante, thanks! [00:31:15] np, Max [00:37:19] binasher: ? ^ [00:45:02] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 190 seconds [00:49:05] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:50:38] !log maxsem synchronized php-1.21wmf1/extensions/MobileFrontend [00:50:46] Logged the message, Master [00:52:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [01:08:02] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 4 seconds [01:37:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:17] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 244 seconds [01:46:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:52:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [01:59:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 249 seconds [02:11:00] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [02:25:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:30] !log LocalisationUpdate completed (1.21wmf2) at Wed Oct 24 02:27:30 UTC 2012 [02:27:46] Logged the message, Master [02:28:06] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 203 seconds [02:37:59] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 25 seconds [02:38:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [02:48:06] !log LocalisationUpdate completed (1.21wmf1) at Wed Oct 24 02:48:06 UTC 2012 [02:48:20] Logged the message, Master [03:50:44] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 26 seconds [03:52:05] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [05:02:08] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [05:05:02] New patchset: Daniel Friesen; "Use $wgHiddenPrefs to hide real name." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29741 [05:50:38] TimStarling: Any chance you could merge and deploy Daniel's rev from above? [05:51:07] The real name field is showing up on the English Wikipedia and other sites. Someone didn't read the release notes before deploying... :-) [05:55:29] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29741 [05:59:32] !log tstarling synchronized wmf-config/CommonSettings.php [05:59:47] Logged the message, Master [06:09:06] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:18:56] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:45] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:15] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:29] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:22:05] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [06:23:53] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.450 second response time [06:27:56] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [06:29:19] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [06:35:34] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:35:34] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:36:19] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:40] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:43] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:01] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.233 second response time [06:45:46] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [06:46:13] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [07:08:52] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [07:09:46] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [07:21:14] hello [07:52:21] mark: I'm reading Daniel's mail (and getting confused) [07:52:55] didn't you say we don't have any more IPs in esams? [08:19:12] yes i did [08:25:05] i'm replying now [08:34:49] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:41:52] PROBLEM - SSH on ms-fe2 is CRITICAL: Connection refused [08:51:46] RECOVERY - SSH on ms-fe2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:20:57] New patchset: Faidon; "autoinstall: partman raid1 profiles mega-commit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29754 [09:21:33] I don't thing anyone wants to review that :) [09:22:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29754 [09:22:49] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [09:23:43] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:24:39] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [09:25:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [09:25:43] New review: Hashar; "* updated zuul.init from upstream" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25235 [09:25:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [09:28:49] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [09:53:19] New patchset: Faidon; "autoinstall: partman raid1 profiles mega-commit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29754 [09:54:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29754 [09:54:50] * mark cheers [09:55:34] for the raid1 megacommit? :) [09:56:36] yes [09:56:43] i never got why people need to make a gazillion profiles [09:59:45] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [10:08:27] New review: Faidon; "Noone is going to go near that, I might just as well +2 it myself." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/29754 [10:11:00] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [10:11:58] New patchset: Faidon; "autoinstall: partman raid1 profiles mega-commit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29754 [10:12:10] would it be possible to rename a header in squid? looking into adding X-Content-Duration to videos, swift only allows setting X-Object-Meta-Duration, so if squid could turn X-Object-Meta-Duration into X-Content-Duration it might work [10:12:47] j^: no it doesn't [10:12:52] docs are incorrect [10:12:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [10:12:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29754 [10:13:01] they added the feature of arbitrary headers at one point [10:13:06] and never updated the docs [10:13:34] (that's my guess from my interpreation of the code, might be incorrect) [10:14:16] https://code.launchpad.net/~notmyname/swift/arbitrary_headers/+merge/54455 [10:14:19] yeah, that's merged [10:14:38] but I have to explictly allow it on the swift configs [10:14:59] paravoid: i was testing with git head of swift here and i can only set X-Object-Meta [10:15:13] ah ok, allow in config... [10:15:50] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [10:16:14] try it and if it works, I'll open a bug for their docs [10:16:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [10:17:46] it should be "allowed_headers = X-Content-Duration" in the [object-server] section [10:19:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29754 [10:19:32] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [10:20:34] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [10:21:35] New review: Hashar; "removed the /etc/zuul/wikimedia file definition out of the module. Will be created by the wikimedia ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25235 [10:21:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [10:21:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [10:39:10] RECOVERY - Puppet freshness on ms-fe2 is OK: puppet ran at Wed Oct 24 10:39:03 UTC 2012 [10:44:34] RECOVERY - NTP on ms-fe2 is OK: NTP OK: Offset -0.06889736652 secs [10:49:49] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:00:21] j^: any luck? [11:03:02] New patchset: Mark Bergsma; "Make sure high range requests don't block on large objects being retrieved" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29763 [11:04:09] New patchset: Mark Bergsma; "Make sure high range requests don't block on large objects being retrieved" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29763 [11:05:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29763 [11:08:21] New patchset: Mark Bergsma; "Make sure high range requests don't block on large objects being retrieved" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29763 [11:08:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:09:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29763 [11:11:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.963 seconds [11:12:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29763 [11:15:19] PROBLEM - Host ms-fe2 is DOWN: PING CRITICAL - Packet loss = 100% [11:16:58] RECOVERY - Host ms-fe2 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:16:58] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [11:17:07] RECOVERY - Memcached on ms-fe2 is OK: TCP OK - 0.002 second response time on port 11211 [11:21:03] New patchset: Mark Bergsma; "req.hash_ignore_busy can't be set in vcl_hash for some reason" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29764 [11:22:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29764 [11:22:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29764 [11:27:37] New patchset: Mark Bergsma; "req.hash_ignore_busy can't be read in vcl_miss for some reason" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29765 [11:28:08] oh heh [11:28:32] sloppy bastards [11:28:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29765 [11:29:15] I guess you can set a hidden header [11:29:15] yes [11:29:15] but I hate that too [11:29:15] ugly [11:29:21] i want to set booleans [11:29:30] there's a vmod which adds that I think [11:29:30] but can't be bothered [11:31:55] Request to first data byte: 35026 ms [11:32:08] I was able to replicate your observation now [11:32:12] of course when I tested it yesterday, it was cached [11:32:23] we'll see how it does in half an hour when puppet has run [11:33:11] python-eventlet (0.9.17-0ubuntu1~cloud0) precise-folsom; urgency=low [11:33:14] * Backport for Ubuntu Cloud Archive: [11:33:19] - Newer version of python-eventlet is required for nova. [11:33:19] ohrly [11:33:33] how hard is it to have a line or two explaining WHY [11:33:39] goddamit [11:34:16] because it doesn't leak sockets? ;) [11:34:59] heh yeah [11:35:13] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe1.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1351078449&g=network_report&z=large&c=Swift%20pmtpa [11:35:27] traffic is increased since yesterday [11:35:55] not hitting the 1gbps but still, increased compared to what normally is [11:36:15] any idea why? [11:36:15] and more importantly, there's a big gap between in and out [11:36:24] that bug? [11:36:25] maybe... [11:36:30] that needs to be fixed [11:36:44] yeah, I'm still waiting for a reply [11:36:46] from the swift people [11:36:58] the range one you mean, right? [11:37:02] yes [11:37:14] I mailed them a separate mail from the container sync thread [11:37:50] that had a) the 90 DELETEs/s that killed swift and b) the Gertie incident with i) range bug ii) posix_fadvise() calls [11:39:00] can you CC me on such mails? i'm interested [11:39:45] there, problem solved :) [11:39:58] Request to first data byte: 500 ms [11:40:35] the 2nd range request for the end of the file now came in well before the first one completed (for the entire file) [11:41:37] hehe [11:41:37] you rock [11:42:09] thanks for the pcap, that was hugely helpful [11:43:47] there, I bounced you some mails [11:43:57] and I'll keep you Cc'ed in future ones [11:45:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:02] tnx [11:47:52] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [11:48:36] increased load since yesterday is expected, since I took down ms-fe2 for the upgrade [11:48:43] but the gap between in/out puzzles me [11:48:54] the range bug might be an explanation [11:49:06] should be able to see that in a pcap, right? [11:49:59] a 400mbit pcap? :) [11:51:10] New patchset: J; "Bug 41304 Add X-Content-Duration to allowed_headers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29768 [11:51:40] use ngrep and try to replicate it with a range request for a given object? [11:52:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29768 [11:52:21] j^: did you test that? [11:52:53] PROBLEM - Puppet freshness on srv255 is CRITICAL: Puppet has not run in the last 10 hours [11:52:53] the bug exists, there's no question about that. I'm wondering if we currently experiencing it or not [11:52:55] I'll find a way [11:53:07] in the meantime, I'll put the updated ms-fe2 back into rotation :) [11:54:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.323 seconds [11:55:48] paravoid: tested here with local swift setup (using git head, 1.7.5 though) [11:56:27] !log putting ms-fe2 (upgraded to precise/swift 1.7.4) back into the pool [11:56:30] cool to see some fast results on X-Content-Duration :) [11:56:40] Logged the message, Master [11:56:42] indeed [11:58:37] we could combine that with some throughput rate limiting [11:59:16] calculate the bitrate of the video and serve it at a slightly higher rate [12:03:03] there are web servers that have that feature [12:03:17] yes [12:03:20] but there's not much reason varnish couldn't also have it I guess [12:04:17] I'm no expert but I think it's a bit harder than that though [12:04:24] the bitrate is variable [12:04:29] sure [12:04:36] so you might be way off from (avg + N%) [12:04:44] so you'd want to set it to at least 1.5* avg or so [12:05:39] still better than pumping out entire gigabytes for a file a user might abort after 30s [12:06:55] not sure how many users have gigabit connections to the datacenter [12:07:19] i do [12:07:29] :) [12:07:51] i just notice that when google linked that Gertie_the_Dinosaur video, not exactly a big video at all [12:07:56] we were serving it at 3 Gbps [12:08:10] if you do rate limiting, you want to burst at full speed for some amount and slow it down after that [12:08:15] the max our 3 swifcod do [12:08:15] yes [12:08:46] and currently, one backend varnish or squid instance can serve a video at max 1 Gbps [12:08:49] we're gonna up that [12:08:56] but it's worrying ;) [12:10:10] I worry too [12:10:17] gertie was a small video btw [12:10:22] like I said [12:10:36] right [12:11:52] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [12:29:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:31:05] New review: Hashar; "Quoting the output variable is already handled with change I1ccf76ce: beta: autoupdater now reports ..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/29615 [12:31:09] there [12:31:19] made vmod_std's std.collect() work in vcl_deliver :) [12:34:07] New patchset: Mark Bergsma; "Make std.collect work in vcl_deliver" [operations/debs/varnish] (patches/vcl_deliver-collect) - https://gerrit.wikimedia.org/r/29771 [12:34:50] New patchset: Mark Bergsma; "Make std.collect work in vcl_deliver" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29772 [12:34:50] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm5) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29773 [12:41:04] New patchset: J; "update all extensions even if one fails" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29615 [12:42:10] New patchset: Mark Bergsma; "Collect response headers Via and X-Varnish on delivery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29774 [12:42:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.085 seconds [12:43:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29615 [12:43:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29774 [12:43:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29774 [12:43:54] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/vcl_deliver-collect) - https://gerrit.wikimedia.org/r/29771 [12:44:13] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29772 [12:44:20] New review: J; "> I have no idea what ':' is for in shell.. What is it supposed to do ?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/29615 [12:44:38] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29773 [12:56:31] New patchset: MaxSem; "Rm old logging code" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29775 [13:12:57] New patchset: Mark Bergsma; "Collapse X-Cache into one header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29777 [13:14:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29777 [13:17:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:23] paravoid: hello :-) Been late sorry. [13:30:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [13:39:55] !log installing base OS on db62 [13:40:09] Logged the message, Master [13:48:37] hey ^demon, gerrit question [13:48:43] <^demon> Shoot. [13:48:56] i am trying to add ottomata as reviewer to a patchset [13:49:31] and i get the error application error, ottomata is neither a registered user nor a group [13:49:57] but his name and email address pop up in the reviewer box [13:50:37] paravoid: can you test videos again if you have time? [13:50:49] New review: Hashar; "bash has sooo many tricks:" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/29615 [13:51:26] i see firefox poking around when seeking [13:51:37] <^demon> drdee: No clue. Don't see anything in the error log. [13:52:07] can you try to add him to patchset 29779? [13:52:49] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [13:52:52] i must say, i have often weird application errors, also after submitting a code review i will get an application error but the review still gets through [13:53:02] could there be an issue with my account somehow? [13:53:24] <^demon> No, I doubt there's any issues with your account. [13:53:31] <^demon> I get the same error on adding Otto. [13:53:39] <^demon> Still, nothing in the error logs. [13:53:44] pffeww it's not me :D [13:53:54] <^demon> I'll file upstream. But without a stacktrace I dunno. [13:54:21] maybe look in the ajax call? [13:54:45] <^demon> Javascript console says nothing special. [13:57:52] ^demon, and so when i enter my review, i get application error, cannot submit change xyz, needs Code Review but obviously it has a code review [13:58:08] <^demon> I have no idea. [13:58:58] Change abandoned: J; "ok, i guess once change I1ccf76ce is merged the logs will at least show problems with an extension s..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29615 [14:00:54] ^demon: Hi chad :-] [14:02:21] ^demon: I have screwed up a freshly created repository : integration/zuu-conf . I have sent two commits as changes in Gerrit but could not merge them since it was lacking an "initial commit". I eventually pushed them straight to gerrit but now the changes are still there and marked abandoned =) [14:02:39] ^demon: so whenever I send a new commit, ,it can be merged since it is based on an abandoned change :-/ [14:02:57] <^demon> Rewrite history? [14:03:04] might try that [14:03:18] <^demon> If nobody's really using it but you yet, no harm really. [14:03:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:05] ^demon: any idea how to rewrite the first commit ? :-] [14:05:06] git checkout --orphan newmaster [14:08:07] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:18:08] ^demon: that worked. Thanks :-] [14:18:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [14:18:38] <^demon> yw :) [14:21:05] New patchset: Mark Bergsma; "Retab" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29784 [14:21:24] !log upgarding srv221 and srv222 to precise [14:21:32] seriously [14:21:38] labs is slow as hell again [14:21:38] Logged the message, notpeter [14:21:52] I think I will end up renting a few servers :-] [14:22:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29784 [14:23:32] New patchset: Mark Bergsma; "Retab" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29784 [14:24:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29784 [14:24:56] New patchset: Mark Bergsma; "Retab" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29784 [14:25:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29784 [14:32:45] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [14:33:50] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [14:34:46] New patchset: Mark Bergsma; "Attempt to make varnishhtcpd a bit more robust on errors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29786 [14:35:48] New review: Hashar; "rebased again to get change I0f8e3fe81 :" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/27611 [14:35:48] New patchset: Pyoungmeister; "setting srv221 and srv222 to use modules as part of upgrade to precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29787 [14:36:19] PROBLEM - Host srv221 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:39] yay [14:36:48] that's imagescalers, isn't it [14:36:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [14:36:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29786 [14:36:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [14:36:53] New patchset: Pyoungmeister; "setting srv221 and srv222 to use modules as part of upgrade to precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29787 [14:36:53] yes [14:37:43] <^demon> paravoid: Did we still want to do those precise upgrades tomorrow? [14:37:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29787 [14:38:06] 3am is a bit too late for me to be doing upgrades I'm afraid [14:38:36] <^demon> We can pick a better time for you. [14:39:00] <^demon> Sometime in the morning for me, perhaps. [14:39:12] I was hoping to chat with someone from the SF office who might have the time but I haven't been able to do that yet [14:39:35] I think Daniel is busy with wikidata/voyage stuff these days anyway [14:39:36] <^demon> Well Ryan and I were going to do manganese & formey Thurs. [14:39:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29787 [14:40:42] I kinda feel bad about asking Ryan stuff considering he's not getting the labs help from me as originally hoped :P [14:41:15] but yeah, if you guys are willing to do it earlier then I can make it [14:41:16] PROBLEM - SSH on srv222 is CRITICAL: Connection refused [14:41:27] <^demon> Well manganese & formey are going to be easy. Is gallium not straightforward? [14:41:38] PROBLEM - Apache HTTP on srv222 is CRITICAL: Connection refused [14:41:45] New patchset: Mark Bergsma; "Restart varnishhtcpd on changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29790 [14:41:46] paravoid: yep. I'm going to finish upgarding them today [14:41:52] hashar knows [14:41:57] but going to wait a couple of days before declaring them done [14:42:10] RECOVERY - Host srv221 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:42:23] I don't anything about gallium really, my plan is to do a dist-upgrade and attempt fixing whatever shit breaks with hashar's help [14:42:35] my "plan" I should say [14:42:42] paravoid: sorry missed your chats yesterday :( [14:42:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29790 [14:42:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29790 [14:42:56] wtf. installs failing....... [14:43:18] <^demon> paravoid: Well I can pretty much do manganese & formey myself. Just need someone on hand in case Something Happens. [14:43:21] paravoid: have you passed the gallium upgrade to someone else or will you handle it yourself ? [14:43:22] notpeter: I messed with partman today, but not with mw.cfg [14:43:37] yeah. it was a fail on instlaling software [14:43:53] hashar: see above :) [14:43:54] and yesterday, I was reimaging pc3 and it failed on getting/running the late_commands [14:45:04] so hmm [14:45:13] as Iunderstand it notpeter is going to be the one upgrading gallium ? ;-D [14:45:46] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:15] erm, not what I said [14:46:38] New review: Hashar; "The updater fail because of permissions issues :/ The script run as mwdeploy:mwdeploy where as admi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29615 [14:46:40] if it makes sense to sync gerrit & jenkins (you tell me) and ^demon is willing to push it earlier then I'm doing it [14:47:16] <^demon> Well the upgrades aren't really dependent on one another, but makes sense to do them around the same time since they're related. [14:47:22] I don't mind doing it tomorrow at 10pm - midnight CET. (1pm - 3pm PST) [14:47:27] <^demon> Jenkins downtime interrupts gerrit, not the other way around. [14:47:49] <^demon> There's some deployments going on until 3pm PDT, which is why I piked 3-5pm PDT originally. [14:47:50] the one sure thing is that whenever Gerrit is done, Jenkins stay idling [14:48:03] <^demon> But we could go much earlier. Something that morning? [14:48:17] Jenkins can be stopped independently. That simply requires to manually retrigger any changes that have been submitted to Gerrit during the downtime. [14:49:58] ^demon: the earlier your morning the better for me, I guess for hashar too [14:50:53] <^demon> That works for me. As long as we're done before 17:00UTC, we won't be affecting any deployments I see. [14:50:53] or we could upgrade gallium much earlier [14:51:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:51:43] okay, want to do the jenkins upgrade on 15:00 UTC let's say? [14:52:31] <^demon> Jenkins and gerrit, 15-17:00? [14:52:45] one hour earlier ? Need to get my daughter at 16:00 UTC [14:53:10] unlikely my wife can leave earlier, though I could ask her this evening. [14:53:38] I don't mind, but I think it's getting a bit early for chad [14:53:44] <^demon> That's fine by me. It's 10am my time. [14:53:51] so 14:00 UTC ? [14:54:12] ack [14:54:13] nice [14:54:21] <^demon> Sounds good. I'll send out notices. [14:54:27] I am out to get my daughter, will connect again later on this evening though [14:54:27] thanks [14:54:31] thanks you two :-] [14:54:36] cya in a few hours [14:54:46] ^demon: I obviously can provide support to you too [14:54:55] instead of Ryan I mean. [14:55:24] <^demon> Yeah. Hopefully it'll go as smoothly as my testing locally & on labs went. [14:56:08] RECOVERY - SSH on srv222 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:01:48] New patchset: Platonides; "(Bug 41350) The translation of Wikipedia is said to be wrong on pa.wikipedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29792 [15:02:57] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [15:03:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [15:06:19] PROBLEM - NTP on srv221 is CRITICAL: NTP CRITICAL: Offset unknown [15:06:31] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.011 seconds [15:06:46] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [15:10:45] ^demon: thanks for the ping and the mail [15:10:53] and for being flexible regarding the time [15:11:15] <^demon> Yeah no problem. Totally makes sense to get it all done at once. [15:12:50] <^demon> Ryan will be happy when I tell him he's off the hook now too ;-) [15:16:07] RECOVERY - NTP on srv221 is OK: NTP OK: Offset -0.1137486696 secs [15:17:35] New patchset: Ottomata; "Using only an10 as ganglia aggregator for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29797 [15:18:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29797 [15:19:54] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29797 [15:20:16] PROBLEM - NTP on srv222 is CRITICAL: NTP CRITICAL: Offset unknown [15:30:37] New patchset: Ottomata; "ganglia.pp - only using analytics1010 as ganglia aggregator (for now)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29798 [15:31:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29798 [15:31:58] RECOVERY - NTP on srv222 is OK: NTP OK: Offset -0.03310871124 secs [15:33:37] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29798 [15:37:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:34] mark: it seems we hit 1gbps on ms-fe1 moments ago [15:42:55] PROBLEM - Host ps1-c1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.11) [15:42:55] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.19) [15:42:55] PROBLEM - Host ps1-a1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.1) [15:42:55] PROBLEM - Host ps1-b5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.10) [15:42:55] PROBLEM - Host ps1-c3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.13) [15:42:56] PROBLEM - Host ps1-c2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.12) [15:42:56] PROBLEM - Host ps1-b4-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.9) [15:43:13] PROBLEM - Host ps1-a4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [15:43:13] PROBLEM - Host ps1-a2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.2) [15:43:13] PROBLEM - Host ps1-b2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.7) [15:43:13] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [15:43:13] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [15:43:14] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.18) [15:43:14] PROBLEM - Host ps1-b1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.6) [15:43:15] PROBLEM - Host ps1-a3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.3) [15:43:15] PROBLEM - Host ps1-b3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.8) [15:43:16] PROBLEM - Host ps1-d2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.15) [15:43:16] PROBLEM - Host ps1-a5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.5) [15:43:58] PROBLEM - Host ps1-d1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.14) [15:45:39] PROBLEM - Host mr1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.2.3) [15:47:35] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [15:47:35] RECOVERY - Host ps1-c1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.54 ms [15:47:35] RECOVERY - Host ps1-c3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.26 ms [15:47:35] RECOVERY - Host ps1-d1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.51 ms [15:47:35] RECOVERY - Host ps1-a2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.02 ms [15:47:37] New review: Nikerabbit; "Ping, there are unanswered comments here." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [15:47:43] RECOVERY - Host ps1-a5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.68 ms [15:47:43] RECOVERY - Host ps1-b4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.53 ms [15:47:43] RECOVERY - Host ps1-a4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.34 ms [15:47:43] RECOVERY - Host ps1-c2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.42 ms [15:47:43] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.84 ms [15:47:44] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.72 ms [15:47:44] RECOVERY - Host ps1-a1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [15:47:53] RECOVERY - Host ps1-b5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.71 ms [15:48:04] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.54 ms [15:48:04] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [15:48:10] RECOVERY - Host ps1-b1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.95 ms [15:48:10] RECOVERY - Host ps1-a3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.53 ms [15:48:10] RECOVERY - Host ps1-b3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.48 ms [15:48:55] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.10 ms [15:50:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.896 seconds [15:51:30] RECOVERY - Host mr1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 1.91 ms [15:58:53] New patchset: Jgreen; "adding user pcoombe to aluminium+grosley" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29803 [15:59:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29803 [16:01:28] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29803 [16:02:34] RECOVERY - Puppet freshness on aluminium is OK: puppet ran at Wed Oct 24 16:02:27 UTC 2012 [16:02:59] New patchset: Mark Bergsma; "Proof of concept for hashing thumbs to original's hash key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29805 [16:03:59] !log removing h310 controller from db62 and adding original h700 card [16:03:59] Logged the message, Master [16:04:01] New review: Amire80; "Need consensus from the Wiki communmity." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/29792 [16:04:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29805 [16:07:13] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:04] New patchset: Mark Bergsma; "Proof of concept for hashing thumbs to original's hash key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29805 [16:09:46] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [16:10:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29805 [16:11:16] PROBLEM - Host db22 is DOWN: PING CRITICAL - Packet loss = 100% [16:12:11] New review: Reedy; "I think i" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [16:12:31] New patchset: Jgreen; "fixed class name for user pcoombe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29806 [16:13:39] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29806 [16:13:49] New patchset: Matthias Mullie; "Sverige = sv" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29807 [16:15:00] !log shutting down db61 for rack relocation [16:15:15] Logged the message, Master [16:18:33] New patchset: Demon; "Updated $wgConf->suffixes for wikidata/wikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29809 [16:23:18] heya um [16:23:19] i'm having a weird problem [16:23:40] my gmond.conf on one of the analtycis machines is not respecting the $ganglia_aggregator variable [16:23:53] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:29] i have this: [16:24:29] # analytics 1010 is set up as a ganglia [16:24:29] # aggregator for the Analytics cluster. [16:24:29] if $hostname == "analytics1010" { [16:24:29] $ganglia_aggregator = "true" [16:24:29] } [16:24:31] in site.pp [16:24:33] but [16:24:45] deaf = yes [16:24:48] is still in gmond.conf [16:25:18] bitch slap it until it shows you respect [16:25:54] bitchslap => true, [16:25:54] unless => respected, [16:25:57] :) [16:26:08] i'm looking at puppet to see if i see anything here [16:26:23] # aggregator should not be deaf (they should listen) [16:26:23] # ganglia_aggregator for production are defined in site.pp; [16:26:23] # for labs, 'deaf = "no"' is defined in gmond.conf.labsstub [16:26:23] if $ganglia_aggregator { [16:26:23] $deaf = "no" [16:26:23] } else { [16:26:23] $deaf = "yes" [16:26:23] } [16:26:27] and in gmond_template.erb [16:26:28] deaf = <%= deaf %> [16:26:45] actually, i'm pretty sure what is happening, as its happened to me before [16:26:51] oh yeha? [16:27:10] so the ganglia variables (by default aggregator = no) are getting set and acted upon in the role class [16:27:21] and then at the very end analytics1010 is set to aggregator = true [16:27:41] put the aggregator = true before the include role::analytics [16:28:36] hmmmmmmmmmmm [16:29:28] will try, although I didnt' think puppet was ordered like that [16:30:00] New patchset: Ottomata; "Attempting to get ganglia_aggregator respected!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29810 [16:30:48] in theory it's not [16:30:54] in practice, it's dumb [16:31:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29810 [16:31:10] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29810 [16:36:46] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [16:36:46] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:38:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.142 seconds [16:38:50] Welp, thanks LeslieCarr, that did it! [16:39:30] New review: Reedy; "http://www.youtube.com/watch?v=gx4jn77VKlQ" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29810 [16:42:50] hm LeslieCarr, can I ask you some ganglia questions? [16:42:53] not sure how much you know [16:43:01] sure [16:43:02] not sure how much i know either ;) [16:43:05] so, i'm trying to set up hadoop ganglia metrics [16:43:16] I need to specify ganglia hosts in my hadoop config files [16:43:23] ok [16:43:31] it is workign only on analytics1010 right now, which is the machine that I set up to be the ganglia aggregator [16:43:45] but, there is also the udp_send_channel [16:43:50] that gets set up by puppet with a multicast addy [16:43:59] "analytics" => { [16:43:59] "name" => "Analytics cluster", [16:43:59] "ip_oct" => "32" }, [16:44:03] $mcast_address = "${ip_prefix}.${ipoct}" [16:44:19] that gets put in the gmond.conf file [16:44:30] should I be using the $mcast_address as the ganglia addy in my hadoop configs [16:44:43] or analytics1010.eqiad.wmnet [16:44:43] ? [16:45:26] New review: Reedy; "https://gerrit.wikimedia.org/r/#/c/29807/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28238 [16:45:41] i would send all the data to the multicast addy so 1010 can send it on to th master [16:45:50] ahh ahhhh, i see [16:45:55] so, everybody sends to multicast [16:46:09] and deaf = no makes that guy listen to multicast and send to gmetad master [16:46:13] k lemme try that [16:46:38] yep [16:53:52] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Wed Oct 24 16:53:40 UTC 2012 [16:56:55] New patchset: MaxSem; "Bug 38617 - Redirector.c should read redirect regex from config" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/27035 [17:10:28] !log db62 loading OS [17:10:39] Logged the message, Master [17:10:49] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [17:11:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:18:55] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [17:21:09] !log lcarr synchronized wmf-config/throttle.php 'I updated a throttle' [17:21:22] Logged the message, Master [17:22:22] PROBLEM - MySQL Recent Restart on db62 is CRITICAL: Connection refused by host [17:22:22] PROBLEM - MySQL Slave Delay on db62 is CRITICAL: Connection refused by host [17:22:49] PROBLEM - MySQL Replication Heartbeat on db62 is CRITICAL: Connection refused by host [17:23:28] PROBLEM - SSH on db62 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:28] PROBLEM - MySQL disk space on db62 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:34] PROBLEM - Full LVS Snapshot on db62 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:52] PROBLEM - MySQL Slave Running on db62 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:24:01] PROBLEM - MySQL Idle Transactions on db62 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:24:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds [17:25:52] !log preilly synchronized php-1.21wmf1/extensions/MobileFrontend 'update after deploy' [17:26:07] Logged the message, Master [17:26:35] !log preilly synchronized php-1.21wmf2/extensions/MobileFrontend 'update after deploy' [17:26:49] Logged the message, Master [17:28:52] paravoid: http://commons.wikimedia.org/wiki/Commons:Village_pump#Problem_is_back_.28Oct_23.29 [17:38:09] !log lcarr synchronized wmf-config/throttle.php 'I updated a throttle' [17:38:21] Logged the message, Master [17:40:40] !log preilly synchronized php-1.21wmf1/extensions/ZeroRatedMobileAccess 'update post deploy' [17:40:49] Logged the message, Master [17:41:09] !log preilly synchronized php-1.21wmf2/extensions/ZeroRatedMobileAccess 'update post deploy' [17:41:23] Logged the message, Master [17:41:57] notpeter - ping [17:46:13] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [17:48:10] RECOVERY - SSH on db62 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:48:22] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms [17:53:18] !log upgrading srv223 and srv224 to precise [17:53:30] Logged the message, notpeter [17:58:30] ping notpeter [17:59:13] hey, what's up? [17:59:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:56] u upgraded srv190 to the latest/greatest distro (imagescaler) [18:00:01] PROBLEM - Host srv223 is DOWN: PING CRITICAL - Packet loss = 100% [18:00:12] srv223 and 224 are the last of the lucid scalers [18:00:24] after I upgrade them, we should be ready to go [18:00:40] ahh... u are upgrading them now ? [18:00:47] (although, i'd still like to let them sit until monday just to make sure that nothing goes terribly wrong and we have to roll back) [18:00:52] yep [18:00:53] cool [18:00:54] thks [18:00:57] robla - ping [18:01:00] see above [18:01:08] so, actually, at ths moment, we're fully precise on imagescalers [18:01:11] yeah, we about to send out an update email [18:01:15] *was [18:02:10] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Everything else to 1.21wmf2 [18:02:22] Logged the message, Master [18:05:43] RECOVERY - Host srv223 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [18:05:55] PROBLEM - SSH on srv224 is CRITICAL: Connection refused [18:06:28] PROBLEM - Apache HTTP on srv224 is CRITICAL: Connection refused [18:07:28] re [18:09:16] woosters: update sent [18:09:46] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:09:57] awesome [18:09:57] PROBLEM - SSH on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:10:25] thanks notpeter [18:10:40] RECOVERY - SSH on srv224 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:11:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.916 seconds [18:11:25] RECOVERY - SSH on srv223 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:12:40] PROBLEM - NTP on db62 is CRITICAL: NTP CRITICAL: No response from NTP server [18:14:50] Can someone depool mw40 please? [18:15:11] It's causing a load of php warnings with relation to reading/writing files [18:15:39] See tail -n 1000 /home/wikipedia/syslog/apache.log | grep resource [18:19:22] yargh, ganglia on analytics is not happy anymore, only on analytics1010 [18:19:37] LeslieCarr or notpeter, are either of you available to help me poke? I'm not sure where to look [18:19:37] http://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [18:19:54] i see the gmond process sending stuff [18:22:56] hrm [18:23:10] question #1 - why do you had ganglia and freedom [18:23:54] so i'll look at analytics1009 and analytics1010 [18:23:54] New patchset: Pyoungmeister; "setting srv223 and srv224 to use modules for precise upgrade" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29820 [18:24:26] eh? [18:24:37] hate* [18:24:37] ah [18:24:40] hehe [18:24:51] i also obviously hate spelling [18:24:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29820 [18:25:05] i am trying to set ganglia freeeeeeeeeee [18:25:15] man, don't tell me about hating freedom, I looked at the statue of liberty from a roof last weekend! [18:25:20] that must count for something [18:25:23] man, in one channel we're talking about hating freedoms, in another I'm talking about unknown unknowns.... what is our discourse coming to? [18:25:41] in analytic we are playing word games where only I know the rules [18:25:48] dschoon and drdee just lost [18:25:53] heh [18:25:58] PROBLEM - NTP on srv224 is CRITICAL: NTP CRITICAL: No response from NTP server [18:26:23] no ottomata, you think you determine the rules, that is the matrix you are in [18:26:23] this is because ottomata, in addition to hating freedom, hates america. [18:26:51] he also hates cats, love, and jump-rope [18:27:04] wtf? [18:27:04] no, the statue of liberty is for old fashioned freedoms, pre-1941 [18:27:04] you never told me that, ottomata [18:27:04] cats ? [18:27:08] who can hate cats ? [18:27:18] iknorite? [18:27:19] I mean, I do. but that's for allergy reasons... [18:27:20] lulcatz???? [18:27:32] he must clearly be a bad person, who has rigged the game as a result. [18:27:54] don't hate the player or the game. hate the person who makes the rules for the game [18:28:35] "don't hate the player or the game, hate the umpire"? [18:28:44] Can somebody please depool mw40? And optionally kick it a few times to find out why it's making noise on fwrite calls etc [18:28:49] PROBLEM - NTP on srv223 is CRITICAL: NTP CRITICAL: No response from NTP server [18:29:13] ^demon: notpeter: are we going to upgrade formey to Precise? [18:29:21] woosters: (in lieu of binasher), see Reedy's request [18:29:42] <^demon> hashar: formey, manganese & gallium. And it was paravoid working with us I thought, not notpeter. [18:29:50] um, cats schmats, dogs rule [18:29:56] i do not hate jumprope [18:30:10] in fact, I jumped rope while holding 2 slices of pizza just last week [18:30:14] mw40 ? [18:30:15] I mean, everything will be upgarded eventually... but as of right now I don't know about formey in particular [18:30:16] love? questionaable [18:30:26] robla / reedy ? [18:30:27] ^demon: nice :-] That also mean doxygen will be upgraded too :-) [18:30:28] not notpeter == peter? [18:30:28] Reedy: sure [18:30:43] that is the matrix :D [18:31:00] woosters: yup [18:31:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29820 [18:33:02] notpeter - can u kick mw40? [18:33:12] yep [18:33:17] thks [18:33:24] ottomata: so i can say definitively that analytics1010 is not receiving traffic from other ganglia hosts while the other switches are [18:33:37] other switches? [18:33:38] s/switches/aggregators/ [18:33:41] ah [18:33:41] lemme check out switch config [18:33:47] interesting [18:33:52] me and nouns today are not getting along [18:34:02] but the mcast_address looks correct, right? [18:34:02] in gmond.conf? [18:34:34] udp_send_channel { [18:34:34] mcast_join = 239.192.1.32 [18:34:34] port = 8649 [18:34:45] yeah [18:34:45] i tried to netcat -ul on that addy [18:34:46] didnt' see any traffic [18:35:15] so we have been having some issues with stupid asw-c-eqiad and its firmware [18:35:19] i'm so tempted to just downgrade it [18:35:44] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [18:37:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28351 [18:37:31] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.008 seconds [18:37:53] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.002 seconds [18:38:26] LeslieCarr: yeah do it [18:38:40] ottomata: also, looks like i didn't put the pim interface for the analytics vlan on cr2-eqiad, which is probably teh real issue [18:39:19] Importazione non riuscita: A database error has occurred. Did you forget to run maintenance/update.php after upgrading? See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script Query: INSERT IGNORE INTO `globalimagelinks` (gil_wiki,gil_page,gil_page_namespace_id,gil_page_namespace,gil_page_title,gil_to) VALUES ('itwikibooks','30416','0','','Miele','Runny_hunny.jpg'),('itwikibooks','30416','0','','Miele','Honey_comb.jpg') [18:39:19] ,('itwikibooks','30416','0','','Miele','Commons-logo.svg'),('itwikibooks','30416','0','','Miele','Wikiquote-logo.svg') Function: GlobalUsage::insertLinks Error: 1205 Lock wait timeout exceeded; try restarting transaction (10.0.6.41) [18:39:22] well… maybe that was the real issue …, though perhaps not :-/ [18:39:27] !log catrope synchronized php-1.21wmf2/includes/upload/UploadStash.php 'Debugging exception' [18:39:39] Logged the message, Master [18:39:41] still not seeing analytics1010 picking up any multicast traffic [18:40:08] hm [18:41:18] New patchset: Ryan Lane; "We manage gerrit's version manually; ensure that" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29822 [18:42:04] so, LeslieCarr, on an09 [18:42:13] this is opened by gmond [18:42:13] analytics1009.eqiad.wmnet:52187->239.192.1.32:8649 [18:42:16] which looks correct [18:42:21] so I think the machines are sending [18:42:28] New patchset: Aaron Schulz; "Added memcached-pecl cache object (enabled for testwiki)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29824 [18:42:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29822 [18:42:49] binasher: take a look at https://gerrit.wikimedia.org/r/#/c/29824/ [18:42:56] the multicast traffic just isn't making it around, i guess [18:42:57] ? [18:43:30] yeah, it looks like they are sending [18:43:31] :-/ [18:43:40] and tcpdump is showing they're spewing traffic [18:43:54] but analytics1010 just doesn't want to listen to the multicast party [18:44:23] notpeter: Are you messing with srv223/224? [18:44:24] !log catrope synchronized php-1.21wmf2/includes/upload/UploadStash.php 'Debugging exception' [18:44:37] Logged the message, Master [18:44:52] its not just an10, right? i can't get that multicast stream via netcat (I should be able to, right?) [18:44:59] sudo netcat -lu 239.192.1.32 8649 [18:45:00] RoanKattouw: yeah, sorry [18:45:07] they're getting upgardes [18:45:15] last of the imagescalers! [18:45:25] No worries, just checking [18:45:40] cool [18:45:48] LeslieCarr: show igmp-snooping membership on asw-c-eqiad [18:45:54] shows only one entry [18:45:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:46:09] !log catrope synchronized php-1.21wmf2/includes/upload/UploadStash.php 'Debugging exception' [18:46:23] Logged the message, Master [18:46:44] analytics1026 ..that's interesting [18:47:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29822 [18:48:04] binasher: so, pc3 refuses to pick up and run late_commands [18:48:10] weird [18:48:10] which means.... no ssh server or key... [18:48:13] yeah [18:48:19] it's failed 3 times :/ [18:48:25] anywho, that's what's up with pc3 [18:48:37] any idea why? [18:48:37] an26 is currently consuming the udp2log multicast stream [18:48:49] to count the number of bytes it produces hourly [18:49:06] LeslieCarr ^ [18:49:23] !log catrope synchronized php-1.21wmf2/includes/upload/UploadStash.php 'rm debugging live hacks' [18:49:35] Logged the message, Master [18:49:54] binasher: not yet [18:50:29] ah [18:50:36] still very interesting ... [18:54:23] ottomata: if you have no objection, i'm going to try restarting ganglia on analytics1010 to see if that will trigger it trying to rejoin the multicast group [18:54:32] if not, my best guess right now is that it's the software on that switch [18:54:38] yeah please [18:54:46] restart anything you need [18:54:53] but hmm [18:54:55] are the analytics machines in use right now ? (there's a few other groups of machines on that switch as well) [18:55:15] yes? [18:55:26] 1001-1010, and 23-27 are in use [18:55:30] 1023-1027* [18:55:39] grrr, not restarting :( [18:55:41] 1011-1022 were the C2100s, and have been taking offline [18:55:44] its not restarting? [18:55:49] i mean it restarted, but not listening [18:55:50] oh, aye [18:55:50] so [18:56:05] i can't listen anywhere though [18:56:07] shouldn't I be able to? [18:56:08] netcat -lu 239.192.1.32 8649 [18:56:09] ? [18:56:11] New patchset: Demon; "Purposefully broken, will abandon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29825 [18:56:32] you should be able to listen on analytics1010 , as it should join that group [18:56:47] with multicast it doesn't just broadcast the packets, a machine requests to join and then gets sent packets [18:56:52] hmmmmm [18:57:08] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29825 [18:57:08] right, but netcat -lu asks to join, right? [18:57:18] (at least, i've made that work for the udp2log multicast) [18:57:22] oh interesting [18:57:29] i do see traffic with netcat -lu on an10 [18:57:48] so maybe this is a ganglia problem [18:57:49] hm [18:58:01] New patchset: Demon; "Purposefully broken, will abandon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29825 [18:58:13] well, which source ip is that from ? [18:58:41] oh from analytics1004 [18:58:58] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29825 [18:59:02] that's interesting because the switch doesn't show it as being in the multicast group [18:59:32] New patchset: Demon; "Purposefully broken, will abandon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29825 [19:01:14] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29825 [19:01:33] ottomata: have a phone conf, bbiab [19:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [19:03:12] eh? that IP is the mluitcast addy, no? [19:03:24] New patchset: Demon; "Purposefully broken, will abandon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29825 [19:06:00] ^demon: you killed gerrit :-] [19:06:17] <^demon> Gerrit's up. [19:06:34] \O/ [19:06:37] ottomata: it is the mutlicast address [19:06:50] <^demon> hashar: Was it down? [19:06:57] but it also looks like the switch doesn't think that it is part of the multicast group [19:07:04] ^demon: I got a few 503 errors, probably nothing to worry about [19:07:30] <^demon> hashar: I'm playing around with https://integration.mediawiki.org/ci/job/operations-puppet/configure. It doesn't seem to be pulling the changes, just building production. [19:07:58] ohh [19:08:16] I thought ops wanted to use the python Gerrit hooks [19:08:37] New patchset: MaxSem; "WIP: support for multiple Solr cores" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29827 [19:08:37] <^demon> No, we've removing those. [19:08:37] <^demon> Ryan and I hate them. [19:08:45] bahhh [19:08:55] <^demon> s/removing/removed/ [19:08:59] <^demon> I thought I had this working already [19:09:20] LeslieCarr: [19:09:20] but it also looks like the switch doesn't think that it is part of the multicast group [19:09:22] what' is 'it' in that sentence? :p [19:09:34] ^demon: let me review the jenkins conf [19:09:35] analytics1010 [19:09:50] /bin/sed -i 's%import \"../private%#import \"../private%' manifests/base.pp | /usr/bin/puppet parser validate --color none manifests/site.pp [19:09:52] ... [19:10:26] ^demon: just call "rake validate" :-] [19:10:33] <^demon> That doesn't work. [19:10:40] aye, right, as in , an10 gmond is not joining the group [19:10:40] <^demon> It has nothing to do with the batch script, that's fine. [19:10:41] k will look at that later [19:10:54] but on an10, if I run [19:10:54] netcat -lu 239.192.1.32 8649 [19:11:06] I get traffic (from only a few nodes, I think) [19:11:28] <^demon> hashar: Every single one of the builds says "Checking out Revision 034735af235312c600a4dfda9048c7d1b99dc54c (origin/HEAD, origin/production)" [19:11:28] <^demon> (Same hash) [19:11:34] so, that seems to me like the networking configs are working, right? [19:11:42] its just ganglia itself that isn't joining [19:12:11] … maybe …. it also could just be that since it's all on the same physical switch/vlan it's just snooping …. [19:12:11] ^demon: I am changing the branches to build from "**" to "$GERRIT_BRANCH" [19:12:12] but. I should be able to join that group with netcat -lu from any node, right? [19:12:13] oh hmmmmmm [19:12:14] anyways, phone conf for realzies right now [19:13:00] ^demon: note that you can retrigger a change from https://integration.mediawiki.org/ci/gerrit_manual_trigger/ [19:13:10] <^demon> Yeah, I know. [19:13:12] ah but, LeslieCarr [19:13:13] <^demon> I was making sure I had a syntax error :) [19:13:19] i don' tthink I would be able to run netcat -ul on that addy [19:13:28] if gmond was using it [19:13:30] example: [19:13:41] on an26, where I have udp2log listening on a mulitcast addy [19:13:53] $ netcat -ul 233.58.59.1 8420 [19:13:53] netcat: Address already in use [19:13:55] so, on an10 [19:13:56] ah :) [19:14:07] if gmond was actually using that, I wouldn't be able to join it on the same machine [19:14:28] hrm…. [19:15:36] ^demon: ahhh it missed the git strategy, which should be set to "Gerrit Trigger" [19:16:08] 19:15:45 Fetching upstream changes from https://gerrit.wikimedia.org/r/p/operations/puppet.git [19:16:09] 19:15:59 Commencing build of Revision 9a1ce9776c426c4081cbf12942b77adf478a721d (production) [19:16:09] yeahhh [19:16:13] 19:16:02 Finished: FAILURE [19:16:32] <^demon> There we go. [19:16:36] ^demon: so the Git plugin has different strategy when it comes to grab the commits [19:16:36] <^demon> Thanks [19:17:05] ^demon: in chronological order or reverse one [19:17:05] or using the Gerrit trigger [19:17:05] Change abandoned: Demon; "Done testing." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29825 [19:17:20] it is missing the ansi colors :/ [19:17:26] and I am wondering why rake lint does not work [19:17:34] err rake validate [19:18:03] binasher: have you played around with telnet and mc* ? [19:18:31] ^demon: ahhh I remember why 'rake validate' does not work. It relies on a class which is not available in the puppet client installed on gallium [19:18:34] <^demon> Right. [19:18:36] <^demon> I tried it :) [19:19:00] binasher - ping [19:19:28] dfoy made another request for traceroute - rt 3758 [19:19:31] AaronSchulz: ? [19:19:42] as in, to test it [19:19:51] ^demon: so that would probably self fix when gallium is upgraded :-] [19:20:16] RECOVERY - NTP on srv224 is OK: NTP OK: Offset -0.02762901783 secs [19:20:35] woosters: both of his traceroute requests are done in that ticket, and both were to the same address [19:20:54] he wants telnet [19:20:54] ^demon: any reason why you are removing the color output from puppet ? [19:21:00] woosters: what? [19:21:02] Chris forgot to do that [19:21:28] RECOVERY - NTP on srv223 is OK: NTP OK: Offset -0.03235328197 secs [19:21:44] woosters: no he didn't, that wasn't a real request for anything [19:22:07] <^demon> hashar: That's what the old script did, I just copy+pasted. [19:23:05] his request 'We're following up with our partners, who are asking for a traceroute and telnet output from our Vumi server (zhen/silver).' [19:23:32] ^demon: you can try out with colors though the yellow on a white background is probably not going to play nice [19:23:34] there is no connectivity between us and their ip, as the tracetoutes all show [19:23:47] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [19:24:02] woosters: they don't even have telnet open there.. that's gibberish [19:24:37] ya, i would be surprised telnet port is open on their side [19:25:18] Eloquence: I bet you have ogg video lengths now, when using varnish upload in eqiad ;) [19:25:42] hi Mark [19:25:43] hi [19:25:49] u got upload going in EQiad? [19:25:56] no, tomorrow [19:26:03] mark, cool - lemme try [19:26:04] found and fixed/optimized some more stuff today [19:26:11] and had to clear the cache, so i'll need to do a slow rampup [19:27:18] it's actually completely empty now, so whatever you do will be cache misses - worst case [19:27:53] AaronSchulz: are you saying there's a problem? [19:29:40] mark - how's ur idea about hashing thumbs to original's hash key? [19:29:43] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [19:29:52] just posted to wikitech-l about it [19:30:06] it has pros and cons [19:31:08] mark: thanks for reply, from which IP range should i use an IP for wikidata-lb.wikimedia.org though. (the other -lb.wikimedia.org are all in the esams network) [19:31:22] mark: yes! it works. the user experience seems identical on squid vs. varnish now :-) [19:31:29] it's also working pretty much the same [19:31:30] binasher: no, just curious what testing was done so far [19:31:49] Eloquence: we're gonna improve it some more, by sending an X-Content-Duration header with the video length [19:31:49] mediawiki/swift can set/send that [19:31:52] AaronSchulz: for x in mc{1..16} ; do echo $x ; echo -e "stats\nquit" | nc $x 11211 | grep uptime ; done [19:31:53] awesome [19:32:31] mutante: i don't understand what you mean [19:32:41] wikidata-lb.wikimedia.org will be a georecord [19:32:46] it doesn't have an ip address [19:33:42] binasher: ok [19:34:09] mark: if in zone file "wikidata.org" i point to wikidata-lb.wikimedia.org i will also need to add that record in wikimedia.org. Or i just point to wikidata-lb.eqiad.wikimedia.org [19:34:10] mutante: the only difference is, the geomap for that record will not send anything to esams (for now) [19:34:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:34:20] mutante: no [19:34:31] are you aware of the powerdns geobackend and how that works? [19:34:33] it's documented on wikitech [19:36:28] mutante: wikipedia-lb.wikimedia.org etc don't exist in the wikimedia.org zone file either [19:36:32] no, hmm, i just see the others wikiquote-lb, wikimnedia-lb having IP addresses in esams. [19:36:56] no that's wikiquote-lb.esams.wikimedia.org [19:36:59] wikipedia-lb 1H IN A 91.198.174.225 [19:37:01] check the $ORIGIN [19:37:06] ah [19:37:42] mutante: this is making me nervous, i'll do the config change tomorrow ok? :) and then show you what I did for next time [19:38:14] i need to go now [19:39:07] mark: ok, thanks, i have diffs sitting on sockpuppet, they just wish they could setup the wiki until tomorrow [19:39:37] if you absolutely need to push this today, get someone to help you who understands dns well, as well as our dns setup [19:39:40] (so not Ryan ;-p) [19:39:57] be careful [19:40:24] <^demon> mutante, mark: Thanks for your help. [19:40:34] out of there have a good night [19:40:35] <^demon> mark: I'll talk to you in the morning about it if you have any questions from our side. [19:40:45] ^demon: ok, we can get this done tomorrow [19:40:50] <^demon> Thanks, have a good night. [19:40:51] the whole thing [19:41:01] night! [19:41:02] mark: ok, good night [19:49:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [19:52:33] New review: Dzahn; "wikidata apache config, looks good to start with" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/27546 [19:52:33] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/27546 [19:58:45] dzahn is doing a graceful restart of all apaches [19:59:01] !log dzahn gracefulled all apaches [19:59:30] Logged the message, Master [19:59:30] mutante: sooooo, you'll actually have to do that in 30 minutes [19:59:33] oh, wait no [19:59:44] that's just for that ops-maintained one [19:59:44] nvm [19:59:44] sorry [20:00:03] ops-maintained one? [20:00:21] the one in operations/puppet [20:00:28] (not being very clear) [20:00:31] (sorry) [20:00:41] the one I wrote an email about the other day [20:00:43] whats happening in 30 minutes [20:00:46] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [20:00:55] puppet will have run on all hosts [20:00:55] but! [20:00:57] you're doing something different [20:00:57] sorry [20:01:11] ah,yeah [20:01:42] i pushed that via sync-apache ,yep [20:02:00] ^demon: done, the apache config is on [20:02:24] <^demon> I saw, thanks. [20:02:45] ok [20:03:13] en.wikidata now redirects to meta [20:03:25] as opposed to the "uncofigured domain" message [20:17:04] ottomata: ok back and stuffs [20:17:39] New review: Asher; "Please remove redirect.conf from here, and commit under the operations/puppet repo with a manifest t..." [operations/debs/squid] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/27035 [20:18:28] cool [20:18:28] danke [20:18:45] so questions: [20:19:17] 1. why can't I see multicast traffic using netcat on nodes other than an10? [20:19:17] 2. why doesn't an10's gmond join the multicast group? [20:20:25] ottomata: i am going to use the power of the "internets" to see if there's someone with a similar problem ? [20:20:38] as number 2 [20:20:45] hmmmm ok! [20:22:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:06] binasher, are there any secrets in compiling redirector for production other than gcc -O3 -o redirector -lpcre redirector.c on a 64-bit Precise machine? [20:31:00] MaxSem: nope, then strip redirector [20:35:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [20:36:32] :-/ so according to netstat, analytics1010 is listening … however it's not according to the switch … my gut instinct is that it's the stupid switch software ... [20:36:50] lemme see if there's anything in release notes on a minor upgrade of this [20:37:44] minor upgrades would allow me to upgrade each individual switch without breaking the stack until they are all at the same version [20:37:49] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [20:39:56] * Damianz thinks 3..2..1 'wth isn't it part of the stack anymore' [20:40:58] New patchset: Pyoungmeister; "adding mc1 and mc2 as data sources in ganglia.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29876 [20:42:53] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29876 [20:44:20] !log demon synchronized php-1.21wmf2/includes/EditPage.php 'Deploying I4c2055be' [20:44:35] Logged the message, Master [20:44:38] New patchset: Pyoungmeister; "adding mc1 and mc2 as data sources in ganglia.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29877 [20:45:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29877 [20:48:58] !log demon synchronized php-1.21wmf2/includes/Title.php 'Syncing Ic2d3f0b8' [20:49:10] Logged the message, Master [20:50:52] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:00:37] ottomata: hrm, so i am just curious, i'm going to try making analytics1008 a ganglia master :) [21:04:02] certainly [21:04:13] i probably want to have two of them eventually anyway, right? [21:04:27] for redundancy? (not sure how that is supposed to work, just heard that was a good thing to do) [21:06:40] ottomata: yeah [21:06:41] i was just doing this to see what's up [21:06:46] ok this is so crazy [21:06:49] yeah? [21:06:59] tcpdump on analytics1008, all sorts of traffic [21:07:33] or maybe not [21:07:54] New patchset: MaxSem; "New, configurable redirector and its config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29883 [21:08:03] hadoop is running [21:08:08] gah this makes no sense [21:08:08] so probably lots of traffic all over the place [21:08:12] especially with an01 [21:08:15] but not on 239.192.1.32 :) [21:08:18] aye [21:08:26] it was having traffic from all the hosts… and then suddenly, not any more [21:08:37] that addy? [21:08:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:09:09] yeah [21:09:11] whoa, netcat -ul on an08 shows stuff [21:09:11] gah [21:09:21] netcat -ul 239.192.1.32 8649 [21:09:35] so you say gmond got traffic on that for a min, and then stopped? [21:10:04] well via tcpdump [21:11:46] Maybe no one is taking a dump on the server. [21:12:07] can I try something real quick? [21:12:13] gonna restart gmond on an08 [21:12:46] LeslieCarr^? [21:13:04] ok [21:13:12] hehe i just restarted it [21:13:12] and magic! [21:13:13] it's happy [21:13:18] oh? [21:13:20] actually i wonder [21:13:27] don't do a netcat [21:13:40] yeah whoa [21:13:40] i'm not [21:13:45] i see the traffic too [21:13:48] via tcpdupm [21:13:55] hm [21:13:58] i shouldnt' be able to do netcat [21:13:59] right? [21:14:03] i am wondering if ntcat could be causing it ? [21:14:10] not certain [21:14:10] im' not running it [21:14:12] i did before though [21:14:12] like i woudl think not but i'm not sure [21:14:27] mabye running it once causes the interface to join the mcast group or something (i'm a little ignorant here) [21:14:28] i'm going to try to run it right now [21:14:30] ok [21:14:53] hrm, that didn't stop the traffic flow [21:15:02] well there goes that theory [21:15:02] gmond is listening [21:15:07] tcp 0 0 0.0.0.0:8649 0.0.0.0:* LISTEN 19825/gmond [21:15:07] udp 0 0 239.192.1.32:8649 0.0.0.0:* 19825/gmond [21:15:22] New patchset: MaxSem; "Bug 38617 - Redirector.c should read redirect regex from config" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/27035 [21:15:22] should we restart gmond again and see if it keeps working? [21:15:27] sounds good [21:15:30] want to do the honors ? [21:15:36] done [21:15:51] still looks pretty happy to me [21:15:55] yup [21:16:06] hm [21:16:34] still showing as down in ganglia.wikimedia.org [21:17:02] yeah, so my guess is it's not sending to nickel [21:17:19] an10 is [21:19:37] hrm, so right now i'd see if setting up any machines that are not 1010 purely via puppet works [21:22:14] hmmmmm, ok, I will try an03? [21:23:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [21:25:54] New patchset: Ottomata; "Attempting to use analytics1003 as ganglia aggregator" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29884 [21:26:54] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29884 [21:27:06] oo, jenkins bot doesn't notify of lint check pass in here yet [21:27:21] [21:27:30] it should notify if it fails though? [21:27:40] hopefully heh [21:28:49] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [21:33:12] New patchset: Dereckson; "(bug 40848) Enable WikiLove extension on ur.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29886 [21:37:03] !log aaron synchronized php-1.21wmf2/includes/EditPage.php 'deployed 51eede5af0fe0bb28d7abdf81a04292ba8a4864c' [21:37:18] Logged the message, Master [21:39:58] MatmaRex_: and in #wikimedia-operations there is now sort of an on-duty liaison that changes regularly, which is basically the person you can talk to to ask "who in Ops should I talk to about this problem?" [21:39:59] (from #mediawiki) [21:40:06] Apparently this is the person said to be 'on rt duty' in the topic [21:40:28] Is there any chance that ('rt duty') could be changed? It's not very clear [21:40:28] Yes, though he's not here =D [21:40:49] hmmm LeslieCarr [21:40:57] an03 is now an aggregator [21:41:26] but I only see it sending traffic to the mcast addr [21:41:29] not receiving any [21:43:27] how does gmetad know to ask an03 for stats? [21:43:47] hmm, yeah i changed it in the data_source and ran puppet on nickel [21:48:46] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [21:53:22] hehe [21:53:25] sorry [21:53:28] cool, now happiness ? [21:53:28] s'ok [21:53:43] PROBLEM - Puppet freshness on srv255 is CRITICAL: Puppet has not run in the last 10 hours [21:53:54] yeah, so i don't see traffic from any other nodes [21:55:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:56:59] no happiness [21:57:01] LeslieCarr ^ [21:57:02] :( [21:57:16] kinda like it was earlier, when an10 was master [21:57:23] only an03 is communicating with multicast addy [21:59:26] gah [21:59:32] why does it fail [21:59:42] brb, gotta go pick up some food real quick [22:02:32] mk [22:02:42] i might head out pretty soon, so maybe I'll bug you more about this tomorrow :) [22:02:46] thanks for the help so far [22:06:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.273 seconds [22:12:46] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [22:19:50] preilly/paravoid: there is a slight complication with the libxml thing [22:22:11] specifically, libxml will call xmlMalloc() whether it is going to store a pointer in a true global variable or in a thread-specific state [22:24:16] it looks like it only stores pointers in global variables when xmlInitParser() is called, and only frees them when xmlCleanupParser() is called, so I can probably use that to distinguish between persistent allocation and per-request allocation [22:24:31] but it makes it slightly more dubious and less likely to be accepted upstream [22:25:14] also, initialisation in a multithreaded SAPI has its own complications [22:25:20] * AaronSchulz looks at preilly's empty chair [22:25:55] in fact, the existing code is probably broken in multithreaded mode [22:32:47] ottomata: back [22:32:47] sigh [22:33:01] sigh? [22:33:02] 'heh [22:33:04] sigh. [22:33:25] TimStarling: what was the reason you didn't want to use igbinary? [22:35:57] we can use it, as long as igbinary.compact_strings is off [22:37:10] do you want to? [22:37:57] LeslieCarr, I really gotta run, do what you can, I might be back on laters [22:38:02] or tomorrow [22:38:02] byyyeeee [22:39:48] I'm not really enthusiastic about it, mostly because it's a fairly small win in exchange for some added complexity, and the fact that the code is not so well-tested [22:39:55] Change merged: Asher; [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/27035 [22:40:12] what's that serialization format that domas likes? [22:40:18] lol [22:40:28] there's one that actually lets you omit the property names [22:40:42] instead of just reducing the amount of punctuation [22:40:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:47:18] we have memcached.serializer => igbinary => igbinary and igbinary.compact_strings => On => On by default.. TimStarling: would you prefer we go back to the php serializer vs. disabling compact_strings? [22:47:26] New patchset: Dereckson; "Namespace configuration for hsb.wiktionary.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29890 [22:48:10] just disable compact_strings [22:48:39] New patchset: Dereckson; "(bug 29890) Namespace configuration for hsb.wiktionary.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29890 [22:49:00] New review: Dereckson; "PS2: More comprehensive commit message." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/29890 [22:49:54] New patchset: Dereckson; "(bug 41328) Namespace configuration for hsb.wiktionary.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29890 [22:50:19] New review: Dereckson; "PS3: Fixing bug id." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/29890 [22:53:15] binasher: is it expected that upload.wm.o is not sending cache-control headers? [22:53:46] pgehres: As seen from the outside world? [22:53:48] !log authdns changes adding db63-db78 mgmt [22:53:59] Logged the message, Master [22:54:06] RoanKattouw: not sure what you mean? [22:55:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [22:56:07] pgehres: Like, you're getting no CC headers on upload.wm.o responses as seen from a machine outside the cluster? [22:56:22] yeah [22:56:36] it is expected [22:57:04] :-( okay [22:57:36] due to the sheer number of images, it seems that we are taking forever to revalidate the entire list every time [22:57:41] swift and the nfs hosts it replaced don't send cc headers, the upload squid conf has defaults in its place, as should the newer varnish cluster [22:58:52] are you saying that it should be set by the squids? [22:59:42] i actually haven't queried upload internally, i am just browsing anon and inspecting the headers [23:06:10] New patchset: Matthias Mullie; "Add Wikivoyage extensions to extensions list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29891 [23:06:12] Can someone de-pool mw45 for the same reason as mw40? [23:06:54] Change merged: CSteipp; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29891 [23:06:58] !log authdns update adding db78.pmtpa.wmnet [23:07:10] Logged the message, Master [23:07:42] New patchset: Asher; "disable igbinary.compat_strings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29892 [23:08:35] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29892 [23:09:03] RoanKattouw: ? [23:09:24] outreach.wm.o (seriously NSFW) [23:09:32] !log db78 loading base OS [23:09:42] Logged the message, Master [23:11:19] pgehres: no, just saying that we have the upload servers configured to cache images in the absence of a cc header [23:11:29] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:12:12] !log restarting pdns on ns0 [23:12:19] Logged the message, Master [23:12:55] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.008 seconds response time. www.wikipedia.org returns 208.80.154.225 [23:13:32] AaronSchulz: memcached-serious log group? :) [23:14:10] binasher: everything ok with libmemcached/igbinary? [23:14:21] hmm? [23:14:29] just wondering [23:14:35] if the packages work etc. [23:14:49] so far [23:15:12] great [23:20:38] binasher: I originally made memcached-pecl, and then realized that didn't match the name of the calling function [23:20:45] * AaronSchulz is too ocd about stuff like that [23:21:49] heh [23:22:19] ok, the change looks good but i want to let puppet update igbinary.ini everywhere and then bounce apaches before merging [23:22:36] although, i guess just making sure srv193 is update is good enough for this [23:27:38] binasher: what did that igbinary puppet change you just made affect? [23:28:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:29:18] AaronSchulz: "Serialization performance depends on the "compact_strings" option which enables [23:29:19] duplicate string tracking. String are inserted to a hash table which adds some [23:29:20] overhead. In usual scenarios this does not have much significance since usage [23:29:21] pattern is "serialize rarely, unserialize often". With "compact_strings" [23:29:22] option igbinary is usually a bit slower than the standard serializer. Without [23:29:23] it, a bit faster." [23:30:02] it defaults to on, updated ini sets to off [23:31:10] and that affects memcached as used now? I thought we just used php serialization? [23:31:18] no [23:31:32] no effect on memcached as used now [23:32:01] just prep for the pecl client [23:32:10] well, I will need to amend 29824 to use igbinary then [23:33:57] ah, does MemcachedPeclBagOStuff override memcached.serializer? [23:35:12] New patchset: Aaron Schulz; "Added memcached-pecl cache object (enabled for testwiki)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29824 [23:35:40] New patchset: MaxSem; "New, configurable redirector and its config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29883 [23:36:22] yes [23:36:25] binasher: it default to setting Memcached::SERIALIZER_PHP on the Memcached object unless specifically given igbinary [23:36:32] *defaults [23:36:47] New patchset: MaxSem; "Update tests" [operations/debs/squid] (master) - https://gerrit.wikimedia.org/r/29895 [23:37:08] AaronSchulz: ok, that's probably a good thing [23:37:08] because I was afraid that someone would accidentally enable it without turning off compact_strings [23:37:47] so to turn on igbinary, you first have to read the MemcachedPeclBagOStuff documentation where it says that you should disable compact_strings [23:38:40] haha :) [23:39:09] part of the idiot-proof Starling configuration security system [23:40:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.677 seconds [23:42:58] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29883 [23:51:11] binasher: if we wanted to downgrade the image scalers to Lucid tonight, would that be possible [23:52:24] ? [23:52:34] robla: if it makes sense to do so vs. fixing something [23:53:05] notpeter: ^^ [23:53:15] what's the problem? [23:53:34] https://bugzilla.wikimedia.org/show_bug.cgi?id=41361 [23:53:58] we started discussing this in #mediawiki, and TimStarling suggested we might want to revert [23:54:22] I'd be surprised if downgrading affected that bug [23:54:29] that must be the wrong bug # [23:54:32] https://commons.wikimedia.org/w/index.php?title=Commons:Village_pump&oldid=81616486#More_image_scaler_problems [23:54:45] I think he means https://bugzilla.wikimedia.org/show_bug.cgi?id=41362 [23:54:55] erm, the URLs in that bug are originals [23:54:55] oh [23:55:15] blame the French [23:55:23] or did I do some mistake with URLs again [23:55:41] the list on commons looks like several separate issues to me [23:55:55] sure [23:56:05] and I doubt they're related to the upgrade [23:56:05] yeah, I was using 41361 as a proxy for that, since it had the URL [23:56:06] it's not recent stuff [23:56:12] maybe the first two are interesting [23:56:36] two or three [23:56:49] is the ogg stuff recent? [23:57:23] it looks like we have test cases, so we should be able to debug them without the code actually being live [23:57:23] seems to be if the comment time is to be believed [23:59:12] the oggThumb thing is pretty explicit [23:59:25] not much isolation work to do there