[00:06:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:09:39] !log maxsem Finished syncing Wikimedia installation... : Weekly MobileFrontend and Zero deployment [00:09:51] Logged the message, Master [00:17:29] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29702 [00:17:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.447 seconds [00:18:34] !log maxsem synchronized php-1.21wmf2/extensions/MobileFrontend/ [00:18:48] Logged the message, Master [00:19:21] binasher: so one is pecl mc? :) [00:19:52] s/one/when [00:20:29] AaronSchulz: feel free to do the config instead of waiting for me [00:20:49] ok, I'll try to get something in gerrit this week [00:21:00] i am currently going over a large quantity of timing data from the fundraising team in prep for a meeting with erik in 10 minutes [00:21:01] i'll prob have time tomorrow though [00:21:02] !log maxsem synchronized php-1.21wmf1/extensions/MobileFrontend/ [00:21:14] Logged the message, Master [00:25:10] can someone please flush mobile Varnish? [00:26:11] MaxSem: ok, hold on [00:26:51] !log flushing varnish mobile cache [00:27:03] Logged the message, Master [00:27:49] binasher: would you happen to know if our version of varnish supports HTCP operations? [00:28:32] not directly, we use a special daemon for that [00:29:24] kk; I've seen what wikia did to make it support it - I presume ours is something similar? [00:30:45] or; actually; that doesn't even matter -- what I'm really after is if I use SquidUpdate::HTCPPurge() it will work with our varnish boxen? [00:31:04] mutante, thanks! [00:31:15] np, Max [00:37:19] binasher: ? ^ [00:45:02] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 190 seconds [00:49:05] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:50:38] !log maxsem synchronized php-1.21wmf1/extensions/MobileFrontend [00:50:46] Logged the message, Master [00:52:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [01:08:02] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 4 seconds [01:37:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:17] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 244 seconds [01:46:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [01:52:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [01:59:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 249 seconds [02:11:00] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [02:25:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:30] !log LocalisationUpdate completed (1.21wmf2) at Wed Oct 24 02:27:30 UTC 2012 [02:27:46] Logged the message, Master [02:28:06] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 203 seconds [02:37:59] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 25 seconds [02:38:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [02:48:06] !log LocalisationUpdate completed (1.21wmf1) at Wed Oct 24 02:48:06 UTC 2012 [02:48:20] Logged the message, Master [03:50:44] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 26 seconds [03:52:05] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [05:02:08] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [05:05:02] New patchset: Daniel Friesen; "Use $wgHiddenPrefs to hide real name." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29741 [05:50:38] TimStarling: Any chance you could merge and deploy Daniel's rev from above? [05:51:07] The real name field is showing up on the English Wikipedia and other sites. Someone didn't read the release notes before deploying... :-) [05:55:29] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29741 [05:59:32] !log tstarling synchronized wmf-config/CommonSettings.php [05:59:47] Logged the message, Master [06:09:06] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:18:56] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:20:45] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:15] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:21:29] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:22:05] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [06:23:53] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.450 second response time [06:27:56] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [06:29:19] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [06:35:34] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:35:34] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:36:19] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:40] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:38:43] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:01] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.233 second response time [06:45:46] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [06:46:13] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [07:08:52] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [07:09:46] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [07:21:14] hello [07:52:21] mark: I'm reading Daniel's mail (and getting confused) [07:52:55] didn't you say we don't have any more IPs in esams? [08:19:12] yes i did [08:25:05] i'm replying now [08:34:49] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [08:41:52] PROBLEM - SSH on ms-fe2 is CRITICAL: Connection refused [08:51:46] RECOVERY - SSH on ms-fe2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [09:20:57] New patchset: Faidon; "autoinstall: partman raid1 profiles mega-commit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29754 [09:21:33] I don't thing anyone wants to review that :) [09:22:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29754 [09:22:49] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [09:23:43] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [09:24:39] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [09:25:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [09:25:43] New review: Hashar; "* updated zuul.init from upstream" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25235 [09:25:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [09:28:49] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [09:53:19] New patchset: Faidon; "autoinstall: partman raid1 profiles mega-commit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29754 [09:54:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29754 [09:54:50] * mark cheers [09:55:34] for the raid1 megacommit? :) [09:56:36] yes [09:56:43] i never got why people need to make a gazillion profiles [09:59:45] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [10:08:27] New review: Faidon; "Noone is going to go near that, I might just as well +2 it myself." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/29754 [10:11:00] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [10:11:58] New patchset: Faidon; "autoinstall: partman raid1 profiles mega-commit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29754 [10:12:10] would it be possible to rename a header in squid? looking into adding X-Content-Duration to videos, swift only allows setting X-Object-Meta-Duration, so if squid could turn X-Object-Meta-Duration into X-Content-Duration it might work [10:12:47] j^: no it doesn't [10:12:52] docs are incorrect [10:12:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [10:12:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29754 [10:13:01] they added the feature of arbitrary headers at one point [10:13:06] and never updated the docs [10:13:34] (that's my guess from my interpreation of the code, might be incorrect) [10:14:16] https://code.launchpad.net/~notmyname/swift/arbitrary_headers/+merge/54455 [10:14:19] yeah, that's merged [10:14:38] but I have to explictly allow it on the swift configs [10:14:59] paravoid: i was testing with git head of swift here and i can only set X-Object-Meta [10:15:13] ah ok, allow in config... [10:15:50] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [10:16:14] try it and if it works, I'll open a bug for their docs [10:16:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [10:17:46] it should be "allowed_headers = X-Content-Duration" in the [object-server] section [10:19:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29754 [10:19:32] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [10:20:34] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [10:21:35] New review: Hashar; "removed the /etc/zuul/wikimedia file definition out of the module. Will be created by the wikimedia ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25235 [10:21:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [10:21:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [10:39:10] RECOVERY - Puppet freshness on ms-fe2 is OK: puppet ran at Wed Oct 24 10:39:03 UTC 2012 [10:44:34] RECOVERY - NTP on ms-fe2 is OK: NTP OK: Offset -0.06889736652 secs [10:49:49] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:00:21] j^: any luck? [11:03:02] New patchset: Mark Bergsma; "Make sure high range requests don't block on large objects being retrieved" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29763 [11:04:09] New patchset: Mark Bergsma; "Make sure high range requests don't block on large objects being retrieved" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29763 [11:05:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29763 [11:08:21] New patchset: Mark Bergsma; "Make sure high range requests don't block on large objects being retrieved" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29763 [11:08:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:09:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29763 [11:11:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.963 seconds [11:12:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29763 [11:15:19] PROBLEM - Host ms-fe2 is DOWN: PING CRITICAL - Packet loss = 100% [11:16:58] RECOVERY - Host ms-fe2 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [11:16:58] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.011 seconds [11:17:07] RECOVERY - Memcached on ms-fe2 is OK: TCP OK - 0.002 second response time on port 11211 [11:21:03] New patchset: Mark Bergsma; "req.hash_ignore_busy can't be set in vcl_hash for some reason" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29764 [11:22:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29764 [11:22:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29764 [11:27:37] New patchset: Mark Bergsma; "req.hash_ignore_busy can't be read in vcl_miss for some reason" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29765 [11:28:08] oh heh [11:28:32] sloppy bastards [11:28:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29765 [11:29:15] I guess you can set a hidden header [11:29:15] yes [11:29:15] but I hate that too [11:29:15] ugly [11:29:21] i want to set booleans [11:29:30] there's a vmod which adds that I think [11:29:30] but can't be bothered [11:31:55] Request to first data byte: 35026 ms [11:32:08] I was able to replicate your observation now [11:32:12] of course when I tested it yesterday, it was cached [11:32:23] we'll see how it does in half an hour when puppet has run [11:33:11] python-eventlet (0.9.17-0ubuntu1~cloud0) precise-folsom; urgency=low [11:33:14] * Backport for Ubuntu Cloud Archive: [11:33:19] - Newer version of python-eventlet is required for nova. [11:33:19] ohrly [11:33:33] how hard is it to have a line or two explaining WHY [11:33:39] goddamit [11:34:16] because it doesn't leak sockets? ;) [11:34:59] heh yeah [11:35:13] http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-fe1.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1351078449&g=network_report&z=large&c=Swift%20pmtpa [11:35:27] traffic is increased since yesterday [11:35:55] not hitting the 1gbps but still, increased compared to what normally is [11:36:15] any idea why? [11:36:15] and more importantly, there's a big gap between in and out [11:36:24] that bug? [11:36:25] maybe... [11:36:30] that needs to be fixed [11:36:44] yeah, I'm still waiting for a reply [11:36:46] from the swift people [11:36:58] the range one you mean, right? [11:37:02] yes [11:37:14] I mailed them a separate mail from the container sync thread [11:37:50] that had a) the 90 DELETEs/s that killed swift and b) the Gertie incident with i) range bug ii) posix_fadvise() calls [11:39:00] can you CC me on such mails? i'm interested [11:39:45] there, problem solved :) [11:39:58] Request to first data byte: 500 ms [11:40:35] the 2nd range request for the end of the file now came in well before the first one completed (for the entire file) [11:41:37] hehe [11:41:37] you rock [11:42:09] thanks for the pcap, that was hugely helpful [11:43:47] there, I bounced you some mails [11:43:57] and I'll keep you Cc'ed in future ones [11:45:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:46:02] tnx [11:47:52] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [11:48:36] increased load since yesterday is expected, since I took down ms-fe2 for the upgrade [11:48:43] but the gap between in/out puzzles me [11:48:54] the range bug might be an explanation [11:49:06] should be able to see that in a pcap, right? [11:49:59] a 400mbit pcap? :) [11:51:10] New patchset: J; "Bug 41304 Add X-Content-Duration to allowed_headers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29768 [11:51:40] use ngrep and try to replicate it with a range request for a given object? [11:52:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29768 [11:52:21] j^: did you test that? [11:52:53] PROBLEM - Puppet freshness on srv255 is CRITICAL: Puppet has not run in the last 10 hours [11:52:53] the bug exists, there's no question about that. I'm wondering if we currently experiencing it or not [11:52:55] I'll find a way [11:53:07] in the meantime, I'll put the updated ms-fe2 back into rotation :) [11:54:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.323 seconds [11:55:48] paravoid: tested here with local swift setup (using git head, 1.7.5 though) [11:56:27] !log putting ms-fe2 (upgraded to precise/swift 1.7.4) back into the pool [11:56:30] cool to see some fast results on X-Content-Duration :) [11:56:40] Logged the message, Master [11:56:42] indeed [11:58:37] we could combine that with some throughput rate limiting [11:59:16] calculate the bitrate of the video and serve it at a slightly higher rate [12:03:03] there are web servers that have that feature [12:03:17] yes [12:03:20] but there's not much reason varnish couldn't also have it I guess [12:04:17] I'm no expert but I think it's a bit harder than that though [12:04:24] the bitrate is variable [12:04:29] sure [12:04:36] so you might be way off from (avg + N%) [12:04:44] so you'd want to set it to at least 1.5* avg or so [12:05:39] still better than pumping out entire gigabytes for a file a user might abort after 30s [12:06:55] not sure how many users have gigabit connections to the datacenter [12:07:19] i do [12:07:29] :) [12:07:51] i just notice that when google linked that Gertie_the_Dinosaur video, not exactly a big video at all [12:07:56] we were serving it at 3 Gbps [12:08:10] if you do rate limiting, you want to burst at full speed for some amount and slow it down after that [12:08:15] the max our 3 swifcod do [12:08:15] yes [12:08:46] and currently, one backend varnish or squid instance can serve a video at max 1 Gbps [12:08:49] we're gonna up that [12:08:56] but it's worrying ;) [12:10:10] I worry too [12:10:17] gertie was a small video btw [12:10:22] like I said [12:10:36] right [12:11:52] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [12:29:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:31:05] New review: Hashar; "Quoting the output variable is already handled with change I1ccf76ce: beta: autoupdater now reports ..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/29615 [12:31:09] there [12:31:19] made vmod_std's std.collect() work in vcl_deliver :) [12:34:07] New patchset: Mark Bergsma; "Make std.collect work in vcl_deliver" [operations/debs/varnish] (patches/vcl_deliver-collect) - https://gerrit.wikimedia.org/r/29771 [12:34:50] New patchset: Mark Bergsma; "Make std.collect work in vcl_deliver" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29772 [12:34:50] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm5) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29773 [12:41:04] New patchset: J; "update all extensions even if one fails" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29615 [12:42:10] New patchset: Mark Bergsma; "Collect response headers Via and X-Varnish on delivery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29774 [12:42:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.085 seconds [12:43:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29615 [12:43:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29774 [12:43:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29774 [12:43:54] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/vcl_deliver-collect) - https://gerrit.wikimedia.org/r/29771 [12:44:13] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29772 [12:44:20] New review: J; "> I have no idea what ':' is for in shell.. What is it supposed to do ?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/29615 [12:44:38] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29773 [12:56:31] New patchset: MaxSem; "Rm old logging code" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29775 [13:12:57] New patchset: Mark Bergsma; "Collapse X-Cache into one header" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29777 [13:14:06] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29777 [13:17:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:23] paravoid: hello :-) Been late sorry. [13:30:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [13:39:55] !log installing base OS on db62 [13:40:09] Logged the message, Master [13:48:37] hey ^demon, gerrit question [13:48:43] <^demon> Shoot. [13:48:56] i am trying to add ottomata as reviewer to a patchset [13:49:31] and i get the error application error, ottomata is neither a registered user nor a group [13:49:57] but his name and email address pop up in the reviewer box [13:50:37] paravoid: can you test videos again if you have time? [13:50:49] New review: Hashar; "bash has sooo many tricks:" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/29615 [13:51:26] i see firefox poking around when seeking [13:51:37] <^demon> drdee: No clue. Don't see anything in the error log. [13:52:07] can you try to add him to patchset 29779? [13:52:49] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [13:52:52] i must say, i have often weird application errors, also after submitting a code review i will get an application error but the review still gets through [13:53:02] could there be an issue with my account somehow? [13:53:24] <^demon> No, I doubt there's any issues with your account. [13:53:31] <^demon> I get the same error on adding Otto. [13:53:39] <^demon> Still, nothing in the error logs. [13:53:44] pffeww it's not me :D [13:53:54] <^demon> I'll file upstream. But without a stacktrace I dunno. [13:54:21] maybe look in the ajax call? [13:54:45] <^demon> Javascript console says nothing special. [13:57:52] ^demon, and so when i enter my review, i get application error, cannot submit change xyz, needs Code Review but obviously it has a code review [13:58:08] <^demon> I have no idea. [13:58:58] Change abandoned: J; "ok, i guess once change I1ccf76ce is merged the logs will at least show problems with an extension s..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29615 [14:00:54] ^demon: Hi chad :-] [14:02:21] ^demon: I have screwed up a freshly created repository : integration/zuu-conf . I have sent two commits as changes in Gerrit but could not merge them since it was lacking an "initial commit". I eventually pushed them straight to gerrit but now the changes are still there and marked abandoned =) [14:02:39] ^demon: so whenever I send a new commit, ,it can be merged since it is based on an abandoned change :-/ [14:02:57] <^demon> Rewrite history? [14:03:04] might try that [14:03:18] <^demon> If nobody's really using it but you yet, no harm really. [14:03:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:05] ^demon: any idea how to rewrite the first commit ? :-] [14:05:06] git checkout --orphan newmaster [14:08:07] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [14:18:08] ^demon: that worked. Thanks :-] [14:18:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [14:18:38] <^demon> yw :) [14:21:05] New patchset: Mark Bergsma; "Retab" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29784 [14:21:24] !log upgarding srv221 and srv222 to precise [14:21:32] seriously [14:21:38] labs is slow as hell again [14:21:38] Logged the message, notpeter [14:21:52] I think I will end up renting a few servers :-] [14:22:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29784 [14:23:32] New patchset: Mark Bergsma; "Retab" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29784 [14:24:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29784 [14:24:56] New patchset: Mark Bergsma; "Retab" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29784 [14:25:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29784 [14:32:45] New patchset: Hashar; "zuul configuration for Wikimedia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [14:33:50] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [14:34:46] New patchset: Mark Bergsma; "Attempt to make varnishhtcpd a bit more robust on errors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29786 [14:35:48] New review: Hashar; "rebased again to get change I0f8e3fe81 :" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/27611 [14:35:48] New patchset: Pyoungmeister; "setting srv221 and srv222 to use modules as part of upgrade to precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29787 [14:36:19] PROBLEM - Host srv221 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:39] yay [14:36:48] that's imagescalers, isn't it [14:36:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [14:36:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29786 [14:36:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [14:36:53] New patchset: Pyoungmeister; "setting srv221 and srv222 to use modules as part of upgrade to precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29787 [14:36:53] yes [14:37:43] <^demon> paravoid: Did we still want to do those precise upgrades tomorrow? [14:37:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29787 [14:38:06] 3am is a bit too late for me to be doing upgrades I'm afraid [14:38:36] <^demon> We can pick a better time for you. [14:39:00] <^demon> Sometime in the morning for me, perhaps. [14:39:12] I was hoping to chat with someone from the SF office who might have the time but I haven't been able to do that yet [14:39:35] I think Daniel is busy with wikidata/voyage stuff these days anyway [14:39:36] <^demon> Well Ryan and I were going to do manganese & formey Thurs. [14:39:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29787 [14:40:42] I kinda feel bad about asking Ryan stuff considering he's not getting the labs help from me as originally hoped :P [14:41:15] but yeah, if you guys are willing to do it earlier then I can make it [14:41:16] PROBLEM - SSH on srv222 is CRITICAL: Connection refused [14:41:27] <^demon> Well manganese & formey are going to be easy. Is gallium not straightforward? [14:41:38] PROBLEM - Apache HTTP on srv222 is CRITICAL: Connection refused [14:41:45] New patchset: Mark Bergsma; "Restart varnishhtcpd on changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29790 [14:41:46] paravoid: yep. I'm going to finish upgarding them today [14:41:52] hashar knows [14:41:57] but going to wait a couple of days before declaring them done [14:42:10] RECOVERY - Host srv221 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:42:23] I don't anything about gallium really, my plan is to do a dist-upgrade and attempt fixing whatever shit breaks with hashar's help [14:42:35] my "plan" I should say [14:42:42] paravoid: sorry missed your chats yesterday :( [14:42:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29790 [14:42:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29790 [14:42:56] wtf. installs failing....... [14:43:18] <^demon> paravoid: Well I can pretty much do manganese & formey myself. Just need someone on hand in case Something Happens. [14:43:21] paravoid: have you passed the gallium upgrade to someone else or will you handle it yourself ? [14:43:22] notpeter: I messed with partman today, but not with mw.cfg [14:43:37] yeah. it was a fail on instlaling software [14:43:53] hashar: see above :) [14:43:54] and yesterday, I was reimaging pc3 and it failed on getting/running the late_commands [14:45:04] so hmm [14:45:13] as Iunderstand it notpeter is going to be the one upgrading gallium ? ;-D [14:45:46] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:46:15] erm, not what I said [14:46:38] New review: Hashar; "The updater fail because of permissions issues :/ The script run as mwdeploy:mwdeploy where as admi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29615 [14:46:40] if it makes sense to sync gerrit & jenkins (you tell me) and ^demon is willing to push it earlier then I'm doing it [14:47:16] <^demon> Well the upgrades aren't really dependent on one another, but makes sense to do them around the same time since they're related. [14:47:22] I don't mind doing it tomorrow at 10pm - midnight CET. (1pm - 3pm PST) [14:47:27] <^demon> Jenkins downtime interrupts gerrit, not the other way around. [14:47:49] <^demon> There's some deployments going on until 3pm PDT, which is why I piked 3-5pm PDT originally. [14:47:50] the one sure thing is that whenever Gerrit is done, Jenkins stay idling [14:48:03] <^demon> But we could go much earlier. Something that morning? [14:48:17] Jenkins can be stopped independently. That simply requires to manually retrigger any changes that have been submitted to Gerrit during the downtime. [14:49:58] ^demon: the earlier your morning the better for me, I guess for hashar too [14:50:53] <^demon> That works for me. As long as we're done before 17:00UTC, we won't be affecting any deployments I see. [14:50:53] or we could upgrade gallium much earlier [14:51:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:51:43] okay, want to do the jenkins upgrade on 15:00 UTC let's say? [14:52:31] <^demon> Jenkins and gerrit, 15-17:00? [14:52:45] one hour earlier ? Need to get my daughter at 16:00 UTC [14:53:10] unlikely my wife can leave earlier, though I could ask her this evening. [14:53:38] I don't mind, but I think it's getting a bit early for chad [14:53:44] <^demon> That's fine by me. It's 10am my time. [14:53:51] so 14:00 UTC ? [14:54:12] ack [14:54:13] nice [14:54:21] <^demon> Sounds good. I'll send out notices. [14:54:27] I am out to get my daughter, will connect again later on this evening though [14:54:27] thanks [14:54:31] thanks you two :-] [14:54:36] cya in a few hours [14:54:46] ^demon: I obviously can provide support to you too [14:54:55] instead of Ryan I mean. [14:55:24] <^demon> Yeah. Hopefully it'll go as smoothly as my testing locally & on labs went. [14:56:08] RECOVERY - SSH on srv222 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [15:01:48] New patchset: Platonides; "(Bug 41350) The translation of Wikipedia is said to be wrong on pa.wikipedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29792 [15:02:57] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [15:03:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [15:06:19] PROBLEM - NTP on srv221 is CRITICAL: NTP CRITICAL: Offset unknown [15:06:31] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.011 seconds [15:06:46] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.003 seconds [15:10:45] ^demon: thanks for the ping and the mail [15:10:53] and for being flexible regarding the time [15:11:15] <^demon> Yeah no problem. Totally makes sense to get it all done at once. [15:12:50] <^demon> Ryan will be happy when I tell him he's off the hook now too ;-) [15:16:07] RECOVERY - NTP on srv221 is OK: NTP OK: Offset -0.1137486696 secs [15:17:35] New patchset: Ottomata; "Using only an10 as ganglia aggregator for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29797 [15:18:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29797 [15:19:54] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29797 [15:20:16] PROBLEM - NTP on srv222 is CRITICAL: NTP CRITICAL: Offset unknown [15:30:37] New patchset: Ottomata; "ganglia.pp - only using analytics1010 as ganglia aggregator (for now)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29798 [15:31:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29798 [15:31:58] RECOVERY - NTP on srv222 is OK: NTP OK: Offset -0.03310871124 secs [15:33:37] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29798 [15:37:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:34] mark: it seems we hit 1gbps on ms-fe1 moments ago [15:42:55] PROBLEM - Host ps1-c1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.11) [15:42:55] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.19) [15:42:55] PROBLEM - Host ps1-a1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.1) [15:42:55] PROBLEM - Host ps1-b5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.10) [15:42:55] PROBLEM - Host ps1-c3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.13) [15:42:56] PROBLEM - Host ps1-c2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.12) [15:42:56] PROBLEM - Host ps1-b4-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.9) [15:43:13] PROBLEM - Host ps1-a4-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [15:43:13] PROBLEM - Host ps1-a2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.2) [15:43:13] PROBLEM - Host ps1-b2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.7) [15:43:13] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [15:43:13] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [15:43:14] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.18) [15:43:14] PROBLEM - Host ps1-b1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.6) [15:43:15] PROBLEM - Host ps1-a3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.3) [15:43:15] PROBLEM - Host ps1-b3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.8) [15:43:16] PROBLEM - Host ps1-d2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.15) [15:43:16]