[00:01:14] Ryan_Lane: if i'm logged in, can i do that via Special:UserLogin&type=signup ? [00:01:19] yep [00:04:34] New patchset: Ryan Lane; "Restrict a region to its own services" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20249 [00:05:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20249 [00:06:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20249 [00:18:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:29] New patchset: Ori.livneh; "Event logging via varnishlog over ZeroMQ" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20039 [00:21:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20039 [00:25:19] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20039 [00:29:55] AaronSchulz: it's not just the image scalers. [00:31:10] most of the apaches? [00:32:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.225 seconds [00:32:11] AaronSchulz: sorting by the most hits (over the last 1000 lines), no image scaler even makes the list. [00:32:22] sorry, no image scaler makes the top 60. [00:35:04] most of the GETs seems to be on other wikis getting local copies of commons files [00:35:07] top 70 list is a range from 24 hits to 193 hits (I'm only counting GETs) [00:35:19] (of the last 1000 lines of log) [00:35:34] * AaronSchulz adds todo list entries to beef up some profiling [00:38:12] New patchset: Ryan Lane; "Fix naming of virt hosts in eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20250 [00:38:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20250 [00:39:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20250 [00:39:36] binasher: I merged the zeromq change [00:41:53] New patchset: Dzahn; "load mod_rewrite" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20251 [00:42:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20251 [00:42:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20251 [00:46:35] New patchset: Ryan Lane; "Disabling live-migration configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20252 [00:47:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20252 [00:47:42] New patchset: Dzahn; "duh, fix syntax error in dependency definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20253 [00:48:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20253 [00:48:22] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20252 [01:04:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:16:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.377 seconds [01:42:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 270 seconds [01:43:23] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 241 seconds [01:50:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:44] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 682s [01:55:41] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [02:00:56] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 0 seconds [02:01:14] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 12s [02:02:30] New patchset: Ryan Lane; "Add policy.json to restrict actions to proper roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20259 [02:03:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20259 [02:03:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.708 seconds [02:08:15] New patchset: Ryan Lane; "Adding in patched code for keystone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20260 [02:08:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20259 [02:08:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20260 [02:09:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20260 [02:16:45] New patchset: Ryan Lane; "Block keystone ports, but open them to needed services" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20261 [02:17:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20261 [02:17:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20261 [02:20:34] New patchset: Ryan Lane; "Add keystone_service and keystone_admin protocols" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20262 [02:21:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20262 [02:23:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20262 [02:36:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [03:15:17] PROBLEM - mysqld processes on es8 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:15:35] PROBLEM - mysqld processes on es7 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:16:11] PROBLEM - mysqld processes on es6 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:16:29] PROBLEM - mysqld processes on es5 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [04:46:10] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.004 second response time on port 636 [04:46:19] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.001 second response time on port 389 [05:23:12] New patchset: Ryan Lane; "Add tenant_name and user_name attribute config to keystone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20267 [05:23:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20267 [05:24:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20267 [05:31:08] Ryan_Lane: cd /srv/org/wikimedia/controller/wikis/1.20wmf2 this fails because it can't cd to the dir... [05:31:22] yeah [05:31:30] I haven't added mediawiki yet [05:31:47] ah ha [05:32:25] I thought maybe it was a version thing. nm then [05:47:49] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [05:47:49] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [05:47:49] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [05:47:49] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:48:53] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [05:48:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:00:03] good morning [08:15:51] s/good// [08:16:06] why? :) [08:17:02] mornings are never good [08:17:07] that is why we sleep through them! [08:44:40] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:34] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:43] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:52] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:10] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:10] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:10] PROBLEM - swift-container-server on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:19] PROBLEM - swift-object-server on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:55] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:55] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:04] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:04] PROBLEM - swift-account-server on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:37] PROBLEM - SSH on ms-be6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.061 seconds [09:32:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:36:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [09:50:44] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [09:50:44] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [09:50:44] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [09:50:44] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [09:58:41] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:10:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [10:53:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:05:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.654 seconds [11:39:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:51:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.508 seconds [12:02:06] New patchset: Mark Bergsma; "Rename misc::deployment-host to ::deployment, merge misc::scripts and password scripts into it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20290 [12:02:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20290 [12:05:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20290 [12:14:50] New patchset: Mark Bergsma; "Rename misc::l10nupdate into misc::deployment::l10nupdate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20291 [12:15:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20291 [12:16:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20291 [12:23:27] New patchset: Mark Bergsma; "Move misc::translationnotifications into misc::maintenance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20292 [12:24:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20292 [12:24:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20292 [12:25:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:27:53] New patchset: Mark Bergsma; "Add FIXMEs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20293 [12:28:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20293 [12:29:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20293 [12:40:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [12:53:49] hi chris [12:54:02] would you be able to run some extra uplinks for the row D sdtpa switches today? [12:54:40] hi mark: yes [12:54:50] good [12:54:56] I think we're gonna give all 3 racks an extra gige uplink [12:54:58] so 3 new runs [12:55:05] i believe those are to either rack C1 or B1 [12:55:32] but perhaps it's good to verify that before you start [12:55:59] probably a good idea [13:13:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.545 seconds [13:34:42] !log moving apache logs from pushing to fluorine back to nfs1 for now [13:34:52] Logged the message, RobH [13:45:41] New patchset: RobH; "moving apache logs back to nfs1 from fluorine (temp)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20447 [13:46:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20447 [13:52:39] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20447 [13:59:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.697 seconds [14:16:47] PROBLEM - Apache HTTP on mw44 is CRITICAL: Connection refused [14:38:38] cmjohnson1: so... new uplinks should be run to the panel that terminates the first half of line card 11 on csw1-sdtpa [14:38:43] is probably rack C1, I think [14:42:43] ok [14:42:57] then for each rack d1, d2, d3, run patches to that [14:42:59] don't plug em in yet [14:45:24] mark: btw, if it's an "optimization" and not a rollback, then we still have some room [14:45:31] although 2xGbE couldn't hurt either [14:45:35] yeah I figured [14:45:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:39] !log Setup static 802.3ad trunk on asw-d3-sdtpa with one port up [14:53:48] Logged the message, Master [14:57:25] perhaps we should do away with asw-d3-sdtpa now we have an EX4500 in there [14:57:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.237 seconds [15:00:26] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [15:02:06] New patchset: Ottomata; "Adding Wikipedia Zero filter for Tata India." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20454 [15:02:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20454 [15:03:20] !log all log traffic appears to have moved back to nfs1, copying fluorine logs back to nfs for reformat [15:03:29] Logged the message, RobH [15:03:36] hi guys, would one of you kind opsies gimme a little +2 on that one please? [15:03:40] https://gerrit.wikimedia.org/r/#/c/20454/ [15:04:14] !log all data copied, all traffic moved, fluorine coming down for reinstallation into internal vlan and lvs restructure [15:04:23] !log lvm restructure [15:04:24] Logged the message, RobH [15:04:32] Logged the message, RobH [15:04:42] and precise please [15:04:50] mark: Would you be so kind as to please move its network port, it can die now [15:04:55] i am pulling up dns to change it there [15:04:56] yeah [15:04:59] thanks! [15:04:59] do you happen to know the row? [15:05:03] will find out now [15:05:06] i thnk a [15:05:14] yep, a4-eqiad [15:08:24] done [15:09:17] PROBLEM - Host fluorine is DOWN: PING CRITICAL - Packet loss = 100% [15:09:28] heh, no duh... hrmm, i guess i need to put in decomm [15:09:34] and run on nagios once before pushing back in service? [15:09:44] (nagios will have fluorine as its external fqdn) [15:10:15] ideally [15:11:14] !log authdns-update for fluorine ip update [15:11:23] Logged the message, RobH [15:12:51] New patchset: RobH; "decommissioning fluorine to pull from nagios, as its ip is changing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20455 [15:13:22] now i wish we setup decommissioning.pp to use fqdn. [15:13:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20455 [15:14:01] New review: RobH; "never self review, also do as i say, not as i do" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/20455 [15:14:02] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20455 [15:15:03] mark: quick q (yes, another one): you were working on exim, correct? [15:15:09] new smarthosts? [15:15:11] yeah but waiting on new misc servers [15:15:16] yeah, too [15:16:10] as previously discussed, it'd be nice to have a root@ labs VMs be routed differently [15:16:12] mark: https://rt.wikimedia.org/Ticket/Display.html?id=3278 has erik's approval [15:16:20] but i kinda want yours on there too please, to confirm the final quote [15:16:22] that was your task wasn't it [15:16:24] on the labs relay [15:16:29] not the production one [15:16:36] we don't have a labs relay right now [15:16:39] and I was coming to that [15:16:40] yeah, fix that :) [15:16:58] do you see any value in having a labs-specific relay rather than a 4-line router that does things a bit different for *.wmflabs? [15:17:28] I can see some pros and cons, but I'm leaning towards "same systems" [15:17:41] to avoid infrastructure duplication [15:18:13] the con I'm seeing is that on a possible mega-spamming, our production queues will get filled up with junk [15:18:18] i don't want labs interminglued [15:18:32] mingled [15:18:36] * RobH kicks puppet on spence [15:18:40] this will take a bit. [15:18:46] a labs relay can do specific things for labs more easily than trying to combine it with production [15:19:04] what are you thinking exactly? [15:19:32] automatically routing root@ to the labs project owner's mail address or something [15:20:11] well yeah, other than that I mean [15:20:20] that's a given :) [15:20:20] other than that not much [15:20:22] but that's enough [15:20:33] that's basically a 4 line router or something [15:20:43] which depends on labs infrastructure, which I don't like [15:20:57] i don't want production email affected by labs infra outages [15:21:18] hrm [15:21:21] i see no reason NOT to have a separate labs relay really [15:21:32] I'm not a big fan of this great divide and that goes both ways, but okay [15:22:27] also [15:22:33] outgoing mail should come from a labs ip [15:22:40] if some instance gets used for spamming or so [15:22:49] i don't want those outgoing mails coming from a production ip [15:22:55] unfortunately it's all one ip prefix right now [15:22:58] but still [15:23:26] yeah, still, prefix != host for 95% of RBLs [15:23:34] that's a good point. [15:23:54] we send wiki mails from a separate ip too now, even on the same relay [15:23:58] so that's possible, but meh [15:24:07] just make it seperate and be done with it [15:25:21] I guess I'll run it on virt0/virt1000, it'd suck having extra machines just for that :) [15:25:42] yeah [15:25:44] no need [15:25:44] are you going to revamp the puppet class(es) for exim? [15:25:49] not much [15:25:51] if so, I should probably wait for that [15:25:59] a lot has been done already [15:26:35] okay, I'll work with them then [15:26:36] thanks :) [15:27:09] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [15:29:04] currently there are two classes really [15:29:08] or more [15:29:16] but one is for the simple exim instance on all servers [15:29:26] and then there's one for our actual mail servers [15:29:37] the latter one was redone by peter not long ago [15:29:39] I'll figure it out [15:29:54] and feel free to suggest changes of course [15:29:56] I have other things on the top of my list [15:30:00] but I'm so fed up with the cronspam [15:30:00] sure [15:30:02] me too [15:30:14] we could devnull it already hehe [15:30:29] that's what I initially thought, put a router to silence it since noone looks at it anyway [15:30:37] i'd be fine with that [15:30:42] but then I thought, devnulling it and putting an ldap query is like a one-line diff [15:30:48] hence my original question :) [15:30:54] don't like that ;) [15:31:10] just check for labs instance ip space and devnull that [15:31:44] !log puppet is taking forever on nagios, as soon as it completes, will remove fluorine from decom and rerun [15:31:46] so slow. [15:31:54] Logged the message, RobH [15:32:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:49] of course the main cronspam right now wouldn't be affected by it [15:33:53] as virt1000 is production ;) [15:34:00] I know [15:35:30] I'm happy you don't want to segregate that too :) [15:35:56] no [15:36:08] we haven't been very consistent on the dependency on labs' ldap though [15:36:08] New patchset: RobH; "fluorine moved from wikimedia.org to eqiad.wmnet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20456 [15:36:14] it has never got into anything mission critical [15:36:19] but we use it for e.g. gerrit [15:36:22] i know [15:36:23] or asher's db tools [15:36:35] and that's about as far as it'll get [15:36:41] and I've heard there are plans to merge wikitech with labsconsole [15:36:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20456 [15:36:52] i don't like that [15:36:53] which starts to get a bit mission critical [15:38:02] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20456 [15:38:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.322 seconds [15:44:35] !log Created static 802.3ad trunks on asw-d1-sdtpa and asw-d2-sdtpa for uplinks [15:44:44] Logged the message, Master [15:47:22] !log Created static 802.3ad trunks on csw1-sdtpa for asw-d1-sdtpa to asw-d3-sdtpa uplinks, deployed with only the existing port enabled [15:47:31] Logged the message, Master [15:48:13] damn it, i keep forgetting one file or something for this damned reinstall [15:48:21] atleast spence is finally running service updates [15:49:24] New patchset: RobH; "fluorine dhcpd update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20457 [15:49:50] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [15:49:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:49:52] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [15:49:52] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [15:49:52] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [15:49:53] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [15:49:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:49:54] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [15:50:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20457 [15:51:07] New patchset: RobH; "fluorine dhcpd update & recomed from decom file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20457 [15:51:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20457 [15:53:08] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20457 [15:54:25] !log fluorine now coming down for reinstall, all log files backed up to nfs [15:54:34] Logged the message, RobH [16:06:41] paravoid or notpeter, maybe you two have some input [16:06:49] im reimaging fluorine, and it has 2.4T for an lvm [16:06:55] im goign to use 2.0 of that [16:07:15] but its apache log files, any reason not to use ext3/4 (you can shrink a lvm with that filesystem right?) [16:07:52] since its not db driven, i am not sure if using xfs for the filesystem is preferred in this instance. [16:08:13] shrinking fs is not a very good idea generally [16:08:18] even with ext3/4 you can't do it online [16:08:22] right, but its atleast possible [16:08:22] but that doesn't work with xfs either [16:08:32] yes [16:08:35] xfs can only grow [16:08:39] i dunno, just trying to take the approach of keeping the lvm as dynamic as possible. [16:08:42] Anything is possible :) [16:08:43] ext3/4 can grow online and shrink offline [16:08:46] seems like maybe i should just ext3 it. [16:08:55] maplebed: good morning! [16:09:13] back [16:09:15] first of all [16:09:21] dns for 10.x is currently offline [16:09:26] i think the last dns update was rob [16:09:29] can you figure out what's wrong? [16:09:35] it dug ok for me, wtf, sorry, will check [16:09:36] my bad [16:09:53] RobH: use xfs [16:10:04] ok [16:10:16] xfs is reliable, fast, and good with large files [16:10:23] cmjohnson1: awesome [16:10:33] you can plug them all in in port 0/1/47 of the 3 switches [16:10:41] so the port before the existing uplink [16:10:46] and then i'll give you the port assignments on the other end [16:11:06] they are disabled [16:11:10] so don't be surprised when they don't come up [16:11:12] xfs has been very unreliable for me and particuarly in power failures, but this was years ago [16:11:22] as you once found out, we have to be very careful with these uplinks and creating loops [16:11:40] paravoid: yeah it used to be not so great [16:11:44] mark: What do you mean its offline? I can dig fqdn in the 10.address space against ns0/1/2? [16:11:55] RobH: you can? [16:11:59] we seem to use it around here a lot with no problems, so it must have gone better [16:12:17] getting SERVFAILs here [16:12:23] maplebed: ping? [16:12:24] dig @ns0.wikimedia.org fluorine.eqiad.wmnet ? [16:12:31] no, i said 10.x [16:12:31] morning! [16:12:35] different zone [16:13:02] dig @ns0.wikimedia.org 15.3.0.10.in-addr.arpa [16:13:03] i see my issue [16:13:07] i had a typo =P [16:13:17] or dig -x @ns0.wikimedia.org 10.0.3.15 :) [16:13:22] so when you do authdns-update, check whether it actually loads the zone without issues [16:13:34] duly noted [16:13:41] you got lucky here [16:13:46] we all did [16:13:47] nothing much seems to be affected by it [16:13:57] well rob likes to say he's never caused a dns outage [16:14:00] my svn commit 'typo, im lucky nothing really broke' [16:14:02] so ;) [16:14:05] mark: still didnt, thats a hiccup. [16:14:08] ;p [16:14:26] in simply the best zone to cause a hiccup in. [16:14:26] obviously doing this typo in the wikimedia.org zone would have been disastrous [16:14:35] indeed, i live a charmed life. [16:15:02] !log authdns-update to correct typo, nothing really broke badly, i r lucky [16:15:12] Logged the message, RobH [16:15:21] someone in the office make sure to point out my dns break to Ryan_Lane [16:15:42] hi cmjohnson1, paravoid. happy friday! [16:15:50] hi ben [16:15:57] almost weekend here! [16:16:02] so close! [16:16:24] so did the swift migration finish yesterday? [16:16:27] we weren't entirely sure [16:16:30] cmjohnson1: yep [16:16:32] sec [16:16:37] yeah, that's what I wanted to ask you [16:16:55] cmjohnson1: the second uplink for asw-d1-sdtpa will go to port 11/19 [16:16:56] mark: the thingc that's left to do is the squid acls to send upload.wm.org/originals to swift instead of ms7. [16:17:00] (so probably port 19 on that patch panel) [16:17:05] maplebed: have you seen the graph? [16:17:17] something changed a few hours after I left [16:17:20] errr... [16:17:22] umm... [16:17:25] huh. [16:17:30] that happened after I left too. [16:17:36] cmjohnson1: the second uplink for asw-d2-sdtpa will go to port 11/16 [16:17:40] maplebed: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[A-Z]%2B_hits%24>ype=stack&title=Swift+queries+per+second&aggregate=1 [16:17:42] I know aaron was continuing to work on it when I left [16:17:48] and the second uplink for asw-d3-sdtpa will go to port 11/17 [16:17:58] maplebed: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[0-9]%2B_hits__%24&mreg[]=swift_other_hits__%24>ype=line&title=Swift+percentage+queries+by+status+code&aggregate=1 [16:18:01] etc. [16:18:04] I had a chat with him just before taking off that the sheer volume felt wrong. [16:18:22] but he's not here yet this morning. [16:18:25] anything in the server log? [16:18:27] so did any ops actually stay around/ [16:19:03] well, it was stable when we all left. [16:19:11] so no, when night came we went home. [16:19:14] cmjohnson1: what is it? [16:19:25] port seems down [16:20:14] the SAL shows aaron pushing a bunch of stuff at 03:00ish UTC. [16:20:22] which corresponds to the change in the graph. [16:20:56] mark: btw, when we get rid of hardy for dobson, we can install bind9utils and hook named-checkzone into authdns-update [16:21:05] (I tried to do it now, but... hardy :/) [16:21:19] paravoid: yeah [16:22:14] maplebed: yes. the log is fairly cryptic, at least for us non-developers. I'd like to know more, but you obviously don't know either [16:22:25] we'll ask him... [16:25:49] ok [16:27:35] can't tell, probably, it's disabled [16:27:38] insert the other ones first ;) [16:27:46] d3 will come up [16:28:06] 11/17 [16:29:24] that's working [16:29:30] going to enable the other two [16:30:45] cool [16:30:46] all working [16:30:48] thanks! [16:31:08] !log Enabled all 3 new 802.3ad aggregated links for asw-d1-sdtpa to asw-d3-sdtpa [16:31:17] Logged the message, Master [16:34:39] cmjohnson1: project2 [16:34:54] 11/15 Up Forward Full 1G None No level0 0012.f2c5.5600 project2 [16:34:57] is that still there? [16:34:59] burn that. [16:38:20] ah [16:38:22] not setup yet then [16:38:22] ok [16:38:55] good [16:41:35] in fact, why...oh damn it [16:41:45] thomas at dell was researching the raid controller purchase for us [16:41:47] off RT [16:41:58] and when he left, i bet it got forgotten, i will email russ [16:42:42] maplebed: I have to go out for a few hours but I'll be back; are you going to ask Aaron in the meantime when you see him? [16:42:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:50] paravoid: yes. [16:42:55] hrmm, my comment was unclear, its in rt, i fired off the email from there [16:47:06] maplebed: :-) [16:50:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.044 seconds [16:59:10] Logged the message, Master [17:00:21] Logged the message, Master [17:02:35] PROBLEM - Host search23 is DOWN: PING CRITICAL - Packet loss = 100% [17:08:11] awjr, how are we going to integrate ISO with CLRD and the API? [17:09:47] RECOVERY - Host search23 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:10:08] MaxSem: that is up to us to figure out - i imagine having one class/script that we can include in the API to fetch information from cldr/iso-3166-2. we'll probably need to extract the cldr data from the cldr extension since we can't really use it as-is w/o mediawiki (at least not easily) [17:10:58] we're not going to add them to erfgoed's main repo? [17:11:34] no we not [17:11:46] !i've got an idea! [17:11:52] MaxSem we could [17:12:35] MaxSem: should ew move this convo back to #wikimedia-mobile? [17:13:10] oh, fail [17:13:19] :p [17:13:43] too much channels [17:23:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:38] !log moving apache logs back to fluorine [17:23:47] Logged the message, RobH [17:25:58] New patchset: RobH; "moving apache logging back to fluorine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20461 [17:26:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20461 [17:26:56] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20461 [17:28:32] AaronSchulz: do you mind if we do the swift external reads on monday instead? then upgrade the cluster on tuesday? [17:29:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.150 seconds [17:32:51] ping AaronSchulz ^^^^ [17:34:02] sounds fine [17:36:56] PROBLEM - Lucene on search23 is CRITICAL: Connection refused [17:40:32] RECOVERY - Apache HTTP on srv281 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.530 second response time [17:45:32] maplebed: what's me-be6 up to? [17:47:46] oomph. It had failing disks yesterday... I'll look. [17:48:06] !log apache logs confirmed hitting fluorine, all will transition over as they puppet update [17:48:15] Logged the message, RobH [17:48:21] !log once all apaches update, will migrate old logs from nfs to fluorine [17:48:31] Logged the message, RobH [17:54:27] broken. [17:54:33] load 200, cpu 100% [17:54:41] I haven't logged in yet to see wtf's going on. [17:55:37] I've gto to wrap up one thing then I'll look at it. [17:55:55] it looked fine when I left last night... ::sigh:: (minus sdf dying of course.) [17:56:25] yeah, just fine. [17:56:30] they looked healthy [17:56:37] I reformatted them etc. and they seemed just fine. [18:01:59] RECOVERY - Lucene on search23 is OK: TCP OK - 0.001 second response time on port 8123 [18:03:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:27] cmjohnson1: omg. search32 is still up. this is amazing. I don't even know what to think anymore. black is white, coke is pepsi. it's crazy [18:03:30] thank you! [18:10:20] New patchset: Bhartshorne; "changing copper swift cluster to not shard containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20468 [18:11:03] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20468 [18:18:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [18:31:24] lunchtime [18:34:50] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [18:39:36] New review: Bhartshorne; "setting it back to sharding containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20468 [18:39:57] New patchset: Bhartshorne; "Revert "changing copper swift cluster to not shard containers"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20469 [18:40:42] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20469 [18:49:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:59] AaronSchulz: tests on copper complete; it's working as expected (but was broken for a while) [19:01:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.161 seconds [19:06:05] AaronSchulz: I looked at the auth headers from 1.5 vs. 1.4.3 and the only difference I see is an additional header returned by 1.5: X-Timestamp. [19:24:49] hey guys, i think i'm doing something dumb [19:24:58] how do I access the mgmt interface on a R310? [19:25:01] just ssh, right? [19:25:03] ssh analytics1023.mgmt.eqiad.wmnet [19:25:05] ? [19:26:04] hm, i'm having key problems then [19:26:09] OH [19:26:10] no the pw [19:26:11] i konw it [19:26:12] doh [19:26:22] doh, of course [19:26:35] was thinking it was going to magiacally use my key somehow [19:26:37] thank you! [19:36:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.258 seconds [19:47:44] New patchset: Ottomata; "Adding analytics1023-1027 host entries." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20477 [19:48:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20477 [19:49:28] notpeter, wouldya do me a kindness? ^ [19:53:31] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [19:53:31] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [19:53:31] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [19:53:31] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [19:54:19] ore maplebed could gimme a little sweet approvin'? [19:54:27] https://gerrit.wikimedia.org/r/20477 [19:59:40] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:21:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:30:53] hey, i have a couple of packages that are ready to be uploaded to the repo, but i don't have reprepro perms. anyone have a moment to spare? [20:31:54] they're ready to go in /home/olivneh/lucid on fenari [20:32:06] New patchset: Cmjohnson; "adding new public key for chris johnson" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20483 [20:32:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20483 [20:36:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [20:42:15] cmjohnson1: was afk, will do now [20:42:51] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20483 [20:43:05] ottomata: ok [20:43:16] danke! [20:43:17] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20477 [20:43:22] mergin' your stuffs now [20:44:40] dankkkkeeee [20:56:11] AaronSchulz: I see a bunch of containers like aaron-ubuntu-wiki-local-{deleted,public,thumb}.## on copper, so they are getting created on demand. [20:57:04] * AaronSchulz fixes a CF bug in error handling while at it [20:58:04] AaronSchulz: interestingly enough, of the 13 -public shards, 7 of them have no objects in them. [20:58:10] does that make sense given your tests? [20:59:20] it's possible if I deleted files [21:06:45] New patchset: Asher; "updating versions to match new vumi packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20487 [21:06:51] preilly: ^^ [21:07:31] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20487 [21:09:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:15] AaronSchulz: any idea what it means to get a 404 on a PUT? [21:15:08] no [21:15:40] I see at least two examples of [21:16:50] an object in the deleted container getting 4 HEADs in rapid succession (all returning 404) followed by a PUT that also gets a 404. [21:17:33] slipped in between are some HEADs that get a 200 to an public.##/archive/ URL [21:17:33] my debug log shows 4 404s for getFileStat [21:21:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.574 seconds [21:28:58] AaronSchulz: though I certainly see a bunch of chatter (repeated requests and such), I don't see anything that looks like it's failing when it shouldn't. [21:41:58] anyone have a moment to help me upload a couple of packages to the repo? they're ready to go [21:42:08] New patchset: preilly; "Vumi production settings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20489 [21:42:40] binasher: ^^ I forgot the $passwords::mobile::vumi::…. stuff [21:42:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20489 [21:48:49] New patchset: preilly; "Vumi production settings" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20489 [21:48:56] binasher: ^^ [21:49:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20489 [21:50:41] preilly: " (Reply for more, price 50p)" that's kinda pricey, [21:50:53] binasher: yeah [21:51:13] i dunno if i can merge it.. seems exploitive [21:53:56] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20489 [21:54:32] maplebed: can you try upgrading the other two? [21:54:58] sure. [21:55:03] are you still seeing issues? [21:55:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:55:31] yeah [21:56:14] do you want me to pastebin logs? [21:56:23] or come over here and look at them? [21:57:18] pastebin for now I guess [21:59:16] New patchset: Ryan Lane; "Point eqiad nova keystone references to eqiad keystone config, and fix auth IP for keystone pmtpa." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20492 [21:59:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20492 [22:00:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20492 [22:00:44] !log rebuilding wikitech-l archives [22:00:59] Logged the message, Master [22:00:59] !log rebuilding wikitech wiki l18n cache [22:01:09] AaronSchulz: all three are upgraded. [22:01:12] Logged the message, Master [22:01:36] AaronSchulz: do you want the trunkwiki logs? [22:01:43] meh, l10n :p [22:01:51] if there are not too big ;) [22:02:05] oomph. [22:02:11] there are a lot of the unittest logs. [22:02:12] *they [22:02:21] heh [22:02:39] I can't pastebin these. [22:03:53] New patchset: preilly; "remove unneeded line" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20494 [22:04:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20494 [22:04:54] New patchset: Ryan Lane; "Setting keystone service address too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20495 [22:05:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20495 [22:05:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20495 [22:06:42] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20494 [22:08:18] Error undeleting file: Could not copy file "mwstore://localSwift/local-deleted/q/3/q3smvdrv9y4qg0l6ne2id1iv2o21bkg.jpg" to "mwstore://localSwift/local-public/archive/5/57/20120817203819!Penguins.jpg". [22:08:25] raarr [22:08:46] it's too hot for penguins now [22:10:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.508 seconds [22:16:06] !log rebooting/reimaging zhen and silver for upgrade to precise [22:16:16] Logged the message, notpeter [22:19:35] AaronSchulz: I see an object tombstone where you're trying to put the archive...Penguins file [22:19:37] PROBLEM - Host silver is DOWN: PING CRITICAL - Packet loss = 100% [22:21:16] PROBLEM - Host zhen is DOWN: CRITICAL - Host Unreachable (208.80.152.140) [22:25:10] RECOVERY - Host silver is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [22:26:49] RECOVERY - Host zhen is UP: PING OK - Packet loss = 0%, RTA = 1.95 ms [22:28:46] PROBLEM - SSH on silver is CRITICAL: Connection refused [22:31:10] PROBLEM - SSH on zhen is CRITICAL: Connection refused [22:34:03] AaronSchulz: could you re-run the test? I kicked memcahced to make sure it wasn't caching something about container or accounts. [22:34:21] the swift folks say that PUT returning a 404 means the container doesn't exist. [22:34:35] which is weird, but sometimes the response is cached. [22:35:40] Wouldn't a 400 be more suited to that... [22:37:01] the container not existing != "malformed syntax", so I'd say no, 400 wouldn't be right. [22:38:18] * AaronSchulz wonders if his bash terminal is wonk [22:39:34] PROBLEM - Host zhen is DOWN: CRITICAL - Host Unreachable (208.80.152.140) [22:40:01] RECOVERY - SSH on zhen is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:40:10] RECOVERY - Host zhen is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [22:43:37] RECOVERY - SSH on silver is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:44:09] AaronSchulz: can you re-run the unit tests? [22:44:13] I restarted memcached. [22:45:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:45:37] gah, I'd need a minute [22:53:29] AaronSchulz: are you sure that the unit tests pass when aimed at a fully operational 1.4.3 cluster? [22:53:34] and that we're not chasing ghosts? [22:53:44] I may have found a fix [22:53:50] ah? [22:53:54] details? [22:54:21] tests are still running [22:55:05] AaronSchulz: btw, in http://pastebin.com/nGhPuMy8 the third URL may be for a different container due to the URL encoding. [22:55:06] gah, a few minor warnings from my hacks [22:55:19] it looks like a breaking change in the swift api [22:55:49] details? [22:56:25] http://docs.openstack.org/api/openstack-object-storage/1.0/content/copy-object.html [22:56:32] only X-COPY-FROM works [22:56:48] Destination: // is broken [22:57:52] RECOVERY - Puppet freshness on virt1000 is OK: puppet ran at Fri Aug 17 22:57:39 UTC 2012 [22:58:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [22:59:05] maplebed: hi [22:59:15] welcome! [22:59:30] repeating AaronSchulz's comment above: [22:59:41] [3:55 PM] it looks like a breaking change in the swift api [22:59:42] [3:56 PM] http://docs.openstack.org/api/openstack-object-storage/1.0/content/copy-object.html [22:59:42] [3:56 PM] only X-COPY-FROM works [22:59:42] [3:56 PM] Destination: // is broken [23:00:07] wrong channel? :) [23:00:23] nope. notmyname just joined. the echo was for him. [23:00:30] I see [23:00:51] AaronSchulz: "breaking change". so it used to work and now it doesn't? [23:01:05] right, at least it worked in 1.4.3 [23:01:23] checking [23:01:23] maplebed: I'll just make a fix to our CF extension [23:01:41] I prefer headers with X- anyway :) [23:01:49] ;) [23:02:48] PROBLEM - NTP on zhen is CRITICAL: NTP CRITICAL: No response from NTP server [23:04:42] maplebed: hmm wait [23:04:45] PROBLEM - mysqld processes on es1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:04:50] now I'm getting failures again [23:04:58] AaronSchulz: looks like the test was originally written in sept. 2010. last time it was modified was june 2011. 1.4.3 was september 2011 [23:05:03] maybe phpunit just aborted out on some random warning without showing the failure [23:05:12] PROBLEM - mysqld processes on es1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:05:12] PROBLEM - mysqld processes on es1010 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:05:30] PROBLEM - mysqld processes on es1005 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:05:30] PROBLEM - mysqld processes on es1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:05:33] I'm not getting warm fuzzies about these unit tests. [23:05:52] but that would explain why it started working with manual tests [23:06:06] PROBLEM - mysqld processes on es1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:07:18] PROBLEM - NTP on silver is CRITICAL: NTP CRITICAL: No response from NTP server [23:08:41] maplebed: maybe it's not Destination: then since I'm getting the same problem again undeleted a file [23:08:54] :( [23:09:11] ok, so your manual tests work but the unit tests don't? [23:10:31] neither works now [23:10:59] oomph. [23:11:33] this worked a few weeks ago when Jan was talking about it :/ [23:11:34] notmyname: thanks for swinging by, but it sounds like we've got some shit to work through before we can say anything about what's actually broken. [23:11:49] maplebed: can you downgrade it? [23:12:02] AaronSchulz: just one host or all three? [23:12:13] just copper [23:12:48] lemme try. [23:13:06] it's a pain going back and forth though [23:13:10] are you sure you want it downgraded? [23:13:33] maplebed: ok. cool. find me if you need me. FWIW, I can use both PUT+X-Copy-From and COPY+Destination on my SAIO just fine [23:13:42] ok. [23:15:09] New patchset: Pyoungmeister; "removing dupe def of vumi config file." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20502 [23:15:26] maplebed: if it's not a huge pain [23:15:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20502 [23:16:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20502 [23:16:36] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [23:17:58] !log added cn=essextest project to labs LDAP for essex testing. this project isn't compatible with diablo [23:18:08] Logged the message, Master [23:18:44] AaronSchulz: copper is downgraded. [23:19:00] * AaronSchulz starts the tests [23:21:06] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [23:21:08] AaronSchulz: btw, since copper is now downgraded, you can point your MW instance at it to get 1.4.3 and at magnesium or zinc to get 1.5, all backed by the same cluster. [23:21:22] so to re-test 1.5 stuff I won't have to upgrade copper. [23:23:01] yeah, I was thinking about that [23:23:13] seems like its still failing [23:25:27] RECOVERY - NTP on zhen is OK: NTP OK: Offset -0.0864289999 secs [23:25:45] RECOVERY - NTP on silver is OK: NTP OK: Offset -0.1150801182 secs [23:30:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:10] New patchset: Pyoungmeister; "upping required versions for mobile::vumi class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20505 [23:38:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20505 [23:40:49] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20505 [23:42:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.724 seconds