[00:01:14] Ryan_Lane: if i'm logged in, can i do that via Special:UserLogin&type=signup ? [00:01:19] yep [00:04:34] New patchset: Ryan Lane; "Restrict a region to its own services" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20249 [00:05:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20249 [00:06:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20249 [00:18:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:29] New patchset: Ori.livneh; "Event logging via varnishlog over ZeroMQ" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20039 [00:21:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20039 [00:25:19] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20039 [00:29:55] AaronSchulz: it's not just the image scalers. [00:31:10] most of the apaches? [00:32:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.225 seconds [00:32:11] AaronSchulz: sorting by the most hits (over the last 1000 lines), no image scaler even makes the list. [00:32:22] sorry, no image scaler makes the top 60. [00:35:04] most of the GETs seems to be on other wikis getting local copies of commons files [00:35:07] top 70 list is a range from 24 hits to 193 hits (I'm only counting GETs) [00:35:19] (of the last 1000 lines of log) [00:35:34] * AaronSchulz adds todo list entries to beef up some profiling [00:38:12] New patchset: Ryan Lane; "Fix naming of virt hosts in eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20250 [00:38:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20250 [00:39:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20250 [00:39:36] binasher: I merged the zeromq change [00:41:53] New patchset: Dzahn; "load mod_rewrite" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20251 [00:42:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20251 [00:42:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20251 [00:46:35] New patchset: Ryan Lane; "Disabling live-migration configuration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20252 [00:47:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20252 [00:47:42] New patchset: Dzahn; "duh, fix syntax error in dependency definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20253 [00:48:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20253 [00:48:22] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20252 [01:04:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:16:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.377 seconds [01:42:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 270 seconds [01:43:23] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 241 seconds [01:50:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:44] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 682s [01:55:41] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [02:00:56] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 0 seconds [02:01:14] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 12s [02:02:30] New patchset: Ryan Lane; "Add policy.json to restrict actions to proper roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20259 [02:03:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20259 [02:03:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.708 seconds [02:08:15] New patchset: Ryan Lane; "Adding in patched code for keystone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20260 [02:08:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20259 [02:08:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20260 [02:09:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20260 [02:16:45] New patchset: Ryan Lane; "Block keystone ports, but open them to needed services" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20261 [02:17:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20261 [02:17:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20261 [02:20:34] New patchset: Ryan Lane; "Add keystone_service and keystone_admin protocols" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20262 [02:21:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20262 [02:23:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20262 [02:36:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [03:15:17] PROBLEM - mysqld processes on es8 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:15:35] PROBLEM - mysqld processes on es7 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:16:11] PROBLEM - mysqld processes on es6 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [03:16:29] PROBLEM - mysqld processes on es5 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [04:46:10] RECOVERY - LDAPS on virt0 is OK: TCP OK - 0.004 second response time on port 636 [04:46:19] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.001 second response time on port 389 [05:23:12] New patchset: Ryan Lane; "Add tenant_name and user_name attribute config to keystone" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20267 [05:23:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20267 [05:24:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20267 [05:31:08] Ryan_Lane: cd /srv/org/wikimedia/controller/wikis/1.20wmf2 this fails because it can't cd to the dir... [05:31:22] yeah [05:31:30] I haven't added mediawiki yet [05:31:47] ah ha [05:32:25] I thought maybe it was a version thing. nm then [05:47:49] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [05:47:49] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [05:47:49] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [05:47:49] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [05:48:52] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:48:53] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [05:48:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:00:03] good morning [08:15:51] s/good// [08:16:06] why? :) [08:17:02] mornings are never good [08:17:07] that is why we sleep through them! [08:44:40] PROBLEM - swift-object-replicator on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:34] PROBLEM - swift-container-updater on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:43] PROBLEM - swift-account-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:45:52] PROBLEM - swift-account-reaper on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:10] PROBLEM - swift-account-replicator on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:10] PROBLEM - swift-container-replicator on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:10] PROBLEM - swift-container-server on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:19] PROBLEM - swift-object-server on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:55] PROBLEM - swift-container-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:46:55] PROBLEM - swift-object-auditor on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:04] PROBLEM - swift-object-updater on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:04] PROBLEM - swift-account-server on ms-be6 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:55:37] PROBLEM - SSH on ms-be6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.061 seconds [09:32:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:36:56] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [09:50:44] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [09:50:44] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [09:50:44] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [09:50:44] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [09:58:41] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:10:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [10:53:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:05:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.654 seconds [11:39:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:51:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.508 seconds [12:02:06] New patchset: Mark Bergsma; "Rename misc::deployment-host to ::deployment, merge misc::scripts and password scripts into it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20290 [12:02:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20290 [12:05:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20290 [12:14:50] New patchset: Mark Bergsma; "Rename misc::l10nupdate into misc::deployment::l10nupdate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20291 [12:15:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20291 [12:16:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20291 [12:23:27] New patchset: Mark Bergsma; "Move misc::translationnotifications into misc::maintenance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20292 [12:24:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20292 [12:24:30] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20292 [12:25:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:27:53] New patchset: Mark Bergsma; "Add FIXMEs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20293 [12:28:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20293 [12:29:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20293 [12:40:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [12:53:49] hi chris [12:54:02] would you be able to run some extra uplinks for the row D sdtpa switches today? [12:54:40] hi mark: yes [12:54:50] good [12:54:56] I think we're gonna give all 3 racks an extra gige uplink [12:54:58] so 3 new runs [12:55:05] i believe those are to either rack C1 or B1 [12:55:32] but perhaps it's good to verify that before you start [12:55:59] probably a good idea [13:13:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.545 seconds [13:34:42] !log moving apache logs from pushing to fluorine back to nfs1 for now [13:34:52] Logged the message, RobH [13:45:41] New patchset: RobH; "moving apache logs back to nfs1 from fluorine (temp)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20447 [13:46:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20447 [13:52:39] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20447 [13:59:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.697 seconds [14:16:47] PROBLEM - Apache HTTP on mw44 is CRITICAL: Connection refused [14:38:38] cmjohnson1: so... new uplinks should be run to the panel that terminates the first half of line card 11 on csw1-sdtpa [14:38:43] is probably rack C1, I think [14:42:43] ok [14:42:57] then for each rack d1, d2, d3, run patches to that [14:42:59] don't plug em in yet [14:45:24] mark: btw, if it's an "optimization" and not a rollback, then we still have some room [14:45:31] although 2xGbE couldn't hurt either [14:45:35] yeah I figured [14:45:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:39] !log Setup static 802.3ad trunk on asw-d3-sdtpa with one port up [14:53:48] Logged the message, Master [14:57:25] perhaps we should do away with asw-d3-sdtpa now we have an EX4500 in there [14:57:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.237 seconds [15:00:26] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [15:02:06] New patchset: Ottomata; "Adding Wikipedia Zero filter for Tata India." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20454 [15:02:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20454 [15:03:20] !log all log traffic appears to have moved back to nfs1, copying fluorine logs back to nfs for reformat [15:03:29] Logged the message, RobH [15:03:36] hi guys, would one of you kind opsies gimme a little +2 on that one please? [15:03:40] https://gerrit.wikimedia.org/r/#/c/20454/ [15:04:14] !log all data copied, all traffic moved, fluorine coming down for reinstallation into internal vlan and lvs restructure [15:04:23] !log lvm restructure [15:04:24] Logged the message, RobH [15:04:32] Logged the message, RobH [15:04:42] and precise please [15:04:50] mark: Would you be so kind as to please move its network port, it can die now [15:04:55] i am pulling up dns to change it there [15:04:56] yeah [15:04:59] thanks! [15:04:59] do you happen to know the row? [15:05:03] will find out now [15:05:06] i thnk a [15:05:14] yep, a4-eqiad [15:08:24] done [15:09:17] PROBLEM - Host fluorine is DOWN: PING CRITICAL - Packet loss = 100% [15:09:28] heh, no duh... hrmm, i guess i need to put in decomm [15:09:34] and run on nagios once before pushing back in service? [15:09:44] (nagios will have fluorine as its external fqdn) [15:10:15] ideally [15:11:14] !log authdns-update for fluorine ip update [15:11:23] Logged the message, RobH [15:12:51] New patchset: RobH; "decommissioning fluorine to pull from nagios, as its ip is changing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20455 [15:13:22] now i wish we setup decommissioning.pp to use fqdn. [15:13:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20455 [15:14:01] New review: RobH; "never self review, also do as i say, not as i do" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/20455 [15:14:02] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20455 [15:15:03] mark: quick q (yes, another one): you were working on exim, correct? [15:15:09] new smarthosts? [15:15:11] yeah but waiting on new misc servers [15:15:16] yeah, too [15:16:10] as previously discussed, it'd be nice to have a root@ labs VMs be routed differently [15:16:12] mark: https://rt.wikimedia.org/Ticket/Display.html?id=3278 has erik's approval [15:16:20] but i kinda want yours on there too please, to confirm the final quote [15:16:22] that was your task wasn't it [15:16:24] on the labs relay [15:16:29] not the production one [15:16:36] we don't have a labs relay right now [15:16:39] and I was coming to that [15:16:40] yeah, fix that :) [15:16:58] do you see any value in having a labs-specific relay rather than a 4-line router that does things a bit different for *.wmflabs? [15:17:28] I can see some pros and cons, but I'm leaning towards "same systems" [15:17:41] to avoid infrastructure duplication [15:18:13] the con I'm seeing is that on a possible mega-spamming, our production queues will get filled up with junk [15:18:18] i don't want labs interminglued [15:18:32] mingled [15:18:36] * RobH kicks puppet on spence [15:18:40] this will take a bit. [15:18:46] a labs relay can do specific things for labs more easily than trying to combine it with production [15:19:04] what are you thinking exactly? [15:19:32] automatically routing root@ to the labs project owner's mail address or something [15:20:11] well yeah, other than that I mean [15:20:20] that's a given :) [15:20:20] other than that not much [15:20:22] but that's enough [15:20:33] that's basically a 4 line router or something [15:20:43] which depends on labs infrastructure, which I don't like [15:20:57] i don't want production email affected by labs infra outages [15:21:18] hrm [15:21:21] i see no reason NOT to have a separate labs relay really [15:21:32] I'm not a big fan of this great divide and that goes both ways, but okay [15:22:27] also [15:22:33] outgoing mail should come from a labs ip [15:22:40] if some instance gets used for spamming or so [15:22:49] i don't want those outgoing mails coming from a production ip [15:22:55] unfortunately it's all one ip prefix right now [15:22:58] but still [15:23:26] yeah, still, prefix != host for 95% of RBLs [15:23:34] that's a good point. [15:23:54] we send wiki mails from a separate ip too now, even on the same relay [15:23:58] so that's possible, but meh [15:24:07] just make it seperate and be done with it [15:25:21] I guess I'll run it on virt0/virt1000, it'd suck having extra machines just for that :) [15:25:42] yeah [15:25:44] no need [15:25:44] are you going to revamp the puppet class(es) for exim? [15:25:49] not much [15:25:51] if so, I should probably wait for that [15:25:59] a lot has been done already [15:26:35] okay, I'll work with them then [15:26:36] thanks :) [15:27:09] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Puppet has not run in the last 10 hours [15:29:04] currently there are two classes really [15:29:08] or more [15:29:16] but one is for the simple exim instance on all servers [15:29:26] and then there's one for our actual mail servers [15:29:37] the latter one was redone by peter not long ago [15:29:39] I'll figure it out [15:29:54] and feel free to suggest changes of course [15:29:56] I have other things on the top of my list [15:30:00] but I'm so fed up with the cronspam [15:30:00] sure [15:30:02] me too [15:30:14] we could devnull it already hehe [15:30:29] that's what I initially thought, put a router to silence it since noone looks at it anyway [15:30:37] i'd be fine with that [15:30:42] but then I thought, devnulling it and putting an ldap query is like a one-line diff [15:30:48] hence my original question :) [15:30:54] don't like that ;) [15:31:10] just check for labs instance ip space and devnull that [15:31:44] !log puppet is taking forever on nagios, as soon as it completes, will remove fluorine from decom and rerun [15:31:46] so slow. [15:31:54] Logged the message, RobH [15:32:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:33:49] of course the main cronspam right now wouldn't be affected by it [15:33:53] as virt1000 is production ;) [15:34:00] I know [15:35:30] I'm happy you don't want to segregate that too :) [15:35:56] no [15:36:08] we haven't been very consistent on the dependency on labs' ldap though [15:36:08] New patchset: RobH; "fluorine moved from wikimedia.org to eqiad.wmnet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20456 [15:36:14] it has never got into anything mission critical [15:36:19] but we use it for e.g. gerrit [15:36:22] i know [15:36:23] or asher's db tools [15:36:35] and that's about as far as it'll get [15:36:41] and I've heard there are plans to merge wikitech with labsconsole [15:36:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20456 [15:36:52] i don't like that [15:36:53] which starts to get a bit mission critical [15:38:02] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20456 [15:38:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.322 seconds [15:44:35] !log Created static 802.3ad trunks on asw-d1-sdtpa and asw-d2-sdtpa for uplinks [15:44:44] Logged the message, Master [15:47:22] !log Created static 802.3ad trunks on csw1-sdtpa for asw-d1-sdtpa to asw-d3-sdtpa uplinks, deployed with only the existing port enabled [15:47:31] Logged the message, Master [15:48:13] damn it, i keep forgetting one file or something for this damned reinstall [15:48:21] atleast spence is finally running service updates [15:49:24] New patchset: RobH; "fluorine dhcpd update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20457 [15:49:50] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [15:49:50] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [15:49:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:49:52] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [15:49:52] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [15:49:52] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [15:49:53] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [15:49:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [15:49:54] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [15:50:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20457 [15:51:07] New patchset: RobH; "fluorine dhcpd update & recomed from decom file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20457 [15:51:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/20457 [15:53:08] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/20457 [15:54:25] !log fluorine now coming down for reinstall, all log files backed up to nfs [15:54:34] Logged the message, RobH [16:06:41] paravoid or notpeter, maybe you two have some input [16:06:49] im reimaging fluorine, and it has 2.4T for an lvm [16:06:55] im goign to use 2.0 of that [16:07:15] but its apache log files, any reason not to use ext3/4 (you can shrink a lvm with that filesystem right?) [16:07:52] since its not db driven, i am not sure if using xfs for the filesystem is preferred in this instance. [16:08:13] shrinking fs is not a very good idea generally [16:08:18] even with ext3/4 you can't do it online [16:08:22] right, but its atleast possible [16:08:22] but that doesn't work with xfs either [16:08:32] yes [16:08:35] xfs can only grow [16:08:39] i dunno, just trying to take the approach of keeping the lvm as dynamic as possible. [16:08:42] Anything is possible :) [16:08:43] ext3/4 can grow online and shrink offline [16:08:46] seems like maybe i should just ext3 it. [16:08:55] maplebed: good morning! [16:09:13] back [16:09:15] first of all [16:09:21] dns for 10.x is currently offline [16:09:26] i think the last dns update was rob [16:09:29] can you figure out what's wrong? [16:09:35] it dug ok for me, wtf, sorry, will check [16:09:36] my bad [16:09:53] RobH: use xfs [16:10:04] ok [16:10:16] xfs is reliable, fast, and good with large files [16:10:23] cmjohnson1: awesome [16:10:33] you can plug them all in in port 0/1/47 of the 3 switches [16:10:41] so the port before the existing uplink [16:10:46] and then i'll give you the port assignments on the other end [16:11:06] they are disabled [16:11:10] so don't be surprised when they don't come up [16:11:12] xfs has been very unreliable for me and particuarly in power failures, but this was years ago [16:11:22] as you once found out, we have to be very careful with these uplinks and creating loops [16:11:40] paravoid: yeah it used to be not so great [16:11:44] mark: What do you mean its offline? I can dig fqdn in the 10.address space against ns0/1/2? [16:11:55] RobH: you can? [16:11:59] we seem to use it around here a lot with no problems, so it must have gone better [16:12:17] getting SERVFAILs here [16:12:23] maplebed: ping? [16:12:24] dig @ns0.wikimedia.org fluorine.eqiad.wmnet ? [16:12:31] no, i said 10.x [16:12:31] morning! [16:12:35] different zone [16:13:02] dig @ns0.wikimedia.org 15.3.0.10.in-addr.arpa [16:13:03] i see my issue [16:13:07] i had a typo =P [16:13:17] or dig -x @ns0.wikimedia.org 10.0.3.15 :) [16:13:22] so when you do authdns-update, check whether it actually loads the zone without issues [16:13:34] duly noted [16:13:41] you got lucky here [16:13:46] we all did [16:13:47] nothing much seems to be affected by it [16:13:57] well rob likes to say he's never caused a dns outage [16:14:00] my svn commit 'typo, im lucky nothing really broke' [16:14:02] so ;) [16:14:05] mark: still didnt, thats a hiccup. [16:14:08] ;p [16:14:26] in simply the best zone to cause a hiccup in. [16:14:26] obviously doing this typo in the wikimedia.org zone would have been disastrous [16:14:35] indeed, i live a charmed life. [16:15:02] !log authdns-update to correct typo, nothing really broke badly, i r lucky [16:15:12] Logged the message, RobH [16:15:21] someone in the office make sure to point out my dns break to Ryan_Lane [16:15:42] hi cmjohnson1, paravoid. happy friday! [16:15:50] hi ben [16:15:57] almost weekend here! [16:16:02] so close! [16:16:24] so did the swift migration finish yesterday? [16:16:27] we weren't entirely sure [16:16:30] cmjohnson1: yep [16:16:32] sec [16:16:37] yeah, that's what I wanted to ask you [16:16:55] cmjohnson1: the second uplink for asw-d1-sdtpa will go to port 11/19 [16:16:56] mark: the thingc that's left to do is the squid acls to send upload.wm.org/originals to swift instead of ms7. [16:17:00] (so probably port 19 on that patch panel) [16:17:05] maplebed: have you seen the graph? [16:17:17] something changed a few hours after I left [16:17:20] errr... [16:17:22] umm... [16:17:25] huh. [16:17:30] that happened after I left too. [16:17:36] cmjohnson1: the second uplink for asw-d2-sdtpa will go to port 11/16 [16:17:40] maplebed: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[A-Z]%2B_hits%24>ype=stack&title=Swift+queries+per+second&aggregate=1 [16:17:42] I know aaron was continuing to work on it when I left [16:17:48] and the second uplink for asw-d3-sdtpa will go to port 11/17 [16:17:58] maplebed: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[0-9]%2B_hits__%24&mreg[]=swift_other_hits__%24>ype=line&title=Swift+percentage+queries+by+status+code&aggregate=1 [16:18:01]