[00:05:26] ebernhardson, are you LDing? [00:06:56] MaxSem: yes, as soon as spage double checks my core update patch [00:07:12] err, its not a core update its an extension update in core [00:07:18] can I push my config change meanwhile? [00:07:23] sure [00:07:27] thx [00:07:43] (03CR) 10MaxSem: [C: 032] Mobile ulsfo LVS appears in XFF chains, whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111927 (owner: 10MaxSem) [00:07:47] greg-g, we're going to give up our slot for today. [00:07:56] (03Merged) 10jenkins-bot: Mobile ulsfo LVS appears in XFF chains, whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111927 (owner: 10MaxSem) [00:10:32] !log maxsem synchronized wmf-config/squid.php 'https://gerrit.wikimedia.org/r/111927' [00:10:40] Logged the message, Master [00:11:43] ebernhardson, I'm done [00:12:29] MaxSem: ok [00:12:31] MaxSem: I said yes, right? [00:13:06] superm401: kk [00:13:27] (03CR) 10Mattflaschen: [C: 04-1] "I didn't catch it earlier, but PageFilter also expects $wgGettingStartedExcludedCategories in canonicalized form." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [00:14:29] (03CR) 10Mattflaschen: "Or I guess RedisCategorySync only applies on every save with a category change, but that's still a lot." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [00:14:57] !log ebernhardson synchronized php-1.23wmf13/extensions/Echo/ 'LD two patches to Echo-1.23wmf13' [00:14:59] greg-g, sorry misunderstood [00:15:05] Logged the message, Master [00:15:34] !log ebernhardson synchronized php-1.23wmf13/extensions/Flow/ 'LD two patches to Flow-1.23wmf13' [00:15:43] Logged the message, Master [00:15:55] greg-g, meanwhile due to math lulz MobileApp that was scheduled for today haven't been deployed - can we reschedule? [00:16:25] MaxSem: yeah, Monday at 2pm? [00:16:30] pacific, that is [00:16:56] greg-g, deal:) [00:17:56] thanks [00:17:57] perfect [00:17:58] ty [00:21:14] ebernhardson: let me know when you're done :) [00:21:24] !log ebernhardson synchronized php-1.23wmf12/extensions/Echo/ 'LD two patches to Echo - 1.23wmf12' [00:21:32] Logged the message, Master [00:21:51] !log ebernhardson synchronized php-1.23wmf12/extensions/Flow/ 'LD two patches to Flow - 1.23wmf12' [00:21:54] Krinkle: all done now [00:21:59] Logged the message, Master [00:29:02] !log krinkle synchronized php-1.23wmf13/extensions/VisualEditor 'I156b24551a40' [00:29:10] Logged the message, Master [00:31:49] !log krinkle synchronized php-1.23wmf12/extensions/VisualEditor 'I1cc789596dd' [00:31:56] Logged the message, Master [00:43:13] !log krinkle synchronized php-1.23wmf12/extensions/VisualEditor 'I1cc789596dd (re-sync, forgot to update inner submodule)' [00:43:21] Logged the message, Master [00:47:28] Anyone around who can do a graceful restart of apache on mw1142? [00:48:01] It looks like the APC cache may be whacked out there causing class not found errors for a file that is clearly on disk [00:48:21] sure [00:48:51] thanks ori [00:48:59] !log graceful'd mw1142 [00:49:05] Logged the message, Master [00:54:23] !log kaldari synchronized php-1.23wmf12/extensions/MobileFrontend 'syncing MobileFrontend make sure all the js is up to date' [00:54:31] Logged the message, Master [00:55:12] Hmm… still seeing the same fatal [01:00:23] Reedy, ori: I think the graceful fixed it. No errors in the last 2 minutes and only 1 in the last 8. [01:00:54] <^demon|away> sounds like it was apc then [01:01:05] <^demon|away> turning it off and on again always works ;-) [01:01:59] I've never had a good relationship with apc [01:02:41] I was glad to see php5.5 put the zend cache in core [01:03:04] Let's just hope it's not a bundle of fail :) [01:04:10] <^demon|away> I wonder if we can bump the minimum requirement at some point in the not too distant future. [01:04:26] <^demon|away> Too many distros probably still shipping 5.3 :\ [01:04:48] For WMF we've got to wait for 14.04 to be deployed [01:04:54] <^demon|away> or hhvm :p [01:05:11] Which gives us 5.5.8 [01:05:15] yeah [01:05:21] Who is quicker? Platform or Ops? [01:05:25] :D [01:06:12] * bd808 puts his money on 14.04 being released before we are hhvm clean all the way through [01:06:24] I was meaning 14.04 deployed [01:06:36] Just being released isn't much use ;) [01:06:43] Yeah. That part I can't say [01:06:48] Unless we want to stop deploying master [01:07:24] At $DAYJOB-1 it only took until 2012-01 to upgrade from 8.04 to 10.04 :\ [01:07:26] I think it was 4 or 5 months after for 12.04 (for the apache app servers) [01:07:56] Though, puppet et al is in a much nicer state so... [01:15:03] Did mw1215 get depooled? 96 segfaults in last 5 minutes; 6000 in last 6 hours [01:16:31] Not obviously [01:16:35] and/or logged [01:17:45] Looks like they started at 19:56 and have continued since [01:19:44] !log mw1215 logging 10-30 segfaults per minute since 19:56 [01:19:55] Logged the message, Master [01:22:15] !log graceful'd mw1215 [01:22:21] Logged the message, Master [01:28:55] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [01:29:55] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [01:39:32] !log segfaults on mw1215 stopped after graceful restart [01:39:44] Logged the message, Master [02:27:34] !log LocalisationUpdate completed (1.23wmf12) at 2014-02-07 02:27:34+00:00 [02:27:43] Logged the message, Master [03:07:57] Could not load worker load.php?debug=false&lang=en&modules=ext.codeEditor.ace [03:07:57] DOMException [code: 18] [03:07:57] SecurityError: Failed to create a worker: script at 'https://bits.wikimedia.org/static-1.23wmf13/extensions/CodeEditor/modules/ace/worker-javascript.js' cannot be accessed from origin 'https://www.mediawiki.org'. at new WorkerClient (https://bits.wikimedia.org/www.mediawiki.org/load.php?debug=false&lang=en&modules=ext.codeEditor.ace) [03:08:18] Hm.. did something get messed up in the origin whitelist? [03:12:27] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-07 03:12:27+00:00 [03:12:35] Logged the message, Master [03:50:13] !log removing myself from ops and wmf groups [03:50:22] Logged the message, Master [03:51:44] (03PS1) 10Ryan Lane: Revoking my access [operations/puppet] - 10https://gerrit.wikimedia.org/r/111960 [03:51:59] (03CR) 10Ryan Lane: [C: 031] "I'm not interested anymore." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111960 (owner: 10Ryan Lane) [03:52:21] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-07 03:52:20+00:00 [03:52:29] Logged the message, Master [03:54:04] Ryan_Lane: :( [03:59:55] So this is http://meatballwiki.org/wiki/GoodBye? [04:00:09] I'm still going to work inside of labs [04:00:16] but as a regular volunteer [04:00:23] and absolutely nothing for wikimedia ever again [04:00:40] You know it's Wikimedia Labs, right? ;-) [04:01:02] *foundation [04:01:11] I just don't have the time or the resolve [04:01:18] Fair enough. [04:02:16] Ryan_Lane: I guess this means your contract has expired? [04:02:24] not yet [04:02:28] Heh. [04:02:54] If Labs becomes anything like the Toolserver, it'll need as much sysadmin help as possible. [04:03:01] But, of course, there's no obligation. [04:07:10] Ryan_Lane: any details on what led to this decision? [04:07:18] nope [04:07:29] time and resolve [04:43:38] (03PS1) 10Andrew Bogott: Configure novnc for instance debugging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111965 [04:45:35] (03PS2) 10Andrew Bogott: Configure novnc for instance debugging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111965 [04:47:07] (03CR) 10Andrew Bogott: [C: 032] Configure novnc for instance debugging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111965 (owner: 10Andrew Bogott) [04:49:37] (03CR) 10Tim Starling: [C: 04-1] "I suggest letting this sit for 24 hours or so." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111960 (owner: 10Ryan Lane) [04:57:03] (03PS1) 10Andrew Bogott: Configure novnc, take two [operations/puppet] - 10https://gerrit.wikimedia.org/r/111967 [04:58:28] (03CR) 10Andrew Bogott: [C: 032] Configure novnc, take two [operations/puppet] - 10https://gerrit.wikimedia.org/r/111967 (owner: 10Andrew Bogott) [05:07:47] (03PS1) 10Diederik: Enabling TLSv1.1 and TLSv1.2 on misc Apache services [operations/puppet] - 10https://gerrit.wikimedia.org/r/111969 [05:34:28] (03PS1) 10Andrew Bogott: Set tunnel_type and tunnel_types. Trying to get dhcp to work... [operations/puppet] - 10https://gerrit.wikimedia.org/r/111970 [05:35:54] (03CR) 10Andrew Bogott: [C: 032] Set tunnel_type and tunnel_types. Trying to get dhcp to work... [operations/puppet] - 10https://gerrit.wikimedia.org/r/111970 (owner: 10Andrew Bogott) [06:57:43] (03CR) 10TTO: [C: 04-1] "No consensus for this has been shown yet, see comment 1 at the bug." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 (owner: 10Ebe123) [07:37:15] (03PS1) 10Andrew Bogott: Specify sysctl priority for openstack settings. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111978 [07:39:02] (03CR) 10Andrew Bogott: [C: 032] Specify sysctl priority for openstack settings. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111978 (owner: 10Andrew Bogott) [08:11:18] (03PS1) 10Andrew Bogott: Cut down to just one compute node for now. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111979 [08:17:06] (03CR) 10Andrew Bogott: [C: 032] Cut down to just one compute node for now. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111979 (owner: 10Andrew Bogott) [08:31:42] mark: up? [08:32:11] I'm trying to track a dhcp request and figure out where it's failing… could use some help. [09:46:55] (03PS1) 10Matanya: (bug 61014) add he.wiki checkusers additional rights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111985 [10:01:36] andre__: how can i subscribe to bugs in only one product? [10:01:56] matanya, I need to fix https://bugzilla.wikimedia.org/show_bug.cgi?id=37105 for that... :-/ [10:02:33] oh, thanks [10:02:59] we will move to phabricaotr at the end :) [10:03:16] :D [10:03:50] (Actually I'm surprised to see what's planned for Bugzilla 5.0, in case we have Bugzilla around for a little longer.) [10:03:59] *positively [10:08:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Tested this on zirconium. Didn't like it:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111969 (owner: 10Diederik) [11:15:55] (03CR) 10Alexandros Kosiaris: [C: 032] lucene: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111520 (owner: 10Matanya) [11:20:02] (03PS1) 10Alexandros Kosiaris: Fix erroneous warning on puppet-merge [operations/puppet] - 10https://gerrit.wikimedia.org/r/112001 [11:23:06] (03CR) 10Alexandros Kosiaris: [C: 032] protoproxy: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111776 (owner: 10Matanya) [11:23:42] (03CR) 10Alexandros Kosiaris: [C: 032] Fix erroneous warning on puppet-merge [operations/puppet] - 10https://gerrit.wikimedia.org/r/112001 (owner: 10Alexandros Kosiaris) [11:29:36] paravoid, around? [11:30:00] hi [11:30:01] I am [11:30:27] yesterday, I discovered an LVS IP, 10.2.4.26, in XFF chains on mobile - I thought LVS is supposed to be transparent [11:31:26] they are [11:31:30] what was the request? [11:31:43] mobile editing [11:32:08] can you send me the full request? [11:32:11] ulsfo? [11:32:16] yeah it's ulsfo [11:32:28] might be varnish 3.0.5 changes [11:32:28] I don't have the request info, just XFF from cu_log [11:33:17] cuc_xff: , 10.2.4.26, 10.128.0.111, 10.128.0.111 [11:34:24] hrm [11:37:03] localssl maybe? [11:38:23] it was an edit so was definitely made over HTTPS [11:40:22] (03CR) 10Alexandros Kosiaris: [C: 032] Tools: Fully qualify variables for Puppet 3 compatibility (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [11:41:28] (03PS1) 10Alexandros Kosiaris: Remove parsoid.py file resource from deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 [11:45:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd rather this was done in stages. At least one to add the configuration to netmon and one to remove the configuration from manutius whi" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108314 (owner: 10Matanya) [11:48:26] (03CR) 10Tim Landscheidt: Tools: Fully qualify variables for Puppet 3 compatibility (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [11:53:19] (03CR) 10Alexandros Kosiaris: "Gabriel, please take a look at this, just to make sure we haven't removed parsoid.py by error." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 (owner: 10Alexandros Kosiaris) [11:54:44] grumble, we don't log X-F-Proto [11:58:14] paravoid, protocol is present in xff.log [11:58:44] ...which looks a bit broken:P [11:58:55] Fri, 07 Feb 2014 07:14:49 +0000 http://ru.wikipedia.orghttp://ru.wikipedia.org/w/api.php , 10.64.0.219 [11:59:39] I have a suspicion that there's supposed to be something before the comma:) [12:03:22] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111759 (owner: 10Matanya) [12:15:28] (03CR) 10Alexandros Kosiaris: Tools: Fully qualify variables for Puppet 3 compatibility (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [12:26:17] (03CR) 10Alexandros Kosiaris: [C: 032] ldap: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111745 (owner: 10Matanya) [12:27:56] merge day akosiaris ? [12:28:11] hashar: can you please merge https://gerrit.wikimedia.org/r/#/c/111985/1 ? [12:28:21] more like hour. I got to head back to OSM at some point [12:29:07] cool, thanks for this work [12:29:26] maybe i should ask hashar_ ? :) [12:29:28] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111743 (owner: 10Matanya) [12:29:53] hi [12:29:59] i am happy that you are not creating dependencies between all this commits and I can pick whatever I like to merge :-) [12:30:07] s/this/these/ [12:30:32] matanya: can't do anything today sorry, already overloaded :D [12:30:55] np hashar_ at your free time :) [12:31:38] akosiaris: i try to keep it clean [12:32:10] matanya: :D :D [12:33:01] finally under 30 in the queue [12:33:47] matanya: Could not look up qualified variable 'ganglia_new::monitor::config::cname'; class ganglia_new::monitor::config has not been evaluated at /etc/puppet/manifests/nagios.pp:58 [12:33:59] hmm [12:34:05] you need to include ganglia_new::monitor::config for this to be in the scope [12:34:23] this is the last one you merged? [12:34:32] hopefully this class only has variables inside and no resources so it will be easy [12:34:42] no ori merged it yesterday i think [12:34:56] I just found out [12:37:05] (03CR) 10Yuvipanda: "This was split off from VectorBeta for code separation concerns, and is going to be deployed in a couple of weeks (once the ongoing design" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 (owner: 10Yuvipanda) [12:37:08] matanya: https://gerrit.wikimedia.org/r/#/c/107819/ [12:37:23] yeah, aready fixing [12:37:29] thanks [12:37:29] hashar_: I saw you said you were overloaded, but +2 on https://gerrit.wikimedia.org/r/#/c/111765/? :) [12:39:46] (03CR) 10Hashar: "We first noticed the redirect cache issue when migrating the beta cluster text cache from squid to varnish back in July 2013. The malfunc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111917 (owner: 10BryanDavis) [12:41:36] (03PS1) 10Matanya: nagios: follow up fix for I235 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112019 [12:41:47] akosiaris: ^ [12:43:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] coredb_mysql: puppet 3 compatibility fix: fully qualify variable (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108313 (owner: 10Matanya) [12:46:08] matanya: I can not merge that. It requires packages from ganglia_new [12:46:16] meh... it is getting convoluted [12:46:22] yes [12:46:36] i probably need to restructure [12:46:44] i must run in a sec [12:46:57] I am that close to saying to hell with ganglia_new and ganglia.. let's go with ganglia_new2 :P [12:47:04] i will fix all your comments tomorrow night, i hope [12:47:15] :P [12:47:16] mind you it is not going to be easy... [12:48:01] there are 2-3 things like ganglia and nagios and backups that are very weird cause they need to be mixed but break things etc... which is why we try to move them into the role classes [12:49:18] ok, i'll fight those out. maybe making the role changes before would be easier [12:53:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Requiring that will bring all the ganglia packages with it. Not the best design approach. Need to figure this out a bit For example, why w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112019 (owner: 10Matanya) [13:06:09] (03PS5) 10Alexandros Kosiaris: mysql: change nrpe monitoring to use nrpe::monitor [operations/puppet] - 10https://gerrit.wikimedia.org/r/110844 (owner: 10Matanya) [13:09:07] (03CR) 10Alexandros Kosiaris: [C: 032] mysql: change nrpe monitoring to use nrpe::monitor [operations/puppet] - 10https://gerrit.wikimedia.org/r/110844 (owner: 10Matanya) [13:09:15] MaxSem: sorry, I went to have lunch in the meantime [13:09:18] MaxSem: but I found it :) [13:10:14] !log staggered restart of cp4xxx localssl, to deploy Ie94ccc (committed Oct 29th) [13:10:14] MaxSem: thanks a lot [13:10:20] Logged the message, Master [13:37:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [14:01:40] (03CR) 10Ebe123: "The original discussion is archived. Posted a confirmation post on the Discussion page: https://zh.wikivoyage.org/wiki/Wikivoyage:%E4%BA%9" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 (owner: 10Ebe123) [14:35:46] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [14:40:55] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:55] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [14:51:08] greg-g or others, are https://wikitech.wikimedia.org/wiki/Incident_documentation/20140121-BitsApplicationServers#Actionables tracked somewhere? [14:54:22] yup, in RT [14:55:42] is greg-g around? :) [14:55:45] probably not, pretty early [14:55:50] Couple of hours probably [14:56:45] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [15:00:46] RECOVERY - NTP on mw31 is OK: NTP OK: Offset 0.01572930813 secs [15:08:08] (03CR) 10Jgreen: [C: 032 V: 031] "Yeah, it's fine" [operations/dns] - 10https://gerrit.wikimedia.org/r/111621 (owner: 10Dzahn) [15:32:23] * Jeff_Green is abusing db48 with horrible slow queries [15:42:35] PROBLEM - Puppet freshness on db1044 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 12:42:15 PM UTC [15:44:42] (03PS1) 10Andrew Bogott: Add eth1.1102 interface to compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 [15:46:33] (03PS2) 10Andrew Bogott: Add eth1.1102 interface to compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 [15:48:53] (03CR) 10Mark Bergsma: [C: 04-1] Add eth1.1102 interface to compute nodes (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 (owner: 10Andrew Bogott) [15:49:31] (03PS3) 10Andrew Bogott: Add eth1.1102 interface to compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 [15:49:50] mark: you're right! [15:51:42] :) [15:51:56] so after you've done that [15:52:06] you should check whether you can reach the network node from virt1001 [15:52:14] ok [15:52:15] also [15:52:29] (03CR) 10Andrew Bogott: [C: 032] Add eth1.1102 interface to compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 (owner: 10Andrew Bogott) [15:52:32] i wonder if virt1001 needs an ip in that subnet [15:52:43] it may not [15:54:14] yeah probably not [15:54:28] but then you won't be able to do a ping test, that's ok though [15:59:00] (03PS1) 10Andrew Bogott: Don't configure a nova-network interface in havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112031 [16:00:16] (03CR) 10Andrew Bogott: [C: 032] Don't configure a nova-network interface in havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112031 (owner: 10Andrew Bogott) [16:04:35] PROBLEM - Puppet freshness on pc2 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:03:27 PM UTC [16:06:50] mark, I can ping labnet1001 from virt1001. Is that good news or bad news? [16:11:35] PROBLEM - Puppet freshness on pc1 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:10:40 PM UTC [16:15:35] PROBLEM - Puppet freshness on db9 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:15:24 PM UTC [16:30:54] (03CR) 10Alexandros Kosiaris: [C: 04-2] "We investigated this more with Diederik. That syntax is supported on 2.2.23 and above. We have 2.2.22 and not plans to upgrade to 2.2.23 a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111969 (owner: 10Diederik) [16:31:35] PROBLEM - Puppet freshness on pc1001 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:30:57 PM UTC [16:32:35] PROBLEM - Puppet freshness on pc1003 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:32:09 PM UTC [16:33:35] PROBLEM - Puppet freshness on pc3 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:33:00 PM UTC [16:34:45] Could someone run a graceful restart on the mw1165 apache process? It's depooled but still segfaulting (likely in response to icinga checks) and 2 other hosts that wer having segfault issues yesterday cleared after an apache graceful restart. [16:35:05] that's the wrong way to handle this [16:35:14] ok [16:35:28] maybe it's bad memory, and restarting it uses different parts of memory, for example [16:35:39] it will stop segfaulting, but it will also not be a better fix [16:36:08] I just restarted in any case, to help you with the log spam :) [16:36:27] andrewbogott: back [16:36:33] Thanks, I guess ;) [16:36:36] andrewbogott: that's certainly not bad news, but that doesn't tell us much either [16:36:53] andrewbogott: so what is likely happening is that you're simply pinging labnet1001 from virt1001 across eth0 on both systems [16:36:56] the "management" interface [16:37:07] actual instances will talk across the "data" interface [16:37:20] yep, makes sense [16:37:25] and you can't actually test that with ping because virt1001 doesn't have an ip on that interface/subnet [16:37:30] which might be fine due to the way it works [16:37:34] just makes it slightly harder to test [16:37:35] PROBLEM - Puppet freshness on pc1002 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:37:23 PM UTC [16:37:38] however [16:37:51] if all goes well, dhcp messages should travel across that interface and end up on labnet1001 as well [16:43:42] hm, seems like the behavior hasn't changed [16:43:59] ok i'll help debug for a little bit [16:44:18] before I move on to interesting things like ops meeting agendas and stuff ;p [16:45:05] ok so I see eth1.1118 still exists [16:45:09] we should probably remove that [16:45:26] mark: Yeah, it's getting recreated by puppet and I haven't hunted down where yet. [16:45:31] Or, I thought I had, but… clearly not. [16:45:32] the default route on virt1001 is set towards eth0 as normal [16:45:53] did you try a tcpdump on eth1.1102 when creating an instance, and if so, did you see the dhcp message there? [16:46:18] I'll do that now [16:46:24] i'll watch along then [16:46:53] anything wrong with Gerrit? [16:46:59] trying to clone core for a while, no result [16:47:23] * twkozlowski stuck at Cloning into 'core'... [16:47:47] mark, no traffic at all on eth1.1102 [16:47:56] indeed [16:48:01] then the problem is -likely- on virt1001 itself [16:48:31] would you expect me to need to tell neutron about eth1.1102 explicitly? [16:48:48] neutron probably not [16:48:54] neutron lives on labnet1001 doesn't it [16:49:00] which doesn't use eth1.1102 but eth4.1102 [16:49:18] effectively virt1001:eth1.1102 and labnet1001:eth4.1102 are connected together though [16:49:22] (at least if the switches are setup right) [16:49:33] but since we're not even seeing anything going out on eth1.1102, probably there's another issue first [16:49:49] Well, there's openvswitch and the openvswitch neutron plugin -- they're on virt1001 [16:49:50] or is there some neutron component on virt1001? [16:49:53] ok [16:49:54] then maybe yes [16:50:01] somehow, instances need to be connected to eth1.1102 [16:50:11] in the old labs setup, nova-network did this using bridging [16:50:22] probably something like that is still happening, or a variant thereof [16:50:28] yes, there's br-int which I can see the traffic on [16:50:37] and certainly there needs to be some config setting somewhere on virt1001 that mentions eth1.1102 ;) [16:50:44] ok let me investigate what that is [16:51:50] root@virt1001:~# brctl show br-int [16:51:50] bridge name bridge id STP enabled interfaces [16:51:50] br-int can't get info Operation not supported [16:51:55] i wonder what that interface is exactly [16:51:56] let's see [16:52:16] /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini [16:52:21] is probably the place this would be configured [16:52:54] * mark checks [16:52:56] that has a 'local_ip = 10.64.20.4' setting which I think is meant to indicate the data port [16:53:55] yeah, the doc says "local_ip = DATA_INTERFACE_IP" [16:54:36] aha [16:54:43] but we never assigned that ip to any interface did we? [16:54:46] which in our case the data port doesn't have an ip... [16:54:48] right. [16:54:53] so let's fix that :) [16:55:12] what do the docs say that DATA_INTERFACE is? br-int? [16:55:25] we could test this without puppet first [16:55:50] the docs don't say much :( [16:55:57] http://docs.openstack.org/trunk/install-guide/install/apt/content/install-neutron.install-plugin-compute.ovs.gre.html [16:56:34] hiii akosiaris, you there? [16:56:49] gre? [16:56:51] I think that the docs are expecting us to only have one interface on the compute node. That's the only way I can make sense of what's unsaid [16:56:51] omg [16:57:07] paravoid: um… arbitrary choice because the config is simpler. Bad? [16:57:24] ignore my comments, I have no idea about the rest of the setup [16:57:37] I was just amazed to see GRE [16:57:40] I think vlans might be better [16:58:03] http://docs.openstack.org/trunk/install-guide/install/apt/content/install-neutron.install-plug-in.ovs.vlan.html [16:58:08] that makes more sense to me anyway ;) [16:58:29] what's the current though? one VLAN per tenant? [16:58:35] the current what? [16:58:38] (I can go away if I'm just producing noise) [16:58:42] *thought [16:58:43] plan :) [16:58:56] paravoid: hopefully just one big vlan, same as in tampa. Unclear so far if neutron will let us do that. [16:58:59] the current plan is to get anything working ;) [16:59:04] Or, rather, it will let us, but unclear if we can have that and also floating ips. [16:59:17] no, stop thinking about flat network [16:59:19] that's not what we have [16:59:22] not at all [16:59:26] um... [16:59:44] But all the instances on all the tenants are all on one network, aren't they? [16:59:46] you could say we have a flat network behind an openstack network node/router [16:59:56] but that's really not what neutron's flat network model is [17:00:15] which also explains why floating ips can't possibly work [17:00:20] as there's nothing that can do NAT for those floating ips [17:00:25] neutron's flat network model is essentially unmanaged, right? [17:00:34] yeah, bridging out to some interface [17:00:38] i.e. the openstack suite doesn't care at all about the network [17:00:40] where the network hardware takes further care of it [17:00:42] yep [17:00:45] you do whatever you want [17:00:45] right [17:01:09] mark, is 'one big vlan' the same thing as 'flat'? [17:01:18] not in neutron speak at least [17:01:28] ok, then… [17:01:52] you should really only look at the "provider router with per-tenant network" model they describe in the docs [17:01:56] just, you said "no, stop thinking about flat network" and I don't think I said 'flat' so unsure if you're disagreeing with me or not :) [17:02:01] and once we have that working we can see if we can get to "one network for all tenants" :) [17:02:07] ok [17:02:28] if you're doing anything described in the docs for flat network (neutron terminology) we're missing several components [17:03:04] if you do one network per tenant, how do you run all these networks via the switches? [17:03:14] different broadcast domains I assume [17:03:15] via which switches? [17:03:35] via the switches that connect the compute nodes with each other and with the network node [17:03:40] via GRE or vlan tagging [17:03:52] vlan tagging, how? will we just assign a range of VLANs to openstack? [17:03:59] no need [17:04:01] QinQ ;) [17:04:07] jesus, they do QinQ? [17:04:18] i think that's what they're doing, I haven't looked at the details yet [17:04:20] but yes [17:04:28] it's rapidly getting really complex [17:06:24] i have a feeling we're just gonna deploy nova-network :P [17:08:12] well… should I just drop neutron now? Are you that convinced that neutron is too complex? [17:10:04] well [17:10:12] given the fact that it needs quite some network design and thinking [17:10:16] and some time to experiment [17:10:21] and you're probably unwilling to do that this weekend :) [17:10:31] and also need help from some people with a lot of network experience [17:10:34] it's looking unrealistic at this point [17:10:39] not unwilling but probably unqualified [17:10:42] yes [17:10:56] feel free to continue to experiment with it if you want, of course [17:10:57] but yes [17:10:59] well, is the goal to have it /done/ by monday, or just to have it assessed by Monday? [17:11:12] assessed, but we can only realistically do that if we know it's gonna work [17:11:17] which means we have to have it somewhat working [17:12:34] This also sort of ties into the hypothetical canonical consultant thing... [17:12:53] if we did hire them, it would be mostly for neutron. But making the call about neutron or not before we hire them is a bit backwards [17:13:08] I don't think a contractor can reasonably help with this tbh [17:13:12] it ties a lot to our networking setup [17:13:25] exactly [17:13:26]