[00:05:26] ebernhardson, are you LDing? [00:06:56] MaxSem: yes, as soon as spage double checks my core update patch [00:07:12] err, its not a core update its an extension update in core [00:07:18] can I push my config change meanwhile? [00:07:23] sure [00:07:27] thx [00:07:43] (03CR) 10MaxSem: [C: 032] Mobile ulsfo LVS appears in XFF chains, whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111927 (owner: 10MaxSem) [00:07:47] greg-g, we're going to give up our slot for today. [00:07:56] (03Merged) 10jenkins-bot: Mobile ulsfo LVS appears in XFF chains, whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111927 (owner: 10MaxSem) [00:10:32] !log maxsem synchronized wmf-config/squid.php 'https://gerrit.wikimedia.org/r/111927' [00:10:40] Logged the message, Master [00:11:43] ebernhardson, I'm done [00:12:29] MaxSem: ok [00:12:31] MaxSem: I said yes, right? [00:13:06] superm401: kk [00:13:27] (03CR) 10Mattflaschen: [C: 04-1] "I didn't catch it earlier, but PageFilter also expects $wgGettingStartedExcludedCategories in canonicalized form." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [00:14:29] (03CR) 10Mattflaschen: "Or I guess RedisCategorySync only applies on every save with a category change, but that's still a lot." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111460 (owner: 10Phuedx) [00:14:57] !log ebernhardson synchronized php-1.23wmf13/extensions/Echo/ 'LD two patches to Echo-1.23wmf13' [00:14:59] greg-g, sorry misunderstood [00:15:05] Logged the message, Master [00:15:34] !log ebernhardson synchronized php-1.23wmf13/extensions/Flow/ 'LD two patches to Flow-1.23wmf13' [00:15:43] Logged the message, Master [00:15:55] greg-g, meanwhile due to math lulz MobileApp that was scheduled for today haven't been deployed - can we reschedule? [00:16:25] MaxSem: yeah, Monday at 2pm? [00:16:30] pacific, that is [00:16:56] greg-g, deal:) [00:17:56] thanks [00:17:57] perfect [00:17:58] ty [00:21:14] ebernhardson: let me know when you're done :) [00:21:24] !log ebernhardson synchronized php-1.23wmf12/extensions/Echo/ 'LD two patches to Echo - 1.23wmf12' [00:21:32] Logged the message, Master [00:21:51] !log ebernhardson synchronized php-1.23wmf12/extensions/Flow/ 'LD two patches to Flow - 1.23wmf12' [00:21:54] Krinkle: all done now [00:21:59] Logged the message, Master [00:29:02] !log krinkle synchronized php-1.23wmf13/extensions/VisualEditor 'I156b24551a40' [00:29:10] Logged the message, Master [00:31:49] !log krinkle synchronized php-1.23wmf12/extensions/VisualEditor 'I1cc789596dd' [00:31:56] Logged the message, Master [00:43:13] !log krinkle synchronized php-1.23wmf12/extensions/VisualEditor 'I1cc789596dd (re-sync, forgot to update inner submodule)' [00:43:21] Logged the message, Master [00:47:28] Anyone around who can do a graceful restart of apache on mw1142? [00:48:01] It looks like the APC cache may be whacked out there causing class not found errors for a file that is clearly on disk [00:48:21] sure [00:48:51] thanks ori [00:48:59] !log graceful'd mw1142 [00:49:05] Logged the message, Master [00:54:23] !log kaldari synchronized php-1.23wmf12/extensions/MobileFrontend 'syncing MobileFrontend make sure all the js is up to date' [00:54:31] Logged the message, Master [00:55:12] Hmm… still seeing the same fatal [01:00:23] Reedy, ori: I think the graceful fixed it. No errors in the last 2 minutes and only 1 in the last 8. [01:00:54] <^demon|away> sounds like it was apc then [01:01:05] <^demon|away> turning it off and on again always works ;-) [01:01:59] I've never had a good relationship with apc [01:02:41] I was glad to see php5.5 put the zend cache in core [01:03:04] Let's just hope it's not a bundle of fail :) [01:04:10] <^demon|away> I wonder if we can bump the minimum requirement at some point in the not too distant future. [01:04:26] <^demon|away> Too many distros probably still shipping 5.3 :\ [01:04:48] For WMF we've got to wait for 14.04 to be deployed [01:04:54] <^demon|away> or hhvm :p [01:05:11] Which gives us 5.5.8 [01:05:15] yeah [01:05:21] Who is quicker? Platform or Ops? [01:05:25] :D [01:06:12] * bd808 puts his money on 14.04 being released before we are hhvm clean all the way through [01:06:24] I was meaning 14.04 deployed [01:06:36] Just being released isn't much use ;) [01:06:43] Yeah. That part I can't say [01:06:48] Unless we want to stop deploying master [01:07:24] At $DAYJOB-1 it only took until 2012-01 to upgrade from 8.04 to 10.04 :\ [01:07:26] I think it was 4 or 5 months after for 12.04 (for the apache app servers) [01:07:56] Though, puppet et al is in a much nicer state so... [01:15:03] Did mw1215 get depooled? 96 segfaults in last 5 minutes; 6000 in last 6 hours [01:16:31] Not obviously [01:16:35] and/or logged [01:17:45] Looks like they started at 19:56 and have continued since [01:19:44] !log mw1215 logging 10-30 segfaults per minute since 19:56 [01:19:55] Logged the message, Master [01:22:15] !log graceful'd mw1215 [01:22:21] Logged the message, Master [01:28:55] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [01:29:55] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [01:39:32] !log segfaults on mw1215 stopped after graceful restart [01:39:44] Logged the message, Master [02:27:34] !log LocalisationUpdate completed (1.23wmf12) at 2014-02-07 02:27:34+00:00 [02:27:43] Logged the message, Master [03:07:57] Could not load worker load.php?debug=false&lang=en&modules=ext.codeEditor.ace [03:07:57] DOMException [code: 18] [03:07:57] SecurityError: Failed to create a worker: script at 'https://bits.wikimedia.org/static-1.23wmf13/extensions/CodeEditor/modules/ace/worker-javascript.js' cannot be accessed from origin 'https://www.mediawiki.org'. at new WorkerClient (https://bits.wikimedia.org/www.mediawiki.org/load.php?debug=false&lang=en&modules=ext.codeEditor.ace) [03:08:18] Hm.. did something get messed up in the origin whitelist? [03:12:27] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-07 03:12:27+00:00 [03:12:35] Logged the message, Master [03:50:13] !log removing myself from ops and wmf groups [03:50:22] Logged the message, Master [03:51:44] (03PS1) 10Ryan Lane: Revoking my access [operations/puppet] - 10https://gerrit.wikimedia.org/r/111960 [03:51:59] (03CR) 10Ryan Lane: [C: 031] "I'm not interested anymore." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111960 (owner: 10Ryan Lane) [03:52:21] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-07 03:52:20+00:00 [03:52:29] Logged the message, Master [03:54:04] Ryan_Lane: :( [03:59:55] So this is http://meatballwiki.org/wiki/GoodBye? [04:00:09] I'm still going to work inside of labs [04:00:16] but as a regular volunteer [04:00:23] and absolutely nothing for wikimedia ever again [04:00:40] You know it's Wikimedia Labs, right? ;-) [04:01:02] *foundation [04:01:11] I just don't have the time or the resolve [04:01:18] Fair enough. [04:02:16] Ryan_Lane: I guess this means your contract has expired? [04:02:24] not yet [04:02:28] Heh. [04:02:54] If Labs becomes anything like the Toolserver, it'll need as much sysadmin help as possible. [04:03:01] But, of course, there's no obligation. [04:07:10] Ryan_Lane: any details on what led to this decision? [04:07:18] nope [04:07:29] time and resolve [04:43:38] (03PS1) 10Andrew Bogott: Configure novnc for instance debugging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111965 [04:45:35] (03PS2) 10Andrew Bogott: Configure novnc for instance debugging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111965 [04:47:07] (03CR) 10Andrew Bogott: [C: 032] Configure novnc for instance debugging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111965 (owner: 10Andrew Bogott) [04:49:37] (03CR) 10Tim Starling: [C: 04-1] "I suggest letting this sit for 24 hours or so." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111960 (owner: 10Ryan Lane) [04:57:03] (03PS1) 10Andrew Bogott: Configure novnc, take two [operations/puppet] - 10https://gerrit.wikimedia.org/r/111967 [04:58:28] (03CR) 10Andrew Bogott: [C: 032] Configure novnc, take two [operations/puppet] - 10https://gerrit.wikimedia.org/r/111967 (owner: 10Andrew Bogott) [05:07:47] (03PS1) 10Diederik: Enabling TLSv1.1 and TLSv1.2 on misc Apache services [operations/puppet] - 10https://gerrit.wikimedia.org/r/111969 [05:34:28] (03PS1) 10Andrew Bogott: Set tunnel_type and tunnel_types. Trying to get dhcp to work... [operations/puppet] - 10https://gerrit.wikimedia.org/r/111970 [05:35:54] (03CR) 10Andrew Bogott: [C: 032] Set tunnel_type and tunnel_types. Trying to get dhcp to work... [operations/puppet] - 10https://gerrit.wikimedia.org/r/111970 (owner: 10Andrew Bogott) [06:57:43] (03CR) 10TTO: [C: 04-1] "No consensus for this has been shown yet, see comment 1 at the bug." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 (owner: 10Ebe123) [07:37:15] (03PS1) 10Andrew Bogott: Specify sysctl priority for openstack settings. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111978 [07:39:02] (03CR) 10Andrew Bogott: [C: 032] Specify sysctl priority for openstack settings. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111978 (owner: 10Andrew Bogott) [08:11:18] (03PS1) 10Andrew Bogott: Cut down to just one compute node for now. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111979 [08:17:06] (03CR) 10Andrew Bogott: [C: 032] Cut down to just one compute node for now. [operations/puppet] - 10https://gerrit.wikimedia.org/r/111979 (owner: 10Andrew Bogott) [08:31:42] mark: up? [08:32:11] I'm trying to track a dhcp request and figure out where it's failing… could use some help. [09:46:55] (03PS1) 10Matanya: (bug 61014) add he.wiki checkusers additional rights [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111985 [10:01:36] andre__: how can i subscribe to bugs in only one product? [10:01:56] matanya, I need to fix https://bugzilla.wikimedia.org/show_bug.cgi?id=37105 for that... :-/ [10:02:33] oh, thanks [10:02:59] we will move to phabricaotr at the end :) [10:03:16] :D [10:03:50] (Actually I'm surprised to see what's planned for Bugzilla 5.0, in case we have Bugzilla around for a little longer.) [10:03:59] *positively [10:08:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Tested this on zirconium. Didn't like it:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111969 (owner: 10Diederik) [11:15:55] (03CR) 10Alexandros Kosiaris: [C: 032] lucene: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111520 (owner: 10Matanya) [11:20:02] (03PS1) 10Alexandros Kosiaris: Fix erroneous warning on puppet-merge [operations/puppet] - 10https://gerrit.wikimedia.org/r/112001 [11:23:06] (03CR) 10Alexandros Kosiaris: [C: 032] protoproxy: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111776 (owner: 10Matanya) [11:23:42] (03CR) 10Alexandros Kosiaris: [C: 032] Fix erroneous warning on puppet-merge [operations/puppet] - 10https://gerrit.wikimedia.org/r/112001 (owner: 10Alexandros Kosiaris) [11:29:36] paravoid, around? [11:30:00] hi [11:30:01] I am [11:30:27] yesterday, I discovered an LVS IP, 10.2.4.26, in XFF chains on mobile - I thought LVS is supposed to be transparent [11:31:26] they are [11:31:30] what was the request? [11:31:43] mobile editing [11:32:08] can you send me the full request? [11:32:11] ulsfo? [11:32:16] yeah it's ulsfo [11:32:28] might be varnish 3.0.5 changes [11:32:28] I don't have the request info, just XFF from cu_log [11:33:17] cuc_xff: , 10.2.4.26, 10.128.0.111, 10.128.0.111 [11:34:24] hrm [11:37:03] localssl maybe? [11:38:23] it was an edit so was definitely made over HTTPS [11:40:22] (03CR) 10Alexandros Kosiaris: [C: 032] Tools: Fully qualify variables for Puppet 3 compatibility (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [11:41:28] (03PS1) 10Alexandros Kosiaris: Remove parsoid.py file resource from deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 [11:45:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 'd rather this was done in stages. At least one to add the configuration to netmon and one to remove the configuration from manutius whi" [operations/puppet] - 10https://gerrit.wikimedia.org/r/108314 (owner: 10Matanya) [11:48:26] (03CR) 10Tim Landscheidt: Tools: Fully qualify variables for Puppet 3 compatibility (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [11:53:19] (03CR) 10Alexandros Kosiaris: "Gabriel, please take a look at this, just to make sure we haven't removed parsoid.py by error." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112004 (owner: 10Alexandros Kosiaris) [11:54:44] grumble, we don't log X-F-Proto [11:58:14] paravoid, protocol is present in xff.log [11:58:44] ...which looks a bit broken:P [11:58:55] Fri, 07 Feb 2014 07:14:49 +0000 http://ru.wikipedia.orghttp://ru.wikipedia.org/w/api.php , 10.64.0.219 [11:59:39] I have a suspicion that there's supposed to be something before the comma:) [12:03:22] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111759 (owner: 10Matanya) [12:15:28] (03CR) 10Alexandros Kosiaris: Tools: Fully qualify variables for Puppet 3 compatibility (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/111781 (owner: 10Matanya) [12:26:17] (03CR) 10Alexandros Kosiaris: [C: 032] ldap: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111745 (owner: 10Matanya) [12:27:56] merge day akosiaris ? [12:28:11] hashar: can you please merge https://gerrit.wikimedia.org/r/#/c/111985/1 ? [12:28:21] more like hour. I got to head back to OSM at some point [12:29:07] cool, thanks for this work [12:29:26] maybe i should ask hashar_ ? :) [12:29:28] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia: puppet 3 compatibility fix: fully qualify variable [operations/puppet] - 10https://gerrit.wikimedia.org/r/111743 (owner: 10Matanya) [12:29:53] hi [12:29:59] i am happy that you are not creating dependencies between all this commits and I can pick whatever I like to merge :-) [12:30:07] s/this/these/ [12:30:32] matanya: can't do anything today sorry, already overloaded :D [12:30:55] np hashar_ at your free time :) [12:31:38] akosiaris: i try to keep it clean [12:32:10] matanya: :D :D [12:33:01] finally under 30 in the queue [12:33:47] matanya: Could not look up qualified variable 'ganglia_new::monitor::config::cname'; class ganglia_new::monitor::config has not been evaluated at /etc/puppet/manifests/nagios.pp:58 [12:33:59] hmm [12:34:05] you need to include ganglia_new::monitor::config for this to be in the scope [12:34:23] this is the last one you merged? [12:34:32] hopefully this class only has variables inside and no resources so it will be easy [12:34:42] no ori merged it yesterday i think [12:34:56] I just found out [12:37:05] (03CR) 10Yuvipanda: "This was split off from VectorBeta for code separation concerns, and is going to be deployed in a couple of weeks (once the ongoing design" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/111765 (owner: 10Yuvipanda) [12:37:08] matanya: https://gerrit.wikimedia.org/r/#/c/107819/ [12:37:23] yeah, aready fixing [12:37:29] thanks [12:37:29] hashar_: I saw you said you were overloaded, but +2 on https://gerrit.wikimedia.org/r/#/c/111765/? :) [12:39:46] (03CR) 10Hashar: "We first noticed the redirect cache issue when migrating the beta cluster text cache from squid to varnish back in July 2013. The malfunc" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111917 (owner: 10BryanDavis) [12:41:36] (03PS1) 10Matanya: nagios: follow up fix for I235 [operations/puppet] - 10https://gerrit.wikimedia.org/r/112019 [12:41:47] akosiaris: ^ [12:43:53] (03CR) 10Alexandros Kosiaris: [C: 04-1] coredb_mysql: puppet 3 compatibility fix: fully qualify variable (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/108313 (owner: 10Matanya) [12:46:08] matanya: I can not merge that. It requires packages from ganglia_new [12:46:16] meh... it is getting convoluted [12:46:22] yes [12:46:36] i probably need to restructure [12:46:44] i must run in a sec [12:46:57] I am that close to saying to hell with ganglia_new and ganglia.. let's go with ganglia_new2 :P [12:47:04] i will fix all your comments tomorrow night, i hope [12:47:15] :P [12:47:16] mind you it is not going to be easy... [12:48:01] there are 2-3 things like ganglia and nagios and backups that are very weird cause they need to be mixed but break things etc... which is why we try to move them into the role classes [12:49:18] ok, i'll fight those out. maybe making the role changes before would be easier [12:53:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Requiring that will bring all the ganglia packages with it. Not the best design approach. Need to figure this out a bit For example, why w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112019 (owner: 10Matanya) [13:06:09] (03PS5) 10Alexandros Kosiaris: mysql: change nrpe monitoring to use nrpe::monitor [operations/puppet] - 10https://gerrit.wikimedia.org/r/110844 (owner: 10Matanya) [13:09:07] (03CR) 10Alexandros Kosiaris: [C: 032] mysql: change nrpe monitoring to use nrpe::monitor [operations/puppet] - 10https://gerrit.wikimedia.org/r/110844 (owner: 10Matanya) [13:09:15] MaxSem: sorry, I went to have lunch in the meantime [13:09:18] MaxSem: but I found it :) [13:10:14] !log staggered restart of cp4xxx localssl, to deploy Ie94ccc (committed Oct 29th) [13:10:14] MaxSem: thanks a lot [13:10:20] Logged the message, Master [13:37:46] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [14:01:40] (03CR) 10Ebe123: "The original discussion is archived. Posted a confirmation post on the Discussion page: https://zh.wikivoyage.org/wiki/Wikivoyage:%E4%BA%9" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110876 (owner: 10Ebe123) [14:35:46] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [14:40:55] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:55] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 35.58 ms [14:51:08] greg-g or others, are https://wikitech.wikimedia.org/wiki/Incident_documentation/20140121-BitsApplicationServers#Actionables tracked somewhere? [14:54:22] yup, in RT [14:55:42] is greg-g around? :) [14:55:45] probably not, pretty early [14:55:50] Couple of hours probably [14:56:45] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [15:00:46] RECOVERY - NTP on mw31 is OK: NTP OK: Offset 0.01572930813 secs [15:08:08] (03CR) 10Jgreen: [C: 032 V: 031] "Yeah, it's fine" [operations/dns] - 10https://gerrit.wikimedia.org/r/111621 (owner: 10Dzahn) [15:32:23] * Jeff_Green is abusing db48 with horrible slow queries [15:42:35] PROBLEM - Puppet freshness on db1044 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 12:42:15 PM UTC [15:44:42] (03PS1) 10Andrew Bogott: Add eth1.1102 interface to compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 [15:46:33] (03PS2) 10Andrew Bogott: Add eth1.1102 interface to compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 [15:48:53] (03CR) 10Mark Bergsma: [C: 04-1] Add eth1.1102 interface to compute nodes (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 (owner: 10Andrew Bogott) [15:49:31] (03PS3) 10Andrew Bogott: Add eth1.1102 interface to compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 [15:49:50] mark: you're right! [15:51:42] :) [15:51:56] so after you've done that [15:52:06] you should check whether you can reach the network node from virt1001 [15:52:14] ok [15:52:15] also [15:52:29] (03CR) 10Andrew Bogott: [C: 032] Add eth1.1102 interface to compute nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/112030 (owner: 10Andrew Bogott) [15:52:32] i wonder if virt1001 needs an ip in that subnet [15:52:43] it may not [15:54:14] yeah probably not [15:54:28] but then you won't be able to do a ping test, that's ok though [15:59:00] (03PS1) 10Andrew Bogott: Don't configure a nova-network interface in havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112031 [16:00:16] (03CR) 10Andrew Bogott: [C: 032] Don't configure a nova-network interface in havana. [operations/puppet] - 10https://gerrit.wikimedia.org/r/112031 (owner: 10Andrew Bogott) [16:04:35] PROBLEM - Puppet freshness on pc2 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:03:27 PM UTC [16:06:50] mark, I can ping labnet1001 from virt1001. Is that good news or bad news? [16:11:35] PROBLEM - Puppet freshness on pc1 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:10:40 PM UTC [16:15:35] PROBLEM - Puppet freshness on db9 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:15:24 PM UTC [16:30:54] (03CR) 10Alexandros Kosiaris: [C: 04-2] "We investigated this more with Diederik. That syntax is supported on 2.2.23 and above. We have 2.2.22 and not plans to upgrade to 2.2.23 a" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111969 (owner: 10Diederik) [16:31:35] PROBLEM - Puppet freshness on pc1001 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:30:57 PM UTC [16:32:35] PROBLEM - Puppet freshness on pc1003 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:32:09 PM UTC [16:33:35] PROBLEM - Puppet freshness on pc3 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:33:00 PM UTC [16:34:45] Could someone run a graceful restart on the mw1165 apache process? It's depooled but still segfaulting (likely in response to icinga checks) and 2 other hosts that wer having segfault issues yesterday cleared after an apache graceful restart. [16:35:05] that's the wrong way to handle this [16:35:14] ok [16:35:28] maybe it's bad memory, and restarting it uses different parts of memory, for example [16:35:39] it will stop segfaulting, but it will also not be a better fix [16:36:08] I just restarted in any case, to help you with the log spam :) [16:36:27] andrewbogott: back [16:36:33] Thanks, I guess ;) [16:36:36] andrewbogott: that's certainly not bad news, but that doesn't tell us much either [16:36:53] andrewbogott: so what is likely happening is that you're simply pinging labnet1001 from virt1001 across eth0 on both systems [16:36:56] the "management" interface [16:37:07] actual instances will talk across the "data" interface [16:37:20] yep, makes sense [16:37:25] and you can't actually test that with ping because virt1001 doesn't have an ip on that interface/subnet [16:37:30] which might be fine due to the way it works [16:37:34] just makes it slightly harder to test [16:37:35] PROBLEM - Puppet freshness on pc1002 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:37:23 PM UTC [16:37:38] however [16:37:51] if all goes well, dhcp messages should travel across that interface and end up on labnet1001 as well [16:43:42] hm, seems like the behavior hasn't changed [16:43:59] ok i'll help debug for a little bit [16:44:18] before I move on to interesting things like ops meeting agendas and stuff ;p [16:45:05] ok so I see eth1.1118 still exists [16:45:09] we should probably remove that [16:45:26] mark: Yeah, it's getting recreated by puppet and I haven't hunted down where yet. [16:45:31] Or, I thought I had, but… clearly not. [16:45:32] the default route on virt1001 is set towards eth0 as normal [16:45:53] did you try a tcpdump on eth1.1102 when creating an instance, and if so, did you see the dhcp message there? [16:46:18] I'll do that now [16:46:24] i'll watch along then [16:46:53] anything wrong with Gerrit? [16:46:59] trying to clone core for a while, no result [16:47:23] * twkozlowski stuck at Cloning into 'core'... [16:47:47] mark, no traffic at all on eth1.1102 [16:47:56] indeed [16:48:01] then the problem is -likely- on virt1001 itself [16:48:31] would you expect me to need to tell neutron about eth1.1102 explicitly? [16:48:48] neutron probably not [16:48:54] neutron lives on labnet1001 doesn't it [16:49:00] which doesn't use eth1.1102 but eth4.1102 [16:49:18] effectively virt1001:eth1.1102 and labnet1001:eth4.1102 are connected together though [16:49:22] (at least if the switches are setup right) [16:49:33] but since we're not even seeing anything going out on eth1.1102, probably there's another issue first [16:49:49] Well, there's openvswitch and the openvswitch neutron plugin -- they're on virt1001 [16:49:50] or is there some neutron component on virt1001? [16:49:53] ok [16:49:54] then maybe yes [16:50:01] somehow, instances need to be connected to eth1.1102 [16:50:11] in the old labs setup, nova-network did this using bridging [16:50:22] probably something like that is still happening, or a variant thereof [16:50:28] yes, there's br-int which I can see the traffic on [16:50:37] and certainly there needs to be some config setting somewhere on virt1001 that mentions eth1.1102 ;) [16:50:44] ok let me investigate what that is [16:51:50] root@virt1001:~# brctl show br-int [16:51:50] bridge name bridge id STP enabled interfaces [16:51:50] br-int can't get info Operation not supported [16:51:55] i wonder what that interface is exactly [16:51:56] let's see [16:52:16] /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini [16:52:21] is probably the place this would be configured [16:52:54] * mark checks [16:52:56] that has a 'local_ip = 10.64.20.4' setting which I think is meant to indicate the data port [16:53:55] yeah, the doc says "local_ip = DATA_INTERFACE_IP" [16:54:36] aha [16:54:43] but we never assigned that ip to any interface did we? [16:54:46] which in our case the data port doesn't have an ip... [16:54:48] right. [16:54:53] so let's fix that :) [16:55:12] what do the docs say that DATA_INTERFACE is? br-int? [16:55:25] we could test this without puppet first [16:55:50] the docs don't say much :( [16:55:57] http://docs.openstack.org/trunk/install-guide/install/apt/content/install-neutron.install-plugin-compute.ovs.gre.html [16:56:34] hiii akosiaris, you there? [16:56:49] gre? [16:56:51] I think that the docs are expecting us to only have one interface on the compute node. That's the only way I can make sense of what's unsaid [16:56:51] omg [16:57:07] paravoid: um… arbitrary choice because the config is simpler. Bad? [16:57:24] ignore my comments, I have no idea about the rest of the setup [16:57:37] I was just amazed to see GRE [16:57:40] I think vlans might be better [16:58:03] http://docs.openstack.org/trunk/install-guide/install/apt/content/install-neutron.install-plug-in.ovs.vlan.html [16:58:08] that makes more sense to me anyway ;) [16:58:29] what's the current though? one VLAN per tenant? [16:58:35] the current what? [16:58:38] (I can go away if I'm just producing noise) [16:58:42] *thought [16:58:43] plan :) [16:58:56] paravoid: hopefully just one big vlan, same as in tampa. Unclear so far if neutron will let us do that. [16:58:59] the current plan is to get anything working ;) [16:59:04] Or, rather, it will let us, but unclear if we can have that and also floating ips. [16:59:17] no, stop thinking about flat network [16:59:19] that's not what we have [16:59:22] not at all [16:59:26] um... [16:59:44] But all the instances on all the tenants are all on one network, aren't they? [16:59:46] you could say we have a flat network behind an openstack network node/router [16:59:56] but that's really not what neutron's flat network model is [17:00:15] which also explains why floating ips can't possibly work [17:00:20] as there's nothing that can do NAT for those floating ips [17:00:25] neutron's flat network model is essentially unmanaged, right? [17:00:34] yeah, bridging out to some interface [17:00:38] i.e. the openstack suite doesn't care at all about the network [17:00:40] where the network hardware takes further care of it [17:00:42] yep [17:00:45] you do whatever you want [17:00:45] right [17:01:09] mark, is 'one big vlan' the same thing as 'flat'? [17:01:18] not in neutron speak at least [17:01:28] ok, then… [17:01:52] you should really only look at the "provider router with per-tenant network" model they describe in the docs [17:01:56] just, you said "no, stop thinking about flat network" and I don't think I said 'flat' so unsure if you're disagreeing with me or not :) [17:02:01] and once we have that working we can see if we can get to "one network for all tenants" :) [17:02:07] ok [17:02:28] if you're doing anything described in the docs for flat network (neutron terminology) we're missing several components [17:03:04] if you do one network per tenant, how do you run all these networks via the switches? [17:03:14] different broadcast domains I assume [17:03:15] via which switches? [17:03:35] via the switches that connect the compute nodes with each other and with the network node [17:03:40] via GRE or vlan tagging [17:03:52] vlan tagging, how? will we just assign a range of VLANs to openstack? [17:03:59] no need [17:04:01] QinQ ;) [17:04:07] jesus, they do QinQ? [17:04:18] i think that's what they're doing, I haven't looked at the details yet [17:04:20] but yes [17:04:28] it's rapidly getting really complex [17:06:24] i have a feeling we're just gonna deploy nova-network :P [17:08:12] well… should I just drop neutron now? Are you that convinced that neutron is too complex? [17:10:04] well [17:10:12] given the fact that it needs quite some network design and thinking [17:10:16] and some time to experiment [17:10:21] and you're probably unwilling to do that this weekend :) [17:10:31] and also need help from some people with a lot of network experience [17:10:34] it's looking unrealistic at this point [17:10:39] not unwilling but probably unqualified [17:10:42] yes [17:10:56] feel free to continue to experiment with it if you want, of course [17:10:57] but yes [17:10:59] well, is the goal to have it /done/ by monday, or just to have it assessed by Monday? [17:11:12] assessed, but we can only realistically do that if we know it's gonna work [17:11:17] which means we have to have it somewhat working [17:12:34] This also sort of ties into the hypothetical canonical consultant thing... [17:12:53] if we did hire them, it would be mostly for neutron. But making the call about neutron or not before we hire them is a bit backwards [17:13:08] I don't think a contractor can reasonably help with this tbh [17:13:12] it ties a lot to our networking setup [17:13:25] exactly [17:13:26] and needs access to switches etc. which we'd never give [17:13:36] this just needs a lot of planning and design [17:13:38] (as it's the same network with prod) [17:13:40] and that hasn't happened yet [17:13:55] so right now, the only thing we're comfortable doing is the old labs topology [17:14:22] (which we did a bit of design & planning on back then) [17:14:34] and i'm already quite sure that neutron doesn't work the same way [17:14:48] neutron is a lot more flexible but also adds a lot of complexity and dependencies [17:14:56] would have been fine if we had known this earlier and could work with it [17:16:06] Seems weird that they would deprecate nova-network in favor of a solution that requires a totally different level of design and setup. [17:16:25] i think you could do the same thing as nova-network with a different neutron plugin than openvswitch [17:16:25] perhaps [17:16:33] I mean -- I'm not saying that they won't do it. Just it seems weird that neutron doesn't have any simple paths. [17:16:41] openvswitch is just one neutron plugin, right [17:16:59] Yeah, although I don't know much about the alternatives. [17:17:02] likely the most powerful one [17:17:08] but also quite complex [17:17:39] so instead of researching those alternate plugins, it seems easier to just stick with nova-network [17:21:05] o O ( if only we had enough public IPs to do a flat network with regular, public IPs ) [17:21:18] let's do ipv6 only ;-) [17:21:22] yeah! [17:21:48] ok. I largely agree -- several meetings ago I was advocating for dropping neutron, but was outvoted. This is decision is highly subject to who is in the room at the time :) [17:21:49] who needs mysql anyway [17:22:14] At this point I think the only path that I might prefer is... [17:22:32] well I think noone seriously looked at what it would take to use neutron until this last week [17:22:34] or am I wrong there? [17:22:40] 1) try to locate an expert who has actually built a few neutron networks, and [17:22:44] 2) ask them what they think [17:22:56] the problem is [17:23:01] i'm not so worried about getting neutron to work [17:23:08] I'm sure we could do that if we spend a little bit of time on it [17:23:14] No, I think you're right -- I was probably overestimating how much Ryan had really explored the issue. [17:23:30] but whether that will work in a short time with our network and existing assumptions in existing code etc is a different matter [17:23:40] ryan isn't really very knowledgeable with networking either [17:23:44] sure. [17:23:50] and this subject really needs some active involvement of a network engineer [17:24:11] if I had all the time in the world to help you i'd also be less worried already [17:24:14] So, I think my line was something like "Neutron is a nice-to-have, and now that we're in a last-minute panic we should start throwing nice-to-haves overboard" [17:24:17] (03Abandoned) 10Diederik: Enabling TLSv1.1 and TLSv1.2 on misc Apache services [operations/puppet] - 10https://gerrit.wikimedia.org/r/111969 (owner: 10Diederik) [17:24:19] Sounds like you agree with that :) [17:24:21] then at least we'd get the network part sorted, don't know about the api/mediawiki code [17:24:26] yes [17:26:14] anyway… I will try to revive some of the old nova-network bits and see if I have any better luck. I certainly wouldn't mind moving on to other more tractable parts of this project. [17:26:24] yeah [17:26:32] thanks so much for your efforts though, we've learned a lot [17:27:01] we will move to neutron at some point [17:27:06] let's at least handle that a bit better [17:27:17] and hopefully with not too many disruptions to labs users [17:27:20] Well, a lot of my time has been spent on things that are peripheral to neutron, so I don't feel like all that much is wasted. [17:27:30] it's not wasted at all [17:27:45] it's just that the result of it won't be used just yet ;) [17:27:53] It'll be a good long while before we're /forced/ to move to neutron, anyway -- so if it does disrupt labs at least it won't be right on the tail of the migration. [17:28:01] yes [17:28:08] is there a deprecation date for nova-network? [17:28:29] https://lists.launchpad.net/openstack/msg25506.html [17:28:38] deferred to 'icehouse' [17:28:51] is that the release after havana? [17:29:08] they're ordered alphabetically [17:29:13] so, yes [17:29:15] ok [17:29:22] and then they don't immediately remove it [17:29:30] so realistically, one release after that [17:29:40] i'm not so worried [17:29:52] and I think a lot more people might have some delays migrating to neutron ;) [17:30:51] (03CR) 10Chad: [C: 031] "Really don't know. I guess it's not used?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [17:30:57] openstack is really not very kiss, is it [17:31:09] they have a plugin for KISS :-) [17:31:15] ;-) I mean [17:31:20] if there are five ways to do something, they will take the most complicated one and abstract it further [17:32:03] heh, I'm trying to find docs about the eol for nova-network, found a very long soul-searching document discussing the issue but, of course, no actual prediction [17:35:34] my joke was wrong [17:35:44] paravoid: KISS will be in the next release ;-) [17:35:55] haha yeah that's even better [17:38:04] ok, this is more useful: http://docs.openstack.org/trunk/openstack-ops/content/nova-network-deprecation.html [17:38:42] insufficient testing and simplicity when used for the more straightforward use cases nova-network traditionally supported [17:38:42] "nova-network may continue to be a viable option for the next 12 to 18 months" [17:38:45] you don't say! [17:38:58] yeah, that second paragraph covers a lot of ground [17:39:01] The teams are also looking into the possibility of re-opening development on nova-network [17:40:44] haha [17:40:58] i don't mind neutron [17:41:05] but i also don't mind not being forced to use it ;) [17:43:11] ok -- going to sleep soon but I'll work on nova-network a bit tomorrow. [17:43:26] have a good weekend [17:43:35] whether that's working or not :-) [17:43:44] (03PS7) 10Nemo bis: Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 [17:52:40] (03CR) 10Alexandros Kosiaris: [C: 032] Split exim stats to own class and add it to mchenry [operations/puppet] - 10https://gerrit.wikimedia.org/r/110524 (owner: 10Nemo bis) [17:56:38] 1 [17:58:07] 2 [17:58:53] 3 [18:19:54] yoyo ori, tyt? [18:19:55] yt? [18:20:05] ottomata: hey [18:20:17] thinking about how to automatically run git submodule update —init in mw vagrant [18:20:45] not sure where to put it [18:20:59] what is vagrant git-update intended to do? [18:21:00] it'd try it on the host environment first [18:21:02] update mediawiki, right? [18:21:03] yeah totally [18:21:05] host env for sure [18:21:09] that's where I'd do it [18:21:16] was wondering why git-update doesn't work in host env [18:24:03] hmm [18:24:34] (03CR) 10Manybubbles: [C: 031] "I don't believe it is used." [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [18:27:08] !log temporarily disabling puppet on mchenry and then disable collect_exim_stats_via_gmetric cron. Seems like mchenry has not ganglia at all [18:27:15] Logged the message, Master [18:34:05] hey greg-g: it seems that we categorize site outages using two different categories: Event reports and Incident documentation, see https://wikitech.wikimedia.org/wiki/Category:Events_reports [18:34:18] would it be useful to standardize on one and if so which one should that be? [18:43:35] PROBLEM - Puppet freshness on db1044 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 12:42:15 PM UTC [18:49:24] !log reenable puppet on mchenry as well as the exim cronjob. [18:49:32] Logged the message, Master [18:52:36] domas: yes, second [18:52:44] er domas not for you ;) [18:52:46] drdee: yes, second [18:52:55] ha [18:53:10] drdee: hysterical raisans [18:53:40] * drdee is not surprised [18:53:48] <^d> grrrit-wm: tab complete is hard [18:54:14] but https://wikitech.wikimedia.org/wiki/Incident_documentation only pulls the Event reports so that list is incomplete [18:54:35] greg-g: worth fixing? or can of worms? [18:58:23] drdee: worth fixing [18:58:34] if you feel like it, use incident documentation [18:58:40] ok, wil do that [18:58:43] thanks man [18:58:57] I think it's at a JFDI point now, previously it wasn't [18:59:22] :) [18:59:35] someone needs to write one for yesterday [18:59:43] Reedy probably [18:59:47] i wasn't around for the whole thing [19:00:42] ori: we got most of it on the etherpad... [19:00:45] * greg-g finds that url [19:01:02] oh yeah, you started it no? [19:01:03] https://etherpad.wikimedia.org/p/Feb6Outage [19:01:14] I'll try to clean it up and post it asap [19:01:21] thanks [19:01:48] thanks for starting it :) [19:04:37] !log revoked grosley's puppet cert [19:04:44] Logged the message, Master [19:05:35] PROBLEM - Puppet freshness on pc2 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:03:27 PM UTC [19:06:14] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter nrpe_command at /etc/puppet/modules/mysql_wmf/manifests/init.pp:19 on node pc2.pmtpa.wmnet [19:07:46] introduced by https://gerrit.wikimedia.org/r/#/c/110844/ ? [19:09:33] (03CR) 10Dzahn: "err: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter nrpe_command at /etc/puppet/modules/mysql_wmf/m" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/110844 (owner: 10Matanya) [19:10:04] greg-g: checkout https://wikitech.wikimedia.org/wiki/Incident_documentation it's complete ;) [19:11:13] drdee: weeeee [19:11:17] thanks man [19:11:37] https://wikitech.wikimedia.org/wiki/Category:Events_reports is now empty [19:11:48] yw [19:12:35] PROBLEM - Puppet freshness on pc1 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:10:40 PM UTC [19:15:40] (03PS1) 10Dzahn: fix mysql process monitoring and puppet run on pc [operations/puppet] - 10https://gerrit.wikimedia.org/r/112050 [19:16:35] PROBLEM - Puppet freshness on db9 is CRITICAL: Last successful Puppet run was Fri 07 Feb 2014 01:15:24 PM UTC [19:17:03] (03CR) 10Dzahn: "should be fixed Ib7411cadba127ca018bc5406ada4bf1e4643dd50" [operations/puppet] - 10https://gerrit.wikimedia.org/r/110844 (owner: 10Matanya) [19:18:45] (03CR) 10Dzahn: [C: 032] "should fix puppet runs on all the mysql servers introduced by monitoring switch" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112050 (owner: 10Dzahn) [19:19:55] RECOVERY - Puppet freshness on pc2 is OK: puppet ran at Fri Feb 7 19:19:50 UTC 2014 [19:20:26] (03CR) 10Dzahn: "and it does. <+icinga-wm> RECOVERY - Puppet freshness on pc2 is OK: puppet ran at Fri Feb 7 19:19:50 UTC 2014" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112050 (owner: 10Dzahn) [19:26:05] RECOVERY - Puppet freshness on db9 is OK: puppet ran at Fri Feb 7 19:26:03 UTC 2014 [19:27:04] !log running puppet on db9,pc1,pc2 etc, fixes mysqld process monitoring [19:27:15] Logged the message, Master [19:27:45] RECOVERY - Puppet freshness on pc1 is OK: puppet ran at Fri Feb 7 19:27:39 UTC 2014 [19:31:03] dr0ptp4kt: can you confirm that the recommendations on https://wikitech.wikimedia.org/wiki/Incident_documentation/20131030-Wikidata have been implemented and also give me the correct title for that page? [19:31:05] RECOVERY - Puppet freshness on pc1001 is OK: puppet ran at Fri Feb 7 19:31:02 UTC 2014 [19:31:46] dr0ptp4kt: 's/Wikidata/Zero/g' ? [19:32:09] regex replace :) can't talk right now. will email myself reminder [19:32:19] or yurik: ^ [19:32:23] well it's a wiki title :) [19:32:25] RECOVERY - Puppet freshness on pc1003 is OK: puppet ran at Fri Feb 7 19:32:24 UTC 2014 [19:32:55] RECOVERY - Puppet freshness on pc3 is OK: puppet ran at Fri Feb 7 19:32:44 UTC 2014 [19:33:05] or drdee should be able to move the page under a different name [19:33:41] I am, but want to make sure it get's the right title now [19:37:05] RECOVERY - Puppet freshness on pc1002 is OK: puppet ran at Fri Feb 7 19:36:57 UTC 2014 [19:41:32] (03CR) 10Dzahn: [C: 031] sudoers: remove two files, seems not to be used anywhere [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [19:41:55] RECOVERY - Puppet freshness on db1044 is OK: puppet ran at Fri Feb 7 19:41:45 UTC 2014 [20:00:48] (03CR) 10Dzahn: [C: 04-1] "not a good merge for wmf :(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111960 (owner: 10Ryan Lane) [20:05:51] (03CR) 10Mark Bergsma: [C: 04-1] Revoking my access [operations/puppet] - 10https://gerrit.wikimedia.org/r/111960 (owner: 10Ryan Lane) [20:06:22] PROBLEM - MySQL disk space on db1011 is CRITICAL: NRPE: Command check_mysql not defined [20:06:22] PROBLEM - MySQL disk space on es1006 is CRITICAL: NRPE: Command check_mysql not defined [20:06:32] PROBLEM - MySQL disk space on es7 is CRITICAL: NRPE: Command check_mysql not defined [20:06:32] PROBLEM - MySQL disk space on db1020 is CRITICAL: NRPE: Command check_mysql not defined [20:06:32] PROBLEM - MySQL disk space on es1008 is CRITICAL: NRPE: Command check_mysql not defined [20:06:32] PROBLEM - MySQL disk space on es1009 is CRITICAL: NRPE: Command check_mysql not defined [20:06:32] PROBLEM - MySQL disk space on db72 is CRITICAL: NRPE: Command check_mysql not defined [20:06:42] This should really have a better redirect http://static.wikipedia.org/ [20:06:42] PROBLEM - MySQL disk space on es1007 is CRITICAL: NRPE: Command check_mysql not defined [20:06:42] PROBLEM - MySQL disk space on db1033 is CRITICAL: NRPE: Command check_mysql not defined [20:06:42] PROBLEM - MySQL disk space on db68 is CRITICAL: NRPE: Command check_mysql not defined [20:06:42] PROBLEM - MySQL disk space on db67 is CRITICAL: NRPE: Command check_mysql not defined [20:06:42] PROBLEM - MySQL disk space on db38 is CRITICAL: NRPE: Command check_mysql not defined [20:06:43] PROBLEM - MySQL disk space on db9 is CRITICAL: NRPE: Command check_mysql not defined [20:06:43] PROBLEM - MySQL disk space on db74 is CRITICAL: NRPE: Command check_mysql not defined [20:06:43] PROBLEM - MySQL disk space on db1001 is CRITICAL: NRPE: Command check_mysql not defined [20:06:44] PROBLEM - MySQL disk space on db1002 is CRITICAL: NRPE: Command check_mysql not defined [20:06:45] PROBLEM - MySQL disk space on db1042 is CRITICAL: NRPE: Command check_mysql not defined [20:06:45] PROBLEM - MySQL disk space on db1051 is CRITICAL: NRPE: Command check_mysql not defined [20:06:46] PROBLEM - MySQL disk space on es1002 is CRITICAL: NRPE: Command check_mysql not defined [20:06:46] PROBLEM - MySQL disk space on db1028 is CRITICAL: NRPE: Command check_mysql not defined [20:06:46] PROBLEM - MySQL disk space on pc1003 is CRITICAL: NRPE: Command check_mysql not defined [20:06:47] PROBLEM - MySQL disk space on es8 is CRITICAL: NRPE: Command check_mysql not defined [20:06:48] PROBLEM - MySQL disk space on es4 is CRITICAL: NRPE: Command check_mysql not defined [20:06:48] PROBLEM - MySQL disk space on db69 is CRITICAL: NRPE: Command check_mysql not defined [20:06:49] PROBLEM - MySQL disk space on pc3 is CRITICAL: NRPE: Command check_mysql not defined [20:06:52] PROBLEM - MySQL disk space on db63 is CRITICAL: NRPE: Command check_mysql not defined [20:06:52] PROBLEM - MySQL disk space on db1019 is CRITICAL: NRPE: Command check_mysql not defined [20:06:52] PROBLEM - MySQL disk space on db1031 is CRITICAL: NRPE: Command check_mysql not defined [20:06:52] PROBLEM - MySQL disk space on db1010 is CRITICAL: NRPE: Command check_mysql not defined [20:06:52] PROBLEM - MySQL disk space on db1016 is CRITICAL: NRPE: Command check_mysql not defined [20:06:53] PROBLEM - MySQL disk space on db1005 is CRITICAL: NRPE: Command check_mysql not defined [20:06:53] PROBLEM - MySQL disk space on db1024 is CRITICAL: NRPE: Command check_mysql not defined [20:06:54] PROBLEM - MySQL disk space on db1026 is CRITICAL: NRPE: Command check_mysql not defined [20:06:54] PROBLEM - MySQL disk space on db1030 is CRITICAL: NRPE: Command check_mysql not defined [20:06:55] PROBLEM - MySQL disk space on db1036 is CRITICAL: NRPE: Command check_mysql not defined [20:06:55] PROBLEM - MySQL disk space on db1047 is CRITICAL: NRPE: Command check_mysql not defined [20:06:56] PROBLEM - MySQL disk space on db1037 is CRITICAL: NRPE: Command check_mysql not defined [20:06:56] PROBLEM - MySQL disk space on db1041 is CRITICAL: NRPE: Command check_mysql not defined [20:06:57] PROBLEM - MySQL disk space on db1060 is CRITICAL: NRPE: Command check_mysql not defined [20:06:57] PROBLEM - MySQL disk space on es1003 is CRITICAL: NRPE: Command check_mysql not defined [20:06:58] PROBLEM - MySQL disk space on db1059 is CRITICAL: NRPE: Command check_mysql not defined [20:06:58] PROBLEM - MySQL disk space on es1010 is CRITICAL: NRPE: Command check_mysql not defined [20:06:58] PROBLEM - MySQL disk space on db1022 is CRITICAL: NRPE: Command check_mysql not defined [20:06:59] PROBLEM - MySQL disk space on db1055 is CRITICAL: NRPE: Command check_mysql not defined [20:07:00] PROBLEM - MySQL disk space on db1006 is CRITICAL: NRPE: Command check_mysql not defined [20:07:00] PROBLEM - MySQL disk space on db1046 is CRITICAL: NRPE: Command check_mysql not defined [20:07:01] PROBLEM - MySQL disk space on db1058 is CRITICAL: NRPE: Command check_mysql not defined [20:07:01] PROBLEM - MySQL disk space on pc1002 is CRITICAL: NRPE: Command check_mysql not defined [20:07:02] PROBLEM - MySQL disk space on db1003 is CRITICAL: NRPE: Command check_mysql not defined [20:07:02] PROBLEM - MySQL disk space on db1004 is CRITICAL: NRPE: Command check_mysql not defined [20:07:02] :p [20:07:03] PROBLEM - MySQL disk space on es1001 is CRITICAL: NRPE: Command check_mysql not defined [20:07:03] PROBLEM - MySQL disk space on db1007 is CRITICAL: NRPE: Command check_mysql not defined [20:07:04] PROBLEM - MySQL disk space on db1049 is CRITICAL: NRPE: Command check_mysql not defined [20:07:04] PROBLEM - MySQL disk space on db1021 is CRITICAL: NRPE: Command check_mysql not defined [20:07:05] PROBLEM - MySQL disk space on db48 is CRITICAL: NRPE: Command check_mysql not defined [20:07:12] PROBLEM - MySQL disk space on db1018 is CRITICAL: NRPE: Command check_mysql not defined [20:07:12] PROBLEM - MySQL disk space on db1040 is CRITICAL: NRPE: Command check_mysql not defined [20:07:12] PROBLEM - MySQL disk space on db1027 is CRITICAL: NRPE: Command check_mysql not defined [20:07:12] PROBLEM - MySQL disk space on db1023 is CRITICAL: NRPE: Command check_mysql not defined [20:07:12] PROBLEM - MySQL disk space on db1029 is CRITICAL: NRPE: Command check_mysql not defined [20:07:13] PROBLEM - MySQL disk space on es1005 is CRITICAL: NRPE: Command check_mysql not defined [20:07:13] PROBLEM - MySQL disk space on db1048 is CRITICAL: NRPE: Command check_mysql not defined [20:07:14] PROBLEM - MySQL disk space on pc1001 is CRITICAL: NRPE: Command check_mysql not defined [20:07:14] PROBLEM - MySQL disk space on db71 is CRITICAL: NRPE: Command check_mysql not defined [20:07:15] PROBLEM - MySQL disk space on pc1 is CRITICAL: NRPE: Command check_mysql not defined [20:07:15] PROBLEM - MySQL disk space on db1015 is CRITICAL: NRPE: Command check_mysql not defined [20:07:16] PROBLEM - MySQL disk space on db1043 is CRITICAL: NRPE: Command check_mysql not defined [20:07:16] PROBLEM - MySQL disk space on db1044 is CRITICAL: NRPE: Command check_mysql not defined [20:07:17] PROBLEM - MySQL disk space on db1050 is CRITICAL: NRPE: Command check_mysql not defined [20:07:17] PROBLEM - MySQL disk space on db1052 is CRITICAL: NRPE: Command check_mysql not defined [20:07:17] PROBLEM - MySQL disk space on db1045 is CRITICAL: NRPE: Command check_mysql not defined [20:07:18] PROBLEM - MySQL disk space on db1035 is CRITICAL: NRPE: Command check_mysql not defined [20:07:19] PROBLEM - MySQL disk space on es1004 is CRITICAL: NRPE: Command check_mysql not defined [20:07:19] PROBLEM - MySQL disk space on db35 is CRITICAL: NRPE: Command check_mysql not defined [20:07:19] PROBLEM - MySQL disk space on db73 is CRITICAL: NRPE: Command check_mysql not defined [20:07:22] PROBLEM - MySQL disk space on db1038 is CRITICAL: NRPE: Command check_mysql not defined [20:07:22] PROBLEM - MySQL disk space on db1017 is CRITICAL: NRPE: Command check_mysql not defined [20:07:22] PROBLEM - MySQL disk space on db1039 is CRITICAL: NRPE: Command check_mysql not defined [20:07:22] PROBLEM - MySQL disk space on pc2 is CRITICAL: NRPE: Command check_mysql not defined [20:08:13] (03PS2) 10Dzahn: add salt grains automatically in system::role [operations/puppet] - 10https://gerrit.wikimedia.org/r/107831 [20:08:57] ^ that must be another problem introduced by the switch in NRPE monitoring in mysql module merged earlier today .. sigh [20:12:04] oh, heh, whitespace i file names.?? [20:12:06] check_mysql disk space.cfg: ASCII text [20:14:25] (03PS3) 10Dzahn: add salt grains automatically in system::role [operations/puppet] - 10https://gerrit.wikimedia.org/r/107831 [20:18:31] (03Abandoned) 10Dzahn: add salt grains for applicationserver roles [operations/puppet] - 10https://gerrit.wikimedia.org/r/83768 (owner: 10Dzahn) [20:22:04] i just noticed that md1 is rebuilding on virt11, and icinga isn't aware. 4 days until complete at this rate, ouch. [20:25:12] (03PS1) 10Dzahn: fix whitespace in mysql nrpe check command [operations/puppet] - 10https://gerrit.wikimedia.org/r/112060 [20:25:58] jgage: the RAID check didnt see it you say? [20:26:03] right [20:26:13] maybe it stopped when the rebuild started? [20:26:19] stopped showing it i mean [20:26:47] seems like it [20:26:47] OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [20:26:55] jgage: this should fix all the icinga-wm above [20:26:58] https://gerrit.wikimedia.org/r/#/c/112060/ [20:27:14] file "check_mysql disk space.cfg" bah :p [20:27:40] nngh whitespace [20:27:42] jgage: yea, that's what i thought, it just alarms when there is Failed > 0 [20:27:56] but rebuilding != failed ? hmmm [20:28:17] (03CR) 10Gage: [C: 032] fix whitespace in mysql nrpe check command [operations/puppet] - 10https://gerrit.wikimedia.org/r/112060 (owner: 10Dzahn) [20:29:01] :) [20:29:13] i'll make a ticket about mdstat rebuild monitoring [20:29:53] cool [20:30:01] i saw you made that master ticket:) [20:30:06] improve monitoring [20:30:17] so i'll add you as reviewer if topic branch = monitoring, k?:) [20:31:02] yes please :) [20:31:33] i wanted to call that ticket "make all monitoring perfect forever" [20:31:44] great! so i merged that on palladium and running puppet on pc2 [20:33:18] jgage: just one thing about that ticket if you use that name, it will actually never be possible to resolve it [20:33:25] but that may be ok [20:34:09] (03CR) 10Dzahn: "Nrpe::Check[check_mysql_disk_space]/File[/etc/nagios/nrpe.d/check_mysql_disk_space.cfg]/ensure: created" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112060 (owner: 10Dzahn) [20:34:46] really that ticket is just a kludge because i couldn't figure out how to create a new tag in RT [20:35:41] jgage: i do the same, i actually like the tracking tickets [20:35:57] i'd leave it like that, and you can always rename [20:47:36] expects recoveries [20:48:19] RECOVERY - MySQL disk space on pc2 is OK: DISK OK [20:48:19] RECOVERY - MySQL disk space on pc1 is OK: DISK OK [20:48:38] RECOVERY - MySQL disk space on pc1001 is OK: DISK OK [20:48:49] RECOVERY - MySQL disk space on pc1003 is OK: DISK OK [20:48:49] RECOVERY - MySQL disk space on pc3 is OK: DISK OK [20:48:56] yay [20:48:58] RECOVERY - MySQL disk space on pc1002 is OK: DISK OK [20:51:00] :) good, yep, that needed neon to finish the run [20:51:15] as opposed to the monitored hosts for the first fix [20:52:43] (03CR) 10Dzahn: "12:55 <+icinga-wm> RECOVERY - MySQL disk space on pc2 is OK: DISK OK" [operations/puppet] - 10https://gerrit.wikimedia.org/r/112060 (owner: 10Dzahn) [21:05:00] (03PS1) 10Ottomata: Adding alerts for Kafka Broker replication metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/112065 [21:05:01] RECOVERY - MySQL disk space on db1044 is OK: DISK OK [21:10:33] (03CR) 10Ottomata: [C: 032 V: 032] Adding alerts for Kafka Broker replication metrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/112065 (owner: 10Ottomata) [21:11:17] ori: maybe we should just run git submodule update —init as a hook [21:11:23] after merge or something? [21:33:04] 1311 line 99 of /usr/local/apache/common-local/php-1.23wmf13/extensions/Flow/includes/Model/Workflow.php: Interwiki to enwiki not implemented [21:33:14] hrm, 1311 exceptions [21:33:53] AaronSchulz: sorry about that, we probably could have just not logged the error. [21:34:07] AaronSchulz: it will go away soon, we have a refactor necessary to handle the error better than just catch/log [21:34:08] * AaronSchulz hates interwikis [21:45:54] !log catrope synchronized php-1.23wmf13/extensions/MobileFrontend/MobileFrontend.php 'Feature flag for Minerva Beta Feature' [21:46:02] Logged the message, Master [21:46:43] !log catrope synchronized php-1.23wmf13/extensions/MobileFrontend/includes/MobileFrontend.hooks.php 'Use feature flag for Minerva Beta Feature' [21:46:51] Logged the message, Master [21:49:29] greg-g: {{done}} [21:49:45] greg-g: (Hiding from wm-bot by doing it in this channel instead.) [21:50:15] Reedy: is there a bug for all the EP duplicate key errors that happen in ORMRow? [21:52:23] domas: that Special:Contributions namespace filter gets evil pretty easily [21:52:54] * AaronSchulz wonders what to do about that [21:53:13] James_F: :) [22:36:03] (03PS3) 10Dzahn: sudoers: remove sudoers.nrpe_fr & sudoers.search [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [22:36:13] AaronSchulz: heh [22:36:55] (03CR) 10Dzahn: [C: 032] "no manifests use these files, only sudoers.default, sudoers.appserver and sudoers.drupal_fundraising are actually sourced. (grep -r "fil" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [22:39:01] (03CR) 10Dzahn: "no manifests use these files, only sudoers.default, sudoers.appserver and sudoers.drupal_fundraising are actually sourced. (grep -r "fil" [operations/puppet] - 10https://gerrit.wikimedia.org/r/111444 (owner: 10Matanya) [22:45:41] (03CR) 10Dzahn: "duplicate of I9b0e7f15929aae97fb39650dd17e662642dabba6" [operations/puppet] - 10https://gerrit.wikimedia.org/r/109074 (owner: 10Matanya) [23:49:03] phuedx: ping, got a question re: your shell request. do you already have a labs user? [23:50:17] phuedx: ok, you do, fairly obvious, cool, then i can go ahead and use that UID and make a change [23:55:55] (03PS1) 10BryanDavis: logstash: Add normalized_message field to all events [operations/puppet] - 10https://gerrit.wikimedia.org/r/112149