[03:50:08] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 13 seconds [03:50:42] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [03:53:24] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [03:53:24] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [03:53:24] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [03:53:24] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [03:55:30] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.015 second response time on port 11000 [06:03:40] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [06:48:57] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:48:57] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:06:17] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:16:19] hello :) [08:25:27] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:25:54] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 184 seconds [08:30:06] moin [08:47:31] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:36:24] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [12:44:55] PROBLEM - Host ms-be3 is DOWN: PING CRITICAL - Packet loss = 100% [13:12:28] hashar: do you know who can help us with wikidata ssl bug? [13:12:29] https://bugzilla.wikimedia.org/show_bug.cgi?id=41437 [13:12:45] and squid stuff [13:14:24] New patchset: Jgreen; "add jgreen nagios admin privs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30580 [13:15:33] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30580 [13:15:57] aude: ops I guess :-] [13:16:23] aude: it sounds like the wikidata.org virtual domain is wrongly configured in apache [13:16:38] New patchset: Demon; "Disabling wikidata.org for SUL until we fix SSL" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30581 [13:16:38] which ops person? [13:17:04] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30581 [13:17:36] ^demon: thanks for looking at this [13:17:43] <^demon> No problem. [13:18:14] do we also have to disable the central auth icon? [13:18:15] aude: I am not sure. I guess wikidata using https pass via an nginx proxy which terminates the SSL connection. I have no idea who is the nginx xpert [13:18:31] hmmm [13:19:55] I have no knowledge about nginx setup, most probably something in manifests/protoproxy.pp [13:19:55] !log demon synchronized wmf-config/CommonSettings.php 'Disabling wikidata for SUL until we fix SSL' [13:20:10] Logged the message, Master [13:20:25] <^demon> aude: The logo is what does the remote login. We don't want the logo to show until SSL is fixed. [13:20:49] agree [13:22:16] <^demon> You don't have to disable the icon setting. It won't show the icon if the domain isn't configured. [13:22:28] <^demon> (Just confirmed by logging out/in) [13:23:59] ok [13:24:45] <^demon> Who was configuring ULS earlier? [13:24:51] <^demon> http://p.defau.lt/?RBwd_BU7aoUWkgQVlgqQcA was uncommitted on fenari :\ [13:24:54] reedy [13:24:55] i think [13:25:05] * ^demon glares at Reedy [13:25:16] yes reedy [13:25:35] we have some issues with uls and squid [13:26:03] https://bugzilla.wikimedia.org/show_bug.cgi?id=41451 [13:29:09] Change abandoned: CSteipp; "Added by Chad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28862 [13:29:34] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:47] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:55] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:31:17] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.966 second response time [13:32:17] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [13:32:25] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.269 second response time [13:42:24] New patchset: Jgreen; "attempting to put aluminium,grosley,erzurumi,loudon in "fundraising" nagios host group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30583 [13:43:01] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30583 [13:54:01] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [13:54:01] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [13:54:01] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [13:54:01] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [13:55:31] New patchset: Hydriz; "(bug 40474) Set Transwiki namespace on Chinese (zh) wikimedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30585 [14:17:59] Who would be able to see if an IP was being denied access to sites? There is an email on OTRS from an ISP saying they have a number of complaints. [14:18:14] > 121.54.13.51 [14:18:14] > 121.54.2.188 [14:18:23] Ticket 2012102210000612 [14:20:24] not at the squid level [14:20:53] can they view the material but not edit? [14:21:43] From what I gather, they can't load the site [14:21:52] I always try to make sure it's not just an editing block [14:22:12] well like I say not at the squid level [14:22:22] OK. Thanks [14:22:24] sure [14:29:44] New patchset: Hydriz; "(bug 38134) Enable Extension:GoogleNewsSitemap on es wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30589 [14:54:19] New patchset: Silke Meyer; "Added puppet files for Wikidata on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30593 [14:57:34] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:59:08] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.881 second response time [15:10:36] New patchset: Mark Bergsma; "Add stanzas for new esams cp and object storage servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30595 [15:29:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30595 [15:39:07] New review: jan; "I would create a new module like "mediawiki" for this with wikidata as a part. This would be much ni..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/30593 [15:39:52] PROBLEM - Host db42 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:11] !log reedy synchronized php-1.21wmf3 'Initial file sync-out' [15:40:24] Logged the message, Master [15:41:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki to 1.21wmf3 [15:41:28] Logged the message, Master [15:41:39] New patchset: Mark Bergsma; "Add cp30*" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30600 [15:42:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30600 [15:46:43] !log reedy Started syncing Wikimedia installation... : Rebuilding localisation cache for 1.21wmf3 [15:47:03] Logged the message, Master [15:47:47] :) [15:49:52] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:31] !log reedy synchronized php-1.21wmf3/extensions/EventLogging/ [15:50:44] Logged the message, Master [15:51:20] !log reedy synchronized php-1.21wmf3/extensions/PostEdit/ [15:51:22] RECOVERY - Host db42 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [15:51:33] Logged the message, Master [15:52:12] * aude can't stay logged into wikidata at the moment [15:52:35] !log Running sync-common on searchidx1001 [15:52:47] Logged the message, Master [15:54:25] !log reedy synchronized wmf-config/ [15:54:34] Logged the message, Master [15:56:11] Reedy: did you just take down test2? [15:56:17] Yeah [15:56:22] Scap takes an age [15:56:24] OK [15:56:29] zeljkof: ^^ [15:56:30] And we have to have a wiki on a version to get a cache built [15:56:32] shouldn't be too long [15:56:51] thanks [15:57:06] Reedy, chrismcmahon: thanks for letting me know :) [15:57:15] if you're lucky, on refresh you might hit an apache with the cache ;) [15:57:36] I was just scanning the irc channels trying to figure out if it is a known problem [15:57:37] Does anyone have any objections to using "y" as wikivoyage interwiki prefix? [15:59:42] zeljkof: either here or wikimedia-tech usually has that information, and it's almost always Reedy 's fault. :) [15:59:55] chrismcmahon: good to know :) [16:01:43] !log reedy Finished syncing Wikimedia installation... : Rebuilding localisation cache for 1.21wmf3 [16:01:56] Logged the message, Master [16:03:47] New patchset: Silke Meyer; "Added puppet files for Wikidata on labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30593 [16:04:46] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [16:06:30] New review: Silke Meyer; "addressed comments by Jan concerning tabs etc." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/30593 [16:08:08] !log reedy Started syncing Wikimedia installation... : Rebuilding localisation cache for 1.21wmf3 [16:08:22] Logged the message, Master [16:09:51] New patchset: Mark Bergsma; "Make cp3019-22 bits servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30607 [16:12:29] !log reedy Started syncing Wikimedia installation... : Rebuilding localisation cache for 1.21wmf3 [16:12:44] Logged the message, Master [16:23:55] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [16:24:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30607 [16:26:58] New patchset: Mark Bergsma; "Add cp3019-3022 as new bits servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30609 [16:27:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30609 [16:27:51] Ganglia is broken [16:27:51] There was an error collecting ganglia data (127.0.0.1:8654): XML error: Invalid document end at 1 [16:30:42] !log mark synchronized wmf-config/CommonSettings.php 'Add new bits server IPs' [16:30:55] Logged the message, Master [16:32:46] New patchset: Mark Bergsma; "Only make the first two hosts Ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30611 [16:33:15] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30611 [16:38:12] !log reedy synchronized php-1.21wmf3/cache/l10n/ 'Sync' [16:38:24] why didn't watchmouse detect dead nagios ? [16:38:24] Logged the message, Master [16:41:25] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikidatawiki to 1.21wmf3 [16:41:38] Logged the message, Master [16:42:48] Jeff_Green: nagios is upset about not having a @monitor_group defined for fundraising_pmtpa [16:43:05] well well well [16:43:38] for host erzurumi [16:43:38] !log reedy synchronized php-1.21wmf3/includes/Revision.php [16:43:42] I was trying to set up a new group but it took forever for puppet to get that to nagios [16:43:48] Logged the message, Master [16:44:01] !log reedy synchronized php-1.21wmf3/includes/WikiPage.php [16:44:15] Logged the message, Master [16:44:38] LeslieCarr: where are you seeing the error? [16:44:38] so i don't see a @monitor_group statement in manifests here [16:45:14] well nagios died on spence, so /etc/nagios/puppet_hosts.cfg line 7868 [16:45:23] blargh [16:45:47] which is the first time ez (i can't type that name out all the time, my fingers get tangled) is "mentioned" [16:46:23] so what I did was to add $cluster and $nagios_group to several hosts in site.pp [16:46:26] but, you could make a monitor_group statement in site.pp [16:46:27] and then puppetblast that [16:46:41] it's not my favorite place, however since you don't appear to have any role statements [16:46:52] New review: Faidon; "I think all of these classes are an overkill. Can't we have a php::extension definition that install..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/29975 [16:47:01] I can add role statements too [16:47:07] actually you could use this opportunity to make a role/fundraising.pp , make a comment that it's mostly in private repos, then put the monitor_group statement there [16:47:14] hehehe, brain sync :) [16:47:45] my brain can't be sync'd today, i'm underslept [16:49:00] oh noes [16:49:59] let's do a quick fix first and then I'll go back and role-ify [16:50:36] !log reedy synchronized php-1.21wmf3/includes/ [16:50:49] Logged the message, Master [16:52:02] cool, you got it or want me to fix'er up [16:52:19] if i understand correctly I can just do like: [16:52:54] @monitor_group { "${cluster}_${::site}": description => "${cluster} ${::site}"} [16:52:54] right where I've got $nagios_group right? [16:53:03] if that's right, I got it [16:54:25] yeah [16:54:32] k. [16:54:41] woot [16:55:04] I'm trying to get to a point where I can separate out the fundraising boxes for an additional notification victim [16:55:17] b/c fr-tech wants to suffer the pains [16:57:01] New patchset: Jgreen; "adding monitor_group for fundraising_* boxes, temporarily in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30613 [16:57:59] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30613 [16:58:52] now we wait 4E29 seconds for puppet to get the changes onto the nagios box... [17:06:24] is anyone looking at the nagios outage? [17:06:32] paravoid: I think jeff and leslie [17:06:40] Jeff_Green: LeslieCarr want help? [17:07:38] yep [17:07:41] !log reedy synchronized php-1.21wmf2/cache/interwiki.cdb 'Updating interwiki cache' [17:07:42] we are, it was the monitor group [17:07:53] Logged the message, Master [17:08:00] !log reedy synchronized php-1.21wmf3/cache/interwiki.cdb 'Updating interwiki cache' [17:08:09] LeslieCarr: where are you on it currently? [17:08:12] Logged the message, Master [17:08:22] i'm waiting for puppetd -tv to blast the config onto spence [17:08:43] Jeff_Green: want me to fix by hand until that finishes? [17:09:11] actually, nvm, you got this under control :) [17:10:50] if you want to that's fine [17:11:35] i'm working on a new role/fundraising.pp while we wait for the magic of efficient puppet power [17:12:09] heh [17:21:33] New patchset: Jgreen; "moving fundraising config out of site.pp into a new role/fundraising.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30620 [17:23:02] mark: sq37 is back up...replaced the controller card [17:26:51] New review: Mark Bergsma; "This is the correct fix." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/24797 [17:26:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24797 [17:27:37] nagios still doesn't like the config [17:30:53] New patchset: Jgreen; "moving fundraising config out of site.pp into a new role/fundraising.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30620 [17:34:03] RECOVERY - Host sq37 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [17:34:28] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30620 [17:34:48] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [17:34:48] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:34:48] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [17:35:15] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.006 second response time on port 11000 [17:49:32] New patchset: Demon; "Revert "Disabling wikidata.org for SUL until we fix SSL"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30623 [17:49:51] Can someone please graceful all the apaches? [17:49:56] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30623 [17:50:10] mutante pushed a CORS change for me on friday, but as it was via puppet, it'd take ages to propogate... [17:50:59] catrope is doing a graceful restart of all apaches [17:51:07] !log catrope gracefulled all apaches [17:51:13] Well, "all" [17:51:20] notpeter: still up for helping with the puppet/nagios? i have a fail-to-understand [17:51:20] Logged the message, Master [17:51:20] mw1: System failed sanity check: VIP not configured on lo [17:51:23] For mw1-mw15 [17:51:36] srv283: 29 Oct 17:50:58 ntpdate[30669]: no server suitable for synchronization found [17:51:37] srv283: Error: unable to contact NTP server [17:51:49] Also for srv277, srv275, mw34, mw59 [17:52:01] notpeter: Are the failures above related to precise maybe? --^^^ [17:52:17] !log demon synchronized wmf-config/CommonSettings.php 'Turning SUL back on for wikidata.org' [17:52:29] Logged the message, Master [17:52:50] RoanKattouw: those are now only jobrunners, thus don't have vips [17:53:06] as for ntp, nfi [17:53:15] Jeff_Green: sure [17:53:20] what's going on [17:53:33] Ah so wikidata SSL is working now [17:54:11] the issue is getting puppet to create entries for puppet_hostgroups.cfg and puppet_servicegroups.cfg [17:54:25] <^demon> Krenair: For wikidata.org & www.wikidata.org. Lang subdomains need a little further tweaking. [17:54:30] ok [17:54:55] notpeter: Ah, I see, so mw1-15 aren't running Apache any more? [17:55:24] notpeter: In that case, could you update /etc/dsh/group/apaches ? [17:55:26] notpeter: I added @monitor_group { "${cluster}_${::site}": description => "${cluster}_${::site}"} to each of the classes in role/fundraising.pp that are used, as far as I can tell that's what should be necessary, but puppet didn't update those two files [17:55:38] RoanKattouw: sure [17:55:40] Awesome [17:56:45] done. sorry about that. didn't know that that group was used [17:57:11] what are the names of the hosts in question? [17:57:52] Jeff_Green: er, wait [17:57:56] nagios is running [17:57:59] what is the goal here [17:58:11] it's running because I hand-edited the config [17:58:22] ah, ok :) [17:58:37] goal is to add fundraising-{realm} groups so I can work on split notification [17:58:44] fr-tech wants to get SMS when things explode [17:59:38] what is a host that's having issues? [18:01:35] strictly speaking, spence [18:01:39] * aude needs help with setting up access-origin with wikidata and ULS [18:01:42] https://bugzilla.wikimedia.org/show_bug.cgi?id=41489 [18:01:50] anyone can help us, please? :) [18:01:58] Jeff_Green: sure... [18:02:06] what was it barfing on [18:02:21] notpeter: sec. dialing in [18:02:50] ottomata ? [18:03:06] notpeter: something like "no hostgroup in puppet_hostgroups.cfg for fundraising_pmtpa" for example [18:03:20] I added them by hand [18:03:34] ooo! [18:03:34] sorry [18:04:07] aude: I think I've found it [18:04:12] !log reedy synchronized wmf-config/CommonSettings.php 'Add wikidata to wgCrossSiteAJAXdomains' [18:04:18] aude: ^^ [18:04:24] Logged the message, Master [18:06:50] aude: Looks better in a chrome incognito window now... [18:07:08] Reedy: yay [18:07:18] My current chrome window is caching it to hell and bakc [18:07:26] That's Chrome for ya [18:08:33] * aude learns new stuff everyday [18:08:54] ULS: Unknown language als. load.php:267 [18:08:54] ULS: Unknown language bat-smg. load.php:267 [18:08:54] ULS: Unknown language fiu-vro. load.php:267 [18:09:00] There's just those appearing now [18:09:30] can't get in via sip [18:09:30] i've seen that [18:09:34] those files don't exist yet and it's a bug [18:09:44] multiple tries = fail to connect whatsoever [18:10:23] aude: Also, CreateItem is now quite a bit faster when tabbing! :D [18:11:29] ^demon, I'm looking through the apache config now and the only things that I see are configured differently to the other wiki domains are ServerName and ServerAlias [18:11:36] Gerrit 503'ing again. [18:11:53] Reedy: I filed a bug for ULS issues earlier today [18:12:06] And I just fixed it [18:12:06] Jeff_Green: monitor group doesn't go in node def [18:12:24] it shouldn't require CORS since that doesn't work in all browsers, it shouldn't re-try the same json file 16 times and the unkwown language codes [18:12:30] Jeff_Green: look at role/applicationserver.pp [18:12:35] Reedy: no you didn't, that only hides the problem in some browsers. [18:12:47] notpeter: yes [18:13:01] <^demon> Krenair: Apache config is correct. It needs further DNS work. [18:13:14] I see [18:13:14] <^demon> Krinkle: This has been going on all morning. [18:13:19] <^demon> I don't know why, yet. [18:14:05] The apache restart earlier - was that wikidata ssl-related? [18:14:29] no [18:15:58] notpeter: Have you been messing with searchidx1 by any chance? [18:15:58] Warning: the RSA host key for 'searchidx1001' differs from the key for the IP address '10.64.0.119' [18:16:03] !log catrope synchronized php-1.21wmf3/extensions/VisualEditor/ 'Update VisualEditor' [18:16:06] notpeter: i see the @monitor_group stuff at the top . . . I have that per-class in role/fundraising.pp [18:16:13] RoanKattouw: nope [18:16:16] Logged the message, Master [18:16:17] hah [18:16:28] notpeter: is there something else in this I'm missing? [18:16:31] Oh hmm maybe it's just my homedir that's weird? [18:16:34] Offending key for IP in /home/catrope/.ssh/known_hosts:17 [18:16:36] Matching host key in /etc/ssh/ssh_known_hosts:4753 [18:17:51] Jeff_Green: shouldn't be [18:18:02] (although that's not what I see in git. has this been merged?) [18:18:44] yep [18:19:15] it's in the copy on sockpuppet [18:21:14] Jeff_Green: in what I have in my checkout, I see @monitor_group defs inside of class role::fundraiser::blah [18:21:20] yah [18:21:34] ok, pull those out of the role defs [18:21:47] just throw them in the base of fundraising.pp [18:21:58] really? ok [18:22:27] should work [18:22:32] our stuff is weird.... [18:22:56] it really is [18:25:38] !log aaron synchronized php-1.21wmf3/extensions/TimedMediaHandler 'deployed 05118462e3ddbfc507772cc3c4e1117a8527e0b5' [18:25:52] Logged the message, Master [18:26:37] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:27:43] New patchset: Jgreen; "add system_role for fundraising hosts, attempt to fix nagios fundraiser host groups" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30632 [18:28:25] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30632 [18:34:51] chmod g+w /home/wikipedia/common/docroot/www.wikidata.org/favicon.ico [18:34:58] chmod g+w /home/wikipedia/common/docroot/www.wikidata.org/robots.txt [18:35:03] ^ Can someone do that on fenari please? [18:35:49] done [18:36:25] thanks [18:37:09] apergos: Oh, can you do g+w for the www.wikidata.org directory too please? [18:37:23] done [18:37:47] thanks [18:37:52] could have done that and let you fix the other stuff by yerself, oh well :-D [18:37:54] notpeter: did you get me ping about searchidx1001 host key whining? [18:38:42] I pinged him last week about it ;) [18:42:15] nope [18:42:21] meant roan [18:42:42] er [18:42:42] no [18:42:47] gah [18:43:00] not sure why it's not getting picked up [18:43:01] ta [18:43:03] uh [18:43:05] watz? :) [18:43:13] exactly [18:46:05] New patchset: MaxSem; "WIP: support for multiple Solr cores" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29827 [18:46:05] New patchset: MaxSem; "Solr replication" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26571 [18:46:25] New review: MaxSem; "Just a rebase." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/26571 [18:47:45] notpeter: now I'm getting these "Could not retrieve catalog from remote server: Error 400 on SERVER: Exported resource Nagios_host[mw1017] cannot override local resource on node spence.wikimedia.org" [18:48:09] what the fuck? [18:48:38] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [18:48:43] oh, yeah, I have seen that error before. it's seemed to be transient [18:48:43] i'm assuming it's unrelated [18:48:46] I don't fully understand it [18:48:49] yeah [18:49:17] we should just unpuppetize nagios [18:49:59] it's 1000% worse to try to get puppet to do your bidding than it would be to edit the machine over 2400 baud dialup from a 386 running windows 3.0 [18:52:25] lulz [18:57:51] that's bashworthy [18:58:03] ah 1017 [18:58:04] nice [18:58:19] there's like 4-5 hosts it keeps whining about, in random order [19:00:13] hehehe [19:00:27] so naginator works really well [19:00:41] i just hav ebeen putting off fixing neon [19:00:41] :-/ [19:06:34] New patchset: Lcarr; "admins.pp: annotate the include as disabled" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23789 [19:06:52] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23789 [19:08:00] what about the stuff that gets up around the side of the gerrit queue? [19:08:07] Eh.. I just got a 404 error on www.wikidata.org for /w/index.php [19:08:07] hah, LeslieCarr [19:08:15] with a default 404 DocumentHandler response from Apache. [19:08:21] it is fixed now, but how is that even possible? [19:08:37] jeremyb: if you wanna rebase https://gerrit.wikimedia.org/r/#/c/8344/ i'll merge that too [19:09:40] LeslieCarr: in a bit? working on a bug i just noticed [19:09:45] no hurries [19:09:48] danke! [19:09:55] just irc ping me when it's ready ? [19:09:56] sure [19:09:59] kewlio [19:11:07] and an officewiki favicon on https://www.wikidata.org/favicon.ico [19:11:43] LeslieCarr, I will ping you too :) [19:11:47] New patchset: Jgreen; "adjusting banner filters for the fundraisers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30638 [19:11:48] got any time in the next couple of hours for ganglia? [19:12:22] * LeslieCarr hides [19:12:22] when the meeting is done, i go eat lunch, then ganglia sure [19:12:31] notpeter: ah HA. fourth puppetd -tv is finally fetching the config [19:12:36] and if Jeff_Green will be not underwater, also lock down some firewall stuff on payments eqiad [19:13:12] LeslieCarr: ha ok [19:13:32] after lunch, don't wanna do something and then walk away ;) [19:13:46] * Jeff_Green keeps fingers crossed we all don't end up in Oz [19:13:52] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30638 [19:14:44] Jeff_Green: woo! kinda... [19:14:59] spence:/etc/nagios3 is just cruft, right? [19:15:28] every time I get on spence i get confused about which /etc/nagios* dir is used [19:16:01] nagios3 is cruft [19:16:02] spence is insane [19:16:14] notpeter: your suggestion worked, nagios+puppet are once again not enemies sort of [19:16:18] !log reedy synchronized php-1.21wmf3/includes/Revision.php [19:16:31] Logged the message, Master [19:16:32] LeslieCarr: thx. I'm renaming it /etc/nagios3-unused [19:16:38] cool [19:17:07] heh [19:36:43] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki and mediawikiwiki to 1.21wmf3 [19:36:56] Logged the message, Master [19:37:38] New patchset: Jgreen; "undo added system_role in role/fundraising.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30642 [19:38:07] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30642 [19:39:21] !log reedy synchronized live-1.5/ [19:39:35] Logged the message, Master [19:52:17] mutante: oh for the mysql wikivoyage sizing, make sure the tables are all innodb too [20:16:43] New patchset: Jgreen; "switch fundraising hosts to fundraising ganglia and nagios clusters, additional banner logging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30646 [20:19:46] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30646 [20:24:16] Jeff_Green: ready to firewall lockdown ? [20:30:00] ottomata: ready to figure out why ganglia is being bad ? [20:30:29] yeahhhhhhhh [20:31:10] last we left it, we had an03 as an aggregator [20:31:25] and it wasn't getting any ganglia traffic on the mulitcast addy, other than its own [20:32:31] leslie actually ... i'm sorta stuck on a ganglia problem too, let's hold on the firewall thing until tomorrow? [20:32:41] ok [20:33:05] ottomata: so there was something weird … when puppet set it, it didn't set, when we set it manually and restarted, it did listen to all the traffic [20:33:06] right ? [20:33:47] right, puppet was weird, so you had me set the $ganglia_aggregator variable before including the role class [20:33:50] and that seemed to fix the puppet weirdness [20:34:01] i don't remember when or why it did listen to all traffic [20:34:18] that fixed the fact that it wasn't setting anything to be an aggregator [20:34:24] right [20:34:33] that made it do deaf = no [20:34:33] properly [20:34:37] thought - maybe we need the gmond service to be listening via puppet if the config file changes [20:34:39] lemme look and see [20:35:01] i think it is [20:35:15] when it changes the file I see that gmond is refreshed [20:35:52] damn it is subscribed [20:35:58] yeah [20:36:13] hrm ... [20:36:31] it must be doing something differently ... [20:38:36] i wonder, lemme try a service ganglia-monitor stop/start -- maybe when it does a reload it doesn't do something important [20:38:50] that worked [20:39:42] so it must be that reload doesn't actually do something important [20:40:04] hahaha [20:40:09] " reload) [20:40:10] ;;" [20:40:18] yeah, it doesn't do anything [20:40:21] this explains so much [20:41:20] so i believe "hasrestart => true" should theoretically do this ? [20:41:34] restart it? [20:41:42] i think unless hasrefresh => true [20:41:49] hrm [20:41:54] it will restart when the subscribe resources changes [20:42:00] argh. ganglia, nagios, puppet, frack, and fundraising is the ugliest tangle I've dealt with in a long time [20:42:23] but, LeslieCarr [20:42:29] do we have this working outside of puppet? [20:42:50] lets just stop puppet on the machine and make sure it works as an aggregator manually [20:42:50] yeah, check it out on analytics1003 - i did a manual stop/start and it's receiving all the traffic [20:42:50] hm [20:43:34] how do you know it is receiving all traffic? [20:43:39] http://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [20:43:48] first off that, secondly tcpdump [20:44:05] tcpdump -n -i eth0 host 239.192.1.32 and not host 10.64.21.103 [20:44:41] i only see analytics1003 [20:44:41] 10.64.21.103 [20:44:53] !log mlitn synchronized php-1.21wmf2/extensions/ArticleFeedbackv5 [20:45:08] Logged the message, Master [20:45:32] gah, it was getting all the traffic a minute ago [20:45:32] grrrr [20:45:35] i did a netcat, maybe that busted it [20:45:40] check that out - http://ganglia.wikimedia.org/latest/?c=Analytics%20cluster%20eqiad&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [20:45:46] maybe it did ? [20:46:03] restarted gmond [20:46:15] ahhh yeah [20:46:15] that is better [20:46:22] i wonder if my netcats have been messing this up the whole time! [20:46:39] i was just about to ask if you've been using netcat every time [20:46:46] use tcpdump to listen in :) [20:48:21] which one is the other aggregator ? 1010 ? [20:50:06] yeah that'd be cool [20:50:15] 1003 and 1010 [20:50:15] it isn't in puppet right now [20:50:16] i'll fix that [20:50:21] (I can do that, right? 2 aggregators?) [20:53:37] New patchset: Ottomata; "site.pp - using both an03 and an10 as ganglia aggregator for analytics cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30652 [20:54:12] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30652 [20:56:00] yes 2 aggregators [20:59:08] New review: Hashar; "Awesome! Thanks a ton mark :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24797 [21:00:41] erghhhh, LeslieCarr, +2 aggregators == unhappy? [21:01:02] i see all the traffic still [21:01:02] but ganglia web gui says they're all down [21:01:02] cept an03 [21:03:05] grrr [21:03:11] other ones have two aggregators .. [21:03:15] y u hate us ganglia [21:05:43] i just restarted gmond on an10 as non-aggregator [21:07:50] ok [21:08:30] yeah now its happier [21:08:31] i unno [21:09:24] ah well, 1 is fine for now [21:10:36] New patchset: Ottomata; "Boo, ganglia doesn't like me with 2 aggregators!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30655 [21:11:34] !log reedy synchronized wmf-config/CommonSettings.php 'Disable lucene on wikidatawiki for now' [21:11:48] Logged the message, Master [21:11:52] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30655 [21:15:26] AHHHHH no [21:15:26] LeslieCarr [21:15:30] somethignr eally weird just happens after a while [21:15:41] i didn't change anything this time [21:15:41] and you didn't do any netcat ? [21:15:43] and now they are all down [21:15:43] nope [21:15:51] y u kill it [21:16:40] :-/ [21:18:31] hrm .... [21:19:50] i'm a killah, oh man, i gotta stop looking at my computer though, it is hurricaning outside and I am missing it! [21:20:00] nothing is working today for me! yaaaaaaa [21:20:13] so, hurricane playtime is now I think [21:20:28] if you figure anything out, lemme know, [21:20:29] !log reedy synchronized wmf-config/CommonSettings.php 'Change wikidata search config even more' [21:20:35] binasher: i got this: en wiki about 60GB, de and it about 10GB [21:20:43] ottomata: is this your first? [21:20:43] Logged the message, Master [21:20:48] first, what? [21:20:49] hurricane? [21:20:50] hmmm, no [21:20:53] yes! [21:20:58] i went boogie boarding in a hurricane once [21:21:03] on the james river [21:21:11] the waves were up in the trees [21:21:11] what zone are you? [21:21:11] really fun [21:21:12] dunno [21:21:16] but I have some friends who live in dumob [21:21:19] dumbo [21:21:25] which is zone A [21:21:26] and they are having a slumber party [21:21:30] miiiight go over there :p [21:21:39] don't worry though, they have canoes [21:21:46] otto's kinda crazy ;) [21:22:13] for the unitiated: http://project.wnyc.org/news-maps/hurricane-zones/hurricane-zones.html [21:24:54] !log reedy synchronized wmf-config/CommonSettings.php 'Change wikidata search config even more' [21:25:07] Logged the message, Master [21:25:34] !log reedy synchronized wmf-config/CommonSettings.php 'Revert' [21:25:47] Logged the message, Master [21:26:10] csteipp: /win 26 [21:26:17] arr, typo :) [21:28:24] !log reedy synchronized wmf-config/CommonSettings.php [21:28:24] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:34] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:34] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:34] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:34] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:37] Logged the message, Master [21:28:42] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:42] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:51] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:57] Reedy: ^ ? [21:29:10] idk what the latency is on those alerts... [21:29:33] and i guess the first one which you reverted is related to the complaints in #-tech [21:29:54] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [21:30:03] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [21:30:03] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [21:30:03] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.292 second response time [21:30:03] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.071 second response time [21:30:15] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [21:30:15] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [21:30:21] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [21:38:40] Arrrrgh! [21:39:09] Why does wikitech.wikimedia.7val.com/index.php?title=Scap&redirect=no rank higher in Google than wikitech.wikimedia.org/view/Scap [21:39:11] for wikitech scap [21:39:16] !google wikitech scap [21:39:17] http://google.com/?q=wikitech+scap [21:39:27] !google del [21:39:28] Successfully removed google [21:39:44] !google is https://www.google.com/search?q=$url_encoded_* [21:39:44] i'm guessing that 7val.com has some SEO done to it ? [21:39:44] Key was added [21:39:59] !google wikitech scap [21:39:59] https://www.google.com/search?q=wikitech+scap [21:40:09] slow bot [21:40:14] yep [21:40:47] !wikitech scap [21:40:47] do we even provide dumps for wikitech? [21:40:48] http://wikitech.wikimedia.org/view/scap [21:41:22] jeremyb: it isn't hosted in the Wikimedia cluster, so no. [21:41:31] (which is intentional and important) [21:41:41] Krinkle: well that much i knew. it's linode [21:41:52] k [21:41:52] Krinkle: doesn't preclude providing dumps [21:42:00] yep, 7val is SEO'ed [21:42:01] woo [21:42:06] sure, just means it means more work. [21:42:28] and last time somebody spent time on wikitech was when it upgraded to 1.17wmf1 [21:42:38] so odd that they'd put effort into SEO and then get ugly URLs (not pretty!) [21:42:40] I guess that says it all [21:42:52] Krinkle: nah, it's been upgraded since [21:43:16] i just figured it would be 1 less option for how to mirror if there's not a dump [21:43:16] jeremyb: really? looks like 1.17 to me. [21:43:37] Krinkle: i was pretty damned sure [21:43:48] it claims 1.17wmf1 on [[special:version]] [21:44:06] mutante should know, i think he's the one that did it. ;) [21:44:13] the db schema has been upgraded, the files have been rolled back [21:44:44] yes, we will have a newer one, but by installing current mw from scratch and importing data [21:44:57] (on another instance, on precise) [21:51:16] aha [21:51:21] Krinkle: ^ [21:51:30] ok [21:55:37] hashar: Can you explain what you intend to do with Zuul? [21:55:42] re "Will magically fix itself when we switch to Zuul." [21:55:49] Krinkle: replace Gerrit Trigger plugin [21:56:20] and overall a system wich let us fine tune how the jobs are triggered [21:56:54] how is that related to https://bugzilla.wikimedia.org/show_bug.cgi?id=39742 ? [21:57:16] ohh [21:57:23] yeah should have elaborated a bit more I guess [21:57:34] zuul craft a reference with the latest master + the patch [21:57:44] so you just have to fetch that instead of doing a clone [21:57:49] or a git archive [21:57:53] or git clone -l something [21:58:01] zuul basically takes care of it for us [21:59:28] heh, anyone who thinks this is too early?:) [21:59:34] #3801: Kill prototype.wikimedia.org [22:00:03] mutante: there was a discussion about that one less than a week ago [22:00:09] mutante: something about aftv5 [22:00:12] iirc [22:00:24] mutante: talk about prototype with chrismcmahon [22:00:31] I think he got the most up to date information regarding the prototype box [22:00:38] iirc it is still being used [22:01:07] thanks guys! [22:01:21] copy/pastes your comments:) [22:01:52] mutante hashar afaik, the last project to use prototype was AFTv5, and right now AFTv5 is supported on beta labs. everything on prototype is out of date right now, and ops has wanted to get rid of it for some time now [22:02:23] we got a ticket for it from Andre, because he is cleaning up Bugzilla [22:02:32] chrismcmahon: would you mind sending a mail to engineering list to make it clear ? [22:02:35] but it was just "Assuming that Labs has superseded this" [22:02:42] chrismcmahon: I am sure that Ryan will be more than happy to shut it down :-] [22:05:23] mutante: Ryan Lane said something last week about taking down prototype, he's probably the final voice [22:07:18] chrismcmahon: ok, multiple people want to get rid of it :) [22:07:36] pretty sure it was just about making sure it is not being used [22:08:59] did something change with the imagescalers in the last hour? [22:10:38] mutante: you could just (virtually) unplug it's ethernet and see who complains... [22:10:41] ;-) [22:13:13] chrismcmahon: checked, anomie has both sudo and netadmin rights on beta :-] all fine I guess [22:13:17] will cover the rest tomorrow [22:14:25] http://commonsprototype.tesla.usability.wikimedia.org/wiki/Main_Page [22:15:09] http://commonsprototype.tesla.usability.wikimedia.org/wiki/Special:RecentChanges :/ [22:15:14] user account creation spam [22:15:24] at least it is used by chinese bots! [22:15:24] is this also on the list to be killed? [22:15:28] hehee [22:15:38] probably need to be killed too [22:16:16] http://prototype.tesla.usability.wikimedia.org/ [22:16:32] lets get rid of all this:) [22:17:56] good night hashar [22:17:56] ;-)))))))) [22:17:56] thx! [22:17:56] cya [22:17:58] cya [22:19:21] ah, tesla [22:19:35] yeah, they should all die [22:23:21] mutante: commonsprototype is no longer used [22:23:37] Yeah, all of tesla is dead afaik [22:24:06] (those facing as *.tesla that is) [22:24:07] a few on prototype.wikimedia.org are still used by E2/E3 I believe [22:24:11] MoodBar/AFT maybe [22:26:16] thanks [22:28:25] mutante, have you already emailed fabrice? if not I can shoot him a quick note [22:28:45] I think they're using ee-prototype.wmflabs.org for most everything by now. [22:36:29] Eloquence: i havent mailed yet [22:51:17] mutante: Do you have a list and/or plan to make a list of stuff running on prototype.wikimedia.org? [22:52:01] mutante: I have an outdated list from a few months ago that i could turn into a wikitech-l/engineering-l mailing. And nuke whatever doesn't get comments in 2-3 weeks. [22:52:09] And address whatever does get reploes [22:52:15] replies [23:14:50] New patchset: Pyoungmeister; "using proper syntax for searchidx rsync conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30726 [23:19:06] New patchset: Pyoungmeister; "using proper syntax for searchidx rsync conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30726 [23:19:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/30726 [23:19:36] ooo yay, cables breaking [23:19:50] where? [23:20:19] ac2 as you linked? [23:20:21] yep [23:20:32] that's pretty crazy [23:20:32] well the cable may be fine, it may be the landing [23:20:34] probably the landing [23:20:39] heh [23:20:40] yes [23:20:47] if the equipment loses power/goes underwater, it's the same effect in the short term :) [23:21:34] truuuuuueeeee [23:21:57] bit.ly/wikitechLatest [23:22:02] mutante: http://lists.wikimedia.org/pipermail/wikitech-l/2012-October/064128.html [23:32:11] haha - best quote of the day -- "Hurricane Electric suffering due to lack of Electric due to Hurricane?" [23:32:29] lulz [23:32:34] haha [23:55:06] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [23:55:06] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [23:55:06] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours