[00:51:16] New review: Dereckson; "Configuration OK." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/11745 [01:22:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:25:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [01:30:54] PROBLEM - Lucene on search1001 is CRITICAL: Connection timed out [01:39:36] RECOVERY - Lucene on search1001 is OK: TCP OK - 3.020 second response time on port 8123 [01:40:40] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 252 seconds [01:41:51] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 230 seconds [01:44:42] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection timed out [01:45:27] PROBLEM - Lucene on search1002 is CRITICAL: Connection timed out [01:48:45] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 644s [01:48:45] PROBLEM - Lucene on search1001 is CRITICAL: Connection timed out [01:49:21] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [01:51:36] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 21s [01:52:12] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 26 seconds [01:54:00] PROBLEM - Lucene on search1003 is CRITICAL: Connection timed out [01:56:15] RECOVERY - LVS Lucene on search-pool1.svc.eqiad.wmnet is OK: TCP OK - 3.035 second response time on port 8123 [01:56:52] RECOVERY - Lucene on search1003 is OK: TCP OK - 9.027 second response time on port 8123 [01:57:18] RECOVERY - Lucene on search1002 is OK: TCP OK - 3.024 second response time on port 8123 [01:59:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:03:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.097 seconds [02:06:54] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [04:04:14] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [05:06:08] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [06:03:03] !log powercycling db1047 [06:03:08] Logged the message, Master [06:05:49] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [06:09:25] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 83904 seconds [06:10:49] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 83767 seconds [06:12:24] for anyone watching, it's slowly catching up and I am keeping an eye on it [06:55:47] New patchset: ArielGlenn; "last of the deployment scripts, point production to a given dir" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/11832 [06:57:16] New patchset: Petrb; "(bug 37662) change wgUploadNavigationUrl @ dawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11745 [06:57:24] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11745 [06:58:09] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11832 [06:58:11] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/11832 [07:25:55] New review: Dereckson; "Styles issue in commit and comment." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/11745 [07:37:52] New patchset: Petrb; "(bug 37662) change wgUploadNavigationUrl @ dawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11745 [07:37:53] apergos: hello. Dario reported 2 hours ago db1047 to be down again [07:37:58] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11745 [07:42:16] and I rebooted it, logging it in this channel [07:42:55] it is currently in slave catchup mode, and will remain that way for a few hours yet [07:50:20] New review: Dereckson; "(no comment)" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/11745 [07:50:32] apergos: great :-] [07:51:01] * apergos recommends the SAL for nice morning reading :-P [07:51:38] (sorry if' I'm a bit punchy today, last night was a long frustrating night and the amount of sheer bs I am hearing from most politicians and the media is simply astounding) [07:53:55] New patchset: Petrb; "(bug 37662) change wgUploadNavigationUrl @ dawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11745 [07:54:02] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11745 [07:59:56] New patchset: Amire80; "(bug 37648) new wmgTranslateWorkflowStates format" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11833 [08:00:06] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11833 [08:03:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:10:21] New patchset: Dereckson; "(bug 37672) Use odf on collection for ml projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11834 [08:10:27] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11834 [08:16:22] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11834 [08:16:25] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11834 [08:16:52] New review: Nikerabbit; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11833 [08:19:00] New review: Hashar; "Synced on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11834 [08:54:03] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [08:56:27] paravoid: so, dns secondary? [08:56:34] dns/ldap, that is [08:59:53] shit, the resolver isn't responding on the old IP anymore [09:01:54] paravoid: you forgot to disable the cron puppet restarter [09:09:20] heh, you european guys come online so late :D [09:13:26] rawr [09:13:33] we really need something better than racktables [09:13:51] I have no clue what's in use or not [09:14:21] we have needed something better than racktables since the day we started using it, sadly [09:14:34] we're using misc servers for bits? :( [09:14:40] New patchset: Hydriz; "(Bug 36895) Restrict media upload to autopatrolled users on sv.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11837 [09:14:46] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11837 [09:14:47] I guess I'll take titanium [09:14:47] and anyone from here coming on late (what, we are?) is justified causelast night sucked *ss [09:15:03] if it's not being used... [09:15:16] oppsies [09:15:24] what happened? [09:16:02] I was kidding about coming on late. I've been coming on at noon every day since I've been here [09:16:10] elections giving us the worst possib outcome [09:16:15] ahhhh [09:16:16] that sucks [09:16:20] the two parties that supposedly got resoundingly rebuffed in may [09:16:24] nazis? [09:16:25] are now going to form a government [09:16:35] oh the nazis are back in with 6.9 instead of 7% [09:16:41] * Ryan_Lane groans [09:16:47] it's pretyt much awful [09:17:26] oh. huh you must be in our timezone, right :-D sorry... I'm still tired (mentally exhausted I guess) so all neurons not firing well... [09:17:38] well, it's 11 here right now :) [09:17:42] I'm on early today! [09:17:49] yep :-D [09:17:52] eh, how do you automatically break up a line into 72 chars per lines? [09:17:57] s/lines/line/ [09:17:59] region-fill? :-P [09:18:02] no clue [09:18:04] er, fill-region. sorry [09:18:10] (that's an emacs joke :-P) [09:19:30] drac --getmacfornic=1 titanium.mgmt [09:19:32] <3 [09:19:43] one of my favorite scripts :) [09:20:27] New patchset: Hydriz; "(Bug 36895) Restrict media upload to autopatrolled users on sv.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11837 [09:20:33] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11837 [09:22:04] oohhh is it in our puppet repo somewhere? [09:22:12] good question [09:22:14] it should be [09:22:19] it's installed on fenari [09:23:00] New patchset: Ryan Lane; "Add dhcp for virt1000" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11838 [09:23:04] note to people watching db1047: it's probably 1.5 to 2 hours away from caught up on slave rep) [09:23:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11838 [09:23:36] there is a virt1000? isn't it 1001? [09:23:44] :P [09:25:07] we have virt1001-1008 [09:25:11] which are compute nodes [09:25:15] and virt1000 which is a controller [09:25:19] virt0 is a controller [09:25:25] and virt1-8 are compute nodes [09:26:20] virt1-8 o_O [09:26:33] growth by 3 nodes, seems good :) [09:26:39] more than that [09:26:44] we're replacing old hardware with new [09:26:52] \o/ [09:27:01] the new hardware has 185GB ram [09:27:04] the old has 50 [09:27:11] old = gluster? :P [09:27:15] yes [09:27:20] ah when is local disks for all gluster going to happen? [09:27:23] new won't have shared storage at first [09:27:24] new = ? [09:27:27] :( [09:27:36] I mean instead of gluster [09:27:44] when the new hardware is put in place [09:27:52] we can probably start that today [09:28:01] I'm getting the secondary resolver up first, though [09:28:27] grrr the new servers are not nagios-ed [09:28:34] which new ones? [09:28:38] they aren't installed yet [09:29:03] lol [09:29:27] they were just racked/wired at the end of last week [09:29:50] then are we also going to use the nodes in eqiad? [09:29:54] yes [09:30:00] we're going to have 16 compute nodes total [09:30:02] 8 in each zone [09:30:41] and seems like we are finally going to use *.eqiad.wmflabs [09:30:47] we're using it for stuff right now [09:31:01] !log added virt1000 to dns, using titanium misc server [09:31:06] Logged the message, Master [09:31:54] hm. need to add it to a vlan [09:32:35] public subnet... [09:32:43] now which switch? heh [09:33:55] asw-b-eqiad [09:35:23] New patchset: Dereckson; "(bug 37675) Portals reviewing in ruwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11839 [09:35:29] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11839 [09:47:26] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11838 [09:47:29] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11838 [09:47:44] New patchset: Hydriz; "(Bug 35712) Add an alias to Help namespace in ml.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11840 [09:47:50] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11840 [09:48:16] \i/ [09:48:20] \o/ too [09:52:06] now, question is, will this pxe on the first attempt? :) [09:54:29] well, I obviously pulled the wrong MAC [09:57:33] ok. I officially have no clue what macs this script is pulling [09:58:24] oops [09:58:26] New patchset: Ryan Lane; "Fixing the mac for virt1000" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11841 [09:58:33] also I don't know where the script is on fenari [09:58:34] maybe it only pulls the mgmt macs? [09:58:38] which drac [09:58:42] oh [09:58:52] sbin [09:58:54] thanks [09:58:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11841 [09:58:55] yw [09:59:13] it uses python-paramiko to ssh in and run commands on the drac [09:59:20] I see that [09:59:22] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11841 [09:59:25] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11841 [09:59:48] shouldn't be too hard to adjust it for cdracs recent enugh to be able to display the other macs [09:59:52] *dracs [10:00:18] well, I wrote this for when we installed eqiad [10:00:28] and yay it is in puppet [10:00:31] * Ryan_Lane nods [10:00:49] I almost always do everything via puppet [10:01:06] sweet [10:08:12] apergos: how is Athens on Monday noon after elections? [10:08:23] I haven't been out [10:08:29] ok:) [10:08:45] but via im and irc I can tell you that the general mood is gloomy to outright depressed [10:09:05] ..i see [10:09:25] even better, there was a short-livd market rally after the results... [10:09:29] which is already over [10:09:42] and spain is back in the hot seat again. [10:09:54] just like all this never happened. [10:13:59] Ryan_Lane: re: MACs and Ciscos. (if they were Cisco). "detail" in cimc/network are the mgmt MACs, while lom-mac-list detail are the real MACs , not mgmt. "LOM" made us think of lights-out management or something but they mean "LAN on motherboard". so that stuff might be cause for wrong MACs [10:14:11] mutante: this was a dell [10:14:20] I wrote a script a while back to pull info from the drac [10:14:27] but it was for the eqiad buildout [10:14:34] aha, i heard of existing scripts from Rob [10:14:36] so, likely I wrote it to pull mgmt mac [10:14:47] gotcha [10:15:22] heh lom = lan on mobo. funny [10:15:58] heh yea, that confused me with the first Cisco [10:16:55] mark: am I supposed to be using 208.80.154 in eqiad? [10:17:56] bah. of course I am [10:18:09] yay, mutante here, cool! [10:19:58] oh. great [10:20:10] there seems to have already been a virt1000 assigned? [10:20:13] Danny_B|backup: i might have to disappoint you about rank.php. quite busy already [10:20:45] (sad) [10:20:56] \o/ media tarballs announcement [10:21:23] well, that host is going to go back to being krypton [10:21:36] \o/ another set of incremental media tarballs (kicks into action) [10:21:38] since no one renamed it anywhere except dns [10:21:58] no fucking wonder it booted with the wrong MAC [10:23:07] hm. I better check to see if it was labeled in the router [10:24:18] it was :D [10:24:21] well, renaming that back [10:26:26] New patchset: Dereckson; "(bug 37676) Namespaces configuration on kn.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11843 [10:26:32] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11843 [10:30:06] New review: Dereckson; "I prepared the config change, but it's still waiting a shellpolicy flag" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/11843 [10:39:36] hm. no clue why this would say no free leases [10:40:32] mark: host is in dns, mac is in the leases file, dhcp3 was restarted… [10:40:39] any idea why something would show no free leases? [10:41:46] we always dhcp from brewster, right? [10:43:32] is it in ./files/autoinstall/subnets ? [10:43:42] it's not getting the initial pxe [10:43:55] PXE-E51: No DHCP or proxyDHCP offers were received. [10:44:17] and the dhcp server says no free leases, so it isn't giving it an address [10:44:47] hmm.. the wrong linux-host-entries file? [10:45:09] linux-host-entries.ttyS1-115200 [10:45:15] it was still brewster when doing analytics,ack [10:46:39] Ryan_Lane: virt1001 , right? that is already in the "ttyS0" file [10:46:44] no [10:46:47] virt1000 [10:46:53] why tty0? [10:47:10] S0 vs. S1 = Dell vs. Cisco , serial console differs [10:47:17] afair [10:47:25] well, there's also a linux-host-entries.tty0 [10:47:30] either way, this is a dell [10:47:32] it uses S1 [10:47:40] oh, tty0 literally sounds like a typo maybe [10:48:29] I really don't understand how it's making the request with the correct mac and the dhcp server isn't giving virt1000's IP [10:48:48] Ryan_Lane: does the dhcpd get the right dns entry? [10:48:50] host and dig show the proper ip [10:49:11] missing subnet in dhcpd.conf ? [10:49:22] also possible [10:49:38] New patchset: Dereckson; "(bug 37676) Namespaces configuration on kn.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11843 [10:49:39] subnet 208.80.154.128 netmask 255.255.255.192 { [10:49:44] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11843 [10:50:32] maybe I have it on the wrong vlan? [10:50:57] via 208.80.154.2: [10:51:00] which vlan is that [10:51:00] ? [10:51:01] via 208.80.154.3: [10:51:09] surely looks like the wrong router asking [10:51:16] public-b [10:51:18] yes [10:51:18] New review: Dereckson; "waiting local consensus confirmation (shellpolicy)" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/11843 [10:51:24] lemme check router config [10:51:25] that's not a labs vlan is it [10:51:39] no [10:51:45] this shouldn't be on a labs one [10:51:51] this is the controller [10:51:54] ok [10:53:32] -_- [10:53:39] so... [10:53:51] it's really, really crappy when things aren't labeled properly :( [10:54:01] ? [10:54:16] I looked in racktables to take a system [10:54:24] first checking to make sure virt1000 wasn't taken [10:54:32] of course, I should have checked dns too [10:54:36] because it already was [10:54:38] did you rename it? [10:54:49] yes, and I switched the names up [10:54:55] why couldn't you just use the existing name? [10:54:58] now I need to re-do everything [10:55:04] the existing misc name? [10:55:06] yes [10:55:12] I'd really prefer not to [10:55:26] * mark sighs [10:55:38] we're already using virt0 in pmtpa for the controller [10:56:31] if this was running random stuff, I wouldn't mind [10:56:36] but, this is part of a cluster [10:56:37] it is running random stuff [10:56:45] crap! I just figured out why I'm dog tired, and it's not post election blues... it's first day cold "whack you over the head so you sleep" symptoms... afk to try to nap a little [10:56:47] it's the controller for a cluster [10:56:50] yes [10:56:52] so it's misc [10:57:05] we name the lvs servers lvs [10:57:07] it's misc [10:57:12] those are clustered [10:57:31] the controllers for labs have services that should failover with each other too [10:57:43] where are the other controllers then? [10:58:05] we've also stopped using cluster names for other services too, which I dislike [10:58:15] we have databases with misc names, and bits servers with misc names [10:58:26] yes [10:58:34] it makes no sense [10:58:36] because we took misc servers for them, and we avoid renaming servers [10:59:02] when they aren't in puppet, I don't see the point in not renaming them [10:59:17] have you at least created tickets for them in the data center queues and such? [10:59:18] it's really obvious what a system is when it has a name that matches its function [10:59:55] I was going to do so after I was finished renaming it [11:00:02] and had it working [11:00:21] i'm really just gonna stop caring [11:00:28] you guys can do whatever you want with (re)naming servers [11:01:00] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [11:01:36] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 1 seconds [11:02:59] I generally leave the names alone, but I prefer consistency across datacenters [11:04:24] hey you named it virt0 [11:04:41] well, it was originally mobile2 [11:04:50] right [11:05:03] perhaps we should just go all the way and give everything just an asset name [11:05:09] with aliases or something [11:05:41] also, some of the services on virt0 can be clustered with the virt nodes themselves [11:05:48] for instance, the api servie [11:06:23] * Ryan_Lane shrugs [11:10:14] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 23.9048310811 (gt 8.0) [11:11:53] afaik robh usually does that for all servers (asset tag alias in dns but for the mgmt interface) WMF1234.mgmt.eqiad etc. [11:14:17] he means for the servers themselves too. heh [11:16:25] <3 junos [11:22:14] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.710031441441 [11:22:36] TFTPE11: ARP timeout [11:22:40] o.O [11:23:25] well, I can get an IP now, but tftp is failing [11:25:36] mark: any clue? [11:25:47] which subnet is it in now? [11:25:52] public1-a-eqiad [11:26:05] so TFTP is coming from carbon, not brewster [11:26:16] yes [11:26:25] did you check whether puppet ran there? [11:26:31] it did [11:26:32] i guess it should [11:26:40] what do the logs on carbon say? [11:27:15] lemme re-run the pxe [11:27:45] nothing in daemon or syslog [11:28:14] can you ping the server while it's doing that? [11:28:54] * Ryan_Lane sighs [11:28:56] ignore me [11:29:03] i try [11:29:05] ;-) [11:31:12] this is taking me so much longer than it should [11:32:59] 80th try's a charm [11:33:00] ? [11:33:01] :) [11:33:05] \o/ [11:33:29] hooray for a booting system! [11:33:47] food [11:37:46] Ryan_Lane: and hi [11:37:58] Ryan_Lane: I did puppetd --disable, which is supposed to block puppet even if it's running :/ [11:38:08] hehe [11:38:14] there's a cron job removing that lock file [11:38:18] argh [11:38:19] you noob ;-) [11:38:27] (everyone gets bitten by that at first) [11:38:43] I'm used to having puppet run as a cronjob instead of a daemon [11:38:55] still sometimes lock files get stuck [11:39:07] when it crashes or so [11:39:22] I guess a cron job can wait for the process to disappear and clean up [11:39:25] feel free to implement that [11:39:35] as long as they don't all run at once [11:40:16] that's what fqdn_rand() is for :) [11:40:21] yeah [11:49:33] Ryan_Lane: I don't understand why we're tying eqiad with this outage [11:49:45] changing a couple of glue records has nothing to do with a secondary server [12:00:14] so it was a glue record after all? [12:02:26] yep [12:02:47] sad thing is, I said so the first day of the incident, then checked it out and decided that I was wrong [12:02:50] dammit [12:02:56] I still don't forgive myself [12:03:13] btw, do you know perhaps how to change the glue in markmonitor's panel? [12:03:19] I don't have access and Ryan could not figure it out [12:05:35] i checked it at as well [12:05:37] didn't see one either [12:06:16] so, if we can't /ever/ change nsN's IPs? [12:06:17] scary :) [12:06:40] we have done so many times in the past [12:06:50] but not with markmonitor? [12:06:53] indeed [12:07:02] but they can do it for us [12:07:39] PROBLEM - Host mw1015 is DOWN: PING CRITICAL - Packet loss = 100% [12:07:50] ok I lost my MM password already [12:08:55] btw, there's a new spree of DDoS using amplified DNS traffic [12:09:33] does docs.python.org work for you? [12:10:26] hmmz [12:10:28] my ipv6 is down [12:10:55] doesn't work for me [12:11:34] ah they're hosted at xs4all [12:11:38] and this isp is xs4all as well [12:12:16] even over v4 though [12:12:29] ok [12:12:46] New patchset: Dereckson; "(bug 37679) Change zh.wikipedia rollback configuration" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11848 [12:12:52] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11848 [12:25:04] paravoid: well, we're going to change the dns entries anyway [12:25:16] why not bring up the second resolver and be done with it? [12:25:35] we're changing the NS records, that is [12:26:11] if you think we're ready, sure, np [12:26:24] I have no idea how prepared we/you're for that [12:26:33] ldap replication and whatnot I suppose [12:26:45] ldap replication, yes [12:29:30] it's already puppetized [12:29:39] we may need to do slight changes [12:38:04] New patchset: ArielGlenn; "doc on how to deploy dump scripts" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/11849 [12:38:53] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11849 [12:38:55] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/11849 [13:02:21] def __hash__(self): [13:02:21] import operator [13:02:21] return reduce(operator.xor, map(hash, self.itervalues()), 0) [13:02:24] will you stab me now, ryan [13:02:31] ....? [13:02:34] oh [13:02:35] haha [13:02:41] I did it with a lambda at first [13:02:43] but killed it off [13:05:37] well, that's a simple enough lambda [13:12:41] mark: in dns::auth-server, why have args like $ipaddress="", just to set them to dns_auth_ipaddress, then check to see if they are set? [13:12:53] why not just have dns_auth_ipaddress as a required param? [13:13:59] New patchset: Hashar; "(bug 37545) farsi: change default Collection namespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11161 [13:14:06] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11161 [13:15:06] New review: Hashar; "Patchset 2 : uses a wmg as advised by Sam" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11161 [13:15:38] because parameters didn't exist at the time? [13:15:56] this is a parameterized class [13:16:05] is now [13:16:13] why not change the params, then? [13:16:18] and get rid of the ifs? [13:16:28] or, lemme rephrase, can I do so? :) [13:19:49] man, the openstack stuff needs to be redone so, so badl [13:19:52] *badly [13:19:59] as parameterized classes and roles [13:38:06] New patchset: Ryan Lane; "Turn parameterize auth::dns::ldap and make a role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11854 [13:38:09] mark: tell me if you hate this ^^ [13:38:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11854 [13:41:07] In the long run I'll be turning all of the openstack cluster configuration into a hash, and using that rather than what I'm doing here [13:41:22] but that's far too much work for what I need to do right now [13:43:12] paravoid: I think the ciscos in pmtpa are ready for use [13:43:31] oh? [13:43:34] are they? [13:43:57] yep [13:44:24] networked and everything? [13:44:24] cool! [13:44:56] let me check my mail to see if I find relevant info there [13:45:05] there' s nothing in the mail [13:45:11] so, it's wired, but that's it [13:45:20] they should be in racktables [13:45:26] mgmt interfaces? [13:45:29] hostnames? [13:45:33] should also hopefully be done [13:46:36] virt6-15 [13:46:40] -_- [13:46:43] see, there's /something/ in the mail :) [13:46:44] heh [13:46:59] what was the drac --getmacfornic you run before? [13:47:02] forgot to ask you [13:47:06] it's on fenari [13:47:13] it gives back the mgmt macs though :( [13:47:25] it won't work for the ciscos [13:47:34] because they're not dracs? :) [13:47:38] yes [13:47:42] would be nice to adjust it for ciscos [13:47:47] so it works for either [13:51:42] so, [13:51:50] should I set them all up? [13:52:01] I'd set up 6, 7, and 8 [13:52:08] then we can replace 1, 2, 3 and 4 [13:52:47] we can block migrate the instances off of the old ones [13:53:02] as to not setup gluster on the new ones you mean [13:53:05] yes [13:53:10] do these have hw raid? [13:53:14] no :( [13:53:27] do we have a proper partman recipe? [13:53:28] not usable hardware raid anyway [13:53:34] well, andrewbogott was working on it [13:53:43] didn't you just setup a cisco? [13:53:47] I did. manually [13:53:50] that's the only reason I'm asking instead of looking myself :) [13:53:51] ah [13:54:04] paravoid: The issue is that the cisco server's drive configs are not all the same. [13:54:05] the partmap recipie uses the same sizes for partitions [13:54:12] andrewbogott: ARGHHH [13:54:16] recipe* [13:54:18] RobH was looking at it but quit in disgust. [13:54:51] paravoid: That said, my partman recipe works great for some of them :) [13:54:55] Ryan_Lane: you said you have a preference for raid 10? [13:55:00] yes [13:55:02] vs. raid 6 I presume [13:55:14] it eats up more space, but is faster [13:55:47] I think we need the speed more than the space [13:55:52] paravoid: And, really, there are only two different configurations. So we could just categorize them and have two different partmans. [13:56:02] But I'm still hoping RobH will magically make them all conform. [13:56:07] that would be ideal [13:56:34] paravoid: mind doing a quick review of: https://gerrit.wikimedia.org/r/11854 [13:56:34] ? [13:57:36] paravoid: If your cisco has drives a-h then use 'virt-raid10.cfg' and if it has c-j use 'virt-raid10-cisco.cfg'. [13:58:04] c-j? [13:58:06] eh!? [13:58:12] Ryan_Lane: (looking) [13:58:50] I plan on making all of the openstack stuff parameterized and have roles [13:58:57] yay :) [13:58:58] this is just a start at that [13:59:20] since I need this one right now :) [13:59:31] paravoid: Yes, they each have eight drives but some of them seem to start at sdc instead of sda. [14:02:18] New review: Faidon; "Role class class includes "dns::auth-server" instead of "dns::auth-server::ldap"." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/11854 [14:02:29] bah [14:02:31] heh [14:02:39] :-) [14:02:42] that's what reviews are for :) [14:02:49] indeed [14:03:12] that would have ended poorly [14:03:13] New patchset: Ryan Lane; "Turn parameterize auth::dns::ldap and make a role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11854 [14:03:29] that was the only issue? [14:03:30] me likes the direction of this change *a lot* [14:03:35] yeah, me too [14:03:40] this is way more flexible [14:03:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11854 [14:04:23] even better, for labs we can have a "role::dns::ldap::labs" class, and it can use ipaddress_eth0 for everything [14:04:36] so [14:04:40] where are you using those variables? [14:04:46] in the template [14:04:58] which template? [14:05:12] I don't see dns_auth_soa_name anywhere [14:05:16] and I wonder what it is :) [14:05:28] I see it [14:05:28] nor the other dns_auth_* [14:05:29] :) [14:05:32] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [14:05:33] I see those too ;) [14:05:37] which file? [14:05:42] templates/powerdns/pdns-ldap.conf.erb [14:06:37] ah, they're not in the diff but there were before [14:06:47] they aren't there now? [14:07:03] no they are, I was just looking at the diff [14:07:12] ah [14:07:14] * Ryan_Lane nods [14:07:28] hrm [14:07:32] so [14:07:41] the two zones [14:07:50] should have the same authority in both nameservers [14:07:58] ah [14:07:59] we shouldn't have labs-ns0 on one server and labs-ns1 on the other server [14:08:00] right [14:08:03] indeed [14:08:14] I don't think anything would break [14:08:20] but still, it's not a very good practice [14:08:32] that can also be arbitrary [14:08:54] that's dns_auth_soa_name, right? [14:08:58] well [14:09:09] dns_auth_soa_name sets powerdns' default-soa-name [14:09:39] which is "name to insert in the SOA record if none set in the backend" [14:09:41] I don't see dns_auth_master used anywhere [14:09:42] according to pdns documentation [14:09:51] oh, that's used in the manifests [14:10:11] well, the backends are going to have labs-ns0 [14:10:32] because you have SOA in LDAP, right? [14:10:35] yes [14:10:37] that's what I thought [14:10:42] so it doesn't matter at all [14:10:48] you can even unset that variable completely I think [14:10:58] well, I'll keep it, for shits and giggles [14:10:59] or not, maybe pdns will bork :) [14:11:33] ah [14:11:37] dns_auth_master isn't used in the ldap one [14:12:25] New patchset: Ryan Lane; "Turn parameterize auth::dns::ldap and make a role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11854 [14:12:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11854 [14:13:06] New patchset: Ryan Lane; "Turn parameterize auth::dns::ldap and make a role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11854 [14:13:22] I had forgotten to make the params required [14:13:27] and removed the unneeded one [14:13:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11854 [14:13:51] !log rebooting snapshot4, kernel and other updates [14:13:56] Logged the message, Master [14:13:59] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11854 [14:14:03] +2'ed but not submitted [14:14:09] cool. thanks [14:14:17] :) [14:14:22] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11854 [14:14:25] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11854 [14:14:29] now let's see if it installs without any issues [14:14:31] :) [14:17:04] New patchset: Ryan Lane; "Adding firewall rules for new controller" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11857 [14:17:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11857 [14:18:39] !log reboot snapshot3, package and kerne updates [14:18:43] Logged the message, Master [14:19:38] Could not evaluate: Could not find group ssl-cert [14:19:39] hm [14:19:41] really? [14:19:53] eeh?! [14:20:00] petan: re [14:20:14] petan: how are you doing ? [14:20:15] yes [14:20:27] damn it, servermon does ~6900 sql queries for our hardware inventory :( [14:20:31] how could ssl-cert be missing? [14:20:40] OrenBo: good [14:21:41] !log creating new list MediaWiki-commits, not in use yet, but will replace outdated -cvs list soon [14:21:45] Logged the message, Master [14:21:49] Ryan_Lane: you have install_certificate [14:21:59] long time no meet [14:22:05] is ssl-cert installed by apache, or something? [14:22:10] ssl-certificates [14:22:12] iirc [14:22:19] ah [14:22:21] er, ca-certificates [14:22:21] ssl-cert [14:22:29] it's a package ; [14:22:30] ;) [14:23:01] there's something in puppet about this [14:23:09] notpeter: can you send me an email - with links to the stuff you showed me at berlin [14:23:24] Ryan_Lane: certs::groups::ssl-cert [14:23:26] New patchset: Ryan Lane; "Add required ssl-cert package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11859 [14:23:28] it's probably incorrect [14:23:33] bah, why? [14:23:38] let's just install the package [14:23:40] notpeter: the pupet defs for lucene search [14:23:48] yes, but fix that class too then? :) [14:23:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11859 [14:24:10] OrenBo: yes! sorry for being slow. I'm just now back in the swing of things in the US. I shall do that today [14:24:14] (or remove) [14:24:18] that's for hardy [14:24:23] it's unused above that [14:24:56] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11857 [14:24:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11857 [14:25:06] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11859 [14:25:06] it was wrong for hardy too [14:25:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11859 [14:25:36] heh [14:25:43] I'll remove it, then [14:26:00] yep, just extracted hardy's ssl-cert deb [14:26:07] and verified that its postinst creates the group [14:26:18] heh [14:26:35] New patchset: Ryan Lane; "Remove unneeded class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11860 [14:26:56] gerrit-wm: hurry up, you [14:27:03] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11860 [14:27:03] New review: Faidon; "Verified that hardy's ssl-cert package creates the group." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11860 [14:27:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11860 [14:27:07] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11860 [14:27:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11860 [14:28:28] I'm going to need to adjust these firewall rules for replication to work [14:28:30] New patchset: Dzahn; "prepare lighttpd and mailman redirects for mailist list rename to not break old URLs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11861 [14:28:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11861 [14:29:18] notpeter: that no problem - I was busy with writing up some research papers on my other work (robots) [14:29:39] notpeter: that no problem - but I'm back working on SOLR now and I'd like to puppetize it too. [14:29:48] New patchset: Dzahn; "prepare lighttpd and mailman redirects for mailist list rename to not break old URLs. gah, csv != cvs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11861 [14:29:55] OrenBo: sweet! sounds good. [14:30:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11861 [14:30:27] It will be if we cooperate on it! [14:32:19] New patchset: Ryan Lane; "Add ldap rules for virt0/virt100" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11862 [14:32:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11862 [14:34:32] New review: Dzahn; "i'll merge it. needs timing right after making the switch on shell." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11861 [14:37:54] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11862 [14:37:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11862 [14:45:19] New review: Jeremyb; "I'm sorry, maybe I wasn't clear." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/11161 [14:47:51] New patchset: Hashar; "(bug 37662) change wgUploadNavigationUrl @ dawiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11745 [14:47:57] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11745 [14:49:29] fuuuuuuuck [14:49:32] I ran puppet on virt0 [14:49:34] hahaha [14:49:52] i needed to add the new firewall rules :( [14:49:57] New review: Hashar; "Made 'campaign' the last parameter." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11745 [14:50:00] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11745 [14:52:41] there we go [14:52:56] !log hume is out of disk space again. Probably the wmf branches taking toooo much space [14:53:01] Logged the message, Master [14:55:05] hi guys [14:55:36] New review: Hashar; "Deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11745 [14:55:59] Erik Z is supposd to have readonly NFS mounts on stat1 for a bit [14:56:00] there's not much space on hume to move things aroun anyways [14:56:12] they haven't been working since we reinstalled the OS last week [14:56:16] can anyone here help me out? [14:56:45] apergos: yup :/ I have opened an RT Ticket for peter youngmeister to have a look at [14:57:03] apergos: also, it seems hume disk need to be replaced anyway so … that will eventually be fixed one day ;) [14:57:21] actually, it is just one nfs mount [14:57:33] /mnt/data -> 208.80.152.185:/data [14:57:39] I wonder what "uncommon" is over here [14:57:44] $ sudo mount /mnt/data [14:57:44] mount: Connection refused [14:57:46] as in [14:57:57] /usr/local/apache/uncommon [14:58:16] whatever it is, it was updated june 11 so it's recent [14:58:17] mark: ^^ [14:58:25] wasn't there some discussion about getting rid of nfs? [14:58:40] yes yes [14:58:43] this is a readonly mount right [14:58:44] and temporary [14:58:46] hmm it's small nm [14:59:16] that's ds2 btw (the nfs) [14:59:18] I wonder if we can lose 1.17 1.17test 1.18 [14:59:48] paravoid: try this: dig @208.80.154.18 bastion.wmflabs.org [14:59:49] ;) [15:00:26] oooooooooooh, Ryan_Lane [15:00:40] or, even better: dig @labs-ns1.wikimedia.org bastion.wmflabs.org [15:00:53] Change abandoned: Hashar; "Oh that was there already. So abandoning per Jeremy." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11161 [15:01:04] hashar: do you know what uses those dirs on hume anyways? [15:01:31] I guess if we're running maintenace scripts etc [15:01:43] ah Reedy yay you're here [15:01:48] so I'm looking at: [15:02:02] rats. [15:02:08] nothing big enough to clear out enough space [15:02:45] paravoid: so, I think we're good to go to change the NS records now [15:02:45] 684M php-1.20wmf2, 1.2G php-1.20wmf3, 1.9G php-1.20wmf4, 1.2G php-1.20wmf5 [15:02:51] that is the problem in a nutshwll [15:02:53] shell [15:03:17] do the cache/l10n dirs still exists in wmf2/wmf3? [15:03:27] lemme look [15:03:52] at least on wmf3 it would look like it does [15:04:00] 87m for wmf3 [15:04:27] 89m wmf2 [15:04:37] not much but if I can toss em I will [15:04:40] can't we delete wmf2 and wmf3 ? [15:04:54] Maybe wmf2, but not sure [15:05:00] hey, now we have an ldap secondary too, I should update the gerrit config [15:05:04] it might still be referenced in cached files.... [15:05:10] well if it snot in wikiversions.dat .. [15:05:11] I can oss he cache l10ns? or no? [15:05:12] apergos: we could resize /archive a bit [15:05:31] Reedy: [15:05:34] *toss [15:05:40] *the [15:05:44] yeah, l10n caches can go [15:05:48] ok great [15:05:50] we're not using them as php stuff [15:05:58] it's JS/images etc that are still referenced [15:06:16] New review: Mark Bergsma; "Fully qualified global variables please :)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11296 [15:06:19] I don't think 'hume' is serving any web traffic anyway [15:06:25] 173M free :-D [15:06:25] Ryan_Lane: and what about a local mysqld? (not another DC) [15:06:35] isn't hume only used to run slow running scripts / batches ? [15:06:51] I thoughtit was for expensive jobs yeah [15:06:56] /dev/sda1 6.8G 3.2G 3.3G 50% / [15:06:59] \O/ [15:07:08] jeremyb: for gerrit? [15:07:14] anyway, the next time someone run scap, it will be overloaded again :-( [15:07:19] Ryan_Lane: yah [15:07:22] I have a ticket in to procure misc dbs for eqiad [15:07:29] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [15:07:42] (oops my df copy paste was the wrong partition) [15:07:50] it's not /dev/sda1, it's /usr/local/apache [15:07:54] and it's 173M there [15:08:35] does scap copy to hume? [15:08:42] !log testing gerrit config with multiple ldap servers [15:08:46] Logged the message, Master [15:08:55] gerrit is going down for a sec [15:09:15] that worked [15:09:36] apergos: hume is part of dsh group mediawiki-installation which in turn is the group used by scap to run sync [15:09:51] Ryan_Lane: now block ldap access to one with iptables and then the other? [15:09:53] so unless we lose the l10n dirs on fenari... [15:10:02] jeremyb: eh? [15:10:07] block access? [15:10:10] Ryan_Lane: to make sure it really works ;P [15:10:18] heh [15:10:20] simulate an outage [15:10:27] I'd prefer not to :) [15:10:29] also, labsconsole can get the new one? [15:10:30] just yet [15:10:44] * Damianz unplugs jeremyb from his router [15:10:50] Reedy: can I lose fenari:/home/wikipedia/common/php-1.20wmf2/cache/l10n (and wmf3 the same)? [15:11:31] oh, yeah [15:11:32] crap [15:11:35] I forgot about those [15:11:42] I'm guessing scap pushed them back out again [15:11:46] yep [15:11:51] !log added virt1000 as a secondary ldap server for labsconsole [15:11:55] ok tossing them [15:11:56] Logged the message, Master [15:12:02] (*wooow*) [15:12:03] maybe create the directories on hume so they are not writable, that would prevent scap from copying them again [15:12:11] (at the price of giving us an error message though) [15:12:13] this doesn't seem right [15:12:16] if I toss em from fenari then we don't worry about it [15:12:26] sure :-] [15:12:33] because it isn't. heh [15:12:39] I don't know how to configure my own plugin [15:12:51] Reedy: maybe add "delete old cache/l10n" to your release process ? [15:13:00] I somewhat did :p [15:13:02] done. [15:13:05] \O/ [15:13:21] do you have a link to that ticket hashar? [15:13:25] I'll note this there [15:14:06] apergos: I add a ticket about the / partition : https://rt.wikimedia.org/Ticket/Display.html?id=3125 [15:14:16] apergos: but that issue you have worked on is a different issue [15:14:20] since that is /usr/local/apache [15:14:31] oh yours was for / [15:14:33] nm then [15:14:53] hmm but well I will comment on the ticket anywayss [15:15:35] New patchset: Matthias Mullie; "Bug 37616 - Article Feedback - Increase Test Sample to 10% of English Wikipedia (using article ID code)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11866 [15:15:41] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11866 [15:15:47] hi guys [15:15:52] can anyone help me with this part of this ticket? [15:15:52] https://rt.wikimedia.org/Ticket/Display.html?id=2162#txn-66336 [15:16:04] I need the /mnt/data -> 208.80.152.185:/data mount on stat1 to work [15:16:07] it is read only [15:16:12] Erik Z uses this to generate some stats [15:17:10] Jeff_Green: /archive isn't the problem over there.... [15:17:23] it's the apache dir [15:17:53] what? [15:18:13] 5231616 5054504 177112 97% /usr/local/apache <--- problem [15:18:14] yes? [15:18:27] tank-archive is /archive [15:18:32] tank-apache is /usr/local/apache [15:18:33] ? [15:18:50] if tank-apache can be resized, sweet [15:19:01] I'm suggesting we steal some space from tank-archive and move it to tank-apache [15:19:06] oh :-D [15:19:08] New patchset: Ryan Lane; "Enable multiple LDAP servers for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11868 [15:19:16] I parsed that in a different fashion :-D [15:19:29] yeah sure [15:19:39] o ha. /archive is currently the emergency backup for fundrasiing banner impressions [15:19:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11868 [15:19:54] New patchset: Ryan Lane; "Enable multiple LDAP servers for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11868 [15:20:02] there are a couple solutions in the pipeline for that, which will allow us to move it completely off of hume [15:20:21] gerrit-wm: faster! [15:20:23] :) [15:20:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11868 [15:20:30] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11868 [15:20:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11868 [15:21:01] now lets see if I break gerrit [15:21:09] how do you move extents around (as opposed to just a resize)? I guess the start point for the one extent will be different [15:21:46] it seems happy enough [15:23:46] Ryan_Lane: have you upgraded Gerrit ? :-) [15:23:53] hashar: upgraded? [15:23:59] I added a second ldap server to its config [15:24:06] so if the primary goes down, we can still use it [15:24:18] apergos: I'm don't know--also I don't know how to do it without exploding the xfs partitions that are on top of it. we can afford to blow away /archive short-term though [15:24:29] ah I see [15:24:32] Ryan_Lane: that might help provided they are both using a different domain name *grin* [15:25:13] um that would be a no [15:25:25] !log rebooting es1002 and es1003 [15:25:29] Logged the message, Master [15:25:36] there's not space to copy stuff elsewhere and these are used in the udp log processing pipeline it looks like [15:26:02] as in 5 minutes ago. [15:26:33] lemme look at lv tools and xfs docs [15:26:36] apergos: ./udplogs is a partial replica of what is also on storage3 and aluminium, we can lose that for now [15:26:43] ohh [15:26:49] so you have a complete copy elsewhere? [15:26:55] hmmm [15:27:16] ./incoming_udplogs is indeed part of the udplog pipeline, but that can be stopped for a bit, OR I can move it back to storage3 which is where it was until storage3 was lost for a month [15:27:27] ok [15:27:30] hmmmm [15:28:47] PROBLEM - Host es1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:54] hashar: what do you mean? [15:29:23] New patchset: Hashar; "Added ffmpeg2theora to the imagescaler package list." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11870 [15:29:30] both ldap servers are in different datacenters [15:29:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11870 [15:30:00] Ryan_Lane: sorry I was just teasing you regarding to last week DNS outage [15:30:05] ah [15:30:05] heh [15:30:07] (outage /disruption) [15:30:14] well, we technically haven't fixed that [15:30:18] I am kind of like teasing people [15:30:29] that is not obvious over IRC though :-( sorry bout that [15:30:33] heh [15:30:48] I just added a secondary dns server too ;) [15:31:10] as a bind slave of the first/master/primary ? [15:31:15] no [15:31:19] don't bother moving stuff yet, lemme see if I can be clever about this (opportunity to get better at lv/pv stuff too) [15:31:20] they use LDAP as a backend [15:31:24] oh [15:31:28] does it work? *grin* [15:31:34] so, you set up LDAP replication and your dns replication is done automatically [15:31:35] yep [15:31:38] never managed to get LDAP replication to work properly :-( [15:31:50] so I always used bind replication [15:31:50] dig @labs-ns0.wikimedia.org bastion.wmflabs.org [15:31:53] dig @labs-ns1.wikimedia.org bastion.wmflabs.org [15:32:11] with a stealth primary server not reachable from outside and replicating to two public facing primary servers [15:32:17] I should really have failover for the LDAP lookups, too [15:32:56] when I redo the openstack puppet config I'll make ldap work sanely in the config [15:33:08] the difficult thing is, you want the primary in your same datacenter always :) [15:33:12] Ryan_Lane: sorry, just saw those [15:34:02] seems to work [15:34:03] great job :) [15:35:58] New patchset: Jgreen; "misc::fundraising::impressionlog::compress" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11871 [15:35:59] RECOVERY - Host es1002 is UP: PING OK - Packet loss = 0%, RTA = 26.75 ms [15:36:29] New review: Andrew Bogott; "Simple enough." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11870 [15:36:29] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11870 [15:36:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11871 [15:36:43] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11871 [15:36:45] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11871 [15:38:28] mark: hi [15:38:34] ottomata: ping [15:38:38] mark: Ryan told me you know how wikimedia irc works [15:38:49] ping! [15:38:57] New patchset: Ottomata; "statistics.pp - ensuring that gerrit-stats repo is at the latest head" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11872 [15:39:05] mark: I would like to insert feed from beta to wikimedia irc, but I have no idea how to register more channels there [15:39:11] ottomata: hhhmmm, I was looking for something more like pong... [15:39:13] :) [15:39:17] pung! [15:39:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11872 [15:39:28] perng! [15:39:30] anyway! [15:39:39] so hiii! [15:39:42] what's up with new udp2log instance? [15:39:44] paravoid: I'm going to change the NS records soon [15:39:44] PROBLEM - mysqld processes on es1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [15:39:51] whatcha mean? [15:39:55] soon/now [15:39:56] for the search results [15:39:59] ohho [15:40:00] waitin on this [15:40:02] er, queries [15:40:04] kk [15:40:07] the review [15:40:11] https://gerrit.wikimedia.org/r/#/c/11574/ [15:40:13] ja [15:40:22] yeah [15:40:46] actually, you made some of the original refactors of this stuff, right? [15:40:50] would love your eyes on it as well [15:41:01] if you can merge it, i'd be happy to babysit it right now with you [15:41:07] ok, I'll take a look [15:44:31] paravoid: Failed to change nameservers from: [virt0.wikimedia.org, labsconsole.wikimedia.org] to: [labs-ns0.wikimedia.org, labs-ns1.wikimedia.org] because: There was an error creating host labs-ns0.wikimedia.org at the registry. Any actions depending upon the successful creation of this host have been aborted. [15:44:50] PROBLEM - Host es1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:45:18] New patchset: Ottomata; "Refactoring udp2log classes and defines." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11574 [15:45:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11574 [15:46:20] RECOVERY - Host es1002 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [15:46:22] notpeter, I found a couple of problems myself looking over it, just submitted that new patchset to fix them [15:46:44] kk [15:46:53] paravoid: any clue? [15:47:00] sigh, haven't seen this one before: ! [remote rejected] HEAD -> refs/for/production (missing Change-Id in commit message) [15:47:19] maybe they need service IPs so that forward and reverse perfectly match [15:48:01] any chance you committed before a rebase or a git pull? [15:48:47] (I got that complaint for just such a reason, a little unintuitive for sure) [15:49:10] !log assigned service IPs for labs-ns0/labs-ns1 [15:49:15] Logged the message, Master [15:49:26] New patchset: Hashar; "Ensure that /a exists on imagescalers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11873 [15:49:52] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/11873 [15:51:05] New patchset: Alex Monk; "(bug 37661) Change Vietnamese Wikibooks logo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11747 [15:51:08] New patchset: Hashar; "Ensure that /a exists on imagescalers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11873 [15:51:24] Reedy: I love how we have more and more volunteers handling shell requests :-] [15:51:38] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11747 [15:51:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11873 [15:52:11] !log making the mailing list switch. mediawiki-cvs -> mediawiki-commits [15:52:16] Logged the message, Master [15:53:30] New patchset: Jgreen; "fixed definition naming conflict in fundraising.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11874 [15:54:00] New review: Dzahn; "BZ 19958 - RT-3097" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11861 [15:54:00] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11861 [15:54:01] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11874 [15:54:01] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11874 [15:54:57] Jeff_Green: can you merge all? mine is harmless but need it asap [15:55:11] both in sockpuppet [15:55:14] mutante: yep, was just gonna ask and do so [15:55:19] thx [15:55:28] np, done [15:56:52] New patchset: Ryan Lane; "Add service IP to labs dns servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11875 [15:57:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11875 [15:57:27] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11875 [15:57:29] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11875 [15:58:57] !log copied full config/users/passes from mediawiki-cvs to mediawiki-commits, merged redirects, added old list name to acceptable_aliases in recipient filters [15:59:02] Logged the message, Master [15:59:45] !log there have been no archives, so that should be it. there may be another issue in BZ 37690 but should be unchanged by renaming [15:59:50] Logged the message, Master [16:00:11] paravoid: I added in 153.135 to puppet, so we can re-enable puppet [16:01:06] I also made labs-ns0/1 service IPs, so that we can move them from system to system [16:01:51] ok well booooo can't shrink an xfs partition, you can only grow it so that puts the kabosh on that idea [16:04:00] notpeter, how's it looking? [16:04:13] Ryan_Lane: you rock [16:04:21] hm [16:04:33] pdns seems to hate this: query-local-address=208.80.152.32 208.80.152.33 208.80.153.135 [16:04:53] it's a comma, isn't it? [16:04:57] its separator [16:05:04] spaces work, I think [16:05:08] !log rebooting es1001 [16:05:13] Logged the message, Master [16:05:21] well, lemme rephrase, they do for local-address [16:05:31] ah, right [16:05:38] query-local-address can only be one [16:05:39] seems you can only have one address [16:05:40] that makes sense [16:05:41] yeah [16:05:48] I need to change that template [16:05:51] that's how it's going to do the outgoing requests [16:05:55] yeah [16:06:12] it doesn't matter which one, unless you have special access lists on other NS or weird multihoming [16:06:36] I'd prefer to use the service address [16:07:34] hey mutatne or maybe apergos [16:07:46] could one of you help me with this? [16:07:46] https://rt.wikimedia.org/Ticket/Display.html?id=2162#txn-66342 [16:07:52] mutante* [16:08:34] New patchset: Ryan Lane; "Add a new variable dns_auth_query_address" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11876 [16:09:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11876 [16:09:09] New patchset: Ryan Lane; "Add a new variable dns_auth_query_address" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11876 [16:09:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11876 [16:09:44] PROBLEM - mysqld processes on es1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:09:46] so Jeff_Green, you wanna move/do whatever with that data and I'll see if I can patch stuff up over on hume? [16:09:53] yep [16:10:06] i think I just got the pipeline moved to storage3 [16:10:09] ok [16:10:20] ottomata: got distracted... sorry. gimme a little bit [16:10:26] and you can nuke /archive temporarily [16:10:30] ok [16:10:48] np [16:11:14] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [16:11:24] New patchset: Ryan Lane; "Add a new variable dns_auth_query_address" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11876 [16:11:47] paravoid: ^^ review? [16:11:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11876 [16:12:45] are you in the directory Jeff_Green? [16:12:57] out now [16:12:59] ok [16:13:07] I guess I'm going to have to nuke /usr/local/apache [16:13:13] hmm [16:13:19] I mean [16:13:29] yeah there's nowhere else to move it [16:13:37] we need DiskQuintupler [16:13:51] lemme think about this again for 2 secs [16:14:25] you *could* blast tank-archive, and create tank-apache-new, move data over, then blast [16:14:43] maybe not [16:14:49] Jeff_Green: that's basically how I did it on the apaches [16:14:55] except /a was empty [16:15:01] ya [16:15:04] need to look at something first [16:15:11] and /archive can be rendered empty [16:15:30] Ryan_Lane: why not just take the first local? [16:15:41] what you did is fine too though [16:15:47] whichever you prefer [16:15:56] I think I'll do it this way :) [16:16:13] the first local may not be the one you want to use [16:16:18] and it's not clear [16:16:39] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11876 [16:16:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11876 [16:17:45] I think you can also not specify query-local-address at all [16:17:47] lemme force run on ns2 to make sure I'm not going to kill everything [16:18:05] fuck [16:18:23] New patchset: Ryan Lane; "Revert "Add a new variable dns_auth_query_address"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11877 [16:18:45] see you later tonight [16:18:51] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11877 [16:18:51] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11877 [16:19:00] that change didn't work [16:22:20] PROBLEM - Auth DNS on labsconsole.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:23:27] New patchset: Ryan Lane; "Add a new variable dns_auth_query_address" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11878 [16:23:29] let's try this again [16:23:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11878 [16:24:56] -_- [16:25:03] for some reason puppet killed pdns [16:25:12] RECOVERY - Auth DNS on labsconsole.wikimedia.org is OK: DNS OK: 0.080 seconds response time. www.wikipedia.wmflabs.org returns 208.80.153.197 [16:25:34] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11878 [16:25:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11878 [16:29:51] !log restarting lighttpd on sodium - redirecting mediawiki-cvs list page [16:29:51] Jeff_Green: do you know if we use any special options to make.xfs? [16:29:56] Logged the message, Master [16:30:09] !log installing package upgrades on sodium [16:30:14] Logged the message, Master [16:30:51] ottomata: sorry i was busy and soon we have a meeting [16:30:58] hey apergos, would you have some time today to resolve a NFS mounting issue on stat1? [16:31:04] not right now [16:31:06] maybe in a bit [16:31:14] nm jeff I am going to steal from the opts on wikitech [16:32:09] apergos: i hope so because it's a bit urgent [16:34:51] New patchset: Ryan Lane; "Setting query-local-address for labs eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11879 [16:35:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11879 [16:35:32] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11879 [16:35:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11879 [16:40:33] New patchset: Ryan Lane; "Fix ip address for labs-ns1 on virt1000" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11881 [16:40:46] why do people let me do dns? for real. [16:41:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11881 [16:41:11] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11881 [16:41:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11881 [16:44:20] Jeff_Green: /archive is back, sorry that took so long [16:44:25] we should be in business now [16:44:57] a little bit of panicking over volume and filesystem labels, turns out I could ignore them [16:45:33] apergos: no problem at all [16:46:50] New review: Hashar; "This come from the test branch. On second though, maybe we need a dedicated class to describe /a" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/11873 [16:47:29] to be clear I did not copy data from wherever you put it, just mounted the empty filesystem. [16:50:26] drdee: ok, done with my task... what's up? [16:54:44] ah [16:54:44] hey [16:54:45] so [16:54:51] https://rt.wikimedia.org/Ticket/Display.html?id=2162#txn-66331 [16:54:55] apergos: ottomata has a question:D [16:54:56] there used to be 2 nfs mounts on stat1 [16:55:01] sure [16:55:02] we reinstalled the OS last week [16:55:06] and then there was the install [16:55:08] and now..? [16:55:10] now they don't work/dont' exist [16:55:13] the /mnt/data one [16:55:23] ok, letme have a look [16:55:24] $ sudo mount /mnt/data [16:55:24] mount: Connection refused [16:55:34] obviously dataset2 or whatever will be fine so it's jsut on the stat1 side [16:55:34] the other one…I think we may have set it up manually, rather than in puppet [16:55:41] since it was meant to be a temporary thing [16:55:49] hmmm, really? [16:55:51] unless you have anew ip [16:55:52] ? [16:55:54] i dunno [16:56:00] well let me poke at it. [16:56:58] stat1.wikimedia.org still listed on dataset2 so that's dine [16:56:59] fine [16:57:32] stat1.wikimedia.org is still the hostname [16:58:46] svc: failed to register lockdv1 RPC service [16:59:16] paravoid: so, markmonitor isn't letting me set labs-ns0/labs-ns1 for some reason [16:59:36] maybe it hates -? [16:59:38] New patchset: Jgreen; "added log creation to fundraising impression log compression class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11883 [17:00:09] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11883 [17:00:09] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11883 [17:02:16] Ryan_Lane: OH COME ON [17:02:23] agreed [17:02:25] wtf [17:02:36] do they have some kind of support? [17:02:39] There was an error creating host http://labs-ns0.wikimedia.org/ at the registry. Any actions depending upon the successful creation of this host have been aborted. [17:02:45] ignore the url [17:02:52] eh!?! [17:02:53] that's copy/paste BS in my client [17:03:10] I don't see why. [17:03:44] apergos, sorry, in a meeting [17:03:58] yeah, stat1 is still the hostname, would an IP change cause a problem? [17:04:00] i'm not sure if it changed [17:04:14] because NFS auth is likely set on the IP [17:04:18] not on the hostname [17:04:27] (it's more secure and it's faster) [17:04:33] ottomata: ok that looks pretty reasonable to me (re: udp2log changes) [17:05:51] ok cool, think we can merge and babysit them in a few minutes here (in a meeting at the moment) [17:06:44] sure (although I have a meeting in 54 minutes) [17:06:51] want to after that? [17:07:05] ok [17:07:09] cool [17:08:46] cmjohnson1: dunno if you saw, but search32 is dead again :( [17:11:41] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [17:11:59] the permissions are granted by hostname [17:12:05] so no that'snot going to be the issue [17:12:17] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:13:32] Ryan_Lane: support? [17:13:42] cmjohnson1: around? [17:13:44] I emailed support [17:13:51] well, who they told me is our support [17:14:10] who monitors core-ops@rt.wikimedia.org? [17:14:35] i am expecting an email from asana, maybe this can be forwarded to me? [17:15:29] cmjohnson1, RobH: see above nagios ^^^ srv278, RT #24, it's been randomly rebooting and is depooled for many months [17:16:02] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:17:38] New patchset: Jgreen; "aluminium should collect from both storage3/hume in case we flip log compression" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11884 [17:18:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11884 [17:18:23] ACKNOWLEDGEMENT - Apache HTTP on srv278 is CRITICAL: Connection refused daniel_zahn RT #24 - hardware [17:18:23] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11884 [17:18:25] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11884 [17:18:58] so I have no idea why it doesn't work now [17:19:04] cmjohnson1: cool! thanks. [17:19:44] I rebooted a couple snapshot hosts and they came back fine with the mounts [17:19:53] hmm [17:20:01] maybe it's something in the precise packages [17:20:20] are there logs on dataset2 [17:20:23] showing the attempted mount? [17:20:29] drdee: nothing for content or requestor email like "asana" [17:20:50] yes, it shows an auth request just like for all of them [17:20:53] nthing exceptions [17:20:57] exceptional [17:21:15] paravoid: you read the reply? [17:23:22] wtf? [17:23:32] so, we need to register and change glue records with Verisign? [17:24:04] mutante: thanks, maybe shortly later, do you also monitor bugzilla-daemon? [17:24:13] same request for that email account :) [17:24:24] Ryan_Lane: will you ask her how to do that ourselves next time or will you? [17:24:33] drdee: no, i dont. that would be more a bugmeister request (hexmode) [17:24:42] Ryan_Lane: and how to modify glue records if needed [17:24:45] or should I that is :) [17:25:05] mutante: thanks! [17:25:46] hexmode: are you around? [17:27:04] so [17:27:11] nfs-common was not installed [17:27:12] sooooo [17:27:15] AH! [17:27:17] on stat1 [17:27:31] ? [17:27:36] I installed manually, you'll want to fix that in puppet [17:27:37] yes, stat1 [17:27:38] i will include it in statistics classes [17:27:38] yeah [17:28:01] drdee: yes [17:28:01] yer mount is now available for /mnt/daa [17:28:06] drdee: sup? [17:28:06] yayy [17:28:09] I assume you can fix the other one now [17:28:15] whatever it was.... [17:28:18] yes! [17:28:22] sweet [17:28:24] eees ogoooood [17:28:49] thank you! sorry, I could have fixed that on my end without your help then, didn't realize what was wrong [17:28:50] thanks so much! [17:28:53] sure [17:28:56] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [17:29:04] there was a ticket for this? [17:29:47] * hexmode wonders if drdee doesn't really need him [17:30:05] yeah [17:30:10] i will update it [17:30:11] if you like [17:30:16] https://rt.wikimedia.org/Ticket/Display.html?id=2162#txn-66331 [17:30:53] sure [17:31:15] New patchset: Ottomata; "statistics.pp - including nfs::common" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11885 [17:31:21] could you merge that real quick? [17:31:23] hexmode: do you see an email from asana in bugzilla-daemon email account, if yes can you forward it to me? [17:31:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11885 [17:35:27] paravoid: BAH [17:35:35] "Not for .org names, this is a manual process." [17:36:40] apergos, while you are still around and thinking about it [17:36:54] https://gerrit.wikimedia.org/r/11885 [17:37:03] argh :( [17:37:06] if you do it now it won't get stuck for in the queue for weeks [17:37:18] seems to be working perfectly fine in all the others [17:37:25] yeah sorry, I didn't see the message here, I was off typing [17:37:57] Ryan_Lane: the "good" news are that we know how to change glue records in the future [17:38:03] yeah [17:38:08] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11885 [17:38:10] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11885 [17:38:26] thanks apergos! [17:38:32] mark: ping [17:38:35] bleh [17:38:36] no done yet [17:38:37] mark: ^^^ :) [17:38:48] wikiemedialabs.org too [17:38:49] heh [17:38:59] mark: markmonitor replied that "for .org names [changing a glue record] is a manual process" [17:39:03] no, the glue should be the same [17:39:10] since you register the nameserver [17:39:14] yeah [17:39:16] and then reference it from the domain [17:39:17] it's not done yet [17:39:25] -> RT-3126 "wikmedia.org", created today [17:39:31] so I get the same error for now [17:39:33] now done, will be in effect after next puppet run [17:39:45] we have wikimedialabs.org/com/net [17:40:00] robla: can you approve https://rt.wikimedia.org/Ticket/Display.html?id=3119 [17:40:26] robla: can you also approve https://rt.wikimedia.org/Ticket/Display.html?id=3116 [17:40:36] hexmode: do you see an email from asana in bugzilla-daemon email account, if yes can you forward it to me? [17:40:54] Ryan_Lane: well, we could use ns0.wikimedialabs.com/net, but feels like a big hack [17:41:09] no thanks :) [17:41:15] I can wait [17:41:17] Ryan_Lane: btw, how come you chose wmflabs.org over labs.wikimedia.org? [17:41:30] we didn't want wikimedia.org in the url [17:41:36] legal? [17:41:46] same domain [17:41:51] security [17:41:55] ah, right [17:42:01] <^demon|zzz> Someone could do nasty things with cookies and such for starters :) [17:42:02] makes total sense [17:42:13] btw, in Debian our NS are ns1/2/3.debian.org + ns4.debian.com [17:42:23] just in case .org goes down :) [17:42:28] great, danke [17:43:09] New patchset: Jgreen; "reenable fundraising banner log compression on hume, as secondary processor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11886 [17:43:16] spread over Canada, MIT, Germany & Greece [17:43:28] ok review/merge time [17:43:38] i have a few changes waiting in gerrit [17:43:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11886 [17:43:39] https://gerrit.wikimedia.org/r/#/c/11872/ [17:43:43] <^demon|zzz> paravoid: MIT is a country? [17:43:43] is pretty simple and harmless [17:43:51] paravoid: 1.ns.debian ? re: gTLDs [17:44:15] apergos, can you do that one for me too? [17:44:16] https://gerrit.wikimedia.org/r/#/c/11872/ [17:44:16] ? [17:44:25] ^demon|zzz: kind of, they don't recognize ARIN for example :-) [17:44:34] and that's why they don't have IPv6 [17:44:58] New patchset: Ryan Lane; "Revoke Andrew Bogott's key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11887 [17:45:07] eh? [17:45:08] ewll they don't need ipv6 with all of their ip's [17:45:10] what happened? [17:45:23] who feels like merging https://gerrit.wikimedia.org/r/#/c/11564/ it is giving Jonathan (a global dev analyst) access to stat1 [17:45:27] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11886 [17:45:27] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11886 [17:45:28] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11872 [17:45:28] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11872 [17:45:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11887 [17:45:40] thanks apergos! [17:45:45] whose fundraising change? [17:45:54] cause I'm about to merge it [17:46:03] + nrpe, [17:46:03] + misc::fundraising::impressionlog::compress [17:46:16] Jeff_Green: ? [17:47:50] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11887 [17:47:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11887 [17:48:05] now they are stackign up [17:48:29] apergos: yes, mine, and it's already gone [17:48:40] i merged it before i noticed your mention here [17:49:04] heh [17:49:32] eh someone already merged ottomata's too [17:49:34] ok then thanks [17:50:29] RobH: heya [17:51:07] https://rt.wikimedia.org/Ticket/Display.html?id=3068 ? [17:51:56] LeslieCarr, RobH: yes the analytics team would love to see this fixed! :D [17:52:32] i cannot do that from tampa. [17:52:38] oh you're still in tampa ? [17:52:40] so its not going to get done until next week [17:52:41] i thoguht you flew back this week [17:52:42] nm [17:52:48] i mean i thoguht you flew back this weekend [17:52:52] nope, next [17:52:56] cool [17:53:03] New patchset: Ottomata; "site.pp - Giving access on stat1 to Ryan Faulkner. RT306" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9628 [17:53:04] how's tampa treating you ? [17:53:19] i'm very excited that we finally got the switch ring working [17:53:21] :) [17:53:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/9628 [17:53:56] though turns out, even thoguh they say minor revision differences will still work with each other… it won't :( you can attach a minor revision different switch to a ring, but it won't pass ring traffic through it [17:54:32] yea chris said you guys were working on it awhile [17:57:50] LeslieCarr: yeah [17:58:08] it's always a pita to get them working initially.. have to install the same software on all before start [17:58:27] yeah, also for some reaosn when you install via a usb key, it deletes the package from teh usb key [17:58:34] so weird [17:58:45] had to have chris copy it over again each time [17:58:53] brb [18:01:33] drdee: what's Ori want on locke, emery, oxygen etc? [18:04:29] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:14:38] drdee: fyi, I don't see anything from asana. [18:16:52] New review: Dereckson; "(no comment)" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/11747 [18:22:31] paravoid: I need to ask markmonitor to remove the old glue records, right? [18:22:54] dunno, maybe they garbage collect unreferenced glues [18:22:56] or not, no idea [18:23:02] probably a good idea to ask... [18:23:03] I'll ask, just in case [18:23:25] they're not going to be in the additional section for wmflabs.org replies though [18:23:29] so that's a good thing :-) [18:24:58] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=&s=by+name&c=Application%2520servers%2520pmtpa&tab=m&vn= [18:25:07] ^ quite a few of the apaches look quite loaded [18:26:30] New patchset: Andrew Bogott; "Added updated key for abogott." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11892 [18:26:36] hi guys [18:26:43] i betcha you are all in a meeting eh? [18:26:48] so maybe this q will not be answered, hmmm [18:26:49] ops are yeah [18:26:55] yeah [18:27:00] vvv: and of course ops are in a meeting :-D [18:27:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11892 [18:27:05] i have a generic 'stats' user on stat1 [18:27:27] i need to puppetize some rsyncing of log files from some of the udp2log machines over to stat1 [18:27:50] I manually created an ssh key on stat1 for the stats user [18:28:09] so, i need the stats user to exist on the udp2log machiens [18:28:25] and it needs to be able to ssh/rsync between udp2log machines and stat1 [18:28:28] i don't care if I push or pull [18:28:31] probably pull is easier [18:28:37] so the rsync cron jobs are all on stat1 [18:28:42] I think I had something similar setup for Jenkins / Gerrit [18:29:04] but that means I need to install the stats user and stats' public key on the udp2log machines [18:29:11] so [18:29:15] 1. shoudl I do that [18:29:16] if so [18:29:25] 2. should the private key be puppetized? [18:29:31] 3. should the public key be puppetized? [18:29:44] 4. should I move the stats user class into a more global location, maybe admins.pp [18:29:44] ? [18:29:51] (it is currently in misc/statistics.pp) [18:30:09] ottomata, i guess the public key should be [18:30:14] if yes to 2. and/or 3., then how do I puppetize? [18:30:23] should I use the private repo? [18:30:33] not sure about the private one, but it shouldn't be in a public repo, for obvious reasons [18:30:38] aye [18:31:27] or [18:31:43] should we maybe do the same that we do for real users for this user? [18:31:46] instead of puppetizing the private key [18:31:48] can we use ldap? [18:31:55] ummm…how's that work, through labsconsole? [18:32:07] i think this stats user does have a labsconsole account [18:35:04] Reedy: still look bad? [18:35:15] (I'm mostly not here cause we have our meeting) [18:35:24] and q2: can this be a side effect of enwiki deploy? [18:36:21] Individual load looks to have been higher for longer [18:36:25] huh [18:36:50] oh it's just the apis is it? [18:37:12] I was looking at the app pmtpa pool [18:37:30] load, memory and network graphs are very noisy too ( [18:37:52] api pool looks reasonable [18:37:52] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=load_one&s=by+name&c=Application+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:37:58] the weeklies don' t look tooo crazy [18:38:40] true [18:38:52] it's not as if all the machines are at high load [18:39:07] i created a ticket about this [18:39:14] https://rt.wikimedia.org/Ticket/Display.html?id=3137 [18:39:23] i'm guessing there's a pool or two that needs machines and some that have too few [18:39:31] i mean some have too many ;) [18:39:32] sounds likely [18:39:41] heh [18:55:29] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [19:06:43] !log updating dns [19:06:48] Logged the message, RobH [19:09:47] New patchset: Ottomata; "Setting up cron jobs to rsync some udp2log archived logs to stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11898 [19:10:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11898 [19:11:37] apergos (or anyone), if you are around, could you check that out for me? [19:11:49] !log updating several Jenkins plugins [19:11:54] Logged the message, Master [19:12:21] gone (sorry), 10 pm and off the clock.. trying to burn a dvd with f17 on it, my dvd burner seems on the blink [19:12:52] bye bye apergos :) [19:13:06] see yas [19:16:20] yeah no probs, thanks for your help today [19:27:42] yw (see, I really was gone :-D) [19:29:53] Jenkins down, working on it :/ [19:38:13] hexmode: thanks [19:38:53] robla: about Ori, yeah i guess so, i didn't create the ticket but he should have access to at least emery to check the AFT / clicktracking log [19:39:30] drdee: clicktracking is moving over to emery? [19:40:04] not moving over, always been running there [19:41:11] I'm assuming RoanKattouw moved it to emery when he did his last fixup pass on it. it wasn't *always* there [19:41:40] It's been on emery for a long long time [19:41:54] Actually I think it's always been there, at least in its UDP days [19:42:02] Prior to being collected over UDP on emery, it was in a DB table [19:42:49] * robla looks in email, and finds that it moved over December 2011. While that qualifies as "a while", it's not "a long long time" [19:43:08] fair enough :) [19:44:16] Between Dec 2011 and now, I circumnavigated the world, got a visa, moved countries, rented an apartment, got my passport stolen, got a visa again, went to Europe twice, and got a driver's license. So it felt like a long time to me :D [19:44:57] anyway, I guess we know that emery can handle the current load. my main concern with these boxes in particular is making sure that everyone who is on them knows how fragile they are, and that even seemingly innocent activities can induce loss [19:46:08] * robla must get food and eat in 15 min, so will pick this up later [19:46:31] robla: i am on it, we are not deploying any new filters on emery anymore [19:46:32] uhoh RoanKattouw can drive ? I'm scared [19:54:49] LeslieCarr: The license I'm getting expires in 2022, when I'll be 31. That's what scares *me* [19:54:58] New patchset: Aaron Schulz; "Tweaked rsync params per man page." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11919 [19:55:19] haha [19:55:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11919 [19:55:41] how'd you get a 10 year license ? [19:56:20] Aren't CA DLs 10y in general? [19:57:27] i didn't think so ? [19:58:12] i thought it was 5 years [19:59:02] I'll know in a few weeks when I get the license [19:59:15] Anyway my NL license is 10y so it expires in 2020, that's scary enough [19:59:46] binasher: are you around? [20:00:48] drdee: is this meeting happening? [20:01:05] yes, if you guys are in a room and tell me who i should call [20:01:24] preilly: is binasher around? [20:01:28] and terry [20:01:29] ? [20:01:55] drdee: not yet [20:01:58] drdee: terry is [20:02:09] preilly: we sort of need binasher :) [20:02:28] drdee: I just texted him [20:02:32] drdee: he is on his way [20:02:35] who shall i skype? [20:02:49] drdee: Skype @preillyme [20:23:05] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [20:33:26] New patchset: Platonides; "(Bug 37700) - Change logo for stewardwiki to http://commons.wikimedia.org/wiki/File:Steward_wiki_logo_3.svg and favicon to meta one." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11943 [20:33:33] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11943 [20:37:33] New patchset: Petrb; "change logo and favicon of steward wiki (bug 37700)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11944 [20:37:38] New review: jenkins-bot; "Build Failed " [operations/mediawiki-config] (master); V: -1 C: 0; - https://gerrit.wikimedia.org/r/11944 [20:39:31] New review: Petrb; "duplicate of https://gerrit.wikimedia.org/r/11944 but fine" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/11943 [20:40:05] Ryan_Lane: btw, if you're going to do ssh userkeys in prod too [20:40:14] ? [20:40:15] would you open in changing the path a bit [20:40:37] to /etc/ssh/userkeys/%u instead of /etc/ssh/userkeys/%u/.ssh/authorized_keys ? [20:40:37] if the userkeys support in puppet works with it, yeah [20:41:38] Change abandoned: Petrb; "duplicate" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11944 [20:42:17] hahaha. my rlane@wikimedia.org account has the wrong picture :D [20:42:55] ? [20:43:30] it's me with a giant hello kitty statue [20:43:38] not the best image for my work email :D [20:44:11] where does it have a picture? [20:44:35] gmail will automatically show a pic of people, if they have one available by some means [20:45:56] how it got that one, I'll never know [20:46:10] google+? [20:46:16] it was blank [20:46:29] and my personal one doesn't have that pic either [20:46:33] facebook has that pic [20:46:36] ah [20:46:37] I know [20:46:45] google chat [20:46:45] mutt doesn't show ay pictures [20:46:50] and I just fired gmail [20:46:55] and it's a different picture [20:47:00] fwiw [20:47:01] yeah. I just changed it [20:47:15] ah [20:47:19] if you set your google+ one, it'll override the gchat one [20:48:16] I like my email with no pictures [20:48:45] back in a while [21:02:06] hey maplebed: would you have some spare minutes to approve https://gerrit.wikimedia.org/r/#/c/11564/ and https://gerrit.wikimedia.org/r/#/c/9628/ ? [21:09:41] * maplebed looks [21:11:31] drdee: please move jmorgan to UID order within admins.pp [21:11:50] looking [21:12:20] also, did you verify the key? (via a secure medium) [21:12:54] ? [21:13:03] New patchset: Ottomata; "Creating new user Jonathan Morgan, including him on stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11564 [21:13:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11564 [21:16:59] drdee: those two commits will fight on merge - they both try and add a line in the same place. if they do merge, one will be missing a comma. [21:17:25] ottomata: can you fix that? [21:18:18] ottomata: the format of jmorgan's key is wrong; it shouldn't have the comment in the key string. [21:19:36] uhh [21:19:39] can fix the comment [21:20:21] not sure how to fix the conflict [21:20:27] New patchset: Ottomata; "Creating new user Jonathan Morgan, including him on stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11564 [21:20:29] we can abondon faulkner's commit [21:20:34] and I can do a new one after this one is merged [21:20:42] that'd do it. [21:20:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11564 [21:23:54] drdee: was there an RT ticket for jmorgan? [21:24:25] maplebed: yes, 1 sec [21:24:57] maplebed: https://rt.wikimedia.org/Ticket/Display.html?id=3003 [21:25:42] woosters: would you reply to both ^^^ and https://rt.wikimedia.org/Ticket/Display.html?id=3063 with your approval? [21:27:00] New review: Hashar; "Looks rsync indeed use the same syntax as bash (and other shells?). Man page from fenari:" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/11919 [21:27:34] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11564 [21:27:36] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11564 [21:33:20] ottomata: you want to redo falkner's change now that jmorgan's been merged? [21:33:35] New patchset: Jgreen; "whee, fixing the config that resulted in aluminium:/ 100%" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11950 [21:33:47] mmk [21:34:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11950 [21:34:26] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11950 [21:34:29] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11950 [21:34:33] Change abandoned: Ottomata; "redoing this in another commit to avoid conflicts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9628 [21:36:14] New patchset: Ottomata; "Giving access to Ryan Faulkner on stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11951 [21:36:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11951 [21:38:47] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11951 [21:38:49] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11951 [21:39:00] ottomata: ugh, why abandon? ;( [21:39:20] you could cherry pick or just start from scratch but keep the change id [21:39:30] maplebed: if you haven't already, can you delay making any changes to es1001 until tomorrow? [21:39:40] * jeremyb runs away [21:40:04] binasher: np. [21:43:17] New patchset: Bhartshorne; "Revert "Creating new user Jonathan Morgan, including him on stat1."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11952 [21:43:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11952 [21:43:57] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11952 [21:46:48] Change abandoned: Bhartshorne; "just got confirmation I can merge." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11952 [21:49:20] wait what's the problem there? [21:50:01] ottomata: no problem. [21:50:06] k [21:50:10] just me fighting gerrit. [21:51:08] fwiw though, puppet's currently broken on stat1 and so those accounts won't be enabled until it runs successfully. [21:51:29] growl [21:53:42] New patchset: Ottomata; "statistics.pp - Fixing duplicate require" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11953 [21:53:46] ok, that should fix it [21:53:46] maplebed [21:53:57] looking [21:54:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11953 [21:54:31] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11953 [21:54:33] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11953 [21:57:36] ottomata: still broken, but in a different way. [21:58:06] (broken in a way that didn't impede creating the new accounts) [21:59:41] oh yeah i saw some of those [21:59:44] been there since reinstall [21:59:48] haven't had a chance to fix them yet [22:00:06] will do soon [22:00:07] thanks ben [22:00:08] i gotta run [22:09:47] maplebed: final request, could you maybe review https://gerrit.wikimedia.org/r/#/c/11898/ (we are getting real close in getting stat1 fully operational) [22:10:47] drdee: no, that one requires more conversation. [22:17:32] !log rebooting es1002 to look at the raid setup [22:17:37] Logged the message, Master [22:19:02] PROBLEM - Host es1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:14] binasher: from the bios the raid on es1002 looks totally normal. (except for the degraded disk, of course.) [22:23:33] interesting [22:24:52] did every span appear degraded via MegaCli for you too? [22:26:32] RECOVERY - Host es1002 is UP: PING OK - Packet loss = 0%, RTA = 26.38 ms [22:26:37] I started with the bios. [22:28:11] maplebed: what kind of conversation is needed :) ? [22:30:41] drdee: due diligence. aren't files already getting copied over to stat1? how has it been happening and why should it change? [22:30:58] but right now I need to get some other shit done, so can't actually have that conversation with you. [22:31:02] okay [23:22:19] New patchset: Jdlrobson; "update varnish config to match DeviceDetection.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11963 [23:22:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11963 [23:56:13] Ryan_Lane: can you create a gerrit repo for me? [23:58:29] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours