[00:08:29] RECOVERY - Puppet freshness on spence is OK: puppet ran at Mon Mar 4 00:08:18 UTC 2013 [00:08:29] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:09:43] RECOVERY - Puppet freshness on spence is OK: puppet ran at Mon Mar 4 00:09:34 UTC 2013 [00:10:28] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:10:58] RECOVERY - Puppet freshness on spence is OK: puppet ran at Mon Mar 4 00:10:51 UTC 2013 [00:11:28] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:12:08] RECOVERY - Puppet freshness on spence is OK: puppet ran at Mon Mar 4 00:12:05 UTC 2013 [00:12:35] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [00:13:06] PROBLEM - Puppet freshness on db27 is CRITICAL: Puppet has not run in the last 10 hours [00:42:45] PROBLEM - Squid on brewster is CRITICAL: Connection refused [01:32:35] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 197 seconds [01:32:35] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 200 seconds [01:37:35] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 194 seconds [01:37:35] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 195 seconds [01:39:51] PROBLEM - Puppet freshness on db27 is CRITICAL: Puppet has not run in the last 10 hours [01:44:41] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 23 seconds [01:44:41] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 18 seconds [02:00:51] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [02:04:11] !log LocalisationUpdate failed (1.21wmf10) at Mon Mar 4 02:04:10 UTC 2013 [02:04:20] Logged the message, Master [02:09:11] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Puppet has not run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on ssl3003 is CRITICAL: Puppet has not run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [02:10:11] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [02:10:11] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [02:10:12] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [02:10:12] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [02:10:12] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Puppet has not run in the last 10 hours [10:11:52] !log reinstalled wikimedia-app-taskserver on fenari (how did it get removed?), needed for various cron jobs among other things [10:12:02] Logged the message, Master [10:12:44] !log cleared some space on brewster by rmeoving old squid access/storage logs (2.2gb worth in 3 files!), its root partition was full, restarted squid [10:12:50] Logged the message, Master [10:50:48] New patchset: Silke Meyer; "Adding Babel, Translate, AbuseFilter to Wikidata repo." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52026 [10:51:03] New review: Silke Meyer; "OK, I'll abandone this change and start a new one." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51648 [10:51:24] Change abandoned: Silke Meyer; "https://gerrit.wikimedia.org/r/52026" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51648 [10:51:33] Silke_WMDE: is that for the dev repo? [10:52:00] legoktm: No, for test-repo [10:52:01] err test repo. the one at http://wikidata-test-repo.wikimedia.de [10:52:04] oh ok [10:52:09] yes [10:52:27] will it be possible for sysops on wikidata to get sysop access on test-repo so we can test how abusefilter works? [10:52:58] because that would be nice :) [10:54:55] Sure. PLease just send me a list of user names. At the moment I have to have puppet recreate the whole MediaWiki installation, so people will have to change their passwords weekly. Or keep the original one. [10:55:31] wrong channel, btw ;) [10:55:38] oh right :P [10:55:50] ill start a thread on WD:AN and get you a list in a few days, thanks :) [10:56:03] ok [13:07:40] New review: QChris; "As the upstream patch to get consistent row numbers in gerrit seems to be withheld upstream further,..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49993 [13:43:57] New patchset: Jeremyb; "(bug 38114) gerrit: alternate change list row colors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49993 [14:47:46] New patchset: Ottomata; "Adding puppet Limn module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49710 [15:30:07] New patchset: ArielGlenn; "harmon into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52042 [15:43:25] New patchset: Silke Meyer; "Install Solr and Solarium on Wikidata test repos." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52043 [15:45:19] New review: Silke Meyer; "If it's not okay to use puppet to download composer to install solarium, please tell me about the pr..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52043 [16:14:36] !log db27 being taken down for decom per: https://rt.wikimedia.org/Ticket/Display.html?id=4623 [16:14:41] Logged the message, Master [16:24:14] !log rebooting db70 to reset idrac: https://rt.wikimedia.org/Ticket/Display.html?id=4628 [16:24:20] Logged the message, Master [16:26:45] cmjohnson1: db70 idrac is back up [16:27:01] awesome! [16:27:02] thanks [16:29:13] sbernardin: what was the issue? [16:30:02] New review: MaxSem; "(4 comments)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/52043 [16:30:10] Silke_WMDE, ^^^ [16:30:28] cmjohnson1: not sure ...I wasn't able to get in either before I rebooted the server [16:31:36] MaxSem: Thanks for the feedback! [16:31:49] odd..but not the first time [16:33:04] * apergos tries getting onto db70 to poke around...must not be all the way up yet (?) [16:37:37] Silke_WMDE, wouldn't get_composer be executed on every puppet run? [16:38:48] Does that matter? If a file exists already, it wouldn't be downloaded again, would it? [16:39:19] But good to know I can use that extension! [16:39:49] hey sbernardin.. is db70 all the way up? [16:40:38] apergos: should be by now....rebooted it about 10 minutes ago [16:40:44] hmm wll [16:40:48] ssh ain't happening [16:41:07] wanna peek since you're already there? [16:42:51] MaxSem: As to your comment about using the solr module - I was confused there are two things manifest/role/solr.pp and modules/solr/ [16:43:04] I used the second one [16:43:21] apergos: no db70 is in installer atm [16:43:56] MaxSem: in the module example it is used the way I used it (at least from what I iunderstood) [16:44:44] sbernardin: the dell pickup is ready to go right? [16:45:15] RobH: the c2100s?...yes those are [16:45:16] cmjohnson1: ok thanks, that solves that riddle [16:45:43] sbernardin: so frieght folks called me just now [16:45:47] they will be there shortly =] [16:46:10] RobH: today? [16:46:14] yep [16:46:48] RobH: I thought dell was going to send a heads up via email [16:47:02] I just got a phone call from CEVA about showing up today [16:47:11] no idea what Dell said, just relating events as they occur ;] [16:47:22] steve, did you schedule the p/up for today w/dell? [16:47:23] If we need an outbound shipment ticket, would you please email 365 Main to create it? [16:47:30] since you can explain where it is and such. [16:47:46] RobH: OK...cool...Will create a ticket for 365 about the pickup now [16:47:51] I think this is another case of Dell saying one thing, then doing something entirely different. [16:48:06] =P [16:48:41] robh it is! [16:52:49] New patchset: Pyoungmeister; "depooling db1009 while investigating failure" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52048 [16:55:07] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52048 [16:56:17] !log py synchronized wmf-config/db-eqiad.php 'depooling db1009 while investigating error' [16:56:23] Logged the message, Master [16:59:01] Robh: They just got here...it's old dominion and not Ceva [16:59:17] sbernardin: ok, thats fine [16:59:19] RobH: just wanted to let you know [16:59:24] cool [17:00:07] are these the last of the c2100s? [17:02:10] New patchset: Cmjohnson; "Adding db70 entry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52050 [17:02:56] robh: No...we have 4 more in tampa...next one will not come offline for another 10-14 days [17:03:05] gettin there [17:03:22] slowly! but according to apergos...we are ahead of schedule [17:05:10] heh [17:05:20] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52050 [17:05:35] it's cause I redid the scheudle on friday thinking this last update would take two days longer than it did [17:05:47] but that's going to be the exception to the rule I'm afraid [17:10:52] RobH: no...I believe we have 5 more [17:44:45] !log authdns update [17:44:51] Logged the message, Master [17:51:00] robh: [17:51:39] ? [17:51:40] the osm-cp1001/1002 and osm-db's have all been update in dns...did you want to fix tampa? [17:51:52] only mgmt is done there [17:52:03] yep, make it so osm web and db are internal [17:52:07] and cp is external [17:52:13] please fix for both sites =] [17:52:39] ok [18:02:08] robh: need your help giving external ips [18:02:23] for eqiad or tampa? [18:02:27] eqiad [18:02:38] ok, so this is the easy one =] [18:03:40] so osm-web1001-1002 are in 16 [18:03:43] a6 even [18:03:54] where 1003-1004 are3 in b6 [18:04:06] yes [18:04:08] lets handle row A first, you will find a section in the 154 reverse file called 'public1-a [18:04:17] public1-a-eqiad [18:04:38] i see that .42 and .44 are free [18:04:47] so i would make those the 1001 and 1002 [18:05:00] so 208.80.154.42 & 44 [18:05:17] in eqiad each row is its own public and private subnet [18:05:36] ok..i like that [18:05:37] I'd explain why, but frankly I don't quite get it when it is a major pita when we run out of IPs ;] [18:05:42] see, i dislike. [18:06:01] but, luckily, that is a mar_k and leslie problem. [18:06:16] so snag two public ips in public1-a-eqiad for the two row A web servers [18:06:28] then you can find the public1-b-eiqad [18:06:37] also, these do NOT have to be in sequence [18:06:45] so best to snagg a lowest IP when available kind of thing [18:07:00] so in row B, I see .146 and .159 [18:07:18] 208.80.154.146 & .159 for web1003 & 1004 [18:07:27] then add them into the foward wikimedia.org file normally [18:07:39] ok..simple enough [18:07:43] i can review your change before you svn commit if you like [18:07:47] similar to private [18:07:51] yeah...that would be good [18:08:07] similar, but fuck this one up and the entire world may not be able to reach us ;] [18:08:17] lemme do eqiad first than tampa cuz they need to private entries as well [18:08:36] heh...yeah...don't wanna do that...will never live it down [18:12:25] New patchset: ArielGlenn; "actually add sha1.{h,c}, base36.c, sql2txt.c from my local repo >_<" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/52057 [18:17:13] New review: ArielGlenn; "sheesh" [operations/dumps] (ariel) C: 2; - https://gerrit.wikimedia.org/r/52057 [18:17:13] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/52057 [18:19:46] robh: please review [18:29:47] cmjohnson1: sorry, got distracted [18:29:48] looking now [18:30:40] cmjohnson1: so where you have osm is wrong in wikimeida.org [18:30:44] notice those are service aliases [18:30:53] these arent aliases, we are entering the actual servers. [18:31:12] i thought that was wrong but i wasn't certain [18:31:17] so yeah let me fix [18:32:44] the actual server entires still go under the servers section of the forward (wikimedia.org) [18:34:50] robh: k...i noticed a few things out of order [18:35:11] but i fixed so plz check [18:40:19] robh: ping [18:40:29] cehckin [18:40:31] checking even [18:40:54] what do you mean out of order? [18:40:59] i only see your osm move [18:41:27] looks good to me. [18:42:28] oh..i see knsq...not in alphabetical order [18:42:34] nothing important [18:44:11] !log authdns update [18:44:20] Logged the message, Master [18:47:32] New patchset: Dzahn; "make check_http (80) and check_tcp (8080) on install hosts a critical (paging) service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52069 [19:07:21] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52069 [19:13:44] !log taking down amaranth for hw replacement: https://rt.wikimedia.org/Ticket/Display.html?id=4497 [19:13:49] Logged the message, Master [19:20:06] !log installing package upgrades on sockpuppet [19:20:11] Logged the message, Master [19:22:53] wooooooot [19:23:01] re amaranth :) [19:23:07] !log reedy synchronized wmf-config/ [19:23:13] Logged the message, Master [19:30:41] !log on gallium: stopping puppet, cleaning defunct and restarting puppet [19:30:46] Logged the message, Master [19:33:31] !log reedy synchronized docroot [19:33:37] Logged the message, Master [19:34:04] !log reedy synchronized live-1.5/ [19:34:09] Logged the message, Master [19:34:48] !log reedy synchronized wmf-config/ [19:34:53] Logged the message, Master [19:41:20] !log reedy synchronized php-1.21wmf11/ [19:41:25] Logged the message, Master [19:43:42] New patchset: Pyoungmeister; "db59 pmtpa s1 snapshot host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52077 [19:43:58] New patchset: Pyoungmeister; "db59 pmtpa s1 snapshot host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52077 [19:54:59] !log reedy Started syncing Wikimedia installation... : Build 1.21wmf11 message cache [19:55:04] Logged the message, Master [20:03:19] Looking at search.pp, I can't tell how often the lucene monitoring service (ie check_lucene) is run ... [20:03:28] Anyone know where that is defined ? [20:07:55] sbernardin: woot. so ordering more? [20:08:20] (i thought you only had 1GBs?) [20:09:09] jeremyb_: have a few more....just decommissioned a server today that had what we needed [20:09:21] db27? [20:09:25] anyway, yay [20:09:29] :) [20:11:02] xyzram: Presumably it should be in puppet.. [20:11:33] Doesn't look to be though [20:11:35] Yes, search.pp is the puppet file I looked at. [20:11:51] hi, what's the status of php 5.4 on the servers? has anyone looked into it? [20:12:09] No [20:12:48] xyzram: It might be in with the other LVS checks.. [20:12:58] jeremyb_: yes...db27 [20:12:59] Reedy: I see " monitor_service { "lucene": description => "Lucene", ..." [20:13:09] yurik: It'd require backport with all dependancies until we upgraded to 14.04 [20:13:38] Reedy: sorry, what's 14.04? [20:13:43] Ubuntu [20:13:45] Next LTS [20:14:11] Reedy: I looked also at lvs.pp but don't see the frequency there either. [20:15:16] xyzram: Looks to be every minute [20:15:28] Last Check Time: 2013-03-04 20:14:07 [20:15:28] Next Scheduled Check: 2013-03-04 20:15:07 [20:15:52] based on icinga for one of the lvs checks [20:15:52] Where are you seeing this ? [20:16:10] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=api.svc.pmtpa.wmnet&service=LVS+HTTP+IPv4 [20:16:51] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=search-pool2.svc.eqiad.wmnet [20:28:34] !log reedy Started syncing Wikimedia installation... : Rebuild Message Cache for 1.21wmf11 [20:28:38] Logged the message, Master [20:29:33] Reedy: thanks. search-pool4 seems to have a 5m interval; that's the one causing paging problems. [20:33:49] Suddenly I'm getting the "Whoops" page for icinga, https://icinga.wikimedia.org/icinga/ [20:42:40] Looks like it died [20:45:32] !log restarted icinga [20:45:50] grrrr. [20:47:39] !log restarted icinga [20:48:31] Ryan_Lane: morebots seems fubar dude. [20:49:20] =/ damn you morebots [20:50:30] !log i hate you morebots [20:50:37] Logged the message, RobH [20:50:52] !log restarted icinga [20:51:09] !log Why arent you taking all the !log [20:51:14] Logged the message, RobH [20:51:21] !log had to restart icinga [20:51:26] Logged the message, RobH [20:51:50] interesting, morebots won't parse a log with restarted as the initial word, it doesnt pick it up in the script. [20:52:01] !log had to restart morebots, but it wasnt really dead, opps [20:52:02] New patchset: Reedy; "Symlinks yo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52095 [20:52:06] Logged the message, RobH [20:52:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52095 [20:52:53] !log reedy Finished syncing Wikimedia installation... : Rebuild Message Cache for 1.21wmf11 [20:52:59] Logged the message, Master [20:52:59] sbernardin: So it seems that professor is the graphite server [20:53:22] sbernardin: we dont want it offline for long, have you confirmed where the fan you need to replace is, and that we have a spare to go in there? [20:54:14] RobH: I have spares from db27 [20:54:30] and can see on that diagram which fan you are swapping right? [20:54:48] RobH: likely a netsplit [20:54:54] stupid bot has problems with that [20:54:56] Ryan_Lane: it wasnt that [20:55:03] what was it, then? [20:55:04] it was here, and wont take a log starting with 'restarted' [20:55:14] no idea why, but i tried a few different sentences and no dice. [20:55:21] !log restarted icinga [20:55:25] !log restarted icinga [20:55:25] RobH: "restarted" isn't the problem. the issue is you were putting leading spaces before the ! [20:55:26] or it just hates that [20:55:32] Logged the message, Master [20:55:32] oh, was i? [20:55:33] jeremyb_: indeed [20:55:36] yes [20:55:36] well, damn. [20:56:00] oh well [20:56:10] is lesliecarr back yet? [20:56:47] RobH: blaming the bot. tsk tsk [20:56:56] to be fair [20:57:01] haha [20:57:02] i was perfectly prepared to just blame you. [20:57:31] 'Ryan broke morebots.' [20:57:41] ;) [20:57:45] * jeremyb_ heads back to J&R [20:57:47] see, i put it in quotes, so its somehow not bullshit. [20:57:55] not optimistic :( [20:57:57] hey atleast all you need to do now is "/etc/init.d/adminbot restart" [20:58:01] it's better than it was before [20:58:03] yes, it was nice [20:58:08] no pretending to be roan's user. [20:58:13] yeah [21:01:26] sbernardin: So if you know what fan slot it is [21:01:36] I can more than likely just do a proper shutdown for you [21:02:46] OK [21:03:07] I'm still here now...so lets do it [21:04:19] sbernardin: you know what fan is being replaced? [21:05:21] I want to be sure we know what fan slot is coming out and being swapped before I take it down [21:05:33] !log stopping puppet on oxygen to see if we can disable filters and reduce packet loss [21:05:39] Logged the message, Master [21:07:11] sbernardin: ? [21:09:01] sbernardin: http://docs.oracle.com/cd/E19121-01/sf.x4240/820-3835-14/820-3835-14.pdf [21:09:04] RobH: there are indicators once you open up the fan portion of the case [21:09:21] and if the power is unplugged, they lose capacity in seconds [21:09:29] so then we have a down server, and you have to now lookup what slot to swap [21:09:32] we know the slot now [21:09:41] i want you to know what is happening before we start the downtime window [21:09:44] this is why i asked. [21:09:50] So, I just linked the service manual [21:09:57] that will have a layout of the fan trays [21:10:14] and we know from the logs that FB1/FM2 is bad. [21:10:41] sction 3.3 of the manual has the info [21:11:04] The indicators are not reliable if we dont have the system powered on. You advised that we cannot keep the system powered on and give you access to the fan trays [21:11:29] page 3-11 has the diagram [21:11:51] !log authdns update [21:11:53] I got the manual [21:11:57] Logged the message, Master [21:12:13] So on 3-11 we see from front of case, the first row is fan board 0, and second row is fan board 1 [21:12:39] and we need to replace fb1/fm2 [21:12:49] which is the far right fan [21:13:01] Correct [21:13:01] so facing case, its the back right fan [21:13:15] I just want you to know exactly what we are doing before we cause a downtime window [21:13:24] so you are in and out as fast as possible [21:14:08] OK...got it [21:14:13] ok, good to go?> [21:14:24] !log shutting down professor for a fan swap, should be a short window [21:14:29] Logged the message, RobH [21:14:32] sbernardin: once the system powers down its all yours [21:15:04] OK [21:15:08] hrmm, maybe not [21:15:20] oh hell, this is some odd ass non precise install [21:15:22] notpeter: are you about? [21:15:43] !log reedy synchronized php-1.21wmf10 [21:15:46] professor is Asher's graphite log server and some mysql host, but its not using standard init scripts for mysql [21:15:47] Logged the message, Master [21:16:00] sbernardin: sorry, delay, seems this is not normal server. [21:16:20] OK ...no problem [21:16:32] !log reedy synchronized php-1.21wmf11 [21:16:38] Logged the message, Master [21:16:52] RobH: it doesn't appear to have mysqld, just mysql-client [21:17:03] hrmm, odd server. [21:17:54] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki and mediawikiwiki to 1.21wmf11 [21:17:57] Logged the message, Master [21:17:58] well, if i take it offline, the graphite frontend is gone for a bit [21:18:15] but i dunno if thats ok [21:18:42] cmjohnson1: hey, so what's up with the copper link ? did they figure out where our cage is ? [21:18:42] sbernardin: So I rather err on the side of caution. This system is online right now, and I am not sure how well the service will handle interruption. Rather than cause a potential issue, I'll update the ticket with the info we have [21:18:58] sbernardin: and we will get asher or peter to sign off that its ok for this to go offline for a short window [21:19:07] lesliecarr: idk..i didn't go to DC today [21:19:10] sbernardin: are you at the dc now ? when you leaving ? [21:19:14] OK...I'll put the fan aside for it [21:19:26] cmjohnson1: ok, lemme know when you're in dc next if they fixed it ? [21:19:29] lesliecarr: what is the easiest way to change vlan from private to public [21:19:43] delete the interface? [21:19:49] delete the port from the interface-range public and put it in interface-range private [21:19:53] or vice versa in this case [21:19:59] LeslieCarr: yes...im at the DC now [21:20:02] ok [21:20:02] not the interface description itself though [21:20:11] k [21:20:17] sbernardin: awesome, going to do a network thing and may need an emergency unplugging if something goes south [21:20:18] :) [21:20:38] LeslieCarr: cool [21:20:41] Very technical there LeslieCarr [21:21:28] New patchset: Reedy; "testwiki, test2wiki and mediawikiwiki to 1.21wmf11" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52101 [21:21:46] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52101 [21:21:58] New patchset: Dzahn; "sort ServerAlias alphabetically in redirects.conf" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/52102 [21:22:27] i try my best [21:22:36] Ryan_Lane: ready to slauighter virt2 for a little bit ? :) [21:22:43] also ready for me to misspell words ? [21:22:55] LeslieCarr: heh. yes [21:25:48] New review: Dzahn; "+116,-116, just sorting them" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/52102 [21:25:49] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/52102 [21:27:13] New patchset: Lcarr; "bonding virt2's second interface group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52103 [21:29:57] !log authdns update [21:29:59] sbernardin: can you let me know what is plugged into mrjp-c2-sdtpa port 28 please ? [21:30:03] Logged the message, Master [21:32:27] LeslieCarr: db58 [21:33:00] thanks [21:33:07] our cabling in tampa is a bit insane :) [21:33:46] just a bit :-] [21:34:40] Lol [21:34:43] RobH: sup? [21:35:01] notpeter: Do you happen to know anything about professor? It is the graphite server it appears [21:35:09] no [21:35:11] sbernardin & I need to take it down to swap a bad fan. [21:35:17] I think that should be just fine [21:35:21] but that's an asher thing [21:35:21] i saw asher had done most the puppetization [21:35:25] should prob ask him [21:35:30] ja [21:35:30] yea, its not emergency [21:35:32] Ryan_Lane: merging the interface bonding now - in almost every case it doesn't cause an outage but i hold no guarantees with the "lovely" foundry switch [21:35:33] cool [21:35:37] yeah, i'd just shoot him an email [21:35:39] so it can wait a bit for him to answer email [21:35:44] should be fine to just turn off and on [21:35:45] did via ticket (you too ;) [21:35:46] cool cool [21:35:46] are you ok with a 5% chance of the network for virt2 going down now ?" [21:35:50] LeslieCarr: heh [21:35:51] hyeah [21:35:52] *yeah [21:36:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52103 [21:36:08] LeslieCarr: computing: it's deterministic. usually. [21:36:12] LeslieCarr: that means you can totally screw up now, cuz ryan agreed ;] [21:36:18] haha [21:36:18] New patchset: Dzahn; "redirect indiawikipedia.com to http://wikimedia.in/wikipedia.html (RT-1395)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/52141 [21:38:49] !seen jenkins [21:41:25] mutante: yeah [21:41:33] mutante: so turns out your change on apache-config at https://gerrit.wikimedia.org/r/52141 [21:41:45] mutante: it is a Jenkins success but somehow it does not report back :-( [21:41:53] New patchset: Pyoungmeister; "reshuffling pmtpa s1 DBs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52147 [21:41:57] Change abandoned: Pyoungmeister; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52077 [21:43:15] New patchset: Ryan Lane; "Split virt2 away from other virt nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52148 [21:45:16] New patchset: Pyoungmeister; "moving db36 and db38 to shard x1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52149 [21:45:37] hashar: now it did, but this is the thing now that order has to be correct to merge [21:45:55] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52148 [21:46:02] if a human does +2 code review before jenkins does Verified+2 it wont merge, only if the other way around [21:46:48] New review: Dzahn; "repeat +2" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/52141 [21:46:48] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/52141 [21:47:38] New patchset: Ottomata; "Disabling Vodaphone India in response to packet loss." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52150 [21:48:10] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52150 [21:50:13] !log staring puppet back up on oxygen, packet loss is better after disabling a filter [21:50:19] Logged the message, Master [21:50:42] !log messing around with the network of virt2 - expect possible pmtpa labs network issues [21:50:48] Logged the message, Mistress of the network gear. [21:52:16] y u so slow, jenkins? [21:52:44] he's only going to serve up fail [21:53:36] Reedy: is broken? [21:53:53] Seems to be in a bad mood with MediaWiki [21:54:02] aha! I got a success! https://gerrit.wikimedia.org/r/#/c/52147/ [21:54:31] just needs another +1 modifier [21:54:37] and then I'll be able to slay these orcs [21:54:38] er [21:54:42] whatever it is that I'm doing [21:54:42] New patchset: Dzahn; "let's also make that work for www.indiawikipedia (RT-1395)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/52152 [21:55:44] * cmjohnson1 wonders why it takes so long to go through post on r510's [21:55:58] "Ghetto Puppet [21:57:51] mutante: regarding your nagios -> monitoring change in puppet, why don't you create a new module ? :-] [21:58:30] ok that might need one module per software (nape,nagios,incinga,ganglia) [21:58:33] nape -> nrpe [21:59:32] well nagios is basically decommissioned [21:59:39] Change abandoned: Dzahn; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51792 [21:59:41] New patchset: Pyoungmeister; "reshuffling pmtpa s1 DBs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52147 [21:59:41] congrat [22:00:05] New patchset: Ryan Lane; "Order bonding before tagging for network node" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52153 [22:00:53] also could you guys possibly vote on having ack-grep included on all servers? https://gerrit.wikimedia.org/r/#/c/50306/ [22:01:01] that is a useful replacement to grep [22:01:11] and a tiny package :-] [22:01:37] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/52147 [22:01:40] New patchset: Ryan Lane; "Order bonding before tagging for network node" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52153 [22:03:45] !log py synchronized wmf-config/db-pmtpa.php 'shuffling around pmtpa enwiki slaves' [22:03:50] Logged the message, Master [22:04:39] Change abandoned: Hashar; "Upstream feature does not work that well and need to be slightly enhanced." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49993 [22:05:29] hey RobH or cmjohnson1 want to do something for me? :) [22:05:45] that is very open ended? [22:05:54] indeed it is. indeed it is...... [22:05:55] uh [22:06:04] db36 and db38 need to be reimaged [22:06:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52153 [22:06:13] but they're sun boxes [22:06:22] and I have no idea how their consoles work.... [22:06:26] do you know? [22:06:31] we have a page on them ;] [22:06:44] ^ yeah...but if robh is busy i can do [22:06:51] RobH: I tried this recently, and pulled out what little remaining hair I had... [22:06:55] you have to tell the truth 'i fucking hate sun boxen and want someone else to handle this' ;] [22:07:01] well [22:07:02] it's more [22:07:06] hehe [22:07:09] if someone else knows the magic incantations [22:07:10] https://wikitech.wikimedia.org/wiki/Sun_install_howto hehe [22:07:13] it'll just take a lot less time [22:07:15] cmjohnson1 you wanna take them you can feel free [22:07:26] yeah..i can do them [22:07:26] I mean, i totally can as well [22:07:31] cmjohnson1: sweet! thank you [22:07:39] otherwise if you dont get them done by tomorrow end of you day lemme konw and i can poke [22:08:08] no..i am imaging db70 now or at leaste attempting after figuring out wtf was wrong with it [22:08:13] so i can do it [22:08:17] cool cool [22:08:19] awesome! thank you [22:09:46] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52149 [22:11:35] New patchset: Ryan Lane; "Move the ip into the role and below the bond" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52155 [22:11:56] notpeter: are they Cisco? [22:12:23] notpeter: ignore me [22:12:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52155 [22:13:33] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/52152 [22:14:22] New review: Dzahn; "apache-fast-test indiawikipedia.com.url mw1044" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/52152 [22:14:46] mutante: I sadly know the cisco imaging dance... [22:17:09] dzahn is doing a graceful restart of all apaches [22:17:51] !log dzahn gracefulled all apaches [22:17:56] Logged the message, Master [22:18:19] !log gracefulling eqiad Apaches to push indiawikipedia redirect [22:18:24] Logged the message, Master [22:25:55] binasher: hi, people wanted to know if professor can just go down for maintenance anytime or if it should be scheduled (graphite) [22:28:12] mutante: it probably shouldn't be down during or within a few hours after any platform or new mediawiki branch deployments [22:28:39] thanks [22:28:41] RobH: ^ [22:29:10] cool [22:29:26] otherwise, have at it [22:29:45] New patchset: Ryan Lane; "Set tag on bond interface, rather than eth1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52157 [22:30:31] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52157 [22:32:16] New patchset: Hashar; "adapts lucene classes for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [22:37:06] New review: Hashar; "rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51677 [22:44:22] notpeter or robh: have a question...db36 and db38 fail to pxe boot...when i check the log on brewster, the installer seems to try and pull from the wrong host file in dhcpd [22:44:36] buh? [22:45:52] identify wrong? [22:45:55] they arent fast [22:46:43] this is the error i get /etc/dhcp3/linux-host-entries.ttyS1-115200 line 4476: host declarations not allowed here. [22:47:17] they shouldn't even be in there... [22:47:35] they arent. [22:47:38] they are not [22:47:39] right [22:47:40] i checked [22:47:54] and that looks like a normal line in the git repo config [22:47:58] yep [22:48:04] cmjohnson1: does the local file on brewster match the git repo? [22:48:08] ie: has puppet run lately [22:48:11] hrmmm..i should've checked the time stamp...wrong box [22:48:25] heh [22:49:24] doesn't explain my time out on them but different issue all together [22:57:05] Ryan_Lane: https://www.kernel.org/doc/Documentation/networking/bonding.txt [23:00:26] cmjohnson1: yeah, doesn't look like a dhcp request is ever getting to brewster from db36/38 [23:01:15] nope...it doesn't and not sure why (yet) [23:01:42] probably solar flares. [23:01:52] heh [23:01:53] probably [23:03:40] so i see db38 atttempting to pxe boot now and nothing in the log [23:04:11] misbehaving network gnomes? [23:05:01] but you can still ssh root@db38 [23:05:58] i can't [23:06:10] I can't [23:06:22] well maybe not now cuz i am trying to pxe boot it [23:06:53] but try db36 [23:07:09] /Stage[main]/Misc::Install-server::Dhcp-server/Service[dhcp3-server]/ensure) change from stopped to running failed: Could not start Service[dhcp3-server]: Execution of '/etc/init.d/dhcp3-server start' returned 1: at /var/lib/git/operations/puppet/manifests/misc/install-server.pp:240 [23:07:42] dhcp server isnt running [23:07:54] oh..that makes sense [23:07:58] /etc/dhcp3/dhcpd.conf line 291: /etc/dhcp3/linux-host-entries.ttyS1-115200: bad parse. [23:08:06] there is an issue in that file above [23:08:27] mutante..that file is related to db70 [23:08:36] some typo? [23:08:43] there was an issue cuz someone try to see it would work for swift and renamed it ms-be13 [23:08:46] missing ; or so? [23:09:01] i thought i cleaned it all up but i may have missed something [23:09:10] cmjohnson1: i see it [23:09:16] db70 misses the closing } [23:09:45] yep, started dhcp after live hack [23:10:08] typo..thx for checking..did you fix or do i need to do that now? [23:10:43] i can do it,, hold on [23:10:49] !log bringing down virt2 network to bond it [23:10:54] Logged the message, Mistress of the network gear. [23:10:56] !log p.s. bonding on foundry sucks [23:11:00] Logged the message, Mistress of the network gear.