[00:01:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.322 seconds [00:37:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:41:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.595 seconds [00:45:49] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 20 seconds [00:46:52] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [00:55:34] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 16 seconds [00:56:55] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [01:16:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.612 seconds [01:56:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:03:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.781 seconds [02:37:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.025 seconds [04:15:44] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:47:32] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [04:48:44] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [04:52:02] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [04:57:35] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [04:57:35] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [05:04:38] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.022 second response time [05:12:35] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [05:12:35] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [06:29:49] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [06:45:32] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [07:07:27] New review: Hashar; "(no comment)" [operations/software] (master) C: 1; - https://gerrit.wikimedia.org/r/4415 [07:56:49] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [08:30:56] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 190 seconds [08:31:14] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 196 seconds [08:53:17] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [09:46:15] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [09:47:10] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [11:15:11] hi has meta been upgraded ? [11:24:38] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [12:17:59] ^demon: time for more datacenter work! Sure you don't wanna join ;] (I'm kidding, but I had to ask ;) [12:19:02] <^demon> I'm not leaving my bedroom all day. Way too much to get done :\ [12:21:42] so....many....inappropiate....comments...must....sign.....off........ ;] [13:48:39] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 452849 MB (3% inode=99%): [14:51:56] PROBLEM - Host search1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:05] PROBLEM - Host search1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:14] PROBLEM - Host search1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:14] PROBLEM - Host search1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:23] PROBLEM - Host search1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:23] PROBLEM - Host search1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:23] PROBLEM - Host search1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:32] PROBLEM - Host search1011 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:32] PROBLEM - Host search1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:32] PROBLEM - Host search1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:32] PROBLEM - Host search1010 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:32] PROBLEM - Host search1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:41] PROBLEM - Host search1015 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:41] PROBLEM - Host search1014 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:41] PROBLEM - Host search1017 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:41] PROBLEM - Host search1016 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:41] PROBLEM - Host search1013 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:42] PROBLEM - Host search1020 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:42] PROBLEM - Host search1021 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:43] PROBLEM - Host search1018 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:43] PROBLEM - Host search1019 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:44] PROBLEM - Host search1022 is DOWN: PING CRITICAL - Packet loss = 100% [14:52:44] PROBLEM - Host search1023 is DOWN: PING CRITICAL - Packet loss = 100% [14:53:35] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: No route to host [14:53:35] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: No route to host [14:54:11] PROBLEM - Host search1024 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:20] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: No route to host [14:54:33] !log i killed eqiad search nodes, woooo [14:54:37] Logged the message, RobH [14:54:42] notpeter: maybe we should have disabled paging on those [14:54:51] ahem, yes [14:55:21] yes you proly should have [14:55:34] Good Morning San Fransisco [14:55:44] its only 8am there [14:55:50] thats not early [14:55:59] yeah still [14:56:12] How long before someone comes to check in? :D [14:56:20] it's good it's a workday [14:57:06] ok, I believe that notification is disabled for all of those vips now [14:57:18] RobH: want to page-all with info? [14:58:59] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [14:58:59] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [14:59:11] hrmm [14:59:15] im not in terminal [14:59:17] someone else do it [15:01:27] done [15:02:26] thx [15:13:59] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:13:59] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [15:16:06] New patchset: Pyoungmeister; "less cronspam" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4562 [15:16:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4562 [15:18:36] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4562 [15:18:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4562 [15:34:13] well glad i ordered the mushkin adapters [15:34:23] the ssh adapter that comes with the SSDs doesnt fit in the R410s [15:35:44] ssd even [15:36:24] RobH: that's the great things about standards: so many to choose from! [15:38:17] robh: when will the 410's arrive? any idea [15:51:34] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.49945925 (gt 8.0) [16:07:10] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.96619491667 [16:37:10] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 13.4719135833 (gt 8.0) [17:31:40] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.09619025 (gt 8.0) [17:49:44] !log moving es monitoring to nrpe and variables, may cause false pages if i did it wrong :) [17:49:46] Logged the message, Mistress of the network gear. [17:50:07] New review: Lcarr; "asher also thinks these monitors are proper" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4498 [17:50:10] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4498 [17:50:39] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4380 [17:50:42] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4380 [17:51:41] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.4821545833 (gt 8.0) [17:57:23] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [18:00:52] * AaronSchulz hopes Ryan_Lane is having a Ruby-licious morning [18:01:00] ? [18:01:08] * AaronSchulz is just trolling [18:01:13] :D [18:01:53] woosters: meeting time? [18:02:14] 11am pst :-) [18:27:32] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.57089641667 (gt 8.0) [18:43:53] New patchset: Lcarr; "putting ubuntu.png in" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4565 [18:44:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4565 [18:44:21] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4565 [18:44:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4565 [18:54:26] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [19:03:01] does anybody know whether we have enabled XMLRPC on Bugzilla? [19:07:19] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 15.3202081667 (gt 8.0) [19:07:25] New patchset: Lcarr; "trying to fix facilities.pp line 53" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4567 [19:07:37] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4567 [19:08:23] New patchset: Lcarr; "trying to fix facilities.pp line 53" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4567 [19:08:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4567 [19:09:03] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4567 [19:09:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4567 [19:11:40] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.99581689076 [19:17:00] New patchset: Lcarr; "once again, trying to fix facilities.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4568 [19:17:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4568 [19:17:39] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4568 [19:17:42] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4568 [19:23:38] New patchset: Lcarr; "fixing for newmonitor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4569 [19:23:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4569 [19:24:15] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4569 [19:24:17] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4569 [19:28:16] for some insane reason, when doing conditionals in puppet, you can't use fully scoped variables [19:28:20] only local or top level [19:28:30] which makes all of my for x in y's sort of fail [19:28:46] oh noes [19:28:52] yeah [19:29:01] well good thing i pushed them today instead of friday afternoon when writing them :) [19:29:05] hahaha [19:32:07] New patchset: Lcarr; "fixing lvs monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4571 [19:32:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4571 [19:34:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4571 [19:34:09] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4571 [19:37:26] New patchset: Lcarr; "i hate your sometimes explicit sometimes not variables puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4572 [19:37:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4572 [19:37:51] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4572 [19:37:54] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4572 [19:53:03] New patchset: Lcarr; "fixing monitoring class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4573 [19:53:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4573 [19:53:29] hey maplebed: do you know if xmlrpc on bugzilla is enabled? [19:53:56] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4573 [19:53:58] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4573 [19:57:16] what does Message may contain administrivia mean ? [20:02:11] drdee: it is [20:08:23] LeslieCarr, where is it that? [20:08:47] ah it was a mailman error - trying to figure wtf in my message made mailman think it was a subscribe/unsubscribe thing [20:09:12] Reedy: how can I verify that, because if visit bugzilla.wikimedia.org/xmlrpc.cgi i get a complete empty page with absolute nothing [20:09:37] apergos, I think you're also located in Athens? [20:10:16] Reedy: I am trying to integrate bugzilla and pivotal tracker using the xmlrpc interface but pivotal tracker throws errors ( but very unclear onces) [20:10:27] drdee: I don't think it's supposed to display anything... Ignoring that, I know hexmode uses the BZ API for stuff [20:10:35] So he can probabaly give you a bit more advice [20:11:21] okay,thx [20:11:39] yes I am [20:12:10] Athens_contractors++, :P [20:12:35] (I guess you read about Faidon) [20:13:44] yes [20:13:57] I'm actually full time though [20:13:59] but [20:14:06] greek_speakers++ [20:14:22] we're only.. 30 people behind the french speakers now [20:14:31] maybe 40 :-P [20:15:09] On staff? nowai [20:15:17] you better count non-English native [20:15:33] There's an officewiki page somewhere with a language map of most staff [20:15:52] By a liberal count we probably have 15, at most 20 French speakers at WMF [20:16:21] And by "liberal" I mean counting people like me who sort of speak French but not nearly fluently [20:17:05] well I will count if you can read it fluently [20:17:16] which is easier for people than speaking if they learned it as a second language [20:17:36] anyways, 20, 30, 15, we still have a long ways to go :-D [20:21:55] apergos, you have the inconvenience of the foreign alphabet [20:22:10] it's not at all inconvenient [20:22:12] still, i find it way easier to recognise the letters than cyrillic [20:22:28] kanji would be inconvenient [20:22:29] apergos, for foreign speakers [20:22:35] ah [20:22:42] a little I guess [20:23:09] :) [20:29:40] binasher: switch to twitter fork! [20:31:32] New patchset: Lcarr; "fixing gsb monitoring and moving to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4574 [20:31:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4574 [20:32:11] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4574 [20:32:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4574 [20:33:45] drdee: can you email me your problem? I'm about to go ride. bbi1h [20:37:00] PROBLEM - mysqld processes on es1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:37:58] hexmode: i did [20:43:30] oh no mysql did not just die on es1003 after all our work [20:45:24] RECOVERY - mysqld processes on es1003 is OK: PROCS OK: 1 process with command name mysqld [20:45:54] oh it's still on the old slaving [20:45:54] nm [20:49:36] PROBLEM - mysqld processes on es1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:52:21] RobH: RT-2407 and 2408 is the ones we were talking about this morning, right? [20:54:18] RobH: yeah, I think they are. I want to make sure that there's nothing I need to contribute that might cause racking (and provisioning) them to stall until I return from vacation. [20:54:41] I know we talked about the requirements I've got for where they're racked, but I don't see the notes on the ticket. [20:55:13] Should I write them down on that ticket, on another ticket, in an email, or some other way? [20:55:31] (or are you good and confident that you need no additional info from me while I'm gone) [21:06:06] RECOVERY - Host search1004 is UP: PING WARNING - Packet loss = 80%, RTA = 26.51 ms [21:07:38] maplebed: feel free to comment in ticket on minimum racking requirements [21:07:42] i plan to farm that ticket to mark [21:09:06] PROBLEM - SSH on search1004 is CRITICAL: Connection refused [21:09:59] RobH: you are el hombre [21:11:37] RobH: ok, will do. I didn't want it lost in the sea of quotes but so long as you know it's there, I'll be happy! [21:11:39] RECOVERY - Host search1002 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [21:11:39] RECOVERY - Host search1003 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [21:11:39] RECOVERY - Host search1001 is UP: PING OK - Packet loss = 0%, RTA = 26.73 ms [21:12:36] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4430 [21:12:57] ^demon: should we reenable the cronjob that is listing extensions? [21:13:12] (change 4430) [21:13:13] <^demon> https://gerrit.wikimedia.org/r/#change,4430 [21:13:15] <^demon> Yeah [21:13:22] maplebed: oh, those are procurement tickets [21:13:28] maplebed: dont put notes on there, indeed [21:13:39] there should be a racking ticket, i will look for it later and email you with details regarding it [21:13:45] it may not be until tomrrow [21:13:49] if any ops reading this, can you have a look at the very simple change #4430 ? :-] Thanks! [21:14:03] PROBLEM - Lucene on search1004 is CRITICAL: Connection refused [21:14:03] tomorrow I plan to run down every open tasks and detail off a handoff email to ops, including that one [21:14:19] maplebed: indeed, adding racking notes to procurement ticket will result in no one seeing them ;] [21:14:39] PROBLEM - SSH on search1001 is CRITICAL: Connection refused [21:14:39] PROBLEM - SSH on search1003 is CRITICAL: Connection refused [21:15:24] PROBLEM - SSH on search1002 is CRITICAL: Connection refused [21:18:51] PROBLEM - Lucene on search1003 is CRITICAL: Connection refused [21:19:18] PROBLEM - Lucene on search1001 is CRITICAL: Connection refused [21:20:00] RobH: are you imaging those search boxxies? [21:20:03] PROBLEM - Lucene on search1002 is CRITICAL: Connection refused [21:20:04] or shall I do that tomorrow? [21:24:24] PROBLEM - MySQL master status on es1001 is CRITICAL: CRITICAL: Read only: expected OFF, got ON [21:24:51] PROBLEM - MySQL replication status on es1003 is CRITICAL: (Return code of 255 is out of bounds) [21:25:06] notpeter: yea, as you can see, those hdds are installed [21:25:10] PROBLEM - MySQL slave status on es1003 is CRITICAL: CRITICAL: Lost connection to MySQL server at reading initial communication packet, system error: 111 [21:25:15] so those are all yours [21:25:27] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [21:25:36] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Puppet has not run in the last 10 hours [21:26:28] RobH: awe! some! [21:28:27] PROBLEM - NTP on search1004 is CRITICAL: NTP CRITICAL: No response from NTP server [21:33:15] PROBLEM - NTP on search1003 is CRITICAL: NTP CRITICAL: No response from NTP server [21:33:33] perhaps the idea to highlight PROBLEM for ping was a bad one [21:33:42] my laptop has done nothing but beep at me now for days. [21:33:51] PROBLEM - NTP on search1002 is CRITICAL: NTP CRITICAL: No response from NTP server [21:33:55] well, problem directly from the nagios-wm user only. [21:34:09] notpeter: I am not, they are all yours man [21:34:18] PROBLEM - NTP on search1001 is CRITICAL: NTP CRITICAL: No response from NTP server [21:34:27] I am trying to update all my onsite tickets to sit there for a week [21:36:33] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [21:37:09] RECOVERY - MySQL slave status on es1004 is OK: OK: [21:38:12] PROBLEM - MySQL replication status on es1004 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 135451s [21:39:22] !log restarted mysql on es1004 and cleared out its disk space [21:39:25] Logged the message, Mistress of the network gear. [21:39:51] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 108269 seconds [21:48:15] RECOVERY - MySQL replication status on es1004 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [21:48:24] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 0 seconds [22:09:19] New patchset: Ryan Lane; "Adding in cisco hardware" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4577 [22:09:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4577 [22:09:42] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4577 [22:09:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4577 [22:11:33] New patchset: Ryan Lane; "Fix typo and the typo i copy/pasted" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4579 [22:11:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4579 [22:11:54] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4579 [22:11:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4579 [22:14:50] New patchset: Ryan Lane; "Seems this hasn't been tested... Making the upstart end with .conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4580 [22:15:05] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4580 [22:15:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4580 [22:15:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4580 [22:15:23] apergos: if you're around, any report on es1003? [22:15:41] hello [22:15:54] oh, I mentioned in the chat on the etherpad [22:16:05] I musta missed it. sorry. [22:16:23] it would have come due this weekend, and this morning I didn't remember I had it on my todo list, so it's going to be first thing tomorrow [22:16:37] ok. [22:16:37] you won't be here, but I assume there shouldn't be anything too tricky [22:16:43] New patchset: Ryan Lane; "/me sighs. fix the require too." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4581 [22:16:55] if you would prefer I wait til you are around and I am awake, I can do it in the evening my time [22:16:56] the only tricky bit is what we already talked about - choosing the right numbers to use in the 'change master to' line. [22:16:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4581 [22:17:01] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4581 [22:17:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4581 [22:17:05] don't wait for me. I have faith. [22:17:07] ok [22:17:22] puppet's shitty lint checks could surely be better [22:17:25] you'll see the results in the log [22:17:27] dschoon: what was the name of the company you had mentioned with regards to hadoop ? [22:17:31] cool. [22:17:32] didn't see link in email [22:18:54] RobH: ok. it's fixed for you now [22:19:03] it only took me like 6 changes [22:20:22] LeslieCarr links should be in first forwarded email, iirc [22:21:02] read the docs, but didn't see the links to the outside world info ? [22:21:04] LeslieCarr: oh sorry, you need to dereference an additional pointer. they're footnotes to https://www.mediawiki.org/wiki/Analytics/2012-2013_Roadmap/Hardware#Server_Hardware_Guidelines [22:21:16] ahha [22:21:20] http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/ [22:21:23] http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ [22:21:27] http://wiki.apache.org/cassandra/CassandraHardware [22:21:29] s'all [22:22:14] New patchset: Ryan Lane; "Adding virt5 into compute cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4582 [22:22:21] ^^ \o/ [22:22:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4582 [22:22:41] and, in fact, the network section of that wiki page has relevant quotes to what we were talking about [22:22:43] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4582 [22:22:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4582 [22:26:03] New patchset: Dzahn; "url redirect foundation-l listinfo page (and only this) to new wikimedia-l page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4583 [22:26:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4583 [22:29:36] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 26 seconds [22:30:18] New review: Dzahn; "tested" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4583 [22:30:21] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4583 [22:30:48] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [22:35:58] New patchset: Bhartshorne; "installing mysql on iron to use for testing and development." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4584 [22:36:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4584 [22:36:27] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4584 [22:36:31] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4584 [22:49:26] New patchset: Bhartshorne; "mysql class requires but doesn't pull in stuff for nrpe via the percona files class. try pulling in the nrpe class." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4585 [22:49:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4585 [22:50:18] New patchset: Bhartshorne; "mysql class requires but doesn't pull in stuff for nrpe via the percona files class. try pulling in the nrpe class." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4585 [22:50:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4585 [22:50:43] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4585 [22:50:46] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4585 [22:51:49] binasher: just in case something I do makes mysql shit a brick, I'm trying to get mysql installed on iron at the moment and puppet is making me fight for it. ^^^^ [22:52:32] I don't think I'll break any othre servers, but you never know... [22:52:44] won't that break every server that already includes nrpe? [22:52:48] which should be every server [22:53:34] I don't think so, but I'm not sure. [22:53:37] hence the warning. [22:53:37] New patchset: Dzahn; "mail forward - foundation-l to wikimedia-l list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4586 [22:53:38] :) [22:53:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4586 [22:54:26] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4586 [22:55:55] arr, why "change requires a recursive merge to resolve."? [22:58:39] ok, the current puppet docs say its fine if include is called for a class multiple times [22:58:55] binasher: so far a puppet run on es1002 and db1033 went ok. [22:59:14] of course, this doesn't completely solve my problem because just incrluding nrpe also fails (/etc/nagios doesn't exist). [22:59:16] ::sigh:: [22:59:28] I think I'm solving this problem the wrong way. [22:59:49] what's iron? it should probably be in base [23:00:11] iron is a 'random host for ops to run stuff and build stuff'. [23:00:26] actually, maybe "include standard" instead of base [23:00:27] and yes, it probably should be in base. [23:00:32] or standard. [23:00:35] I don't know the difference. [23:00:50] standard includes base [23:01:08] my goal is just to get mysql on it. when I tried include::mysql, it complained that the percona files couldn't find the nrpe service. [23:01:44] it already includes standard. [23:01:49] New patchset: Dzahn; "mail forward - foundation-l to wikimedia-l list, fix recursive merge" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4586 [23:02:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4586 [23:02:17] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4586 [23:02:18] binasher: do you have a suggestion for the right way to get mysql on it (since I feel like I'm going to cause more trouble than not doing it the way I thought I should). [23:02:19] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4586 [23:05:25] New patchset: Ryan Lane; "Add cert for virt5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4589 [23:05:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4589 [23:06:43] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4589 [23:06:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4589 [23:11:18] New patchset: Asher; "don't include mysql nrpe monitoring files from base mysql class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4591 [23:11:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4591 [23:11:43] maplebed: i'm removing that include from the base mysql class [23:12:12] but [23:12:23] i think there should be a generic mysql class [23:12:36] in the way that we have generic webserver classes [23:14:46] !log migrated foundation-l to wikimedia-l (users/passwords/archive urls/settings stay, old mail address & siteinfo redirect) [23:14:48] Logged the message, Master [23:14:51] phhhew :) [23:15:03] and bye for tonight [23:15:04] binasher: are the important files linked in other spots ? [23:15:07] if not, i am worried [23:20:46] Change abandoned: Asher; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4591 [23:21:47] New patchset: Bhartshorne; "Going about this the wrong way." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4593 [23:22:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4593 [23:22:18] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4593 [23:22:21] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4593 [23:23:55] New patchset: Lcarr; "Fixing up snmp monitoring for new icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4594 [23:24:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4594 [23:25:57] hi! [23:26:00] howdy [23:26:06] finally arrived [23:26:11] well flown. [23:26:12] cool. good trip? [23:26:21] and. more importantly, checked in to the hotel [23:26:32] that took about an hour and a half :) [23:26:38] wow [23:26:41] are you at the triton? [23:26:49] well, not entirely their fault [23:26:56] their checkin time is 15:00 [23:27:02] and I arrived at 14:20 [23:27:16] ah [23:27:49] New patchset: Lcarr; "Fixing up snmp monitoring for new icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4594 [23:28:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4594 [23:29:10] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4594 [23:29:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4594 [23:31:05] New patchset: Lcarr; "removing dupes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4595 [23:31:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4595 [23:32:28] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4595 [23:32:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4595 [23:33:02] paravoid: well, we're over here at the office, but I'm heading off in about two hours [23:33:03] so close to making icinga work proper [23:33:06] so close! [23:33:07] heh [23:33:27] just another 50 or so commits away... [23:34:20] Ryan_Lane: yes, CT told me, so I decided to not mess with your schedule [23:36:01] paravoid: i can't wait to meet you … i hear you managed to get another place away from exported resources ? [23:36:09] we are relying on them heavily and it is killing our puppet server [23:40:57] what do you mean? [23:41:09] I've been using them extensively, killing my puppetmasters too :) [23:41:14] heh [23:41:19] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:41:29] you implemented something on the puppetmaster side for nagios, right? [23:41:38] LeslieCarr: I think he's still using exported resources for that [23:41:38] oh, that [23:41:52] I ported and heavily modified naggen [23:42:09] naggen? [23:42:10] and used generate() on the server [23:42:40] it's an executable found on puppet's source tree but it's unmaintained anddoesn't work nowadays [23:42:44] ah [23:42:56] you run it on the server side [23:43:02] (I did it with generate) [23:43:17] and then it dumps all of the naginator resources [23:43:28] and create the nagios_service.cfg (for example) from scratch [23:43:32] * Ryan_Lane nods [23:43:34] every time [23:43:52] it's blazingly fast, since it doesn't do any kind of parsing of the old file, just replaces it [23:44:06] but, you can't mix it with manual resources, since those would get overriden [23:44:22] you can workaround that with nagios include directories though [23:44:26] I'm fairly sure we don't use manual resources [23:45:28] even if we did, we put almost all our checks into their own directory [23:45:36] right [23:46:19] New patchset: Lcarr; "removing facilities" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4596 [23:46:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4596 [23:46:40] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4596 [23:46:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4596 [23:47:31] ok…. added virt5. let's see if I can bring an instance up on it [23:49:14] mkdir: cannot create directory `/var/lib/nova/instances/instance-000001eb/': Permission denied\n [23:49:17] that would be a no [23:49:17] heh [23:51:34] * Ryan_Lane groans [23:51:49] all the damn systems have different uid/gids for the service accounts [23:54:13] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:58:59] looks like it's working