[00:00:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.830 seconds [00:18:11] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 273 MB (3% inode=61%): /var/lib/ureadahead/debugfs 273 MB (3% inode=61%): [00:20:26] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 225 MB (3% inode=57%): [00:25:14] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 270 MB (3% inode=61%): /var/lib/ureadahead/debugfs 270 MB (3% inode=61%): [00:26:35] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 273 MB (3% inode=61%): /var/lib/ureadahead/debugfs 273 MB (3% inode=61%): [00:30:47] RECOVERY - Disk space on srv223 is OK: DISK OK [00:30:56] RECOVERY - Disk space on srv221 is OK: DISK OK [00:36:56] RECOVERY - Disk space on srv220 is OK: DISK OK [00:42:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.825 seconds [01:25:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.119 seconds [01:40:50] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [01:55:25] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [02:03:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.887 seconds [02:13:25] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [03:30:22] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 168 MB (2% inode=61%): /var/lib/ureadahead/debugfs 168 MB (2% inode=61%): [03:32:46] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [03:36:40] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 232 MB (3% inode=61%): /var/lib/ureadahead/debugfs 232 MB (3% inode=61%): [03:38:46] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 268 MB (3% inode=61%): /var/lib/ureadahead/debugfs 268 MB (3% inode=61%): [03:38:55] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 263 MB (3% inode=61%): /var/lib/ureadahead/debugfs 263 MB (3% inode=61%): [03:40:52] RECOVERY - Disk space on srv219 is OK: DISK OK [03:46:47] RECOVERY - Disk space on srv220 is OK: DISK OK [03:46:47] RECOVERY - Disk space on srv222 is OK: DISK OK [03:50:50] RECOVERY - Disk space on srv223 is OK: DISK OK [04:14:41] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [04:25:38] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [04:25:38] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [04:39:18] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [04:39:18] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [04:54:36] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3227 MB (2% inode=99%): [04:54:36] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3219 MB (2% inode=99%): [05:23:42] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [05:56:40] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3917 MB (3% inode=99%): [06:15:35] !log killed vi on fenari owned by awjrichards, locking CommonSettings.php for two days [06:15:38] Logged the message, Master [07:21:41] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [07:52:57] hi, anybody checked out storage3 yet? [07:53:04] or i will now [07:53:25] it is down and backups on aluminium and grosley were affected [07:54:26] I did not notice that [07:54:44] it's not in the pages [07:55:11] ok, checking mgmt now [07:57:17] !log powercycling storage3 [07:57:19] Logged the message, Master [07:59:47] RECOVERY - Host storage3 is UP: PING WARNING - Packet loss = 44%, RTA = 91.71 ms [08:02:35] nagios-wm: there are more recoveries to report .. [08:03:59] PROBLEM - MySQL replication status on storage3 is CRITICAL: (Return code of 255 is out of bounds) [08:04:17] PROBLEM - MySQL slave status on storage3 is CRITICAL: CRITICAL: Lost connection to MySQL server at reading initial communication packet, system error: 111 [08:05:20] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: (Return code of 255 is out of bounds) [08:05:38] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [08:06:21] !log storage3 - gmond unable to find the metric information for any mysql_* .."module has not been loaded", starting mysql, running puppet ... [08:06:23] Logged the message, Master [08:06:23] RECOVERY - MySQL slave status on storage3 is OK: OK: [08:06:32] RECOVERY - Puppet freshness on storage3 is OK: puppet ran at Fri Mar 30 08:06:06 UTC 2012 [08:06:57] no obvious reason for the shutdown when glancing at syslog etc [08:07:39] no hardware issues reported on boot [08:15:50] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 26s [08:16:35] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 33s [08:18:41] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [09:09:25] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 7 MB (0% inode=61%): /var/lib/ureadahead/debugfs 7 MB (0% inode=61%): [09:09:34] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 240 MB (3% inode=61%): /var/lib/ureadahead/debugfs 240 MB (3% inode=61%): [09:09:43] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 77 MB (1% inode=61%): /var/lib/ureadahead/debugfs 77 MB (1% inode=61%): [09:17:04] PROBLEM - Apache HTTP on srv224 is CRITICAL: Connection refused [09:18:05] !log srv224,srv219,srv220, upgrade apache, dist-upgrading w/ kernel, disabling ureadahead, rebooting one by one [09:18:07] Logged the message, Master [09:19:55] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 232 MB (3% inode=57%): [09:22:01] RECOVERY - Disk space on srv221 is OK: DISK OK [09:22:01] RECOVERY - Disk space on srv219 is OK: DISK OK [09:22:10] RECOVERY - Disk space on srv224 is OK: DISK OK [09:28:28] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 278 MB (3% inode=61%): /var/lib/ureadahead/debugfs 278 MB (3% inode=61%): [09:33:01] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:34:22] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.099 second response time [09:35:07] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%): /var/lib/ureadahead/debugfs 0 MB (0% inode=57%): [09:35:07] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 279 MB (3% inode=61%): /var/lib/ureadahead/debugfs 279 MB (3% inode=61%): [09:41:43] RECOVERY - Disk space on srv222 is OK: DISK OK [09:47:43] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 177 MB (2% inode=61%): /var/lib/ureadahead/debugfs 177 MB (2% inode=61%): [09:48:01] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:51:55] RECOVERY - Disk space on srv219 is OK: DISK OK [09:55:49] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 179 MB (2% inode=61%): /var/lib/ureadahead/debugfs 179 MB (2% inode=61%): [09:56:16] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 239 MB (3% inode=57%): [09:57:37] PROBLEM - Apache HTTP on srv219 is CRITICAL: Connection refused [10:00:28] RECOVERY - Disk space on srv224 is OK: DISK OK [10:01:49] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.105 second response time [10:02:07] RECOVERY - Disk space on srv223 is OK: DISK OK [10:06:37] RECOVERY - Disk space on srv220 is OK: DISK OK [10:06:55] RECOVERY - Disk space on srv222 is OK: DISK OK [10:22:22] PROBLEM - Apache HTTP on srv220 is CRITICAL: Connection refused [10:30:46] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [10:39:19] PROBLEM - Apache HTTP on srv223 is CRITICAL: Connection refused [10:41:00] !log same for srv223 [10:41:02] Logged the message, Master [11:04:31] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.093 second response time [11:48:14] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 147 MB (2% inode=57%): [11:52:26] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 135 MB (1% inode=57%): [11:52:35] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [11:56:38] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 242 MB (3% inode=57%): [11:57:05] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [11:58:44] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 136 MB (1% inode=57%): [12:00:59] RECOVERY - Disk space on srv220 is OK: DISK OK [12:02:47] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 55 MB (0% inode=57%): [12:06:59] RECOVERY - Disk space on srv223 is OK: DISK OK [12:06:59] RECOVERY - Disk space on srv219 is OK: DISK OK [12:07:08] RECOVERY - Disk space on srv222 is OK: DISK OK [12:28:04] !log deleted old kernel sources on upgraded srvs for that little extra space during peaks, suggesting to nuke /usr/share/doc if there should be more disk space warnings [12:28:06] Logged the message, Master [12:28:44] reinstall the boxes with a decent partitioning scheme instead :P [12:29:00] heh, fair enough [12:29:09] hi mark [12:42:14] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [12:44:38] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:35] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:47:04] !log powercycling db1047 [12:47:06] Logged the message, Master [12:50:47] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [12:52:03] PROBLEM - NTP on db1047 is CRITICAL: NTP CRITICAL: Offset unknown [12:53:06] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [12:55:12] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [12:56:15] RECOVERY - NTP on db1047 is OK: NTP OK: Offset -0.0309022665 secs [12:56:36] !log db1047 - added system startup for /etc/init.d/mysql [12:56:38] Logged the message, Master [12:59:06] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 294 seconds [12:59:24] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 230 seconds [13:01:12] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [13:01:30] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [13:09:00] sees " [13:09:21] a new check_to_check_nagios_paging by maplebed [13:09:59] hi mutante [13:10:02] back in germany yet? [13:10:23] since like a day [13:10:27] or 2 [13:10:57] dependng if you count the jetlag :p [13:11:41] hehe [13:11:56] db1047 just froze there..i saw it was a core db even [13:12:03] from the motd [13:19:50] [14:18:14] (NEW) Massive loss of data in the list of contributions - https://bugzilla.wikimedia.org/35610 major; Wikimedia: General/Unknown; (marc.galli35) [13:20:45] and nosy was here earlier asking about possible issues with replication db47->amaranth.toolserver , checked a few things for her, no obvious problem on our side, now it's waiting for dab for amaranth [13:22:05] robh won't like to hear about db1047 [13:25:24] Reedy: scary bug title? [13:25:37] mutante: indeed. Though, he seems to be correct [13:26:52] oh, sounds like Importance should go up? [13:26:59] he already did [13:26:59] =/ [13:28:05] uh oh [13:29:28] Yeah... [13:31:11] disappeared when? [13:31:57] so are those other revisions in there with some bad user id I wonder [13:33:56] http://fr.wikisource.org/w/index.php?title=Wikisource:Accueil&diff=1466&oldid=1465 that one for example [13:41:21] http://fr.wikisource.org/w/index.php?title=Auteur:Ren%C3%A9_Descartes&dir=prev&action=history [13:42:59] apergos: he seems to be mentioning 2 different accounts [13:43:11] I wonder if it's some form of an incomplete rename [13:43:42] I don't see where you get that [13:44:28] He's saying he is/was Caton, and there are contribs for https://fr.wikisource.org/wiki/Utilisateur:Marc [13:44:38] https://fr.wikisource.org/wiki/Utilisateur:Caton?redirect=no [13:45:02] https://fr.wikisource.org/w/index.php?limit=50&tagfilter=&title=Sp%C3%A9cial%3AContributions&contribs=user&target=Marc&namespace=&tagfilter=&year=&month=-1 [13:45:17] where does he say that he is Marc? [13:45:51] His display name is "MArc" on the bug... He refers to the edits by Caton as being his [13:46:05] a the bug entry [13:47:25] I guess someone has to ask him about a rename [13:47:29] mysql> SELECT count(*) FROM revision WHERE rev_user_text = 'Caton'; [13:47:31] I tried to look in the logs but [13:47:34] | 3220 | [13:47:40] nothing showed otoh I cannot read anything over there [13:48:08] He's got some edits as user id 0 [13:48:11] ERROR 1305 (42000): FUNCTION frwikisource.distinc does not exist [13:48:11] mysql> SELECT distinct(rev_user) FROM revision WHERE rev_user_text = 'Caton'; [13:48:11] +----------+ [13:48:11] | rev_user | [13:48:12] +----------+ [13:48:15] | 0 | [13:48:17] | 44 | [13:48:19] +----------+ [13:48:29] only 44? [13:48:37] no, user id 0 and user id 44 [13:48:40] oh [13:48:49] how many as user 0? [13:49:09] 3216 [13:49:12] sorry, I should either pay full attention or leave this alone I guess [13:49:16] heh :) [13:49:33] so wanna ask him about whether he was renamed? [13:49:40] Yeah, will do [13:55:46] "It's difficult to remember, because it is a former account. Sometimes for some reason, I look what I have done with this account. The last time was perhaps last year." [13:55:57] oh gee [13:56:06] I'm presuming renames don't usually go via user 0 and user text != an ip address? [13:56:58] failed renames will often have uid 0 on a bunchof the revs [13:57:02] I don't remember why that is [13:57:24] and depending when the rename was done, the user name can appear in different parts of the log [13:57:28] it's kind of annoying that way [13:58:17] we'd have to hunt around to see when the marc account was created [14:00:17] "My account was not renamed. It is an former account from old wikisource. But all was fine on fr: until now." [14:00:43] eeewww [14:01:30] I am liking this less and less [14:01:58] And the other account he mentions also has edits on user id 0 [14:02:00] what abou tthe other account mentioned in the report? maybe that user is more active? [14:02:14] I'm gonna drop the severity, cause the data is certainly not missing [14:02:25] 2010 [14:02:43] (show/hide) 11:44, 4 May 2009 Account Shaihulud (Talk | contribs) was created automatically [14:03:10] Marc/Caton don't seem to be in the account creation logs [14:03:20] there was a point where we turned them off [14:03:23] for automatic [14:03:27] then turned them back on [14:03:41] but if he has edits going way back maybe there weren't good logs then [14:04:47] https://bugzilla.wikimedia.org/show_bug.cgi?id=35610#c6 [14:06:27] select * from revision where rev_user = 0 and rev_user_text NOT LIKE an ip address... [14:06:29] so one could run a job I guess that found all rev 0 [14:06:34] I mean user id 0 revs [14:06:42] yeah, I see you are thinking the same thing [14:07:31] that everythig was good a year ago... well... a year is a long time [14:07:44] who knows what happened during that time [14:08:19] both are indexed, but seperately.. [14:08:24] I also can't believe it just broke today or something [14:08:38] seems very doubtful [14:09:17] anyways a script could find them all, reassinging them would be a bit more of a pain but still scriptable [14:10:03] and then we would know that as of x date things were clean [14:10:18] right now what we know is that there might have been revisions in there like that for years and years [14:10:38] Yeah.. It also makes me wonder how widespread this may be [14:11:01] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 18 MB (0% inode=57%): [14:13:13] oh I'm sure it's across all wikis [14:15:13] RECOVERY - Disk space on srv221 is OK: DISK OK [14:15:49] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [14:16:34] Yeah... [14:17:08] I seem to recall a renae-ish bug that had some revs like these that were renames and some that were "who knows that caused these" [14:26:46] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [14:26:46] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [14:40:43] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [14:40:43] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:25:27] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [16:46:10] New patchset: Pyoungmeister; "let java do the logrotation! (working well in testing.)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3989 [16:46:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3989 [16:47:08] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3989 [16:47:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3989 [16:47:53] maplebed: ok if I merge your stuff? [16:47:58] looks inocuous [16:48:10] ack,I submitted something without merging?! [16:48:25] ja, is just changes to the nofication stuffs [16:48:29] I'm gonna merge [16:48:37] doh! [16:48:40] I'm sorry. [16:48:44] that sat there since yesterday! [16:48:51] heh [16:48:52] no biggie [17:01:33] New patchset: preilly; "Add all Digi Malaysia IP addresses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4004 [17:01:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4004 [17:03:39] New patchset: preilly; "Add all Digi Malaysia IP addresses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4004 [17:03:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4004 [17:07:27] New patchset: preilly; "Add all Digi Malaysia IP addresses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4004 [17:07:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4004 [17:08:25] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4004 [17:08:28] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4004 [17:18:57] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 81 MB (1% inode=61%): /var/lib/ureadahead/debugfs 81 MB (1% inode=61%): [17:23:45] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [17:30:12] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:31:24] RECOVERY - Disk space on srv222 is OK: DISK OK [17:34:15] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [18:04:00] New patchset: Bhartshorne; "telling sanger what IP to bind to for opendj ldap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4009 [18:04:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4009 [18:06:37] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4009 [18:06:40] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4009 [18:06:49] ^^^ trying to puppetize the fix I put in for sanger and ldap yesterday [18:11:15] New patchset: Lcarr; "Revert "TESTING seeing if relative variables make a difference to icinga"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4010 [18:11:29] New patchset: Lcarr; "Revert "TESTING icinga stuff"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4011 [18:11:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4010 [18:11:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4011 [18:13:11] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4010 [18:13:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4010 [18:13:18] !log killing puppet daemon on brewster, i need to hack at local configuration for cisco server stuff [18:13:19] Logged the message, RobH [18:13:27] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4011 [18:13:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4011 [18:15:57] I'd need a shell command that i can execute on any labs instance, that returns the "instance_name" / the "nice" name , and replaces something that just works as `hostname` in prod. like on labs instances the hostnames are the resource names, but not the instance names [18:18:07] can a nova instance find out it's own instance_name ? as opposed to asking the controller [18:20:36] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [18:22:12] im not sure an instance knows that without the controller [18:22:20] i never see it metioned in instacne specific files [18:22:28] but my experience in labs is quite limited. [18:23:52] hmm, yeah, then either labs nagios needs to be changed to use the resource names as "host names" from Nagios' point of view .. [18:24:32] or snmp traps for puppet freshness wont work [18:25:18] stupid OSX questions for someone who actually uses it: What's the story with the ssh key agent these days, is there a built in agent? Does it start by default? [18:25:38] it took quite a bit to this point, multiple other issues, and now _that_ close, after the traps arrive, submit_check_result works.. and all... just the damn hostname mismatch keeps it from working :p [18:26:30] because this is a passive checks the instances themselves need to sent their (nagios) "hostname" out by themselves [18:26:40] send [18:31:40] Jeff_Green: I think my agent is actually part of the osx keychain stuff. [18:31:48] huh [18:32:12] all i know is that when I start a new window in the terminal my SSH_AUTH_SOCK is valid. [18:32:12] i know ryan discovered you need to exec independent agents [18:32:16] or it shares it across all terminal sessions [18:32:20] which is undesirable [18:32:23] I'm trying to help zack get to storage3 so he can work with impression logs that are stored there. he's got an account and his public key is there [18:32:31] ssh-add -l [18:32:37] that's the best check [18:32:55] does he have to manually attach keys to the keyring? [18:33:16] it will autamatically use keys with the standard key name [18:33:25] but you have to manually add keys with custom names. [18:33:30] ok [18:33:35] strange [18:33:38] (i.e. it'll add .ssh/id_rsa but not .ssh/mykey) [18:33:49] so in theory he should be able to ssh -A aluminium, and then ssh to storage3 [18:34:04] I would ssh-add -l at each hop and make sure the agent is coming along for the ride. [18:34:11] ok [18:35:09] and of course he's gone offline now :-P [18:35:51] hmph. feel free to tell him to ping me for help if you're not around when he gets back. [18:35:59] ok. thank you [18:38:00] apropos storage3, it crashed today and there were no daily aluminium / grosley offhost backups [18:38:19] mutante: yeah, I saw--it's ok, not a big deal [18:38:33] it'll catch up to night [18:39:17] i'm bummed that we didn't get any real notification [18:39:30] i got email on it. [18:39:38] but when i read it you were already chatting about it in channel [18:39:46] i get emails on all that stuff [18:39:51] the backup fail stuff or from nagios? [18:39:51] New patchset: Lcarr; "removing nagios_config_dir from neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4012 [18:40:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4012 [18:40:12] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4012 [18:40:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4012 [18:40:25] yes to both, will catch up and got mail [18:40:50] the only mail I saw was from the backup jobs and your updates as you fixed it [18:40:55] nothing from nagios [18:41:25] maybe they were eaten by gmail somehow [18:48:02] hello ops :-D [18:48:15] does anyone know which smtp server I could use to send emails ? [18:48:25] the sending host would be gallium , software is jenkins [18:49:02] localhost? [18:49:12] ohh [18:49:15] I should try that :D [18:49:28] javax.mail.MessagingException: Could not connect to SMTP host: localhost, port: 25; [18:49:32] do you need to shell out to mail or connect to localhost 25? [18:49:36] ah. the latter. [18:49:46] it seems [18:49:51] I don't know how we do outgoing mail here. [18:50:00] I will have to find out [18:50:02] * maplebed reads http://wikitech.wikimedia.org/view/Mail [18:50:05] there might a puppet class just for that [18:50:30] oh we actually have documentation \O/ [18:50:41] sadly I don't think it domucents what I wanted it to. [18:50:52] I would look at how mediawiki sends mail and use that as the template. [18:51:09] hashar: i think we ordinarily use the server on localhost, which is configured to relay [18:51:34] strange that it's not working [18:51:41] * hashar starts pretending to be an ops by looking at puppet conf [18:51:58] sure enough, exim's not running on gallium. [18:52:04] the host must be missing some exim package [18:52:23] exim::simple-mail-sender sounds good [18:52:47] that should be installed by puppet, i wonder if it was intentionally left off gallium [18:53:40] ordinarily hosts accept mail locally, then relay out through mchenry, and fail to lily [18:55:05] maplebed: I got zack to run ssh-add -l and indeed no keys are being forwarded to his first hop (aluminium) [18:55:09] New patchset: Hashar; "allow gallium to send mails (for Jenkins notifications)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4014 [18:55:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4014 [18:55:35] and lily is down [18:55:37] he's doing ssh -A, so I'm not sure why his agent is failing to attach them [18:55:37] change 4014 ^^^ should enable exim on gallium if one of you is willing to merge / apply it :-D [18:55:42] Jeff_Green: my guess - ssh is specifically prohibiting forwarding for some reason (such as a mismatched host key) [18:56:16] they should fail to sodium [18:56:24] maplebed: huh, ok [18:56:44] I think the ssh daemon can prohibit agent forwarding too, but I don't think we do that. [18:57:30] yeah it doesn't for me on those hosts [18:57:44] I just checked too. and I agree. [18:58:08] I think command line flags are supposed to override the .ssh_config, but he could check is ~/.ssh/ssh_config file and see if he's squashing his agent. [18:58:10] should he have to manually attach the key locally? [18:58:27] or is it auto-attached the first time he uses it? [18:58:28] at this point, it might be easier to just take over his laptop to poke around rather than playing telephone. [18:58:32] (eg. using ichat) [18:58:46] heheh yes I agree, and he's in a meeting to boot [18:59:14] I'll ask him to track you down if that's ok [18:59:56] "calling ssh technician to cubicle 79B" [19:00:06] np. [19:01:39] maplebed: pgehres just offered to help Zack, you're off the hook! [19:02:16] alright, well, the offer's still open. [19:03:10] thanks [19:03:56] maplebed: Jeff_Green : can one of you review my change that install exim on gallium : https://gerrit.wikimedia.org/r/#q,4014,n,z [19:04:00] please ? :-D [19:04:07] sure, sorry got distracted [19:04:41] exim::simple-mail-sender is part of the "standard" class [19:04:57] New review: Jgreen; "winning!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4014 [19:04:59] but gallium is not really standard :-D so it does not have that exim class [19:05:00] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4014 [19:06:11] Jeff_Green/ hashar why not just put include standard on gallium ? [19:06:14] i detest ubuntu release names [19:06:18] Jeff_Green: actually, that storage3 mail wasnt a from nagios@ in this case, but from root@, Subject: FAILURE ... [19:06:20] and how they are used in apt and the like [19:06:22] i think you now have all 4 components of the standard class on there [19:06:26] should just use the damned version numbers. [19:06:33] LeslieCarr: oh maybe we have all of them [19:07:03] mutante: ah that's my backup script. I have half a mind to make that thing send email-to-sms notifications so we actually get some monitoring on fundraising db's :-P [19:07:31] LeslieCarr: all but the evil generic::tcptweaks [19:07:41] New patchset: preilly; "Add mobile channel to log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4015 [19:07:45] haha [19:07:47] that's not evil at all [19:07:51] it's good for you [19:07:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4015 [19:08:07] RobH: I like making up absurd names sorta in their scheme [19:08:08] unless there's a reason gallium needs a huge syn timeout [19:08:12] it has both TCP and "tweak" !!! That triggers an EVILNESS warning to me :-D [19:08:16] haha [19:08:29] i have enough trouble recalling that lucid is 10.04 [19:08:36] i dont need to think of another sarcastic name ;] [19:08:42] LeslieCarr: amending change to use standard [19:08:49] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4015 [19:08:52] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4015 [19:09:10] since it's already merged, just do a new change (unless you were using amending in the english sense, instead of the git sense) [19:09:22] New patchset: Pyoungmeister; "adding some hashes for with which to generate some confs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4017 [19:09:23] Jeff_Green: I'm committing your galium change. [19:09:38] New patchset: Pyoungmeister; "first pass at dynamic conf generation for lucene" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4018 [19:09:43] maplebed: thx [19:09:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4017 [19:09:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4018 [19:10:53] New patchset: Hashar; "make gallium "standard"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4019 [19:10:57] would someone who has strong puppet fu be willing to look at https://gerrit.wikimedia.org/r/#change,4018 [19:11:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4019 [19:11:21] LeslieCarr: change 4019 use the standard class on gallium :D [19:12:00] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4019 [19:12:02] reviewed and submitted [19:12:03] :) [19:12:03] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4019 [19:12:11] all your tcp are belong to us [19:12:15] good [19:12:37] maplebed: by committing do you mean having the change made available to puppet ? You might as well get change 4019 so [19:12:55] maplebed: ends up adding exim class is just the same as adding the standard class [19:13:37] hashar: after we mark a change reviewed in gerrit we manually pull it into the repo on sockpuppet. If two of us merge changes at the same time, they both show up when either of us does the diff review on sockpuppet. [19:13:57] I was just telling jeff I was enabling his already-merged change. [19:14:30] that's separate from actually reviewing and publishing the change in gerrit... [19:14:59] ah that is all the magic ops stuff happening in the backstage [19:15:17] * hashar feels like a special guest at a rock concert [19:15:45] though in our case there are millions of fans asking for their wiki pages [19:16:44] New patchset: Demon; "Create hook for l10n-autoreview in gerrit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4020 [19:16:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4020 [19:17:28] Jeff_Green: indeed, Nagios did not get [19:17:31] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4017 [19:17:34] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4017 [19:18:01] Jeff_Green: indeed, Nagios did not get to more than a HOST ALERT Warning.. [19:18:33] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4018 [19:18:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4018 [19:18:47] mutante: odd [19:18:50] Jeff_Green: so it's not the notification stuff, it did not get the HOST ALERT doen [19:19:10] what would have sent that? [19:19:12] wait, what happened? [19:19:12] ^demon: you rocks :-) [19:19:13] and this one, ouch [19:19:17] Program End[03-30-2012 06:29:41] Bailing out due to one or more errors encountered in the configuration files. Run Nagios from the command line with the -v option to verify your config before restarting. (PID=3865) [19:19:24] <^demon> hashar: :) [19:19:27] New patchset: preilly; "Change project match string" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4021 [19:19:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4021 [19:19:49] ^demon: hookconfig.py is generated by puppet based on a template in operations/puppet.git see templates/gerrit/hookconfig.py.erb [19:19:51] mutante: that's logged on spence or stafford or whatever the nagios hosts is these days? [19:20:08] can I merge the gallium change? [19:20:19] ^demon: you could probably hardcode the value in that template [19:20:22] <^demon> hashar: Ah ok. I'll fix that and submit a second patchset. [19:20:28] <^demon> Yeah that stuff's fine being hardcoded. [19:20:37] Jeff_Green: in this case i see it even in web cgi's [19:20:43] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4021 [19:20:46] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4021 [19:21:22] to whoever changed a thing about gallium in site.pp, your code is now live [19:23:58] New patchset: Demon; "Create hook for l10n-autoreview in gerrit." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4020 [19:24:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4020 [19:25:57] New review: Hashar; "(no comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4020 [19:26:03] ^demon: added comments to patchset1 sorry [19:26:38] notpeter: I did :-D Supposed to have exim installed but it does not seem to have installed anything [19:27:21] well exim does run according to ps :D just not listening on port 25 [19:27:23] I am cursed!! [19:27:39] notpeter: so exim runs at least. Thx [19:28:06] <^demon> hashar: `gerrit approve` is just a back-compat alias for `gerrit review` [19:28:10] <^demon> They're functionally identical. [19:28:17] hashar: woop! welcome [19:28:23] <^demon> approve is just deprecated in favor of review. [19:28:26] !log puppet daemon restarted on brewster [19:28:28] Logged the message, RobH [19:28:40] ^demon: so that is a valid but unrelated change :-) [19:29:04] <^demon> Well I was going to write review in my bit, but I figured it was best to be consistent ~20 lines up :) [19:29:58] New patchset: Pyoungmeister; "incorrect syntax. believe this is the right one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4022 [19:30:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4022 [19:30:55] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4020 [19:30:59] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4022 [19:31:02] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4022 [19:32:13] ^demon: looks good to me, will check with Ryan next week I guess [19:34:33] New review: Demon; "Just replying to patchset 1 to clear up any ambiguity." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4020 [19:37:02] !log configured jenkins on gallium to use smtp.pmtpa.wmnet as outgoing SMTP server [19:37:04] Logged the message, Master [19:37:31] that is probably the easiest for now :-] [19:37:40] Gerrit uses the same host [19:37:49] <^demon> Ouch, I still need to write a pre-commit hook for svn :\ [19:37:56] * ^demon thought he was done with python for the day [19:39:20] you could be doing perl [19:40:55] hey I *am* doing perl [19:41:33] New patchset: Pyoungmeister; "one more try" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4023 [19:41:42] Jeff_Green: switch to python :-))) [19:41:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4023 [19:41:51] hashar: why? [19:42:08] <^demon> Anyone wanna write an svn pre-commit hook? :p [19:42:12] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4023 [19:42:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4023 [19:43:42] ^demon: you could write it using python haha [19:43:56] Jeff_Green: I like python syntax, easier to read compared to perl [19:44:03] anyway I am off for the week-end [19:44:06] already late for the party [20:12:47] <^demon> notpeter: Still around? [20:13:06] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2990 [20:13:56] ja [20:13:59] whatup? [20:14:01] <^demon> Mind merging another gallium change? ^^ [20:14:10] <^demon> s/gallium/formey/ [20:16:24] I can merge that, but I think you want subscribe instead of require [20:18:30] like, you want to have it do the pull and then the clone, yes? [20:19:33] <^demon> No, clone then pull. [20:19:42] <^demon> Pull requires clone. [20:20:21] Hell, do the clone manually, it doesn't matter :p [20:20:29] hrm, I'm not sure if this will accomplish what you want, but we shall see! [20:20:44] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2990 [20:20:47] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2990 [20:21:31] I'm not sure how it'll behave as their will be a svn checkout of phase3 there already [20:22:34] <^demon> Oh dur, yeah...it's gonna 'splode on that. [20:22:40] <^demon> I believe I can fix that. [20:23:18] yeah, i htink you want to run the clone from cron, then subscribe the pull to it [20:23:23] but, you're the boss! [20:23:42] <^demon> Crap, permissions on this are a mess. [20:23:51] Though, hashar changed the paths... [20:24:24] <^demon> Ha, /var/mwdocs/phase3 isn't a svn or git repo. [20:25:02] /home/mwdocs/phase3 should be svn [20:25:08] <^demon> Neither is /home/mwdocs/phase3 [20:25:09] <^demon> No .svn [20:25:18] <^demon> Oh thur it is. [20:25:58] ohai [20:27:31] <^demon> Oh /home/mwdocs/phase3 is a symlink to /var anyway [20:27:47] ah [20:39:45] PROBLEM - MySQL disk space on neon is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:46:18] hey, just fyi, I'm going to kick opendj on sanger again to test that my change made it into puppet. Mail will be rejected for a few minutes while I clear out the iptables stuff before restarting opendj. andrew_wmf_ RobH [20:46:52] ok, thx for heads up =] [20:47:01] eep [20:47:06] mapblebedafk, hey there [20:47:08] y u hate mailz maplebed ? [20:47:13] would it be possible to do this later today? [20:47:30] preferably like past 5? [20:47:32] andrew_wmf_: it'll really only be a few minutes, and the queueing of mail will mean most people won't notice. [20:47:49] I'm not going to be here past 5. (at least not without some bourbon.) [20:48:01] ok. i just want to avoid mass panic :-) [20:48:03] think of it as protection against the weekend. [20:48:08] plus now he has others of us checking over his shoulder too [20:48:17] if it's later, the rest of us will be drunk [20:48:28] wheeee!!!!! [20:48:29] * andrew_wmf_ starts maplebed's bourbon  [20:49:01] opendj stopped [20:49:15] omg panic ! [20:49:25] iptables rules cleared [20:49:33] opendj started [20:49:56] nat rules in iptables confirmed [20:50:14] test mail sent [20:50:26] and recieved. [20:50:31] it works! [20:50:40] all done. andrew_wmf_ RobH LeslieCarr [20:50:41] :) [20:50:44] huzzah, icinga is sort of working [20:50:51] and yay maplebed :) [20:51:42] sigh, of course i can't see if it will break next time for another 2 hours because that's how long it takes puppet to rerun on this fucking thing [20:52:45] If I want a new instance on Labs to test Bugzilla, is Ryan the only person who can help me? [20:52:52] oh, wrong channel [20:52:53] hexmode: nope. [20:52:59] and ... wrong channel. ;) [20:53:05] hexmode: you can make a new one yourself [20:53:33] LeslieCarr: maplebed: ty, if I have more you'll see me in -labs ;) [21:11:34] LeslieCarr: do you remember where and what the error messages were we saw when mail was broken? [21:12:10] on mchenry [21:12:16] saw a ton in the /var/log/exim [21:12:53] ::sigh:: I was looking in /var/log/mail like an idio. [21:13:02] thank you. [21:19:28] LeslieCarr: if you have a few, would you review http://wikitech.wikimedia.org/index.php?title=LDAP&diff=45214&oldid=44483 and make sure that whatever would have been helpful during yesterday's mail outage is there? andrew_wmf_ maybe you too? [21:20:08] yes [21:21:12] making a fwe little changes [21:21:23] excellent. [21:21:41] fwiw i'm in talks with an opendj company about upgrading ldap and making the IT instance highly available [21:22:36] nothing concrete yet but if i get approval to move forward it will need some careful ops/it coordination [21:28:40] hey maplebed: could you help me early next week with deploying some udp-filters? i'll have a nice debian package ready by then [21:29:00] sure, want to drop a time on my calendar? [21:29:18] yeah, what do you prefer? [21:29:59] just that you find an empty time on my calendar and put it there. :) [21:30:56] okidoki [21:57:46] maplebed: so the new icinga instance has a brand new fucked problem - it's commenting out all of the nodes it puts in, except for a few [21:58:39] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [21:58:44] that is awesome. [21:59:11] I'm thinking it's high time we just completely redo the way nagios is configured. [22:00:03] yeah, awesome is one word [22:00:12] ah the ones that are working are old copies [22:00:33] so everything new is being commented [22:00:44] i hate you nagiospuppet [22:41:30] maplebed Jeff_Green: FYI, zack is all fixed up [22:41:40] good to hear. [22:41:51] anything interesting in the resolution or just a misunderstanding of some sort? [22:41:58] s/interesting/tricky/ [22:42:43] nothing too interesting, it seems that OS X doesn't forward keys unless you add them with "ssh-add /path/to/key" [22:42:56] after that, ssh -A worked like a charm [22:43:31] huh. [22:43:33] mine does. [22:43:44] I wonder what's different... [22:43:47] Lion? [22:43:53] I'm runnig lion, yes. [22:43:58] hmm, so are we [22:44:21] I tested it both on mine and zack's, worked the same way both times [22:44:40] my test : [22:44:46] * empty my ssh agent with ssh-add -D [22:44:54] * ssh to bast1001 [22:45:03] * a popup window appears asking me to unlock my key [22:45:10] * I'm connected to bast with my key forwarded [22:45:35] maplebed: can you give me a link to ganglia's copper graphs? [22:45:55] AaronSchulz: if you start from ganglia.wikimedia.org and enter 'copper' in the search tab, you can get there on your own! [22:45:56] :) [22:46:13] search gave a 404 or something [22:46:14] but here's one anyways. [22:46:15] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=&c=Swift+eqiad&h=copper.wikimedia.org&tab=s&vn=&mc=2&z=medium&metric_group=ALLGROUPS [22:46:17] but I found it so nvm [22:46:42] give a man a ganglia link and he will be amused for 15 minutes, teach a man to use ganglia and he will be amused for an hour, every few days [22:47:15] s/amused/confused/. [22:47:32] AaronSchulz: i can replicate the 404 [22:47:47] the ganglia search interface is a little weird. [22:47:52] you don't type in a search term and hit return. [22:48:03] you type in a search term, then wait and javascript pops up matching hosts [22:48:07] oh dear god [22:48:07] then you click on the host. [22:48:14] yeah, tell me about it. [22:48:27] and then submit a patch to change it here: https://github.com/ganglia/ganglia-web [22:48:41] can we add to it the same JS that we use on fundraising landing pages to unset a form submit on an enter keypress? [22:48:53] patchy patchy! [22:48:57] ;) [22:49:04] fine [22:49:21] (not speaking javascript, I have no opinion on whether we could do that.) [22:49:31] * pgehres adds that to the his TODO list which stretches into 2013 at this point [22:49:47] fwiw, vvuksan in #ganglia is very receptive to not just patches but also merely ideas and pointers towards how to do it. [22:49:57] (he's the primary author of the current version of the ganglia web interface) [22:50:04] I know the functionality works since we use it, but no idea about its use on that particular page [22:51:31] he's EST though, so it being friday night and all he's not around. [22:51:47] could you point me at the specific javascript file you're talking about and I'll pass it on next weekL? [22:52:07] maplebed: sure, one sec [22:54:05] maplebed: the first function in http://donate.wikimedia.org/w/index.php?title=MediaWiki:Resources/landingpage.js&action=edit [22:54:28] or even http://donate.wikimedia.org/wiki/MediaWiki:Resources/landingpage.js since you can't edit that page [22:56:15] ok. [22:56:51] i believe Kaldari wrote it originally, but not 100% sure [23:15:55] New patchset: preilly; "Add X-Carrier to response from Varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4032 [23:16:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4032 [23:28:21] maplebed: it's good to not show up a few minutes after the passport services close for the day :) [23:29:43] you should go at noon. [23:29:50] or 10a [23:30:54] maplebed: is no one there at noon? [23:31:18] dunno. but I'd bet they take lunch from 11:30 to 1:30. [23:31:20] :P