[00:00:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 8.830 seconds [00:18:11] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 273 MB (3% inode=61%): /var/lib/ureadahead/debugfs 273 MB (3% inode=61%): [00:20:26] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 225 MB (3% inode=57%): [00:25:14] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 270 MB (3% inode=61%): /var/lib/ureadahead/debugfs 270 MB (3% inode=61%): [00:26:35] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 273 MB (3% inode=61%): /var/lib/ureadahead/debugfs 273 MB (3% inode=61%): [00:30:47] RECOVERY - Disk space on srv223 is OK: DISK OK [00:30:56] RECOVERY - Disk space on srv221 is OK: DISK OK [00:36:56] RECOVERY - Disk space on srv220 is OK: DISK OK [00:42:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.825 seconds [01:25:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.119 seconds [01:40:50] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [01:55:25] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [02:03:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.887 seconds [02:13:25] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [03:30:22] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 168 MB (2% inode=61%): /var/lib/ureadahead/debugfs 168 MB (2% inode=61%): [03:32:46] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [03:36:40] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 232 MB (3% inode=61%): /var/lib/ureadahead/debugfs 232 MB (3% inode=61%): [03:38:46] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 268 MB (3% inode=61%): /var/lib/ureadahead/debugfs 268 MB (3% inode=61%): [03:38:55] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 263 MB (3% inode=61%): /var/lib/ureadahead/debugfs 263 MB (3% inode=61%): [03:40:52] RECOVERY - Disk space on srv219 is OK: DISK OK [03:46:47] RECOVERY - Disk space on srv220 is OK: DISK OK [03:46:47] RECOVERY - Disk space on srv222 is OK: DISK OK [03:50:50] RECOVERY - Disk space on srv223 is OK: DISK OK [04:14:41] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [04:25:38] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [04:25:38] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [04:39:18] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [04:39:18] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [04:54:36] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3227 MB (2% inode=99%): [04:54:36] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3219 MB (2% inode=99%): [05:23:42] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [05:56:40] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 3917 MB (3% inode=99%): [06:15:35] !log killed vi on fenari owned by awjrichards, locking CommonSettings.php for two days [06:15:38] Logged the message, Master [07:21:41] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [07:52:57] hi, anybody checked out storage3 yet? [07:53:04] or i will now [07:53:25] it is down and backups on aluminium and grosley were affected [07:54:26] I did not notice that [07:54:44] it's not in the pages [07:55:11] ok, checking mgmt now [07:57:17] !log powercycling storage3 [07:57:19] Logged the message, Master [07:59:47] RECOVERY - Host storage3 is UP: PING WARNING - Packet loss = 44%, RTA = 91.71 ms [08:02:35] nagios-wm: there are more recoveries to report .. [08:03:59] PROBLEM - MySQL replication status on storage3 is CRITICAL: (Return code of 255 is out of bounds) [08:04:17] PROBLEM - MySQL slave status on storage3 is CRITICAL: CRITICAL: Lost connection to MySQL server at reading initial communication packet, system error: 111 [08:05:20] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: (Return code of 255 is out of bounds) [08:05:38] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [08:06:21] !log storage3 - gmond unable to find the metric information for any mysql_* .."module has not been loaded", starting mysql, running puppet ... [08:06:23] Logged the message, Master [08:06:23] RECOVERY - MySQL slave status on storage3 is OK: OK: [08:06:32] RECOVERY - Puppet freshness on storage3 is OK: puppet ran at Fri Mar 30 08:06:06 UTC 2012 [08:06:57] no obvious reason for the shutdown when glancing at syslog etc [08:07:39] no hardware issues reported on boot [08:15:50] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 26s [08:16:35] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 33s [08:18:41] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [09:09:25] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 7 MB (0% inode=61%): /var/lib/ureadahead/debugfs 7 MB (0% inode=61%): [09:09:34] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 240 MB (3% inode=61%): /var/lib/ureadahead/debugfs 240 MB (3% inode=61%): [09:09:43] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 77 MB (1% inode=61%): /var/lib/ureadahead/debugfs 77 MB (1% inode=61%): [09:17:04] PROBLEM - Apache HTTP on srv224 is CRITICAL: Connection refused [09:18:05] !log srv224,srv219,srv220, upgrade apache, dist-upgrading w/ kernel, disabling ureadahead, rebooting one by one [09:18:07] Logged the message, Master [09:19:55] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 232 MB (3% inode=57%): [09:22:01] RECOVERY - Disk space on srv221 is OK: DISK OK [09:22:01] RECOVERY - Disk space on srv219 is OK: DISK OK [09:22:10] RECOVERY - Disk space on srv224 is OK: DISK OK [09:28:28] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 278 MB (3% inode=61%): /var/lib/ureadahead/debugfs 278 MB (3% inode=61%): [09:33:01] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:34:22] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.099 second response time [09:35:07] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%): /var/lib/ureadahead/debugfs 0 MB (0% inode=57%): [09:35:07] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 279 MB (3% inode=61%): /var/lib/ureadahead/debugfs 279 MB (3% inode=61%): [09:41:43] RECOVERY - Disk space on srv222 is OK: DISK OK [09:47:43] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 177 MB (2% inode=61%): /var/lib/ureadahead/debugfs 177 MB (2% inode=61%): [09:48:01] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [09:51:55] RECOVERY - Disk space on srv219 is OK: DISK OK [09:55:49] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 179 MB (2% inode=61%): /var/lib/ureadahead/debugfs 179 MB (2% inode=61%): [09:56:16] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 239 MB (3% inode=57%): [09:57:37] PROBLEM - Apache HTTP on srv219 is CRITICAL: Connection refused [10:00:28] RECOVERY - Disk space on srv224 is OK: DISK OK [10:01:49] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.105 second response time [10:02:07] RECOVERY - Disk space on srv223 is OK: DISK OK [10:06:37] RECOVERY - Disk space on srv220 is OK: DISK OK [10:06:55] RECOVERY - Disk space on srv222 is OK: DISK OK [10:22:22] PROBLEM - Apache HTTP on srv220 is CRITICAL: Connection refused [10:30:46] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [10:39:19] PROBLEM - Apache HTTP on srv223 is CRITICAL: Connection refused [10:41:00] !log same for srv223 [10:41:02] Logged the message, Master [11:04:31] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.093 second response time [11:48:14] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 147 MB (2% inode=57%): [11:52:26] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 135 MB (1% inode=57%): [11:52:35] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [11:56:38] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 242 MB (3% inode=57%): [11:57:05] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [11:58:44] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 136 MB (1% inode=57%): [12:00:59] RECOVERY - Disk space on srv220 is OK: DISK OK [12:02:47] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 55 MB (0% inode=57%): [12:06:59] RECOVERY - Disk space on srv223 is OK: DISK OK [12:06:59] RECOVERY - Disk space on srv219 is OK: DISK OK [12:07:08] RECOVERY - Disk space on srv222 is OK: DISK OK [12:28:04] !log deleted old kernel sources on upgraded srvs for that little extra space during peaks, suggesting to nuke /usr/share/doc if there should be more disk space warnings [12:28:06] Logged the message, Master [12:28:44] reinstall the boxes with a decent partitioning scheme instead :P [12:29:00] heh, fair enough [12:29:09] hi mark [12:42:14] PROBLEM - Host db1047 is DOWN: PING CRITICAL - Packet loss = 100% [12:44:38] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:35] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:47:04] !log powercycling db1047 [12:47:06] Logged the message, Master [12:50:47] RECOVERY - Host db1047 is UP: PING OK - Packet loss = 0%, RTA = 26.80 ms [12:52:03] PROBLEM - NTP on db1047 is CRITICAL: NTP CRITICAL: Offset unknown [12:53:06] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [12:55:12] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [12:56:15] RECOVERY - NTP on db1047 is OK: NTP OK: Offset -0.0309022665 secs [12:56:36] !log db1047 - added system startup for /etc/init.d/mysql [12:56:38] Logged the message, Master [12:59:06] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 294 seconds [12:59:24] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 230 seconds [13:01:12] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [13:01:30] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [13:09:00] sees " [13:09:21] a new check_to_check_nagios_paging by maplebed [13:09:59] hi mutante [13:10:02] back in germany yet? [13:10:23] since like a day [13:10:27] or 2 [13:10:57] dependng if you count the jetlag :p [13:11:41] hehe [13:11:56] db1047 just froze there..i saw it was a core db even [13:12:03] from the motd [13:19:50] [14:18:14] (NEW) Massive loss of data in the list of contributions - https://bugzilla.wikimedia.org/35610 major; Wikimedia: General/Unknown; (marc.galli35) [13:20:45] and nosy was here earlier asking about possible issues with replication db47->amaranth.toolserver , checked a few things for her, no obvious problem on our side, now it's waiting for dab for amaranth [13:22:05] robh won't like to hear about db1047 [13:25:24] Reedy: scary bug title? [13:25:37] mutante: indeed. Though, he seems to be correct [13:26:52] oh, sounds like Importance should go up? [13:26:59] he already did [13:26:59] =/ [13:28:05] uh oh [13:29:28] Yeah... [13:31:11] disappeared when? [13:31:57] so are those other revisions in there with some bad user id I wonder [13:33:56] http://fr.wikisource.org/w/index.php?title=Wikisource:Accueil&diff=1466&oldid=1465 that one for example [13:41:21] http://fr.wikisource.org/w/index.php?title=Auteur:Ren%C3%A9_Descartes&dir=prev&action=history [13:42:59] apergos: he seems to be mentioning 2 different accounts [13:43:11] I wonder if it's some form of an incomplete rename [13:43:42] I don't see where you get that [13:44:28] He's saying he is/was Caton, and there are contribs for https://fr.wikisource.org/wiki/Utilisateur:Marc [13:44:38] https://fr.wikisource.org/wiki/Utilisateur:Caton?redirect=no [13:45:02] https://fr.wikisource.org/w/index.php?limit=50&tagfilter=&title=Sp%C3%A9cial%3AContributions&contribs=user&target=Marc&namespace=&tagfilter=&year=&month=-1 [13:45:17] where does he say that he is Marc? [13:45:51] His display name is "MArc" on the bug... He refers to the edits by Caton as being his [13:46:05] a the bug entry [13:47:25] I guess someone has to ask him about a rename [13:47:29] mysql> SELECT count(*) FROM revision WHERE rev_user_text = 'Caton'; [13:47:31] I tried to look in the logs but [13:47:34] | 3220 | [13:47:40] nothing showed otoh I cannot read anything over there [13:48:08] He's got some edits as user id 0 [13:48:11] ERROR 1305 (42000): FUNCTION frwikisource.distinc does not exist [13:48:11] mysql> SELECT distinct(rev_user) FROM revision WHERE rev_user_text = 'Caton'; [13:48:11] +----------+ [13:48:11] | rev_user | [13:48:12] +----------+ [13:48:15] | 0 | [13:48:17] | 44 | [13:48:19] +----------+ [13:48:29] only 44? [13:48:37] no, user id 0 and user id 44 [13:48:40] oh [13:48:49] how many as user 0? [13:49:09] 3216 [13:49:12] sorry, I should either pay full attention or leave this alone I guess [13:49:16] heh :) [13:49:33] so wanna ask him about whether he was renamed? [13:49:40] Yeah, will do [13:55:46] "It's difficult to remember, because it is a former account. Sometimes for some reason, I look what I have done with this account. The last time was perhaps last year." [13:55:57] oh gee [13:56:06] I'm presuming renames don't usually go via user 0 and user text != an ip address? [13:56:58] failed renames will often have uid 0 on a bunchof the revs [13:57:02] I don't remember why that is [13:57:24] and depending when the rename was done, the user name can appear in different parts of the log [13:57:28] it's kind of annoying that way [13:58:17] we'd have to hunt around to see when the marc account was created [14:00:17] "My account was not renamed. It is an former account from old wikisource. But all was fine on fr: until now." [14:00:43] eeewww [14:01:30] I am liking this less and less [14:01:58] And the other account he mentions also has edits on user id 0 [14:02:00] what abou tthe other account mentioned in the report? maybe that user is more active? [14:02:14] I'm gonna drop the severity, cause the data is certainly not missing [14:02:25] 2010 [14:02:43] (show/hide) 11:44, 4 May 2009 Account Shaihulud (Talk | contribs) was created automatically [14:03:10] Marc/Caton don't seem to be in the account creation logs [14:03:20] there was a point where we turned them off [14:03:23] for automatic [14:03:27] then turned them back on [14:03:41] but if he has edits going way back maybe there weren't good logs then [14:04:47] https://bugzilla.wikimedia.org/show_bug.cgi?id=35610#c6 [14:06:27] select * from revision where rev_user = 0 and rev_user_text NOT LIKE an ip address... [14:06:29] so one could run a job I guess that found all rev 0 [14:06:34] I mean user id 0 revs [14:06:42] yeah, I see you are thinking the same thing [14:07:31] that everythig was good a year ago... well... a year is a long time [14:07:44] who knows what happened during that time [14:08:19] both are indexed, but seperately.. [14:08:24] I also can't believe it just broke today or something [14:08:38] seems very doubtful [14:09:17] anyways a script could find them all, reassinging them would be a bit more of a pain but still scriptable [14:10:03] and then we would know that as of x date things were clean [14:10:18] right now what we know is that there might have been revisions in there like that for years and years [14:10:38] Yeah.. It also makes me wonder how widespread this may be [14:11:01] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 18 MB (0% inode=57%): [14:13:13] oh I'm sure it's across all wikis [14:15:13] RECOVERY - Disk space on srv221 is OK: DISK OK [14:15:49] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [14:16:34] Yeah... [14:17:08] I seem to recall a renae-ish bug that had some revs like these that were renames and some that were "who knows that caused these" [14:26:46] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [14:26:46] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [14:40:43] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [14:40:43] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [15:25:27] PROBLEM - Puppet freshness on search1001 is CRITICAL: Puppet has not run in the last 10 hours [16:46:10] New patchset: Pyoungmeister; "let java do the logrotation! (working well in testing.)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3989 [16:46:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3989 [16:47:08] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3989 [16:47:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3989 [16:47:53] maplebed: ok if I merge your stuff? [16:47:58] looks inocuous [16:48:10] ack,I submitted something without merging?! [16:48:25] ja, is just changes to the nofication stuffs [16:48:29] I'm gonna merge [16:48:37] doh! [16:48:40] I'm sorry. [16:48:44] that sat there since yesterday! [16:48:51] heh [16:48:52] no biggie [17:01:33] New patchset: preilly; "Add all Digi Malaysia IP addresses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4004 [17:01:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4004 [17:03:39] New patchset: preilly; "Add all Digi Malaysia IP addresses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4004 [17:03:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4004 [17:07:27] New patchset: preilly; "Add all Digi Malaysia IP addresses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4004 [17:07:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4004 [17:08:25] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4004 [17:08:28] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4004 [17:18:57] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 81 MB (1% inode=61%): /var/lib/ureadahead/debugfs 81 MB (1% inode=61%): [17:23:45] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [17:30:12] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:31:24] RECOVERY - Disk space on srv222 is OK: DISK OK [17:34:15] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [18:04:00] New patchset: Bhartshorne; "telling sanger what IP to bind to for opendj ldap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4009 [18:04:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4009 [18:06:37] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4009 [18:06:40] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4009 [18:06:49] ^^^ trying to puppetize the fix I put in for sanger and ldap yesterday [18:11:15] New patchset: Lcarr; "Revert "TESTING seeing if relative variables make a difference to icinga"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4010 [18:11:29] New patchset: Lcarr; "Revert "TESTING icinga stuff"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4011 [18:11:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4010 [18:11:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4011 [18:13:11] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4010 [18:13:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4010 [18:13:18] !log killing puppet daemon on brewster, i need to hack at local configuration for cisco server stuff [18:13:19] Logged the message, RobH [18:13:27] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4011 [18:13:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4011 [18:15:57] I'd need a shell command that i can execute on any labs instance, that returns the "instance_name" / the "nice" name , and replaces something that just works as `hostname` in prod. like on labs instances the hostnames are the resource names, but not the instance names [18:18:07] can a nova instance find out it's own instance_name ? as opposed to asking the controller [18:20:36] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [18:22:12] im not sure an instance knows that without the controller [18:22:20] i never see it metioned in instacne specific files [18:22:28] but my experience in labs is quite limited. [18:23:52] hmm, yeah, then either labs nagios needs to be changed to use the resource names as "host names" from Nagios' point of view .. [18:24:32] or snmp traps for puppet freshness wont work [18:25:18] stupid OSX questions for someone who actually uses it: What's the story with the ssh key agent these days, is there a built in agent? Does it start by default? [18:25:38] it took quite a bit to this point, multiple other issues, and now _that_ close, after the traps arrive, submit_check_result works.. and all... just the damn hostname mismatch keeps it from working :p [18:26:30] because this is a passive checks the instances themselves need to sent their (nagios) "hostname" out by themselves [18:26:40] send [18:31:40] Jeff_Green: I think my agent is actually part of the osx keychain stuff. [18:31:48] huh [18:32:12] all i know is that when I start a new window in the terminal my SSH_AUTH_SOCK is valid. [18:32:12] i know ryan discovered you need to exec independent agents [18:32:16] or it shares it across all terminal sessions [18:32:20] which is undesirable [18:32:23] I'm trying to help zack get to storage3 so he can work with impression logs that are stored there. he's got an account and his public key is there [18:32:31] ssh-add -l [18:32:37] that's the best check [18:32:55] does he have to manually attach keys to the keyring? [18:33:16] it will autamatically use keys with the standard key name [18:33:25] but you have to manually add keys with custom names. [18:33:30] ok [18:33:35] strange [18:33:38] (i.e. it'll add .ssh/id_rsa but not .ssh/mykey) [18:33:49] so in theory he should be able to ssh -A aluminium, and then ssh to storage3 [18:34:04] I would ssh-add -l at each hop and make sure the agent is coming along for the ride. [18:34:11] ok [18:35:09] and of course he's gone offline now :-P [18:35:51] hmph. feel free to tell him to ping me for help if you're not around when he gets back. [18:35:59] ok. thank you [18:38:00] apropos storage3, it crashed today and there were no daily aluminium / grosley offhost backups [18:38:19] mutante: yeah, I saw--it's ok, not a big deal [18:38:33] it'll catch up to night [18:39:17] i'm bummed that we didn't get any real notification [18:39:30] i got email on it. [18:39:38] but when i read it you were already chatting about it in channel [18:39:46] i get emails on all that stuff [18:39:51] the backup fail stuff or from nagios? [18:39:51] New patchset: Lcarr; "removing nagios_config_dir from neon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4012 [18:40:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4012 [18:40:12] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4012 [18:40:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4012 [18:40:25] yes to both, will catch up and got mail [18:40:50] the only mail I saw was from the backup jobs and your updates as you fixed it [18:40:55] nothing from nagios [18:41:25] maybe they were eaten by gmail somehow [18:48:02] hello ops :-D [18:48:15] does anyone know which smtp server I could use to send emails ? [18:48:25] the sending host would be gallium , software is jenkins [18:49:02] localhost? [18:49:12] ohh [18:49:15] I should try that :D [18:49:28] javax.mail.MessagingException: Could not connect to SMTP host: localhost, port: 25; [18:49:32] do you need to shell out to mail or connect to localhost 25? [18:49:36] ah. the latter. [18:49:46] it seems [18:49:51] I don't know how we do outgoing mail here. [18:50:00] I will have to find out [18:50:02] * maplebed reads http://wikitech.wikimedia.org/view/Mail [18:50:05] there might a puppet class just for that [18:50:30] oh we actually have documentation \O/ [18:50:41] sadly I don't think it domucents what I wanted it to. [18:50:52] I would look at how mediawiki sends mail and use that as the template. [18:51:09] hashar: i think we ordinarily use the server on localhost, which is configured to relay [18:51:34] strange that it's not working [18:51:41] * hashar starts pretending to be an ops by looking at puppet conf [18:51:58] sure enough, exim's not running on gallium. [18:52:04] the host must be missing some exim package [18:52:23] exim::simple-mail-sender sounds good [18:52:47] that should be installed by puppet, i wonder if it was intentionally left off gallium [18:53:40] ordinarily hosts accept mail locally, then relay out through mchenry, and fail to lily [18:55:05] maplebed: I got zack to run ssh-add -l and indeed no keys are being forwarded to his first hop (aluminium) [18:55:09] New patchset: Hashar; "allow gallium to send mails (for Jenkins notifications)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4014 [18:55:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4014 [18:55:35] and lily is down [18:55:37] he's doing ssh -A, so I'm not sure why his agent is failing to attach them [18:55:37] change 4014 ^^^ should enable exim on gallium if one of you is willing to merge / apply it :-D [18:55:42] Jeff_Green: my guess - ssh is specifically prohibiting forwarding for some reason (such as a mismatched host key) [18:56:16] they should fail to sodium [18:56:24] maplebed: huh, ok [18:56:44] I think the ssh daemon can prohibit agent forwarding too, but I don't think we do that. [18:57:30] yeah it doesn't for me on those hosts [18:57:44] I just checked too. and I agree. [18:58:08] I think command line flags are supposed to override the .ssh_config, but he could check is ~/.ssh/ssh_config file and see if he's squashing his agent. [18:58:10] should he have to manually attach the key locally? [18:58:27] or is it auto-attached the first time he uses it? [18:58:28] at this point, it might be easier to just take over his laptop to poke around rather than playing telephone. [18:58:32] (eg. using ichat) [18:58:46] heheh yes I agree, and he's in a meeting to boot [18:59:14] I'll ask him to track you down if that's ok [18:59:56] "calling ssh technician to cubicle 79B" [19:00:06] np. [19:01:39] maplebed: pgehres just offered to help Zack, you're off the hook! [19:02:16] alright, well, the offer's still open. [19:03:10] thanks [19:03:56] maplebed: Jeff_Green : can one of you review my change that install exim on gallium : https://gerrit.wikimedia.org/r/#q,4014,n,z [19:04:00] please ? :-D [19:04:07] sure, sorry got distracted [19:04:41] exim::simple-mail-sender is part of the "standard" class [19:04:57] New review: Jgreen; "winning!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4014 [19:04:59] but gallium is not really standard :-D so it does not have that exim class [19:05:00] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4014 [19:06:11] Jeff_Green/ hashar why not just put include standard on gallium ? [19:06:14] i detest ubuntu release names [19:06:18] Jeff_Green: actually, that storage3 mail wasnt a from nagios@ in this case, but from root@, Subject: FAILURE ... [19:06:20] and how they are used in apt and the like [19:06:22] i think you now have all 4 components of the standard class on there [19:06:26] should just use the damned version numbers. [19:06:33] LeslieCarr: oh maybe we have all of them [19:07:03] mutante: ah that's my backup script. I have half a mind to make that thing send email-to-sms notifications so we actually get some monitoring on fundraising db's :-P [19:07:31] LeslieCarr: all but the evil generic::tcptweaks [19:07:41] New patchset: preilly; "Add mobile channel to log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4015 [19:07:45] haha [19:07:47] that's not evil at all [19:07:51] it's good for you [19:07:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4015 [19:08:07] RobH: I like making up absurd names sorta in their scheme [19:08:08] unless there's a reason gallium needs a huge syn timeout [19:08:12] it has both TCP and "tweak" !!! That triggers an EVILNESS warning to me :-D [19:08:16] haha [19:08:29] i have enough trouble recalling that lucid is 10.04 [19:08:36] i dont need to think of another sarcastic name ;] [19:08:42] LeslieCarr: amending change to use standard [19:08:49] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4015 [19:08:52] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4015 [19:09:10] since it's already merged, just do a new change (unless you were using amending in the english sense, instead of the git sense) [19:09:22] New patchset: Pyoungmeister; "adding some hashes for with which to generate some confs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4017 [19:09:23]