[00:01:20] New patchset: Asher; "class to install percona nagios monitors (just the files so far)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1723 [00:01:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1723 [00:05:20] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1723 [00:05:20] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1723 [00:40:34] LeslieCarr: Can you pop in #mediawiki for a second if you're not busy, I need your help [00:41:10] okay [00:48:36] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [01:03:06] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [01:04:01] New patchset: Lcarr; "Removing star from planet as it is already defined" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1724 [01:04:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1724 [01:04:21] anyone around to review ? [01:05:41] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1724 [01:05:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1724 [01:06:46] RECOVERY - Puppet freshness on singer is OK: puppet ran at Wed Dec 28 01:06:40 UTC 2011 [01:07:14] yay singer is happy again [02:13:38] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1667s [02:21:48] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 2157s [02:31:49] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:32:58] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [03:42:43] PROBLEM - mobile traffic loggers on cp1041 is CRITICAL: PROCS CRITICAL: 7 processes with args varnishncsa [03:42:43] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: PROCS CRITICAL: 7 processes with args varnishncsa [03:42:43] PROBLEM - mobile traffic loggers on cp1042 is CRITICAL: PROCS CRITICAL: 7 processes with args varnishncsa [03:52:33] RECOVERY - mobile traffic loggers on cp1042 is OK: PROCS OK: 4 processes with args varnishncsa [04:02:13] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 3 processes with args varnishncsa [04:11:35] RECOVERY - mobile traffic loggers on cp1041 is OK: PROCS OK: 3 processes with args varnishncsa [04:36:55] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [06:59:19] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 272 MB (3% inode=60%): /var/lib/ureadahead/debugfs 272 MB (3% inode=60%): [07:09:19] RECOVERY - Disk space on srv221 is OK: DISK OK [07:29:18] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1687 [07:34:18] RECOVERY - Disk space on hume is OK: DISK OK [07:35:38] New patchset: Dzahn; "additional generic check_procs with -C option & fix "mobile traffic logger" checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1687 [07:35:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1687 [07:36:16] !log live-hacked /usr/local/bin/copy_impression_logs_from_storage3.pl on hume, it was rsyncing everything into /a/static/uncompressed/2... do we need this job? there is also /usr/local/bin/offhost_backups on storage3 that seems to copy to the same dir, can whoever set this up take a look? [07:36:26] Logged the message, Master [07:36:26] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1687 [07:36:26] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1687 [07:38:00] morning apergos, just fixing that nagios check for mobile traffic loggers [07:38:08] yay [07:39:08] well in another 2 minutes I'll know if my typo fix on hume worked out [07:39:33] it would be nice not to have to panic about space over there for awhile [07:40:23] cool [07:40:31] guess I'm going to ask for some of those juniper credits [07:41:03] I don't necessarily want to become a juniper network equipment expert but if they have courses that teach more about networking that would be fine [07:42:08] yes, totally [07:43:13] yep looks like that took care of it... [07:44:54] nice apergos [07:46:38] and yes, i think it does teach a lot about networking [07:47:04] great [07:47:06] a friend of mine did the JNCIE cert, and i remember how he was learning for it, and all genereal routing and switching protocols [07:47:12] like here: http://www.juniper.net/us/en/training/certification/e_track.html#jncieent [07:49:10] hehe, a volunteer made this: class nagios::monitor::check_wiki_user_last_edit_time [07:49:18] monitor wiki users via Nagios? :p [07:49:54] !change 1712 [07:49:54] https://gerrit.wikimedia.org/r/1712 [07:49:56] :-D [07:51:27] guess I should look at those more closely later [07:52:25] do you already know details about the juniper credits? [07:52:39] like which classes you can use them for / how many / ... [08:13:57] apergos: found something out about puppet / nagios / duplicate service checks ..from official docs: [08:14:07] "You can purge Nagios resources using the resources type, but only in the default file locations. This is an architectural limitation." [08:15:51] I knew that. hmph [08:15:54] "By default, the statements will be added to /etc/nagios/nagios_service.cfg, but you can send them to a different file by setting their target attribute." [08:16:05] so since we don't use the default, we can't purge..ack [08:17:13] i thought we had something set up to work around that [08:17:56] somebody deleted .cfg files in between it seems [08:18:20] before: thousands of lines, after: just hundreds [08:18:38] huh [08:19:23] the work-around i would know about was just to fix permissions [08:19:42] that used to break nagios, and doesn't anymore [08:20:41] oh, yeah, and we have the purge script that is being started from the init script [08:20:59] what about that? [08:21:39] it uses NagiosPurgeFiles= .. and that has among other paths, also /etc/nagios/puppet_checks.d/* [08:22:51] it is not made for the job "remove duplicate service definitions for existing hosts", just for "remove services for hosts which do not exist" [08:22:58] bah [08:23:38] maybe it should just delete all those files ..hmm [08:24:28] hmm..or better, puppet should delete right before it re-creates each one [08:25:00] (or go back to one huge nagios_service.cfg file) [08:25:15] eww not one huge one [08:25:31] if it's going to create a new file then itcould move the old one out of the way [08:25:53] (this way after a test run you can compare the old and new one by hand and see if it sucks or you got what you wanted) [08:28:21] yeah,hmm, just that the file is created by the built-in "@@nagios_service" type, and not by a "file" ..checking [08:49:38] New patchset: Dzahn; "nagios: purge resources using puppet (instead of .py script), comments on duplicate definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1726 [08:49:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1726 [09:06:42] New patchset: Dzahn; "nagios: purge resources using puppet (instead of .py script), try to avoid duplicate definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1726 [09:54:03] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:00:22] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: PROCS CRITICAL: 6 processes with args varnishncsa [10:00:22] PROBLEM - mobile traffic loggers on cp1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [10:00:22] PROBLEM - mobile traffic loggers on cp1042 is CRITICAL: PROCS CRITICAL: 7 processes with args varnishncsa [10:28:55] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 2 processes with args varnishncsa [10:55:52] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [10:55:52] PROBLEM - mobile traffic loggers on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [11:00:45] still weird..that's not what you get when checking manually..sigh [11:12:42] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [12:04:16] PROBLEM - Puppet freshness on srv191 is CRITICAL: Puppet has not run in the last 10 hours [12:47:27] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [12:47:34] New patchset: Dzahn; "nagios - check procs via nrpe -checkcommands - can't use (broken) generic check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1727 [12:47:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1727 [12:48:47] New patchset: Dzahn; "nagios - check procs via nrpe -checkcommands - can't use (broken) generic check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1727 [12:49:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1727 [12:50:09] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:50:27] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1727 [12:50:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1727 [12:57:32] New patchset: Dzahn; "nagios proc checks - use nrpe commands, this just seemed to work..duh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1728 [12:57:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1728 [13:02:43] New patchset: Dzahn; "add nrpe to tarin and cp104[1-4] for process checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1729 [13:02:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1729 [13:08:35] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 6.477 seconds response time. www.wikipedia.org returns 208.80.152.201 [13:15:10] PROBLEM - RAID on db1009 is CRITICAL: Connection refused by host [13:16:51] PROBLEM - Disk space on mw1113 is CRITICAL: Connection refused by host [13:17:05] PROBLEM - Disk space on srv225 is CRITICAL: Connection refused by host [13:17:11] PROBLEM - Disk space on mw1157 is CRITICAL: Connection refused by host [13:17:11] PROBLEM - DPKG on srv259 is CRITICAL: Connection refused by host [13:17:32] PROBLEM - Disk space on srv259 is CRITICAL: Connection refused by host [13:18:42] PROBLEM - RAID on es1003 is CRITICAL: Connection refused by host [13:19:51] PROBLEM - RAID on mw1015 is CRITICAL: Connection refused by host [13:20:32] PROBLEM - DPKG on ms1002 is CRITICAL: Connection refused by host [13:21:44] PROBLEM - MySQL disk space on es1003 is CRITICAL: Connection refused by host [13:21:54] PROBLEM - RAID on srv229 is CRITICAL: Connection refused by host [13:22:23] PROBLEM - Disk space on es1003 is CRITICAL: Connection refused by host [13:22:23] PROBLEM - DPKG on srv274 is CRITICAL: Connection refused by host [13:22:32] PROBLEM - Disk space on aluminium is CRITICAL: Connection refused by host [13:22:32] PROBLEM - DPKG on db1019 is CRITICAL: Connection refused by host [13:22:52] PROBLEM - DPKG on nfs1 is CRITICAL: Connection refused by host [13:22:52] PROBLEM - DPKG on mw1015 is CRITICAL: Connection refused by host [13:23:02] PROBLEM - DPKG on mw1088 is CRITICAL: Connection refused by host [13:23:22] PROBLEM - DPKG on db47 is CRITICAL: Connection refused by host [13:23:22] PROBLEM - RAID on mw1129 is CRITICAL: Connection refused by host [13:23:32] PROBLEM - RAID on mw1104 is CRITICAL: Connection refused by host [13:23:32] PROBLEM - RAID on mw1157 is CRITICAL: Connection refused by host [13:23:42] PROBLEM - Disk space on virt3 is CRITICAL: Connection refused by host [13:23:54] PROBLEM - RAID on srv259 is CRITICAL: Connection refused by host [13:24:02] PROBLEM - RAID on nfs1 is CRITICAL: Connection refused by host [13:24:24] PROBLEM - RAID on srv208 is CRITICAL: Connection refused by host [13:24:34] PROBLEM - MySQL disk space on db1019 is CRITICAL: Connection refused by host [13:24:42] PROBLEM - RAID on mw1112 is CRITICAL: Connection refused by host [13:24:42] PROBLEM - Disk space on db47 is CRITICAL: Connection refused by host [13:24:42] PROBLEM - Disk space on db1019 is CRITICAL: Connection refused by host [13:25:02] PROBLEM - DPKG on db1002 is CRITICAL: Connection refused by host [13:25:50] PROBLEM - Disk space on mw1088 is CRITICAL: Connection refused by host [13:28:10] PROBLEM - RAID on mw29 is CRITICAL: Connection refused by host [13:28:10] PROBLEM - Disk space on mw1076 is CRITICAL: Connection refused by host [13:28:33] PROBLEM - RAID on srv211 is CRITICAL: Connection refused by host [13:28:41] PROBLEM - Disk space on db1002 is CRITICAL: Connection refused by host [13:28:59] PROBLEM - MySQL disk space on db18 is CRITICAL: Connection refused by host [13:28:59] PROBLEM - RAID on srv240 is CRITICAL: Connection refused by host [13:29:09] PROBLEM - DPKG on db11 is CRITICAL: Connection refused by host [13:29:09] PROBLEM - RAID on srv235 is CRITICAL: Connection refused by host [13:29:18] PROBLEM - RAID on db16 is CRITICAL: Connection refused by host [13:29:49] PROBLEM - DPKG on mw1141 is CRITICAL: Connection refused by host [13:30:17] PROBLEM - DPKG on mw1116 is CRITICAL: Connection refused by host [13:30:18] PROBLEM - DPKG on mw1127 is CRITICAL: Connection refused by host [13:30:49] PROBLEM - Disk space on srv227 is CRITICAL: Connection refused by host [13:30:59] PROBLEM - DPKG on mw1144 is CRITICAL: Connection refused by host [13:31:09] PROBLEM - Disk space on db1022 is CRITICAL: Connection refused by host [13:31:10] PROBLEM - DPKG on snapshot4 is CRITICAL: Connection refused by host [13:31:20] PROBLEM - RAID on db1033 is CRITICAL: Connection refused by host [13:31:20] PROBLEM - DPKG on db1033 is CRITICAL: Connection refused by host [13:31:20] PROBLEM - DPKG on db1022 is CRITICAL: Connection refused by host [13:31:20] RECOVERY - RAID on db1009 is OK: OK: State is Optimal, checked 2 logical device(s) [13:31:30] PROBLEM - Disk space on mw1116 is CRITICAL: Connection refused by host [13:31:59] PROBLEM - Disk space on db25 is CRITICAL: Connection refused by host [13:31:59] PROBLEM - Disk space on mw1011 is CRITICAL: Connection refused by host [13:32:20] PROBLEM - Disk space on db34 is CRITICAL: Connection refused by host [13:32:23] PROBLEM - Disk space on srv273 is CRITICAL: Connection refused by host [13:32:38] PROBLEM - Disk space on mw1158 is CRITICAL: Connection refused by host [13:32:38] PROBLEM - Disk space on mw1144 is CRITICAL: Connection refused by host [13:32:38] RECOVERY - Disk space on mw1113 is OK: DISK OK [13:32:59] RECOVERY - Disk space on srv225 is OK: DISK OK [13:33:00] RECOVERY - Disk space on mw1157 is OK: DISK OK [13:33:00] RECOVERY - DPKG on srv259 is OK: All packages OK [13:33:11] PROBLEM - MySQL disk space on db34 is CRITICAL: Connection refused by host [13:33:11] PROBLEM - Disk space on srv230 is CRITICAL: Connection refused by host [13:33:20] RECOVERY - Disk space on srv259 is OK: DISK OK [13:33:20] PROBLEM - Disk space on mw1135 is CRITICAL: Connection refused by host [13:33:33] PROBLEM - DPKG on db13 is CRITICAL: Connection refused by host [13:33:42] PROBLEM - RAID on db25 is CRITICAL: Connection refused by host [13:33:42] PROBLEM - RAID on db1035 is CRITICAL: Connection refused by host [13:34:57] PROBLEM - RAID on db13 is CRITICAL: Connection refused by host [13:34:57] RECOVERY - RAID on es1003 is OK: OK: State is Optimal, checked 2 logical device(s) [13:35:16] RECOVERY - RAID on mw1015 is OK: OK: no RAID installed [13:35:26] RECOVERY - DPKG on ms1002 is OK: All packages OK [13:35:26] PROBLEM - DPKG on searchidx2 is CRITICAL: Connection refused by host [13:35:26] RECOVERY - MySQL disk space on es1003 is OK: DISK OK [13:35:35] RECOVERY - RAID on srv229 is OK: OK: no RAID installed [13:35:45] PROBLEM - Disk space on db13 is CRITICAL: Connection refused by host [13:35:45] PROBLEM - Disk space on es1001 is CRITICAL: Connection refused by host [13:36:06] RECOVERY - DPKG on srv274 is OK: All packages OK [13:36:06] RECOVERY - Disk space on es1003 is OK: DISK OK [13:36:06] PROBLEM - MySQL disk space on es1001 is CRITICAL: Connection refused by host [13:36:06] RECOVERY - Disk space on aluminium is OK: DISK OK [13:36:16] PROBLEM - RAID on mw1065 is CRITICAL: Connection refused by host [13:36:16] PROBLEM - RAID on srv283 is CRITICAL: Connection refused by host [13:36:16] PROBLEM - RAID on db1006 is CRITICAL: Connection refused by host [13:36:16] RECOVERY - DPKG on db1019 is OK: All packages OK [13:36:27] PROBLEM - DPKG on mw15 is CRITICAL: Connection refused by host [13:36:27] RECOVERY - DPKG on nfs1 is OK: All packages OK [13:36:35] PROBLEM - RAID on srv215 is CRITICAL: Connection refused by host [13:36:46] RECOVERY - DPKG on mw1015 is OK: All packages OK [13:36:46] RECOVERY - DPKG on mw1088 is OK: All packages OK [13:36:56] PROBLEM - DPKG on srv215 is CRITICAL: Connection refused by host [13:37:06] PROBLEM - RAID on db48 is CRITICAL: Connection refused by host [13:37:06] PROBLEM - DPKG on srv196 is CRITICAL: Connection refused by host [13:37:06] PROBLEM - RAID on srv234 is CRITICAL: Connection refused by host [13:37:06] RECOVERY - RAID on mw1129 is OK: OK: no RAID installed [13:37:06] RECOVERY - RAID on mw1104 is OK: OK: no RAID installed [13:37:06] RECOVERY - RAID on mw1157 is OK: OK: no RAID installed [13:37:16] RECOVERY - DPKG on db47 is OK: All packages OK [13:37:26] PROBLEM - DPKG on srv210 is CRITICAL: Connection refused by host [13:37:26] RECOVERY - RAID on srv259 is OK: OK: no RAID installed [13:37:36] RECOVERY - Disk space on virt3 is OK: DISK OK [13:37:36] PROBLEM - DPKG on srv234 is CRITICAL: Connection refused by host [13:37:36] RECOVERY - RAID on nfs1 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:37:56] PROBLEM - RAID on mw1156 is CRITICAL: Connection refused by host [13:38:06] PROBLEM - DPKG on db1029 is CRITICAL: Connection refused by host [13:38:06] PROBLEM - DPKG on mw65 is CRITICAL: Connection refused by host [13:38:16] RECOVERY - RAID on srv208 is OK: OK: no RAID installed [13:38:16] PROBLEM - DPKG on snapshot2 is CRITICAL: Connection refused by host [13:38:16] RECOVERY - RAID on mw1112 is OK: OK: no RAID installed [13:38:25] PROBLEM - RAID on snapshot2 is CRITICAL: Connection refused by host [13:38:25] PROBLEM - Disk space on db1029 is CRITICAL: Connection refused by host [13:38:25] RECOVERY - MySQL disk space on db1019 is OK: DISK OK [13:38:36] RECOVERY - Disk space on db47 is OK: DISK OK [13:38:36] PROBLEM - DPKG on bast1001 is CRITICAL: Connection refused by host [13:38:36] RECOVERY - Disk space on db1019 is OK: DISK OK [13:38:46] RECOVERY - DPKG on db1002 is OK: All packages OK [13:38:56] PROBLEM - Disk space on mw1059 is CRITICAL: Connection refused by host [13:38:56] PROBLEM - Disk space on mw1070 is CRITICAL: Connection refused by host [13:39:06] PROBLEM - RAID on es3 is CRITICAL: Connection refused by host [13:39:06] PROBLEM - DPKG on es3 is CRITICAL: Connection refused by host [13:39:06] PROBLEM - RAID on db44 is CRITICAL: Connection refused by host [13:39:06] RECOVERY - Disk space on mw1088 is OK: DISK OK [13:39:26] RECOVERY - Disk space on mw1076 is OK: DISK OK [13:39:36] RECOVERY - RAID on srv211 is OK: OK: no RAID installed [13:39:36] PROBLEM - Disk space on mw65 is CRITICAL: Connection refused by host [13:39:37] RECOVERY - RAID on mw29 is OK: OK: no RAID installed [13:39:46] PROBLEM - Disk space on mw1050 is CRITICAL: Connection refused by host [13:39:46] RECOVERY - Disk space on db1002 is OK: DISK OK [13:40:05] RECOVERY - MySQL disk space on db18 is OK: DISK OK [13:40:06] RECOVERY - RAID on srv240 is OK: OK: no RAID installed [13:40:16] PROBLEM - Disk space on mw1032 is CRITICAL: Connection refused by host [13:40:16] RECOVERY - RAID on srv235 is OK: OK: no RAID installed [13:40:26] PROBLEM - Disk space on mw1067 is CRITICAL: Connection refused by host [13:40:26] RECOVERY - DPKG on db11 is OK: All packages OK [13:40:36] RECOVERY - RAID on db16 is OK: OK: 1 logical device(s) checked [13:41:06] PROBLEM - DPKG on srv264 is CRITICAL: Connection refused by host [13:41:06] RECOVERY - DPKG on mw1141 is OK: All packages OK [13:41:16] RECOVERY - DPKG on mw1127 is OK: All packages OK [13:41:16] RECOVERY - DPKG on mw1116 is OK: All packages OK [13:41:36] RECOVERY - Disk space on srv227 is OK: DISK OK [13:41:36] PROBLEM - Disk space on srv264 is CRITICAL: Connection refused by host [13:41:36] PROBLEM - RAID on srv228 is CRITICAL: Connection refused by host [13:41:36] PROBLEM - DPKG on mw1153 is CRITICAL: Connection refused by host [13:41:46] RECOVERY - DPKG on mw1144 is OK: All packages OK [13:41:46] PROBLEM - RAID on srv271 is CRITICAL: Connection refused by host [13:41:56] RECOVERY - DPKG on snapshot4 is OK: All packages OK [13:41:56] RECOVERY - Disk space on db1022 is OK: DISK OK [13:42:06] RECOVERY - DPKG on db1022 is OK: All packages OK [13:42:06] they will all recover now.. was overload on nagios host [13:42:16] PROBLEM - DPKG on grosley is CRITICAL: Connection refused by host [13:42:16] RECOVERY - DPKG on db1033 is OK: All packages OK [13:42:16] RECOVERY - Disk space on mw1116 is OK: DISK OK [13:42:16] RECOVERY - RAID on db1033 is OK: OK: State is Optimal, checked 2 logical device(s) [13:42:26] PROBLEM - MySQL disk space on db1020 is CRITICAL: Connection refused by host [13:42:26] PROBLEM - Disk space on db1017 is CRITICAL: Connection refused by host [13:42:36] PROBLEM - Disk space on mw1153 is CRITICAL: Connection refused by host [13:42:46] RECOVERY - Disk space on mw1011 is OK: DISK OK [13:42:56] RECOVERY - Disk space on db25 is OK: DISK OK [13:43:06] RECOVERY - Disk space on srv273 is OK: DISK OK [13:43:26] RECOVERY - Disk space on db34 is OK: DISK OK [13:43:26] RECOVERY - Disk space on mw1158 is OK: DISK OK [13:43:36] PROBLEM - DPKG on db1005 is CRITICAL: Connection refused by host [13:43:36] RECOVERY - Disk space on mw1144 is OK: DISK OK [13:43:46] PROBLEM - RAID on db1005 is CRITICAL: Connection refused by host [13:43:56] PROBLEM - DPKG on srv228 is CRITICAL: Connection refused by host [13:44:06] RECOVERY - Disk space on mw1135 is OK: DISK OK [13:44:06] PROBLEM - RAID on es4 is CRITICAL: Connection refused by host [13:44:16] RECOVERY - Disk space on srv230 is OK: DISK OK [13:44:26] RECOVERY - MySQL disk space on db34 is OK: DISK OK [13:44:26] RECOVERY - RAID on db1035 is OK: OK: State is Optimal, checked 2 logical device(s) [13:44:26] RECOVERY - RAID on db25 is OK: OK: 1 logical device(s) checked [13:44:29] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1728 [13:44:30] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1728 [13:44:36] RECOVERY - DPKG on db13 is OK: All packages OK [13:44:36] RECOVERY - RAID on db13 is OK: OK: 1 logical device(s) checked [13:44:46] PROBLEM - RAID on es2 is CRITICAL: Connection refused by host [13:44:46] PROBLEM - Disk space on es4 is CRITICAL: Connection refused by host [13:44:46] PROBLEM - RAID on db1041 is CRITICAL: Connection refused by host [13:44:53] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1729 [13:44:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1729 [13:45:06] PROBLEM - RAID on srv207 is CRITICAL: Connection refused by host [13:45:06] RECOVERY - DPKG on searchidx2 is OK: All packages OK [13:45:16] PROBLEM - Disk space on es2 is CRITICAL: Connection refused by host [13:45:26] RECOVERY - Disk space on es1001 is OK: DISK OK [13:45:26] PROBLEM - RAID on mw7 is CRITICAL: Connection refused by host [13:45:26] RECOVERY - Disk space on db13 is OK: DISK OK [13:45:36] PROBLEM - DPKG on srv263 is CRITICAL: Connection refused by host [13:45:36] PROBLEM - MySQL disk space on db1008 is CRITICAL: Connection refused by host [13:45:36] PROBLEM - DPKG on db50 is CRITICAL: Connection refused by host [13:45:36] PROBLEM - RAID on srv263 is CRITICAL: Connection refused by host [13:45:36] PROBLEM - Disk space on snapshot1 is CRITICAL: Connection refused by host [13:45:36] PROBLEM - RAID on srv272 is CRITICAL: Connection refused by host [13:45:46] PROBLEM - Disk space on srv260 is CRITICAL: Connection refused by host [13:45:46] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [13:45:56] PROBLEM - DPKG on srv218 is CRITICAL: Connection refused by host [13:45:56] RECOVERY - RAID on srv283 is OK: OK: no RAID installed [13:46:06] PROBLEM - MySQL disk space on es2 is CRITICAL: Connection refused by host [13:46:06] PROBLEM - MySQL disk space on db1004 is CRITICAL: Connection refused by host [13:46:06] RECOVERY - DPKG on mw15 is OK: All packages OK [13:46:06] RECOVERY - RAID on mw1065 is OK: OK: no RAID installed [13:46:06] RECOVERY - RAID on db1006 is OK: OK: State is Optimal, checked 2 logical device(s) [13:46:16] PROBLEM - DPKG on db1004 is CRITICAL: Connection refused by host [13:46:26] PROBLEM - MySQL disk space on es4 is CRITICAL: Connection refused by host [13:46:26] RECOVERY - RAID on srv215 is OK: OK: no RAID installed [13:46:36] PROBLEM - DPKG on mw1026 is CRITICAL: Connection refused by host [13:46:56] PROBLEM - RAID on srv289 is CRITICAL: Connection refused by host [13:46:56] PROBLEM - RAID on db1007 is CRITICAL: Connection refused by host [13:46:56] PROBLEM - DPKG on mw1097 is CRITICAL: Connection refused by host [13:46:56] RECOVERY - DPKG on srv196 is OK: All packages OK [13:46:56] RECOVERY - RAID on srv234 is OK: OK: no RAID installed [13:46:56] RECOVERY - RAID on db48 is OK: OK: State is Optimal, checked 2 logical device(s) [13:46:56] PROBLEM - RAID on db42 is CRITICAL: Connection refused by host [13:46:57] PROBLEM - RAID on fenari is CRITICAL: Connection refused by host [13:46:57] RECOVERY - DPKG on srv215 is OK: All packages OK [13:47:16] PROBLEM - DPKG on snapshot1 is CRITICAL: Connection refused by host [13:47:16] PROBLEM - DPKG on srv289 is CRITICAL: Connection refused by host [13:47:16] RECOVERY - DPKG on srv234 is OK: All packages OK [13:47:26] PROBLEM - MySQL disk space on db42 is CRITICAL: Connection refused by host [13:47:26] PROBLEM - Disk space on srv218 is CRITICAL: Connection refused by host [13:47:26] RECOVERY - DPKG on srv210 is OK: All packages OK [13:47:36] PROBLEM - DPKG on mw1012 is CRITICAL: Connection refused by host [13:47:36] PROBLEM - Disk space on db50 is CRITICAL: Connection refused by host [13:47:46] RECOVERY - DPKG on mw65 is OK: All packages OK [13:47:56] PROBLEM - RAID on srv280 is CRITICAL: Connection refused by host [13:47:56] RECOVERY - DPKG on snapshot2 is OK: All packages OK [13:47:56] RECOVERY - RAID on mw1156 is OK: OK: no RAID installed [13:48:06] RECOVERY - Disk space on db1029 is OK: DISK OK [13:48:16] PROBLEM - DPKG on db1038 is CRITICAL: Connection refused by host [13:48:16] PROBLEM - RAID on srv247 is CRITICAL: Connection refused by host [13:48:16] RECOVERY - DPKG on bast1001 is OK: All packages OK [13:48:16] RECOVERY - RAID on snapshot2 is OK: OK: no RAID installed [13:48:17] New patchset: Dzahn; "fix duplicate definition of monitor_service for mobile traffic loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1730 [13:48:26] PROBLEM - DPKG on mw7 is CRITICAL: Connection refused by host [13:48:26] PROBLEM - MySQL disk space on db50 is CRITICAL: Connection refused by host [13:48:26] PROBLEM - DPKG on db42 is CRITICAL: Connection refused by host [13:48:26] PROBLEM - DPKG on fenari is CRITICAL: Connection refused by host [13:48:26] PROBLEM - Disk space on mw1026 is CRITICAL: Connection refused by host [13:48:26] PROBLEM - DPKG on srv272 is CRITICAL: Connection refused by host [13:48:26] PROBLEM - RAID on mw30 is CRITICAL: Connection refused by host [13:48:36] PROBLEM - RAID on db1038 is CRITICAL: Connection refused by host [13:48:36] RECOVERY - DPKG on db1029 is OK: All packages OK [13:48:44] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1730 [13:48:44] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1730 [13:48:46] RECOVERY - Disk space on mw1070 is OK: DISK OK [13:48:46] RECOVERY - RAID on db44 is OK: OK: State is Optimal, checked 2 logical device(s) [13:48:56] PROBLEM - DPKG on mw30 is CRITICAL: Connection refused by host [13:49:16] RECOVERY - Disk space on mw65 is OK: DISK OK [13:49:16] RECOVERY - Disk space on mw1059 is OK: DISK OK [13:49:26] PROBLEM - DPKG on srv280 is CRITICAL: Connection refused by host [13:49:26] RECOVERY - Disk space on mw1050 is OK: DISK OK [13:49:26] PROBLEM - Disk space on mw7 is CRITICAL: Connection refused by host [13:49:26] RECOVERY - DPKG on es3 is OK: All packages OK [13:49:36] RECOVERY - RAID on es3 is OK: OK: State is Optimal, checked 2 logical device(s) [13:50:06] RECOVERY - Disk space on mw1032 is OK: DISK OK [13:50:16] PROBLEM - Disk space on srv289 is CRITICAL: Connection refused by host [13:50:16] PROBLEM - RAID on locke is CRITICAL: Connection refused by host [13:50:16] PROBLEM - RAID on snapshot1 is CRITICAL: Connection refused by host [13:50:26] RECOVERY - Disk space on mw1067 is OK: DISK OK [13:50:46] RECOVERY - DPKG on srv264 is OK: All packages OK [13:51:15] RECOVERY - Disk space on srv264 is OK: DISK OK [13:51:15] RECOVERY - RAID on srv228 is OK: OK: no RAID installed [13:51:15] RECOVERY - DPKG on mw1153 is OK: All packages OK [13:51:26] PROBLEM - Disk space on fenari is CRITICAL: Connection refused by host [13:51:36] PROBLEM - DPKG on srv247 is CRITICAL: Connection refused by host [13:51:46] RECOVERY - RAID on srv271 is OK: OK: no RAID installed [13:51:56] RECOVERY - DPKG on grosley is OK: All packages OK [13:52:06] RECOVERY - MySQL disk space on db1020 is OK: DISK OK [13:52:06] RECOVERY - Disk space on db1017 is OK: DISK OK [13:52:16] PROBLEM - Disk space on mw1012 is CRITICAL: Connection refused by host [13:52:16] RECOVERY - Disk space on mw1153 is OK: DISK OK [13:53:06] PROBLEM - Disk space on srv247 is CRITICAL: Connection refused by host [13:53:16] RECOVERY - DPKG on db1005 is OK: All packages OK [13:53:26] RECOVERY - RAID on db1005 is OK: OK: State is Optimal, checked 2 logical device(s) [13:53:36] RECOVERY - DPKG on srv228 is OK: All packages OK [13:53:46] RECOVERY - RAID on es4 is OK: OK: State is Optimal, checked 2 logical device(s) [13:54:26] RECOVERY - RAID on db1041 is OK: OK: State is Optimal, checked 2 logical device(s) [13:54:46] RECOVERY - RAID on srv207 is OK: OK: no RAID installed [13:54:46] RECOVERY - Disk space on es4 is OK: DISK OK [13:54:56] RECOVERY - Disk space on es2 is OK: DISK OK [13:55:06] RECOVERY - RAID on mw7 is OK: OK: no RAID installed [13:55:06] RECOVERY - RAID on es2 is OK: OK: State is Optimal, checked 2 logical device(s) [13:55:16] RECOVERY - DPKG on db50 is OK: All packages OK [13:55:16] RECOVERY - RAID on srv263 is OK: OK: no RAID installed [13:55:16] RECOVERY - Disk space on snapshot1 is OK: DISK OK [13:55:16] RECOVERY - RAID on srv272 is OK: OK: no RAID installed [13:55:26] RECOVERY - Disk space on srv260 is OK: DISK OK [13:55:36] RECOVERY - DPKG on srv218 is OK: All packages OK [13:55:36] RECOVERY - DPKG on srv263 is OK: All packages OK [13:55:36] RECOVERY - MySQL disk space on db1008 is OK: DISK OK [13:55:46] RECOVERY - MySQL disk space on es2 is OK: DISK OK [13:55:46] RECOVERY - MySQL disk space on db1004 is OK: DISK OK [13:55:56] RECOVERY - DPKG on db1004 is OK: All packages OK [13:56:16] RECOVERY - DPKG on mw1026 is OK: All packages OK [13:56:36] RECOVERY - RAID on db42 is OK: OK: State is Optimal, checked 2 logical device(s) [13:56:36] RECOVERY - DPKG on mw1097 is OK: All packages OK [13:56:36] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [13:56:36] RECOVERY - RAID on db1007 is OK: OK: State is Optimal, checked 2 logical device(s) [13:56:57] RECOVERY - DPKG on srv289 is OK: All packages OK [13:57:06] RECOVERY - Disk space on srv218 is OK: DISK OK [13:57:06] RECOVERY - DPKG on snapshot1 is OK: All packages OK [13:57:06] RECOVERY - RAID on srv289 is OK: OK: no RAID installed [13:57:16] RECOVERY - Disk space on db50 is OK: DISK OK [13:57:16] RECOVERY - MySQL disk space on db42 is OK: DISK OK [13:57:16] RECOVERY - DPKG on mw1012 is OK: All packages OK [13:57:16] RECOVERY - MySQL disk space on es4 is OK: DISK OK [13:57:46] RECOVERY - RAID on srv280 is OK: OK: no RAID installed [13:57:56] RECOVERY - RAID on srv247 is OK: OK: no RAID installed [13:57:56] RECOVERY - DPKG on db1038 is OK: All packages OK [13:58:06] RECOVERY - DPKG on fenari is OK: All packages OK [13:58:06] RECOVERY - RAID on mw30 is OK: OK: no RAID installed [13:58:06] RECOVERY - DPKG on srv272 is OK: All packages OK [13:58:06] RECOVERY - Disk space on mw1026 is OK: DISK OK [13:58:16] RECOVERY - DPKG on db42 is OK: All packages OK [13:58:26] RECOVERY - MySQL disk space on db50 is OK: DISK OK [13:58:26] RECOVERY - RAID on db1038 is OK: OK: State is Optimal, checked 2 logical device(s) [13:58:36] RECOVERY - DPKG on mw7 is OK: All packages OK [13:58:36] RECOVERY - DPKG on mw30 is OK: All packages OK [13:59:06] RECOVERY - Disk space on mw7 is OK: DISK OK [13:59:16] RECOVERY - DPKG on srv280 is OK: All packages OK [13:59:56] RECOVERY - RAID on snapshot1 is OK: OK: no RAID installed [13:59:56] RECOVERY - RAID on locke is OK: OK: State is Optimal, checked 8 logical device(s) [14:00:06] RECOVERY - Disk space on srv289 is OK: DISK OK [14:01:06] RECOVERY - Disk space on fenari is OK: DISK OK [14:01:36] RECOVERY - DPKG on srv247 is OK: All packages OK [14:02:46] RECOVERY - Disk space on mw1012 is OK: DISK OK [14:03:56] RECOVERY - Disk space on srv247 is OK: DISK OK [14:13:26] google: (sqrt(cos(x))*cos(200*x)+sqrt(abs(x))-0.7)*(4-x*x)^0.2042, from -4.5 to 4.5 [14:24:16] ha [14:33:39] RECOVERY - mobile traffic loggers on cp1041 is OK: PROCS OK: 2 processes with command name varnishncsa [14:34:24] it's like I was just rickrolled or something... [14:35:10] people who send autresponder n response to logwatch oughta have automatic email forwarding of copies of all cron email... [14:35:18] too bad there's not as much of it as there used to be [14:55:01] !log fixed mobile traffic logger checks - what they report is for real now. 2 procs on first server, 4 procs on the other three [14:55:15] Logged the message, Master [15:01:07] ah, Jeff or apergos.. if you look https://integration.mediawiki.org and especially check the certificate, do you see a difference to https://www.mediawiki.org ? [15:02:00] because it used to have a wrong one, that is fixed now, but there seemed to be a difference left, could not really verify (unlike the www. host) [15:03:07] looks like you're using the same one now [15:03:14] md5 and sha1s are identical [15:03:19] openssl s_client -showcerts -CAfile Equifax_Secure_Certificate_Authority.cer -connect integration.mediawiki.org:443 [15:03:26] Verify return code: 21 (unable to verify the first certificate) [15:03:35] the same with www. was: Verify return code: 0 (ok) [15:03:46] and i didnt get why, because,yes, they seemed to be identical [15:04:00] and using the "-chained.pem" instead did not change it either [15:04:59] well if it's the same cert it must have to do with the hostname or the chain [15:05:03] there's nothing else it can be [15:05:37] yea, and after saw one .pem with, and one without "-chained", i tried the other, and it didnt change it [15:05:51] and then ..shrugged [15:06:00] shrugging sounds good [15:06:10] ff does not complain for me, both are treated as valid [15:06:23] ok, cool [15:06:37] i don't worry about it anymore then, thx [15:06:42] ah wait [15:06:44] very interesting [15:06:57] ah? [15:07:00] for integration.mw.org: [15:07:16] it doesn't complain but it doesn't list in the verification chain [15:07:26] geotrust inc [15:07:29] which it does for the other [15:07:37] maybe check the setup over there? [15:07:46] i checked for RapidSSL / GeoTrust CA file to make sure..etc [15:08:22] then used Equifax_Secure_Certificate_Authority.cer ..which works for the www. host [15:08:31] and it seemed to be the same cert.. hmm [15:08:42] so you are usin the identical host cert [15:09:13] but the setup for the ca cert must be wrong somehow, maybe it isn't sending that information on integration.mw.o [15:10:47] put the ca cert in /etc/ssl/certs on the integration host already [15:14:12] don't i have to check in /etc/nginx/sites-enabled/mediawiki on say, ssl1001 [15:14:29] I have no idea... [15:14:46] and there it is server_name *.mediawiki.org and ONE ssl_cer and key line [15:14:58] so that also seems like it would not make a diff [15:21:22] apergos: any difference on integration now? [15:21:51] nope [15:22:21] that is when using .chained.pem instead .. like ssl1001 does it [15:22:32] and the rest all looks the same.. shrug again..ok [15:25:10] related to www being loadbalanced, and integration serving directly [15:25:31] <^demon> Well integration's a misc. server :) [15:25:36] <^demon> No need for load balancing [15:26:15] ok, but what's missing to make the cert check perfect...hmm [15:26:57] <^demon> Whoops, loading mixed content too. [15:27:00] <^demon> Lemme fix that. [15:27:03] google chrome shows it verified [15:27:52] just used "openssl s_client -showcerts" to avoid browser comparison [15:28:04] uh huh [15:28:50] was never sure if i still had stuff in cache ..hrmm [15:29:28] New patchset: Demon; "Use https so we don't complain about mixed content" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1731 [15:29:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1731 [15:30:30] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1731 [15:30:30] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1731 [15:49:42] fixed it [15:50:08] what was it? [15:50:29] missing SSLCACertificateFile [15:50:33] in apache config [15:50:49] so there was something different after all [15:50:56] good catch [15:51:14] and then i got error 20 instead of 21 [15:51:31] and that was due to using Equifax instead of RapidSSL_CA [15:51:37] right [15:54:10] New patchset: Dzahn; "integration.mw - add SSLCACertificateFile to fix cert verification" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1732 [15:54:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1732 [15:55:58] New review: Dzahn; "this makes openssl say: Verify return code: 0 (ok)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1732 [15:55:59] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1732 [16:05:46] !log hume:/usr/local/bin/copy_impression_logs_from_storage3.pl deprecated, t'was an emergency hack deployed when storage3 dropped a disk [16:05:55] Logged the message, Master [16:06:06] hey Jeff_Green [16:06:12] hi apergos [16:06:18] feel free to toss out of puppet/cron/etc whichever thing should be tossed [16:06:49] re. that script? it wasn't puppetized [16:07:00] oh, the crontab over there isn't in puppet? [16:07:06] well . . . [16:07:12] it is sort of puppetized [16:07:21] ok, well the only thing I di was fix it so it didn/t write into /a/blahblah/2 [16:07:33] our use of puppet+cron is inconsistent, and this was no exception [16:07:38] yeah I saw [16:07:59] i had subsequently deployed a more reliable script, the one you found on storage3 [16:08:07] okey dokey [16:08:19] althought I don't think it's actually doing the --delete part, that script [16:08:41] I was just going to double-check that that [16:08:41] I deleted some stuff manually today (leaving two weeks worth on hume, as the script calls for) [16:08:55] i'm wondering if it was a timing issue [16:09:03] sholudn't be [16:09:08] it was a few days worth still in there [16:09:12] right right [16:09:14] which were left from the last time I did that [16:09:22] i see [16:09:33] oh wait [16:10:20] this would all suck much less if there were somewhere reasonable to store copies of logs [16:10:27] so instead of the 13 through today it was like the 9th through today over there. [16:10:31] right [16:10:41] well now there's copies of everything over there cause I took the contents of 2 [16:10:45] which was *everything* [16:10:49] and just moved em up a level :-D [16:11:09] hey, without knowing which you wanted fur sure... I figured... best not to make any rash moves :-D [16:11:19] i'm totally confused [16:11:24] :-D [16:11:36] ok so [16:11:44] what's *supposed* to happen is storage3 keeps everything, and hume keeps only the past 2 weeks [16:11:49] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 187 MB (2% inode=60%): /var/lib/ureadahead/debugfs 187 MB (2% inode=60%): [16:12:02] intended mechanism: one script on storage3 brushes stuff > 2W into ./old dir [16:12:09] so /a/static/uncompressed had two jobs going [16:12:18] right, the one on hume was a mistake [16:12:21] one which copied to .../2 [16:12:29] and the other which copied to the righ tdir. [16:12:39] i entirely forgot i'd done that one, it was really "oh shit, storage3 is about to croak" [16:12:41] one had a --delete flag (the one that copied to the right dir) [16:12:51] right, that's the 'good' one on storage3 [16:12:52] one didn't (the one copying to 2/) [16:12:59] oh ha [16:13:00] hahahahh [16:13:01] but neither of them actually seemed to delete stuff :-D [16:13:12] I saw that typo on the stderr redirect [16:13:13] anyways so... since I moved the contents of 2/ up a level [16:13:18] well [16:13:26] now /a/static/uncopressed has *everything* [16:13:26] rsync -a --delete /archive/udplogs/*gz file_mover@hume.wikimedia.org:/a/static/uncompressed/ [16:13:54] I need to reexamine how rsync handles syntax [16:13:59] -rw-r--r-- 1 10000 wikidev 569758 2011-11-01 00:00 bannerImpressions-2011-11-01-12AM--00.log.gz [16:14:05] right right [16:14:06] that's the eaarliest thing in there and it's in there right now. [16:14:18] i'll run these by hand again and see what it does [16:14:25] sounds good to me [16:16:28] New review: Dzahn; "fixed now -> https://gerrit.wikimedia.org/r/#change,1732" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [16:16:41] This tells rsync to delete extraneous files from the receiving side (ones that aren’t on the sending side), but only for the directories that are being synchronized. You must have asked rsync to send the whole directory (e.g. "dir" or "dir/") without using a wildcard for the directory’s contents (e.g. "dir/*") since the wildcard is expanded by the shell and rsync thus gets a request to transfer individual files, not the fil [16:16:47] what do we think about that? [16:16:58] fail [16:17:00] clearly [16:17:02] :-D [16:17:12] I think someone got bit by this a couple weeks ago too [16:17:24] that is rsync usability fail [16:17:41] eh, hard to blame them [16:17:50] it's an ambiguous request on my part [16:18:21] I guess [16:18:31] but we know what it *ought* to do dang it :-P [16:21:29] RECOVERY - Disk space on srv222 is OK: DISK OK [16:22:24] more betterer: 22% /a/static/uncompressed [16:22:31] yay :-D [16:23:22] which reminds me, i think storage3 has more raidfail [16:23:27] bah [16:23:36] * apergos headdesks  [16:23:59] ha [16:24:15] even though it's doing what it's supposed to do, i'm a RAID disbeliever [16:25:22] so there are a bunch of the (older) logs that are only on storage3 right now? is that correct? [16:25:29] yup [16:25:41] ugghhhh [16:25:41] several years worth of impression logs [16:25:58] wonder how that netapp is these days [16:26:17] we need a tridge2 or something [16:26:26] I'm a big fan of data purge [16:26:45] distill the data and drop it [16:26:49] I'm a big fan of "keep stuff you don't want to lose in (at least) two places [16:26:50] " [16:26:55] that too [16:28:03] well now that the script actually sorta does what it's supposed to I guess I shall puppetize it [16:28:08] heh [16:30:42] you can tell my throwaway cron scripts by their cronspam [16:30:51] * Jeff_Green hangs head in shame [16:31:37] hmm there were some no space left emails when I got up but I didn't see them [16:31:56] I saw ct's pm to me at which he left me at like 4am my time [16:32:07] re. hume? [16:32:59] uh huh [16:33:15] unfortunately there's still so much random cron spam I don't try to wade through it every day [16:33:19] right [16:33:38] my goal is to make scripts that syslog and notify only if necessary [16:33:55] i guess in this case it was necessary, but . . . [16:34:19] it sure was [16:34:48] I have one that runs every week or two weeks and sends mail on success [16:34:57] I figure that's a low enough noise level not to bug people [16:35:02] yeah [16:35:16] I like the paradigm where you see Success/Fail in the subject line [16:35:36] so (assuming the script itself isn't defective) you can skim your inbox and delete everything ^success: [16:38:44] that's a good approach [16:40:56] hmm. I think I just ran out of brainpower for reading networking crap [16:41:03] ha [16:43:29] apergos: mutante: i haven't finished reading backlog but there are differences between integration and www: 1 sends an extra cert in the chain and they end up using different ciphers with s_client (i guess they advertise different abilities?) [16:43:40] diff -u <(openssl s_client -showcerts -CApath /etc/ssl/certs -connect www.mediawiki.org:443 I didn't check the ciphers [16:44:21] we were just looking at the ca cert issue [16:44:37] they send a different # of certs in the chain [16:44:41] and I crashed my tv [16:44:46] how the *&^% did I do that [16:44:48] (for me they both show as ok) [16:44:49] * apergos unplugs it [16:45:13] but i was using my local ca path not the ca cert you had locally [16:45:21] ok [16:47:27] instead of computers getting more like tvs, tvs are getting more like computers [16:47:29] stupid things [16:47:58] * apergos is going to afk for awhile... [16:52:28] next up: kernel panic [16:52:41] hateses the kernel panic [16:53:05] already have one in the queue [16:53:16] waiting for SM to get back to me... sooooomedaaaaay [16:53:50] i once had a USB stick that had a guaranteed kernel panic of the host when you plugged it in iff the drivers were installed [16:54:01] nice [16:54:01] (air card) [16:55:08] srv191 has a readonly filesystem [16:55:40] just complained at me when doing sync-file [16:55:46] * jeremyb imagines doing datavis on vendor support calls. each call could bleed into surrounding days so that it shows up even when you zoom out. and multiple calls on the same day change the color of the day. each vendor has it's own color [16:56:06] New patchset: Jgreen; "cron scripts for use on storage3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1733 [16:58:14] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1733 [16:58:15] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1733 [17:00:22] srv191 does look faily [17:04:03] looks like a failed hdd [17:05:12] ticketed [17:05:45] seems like I should shut it down but I'm not positive. thoughts? [17:07:06] what is it? 193 is test i think? [17:07:29] (therefore 191 is not special?) [17:07:34] * jeremyb checks on memcache [17:08:26] yeah, 193 is test [17:08:47] how does 193 relate to 191? [17:08:48] ugh, i can't remember the ip numbering scheme [17:09:14] it doesn't [17:09:15] Jeff_Green: what's the IP? [17:09:25] 10.0.2.191 [17:09:35] if 193 is down, we've got some issues [17:09:48] so, it's not in rotation in mc.php [17:10:19] and isn't in db.php so it's not ES either? [17:10:58] Indeed [17:12:39] $bits_appservers = [ "srv191.pmtpa.wmnet",...] [17:13:07] it's also a ganglia aggregator [17:13:23] that's about it from me [17:13:36] (from https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/site.pp;hb=HEAD ) [17:14:46] so . . . if I shut it down, what happens? :-$ [17:15:08] absence of ganglia reports doesn't kill us [17:15:10] It might need manually depooling [17:15:21] I can't remember what's automatic and what isn't [17:15:22] well i guess the 2 issues are: 1) can bits handle the reduced load? and 2) lvs will still be pointing at it? [17:15:34] s/load/backends/ [17:15:44] idk where pybal fits in [17:34:35] !log srv191 shutdown, looks like failed hdd, see RT #2193 [17:34:44] Logged the message, Master [17:38:44] RECOVERY - mobile traffic loggers on cp1044 is OK: PROCS OK: 2 processes with command name varnishncsa [17:39:04] PROBLEM - Host srv191 is DOWN: PING CRITICAL - Packet loss = 100% [17:43:24] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 2 processes with command name varnishncsa [17:45:14] RECOVERY - mobile traffic loggers on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa [19:39:13] New patchset: Jgreen; "attempting to establish a role class and start by moving fundraising db stuff there" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1738 [19:39:22] rolls the dice . . . and . . . [19:40:26] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1738 [19:40:27] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1738 [19:40:52] wow gerrit didn't barf. amazing. [20:02:01] New patchset: Ryan Lane; "Adding initial live migration support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1739 [20:02:40] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1739 [20:02:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1739 [20:13:00] New patchset: Ryan Lane; "Update for diablo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1740 [20:13:37] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1740 [20:13:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1740 [20:15:19] New patchset: Jgreen; "added role::db::fundraising::dump class and applied to storage3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1741 [20:16:15] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1741 [20:16:16] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1741 [20:16:39] New patchset: Ryan Lane; "Add forgotten ppa key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1742 [20:19:47] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1742 [20:19:48] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1742 [20:35:27] New patchset: Jgreen; "added conf file for fundraisingdb dump script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1743 [20:35:59] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1743 [20:35:59] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1743 [20:57:28] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Dec 28 20:57:26 UTC 2011 [21:21:48] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [22:14:14] New patchset: Ryan Lane; "Install hostname certificate on compute nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1744 [22:14:55] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1744 [22:14:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1744 [22:51:57] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1303s [22:57:35] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1643s [22:59:06] PROBLEM - MySQL replication status on db1025 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1733s [23:03:15] !log db31 wasn't puppetized, fixed [23:03:24] Logged the message, Master [23:33:45] New patchset: Ryan Lane; "Enable libvirt tls remote access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1745 [23:35:56] New patchset: Ryan Lane; "Enable libvirt tls remote access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1745 [23:36:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1745 [23:36:16] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1745 [23:39:09] New patchset: Ryan Lane; "Add libvirt as a service so that we can notify it on config file change." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1746 [23:39:29] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1746 [23:39:29] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1746 [23:40:35] New patchset: Ryan Lane; "That's actually libvirt-bin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1747 [23:40:58] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1747 [23:40:58] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1747 [23:42:56] New patchset: Ryan Lane; "Missed .key in the location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1748 [23:43:11] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1748 [23:43:11] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1748 [23:46:15] New patchset: Ryan Lane; "Using the system's install of the public cert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1749 [23:46:33] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1749 [23:46:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1749 [23:46:54] New patchset: Ryan Lane; "Ugh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1750 [23:47:12] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1750 [23:47:13] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1750 [23:47:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1750