[00:39:08] New patchset: Asher; "adding percona mysql checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1850 [00:39:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1850 [00:41:45] !log added ganglia1002 and ganglia1001 to dns [00:41:47] Logged the message, Mistress of the network gear. [00:42:23] RobH: still there ? [00:43:44] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1850 [00:43:45] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1850 [01:33:23] New patchset: Asher; "install just the new mysql check files on eqiad dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1851 [01:45:42] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1851 [01:45:43] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1851 [02:18:42] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1455s [02:24:02] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1775s [02:28:32] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:33:52] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:53:58] RECOVERY - Puppet freshness on srv272 is OK: puppet ran at Thu Jan 12 02:53:41 UTC 2012 [04:17:51] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:23:31] RECOVERY - Disk space on es1004 is OK: DISK OK [04:36:40] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:42:42] hi :) If anyone could review / merge / deploy https://gerrit.wikimedia.org/r/#change,1847 that would be great [09:09:36] PROBLEM - Puppet freshness on db22 is CRITICAL: Puppet has not run in the last 10 hours [10:00:32] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 440055 MB (3% inode=99%): [10:07:02] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 408755 MB (3% inode=99%): [10:24:00] RECOVERY - MySQL slave status on es1004 is OK: OK: [11:25:44] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [11:26:44] PROBLEM - Puppet freshness on db1006 is CRITICAL: Puppet has not run in the last 10 hours [11:27:44] PROBLEM - Puppet freshness on db1001 is CRITICAL: Puppet has not run in the last 10 hours [11:27:44] PROBLEM - Puppet freshness on db1018 is CRITICAL: Puppet has not run in the last 10 hours [11:36:47] PROBLEM - Puppet freshness on db1007 is CRITICAL: Puppet has not run in the last 10 hours [11:36:47] PROBLEM - Puppet freshness on db1005 is CRITICAL: Puppet has not run in the last 10 hours [11:36:47] PROBLEM - Puppet freshness on db1010 is CRITICAL: Puppet has not run in the last 10 hours [11:36:47] PROBLEM - Puppet freshness on db1009 is CRITICAL: Puppet has not run in the last 10 hours [11:36:47] PROBLEM - Puppet freshness on db1020 is CRITICAL: Puppet has not run in the last 10 hours [11:36:47] PROBLEM - Puppet freshness on db1021 is CRITICAL: Puppet has not run in the last 10 hours [11:36:48] PROBLEM - Puppet freshness on db1022 is CRITICAL: Puppet has not run in the last 10 hours [11:36:48] PROBLEM - Puppet freshness on db1038 is CRITICAL: Puppet has not run in the last 10 hours [11:36:49] PROBLEM - Puppet freshness on db1034 is CRITICAL: Puppet has not run in the last 10 hours [11:36:49] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [11:36:50] PROBLEM - Puppet freshness on db1033 is CRITICAL: Puppet has not run in the last 10 hours [11:36:51] PROBLEM - Puppet freshness on db1043 is CRITICAL: Puppet has not run in the last 10 hours [11:36:51] PROBLEM - Puppet freshness on db1025 is CRITICAL: Puppet has not run in the last 10 hours [11:36:52] PROBLEM - Puppet freshness on db1048 is CRITICAL: Puppet has not run in the last 10 hours [11:40:27] PROBLEM - Puppet freshness on db1002 is CRITICAL: Puppet has not run in the last 10 hours [11:40:27] PROBLEM - Puppet freshness on db1035 is CRITICAL: Puppet has not run in the last 10 hours [11:40:28] PROBLEM - Puppet freshness on db1042 is CRITICAL: Puppet has not run in the last 10 hours [11:42:37] PROBLEM - Puppet freshness on db1004 is CRITICAL: Puppet has not run in the last 10 hours [11:43:37] PROBLEM - Puppet freshness on db1011 is CRITICAL: Puppet has not run in the last 10 hours [11:44:27] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [11:46:37] PROBLEM - Puppet freshness on db1019 is CRITICAL: Puppet has not run in the last 10 hours [11:46:37] PROBLEM - Puppet freshness on db1041 is CRITICAL: Puppet has not run in the last 10 hours [11:46:37] PROBLEM - Puppet freshness on db1008 is CRITICAL: Puppet has not run in the last 10 hours [11:47:27] PROBLEM - Puppet freshness on db1017 is CRITICAL: Puppet has not run in the last 10 hours [11:48:27] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [11:48:27] PROBLEM - Puppet freshness on db1028 is CRITICAL: Puppet has not run in the last 10 hours [11:49:27] PROBLEM - Puppet freshness on db1027 is CRITICAL: Puppet has not run in the last 10 hours [11:49:27] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [11:50:37] PROBLEM - Puppet freshness on db1046 is CRITICAL: Puppet has not run in the last 10 hours [11:51:37] PROBLEM - Puppet freshness on db1003 is CRITICAL: Puppet has not run in the last 10 hours [11:51:37] PROBLEM - Puppet freshness on db1026 is CRITICAL: Puppet has not run in the last 10 hours [11:51:37] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [11:51:37] PROBLEM - Puppet freshness on db1039 is CRITICAL: Puppet has not run in the last 10 hours [11:51:37] PROBLEM - Puppet freshness on db1013 is CRITICAL: Puppet has not run in the last 10 hours [11:52:27] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [11:52:27] PROBLEM - Puppet freshness on db1024 is CRITICAL: Puppet has not run in the last 10 hours [11:53:37] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [11:54:37] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [11:54:38] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [11:55:37] PROBLEM - Puppet freshness on db1040 is CRITICAL: Puppet has not run in the last 10 hours [12:41:22] New patchset: Mark Bergsma; "Add generic::debconf::set definition for preseeding" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1852 [12:41:59] New patchset: Mark Bergsma; "Install all Mailman languages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1853 [12:42:21] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1852 [12:42:22] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1852 [12:45:59] mark: hi :) can you possibly have a look at https://gerrit.wikimedia.org/r/#change,1847 please ? [12:46:17] mark: would let us administrate the postgre database on gallium [12:51:58] New patchset: Mark Bergsma; "Install all Mailman languages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1853 [12:52:27] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1853 [12:52:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1853 [12:53:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1847 [12:53:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1847 [12:56:10] New patchset: Mark Bergsma; "Escape value var" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1854 [12:56:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1854 [12:56:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1854 [13:00:48] New patchset: Mark Bergsma; "Use the noninteractive debconf frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1855 [13:01:26] New patchset: Mark Bergsma; "Use the noninteractive debconf frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1855 [13:02:13] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1855 [13:02:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1855 [13:10:15] New patchset: Mark Bergsma; "Correct mailman default_server_language" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1856 [13:10:30] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1856 [13:10:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1856 [13:12:28] New patchset: Mark Bergsma; "Add newline in comparison string" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1857 [13:12:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1857 [13:12:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1857 [13:21:26] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:23:50] mark: thanks mark :-) [13:27:55] New patchset: Mark Bergsma; "Attempt to get the debconf test to work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1858 [13:27:57] I hate bash [13:28:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1858 [13:28:16] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1858 [13:41:10] New patchset: Mark Bergsma; "Simplify test again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1859 [13:41:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1859 [13:41:31] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1859 [13:41:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1859 [13:45:52] New patchset: Mark Bergsma; "Add quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1860 [13:46:09] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1860 [13:46:10] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1860 [14:00:40] New patchset: Mark Bergsma; "Install a DNS recursor on sodium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1861 [14:00:59] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1861 [14:01:00] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1861 [14:07:14] New review: Mark Bergsma; "Why does it need permissions for that dir anyway? e.g. df works without that..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1845 [14:31:35] New review: Dzahn; "on sodium:" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1845 [14:31:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1845 [14:35:38] mutante: so why does it need permission to that dir? [14:38:31] mark: investigating, saw your comment right after merge [14:40:36] it belongs to Debian-exim Debian-exim and does not allow others [14:41:07] would you rather like to change that instead of ignoring tmpfs in check_disk? [14:41:36] no, I want to know why check_disk needs permission to that dir [14:41:39] what does it do? [14:41:43] i figured it does not really belong in check_disk where we expect to check the physical disk [14:41:58] well I'd prefer it if check_disk worked for tmpfs as well [14:42:08] and seeing as df works fine, why doesn't check_disk? [14:42:25] ah [14:42:27] nevermind [14:42:30] df does not work [14:43:17] which user does the check run as? [14:43:21] nagios [14:43:34] perhaps add 'nagios' to the Debian-exim group on sodium then [14:43:36] (in mail.pp) [14:44:08] alright [14:46:42] PROBLEM - MySQL disk space on db16 is CRITICAL: Connection refused by host [14:47:02] PROBLEM - Disk space on mw1092 is CRITICAL: Connection refused by host [14:47:32] PROBLEM - RAID on srv196 is CRITICAL: Connection refused by host [14:47:32] PROBLEM - Disk space on srv200 is CRITICAL: Connection refused by host [14:48:02] PROBLEM - Disk space on db18 is CRITICAL: Connection refused by host [14:48:22] PROBLEM - DPKG on es1001 is CRITICAL: Connection refused by host [14:49:52] PROBLEM - RAID on es1001 is CRITICAL: Connection refused by host [14:51:22] PROBLEM - mobile traffic loggers on cp1044 is CRITICAL: Connection refused by host [14:51:42] PROBLEM - DPKG on db16 is CRITICAL: Connection refused by host [14:54:32] PROBLEM - RAID on mw70 is CRITICAL: Connection refused by host [14:55:22] PROBLEM - DPKG on bast1001 is CRITICAL: Connection refused by host [14:55:22] PROBLEM - RAID on cp1043 is CRITICAL: Connection refused by host [14:56:02] PROBLEM - RAID on db25 is CRITICAL: Connection refused by host [14:56:02] PROBLEM - Disk space on db34 is CRITICAL: Connection refused by host [14:56:03] PROBLEM - RAID on db46 is CRITICAL: Connection refused by host [14:56:22] PROBLEM - MySQL disk space on db11 is CRITICAL: Connection refused by host [14:56:22] PROBLEM - DPKG on db13 is CRITICAL: Connection refused by host [14:56:22] PROBLEM - Disk space on mw1001 is CRITICAL: Connection refused by host [14:56:32] PROBLEM - DPKG on db18 is CRITICAL: Connection refused by host [14:56:52] PROBLEM - Disk space on mw46 is CRITICAL: Connection refused by host [14:57:12] RECOVERY - Disk space on sodium is OK: DISK OK [14:57:13] PROBLEM - Disk space on snapshot4 is CRITICAL: Connection refused by host [14:57:13] PROBLEM - Disk space on srv195 is CRITICAL: Connection refused by host [14:57:22] PROBLEM - Disk space on mw1080 is CRITICAL: Connection refused by host [14:57:32] PROBLEM - DPKG on srv271 is CRITICAL: Connection refused by host [14:57:32] PROBLEM - Disk space on bast1001 is CRITICAL: Connection refused by host [14:57:32] PROBLEM - DPKG on cp1041 is CRITICAL: Connection refused by host [14:57:42] PROBLEM - Disk space on db13 is CRITICAL: Connection refused by host [14:57:52] PROBLEM - DPKG on es4 is CRITICAL: Connection refused by host [14:58:02] PROBLEM - Disk space on es3 is CRITICAL: Connection refused by host [14:58:22] PROBLEM - DPKG on mw1146 is CRITICAL: Connection refused by host [14:58:22] PROBLEM - DPKG on mw1134 is CRITICAL: Connection refused by host [14:58:32] PROBLEM - RAID on mw55 is CRITICAL: Connection refused by host [14:58:32] PROBLEM - DPKG on mw1121 is CRITICAL: Connection refused by host [14:58:32] PROBLEM - DPKG on mw67 is CRITICAL: Connection refused by host [14:58:32] PROBLEM - RAID on mw69 is CRITICAL: Connection refused by host [14:58:32] PROBLEM - DPKG on mw70 is CRITICAL: Connection refused by host [14:58:33] PROBLEM - RAID on mw72 is CRITICAL: Connection refused by host [14:58:52] PROBLEM - DPKG on srv196 is CRITICAL: Connection refused by host [14:58:52] PROBLEM - RAID on srv210 is CRITICAL: Connection refused by host [14:59:22] PROBLEM - Disk space on virt3 is CRITICAL: Connection refused by host [14:59:22] PROBLEM - Disk space on mw1082 is CRITICAL: Connection refused by host [14:59:32] PROBLEM - Disk space on virt2 is CRITICAL: Connection refused by host [14:59:33] PROBLEM - MySQL disk space on db34 is CRITICAL: Connection refused by host [14:59:33] PROBLEM - DPKG on db46 is CRITICAL: Connection refused by host [14:59:52] PROBLEM - Disk space on db46 is CRITICAL: Connection refused by host [15:00:02] PROBLEM - RAID on mw1001 is CRITICAL: Connection refused by host [15:00:12] PROBLEM - RAID on mw1075 is CRITICAL: Connection refused by host [15:00:12] PROBLEM - RAID on mw1082 is CRITICAL: Connection refused by host [15:00:12] PROBLEM - Disk space on mw1121 is CRITICAL: Connection refused by host [15:00:12] PROBLEM - Disk space on mw1134 is CRITICAL: Connection refused by host [15:00:24] *sigh* [15:00:32] PROBLEM - Disk space on mw1146 is CRITICAL: Connection refused by host [15:00:32] PROBLEM - Disk space on mw67 is CRITICAL: Connection refused by host [15:00:32] PROBLEM - Disk space on mw70 is CRITICAL: Connection refused by host [15:00:42] PROBLEM - DPKG on mw55 is CRITICAL: Connection refused by host [15:00:42] PROBLEM - RAID on snapshot4 is CRITICAL: Connection refused by host [15:00:52] PROBLEM - Disk space on srv196 is CRITICAL: Connection refused by host [15:00:52] PROBLEM - RAID on srv208 is CRITICAL: Connection refused by host [15:00:52] PROBLEM - DPKG on srv210 is CRITICAL: Connection refused by host [15:01:02] PROBLEM - Disk space on srv236 is CRITICAL: Connection refused by host [15:01:12] PROBLEM - RAID on srv236 is CRITICAL: Connection refused by host [15:01:22] PROBLEM - Disk space on cp1043 is CRITICAL: Connection refused by host [15:01:22] PROBLEM - RAID on bast1001 is CRITICAL: Connection refused by host [15:01:22] PROBLEM - RAID on virt2 is CRITICAL: Connection refused by host [15:01:22] PROBLEM - RAID on virt3 is CRITICAL: Connection refused by host [15:01:32] PROBLEM - RAID on srv272 is CRITICAL: Connection refused by host [15:01:42] PROBLEM - DPKG on srv236 is CRITICAL: Connection refused by host [15:01:42] PROBLEM - MySQL disk space on db46 is CRITICAL: Connection refused by host [15:01:52] PROBLEM - MySQL disk space on db18 is CRITICAL: Connection refused by host [15:01:52] PROBLEM - DPKG on db11 is CRITICAL: Connection refused by host [15:01:52] PROBLEM - Disk space on es1 is CRITICAL: Connection refused by host [15:01:52] PROBLEM - MySQL disk space on es4 is CRITICAL: Connection refused by host [15:01:52] PROBLEM - Disk space on es1002 is CRITICAL: Connection refused by host [15:01:53] PROBLEM - MySQL disk space on es2 is CRITICAL: Connection refused by host [15:02:02] PROBLEM - RAID on db16 is CRITICAL: Connection refused by host [15:02:02] PROBLEM - RAID on es1 is CRITICAL: Connection refused by host [15:02:02] PROBLEM - RAID on mw1092 is CRITICAL: Connection refused by host [15:02:02] PROBLEM - RAID on es1002 is CRITICAL: Connection refused by host [15:02:02] PROBLEM - RAID on db11 is CRITICAL: Connection refused by host [15:02:12] PROBLEM - MySQL disk space on db13 is CRITICAL: Connection refused by host [15:02:22] PROBLEM - RAID on mw46 is CRITICAL: Connection refused by host [15:02:32] PROBLEM - Disk space on mw55 is CRITICAL: Connection refused by host [15:02:42] PROBLEM - RAID on es4 is CRITICAL: Connection refused by host [15:02:52] PROBLEM - DPKG on srv208 is CRITICAL: Connection refused by host [15:02:52] PROBLEM - Disk space on srv210 is CRITICAL: Connection refused by host [15:03:02] PROBLEM - Disk space on mw1075 is CRITICAL: Connection refused by host [15:03:12] PROBLEM - Disk space on srv272 is CRITICAL: Connection refused by host [15:03:12] PROBLEM - Disk space on es4 is CRITICAL: Connection refused by host [15:03:22] PROBLEM - DPKG on srv272 is CRITICAL: Connection refused by host [15:03:22] PROBLEM - Disk space on es1001 is CRITICAL: Connection refused by host [15:03:23] someone messing with nagios? [15:03:33] PROBLEM - RAID on db18 is CRITICAL: Connection refused by host [15:03:34] arr, i am checking [15:03:44] i just made a change to an nrpe command [15:03:52] PROBLEM - DPKG on mw1048 is CRITICAL: Connection refused by host [15:04:03] PROBLEM - RAID on mw1104 is CRITICAL: Connection refused by host [15:04:03] PROBLEM - RAID on mw1121 is CRITICAL: Connection refused by host [15:04:13] PROBLEM - RAID on mw1134 is CRITICAL: Connection refused by host [15:04:32] PROBLEM - RAID on mw30 is CRITICAL: Connection refused by host [15:04:32] PROBLEM - RAID on mw67 is CRITICAL: Connection refused by host [15:04:32] PROBLEM - Disk space on mw69 is CRITICAL: Connection refused by host [15:04:42] PROBLEM - Disk space on mw72 is CRITICAL: Connection refused by host [15:04:42] RECOVERY - RAID on mw70 is OK: OK: no RAID installed [15:04:42] PROBLEM - Disk space on db11 is CRITICAL: Connection refused by host [15:04:52] PROBLEM - MySQL disk space on es1 is CRITICAL: Connection refused by host [15:05:02] PROBLEM - DPKG on srv195 is CRITICAL: Connection refused by host [15:05:02] PROBLEM - DPKG on mw1074 is CRITICAL: Connection refused by host [15:05:12] PROBLEM - DPKG on virt3 is CRITICAL: Connection refused by host [15:05:22] PROBLEM - RAID on srv271 is CRITICAL: Connection refused by host [15:05:22] PROBLEM - DPKG on mw46 is CRITICAL: Connection refused by host [15:05:22] PROBLEM - RAID on mw1146 is CRITICAL: Connection refused by host [15:05:24] New patchset: Dzahn; "revert change to check_disk nrpe command" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1862 [15:05:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1862 [15:05:42] PROBLEM - MySQL disk space on es1002 is CRITICAL: Connection refused by host [15:05:42] PROBLEM - DPKG on mw1075 is CRITICAL: Connection refused by host [15:05:52] PROBLEM - DPKG on mw1001 is CRITICAL: Connection refused by host [15:05:52] PROBLEM - DPKG on snapshot4 is CRITICAL: Connection refused by host [15:05:52] RECOVERY - RAID on db46 is OK: OK: State is Optimal, checked 2 logical device(s) [15:06:01] New review: Dzahn; "this worked fine on sodium, but obviously something happened, and we want to solve this in another w..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1862 [15:06:01] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1862 [15:06:12] PROBLEM - DPKG on mw72 is CRITICAL: Connection refused by host [15:06:12] PROBLEM - RAID on db13 is CRITICAL: Connection refused by host [15:06:22] PROBLEM - Disk space on srv208 is CRITICAL: Connection refused by host [15:06:22] RECOVERY - MySQL disk space on db16 is OK: DISK OK [15:06:32] RECOVERY - DPKG on db18 is OK: All packages OK [15:06:32] RECOVERY - Disk space on mw1092 is OK: DISK OK [15:06:42] PROBLEM - RAID on srv195 is CRITICAL: Connection refused by host [15:06:42] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: Connection refused by host [15:06:42] PROBLEM - Disk space on mw1048 is CRITICAL: Connection refused by host [15:06:59] eh, then it wasnt related. starts recovering before i merged the revert [15:07:02] RECOVERY - Disk space on snapshot4 is OK: DISK OK [15:07:12] RECOVERY - RAID on srv196 is OK: OK: no RAID installed [15:07:12] PROBLEM - Disk space on mw1088 is CRITICAL: Connection refused by host [15:07:22] RECOVERY - Disk space on srv200 is OK: DISK OK [15:07:42] PROBLEM - Disk space on cp1044 is CRITICAL: Connection refused by host [15:07:42] RECOVERY - Disk space on db13 is OK: DISK OK [15:07:42] RECOVERY - Disk space on db18 is OK: DISK OK [15:07:42] PROBLEM - DPKG on db25 is CRITICAL: Connection refused by host [15:07:49] i think just spence being so busy that it refuses then [15:07:52] PROBLEM - DPKG on es2 is CRITICAL: Connection refused by host [15:07:52] RECOVERY - DPKG on es4 is OK: All packages OK [15:07:52] PROBLEM - MySQL disk space on es3 is CRITICAL: Connection refused by host [15:08:02] PROBLEM - DPKG on mw30 is CRITICAL: Connection refused by host [15:08:12] RECOVERY - DPKG on es1001 is OK: All packages OK [15:08:12] RECOVERY - DPKG on mw1146 is OK: All packages OK [15:08:22] PROBLEM - DPKG on mw1142 is CRITICAL: Connection refused by host [15:08:22] RECOVERY - DPKG on mw1134 is OK: All packages OK [15:08:22] RECOVERY - RAID on mw55 is OK: OK: no RAID installed [15:08:22] PROBLEM - DPKG on mw33 is CRITICAL: Connection refused by host [15:08:22] RECOVERY - DPKG on mw1121 is OK: All packages OK [15:08:32] PROBLEM - RAID on mw33 is CRITICAL: Connection refused by host [15:08:32] PROBLEM - DPKG on mw1136 is CRITICAL: Connection refused by host [15:08:33] RECOVERY - DPKG on mw67 is OK: All packages OK [15:08:42] RECOVERY - DPKG on mw70 is OK: All packages OK [15:08:52] RECOVERY - RAID on srv210 is OK: OK: no RAID installed [15:08:52] RECOVERY - DPKG on srv196 is OK: All packages OK [15:09:12] RECOVERY - Disk space on virt3 is OK: DISK OK [15:09:12] RECOVERY - Disk space on mw1082 is OK: DISK OK [15:09:20] or it just took a little while until nagios-nrpe was restarted [15:09:22] RECOVERY - DPKG on db46 is OK: All packages OK [15:09:32] PROBLEM - Disk space on db25 is CRITICAL: Connection refused by host [15:09:32] RECOVERY - RAID on es1001 is OK: OK: State is Optimal, checked 2 logical device(s) [15:09:36] (when puppet triggered that due to a config change) [15:09:42] PROBLEM - MySQL disk space on db45 is CRITICAL: Connection refused by host [15:09:42] PROBLEM - DPKG on srv190 is CRITICAL: Connection refused by host [15:09:42] PROBLEM - Disk space on mw1074 is CRITICAL: Connection refused by host [15:09:52] PROBLEM - Disk space on srv190 is CRITICAL: Connection refused by host [15:09:52] PROBLEM - RAID on srv229 is CRITICAL: Connection refused by host [15:09:52] PROBLEM - RAID on mw1048 is CRITICAL: Connection refused by host [15:09:52] PROBLEM - RAID on mw1037 is CRITICAL: Connection refused by host [15:09:52] RECOVERY - Disk space on virt2 is OK: DISK OK [15:10:02] PROBLEM - RAID on mw1074 is CRITICAL: Connection refused by host [15:10:02] PROBLEM - Disk space on srv276 is CRITICAL: Connection refused by host [15:10:02] RECOVERY - RAID on mw1075 is OK: OK: no RAID installed [15:10:02] RECOVERY - RAID on mw1082 is OK: OK: no RAID installed [15:10:02] RECOVERY - Disk space on mw1121 is OK: DISK OK [15:10:02] PROBLEM - Disk space on mw1136 is CRITICAL: Connection refused by host [15:10:03] RECOVERY - Disk space on mw1134 is OK: DISK OK [15:10:12] PROBLEM - Disk space on mw1141 is CRITICAL: Connection refused by host [15:10:13] PROBLEM - Disk space on mw33 is CRITICAL: Connection refused by host [15:10:13] PROBLEM - DPKG on db51 is CRITICAL: Connection refused by host [15:10:13] PROBLEM - RAID on db51 is CRITICAL: Connection refused by host [15:10:22] RECOVERY - Disk space on mw67 is OK: DISK OK [15:10:22] RECOVERY - Disk space on mw70 is OK: DISK OK [15:10:22] RECOVERY - Disk space on db46 is OK: DISK OK [15:10:22] PROBLEM - RAID on mw1080 is CRITICAL: Connection refused by host [15:10:32] RECOVERY - DPKG on mw55 is OK: All packages OK [15:10:32] RECOVERY - RAID on snapshot4 is OK: OK: no RAID installed [15:10:42] RECOVERY - Disk space on srv196 is OK: DISK OK [15:10:42] PROBLEM - Disk space on srv207 is CRITICAL: Connection refused by host [15:10:42] RECOVERY - RAID on srv208 is OK: OK: no RAID installed [15:10:42] RECOVERY - DPKG on srv210 is OK: All packages OK [15:10:42] PROBLEM - Disk space on mw1142 is CRITICAL: Connection refused by host [15:10:52] RECOVERY - Disk space on srv236 is OK: DISK OK [15:10:52] PROBLEM - RAID on mw1079 is CRITICAL: Connection refused by host [15:11:02] RECOVERY - Disk space on mw1146 is OK: DISK OK [15:11:02] RECOVERY - RAID on mw1001 is OK: OK: no RAID installed [15:11:12] PROBLEM - Disk space on cp1041 is CRITICAL: Connection refused by host [15:11:12] PROBLEM - mobile traffic loggers on cp1041 is CRITICAL: Connection refused by host [15:11:13] PROBLEM - RAID on cp1044 is CRITICAL: Connection refused by host [15:11:13] RECOVERY - RAID on bast1001 is OK: OK: no RAID installed [15:11:13] RECOVERY - RAID on virt3 is OK: OK: State is Optimal, checked 2 logical device(s) [15:11:13] RECOVERY - RAID on virt2 is OK: OK: State is Optimal, checked 2 logical device(s) [15:11:22] RECOVERY - RAID on srv272 is OK: OK: no RAID installed [15:11:22] PROBLEM - RAID on mw1088 is CRITICAL: Connection refused by host [15:11:32] RECOVERY - DPKG on db16 is OK: All packages OK [15:11:32] PROBLEM - RAID on db34 is CRITICAL: Connection refused by host [15:11:32] RECOVERY - MySQL disk space on db46 is OK: DISK OK [15:11:32] PROBLEM - Disk space on db51 is CRITICAL: Connection refused by host [15:11:32] PROBLEM - RAID on srv267 is CRITICAL: Connection refused by host [15:11:42] RECOVERY - Disk space on es1 is OK: DISK OK [15:11:42] RECOVERY - MySQL disk space on es4 is OK: DISK OK [15:11:42] PROBLEM - RAID on es2 is CRITICAL: Connection refused by host [15:11:42] RECOVERY - Disk space on es1002 is OK: DISK OK [15:11:52] RECOVERY - MySQL disk space on db18 is OK: DISK OK [15:11:52] RECOVERY - RAID on es1 is OK: OK: State is Optimal, checked 2 logical device(s) [15:11:52] RECOVERY - RAID on mw1092 is OK: OK: no RAID installed [15:11:52] RECOVERY - RAID on es1002 is OK: OK: State is Optimal, checked 2 logical device(s) [15:11:52] RECOVERY - RAID on db16 is OK: OK: 1 logical device(s) checked [15:11:52] RECOVERY - RAID on db11 is OK: OK: 1 logical device(s) checked [15:12:02] RECOVERY - MySQL disk space on db13 is OK: DISK OK [15:12:12] PROBLEM - Disk space on mw1104 is CRITICAL: Connection refused by host [15:12:12] PROBLEM - RAID on mw41 is CRITICAL: Connection refused by host [15:12:12] RECOVERY - RAID on mw46 is OK: OK: no RAID installed [15:12:22] RECOVERY - DPKG on db11 is OK: All packages OK [15:12:22] PROBLEM - Disk space on db45 is CRITICAL: Connection refused by host [15:12:22] RECOVERY - Disk space on mw55 is OK: DISK OK [15:12:32] RECOVERY - RAID on srv236 is OK: OK: no RAID installed [15:12:32] PROBLEM - DPKG on mw1104 is CRITICAL: Connection refused by host [15:12:32] RECOVERY - RAID on es4 is OK: OK: State is Optimal, checked 2 logical device(s) [15:12:42] PROBLEM - DPKG on cp1043 is CRITICAL: Connection refused by host [15:12:42] RECOVERY - Disk space on srv210 is OK: DISK OK [15:12:42] RECOVERY - DPKG on srv208 is OK: All packages OK [15:12:52] RECOVERY - Disk space on mw1075 is OK: DISK OK [15:13:02] RECOVERY - DPKG on srv236 is OK: All packages OK [15:13:02] PROBLEM - Disk space on srv267 is CRITICAL: Connection refused by host [15:13:02] RECOVERY - Disk space on srv272 is OK: DISK OK [15:13:02] RECOVERY - Disk space on es4 is OK: DISK OK [15:13:12] RECOVERY - DPKG on srv272 is OK: All packages OK [15:13:12] RECOVERY - Disk space on es1001 is OK: DISK OK [15:13:22] RECOVERY - RAID on db18 is OK: OK: 1 logical device(s) checked [15:13:42] PROBLEM - RAID on db45 is CRITICAL: Connection refused by host [15:13:42] PROBLEM - DPKG on mw1036 is CRITICAL: Connection refused by host [15:13:43] PROBLEM - DPKG on mw1037 is CRITICAL: Connection refused by host [15:13:43] RECOVERY - DPKG on mw1048 is OK: All packages OK [15:13:52] RECOVERY - RAID on mw1104 is OK: OK: no RAID installed [15:13:52] RECOVERY - RAID on mw1121 is OK: OK: no RAID installed [15:14:02] RECOVERY - RAID on mw1134 is OK: OK: no RAID installed [15:14:02] PROBLEM - Disk space on srv283 is CRITICAL: Connection refused by host [15:14:12] PROBLEM - RAID on mw1142 is CRITICAL: Connection refused by host [15:14:12] PROBLEM - RAID on mw1141 is CRITICAL: Connection refused by host [15:14:12] PROBLEM - DPKG on mw41 is CRITICAL: Connection refused by host [15:14:12] PROBLEM - RAID on srv235 is CRITICAL: Connection refused by host [15:14:22] RECOVERY - RAID on mw30 is OK: OK: no RAID installed [15:14:23] RECOVERY - Disk space on mw69 is OK: DISK OK [15:14:23] RECOVERY - RAID on mw67 is OK: OK: no RAID installed [15:14:23] PROBLEM - RAID on es3 is CRITICAL: Connection refused by host [15:14:32] PROBLEM - DPKG on srv267 is CRITICAL: Connection refused by host [15:14:32] RECOVERY - Disk space on db11 is OK: DISK OK [15:14:32] PROBLEM - DPKG on fenari is CRITICAL: Connection refused by host [15:14:42] PROBLEM - RAID on srv190 is CRITICAL: Connection refused by host [15:14:42] RECOVERY - MySQL disk space on es1 is OK: DISK OK [15:14:52] PROBLEM - DPKG on srv235 is CRITICAL: Connection refused by host [15:14:52] RECOVERY - DPKG on srv195 is OK: All packages OK [15:14:52] RECOVERY - Disk space on mw72 is OK: DISK OK [15:14:52] RECOVERY - DPKG on mw1074 is OK: All packages OK [15:15:02] RECOVERY - DPKG on virt3 is OK: All packages OK [15:15:02] PROBLEM - DPKG on mw1067 is CRITICAL: Connection refused by host [15:15:12] RECOVERY - RAID on srv271 is OK: OK: no RAID installed [15:15:13] PROBLEM - RAID on srv276 is CRITICAL: Connection refused by host [15:15:13] RECOVERY - DPKG on mw46 is OK: All packages OK [15:15:13] RECOVERY - DPKG on bast1001 is OK: All packages OK [15:15:13] RECOVERY - RAID on mw1146 is OK: OK: no RAID installed [15:15:22] PROBLEM - MySQL disk space on db51 is CRITICAL: Connection refused by host [15:15:22] PROBLEM - DPKG on cp1044 is CRITICAL: Connection refused by host [15:15:22] PROBLEM - Disk space on mw1079 is CRITICAL: Connection refused by host [15:15:22] PROBLEM - DPKG on mw1003 is CRITICAL: Connection refused by host [15:15:32] RECOVERY - MySQL disk space on es1002 is OK: DISK OK [15:15:32] RECOVERY - DPKG on mw1075 is OK: All packages OK [15:15:42] RECOVERY - DPKG on snapshot4 is OK: All packages OK [15:15:43] RECOVERY - DPKG on mw1001 is OK: All packages OK [15:15:43] RECOVERY - Disk space on db34 is OK: DISK OK [15:15:43] PROBLEM - DPKG on db45 is CRITICAL: Connection refused by host [15:15:43] RECOVERY - RAID on db25 is OK: OK: 1 logical device(s) checked [15:15:52] PROBLEM - RAID on mw1152 is CRITICAL: Connection refused by host [15:15:52] PROBLEM - DPKG on mw1079 is CRITICAL: Connection refused by host [15:16:02] PROBLEM - RAID on db50 is CRITICAL: Connection refused by host [15:16:02] RECOVERY - DPKG on mw72 is OK: All packages OK [15:16:02] RECOVERY - MySQL disk space on db11 is OK: DISK OK [15:16:02] PROBLEM - DPKG on mw1088 is CRITICAL: Connection refused by host [15:16:02] RECOVERY - RAID on db13 is OK: OK: 1 logical device(s) checked [15:16:12] RECOVERY - DPKG on db13 is OK: All packages OK [15:16:12] RECOVERY - Disk space on srv208 is OK: DISK OK [15:16:13] PROBLEM - Disk space on mw1037 is CRITICAL: Connection refused by host [15:16:22] PROBLEM - Disk space on mw1003 is CRITICAL: Connection refused by host [15:16:22] PROBLEM - Disk space on mw1002 is CRITICAL: Connection refused by host [15:16:23] RECOVERY - Disk space on mw1001 is OK: DISK OK [15:16:32] RECOVERY - RAID on srv195 is OK: OK: no RAID installed [15:16:32] PROBLEM - Disk space on mw1036 is CRITICAL: Connection refused by host [15:16:32] RECOVERY - Disk space on mw1048 is OK: DISK OK [15:16:42] RECOVERY - Disk space on mw46 is OK: DISK OK [15:16:52] PROBLEM - Disk space on mw41 is CRITICAL: Connection refused by host [15:17:02] RECOVERY - Disk space on mw1080 is OK: DISK OK [15:17:02] RECOVERY - Disk space on mw1088 is OK: DISK OK [15:17:12] RECOVERY - Disk space on srv195 is OK: DISK OK [15:17:12] PROBLEM - RAID on srv207 is CRITICAL: Connection refused by host [15:17:12] PROBLEM - RAID on srv254 is CRITICAL: Connection refused by host [15:17:12] PROBLEM - Disk space on mw1050 is CRITICAL: Connection refused by host [15:17:22] RECOVERY - DPKG on srv271 is OK: All packages OK [15:17:22] PROBLEM - DPKG on srv276 is CRITICAL: Connection refused by host [15:17:22] PROBLEM - Disk space on srv285 is CRITICAL: Connection refused by host [15:17:32] PROBLEM - mobile traffic loggers on cp1042 is CRITICAL: Connection refused by host [15:17:32] RECOVERY - Disk space on bast1001 is OK: DISK OK [15:17:32] RECOVERY - DPKG on cp1041 is OK: All packages OK [15:17:32] RECOVERY - DPKG on db25 is OK: All packages OK [15:17:42] RECOVERY - MySQL disk space on es3 is OK: DISK OK [15:17:42] PROBLEM - Disk space on es1003 is CRITICAL: Connection refused by host [15:17:52] PROBLEM - Disk space on fenari is CRITICAL: Connection refused by host [15:17:52] RECOVERY - DPKG on mw30 is OK: All packages OK [15:17:52] PROBLEM - DPKG on es1003 is CRITICAL: Connection refused by host [15:17:52] PROBLEM - RAID on es1003 is CRITICAL: Connection refused by host [15:18:02] PROBLEM - DPKG on mw1152 is CRITICAL: Connection refused by host [15:18:12] PROBLEM - DPKG on mw1111 is CRITICAL: Connection refused by host [15:18:12] RECOVERY - DPKG on mw1142 is OK: All packages OK [15:18:12] RECOVERY - DPKG on mw33 is OK: All packages OK [15:18:12] RECOVERY - Disk space on es3 is OK: DISK OK [15:18:12] RECOVERY - DPKG on es2 is OK: All packages OK [15:18:33] PROBLEM - DPKG on mw1141 is CRITICAL: Connection refused by host [15:18:33] RECOVERY - DPKG on mw1136 is OK: All packages OK [15:18:42] PROBLEM - DPKG on srv207 is CRITICAL: Connection refused by host [15:18:42] PROBLEM - DPKG on mw42 is CRITICAL: Connection refused by host [15:18:42] RECOVERY - RAID on mw72 is OK: OK: no RAID installed [15:18:42] RECOVERY - RAID on mw69 is OK: OK: no RAID installed [15:18:42] RECOVERY - RAID on mw33 is OK: OK: no RAID installed [15:18:52] PROBLEM - DPKG on mw1159 is CRITICAL: Connection refused by host [15:18:52] PROBLEM - Disk space on mw1067 is CRITICAL: Connection refused by host [15:19:02] PROBLEM - DPKG on db50 is CRITICAL: Connection refused by host [15:19:22] RECOVERY - Disk space on db25 is OK: DISK OK [15:19:32] PROBLEM - RAID on srv261 is CRITICAL: Connection refused by host [15:19:32] PROBLEM - DPKG on db47 is CRITICAL: Connection refused by host [15:19:32] PROBLEM - Disk space on srv235 is CRITICAL: Connection refused by host [15:19:32] PROBLEM - Disk space on mw1073 is CRITICAL: Connection refused by host [15:19:42] RECOVERY - MySQL disk space on db34 is OK: DISK OK [15:19:42] RECOVERY - Disk space on mw1074 is OK: DISK OK [15:19:42] RECOVERY - Disk space on srv190 is OK: DISK OK [15:19:42] RECOVERY - RAID on srv229 is OK: OK: no RAID installed [15:19:42] PROBLEM - RAID on mw1036 is CRITICAL: Connection refused by host [15:19:42] RECOVERY - RAID on mw1048 is OK: OK: no RAID installed [15:19:43] RECOVERY - RAID on mw1037 is OK: OK: no RAID installed [15:19:52] PROBLEM - RAID on srv231 is CRITICAL: Connection refused by host [15:19:52] PROBLEM - Disk space on mw1111 is CRITICAL: Connection refused by host [15:19:52] RECOVERY - RAID on mw1074 is OK: OK: no RAID installed [15:19:52] RECOVERY - Disk space on mw1136 is OK: DISK OK [15:20:02] RECOVERY - DPKG on srv190 is OK: All packages OK [15:20:02] PROBLEM - Disk space on mw1159 is CRITICAL: Connection refused by host [15:20:03] RECOVERY - Disk space on mw1141 is OK: DISK OK [15:20:12] PROBLEM - RAID on mw1002 is CRITICAL: Connection refused by host [15:20:12] PROBLEM - Disk space on mw42 is CRITICAL: Connection refused by host [15:20:22] RECOVERY - Disk space on mw33 is OK: DISK OK [15:20:22] PROBLEM - Disk space on db50 is CRITICAL: Connection refused by host [15:20:22] RECOVERY - RAID on mw1080 is OK: OK: no RAID installed [15:20:32] RECOVERY - DPKG on db51 is OK: All packages OK [15:20:32] RECOVERY - RAID on db51 is OK: OK: State is Optimal, checked 2 logical device(s) [15:20:42] PROBLEM - RAID on mw1067 is CRITICAL: Connection refused by host [15:20:42] PROBLEM - RAID on mw1050 is CRITICAL: Connection refused by host [15:20:43] RECOVERY - Disk space on mw1142 is OK: DISK OK [15:20:52] RECOVERY - Disk space on srv207 is OK: DISK OK [15:20:52] PROBLEM - RAID on mw1003 is CRITICAL: Connection refused by host [15:20:52] PROBLEM - DPKG on srv261 is CRITICAL: Connection refused by host [15:20:52] PROBLEM - DPKG on srv254 is CRITICAL: Connection refused by host [15:20:52] PROBLEM - DPKG on srv283 is CRITICAL: Connection refused by host [15:21:02] PROBLEM - RAID on mw1073 is CRITICAL: Connection refused by host [15:21:02] PROBLEM - Disk space on srv263 is CRITICAL: Connection refused by host [15:21:02] RECOVERY - RAID on mw1079 is OK: OK: no RAID installed [15:21:02] PROBLEM - MySQL disk space on storage3 is CRITICAL: Connection refused by host [15:21:02] RECOVERY - Disk space on cp1041 is OK: DISK OK [15:21:02] RECOVERY - mobile traffic loggers on cp1041 is OK: PROCS OK: 2 processes with command name varnishncsa [15:21:03] RECOVERY - RAID on cp1044 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:21:12] PROBLEM - RAID on cp1042 is CRITICAL: Connection refused by host [15:21:22] PROBLEM - MySQL disk space on db50 is CRITICAL: Connection refused by host [15:21:22] PROBLEM - Disk space on db47 is CRITICAL: Connection refused by host [15:21:22] RECOVERY - Disk space on db51 is OK: DISK OK [15:21:22] RECOVERY - RAID on db34 is OK: OK: 1 logical device(s) checked [15:21:32] RECOVERY - RAID on srv267 is OK: OK: no RAID installed [15:21:32] PROBLEM - RAID on srv283 is CRITICAL: Connection refused by host [15:21:32] RECOVERY - RAID on mw1088 is OK: OK: no RAID installed [15:21:32] RECOVERY - MySQL disk space on es2 is OK: DISK OK [15:21:32] RECOVERY - RAID on es2 is OK: OK: State is Optimal, checked 2 logical device(s) [15:21:42] PROBLEM - RAID on fenari is CRITICAL: Connection refused by host [15:22:02] PROBLEM - Disk space on srv258 is CRITICAL: Connection refused by host [15:22:03] RECOVERY - RAID on mw41 is OK: OK: no RAID installed [15:22:03] RECOVERY - Disk space on mw1104 is OK: DISK OK [15:22:12] PROBLEM - MySQL disk space on es1003 is CRITICAL: Connection refused by host [15:22:13] RECOVERY - Disk space on db45 is OK: DISK OK [15:22:32] PROBLEM - DPKG on srv231 is CRITICAL: Connection refused by host [15:22:42] RECOVERY - DPKG on mw1104 is OK: All packages OK [15:22:52] PROBLEM - Disk space on emery is CRITICAL: Connection refused by host [15:22:52] RECOVERY - Disk space on srv267 is OK: DISK OK [15:23:02] PROBLEM - Disk space on srv261 is CRITICAL: Connection refused by host [15:23:02] PROBLEM - RAID on mw1023 is CRITICAL: Connection refused by host [15:23:02] PROBLEM - Disk space on srv231 is CRITICAL: Connection refused by host [15:23:12] PROBLEM - Disk space on srv238 is CRITICAL: Connection refused by host [15:23:32] PROBLEM - MySQL disk space on db47 is CRITICAL: Connection refused by host [15:23:32] RECOVERY - RAID on db45 is OK: OK: State is Optimal, checked 2 logical device(s) [15:23:42] PROBLEM - DPKG on mw1046 is CRITICAL: Connection refused by host [15:23:42] RECOVERY - DPKG on mw1036 is OK: All packages OK [15:23:42] PROBLEM - DPKG on mw1053 is CRITICAL: Connection refused by host [15:23:42] PROBLEM - DPKG on mw1050 is CRITICAL: Connection refused by host [15:23:42] RECOVERY - DPKG on mw1037 is OK: All packages OK [15:23:52] PROBLEM - DPKG on mw1073 is CRITICAL: Connection refused by host [15:23:52] PROBLEM - RAID on mw1111 is CRITICAL: Connection refused by host [15:24:02] PROBLEM - RAID on mw1143 is CRITICAL: Connection refused by host [15:24:02] PROBLEM - RAID on mw1133 is CRITICAL: Connection refused by host [15:24:02] PROBLEM - DPKG on mw18 is CRITICAL: Connection refused by host [15:24:02] RECOVERY - RAID on mw1142 is OK: OK: no RAID installed [15:24:02] RECOVERY - RAID on mw1141 is OK: OK: no RAID installed [15:24:02] RECOVERY - DPKG on mw41 is OK: All packages OK [15:24:03] PROBLEM - DPKG on cp1042 is CRITICAL: Connection refused by host [15:24:12] PROBLEM - RAID on srv285 is CRITICAL: Connection refused by host [15:24:12] PROBLEM - RAID on mw4 is CRITICAL: Connection refused by host [15:24:22] PROBLEM - Disk space on srv254 is CRITICAL: Connection refused by host [15:24:22] RECOVERY - RAID on es3 is OK: OK: State is Optimal, checked 2 logical device(s) [15:24:22] RECOVERY - DPKG on srv267 is OK: All packages OK [15:24:32] PROBLEM - DPKG on mw1002 is CRITICAL: Connection refused by host [15:24:32] RECOVERY - RAID on srv190 is OK: OK: no RAID installed [15:24:42] RECOVERY - DPKG on fenari is OK: All packages OK [15:24:42] PROBLEM - Disk space on mw1152 is CRITICAL: Connection refused by host [15:24:42] PROBLEM - RAID on srv238 is CRITICAL: Connection refused by host [15:24:52] PROBLEM - DPKG on mw1023 is CRITICAL: Connection refused by host [15:24:52] PROBLEM - DPKG on srv285 is CRITICAL: Connection refused by host [15:25:02] RECOVERY - DPKG on mw1067 is OK: All packages OK [15:25:02] RECOVERY - RAID on srv276 is OK: OK: no RAID installed [15:25:02] PROBLEM - Disk space on cp1042 is CRITICAL: Connection refused by host [15:25:02] PROBLEM - DPKG on mw1008 is CRITICAL: Connection refused by host [15:25:12] PROBLEM - DPKG on storage3 is CRITICAL: Connection refused by host [15:25:12] RECOVERY - MySQL disk space on db51 is OK: DISK OK [15:25:13] PROBLEM - RAID on emery is CRITICAL: Connection refused by host [15:25:13] RECOVERY - DPKG on cp1044 is OK: All packages OK [15:25:13] RECOVERY - Disk space on mw1079 is OK: DISK OK [15:25:13] RECOVERY - DPKG on mw1003 is OK: All packages OK [15:25:22] PROBLEM - RAID on mw18 is CRITICAL: Connection refused by host [15:25:32] PROBLEM - RAID on mw1159 is CRITICAL: Connection refused by host [15:25:32] PROBLEM - DPKG on mw1009 is CRITICAL: Connection refused by host [15:25:42] RECOVERY - DPKG on db45 is OK: All packages OK [15:25:42] RECOVERY - RAID on mw1152 is OK: OK: no RAID installed [15:25:52] RECOVERY - DPKG on mw1079 is OK: All packages OK [15:25:52] PROBLEM - DPKG on emery is CRITICAL: Connection refused by host [15:25:52] RECOVERY - DPKG on mw1088 is OK: All packages OK [15:25:52] RECOVERY - RAID on db50 is OK: OK: State is Optimal, checked 2 logical device(s) [15:26:02] RECOVERY - Disk space on mw1037 is OK: DISK OK [15:26:12] PROBLEM - Disk space on mw1008 is CRITICAL: Connection refused by host [15:26:12] RECOVERY - Disk space on mw1003 is OK: DISK OK [15:26:22] PROBLEM - Disk space on mw1046 is CRITICAL: Connection refused by host [15:26:33] PROBLEM - RAID on searchidx2 is CRITICAL: Connection refused by host [15:26:33] PROBLEM - RAID on mw42 is CRITICAL: Connection refused by host [15:26:33] PROBLEM - DPKG on mw4 is CRITICAL: Connection refused by host [15:26:33] PROBLEM - Disk space on mw1009 is CRITICAL: Connection refused by host [15:26:42] RECOVERY - Disk space on mw1036 is OK: DISK OK [15:26:42] RECOVERY - Disk space on mw41 is OK: DISK OK [15:26:52] PROBLEM - Disk space on mw1023 is CRITICAL: Connection refused by host [15:26:52] PROBLEM - DPKG on searchidx2 is CRITICAL: Connection refused by host [15:27:02] PROBLEM - RAID on srv263 is CRITICAL: Connection refused by host [15:27:02] RECOVERY - RAID on srv207 is OK: OK: no RAID installed [15:27:12] RECOVERY - DPKG on srv276 is OK: All packages OK [15:27:12] RECOVERY - Disk space on srv285 is OK: DISK OK [15:27:22] RECOVERY - Disk space on mw1050 is OK: DISK OK [15:27:22] RECOVERY - Disk space on cp1044 is OK: DISK OK [15:27:32] RECOVERY - Disk space on es1003 is OK: DISK OK [15:27:42] RECOVERY - Disk space on fenari is OK: DISK OK [15:27:42] RECOVERY - DPKG on es1003 is OK: All packages OK [15:27:43] RECOVERY - RAID on es1003 is OK: OK: State is Optimal, checked 2 logical device(s) [15:27:52] PROBLEM - RAID on srv258 is CRITICAL: Connection refused by host [15:27:52] RECOVERY - DPKG on mw1152 is OK: All packages OK [15:28:12] PROBLEM - Disk space on mw1053 is CRITICAL: Connection refused by host [15:28:12] PROBLEM - Disk space on mw4 is CRITICAL: Connection refused by host [15:28:22] PROBLEM - DPKG on mw1143 is CRITICAL: Connection refused by host [15:28:32] RECOVERY - DPKG on mw1141 is OK: All packages OK [15:28:33] RECOVERY - DPKG on srv207 is OK: All packages OK [15:28:42] PROBLEM - DPKG on mw1133 is CRITICAL: Connection refused by host [15:28:42] PROBLEM - DPKG on srv258 is CRITICAL: Connection refused by host [15:28:52] RECOVERY - Disk space on mw1067 is OK: DISK OK [15:28:52] PROBLEM - DPKG on srv263 is CRITICAL: Connection refused by host [15:28:52] PROBLEM - Disk space on searchidx2 is CRITICAL: Connection refused by host [15:28:52] PROBLEM - Disk space on mw18 is CRITICAL: Connection refused by host [15:29:02] RECOVERY - DPKG on db50 is OK: All packages OK [15:29:02] RECOVERY - DPKG on mw42 is OK: All packages OK [15:29:02] PROBLEM - RAID on storage3 is CRITICAL: Connection refused by host [15:29:02] PROBLEM - DPKG on srv238 is CRITICAL: Connection refused by host [15:29:03] mutante: so this is still from your changes? [15:29:22] RECOVERY - Disk space on srv235 is OK: DISK OK [15:29:22] RECOVERY - MySQL disk space on db45 is OK: DISK OK [15:29:22] RECOVERY - Disk space on mw1073 is OK: DISK OK [15:29:32] PROBLEM - RAID on mw1008 is CRITICAL: Connection refused by host [15:29:32] PROBLEM - RAID on mw1046 is CRITICAL: Connection refused by host [15:29:32] RECOVERY - RAID on mw1036 is OK: OK: no RAID installed [15:29:32] PROBLEM - Disk space on storage3 is CRITICAL: Connection refused by host [15:29:42] RECOVERY - Disk space on srv276 is OK: DISK OK [15:29:42] PROBLEM - RAID on mw1009 is CRITICAL: Connection refused by host [15:29:42] PROBLEM - RAID on mw1053 is CRITICAL: Connection refused by host [15:29:42] RECOVERY - Disk space on mw1111 is OK: DISK OK [15:29:52] PROBLEM - Disk space on mw1133 is CRITICAL: Connection refused by host [15:29:52] RECOVERY - RAID on srv231 is OK: OK: no RAID installed [15:30:02] RECOVERY - Disk space on mw42 is OK: DISK OK [15:30:02] RECOVERY - RAID on mw1002 is OK: OK: no RAID installed [15:30:12] RECOVERY - Disk space on db50 is OK: DISK OK [15:30:23] RobH: the problem is that the nagios-nrpe service needs to be started..it is doing that now..one by one [15:30:32] RECOVERY - RAID on mw1067 is OK: OK: no RAID installed [15:30:32] RECOVERY - RAID on mw1050 is OK: OK: no RAID installed [15:30:39] New patchset: Mark Bergsma; "Install files in web docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1863 [15:30:42] RECOVERY - RAID on mw1003 is OK: OK: no RAID installed [15:30:42] RECOVERY - DPKG on srv261 is OK: All packages OK [15:30:42] RECOVERY - DPKG on srv283 is OK: All packages OK [15:30:52] RECOVERY - RAID on srv261 is OK: OK: no RAID installed [15:30:52] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [15:30:52] RECOVERY - RAID on mw1073 is OK: OK: no RAID installed [15:30:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1863 [15:31:02] RECOVERY - RAID on cp1042 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:31:07] RobH: not exactly sure why it was stopped, but seemed to happen right after config change. yeah [15:31:09] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1863 [15:31:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1863 [15:31:11] mutante: cool, just was wondering why, now i know =] [15:31:12] RECOVERY - MySQL disk space on db50 is OK: DISK OK [15:31:12] RECOVERY - Disk space on db47 is OK: DISK OK [15:31:22] RECOVERY - RAID on srv283 is OK: OK: no RAID installed [15:31:23] PROBLEM - Disk space on mw1143 is CRITICAL: Connection refused by host [15:31:32] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [15:31:42] PROBLEM - DPKG on mw1107 is CRITICAL: Connection refused by host [15:32:02] RECOVERY - MySQL disk space on es1003 is OK: DISK OK [15:32:22] RECOVERY - DPKG on srv231 is OK: All packages OK [15:32:42] RECOVERY - Disk space on emery is OK: DISK OK [15:32:42] PROBLEM - RAID on srv278 is CRITICAL: Connection refused by host [15:32:42] RECOVERY - Disk space on srv258 is OK: DISK OK [15:32:52] RECOVERY - Disk space on srv261 is OK: DISK OK [15:32:52] RECOVERY - Disk space on srv231 is OK: DISK OK [15:32:52] RECOVERY - RAID on mw1023 is OK: OK: no RAID installed [15:33:01] !log starting nagios-nrpe-server on srv's via dsh [15:33:02] PROBLEM - RAID on mw1025 is CRITICAL: Connection refused by host [15:33:03] RECOVERY - Disk space on srv238 is OK: DISK OK [15:33:03] Logged the message, Master [15:33:22] RECOVERY - MySQL disk space on db47 is OK: DISK OK [15:33:32] PROBLEM - DPKG on mw1035 is CRITICAL: Connection refused by host [15:33:32] RECOVERY - DPKG on mw1046 is OK: All packages OK [15:33:32] RECOVERY - DPKG on mw1050 is OK: All packages OK [15:33:32] RECOVERY - DPKG on mw1053 is OK: All packages OK [15:33:42] RECOVERY - Disk space on srv283 is OK: DISK OK [15:33:42] RECOVERY - DPKG on mw1073 is OK: All packages OK [15:33:42] RECOVERY - RAID on mw1111 is OK: OK: no RAID installed [15:33:52] RECOVERY - DPKG on mw18 is OK: All packages OK [15:33:52] RECOVERY - RAID on mw1143 is OK: OK: no RAID installed [15:33:52] PROBLEM - RAID on mw1107 is CRITICAL: Connection refused by host [15:33:52] RECOVERY - RAID on mw1133 is OK: OK: no RAID installed [15:33:52] RECOVERY - DPKG on cp1042 is OK: All packages OK [15:34:02] RECOVERY - RAID on srv285 is OK: OK: no RAID installed [15:34:02] RECOVERY - RAID on srv235 is OK: OK: no RAID installed [15:34:02] RECOVERY - RAID on mw4 is OK: OK: no RAID installed [15:34:12] RECOVERY - Disk space on srv254 is OK: DISK OK [15:34:22] RECOVERY - DPKG on mw1002 is OK: All packages OK [15:34:32] RECOVERY - Disk space on mw1152 is OK: DISK OK [15:34:32] RECOVERY - DPKG on srv235 is OK: All packages OK [15:34:32] RECOVERY - RAID on srv238 is OK: OK: no RAID installed [15:34:42] PROBLEM - DPKG on mw1096 is CRITICAL: Connection refused by host [15:34:42] RECOVERY - DPKG on srv285 is OK: All packages OK [15:34:43] RECOVERY - DPKG on mw1023 is OK: All packages OK [15:34:52] RECOVERY - Disk space on cp1042 is OK: DISK OK [15:34:52] RECOVERY - DPKG on mw1008 is OK: All packages OK [15:35:02] RECOVERY - DPKG on storage3 is OK: All packages OK [15:35:02] PROBLEM - DPKG on mw1025 is CRITICAL: Connection refused by host [15:35:02] RECOVERY - RAID on emery is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [15:35:12] RECOVERY - RAID on mw18 is OK: OK: no RAID installed [15:35:22] RECOVERY - DPKG on mw1009 is OK: All packages OK [15:35:42] RECOVERY - DPKG on emery is OK: All packages OK [15:36:02] RECOVERY - Disk space on mw1008 is OK: DISK OK [15:36:02] RECOVERY - Disk space on mw1002 is OK: DISK OK [15:36:12] RECOVERY - Disk space on mw1046 is OK: DISK OK [15:36:13] PROBLEM - Disk space on mw1096 is CRITICAL: Connection refused by host [15:36:13] PROBLEM - RAID on mw1096 is CRITICAL: Connection refused by host [15:36:22] PROBLEM - Disk space on mw1025 is CRITICAL: Connection refused by host [15:36:22] RECOVERY - RAID on mw42 is OK: OK: no RAID installed [15:36:22] RECOVERY - DPKG on mw4 is OK: All packages OK [15:36:22] RECOVERY - Disk space on mw1009 is OK: DISK OK [15:36:22] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [15:36:32] PROBLEM - Disk space on mw1035 is CRITICAL: Connection refused by host [15:36:42] RECOVERY - Disk space on mw1023 is OK: DISK OK [15:36:52] RECOVERY - RAID on srv263 is OK: OK: no RAID installed [15:37:02] RECOVERY - DPKG on searchidx2 is OK: All packages OK [15:37:12] RECOVERY - mobile traffic loggers on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa [15:37:42] RECOVERY - RAID on srv258 is OK: OK: no RAID installed [15:37:52] RECOVERY - DPKG on mw1111 is OK: All packages OK [15:38:21] RECOVERY - Disk space on mw1053 is OK: DISK OK [15:38:21] RECOVERY - Disk space on mw4 is OK: DISK OK [15:38:51] PROBLEM - RAID on ms1004 is CRITICAL: Connection refused by host [15:38:51] PROBLEM - Disk space on mw1029 is CRITICAL: Connection refused by host [15:38:51] PROBLEM - Disk space on mw1032 is CRITICAL: Connection refused by host [15:38:51] PROBLEM - Disk space on mw1047 is CRITICAL: Connection refused by host [15:39:21] PROBLEM - Disk space on sodium is CRITICAL: DISK CRITICAL - /var/spool/exim4/db is not accessible: Permission denied [15:39:41] RECOVERY - Disk space on mw18 is OK: DISK OK [15:40:01] PROBLEM - Disk space on mw1070 is CRITICAL: Connection refused by host [15:40:31] RECOVERY - DPKG on mw1133 is OK: All packages OK [15:40:31] RECOVERY - DPKG on mw1143 is OK: All packages OK [15:41:01] RECOVERY - Disk space on storage3 is OK: DISK OK [15:41:01] RECOVERY - Disk space on searchidx2 is OK: DISK OK [15:41:11] PROBLEM - Disk space on mw1077 is CRITICAL: Connection refused by host [15:41:31] New patchset: Mark Bergsma; "Add rewrite rules for eqiad and esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1864 [15:41:41] PROBLEM - Disk space on mw1068 is CRITICAL: Connection refused by host [15:41:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1864 [15:41:51] RECOVERY - DPKG on db47 is OK: All packages OK [15:42:01] PROBLEM - RAID on mw1032 is CRITICAL: Connection refused by host [15:42:01] RECOVERY - RAID on mw1008 is OK: OK: no RAID installed [15:42:01] RECOVERY - RAID on mw1009 is OK: OK: no RAID installed [15:42:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1864 [15:42:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1864 [15:42:11] PROBLEM - RAID on mw1068 is CRITICAL: Connection refused by host [15:42:11] PROBLEM - RAID on mw1070 is CRITICAL: Connection refused by host [15:42:11] RECOVERY - RAID on mw1046 is OK: OK: no RAID installed [15:42:11] RECOVERY - RAID on mw1053 is OK: OK: no RAID installed [15:42:11] RECOVERY - RAID on mw1096 is OK: OK: no RAID installed [15:42:21] RECOVERY - Disk space on mw1133 is OK: DISK OK [15:42:21] RECOVERY - Disk space on mw1143 is OK: DISK OK [15:43:11] RECOVERY - Disk space on srv263 is OK: DISK OK [15:43:11] RECOVERY - DPKG on srv254 is OK: All packages OK [15:43:11] RECOVERY - DPKG on srv258 is OK: All packages OK [15:43:11] RECOVERY - RAID on storage3 is OK: OK: State is Optimal, checked 14 logical device(s) [15:43:21] RECOVERY - DPKG on srv263 is OK: All packages OK [15:43:31] RECOVERY - Disk space on cp1043 is OK: DISK OK [15:43:31] RECOVERY - DPKG on srv238 is OK: All packages OK [15:44:01] RECOVERY - DPKG on cp1043 is OK: All packages OK [15:46:01] PROBLEM - DPKG on mw1047 is CRITICAL: Connection refused by host [15:46:01] PROBLEM - DPKG on mw1051 is CRITICAL: Connection refused by host [15:46:01] PROBLEM - DPKG on mw1057 is CRITICAL: Connection refused by host [15:46:01] RECOVERY - DPKG on mw1035 is OK: All packages OK [15:46:05] New patchset: Jgreen; "adding one-size-fits-all offhost_backups script for aluminium, grosley, storage3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1865 [15:46:11] RECOVERY - RAID on mw1107 is OK: OK: no RAID installed [15:47:01] PROBLEM - DPKG on mw1070 is CRITICAL: Connection refused by host [15:47:21] RECOVERY - RAID on cp1043 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:47:31] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 2 processes with command name varnishncsa [15:47:41] RECOVERY - DPKG on mw1096 is OK: All packages OK [15:48:01] PROBLEM - Disk space on db44 is CRITICAL: Connection refused by host [15:48:11] PROBLEM - DPKG on mw1077 is CRITICAL: Connection refused by host [15:48:21] RECOVERY - Disk space on mw1025 is OK: DISK OK [15:48:41] PROBLEM - Disk space on mw1051 is CRITICAL: Connection refused by host [15:48:41] RECOVERY - Disk space on mw1096 is OK: DISK OK [15:48:51] PROBLEM - Disk space on mw1044 is CRITICAL: Connection refused by host [15:48:51] RECOVERY - DPKG on mw1025 is OK: All packages OK [15:49:01] PROBLEM - Disk space on mw1042 is CRITICAL: Connection refused by host [15:49:01] RECOVERY - Disk space on mw1035 is OK: DISK OK [15:49:21] PROBLEM - Disk space on mw1093 is CRITICAL: Connection refused by host [15:49:31] PROBLEM - RAID on srv269 is CRITICAL: Connection refused by host [15:49:31] PROBLEM - Disk space on mw1071 is CRITICAL: Connection refused by host [15:49:31] PROBLEM - DPKG on srv288 is CRITICAL: Connection refused by host [15:49:51] PROBLEM - Disk space on mw1057 is CRITICAL: Connection refused by host [15:49:51] PROBLEM - Disk space on mw1090 is CRITICAL: Connection refused by host [15:49:51] RECOVERY - Disk space on mw1070 is OK: DISK OK [15:50:01] PROBLEM - Disk space on mw1064 is CRITICAL: Connection refused by host [15:50:21] PROBLEM - DPKG on mw1134 is CRITICAL: Connection refused by host [15:50:21] PROBLEM - DPKG on mw1148 is CRITICAL: Connection refused by host [15:51:01] PROBLEM - Disk space on mw1069 is CRITICAL: Connection refused by host [15:51:11] RECOVERY - DPKG on mw1107 is OK: All packages OK [15:51:31] RECOVERY - Disk space on mw1068 is OK: DISK OK [15:51:51] PROBLEM - DPKG on ms1004 is CRITICAL: Connection refused by host [15:51:51] PROBLEM - RAID on mw1029 is CRITICAL: Connection refused by host [15:51:51] RECOVERY - RAID on mw1025 is OK: OK: no RAID installed [15:52:01] PROBLEM - MySQL disk space on es1001 is CRITICAL: Connection refused by host [15:52:01] PROBLEM - RAID on mw1071 is CRITICAL: Connection refused by host [15:52:01] PROBLEM - RAID on mw1069 is CRITICAL: Connection refused by host [15:52:01] PROBLEM - RAID on mw1090 is CRITICAL: Connection refused by host [15:52:01] PROBLEM - RAID on mw1077 is CRITICAL: Connection refused by host [15:52:01] PROBLEM - RAID on mw1093 is CRITICAL: Connection refused by host [15:52:01] RECOVERY - RAID on mw1068 is OK: OK: no RAID installed [15:52:02] RECOVERY - RAID on mw1070 is OK: OK: no RAID installed [15:55:04] New patchset: Hashar; "testswarm: update fetcher to r108075" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1866 [15:55:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1866 [15:55:41] PROBLEM - DPKG on db44 is CRITICAL: Connection refused by host [15:55:51] PROBLEM - DPKG on mw1029 is CRITICAL: Connection refused by host [15:55:51] PROBLEM - DPKG on mw1042 is CRITICAL: Connection refused by host [15:55:51] PROBLEM - DPKG on mw1044 is CRITICAL: Connection refused by host [15:55:51] RECOVERY - DPKG on mw1047 is OK: All packages OK [15:55:51] RECOVERY - DPKG on mw1051 is OK: All packages OK [15:56:01] RECOVERY - DPKG on mw1057 is OK: All packages OK [15:56:51] RECOVERY - DPKG on mw1070 is OK: All packages OK [15:56:58] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1865 [15:56:58] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1865 [15:57:01] PROBLEM - Disk space on srv275 is CRITICAL: Connection refused by host [15:57:21] PROBLEM - DPKG on mw1071 is CRITICAL: Connection refused by host [15:57:21] PROBLEM - DPKG on mw1093 is CRITICAL: Connection refused by host [15:57:41] PROBLEM - DPKG on mw1090 is CRITICAL: Connection refused by host [15:57:41] PROBLEM - RAID on mw1158 is CRITICAL: Connection refused by host [15:57:51] PROBLEM - DPKG on mw1086 is CRITICAL: Connection refused by host [15:58:21] PROBLEM - Disk space on mw1012 is CRITICAL: Connection refused by host [15:58:21] RECOVERY - Disk space on mw1032 is OK: DISK OK [15:58:21] RECOVERY - Disk space on mw1029 is OK: DISK OK [15:58:21] RECOVERY - RAID on ms1004 is OK: OK: State is Optimal, checked 2 logical device(s) [15:58:31] PROBLEM - Disk space on mw1001 is CRITICAL: Connection refused by host [15:58:31] RECOVERY - DPKG on mw1077 is OK: All packages OK [15:58:31] RECOVERY - Disk space on mw1051 is OK: DISK OK [15:58:41] RECOVERY - Disk space on mw1047 is OK: DISK OK [15:59:01] PROBLEM - RAID on srv201 is CRITICAL: Connection refused by host [15:59:11] RECOVERY - Disk space on mw1093 is OK: DISK OK [15:59:21] PROBLEM - RAID on mw19 is CRITICAL: Connection refused by host [15:59:21] RECOVERY - Disk space on mw1071 is OK: DISK OK [15:59:41] PROBLEM - DPKG on mw74 is CRITICAL: Connection refused by host [15:59:41] PROBLEM - Disk space on mw1075 is CRITICAL: Connection refused by host [15:59:41] PROBLEM - RAID on es1001 is CRITICAL: Connection refused by host [15:59:41] RECOVERY - Disk space on mw1057 is OK: DISK OK [15:59:41] RECOVERY - Disk space on mw1090 is OK: DISK OK [16:00:11] PROBLEM - DPKG on mw1147 is CRITICAL: Connection refused by host [16:00:11] PROBLEM - DPKG on mw1158 is CRITICAL: Connection refused by host [16:00:12] RECOVERY - DPKG on mw1159 is OK: All packages OK [16:00:31] PROBLEM - Disk space on mw1086 is CRITICAL: Connection refused by host [16:00:51] RECOVERY - Disk space on mw1069 is OK: DISK OK [16:01:01] PROBLEM - DPKG on es1001 is CRITICAL: Connection refused by host [16:01:01] PROBLEM - MySQL disk space on db44 is CRITICAL: Connection refused by host [16:01:01] RECOVERY - Disk space on mw1077 is OK: DISK OK [16:01:11] PROBLEM - DPKG on srv269 is CRITICAL: Connection refused by host [16:01:11] PROBLEM - Disk space on srv256 is CRITICAL: Connection refused by host [16:01:11] PROBLEM - DPKG on aluminium is CRITICAL: Connection refused by host [16:01:11] PROBLEM - RAID on aluminium is CRITICAL: Connection refused by host [16:01:31] PROBLEM - Disk space on mw74 is CRITICAL: Connection refused by host [16:01:41] PROBLEM - RAID on srv272 is CRITICAL: Connection refused by host [16:01:41] PROBLEM - RAID on mw1012 is CRITICAL: Connection refused by host [16:01:42] PROBLEM - RAID on mw1001 is CRITICAL: Connection refused by host [16:01:42] RECOVERY - DPKG on ms1004 is OK: All packages OK [16:01:42] RECOVERY - RAID on mw1032 is OK: OK: no RAID installed [16:01:42] RECOVERY - RAID on mw1029 is OK: OK: no RAID installed [16:01:51] PROBLEM - DPKG on es1002 is CRITICAL: Connection refused by host [16:01:51] PROBLEM - RAID on mw1064 is CRITICAL: Connection refused by host [16:01:51] PROBLEM - RAID on mw1086 is CRITICAL: Connection refused by host [16:01:51] PROBLEM - RAID on mw1075 is CRITICAL: Connection refused by host [16:01:51] RECOVERY - RAID on mw1071 is OK: OK: no RAID installed [16:01:51] RECOVERY - RAID on mw1069 is OK: OK: no RAID installed [16:01:51] RECOVERY - RAID on mw1077 is OK: OK: no RAID installed [16:01:52] RECOVERY - RAID on mw1090 is OK: OK: no RAID installed [16:01:52] RECOVERY - RAID on mw1093 is OK: OK: no RAID installed [16:02:01] PROBLEM - Disk space on mw1134 is CRITICAL: Connection refused by host [16:02:01] PROBLEM - Disk space on mw1148 is CRITICAL: Connection refused by host [16:02:01] PROBLEM - Disk space on mw1147 is CRITICAL: Connection refused by host [16:02:11] RECOVERY - Disk space on mw1159 is OK: DISK OK [16:02:21] PROBLEM - Disk space on mw1158 is CRITICAL: Connection refused by host [16:02:41] PROBLEM - RAID on srv202 is CRITICAL: Connection refused by host [16:02:42] PROBLEM - Disk space on srv201 is CRITICAL: Connection refused by host [16:02:51] PROBLEM - Disk space on srv236 is CRITICAL: Connection refused by host [16:02:51] PROBLEM - RAID on srv256 is CRITICAL: Connection refused by host [16:03:01] PROBLEM - RAID on srv270 is CRITICAL: Connection refused by host [16:03:02] PROBLEM - RAID on srv275 is CRITICAL: Connection refused by host [16:03:02] PROBLEM - DPKG on srv272 is CRITICAL: Connection refused by host [16:03:11] PROBLEM - RAID on bast1001 is CRITICAL: Connection refused by host [16:03:12] PROBLEM - Disk space on es1001 is CRITICAL: Connection refused by host [16:03:31] PROBLEM - Disk space on es1002 is CRITICAL: Connection refused by host [16:03:31] PROBLEM - RAID on es1002 is CRITICAL: Connection refused by host [16:04:12] PROBLEM - Disk space on srv288 is CRITICAL: Connection refused by host [16:04:51] PROBLEM - DPKG on srv270 is CRITICAL: Connection refused by host [16:04:51] PROBLEM - Disk space on srv272 is CRITICAL: Connection refused by host [16:04:56] !log starting nagios-nrpe-server on ALL via dsh to speed up nagios recovery [16:04:57] Logged the message, Master [16:05:11] PROBLEM - DPKG on srv275 is CRITICAL: Connection refused by host [16:05:31] RECOVERY - DPKG on db44 is OK: All packages OK [16:05:38] Change abandoned: Hashar; "wrong change :b" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1866 [16:05:41] RECOVERY - DPKG on mw1029 is OK: All packages OK [16:05:41] RECOVERY - DPKG on mw1042 is OK: All packages OK [16:05:41] RECOVERY - DPKG on mw1044 is OK: All packages OK [16:05:51] PROBLEM - DPKG on mw1064 is CRITICAL: Connection refused by host [16:06:51] PROBLEM - Disk space on srv270 is CRITICAL: Connection refused by host [16:06:51] RECOVERY - Disk space on srv275 is OK: DISK OK [16:07:01] PROBLEM - RAID on srv288 is CRITICAL: Connection refused by host [16:07:01] PROBLEM - DPKG on bast1001 is CRITICAL: Connection refused by host [16:07:11] RECOVERY - DPKG on mw1071 is OK: All packages OK [16:07:11] RECOVERY - DPKG on mw1093 is OK: All packages OK [16:07:21] PROBLEM - MySQL disk space on es1002 is CRITICAL: Connection refused by host [16:07:31] RECOVERY - DPKG on mw1090 is OK: All packages OK [16:07:31] RECOVERY - RAID on mw1158 is OK: OK: no RAID installed [16:07:41] RECOVERY - Disk space on db44 is OK: DISK OK [16:07:41] RECOVERY - DPKG on mw1086 is OK: All packages OK [16:07:51] PROBLEM - RAID on mw1147 is CRITICAL: Connection refused by host [16:07:51] PROBLEM - DPKG on mw1075 is CRITICAL: Connection refused by host [16:07:51] PROBLEM - DPKG on mw1012 is CRITICAL: Connection refused by host [16:07:51] PROBLEM - DPKG on mw1001 is CRITICAL: Connection refused by host [16:07:51] RECOVERY - RAID on mw1159 is OK: OK: no RAID installed [16:08:01] PROBLEM - RAID on mw1134 is CRITICAL: Connection refused by host [16:08:21] RECOVERY - Disk space on mw1001 is OK: DISK OK [16:08:31] RECOVERY - Disk space on mw1044 is OK: DISK OK [16:08:41] RECOVERY - Disk space on mw1042 is OK: DISK OK [16:08:51] RECOVERY - RAID on srv201 is OK: OK: no RAID installed [16:09:01] RECOVERY - RAID on mw19 is OK: OK: no RAID installed [16:09:21] PROBLEM - Disk space on bast1001 is CRITICAL: Connection refused by host [16:09:22] RECOVERY - RAID on srv269 is OK: OK: no RAID installed [16:09:22] RECOVERY - DPKG on srv288 is OK: All packages OK [16:09:32] RECOVERY - RAID on es1001 is OK: OK: State is Optimal, checked 2 logical device(s) [16:09:42] RECOVERY - Disk space on mw1064 is OK: DISK OK [16:10:01] RECOVERY - DPKG on mw74 is OK: All packages OK [16:10:01] RECOVERY - DPKG on mw1134 is OK: All packages OK [16:10:01] RECOVERY - DPKG on mw1148 is OK: All packages OK [16:10:01] RECOVERY - DPKG on mw1147 is OK: All packages OK [16:10:01] RECOVERY - DPKG on mw1158 is OK: All packages OK [16:10:21] RECOVERY - Disk space on mw1086 is OK: DISK OK [16:10:51] RECOVERY - MySQL disk space on db44 is OK: DISK OK [16:10:51] RECOVERY - DPKG on es1001 is OK: All packages OK [16:11:01] RECOVERY - DPKG on srv269 is OK: All packages OK [16:11:01] RECOVERY - Disk space on srv256 is OK: DISK OK [16:11:01] RECOVERY - DPKG on aluminium is OK: All packages OK [16:11:01] RECOVERY - RAID on aluminium is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [16:11:21] RECOVERY - Disk space on mw74 is OK: DISK OK [16:11:31] RECOVERY - RAID on srv272 is OK: OK: no RAID installed [16:11:31] RECOVERY - RAID on mw1012 is OK: OK: no RAID installed [16:11:31] RECOVERY - RAID on mw1001 is OK: OK: no RAID installed [16:11:41] RECOVERY - MySQL disk space on es1001 is OK: DISK OK [16:11:41] RECOVERY - RAID on mw1064 is OK: OK: no RAID installed [16:11:41] RECOVERY - DPKG on es1002 is OK: All packages OK [16:11:41] RECOVERY - RAID on mw1075 is OK: OK: no RAID installed [16:11:41] RECOVERY - RAID on mw1086 is OK: OK: no RAID installed [16:11:51] RECOVERY - Disk space on mw1134 is OK: DISK OK [16:11:51] RECOVERY - Disk space on mw1148 is OK: DISK OK [16:11:51] RECOVERY - Disk space on mw1147 is OK: DISK OK [16:12:11] RECOVERY - Disk space on mw1158 is OK: DISK OK [16:12:31] RECOVERY - Disk space on srv201 is OK: DISK OK [16:12:31] RECOVERY - RAID on srv202 is OK: OK: no RAID installed [16:12:41] RECOVERY - Disk space on srv236 is OK: DISK OK [16:12:41] RECOVERY - RAID on srv256 is OK: OK: no RAID installed [16:12:51] RECOVERY - RAID on srv270 is OK: OK: no RAID installed [16:12:51] RECOVERY - RAID on srv275 is OK: OK: no RAID installed [16:12:51] RECOVERY - DPKG on srv272 is OK: All packages OK [16:13:01] RECOVERY - RAID on bast1001 is OK: OK: no RAID installed [16:13:01] RECOVERY - Disk space on es1001 is OK: DISK OK [16:13:21] RECOVERY - Disk space on es1002 is OK: DISK OK [16:13:21] RECOVERY - RAID on es1002 is OK: OK: State is Optimal, checked 2 logical device(s) [16:14:01] RECOVERY - Disk space on srv288 is OK: DISK OK [16:14:41] RECOVERY - DPKG on srv270 is OK: All packages OK [16:14:41] RECOVERY - Disk space on srv272 is OK: DISK OK [16:15:01] RECOVERY - DPKG on srv275 is OK: All packages OK [16:15:41] RECOVERY - DPKG on mw1064 is OK: All packages OK [16:16:41] RECOVERY - Disk space on srv270 is OK: DISK OK [16:16:51] RECOVERY - RAID on srv288 is OK: OK: no RAID installed [16:16:51] RECOVERY - DPKG on bast1001 is OK: All packages OK [16:17:03] !log after a config change to nrpe_local.cfg and puppet applying the change, the service was not resrted but for some reason all nagios-nrpe-server caught SIGTERM. manually applying the same config change does not cause problems. that caused a Nagios outage until nrpe servers were started again (via dsh) [16:17:05] Logged the message, Master [16:17:11] RECOVERY - MySQL disk space on es1002 is OK: DISK OK [16:17:41] RECOVERY - DPKG on mw1012 is OK: All packages OK [16:17:41] RECOVERY - DPKG on mw1001 is OK: All packages OK [16:18:34] RECOVERY - RAID on mw1134 is OK: OK: no RAID installed [16:18:34] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [16:19:04] RECOVERY - Disk space on mw1012 is OK: DISK OK [16:20:14] RECOVERY - Disk space on mw1075 is OK: DISK OK [16:21:04] RECOVERY - Disk space on bast1001 is OK: DISK OK [16:25:55] New patchset: Hashar; "testswarm: update fetcher to r108075" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1867 [16:26:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1867 [16:26:24] RECOVERY - DPKG on mw1075 is OK: All packages OK [16:27:05] mutante: can you possibly merge https://gerrit.wikimedia.org/r/#change,1867 ? Change made by Timo & reviewed by me in CodeReview [16:28:17] New patchset: Jgreen; "adding root@grosley's key to logmover authorized_keys" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1868 [16:28:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1868 [16:28:36] hashar: looking.. [16:28:49] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1868 [16:28:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1868 [16:31:02] hashar: of course i dont know much about testswarm, but yeah, if you're here to check a puppet run after merge.. [16:31:34] mutante: hopefully it will just work :b [16:31:53] New patchset: Jgreen; "removing stale root@grosley ssh key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1869 [16:31:58] hashar: which host is this again? [16:32:02] gallium [16:32:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1869 [16:32:33] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1869 [16:32:34] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1869 [16:33:13] New review: Dzahn; "Change made by Timo & reviewed by hashar in CodeReview." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1867 [16:33:13] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1867 [16:33:15] !log adjusting all power strip humidity sensor 2 (floor level) to 12% humidity, as the center rack has the proper levels, floor levels always are low in humidity. [16:33:17] Logged the message, RobH [16:33:53] mutante: thx ) [16:34:20] yay [16:34:29] if that stops the flood of those emails I will be a happy camper [16:34:39] hashar: Caching catalog for gallium [16:35:08] hashar: and.. applied. please check on web :) [16:35:58] i need to thank ben about foxyproxy again =P [16:36:03] this shit is so much easier than how i used to do it. [16:36:27] mutante: thanks for the merge :) [16:36:35] mutante: now waiting for cronjob to kick in [16:38:28] New review: Dzahn; "careful, you need to make sure a key is defined as absent to make sure it's gone. just removing it h..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1869 [16:52:29] New patchset: Pyoungmeister; "commenting out sodium preseed for manual install" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1870 [16:52:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1870 [16:53:54] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1870 [16:53:54] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1870 [16:55:53] New patchset: Hashar; "testswarm: minor fix following change r1867" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1871 [16:57:38] New review: Hashar; "SVN changes:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1871 [16:58:38] New patchset: Jgreen; "offhost backups crons, adjustments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1872 [16:58:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1872 [16:58:53] mutante: if you are still around, can you also merge https://gerrit.wikimedia.org/r/#change,1867 on gallium please? [16:59:22] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 0; - https://gerrit.wikimedia.org/r/1872 [16:59:26] mutante: one is an unhanded exception the other is just a cosmetic change [16:59:39] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1872 [16:59:39] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1872 [17:00:13] !log stop sodium to do manual reinstall [17:00:17] Logged the message, and now dispaching a T1000 to your position to terminate you. [17:01:14] New review: Demon; "I'm wondering if we should make an integration/testswarm repo (like we did with integration/jenkins)..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1871 [17:02:56] ^demon: can puppet fetch changes from another repo? [17:03:04] <^demon> No :( [17:03:14] even by using git submodules ? :-b [17:03:24] <^demon> Oh, hrm. Perhaps? [17:03:27] <^demon> Dunno about that. [17:03:33] <^demon> If so that'd be super-useful. [17:03:42] right now our procedure is to cherry pick changes into the main repo [17:04:36] ^demon: I have updated the testswarm script with some above gerrit changes [17:04:52] <^demon> LeslieCarr: How would you go about doing that from a puppet manifest? Examples? [17:04:52] ^demon: also added a new jenkins job with a postgresSQL backend [17:04:58] <^demon> I saw, great. [17:05:11] ^demon: that should be fine hopefully. [17:05:14] oh was just thinking gerrit changes [17:05:24] need to go grab my daughter, will be back in roughly 4 hours hopefully [17:06:37] <^demon> LeslieCarr: Being able to import a specific commit hash from a second repo (presumably only ones we run) would be really useful. [17:07:01] <^demon> You could deploy MW using puppet as long as you can specify the commit hash :) [17:07:27] New review: Dzahn; "catching exceptions is a good thing and cosmetic fix, sure" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1871 [17:07:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1871 [17:07:39] hashar, now it shows up like Running MediaWiki trunk r108726"> for me [17:07:39] mutante: you are my new hero :-) [17:07:55] Nemo_bis: yeah :) Thanks for reporting this [17:08:03] Nemo_bis: the change is in the deployement queue ( https://gerrit.wikimedia.org/r/1871 ) [17:08:09] :) [17:08:22] hashar: yw. go get daughter:) [17:08:34] oh yeah my daughter [17:08:40] hehe [17:08:42] thx for the merge! [17:10:08] hashar: it ran on gallium. didnt break. run:) [17:11:55] * hashar checks logs [17:13:14] Nemo_bis: fixed! thanks mutante for deploying this so fast !! [17:13:15] http://integration.mediawiki.org/testswarm/user/MediaWiki/ [17:13:16] ;) [17:13:28] 3 2 1 spriiiint [17:15:07] hope his daughter wasnt waiting outside in the rain:p [17:16:22] PROBLEM - mailman on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:29:14] mark: I know there's gitweb integration in gerrit, but is there any 'front page' for the repositories (as opposed to the changes)? Something like what you normally see in github etc. [17:29:31] i'm not aware of it, but perhaps Ryan knows [17:29:43] I think that the gitweb in gerrit is a bit ugly and hidden indeed [17:30:02] PROBLEM - spamassassin on sodium is CRITICAL: Connection refused by host [17:30:02] PROBLEM - HTTPS on sodium is CRITICAL: Connection refused [17:31:27] hmm that does suck indeed [17:31:47] this is the best I can find: https://gerrit.wikimedia.org/r/#admin,projects [17:32:32] PROBLEM - HTTP on sodium is CRITICAL: Connection refused [17:32:37] but then the link to gitweb isn't obvious [17:32:43] PROBLEM - DPKG on sodium is CRITICAL: Connection refused by host [17:32:52] PROBLEM - RAID on sodium is CRITICAL: Connection refused by host [17:56:51] RECOVERY - HTTP on sodium is OK: HTTP OK HTTP/1.1 200 OK - 452 bytes in 0.054 seconds [18:02:28] New patchset: Mark Bergsma; "Convert spamassassin's local.cf to a template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1873 [18:02:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1873 [18:04:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1873 [18:04:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1873 [18:08:36] New patchset: Mark Bergsma; "include network::constants" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1874 [18:08:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1874 [18:08:57] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1874 [18:08:58] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1874 [18:14:41] RECOVERY - mailman on sodium is OK: PROCS OK: 10 processes with args mailman [18:16:21] RECOVERY - spamassassin on sodium is OK: PROCS OK: 4 processes with args spamd [18:20:31] RECOVERY - RAID on sodium is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:20:41] RECOVERY - DPKG on sodium is OK: All packages OK [18:21:04] New patchset: Mark Bergsma; "require lighttpd to be installed before mailman" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1875 [18:21:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1875 [18:22:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1875 [18:22:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1875 [18:25:06] New review: Demon; "Shun the nonbeliever!" [test/mediawiki/core] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/1841 [19:12:10] hi binasher! [19:12:43] hey [19:12:45] i sent a request to RT for access to locke & emery, is that something you can arrange? [19:13:04] New patchset: Asher; "remove duplicate file definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1876 [19:13:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1876 [19:14:08] drdee: i'll make sure its done today [19:14:13] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1876 [19:14:13] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1876 [19:15:46] RECOVERY - Puppet freshness on db1018 is OK: puppet ran at Thu Jan 12 19:15:21 UTC 2012 [19:16:05] binasher: super kewl! [19:16:47] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Thu Jan 12 19:16:39 UTC 2012 [19:16:47] RECOVERY - Puppet freshness on db1040 is OK: puppet ran at Thu Jan 12 19:16:43 UTC 2012 [19:18:16] RECOVERY - Puppet freshness on db1019 is OK: puppet ran at Thu Jan 12 19:17:53 UTC 2012 [19:19:16] PROBLEM - Puppet freshness on db22 is CRITICAL: Puppet has not run in the last 10 hours [19:19:46] RECOVERY - Puppet freshness on db1020 is OK: puppet ran at Thu Jan 12 19:19:42 UTC 2012 [19:20:16] RECOVERY - Puppet freshness on db1029 is OK: puppet ran at Thu Jan 12 19:19:56 UTC 2012 [19:20:16] RECOVERY - Puppet freshness on db1034 is OK: puppet ran at Thu Jan 12 19:20:00 UTC 2012 [19:22:46] RECOVERY - Puppet freshness on db1042 is OK: puppet ran at Thu Jan 12 19:22:21 UTC 2012 [19:23:17] RECOVERY - Puppet freshness on db1027 is OK: puppet ran at Thu Jan 12 19:22:52 UTC 2012 [19:23:17] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Thu Jan 12 19:23:09 UTC 2012 [19:23:46] RECOVERY - Puppet freshness on db1021 is OK: puppet ran at Thu Jan 12 19:23:18 UTC 2012 [19:23:46] RECOVERY - Puppet freshness on db1002 is OK: puppet ran at Thu Jan 12 19:23:34 UTC 2012 [19:24:17] RECOVERY - Puppet freshness on db1015 is OK: puppet ran at Thu Jan 12 19:24:07 UTC 2012 [19:24:46] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Thu Jan 12 19:24:36 UTC 2012 [19:25:06] New patchset: Asher; "fix password scope" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1877 [19:25:16] RECOVERY - Puppet freshness on db1047 is OK: puppet ran at Thu Jan 12 19:25:09 UTC 2012 [19:25:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1877 [19:25:51] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1877 [19:25:52] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1877 [19:26:22] RECOVERY - Puppet freshness on db1046 is OK: puppet ran at Thu Jan 12 19:25:52 UTC 2012 [19:27:22] RECOVERY - Puppet freshness on db1016 is OK: puppet ran at Thu Jan 12 19:27:19 UTC 2012 [19:27:22] RECOVERY - Puppet freshness on db1028 is OK: puppet ran at Thu Jan 12 19:27:20 UTC 2012 [19:27:52] RECOVERY - Puppet freshness on db1001 is OK: puppet ran at Thu Jan 12 19:27:39 UTC 2012 [19:28:22] RECOVERY - Puppet freshness on db1041 is OK: puppet ran at Thu Jan 12 19:28:06 UTC 2012 [19:28:52] RECOVERY - Puppet freshness on db1033 is OK: puppet ran at Thu Jan 12 19:28:23 UTC 2012 [19:28:52] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Thu Jan 12 19:28:24 UTC 2012 [19:29:52] RECOVERY - Puppet freshness on db1026 is OK: puppet ran at Thu Jan 12 19:29:36 UTC 2012 [19:29:52] RECOVERY - Puppet freshness on db1031 is OK: puppet ran at Thu Jan 12 19:29:52 UTC 2012 [19:30:52] RECOVERY - Puppet freshness on db1048 is OK: puppet ran at Thu Jan 12 19:30:23 UTC 2012 [19:31:43] New patchset: Asher; "and include the pw class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1878 [19:31:52] RECOVERY - Puppet freshness on db1013 is OK: puppet ran at Thu Jan 12 19:31:44 UTC 2012 [19:31:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1878 [19:32:52] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Thu Jan 12 19:32:24 UTC 2012 [19:33:22] RECOVERY - Puppet freshness on db1038 is OK: puppet ran at Thu Jan 12 19:33:11 UTC 2012 [19:33:50] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1878 [19:33:51] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1878 [19:34:52] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Thu Jan 12 19:34:33 UTC 2012 [19:36:22] RECOVERY - Puppet freshness on db1003 is OK: puppet ran at Thu Jan 12 19:35:56 UTC 2012 [19:36:52] RECOVERY - Puppet freshness on db1025 is OK: puppet ran at Thu Jan 12 19:36:33 UTC 2012 [19:36:53] RECOVERY - Puppet freshness on db1044 is OK: puppet ran at Thu Jan 12 19:36:43 UTC 2012 [19:37:52] RECOVERY - Puppet freshness on db1004 is OK: puppet ran at Thu Jan 12 19:37:28 UTC 2012 [19:37:52] RECOVERY - Puppet freshness on db1008 is OK: puppet ran at Thu Jan 12 19:37:36 UTC 2012 [19:41:22] RECOVERY - Puppet freshness on db1022 is OK: puppet ran at Thu Jan 12 19:41:03 UTC 2012 [19:41:22] RECOVERY - Puppet freshness on db1005 is OK: puppet ran at Thu Jan 12 19:41:10 UTC 2012 [19:41:22] RECOVERY - Puppet freshness on db1045 is OK: puppet ran at Thu Jan 12 19:41:12 UTC 2012 [19:41:22] RECOVERY - Puppet freshness on db1017 is OK: puppet ran at Thu Jan 12 19:41:21 UTC 2012 [19:42:22] RECOVERY - Puppet freshness on db1039 is OK: puppet ran at Thu Jan 12 19:42:00 UTC 2012 [19:42:22] RECOVERY - Puppet freshness on db1014 is OK: puppet ran at Thu Jan 12 19:42:13 UTC 2012 [19:42:23] RECOVERY - Puppet freshness on db1030 is OK: puppet ran at Thu Jan 12 19:42:18 UTC 2012 [19:44:52] RECOVERY - Puppet freshness on db1024 is OK: puppet ran at Thu Jan 12 19:44:33 UTC 2012 [19:45:22] RECOVERY - Puppet freshness on db1035 is OK: puppet ran at Thu Jan 12 19:44:56 UTC 2012 [19:45:23] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Jan 12 19:45:05 UTC 2012 [19:51:08] New patchset: Pyoungmeister; "giving diedrik access to various boxes a la rt 2256" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1879 [19:51:59] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1879 [19:52:40] New patchset: Asher; "username switch for these is -l, not -u" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1880 [19:53:46] Change abandoned: Pyoungmeister; "wrong branch" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1879 [19:54:56] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1880 [19:54:56] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1880 [19:55:42] New patchset: Pyoungmeister; "adding shell access for diedrik rt 2256" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1881 [19:56:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1881 [19:56:18] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1881 [19:56:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1881 [19:57:27] New patchset: Ryan Lane; "test" [operations/software] (master) - https://gerrit.wikimedia.org/r/1882 [19:57:53] robh: I want to power down srv178-189...please let me know if there are any issues. [19:58:09] those are the ones causing the overage in that rack right? [19:58:51] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1882 [19:58:51] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/1882 [20:01:10] hi apergos! [20:01:35] hello! [20:01:44] question, question [20:01:57] let's see if I have any answers! [20:01:57] cmjohnson1: those boxes are no longer in apache pool, and external storage has also fullly migrated off of them [20:02:12] thanks to binasher, we have a very simple GLAM filter running on Locke [20:02:20] cool [20:02:23] yes [20:02:24] ! [20:02:40] i would like to make it available at downloads.wikimedia.org [20:02:53] cmjohnson1: they are all ok for you to pull the network on, and wipe [20:02:58] it does not contain privacy information, just timestamp, URL, referral [20:03:04] just pull network first [20:03:17] robh: cool thanks... got it! [20:03:21] how can we fix this? [20:03:22] cmjohnson1: then when they are wiped, unrack and box up, we have decomissions we can send them to, thanks =] [20:03:39] lemme take a quick look [20:04:01] apergos: /var/log/squid/glam_nara.log [20:04:05] okay...can you get a ticket to me when you get the opportunity...thx [20:04:15] sorry it's on emery [20:04:31] yeah I was having trouble finding that on locke :-D [20:04:50] robh: temp leads will take some time... i have to trace black cables on a black rack ...I am going to label as I go along for future ref [20:05:08] they should be labeled [20:05:16] not sure why they arent [20:05:18] if they arent [20:05:29] cmjohnson1: you dont need to trace [20:05:33] so here is the trick [20:05:33] k [20:05:40] you pull up the mgmt http interface for that power strip [20:05:45] yep...an unplug [20:05:54] well, may need to trace some [20:05:58] but it makes it a lot easier [20:06:15] so yea, feel free to unplug that stuff, just admin log when you are poking in a rack on that stuff [20:06:17] i thought of that as well...I wonder if I can bring up http interface on ipad [20:06:25] its not flash, should work [20:06:30] we need to get you wifi. [20:06:35] i will take care of it today. [20:06:44] i have a mifi with unlimited 4g...so not too terribly concerned [20:07:00] !log reassigning ports on asw-b-sdtpa [20:07:02] Logged the message, Mistress of the network gear. [20:07:12] well, the mifi wont get you mgmt interfaces [20:07:30] so makes this more painful, i just need to get it apprvoed [20:08:00] i have a really long cable....i will push the laptop around on the cart =] [20:08:02] LeslieCarr: wanna do me a favor? [20:08:03] https://rt.wikimedia.org/Ticket/Display.html?id=1582 [20:08:16] if you can confirm this is good enough for us to use in pmtpa and sdtpa, i will buy a couple. [20:08:17] I think it's ok [20:08:28] there's nothing weird that can really get into the api.php params I guess [20:08:28] lemme look.. [20:08:36] drdee: [20:08:40] cmjohnson1: yea but that sucks long term, having wifi in dc is so much easier on you =] [20:09:16] cmjohnson1: also, pull the model and info off the crash cart that eq provides [20:09:23] RobH: out of stock [20:09:26] the tall one [20:09:34] LeslieCarr: bleeeeeehhhhh old [20:09:35] I'm not as sure about the referrers [20:09:42] LeslieCarr: care to suggest one off newegg.com? [20:09:44] just cause there's stuff like [20:09:49] one that you wouldnt hate deploying twice [20:09:50] okay...that would be great! I have looked for those [20:10:05] from an edit on some external site [20:10:11] i dunno if that's cool to publish [20:11:03] like people's wordpress blogs and such [20:11:12] mmmm [20:11:49] I know they're only viewing it but then I guess we're revealing info... maybe we could toss referrers that are from external sites? [20:12:19] yes, and just write 'external'? [20:12:23] New patchset: Bhartshorne; "another test" [operations/software] (master) - https://gerrit.wikimedia.org/r/1883 [20:12:23] yep [20:12:24] New review: gerrit2; "Lint check passed." [operations/software] (master); V: 1 - https://gerrit.wikimedia.org/r/1883 [20:12:36] that doesn't mean you can't keep it locally for analysis [20:12:44] just to publish them I think we'd want to toss that info [20:12:47] sure [20:13:00] but the analysis will be done by GLAM people, not me [20:13:06] ohhhh [20:13:10] mmm [20:13:36] I guess we'd want them to have a contract or [20:13:45] meh whatever the toolserver people ahve that can access sensitive data [20:13:48] one or the other [20:14:08] which toolserver ppl? [20:14:28] i think they've hidden more and more from toolserver users (with DB views) [20:14:30] any toolserver user that can get to pieces of the db that most users cna't [20:14:38] idk first hand though [20:14:40] I don't know which class of user that is any more [20:14:43] !log granted the "process" priv to nagios@localhost on all production db clusters [20:14:44] Logged the message, Master [20:14:52] but if we replace referral with internal/external then they are happy and we are not disclosing any sensitive information [20:14:54] or what process they go through to get that access.. it's been years since I've had an active account [20:15:01] sure, that's find [20:15:03] *fine [20:15:07] binasher: ^^ need to tell DaB about that? [20:15:07] let's do that [20:15:09] ok [20:15:27] you have access to bayes right? [20:15:30] nope [20:15:31] New patchset: Bhartshorne; "removing test file" [operations/software] (master) - https://gerrit.wikimedia.org/r/1884 [20:15:32] New review: gerrit2; "Lint check passed." [operations/software] (master); V: 1 - https://gerrit.wikimedia.org/r/1884 [20:15:35] jeremyb: shouldn't need to, it was a grant that didn't include the "identified by" portion [20:15:42] hmm what hosts do you have access to? [20:15:47] none [20:15:52] huh [20:15:58] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1883 [20:15:58] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/1883 [20:16:02] binasher: idk even what that priv is. just got me thinking... [20:16:05] but i put an RT request for Locke & Emery today, Asher is going to help me with that [20:16:06] and these are generated on emory [20:16:14] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1884 [20:16:15] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/1884 [20:16:22] binasher: is `show processlist` ? [20:16:33] ok well we can set up a little job that copies them to download I guess, once you have something that sanitizes them and a location they live [20:16:44] (copies form emory) [20:17:16] that should take care of it, we'll wind up putting them in blah blah /other over there [20:17:26] some subdirectory with a nice name [20:17:27] awesome! [20:17:30] glam? [20:17:36] sounds great :-D [20:17:43] shall i write the cleaning process? [20:17:46] comes with glitter? [20:17:46] sure [20:17:56] comes with david bowie :-P [20:18:30] and david ferriero! [20:18:32] jeremyb: yup, that's what it grants.. without that, a user can only see their own threads [20:18:34] when you're ready to go drop an rt ticket and assign it to me (I hope you can do that) [20:18:40] binasher: and kill? [20:19:03] yup [20:19:21] k. hopefully nagios doesn't start doing that ;-P [20:19:23] er, no [20:19:48] super is required to kill threads owned by other users [20:20:00] nagios can kill things belonging to nagios [20:20:50] sure, that sounds better :) [20:20:53] so I'm running vde_switch (very cool program) and qemu with vde compiled in, and it works but I can't figure out why, because I didn't give a socket path name to qemu and yet it apparently found the vde switch socket... but I can't see it in lsof that it has it open [20:20:57] stumped! [20:21:17] apergos: lsof -p ? [20:21:29] I looked at all fds by both processes [20:21:35] apergos: also lsof /path/to/socket [20:21:49] PROBLEM - Host srv187 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:57] I looked at everything holding open the particular path or containing the dir to the socket [20:21:59] PROBLEM - Host srv189 is DOWN: PING CRITICAL - Packet loss = 100% [20:21:59] PROBLEM - Host srv188 is DOWN: PING CRITICAL - Packet loss = 100% [20:22:04] it aint' there, not from qemu [20:22:19] I mean I went to /proc/nnn/fd [20:22:23] right, it's just not there [20:22:39] http://virtualsquare.org/virtualsquare.png is rather broken [20:22:42] and yet I can telnet to the dang thing :-D [20:23:02] nice [20:23:22] also, > Last Update:Sun May 27 16:57:15 CEST 2007 [20:23:52] well the wiki was updated jan 3 [20:24:32] New patchset: Bhartshorne; "initial import of geturls" [operations/software] (master) - https://gerrit.wikimedia.org/r/1885 [20:25:14] i see that now. otoh, i see "MediaWiki: 1.9.3" [20:25:20] I'm running 2.3.2 out of svn, pretty current [20:25:23] ahahahaha [20:25:30] oh it hurts my soul :-D [20:25:59] see I can see what it has open, just that how qemu is actually hooked up to it is a mystery :-D [20:26:13] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1885 [20:26:13] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/1885 [20:26:22] fire and forget [20:26:39] they apparently have a mailing list [20:26:43] !log shutting down srv178-189 for decommissioning [20:26:44] Logged the message, Master [20:26:50] Ryan_Lane: https://gerrit.wikimedia.org/r/gitweb?p=operations/software.git;a=tree;f=geturls;h=18b5d0f534be6ac86a03564a2b2869a237a29468;hb=HEAD [20:27:48] robh: nagios is showing srv187-189 down...i thought they were out already removed [20:27:54] cmjohnson1: LeslieCarr just found two APs for tampa [20:27:57] ordering them now [20:28:05] robh: cool [20:28:38] guess I might have to read the code real quick [20:28:43] tomorrow. need to make [20:28:47] cmjohnson1: you asked me about up to srv181 i thought [20:28:52] lunch *and* dinner? oh nooooes [20:28:59] ahh, 189 [20:29:03] my bad, lemme check [20:29:10] drdee: are you now set up with all of the access that you need? [20:29:13] can I close the ticket? [20:29:16] i still think yer fine though, its just not in decomissioning.pp in puppet is all [20:29:18] checking [20:29:43] cmjohnson1: did you touch srv187-srv189 yet? [20:29:46] cmjohnson1: if not, leave them. [20:29:56] I pulled network on them [20:30:01] plug em back in ;] [20:30:05] they are still online [20:30:10] okay [20:30:12] it wont hurt anything, they just depooled is all [20:30:16] will come back in no worries [20:30:22] okay [20:30:33] they are one of 40 spi cluster servers [20:30:35] api even [20:30:49] cmjohnson1: so those are still in use, even at a lower weight, so we may as well leave them for the moement [20:30:55] pull the rest though and lets see how we are on power [20:31:11] (these will go eventually but do not need to go today, they are still doing a job well enough for the moment) [20:31:43] maplebed: what's wrong with ab? [20:34:30] cmjohnson1: So before I complete this newegg order [20:34:36] is there anything you need down there that i dunno about? [20:35:14] cmjohnson1: I see some mgmt switch is needed [20:35:19] RECOVERY - Host srv189 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:35:21] robh: lemme see [20:35:29] RECOVERY - Host srv188 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [20:35:39] yes....i need a new switch [20:35:48] got that on this order as well [20:35:53] just a mgmt switch [20:35:59] yes...just mgmt [20:36:02] and the two APs, one for sdtpa and one for pmtpa [20:36:17] when they come in leslie will be working with you to get them setup [20:36:24] I think I am good [20:36:25] you good for blank cds and such? [20:36:34] yes..plenty of those. [20:36:39] cool [20:36:39] I need a couple new thumb drives [20:36:54] I have debian on one and memtest on another [20:37:26] but that is not that big of a deal. thx for checking [20:37:31] couple cheap 5gb good enough? [20:37:39] RECOVERY - Host srv187 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [20:37:40] perfect [20:39:01] meh, 8g is 15 bucks [20:39:09] and they are the nicer mushkin, which arent dog slow [20:39:18] i dont buy the shitty slow ones, they are annoying. [20:40:32] robh: do you want that crash cart info now? [20:40:41] sure =] [20:40:52] i am tired of going to sign one out in eqiad and them being all gone [20:40:55] ordering my own. [20:41:05] i like the tall one they have there, will fit well down aisles [20:44:55] robh: http://www.globalindustrial.com/g/office/computer-furniture/workstations/mobile-computer-workstation [20:45:29] thats pretty cheap [20:45:45] and i know that one is not going to snap in half either, which is nice. [20:46:17] robh: !2261 [20:46:27] i forgot how to get the rt to show up [20:46:43] it has industrial in the name, it can't snap in half! [20:46:43] !rt2261 [20:46:49] !rt 2261 [20:46:49] https://rt.wikimedia.org/Ticket/Display.html?id=2261 [20:46:50] !rt 2261 [20:46:50] https://rt.wikimedia.org/Ticket/Display.html?id=2261 [20:46:58] ahh..thats it [20:47:01] !wm-bot stop sucking [20:47:01] http://meta.wikimedia.org/wiki/WM-Bot [20:47:19] @regsearch rt [20:47:19] Results (found 4): unicorn, socks-proxy, stucked, rt, [20:47:22] regarding !rt2261 where is that going [20:48:01] ? [20:48:09] what do you mean? is it set to go through at that time? [20:48:46] or the fact the ticket has no details on where its moving, so and so? [20:48:54] I see it is being relocated but I don't see a ticket regarding where it's moving to [20:49:06] yea, i didnt know this was scheduled for this time [20:49:14] but hey, good enough, i will drop the relevant tickets [20:50:39] hrmm [20:50:59] i think i am going to make this easy on you and move from pmtpa c1 to pmtpa d1 [20:51:20] hrmm need to see if its redundnat power [20:51:22] or single [20:51:51] New patchset: Asher; "install percona-toolkit on hardy db's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1886 [20:52:03] holy crap this is old drac.... [20:52:05] really old. [20:52:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1886 [20:52:15] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1886 [20:52:16] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1886 [20:53:56] yeah it is ...the dell 1950 apache srv's have the same DRAC [20:54:11] well, this is even older [20:54:19] a revision behind those, these were first dells i ordered for wmf [20:54:20] heh [20:54:28] !log installing percona-toolkit on few remaining hardy dbs [20:54:30] Logged the message, Master [20:54:31] ahhh...memories [20:54:52] RobH: locke, aka db6 [20:54:59] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Thu Jan 12 20:54:29 UTC 2012 [20:55:04] yep [20:56:15] mark: do you ahve any issue with it going in d1-pmtpa [20:56:23] plenty of power left in that rack, dual feeds [20:56:29] quick downtime too since its close [20:56:48] i dont think there is an issue, but since yer here ;] [20:57:28] I don't [20:57:35] and indeed, downtime should be short [20:57:41] chris should setup rails first and prewire [20:57:46] so the system can move and bootup fast [20:57:50] hrmm [20:57:56] rails can be taken from one of the other db7, 8 [20:57:59] cmjohnson1: do you have a spare set of 1950 rails? [20:58:02] those are to be decommissioned [20:58:04] it's a 2950 [20:58:09] that works. [20:58:12] db6-10 are in that rack [20:58:16] well, db10 no longer is [20:58:20] db9 is still in use [20:58:24] but db7, db8 are decommissioned [20:58:26] jsut take those rails [20:59:38] I will have pre-wired....am i understanding that db7 and 8 are decommissioned so I can pull one of them out and and set rails. [20:59:46] yes [20:59:55] New patchset: Asher; "install percona nagios scripts on all dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1887 [20:59:56] okay [21:00:11] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=2264 [21:00:12] New patchset: Jgreen; "adding khorn shell access to storage3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1888 [21:00:19] it has the info on using rails, where to put it, prewire, etc... [21:00:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1888 [21:00:30] links to the ticket on notification to users and to network ticekt [21:00:34] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1887 [21:00:34] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1887 [21:00:49] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1888 [21:00:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1888 [21:01:50] robh: thanks [21:06:09] ordered my crash cart =] [21:11:09] PROBLEM - Host mobile1 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:43] New patchset: Ryan Lane; "Decommissioning mobile1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1889 [21:15:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1889 [21:16:47] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1889 [21:16:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1889 [21:22:52] LeslieCarr: updated the ticket for the labs switch, its still in route to juniper it seems [21:23:24] okay cool [21:23:26] thanks [21:27:11] j is seriously slow on that [21:27:22] perhaps we should get better support [21:27:27] one spare may not always be enough ;) [21:30:15] robh: mobile1 is down is anyone working on it? [21:30:37] doubtful, pretty sure its offline cuz we dont use them for anything right now [21:30:56] okay...anything in sdtpa I want to make sure is not a problem =] [21:30:58] Ryan_Lane renamed mobile1, maybe that's related? [21:31:11] I just renamed it [21:31:20] chris should rename it too [21:31:22] relabel [21:31:22] hm. it failed to pxe [21:31:40] good idea mark [21:31:45] did you update dns, brewster dhcp? [21:31:51] crap. dhcp [21:31:52] I hate renaming servers [21:31:54] +1 [21:31:56] I do too [21:33:23] ryan_lane what is the new name? racktables updated? [21:33:34] cmjohnson1: virt0 [21:33:44] hm. I don't see mobile1 in dhcp conf [21:33:48] thx [21:35:18] christ. how do I get the MAC? [21:36:04] ah. see it on boot [21:36:31] Ryan_Lane: if you rename a server, need to drop a ticket in the relevant datacenter queue to relabel [21:36:36] also have to update the mgmt dns [21:36:38] and racktables [21:37:18] you also need to wire $50 to my bank account [21:37:21] it's not in the dhcp config at all :( [21:37:32] that seems unlikely [21:37:40] it's in the 57600 I think [21:37:45] not 115200 [21:37:46] mark: us americans can't wire so easy [21:37:59] good [21:38:04] that will teach you not to rename servers then [21:38:08] apparently I need to take mobile2, not mobile1 [21:38:14] alternatively you can send pralines [21:38:29] something is wrong with eth0 on mobile1, says binasher [21:38:31] :( [21:38:43] mark: I checked all files. the mac isn't in any of them [21:38:44] chris and rob will want to know ;) [21:38:47] weird [21:39:02] if it's in warranty, let's get it fixed [21:39:07] if it's out of warranty, let's get it decommissioned [21:39:07] I'm just gonna take mobile2 [21:39:11] ok [21:39:16] can you handle the switch? ;) [21:39:17] i'm going off [21:39:19] heh [21:39:20] sure [21:39:21] it was occasional packetloss, might just need recabling [21:39:32] oh [21:39:35] in tampa, that's very possible ;) [21:39:56] I'm going to keep it decommissioned [21:40:13] ? [21:40:18] if the ethernet is bad drop a ticket =p [21:40:19] in puppet [21:40:22] just did [21:40:23] hey now ...tampa is getting somewhat tolerable now [21:40:28] ahh, ok, cuz i think its under warranty [21:40:34] i know chris [21:40:41] but I doubt you touched mobile1 before ;) [21:40:44] :) [21:41:03] no, I haven't but it is an older Dell 1950 [21:41:14] i really do not like those servers [21:41:33] they were nice [21:41:37] just getting a bit old now [21:41:45] meh [21:41:49] it happens to the best of us [21:41:54] my trusty old 1650 in my basement [21:42:08] I don't want to throw it away ;( [21:43:24] I had a sparc classic once [21:43:31] but you know what, for all things there is a season [21:43:36] and a time for the dumpster :-P [21:44:02] you know how much money I spent on that thing [21:44:05] back when I was a student ;p [21:48:03] jeremyb: sorry for the delayed reply. ab only takes a single URL. abmulti can take a file but it's limited to 20,000 lines. I've got 20 million or so. [21:48:35] maplebed: aha [21:48:47] jeremyb: lastly, ab always goes as fast as it can, but I'm hitting what is ultimately a production backend, so I want to make sure that I can limit the speed. [21:49:28] oh wait - there's more. I want to be able to stop and restart the process; ab doesn't let me start in the middle of a file. [21:49:57] maplebed: i've used siege in the past with large numbers of urls (playing back an accesslog basically) but i don't know if it has a url limit off the top of my head [21:50:35] I don't think I looked at siege [21:50:42] anyway, it's written now. [21:50:44] :P [21:50:59] PROBLEM - RAID on ms1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:27] maplebed: if you ever want to look in the future, http://www.joedog.org/index/siege-home [21:51:48] it also supports cookie files, posts, and actions as multiple different logged in users within a single set of urls [21:52:01] not that those features matter for swift testing [22:04:36] anyone need anything done here b4 i leave? [22:06:44] PROBLEM - Host mobile2 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:47] New patchset: Lcarr; "putting ganglia1001 in puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1890 [22:24:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1890 [22:24:21] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1890 [22:24:22] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1890 [23:01:39] New patchset: Asher; "mysql conf reorg, define clusters and masters for all prod clusters s1-7" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1891 [23:01:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1891 [23:05:45] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1891 [23:05:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1891 [23:09:39] New patchset: Asher; "Revert "mysql conf reorg, define clusters and masters for all prod clusters s1-7" - var scope issue" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1892 [23:09:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1892 [23:11:46] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1892 [23:11:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1892 [23:12:57] * binasher just had a "it would be good to use labs" moment [23:24:58] New patchset: Asher; "mysql conf reorg, define clusters and masters for all prod clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1893 [23:25:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1893 [23:26:12] maplebed: Ryan_Lane: could one of you kindly review ^^^ [23:26:41] should I be scared? [23:27:22] oomph. I'm kneedeep in something else - can I review later? [23:28:04] I see where you're going, but to really review I'll need to settle into it for a bit. [23:32:03] maplebed: no problem [23:39:08] bit of background - eqiad is now using chained replication and i'm setting up pt-heartbeat for lag monitoring of slaves of slaves, and to log the bin and relay log positions every host is at, regardless of tier, to make it easier to promote slaves in nasty failure cases or across colos. pt-heartbeat requires knowing the serverid of the master you want to measure lag from when two levels deep. so.. hosts will get a file marking what cluster the [23:39:08] in, and serverid determined in a python script that does gethostbyname($cluster-master) etc.. [23:39:29] tldr - that stuff will become more important, and need to be easily editable in one place [23:52:54] New patchset: Asher; "enable and force ssl for graphite" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1894 [23:53:32] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1894