[00:00:03] New patchset: Bhartshorne; "first draft of the swift cleaner stuff. I know this doesn't work but I want to check it in for reviews." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134 [00:00:14] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.404 seconds [00:00:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134 [00:01:07] mutante: that was quick :) [00:03:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.462 seconds [00:06:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:06:30] Anyone know if "hume: sudo: no tty present and no askpass program specified" when doing sync-file etc has been reported? [00:09:04] Reedy: dont think so [00:12:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.067 seconds [00:27:30] New patchset: Lcarr; "Splitting off the icinga packages and specific config files into their own manifest file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3135 [00:27:39] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3135 [00:28:51] New patchset: Lcarr; "Splitting off the icinga packages and specific config files into their own manifest file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3135 [00:29:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3135 [00:29:20] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3135 [00:36:19] New patchset: Lcarr; "Moving icinga specific files to own class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3136 [00:36:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3136 [00:37:58] Change abandoned: Lcarr; "i hate you gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3135 [00:38:58] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3136 [00:39:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3136 [00:46:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:47:19] New patchset: Lcarr; "explicitly calling out nagios_mysql_check_pass" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3137 [00:47:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3137 [00:48:00] * AaronSchulz wonders why copper is so unreliable [00:48:23] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3137 [00:48:26] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3137 [00:49:03] AaronSchulz: in which way? since just 48 hours or all the time? [00:49:25] probably the later ;) [00:49:52] maybe it just sucks when connecting from here [00:51:25] the machine or internet via copper cables ? [00:51:57] hehe [00:52:01] the boxen [00:52:10] * AaronSchulz loves metal copper [00:52:50] well, i installed some upgrades on it like 2 says ago, thats why i said "since 48 hours" [00:52:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.545 seconds [00:52:54] days [00:54:25] New patchset: Lcarr; "fixing out of scope variable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3138 [00:54:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3138 [00:55:20] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3138 [00:55:23] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3138 [00:58:43] New patchset: Lcarr; "fixing other out of scope variables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3139 [00:58:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3139 [00:59:44] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3139 [00:59:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3139 [01:06:13] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:55] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:13:07] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.859 seconds [01:14:23] mutante: it's hard for me to even connect [01:15:31] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Puppet has not run in the last 10 hours [01:15:31] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Puppet has not run in the last 10 hours [01:15:31] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [01:15:31] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Puppet has not run in the last 10 hours [01:15:31] PROBLEM - Puppet freshness on amssq31 is CRITICAL: Puppet has not run in the last 10 hours [01:16:16] AaronSchulz: works for me (via fenari that is, right) [01:18:29] New patchset: Lcarr; "removing unneeded /etc/icinga check" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3140 [01:18:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3140 [01:18:48] AaronSchulz: 9 packets transmitted, 9 received, 0% packet loss rtt min/avg/max/mdev = 26.350/26.429/26.466/0.114 ms [01:19:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3140 [01:19:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3140 [01:19:25] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:24:17] New patchset: Ryan Lane; "Upping svn rev for ldap tools to update manage-volumes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3141 [01:24:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3141 [01:24:33] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3141 [01:24:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3141 [01:28:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:31:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.194 seconds [01:34:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.667 seconds [01:43:05] PROBLEM - Misc_Db_Lag on db10 is CRITICAL: (Return code of 255 is out of bounds) [01:43:05] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:43:14] PROBLEM - MySQL slave status on es2 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES) [01:43:32] PROBLEM - Misc_Db_Slave on db10 is CRITICAL: CRITICAL: Access denied for user nagios@spence.wikimedia.org (using password: YES) [01:43:32] PROBLEM - MySQL replication status on es1003 is CRITICAL: (Return code of 255 is out of bounds) [01:43:32] PROBLEM - MySQL master status on es3 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES) [01:43:32] PROBLEM - Misc_Db_Master on db9 is CRITICAL: CRITICAL: Access denied for user nagios@spence.wikimedia.org (using password: YES) [01:43:50] PROBLEM - MySQL replication status on es1 is CRITICAL: (Return code of 255 is out of bounds) [01:43:50] PROBLEM - MySQL replication status on es4 is CRITICAL: (Return code of 255 is out of bounds) [01:44:08] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: (Return code of 255 is out of bounds) [01:44:08] PROBLEM - MySQL slave status on es1003 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES) [01:44:08] PROBLEM - MySQL slave status on es1 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES) [01:44:08] PROBLEM - MySQL replication status on es1004 is CRITICAL: (Return code of 255 is out of bounds) [01:44:26] PROBLEM - MySQL slave status on es4 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES) [01:44:26] PROBLEM - MySQL replication status on db1025 is CRITICAL: (Return code of 255 is out of bounds) [01:44:35] PROBLEM - MySQL master status on db1008 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES) [01:44:44] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.300 seconds [01:44:44] PROBLEM - MySQL slave status on db1025 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES) [01:44:44] PROBLEM - MySQL replication status on storage3 is CRITICAL: (Return code of 255 is out of bounds) [01:44:53] PROBLEM - MySQL replication status on es2 is CRITICAL: (Return code of 255 is out of bounds) [01:45:02] PROBLEM - MySQL slave status on storage3 is CRITICAL: CRITICAL: Access denied for user nagios@208.80.152.161 (using password: YES) [01:51:02] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:14] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.601 seconds [01:59:53] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.266 seconds [02:01:32] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:11] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:15:11] New patchset: Lcarr; "Revert "fixing out of scope variable"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3142 [02:15:24] New patchset: Lcarr; "Revert "fixing other out of scope variables"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3143 [02:15:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3142 [02:15:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3143 [02:16:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3142 [02:16:09] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3142 [02:16:13] New patchset: Dzahn; "swift process monitoring (RT-2593)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3144 [02:16:25] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3143 [02:16:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3144 [02:16:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3143 [02:16:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.038 seconds [02:16:41] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.313 seconds [02:17:35] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [02:18:20] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.581 seconds [02:18:47] New review: Dzahn; "expect NRPE to break when merging any change to nrpe_local.cfg - be prepared to restart nagios-nrpe-..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3144 [02:22:59] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:26] New patchset: Lcarr; "fixing the nagios checkcommands template again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3145 [02:24:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3145 [02:24:38] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:25:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3145 [02:25:08] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3145 [02:40:59] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [02:44:53] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [02:47:42] RECOVERY - Misc_Db_Lag on db10 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:47:42] RECOVERY - MySQL slave status on es2 is OK: OK: [02:48:00] RECOVERY - Misc_Db_Slave on db10 is OK: OK: [02:48:00] RECOVERY - MySQL master status on es3 is OK: OK: [02:48:00] RECOVERY - Misc_Db_Master on db9 is OK: OK: [02:48:00] RECOVERY - MySQL replication status on es1003 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:48:27] RECOVERY - MySQL replication status on es1 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:48:27] RECOVERY - MySQL replication status on es4 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:48:27] RECOVERY - MySQL slave status on es1003 is OK: OK: [02:48:45] RECOVERY - MySQL slave status on es4 is OK: OK: [02:48:45] RECOVERY - MySQL replication status on es1004 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : s [02:48:45] RECOVERY - MySQL slave status on es1 is OK: OK: [02:48:53] !log realized i forgot to log hours ago that cp1029-cp1036 are installed with puppet run, ready for varnish deployment tomorrow [02:48:54] RECOVERY - MySQL replication status on db1025 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:48:56] Logged the message, RobH [02:49:03] RECOVERY - MySQL master status on db1008 is OK: OK: [02:49:03] RECOVERY - MySQL slave status on db1025 is OK: OK: [02:49:21] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 11s [02:49:39] RECOVERY - MySQL slave status on storage3 is OK: OK: [02:49:48] RECOVERY - MySQL replication status on es2 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:49:51] !log revoked, cp1032 is some reason in grub error, and its too late at night for me to work on it, will troubleshoot tomorrow [02:49:54] Logged the message, RobH [02:50:42] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 50s [02:52:09] !log cp1032-cp1035 reinstall issue wiped mbr causing issues, will reinstall in my AM [02:52:12] Logged the message, RobH [02:53:51] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [03:01:39] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [03:10:12] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.862 seconds [03:16:39] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.242 seconds [03:31:12] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:33:31] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.136 seconds [03:34:43] RECOVERY - Puppet freshness on mw53 is OK: puppet ran at Wed Mar 14 03:34:21 UTC 2012 [03:38:36] !log free some disk space on spence - deleted user.log.1 on spence, compressing messages.1, apt-get clean,... [03:38:39] Logged the message, Master [03:40:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:42:04] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:44:10] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.363 seconds [03:44:16] !log ekrem - user agent "AppleDictionaryService" requests cause temp. WAP outage ..it seems [03:44:19] Logged the message, Master [03:46:20] mutante: do you know much about appledictionaryservice ? [03:46:34] New patchset: Lcarr; "fixing paths" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3146 [03:46:36] i don't myself :( [03:46:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3146 [03:46:50] LeslieCarr: no, i just see "ekrem is a Wikimedia Apple Dictionary to API OpenSearch bridge (misc::apple-dictionary-bridge)." [03:47:02] ah yeah :) and that's all that's in the apache logs [03:47:14] my totally uncertain guess is it's overloaded and we should set up varnish or something [03:47:15] LeslieCarr: and when i looked a bit at tail -f access.log on it, i see a lot of requests ""AppleDictionaryService/158.2"" [03:47:24] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3146 [03:47:27] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3146 [03:47:50] LeslieCarr: yeah, it does look overloaded, but only while that DictioaryService comes to index all pages or something... [03:48:10] LeslieCarr: when that happens we get the flapping WAP service on ekrem, and after its done its good a gain until next time.. afaik [03:49:11] ah so it's very sporadic [03:49:27] i would check but when hitting history.cgi i probaly kill spence :) [03:49:37] but yeah [03:50:10] :) [03:50:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.630 seconds [03:51:06] "Dictionary running under Mac OS X Leopard, showing Wikipedia's page on Wikipedia." [03:51:26] http://en.wikipedia.org/wiki/Dictionary_%28software%29 [03:52:06] http://en.wikipedia.org/wiki/Dictionary_%28software%29#Wikipedia [03:55:10] LeslieCarr: when fixing stuff on spence and adding new checks, i did not want to mess with your changes on the icinga files, but also i dont want us having to do stuff twice, i'd rather help with Icinga right way [03:55:42] cool - so most files that are in puppet should transfer over [03:55:47] however, many checks aren't in puppet ;) [03:56:01] and i have new process checks for swift sitting in gerrit.. but they will break NRPE again.. because every config change on nrpe_local.cfg does [03:56:46] so i will kill the bot, stop paging, wait for it to break.. [03:56:46] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:10] then restart the service on ALL, using scripts... OR ..find out the root cause first.. hrmm [03:58:19] ah yeah, nrpe local changes, so frustrating [03:58:31] and not having dsh [03:58:35] i mean dsh groups [03:58:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.447 seconds [03:58:49] anyways, i'm gonna go afk for a bit to see if the latest change fixed neon [03:59:34] grep host_name /etc/nagios/puppet_hosts.cfg | cut -d " " -f23 [03:59:38] kk, ttyl [04:06:58] meh, gotta fix users and stuff - that sounds likea tomorrow issue :) [04:07:00] bye [04:11:28] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:34] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.855 seconds [04:45:30] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:49:33] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [05:41:39] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [06:26:57] New patchset: Dzahn; "also allow public esams net (91.198.174.0./25), not just private, snmp access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3147 [06:27:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3147 [06:28:56] New patchset: Dzahn; "also allow public esams net (91.198.174.0./25), not just private, snmp access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3147 [06:29:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3147 [06:30:29] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3147 [06:30:32] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3147 [07:17:17] RECOVERY - Puppet freshness on amslvs1 is OK: puppet ran at Wed Mar 14 07:16:44 UTC 2012 [07:17:44] RECOVERY - Puppet freshness on ssl3001 is OK: puppet ran at Wed Mar 14 07:17:28 UTC 2012 [07:17:44] RECOVERY - Puppet freshness on knsq20 is OK: puppet ran at Wed Mar 14 07:17:33 UTC 2012 [07:18:11] RECOVERY - Puppet freshness on ssl3002 is OK: puppet ran at Wed Mar 14 07:17:54 UTC 2012 [07:18:47] RECOVERY - Puppet freshness on amssq45 is OK: puppet ran at Wed Mar 14 07:18:13 UTC 2012 [07:19:41] RECOVERY - Puppet freshness on amssq35 is OK: puppet ran at Wed Mar 14 07:19:27 UTC 2012 [07:22:50] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [07:24:56] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [07:25:14] RECOVERY - Puppet freshness on amssq55 is OK: puppet ran at Wed Mar 14 07:24:59 UTC 2012 [07:25:14] RECOVERY - Puppet freshness on amssq51 is OK: puppet ran at Wed Mar 14 07:25:08 UTC 2012 [07:25:14] RECOVERY - Puppet freshness on amssq61 is OK: puppet ran at Wed Mar 14 07:25:09 UTC 2012 [07:25:41] RECOVERY - Puppet freshness on amssq34 is OK: puppet ran at Wed Mar 14 07:25:16 UTC 2012 [07:25:41] RECOVERY - Puppet freshness on amssq58 is OK: puppet ran at Wed Mar 14 07:25:22 UTC 2012 [07:27:47] RECOVERY - Puppet freshness on ssl3003 is OK: puppet ran at Wed Mar 14 07:27:27 UTC 2012 [07:27:47] RECOVERY - Puppet freshness on ssl3004 is OK: puppet ran at Wed Mar 14 07:27:32 UTC 2012 [07:27:56] PROBLEM - Puppet freshness on knsq24 is CRITICAL: Puppet has not run in the last 10 hours [07:27:56] PROBLEM - Puppet freshness on ms6 is CRITICAL: Puppet has not run in the last 10 hours [07:28:05] RECOVERY - Puppet freshness on amssq57 is OK: puppet ran at Wed Mar 14 07:27:55 UTC 2012 [07:28:41] RECOVERY - Puppet freshness on amssq54 is OK: puppet ran at Wed Mar 14 07:28:15 UTC 2012 [07:28:41] RECOVERY - Puppet freshness on amslvs3 is OK: puppet ran at Wed Mar 14 07:28:19 UTC 2012 [07:29:17] RECOVERY - Puppet freshness on knsq16 is OK: puppet ran at Wed Mar 14 07:28:46 UTC 2012 [07:29:17] RECOVERY - Puppet freshness on cp3002 is OK: puppet ran at Wed Mar 14 07:28:57 UTC 2012 [07:29:17] RECOVERY - Puppet freshness on amssq33 is OK: puppet ran at Wed Mar 14 07:28:59 UTC 2012 [07:30:11] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Wed Mar 14 07:29:41 UTC 2012 [07:30:47] RECOVERY - Puppet freshness on amssq37 is OK: puppet ran at Wed Mar 14 07:30:14 UTC 2012 [07:30:47] RECOVERY - Puppet freshness on knsq17 is OK: puppet ran at Wed Mar 14 07:30:32 UTC 2012 [07:31:41] RECOVERY - Puppet freshness on amssq52 is OK: puppet ran at Wed Mar 14 07:31:31 UTC 2012 [07:31:50] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:31:50] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [07:32:17] RECOVERY - Puppet freshness on amssq59 is OK: puppet ran at Wed Mar 14 07:32:07 UTC 2012 [07:34:41] RECOVERY - Puppet freshness on amssq53 is OK: puppet ran at Wed Mar 14 07:34:14 UTC 2012 [07:34:41] RECOVERY - Puppet freshness on cp3001 is OK: puppet ran at Wed Mar 14 07:34:27 UTC 2012 [07:35:44] RECOVERY - Puppet freshness on knsq25 is OK: puppet ran at Wed Mar 14 07:35:25 UTC 2012 [07:35:44] RECOVERY - Puppet freshness on knsq22 is OK: puppet ran at Wed Mar 14 07:35:40 UTC 2012 [07:37:14] RECOVERY - Puppet freshness on nescio is OK: puppet ran at Wed Mar 14 07:36:45 UTC 2012 [07:40:14] RECOVERY - Puppet freshness on knsq18 is OK: puppet ran at Wed Mar 14 07:40:02 UTC 2012 [07:40:41] RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Wed Mar 14 07:40:18 UTC 2012 [07:40:41] RECOVERY - Puppet freshness on amssq47 is OK: puppet ran at Wed Mar 14 07:40:40 UTC 2012 [07:40:50] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [07:41:44] RECOVERY - Puppet freshness on knsq24 is OK: puppet ran at Wed Mar 14 07:41:30 UTC 2012 [07:41:44] RECOVERY - Puppet freshness on amssq36 is OK: puppet ran at Wed Mar 14 07:41:31 UTC 2012 [07:42:47] RECOVERY - Puppet freshness on amssq39 is OK: puppet ran at Wed Mar 14 07:42:26 UTC 2012 [07:43:14] RECOVERY - Puppet freshness on amssq32 is OK: puppet ran at Wed Mar 14 07:43:04 UTC 2012 [07:43:41] RECOVERY - Puppet freshness on knsq29 is OK: puppet ran at Wed Mar 14 07:43:16 UTC 2012 [07:43:41] RECOVERY - Puppet freshness on hooft is OK: puppet ran at Wed Mar 14 07:43:26 UTC 2012 [07:45:11] RECOVERY - Puppet freshness on amssq43 is OK: puppet ran at Wed Mar 14 07:44:43 UTC 2012 [07:46:41] RECOVERY - Puppet freshness on amssq31 is OK: puppet ran at Wed Mar 14 07:46:16 UTC 2012 [07:49:50] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [07:51:05] !log fixing owa[1-3] Swift HTTP commands manually [07:51:08] Logged the message, Master [07:51:11] !log restarting mecached on marmontel [07:51:14] Logged the message, Master [07:58:50] RECOVERY - Memcached on srv254 is OK: TCP OK - 0.002 second response time on port 11000 [07:59:08] RECOVERY - Memcached on srv255 is OK: TCP OK - 0.001 second response time on port 11000 [08:02:26] RECOVERY - Memcached on srv257 is OK: TCP OK - 0.003 second response time on port 11000 [08:02:49] !log stop/start memcached on srv254,srv255,srv257 [08:02:52] Logged the message, Master [08:19:10] apergos: does snapshot3 need mysql-client at all? currently mysql-client and mysql-common are on it in a "iU" (unpacked) state, but depends on mysql-client-core which it does not have, causing broken dpkg, can just remove it all? [08:19:17] hello [08:19:34] that's bad [08:19:45] as of when? [08:19:46] hi:) i didnt expect this message to reach you right this second:) [08:20:00] but didnt check time either [08:20:21] as of today? [08:20:25] or as of a week ago? [08:21:35] as of 1d 16 h [08:21:41] all the snapshots need mysql client [08:21:52] if it gets removed (and how did it get removed?) everything breaks [08:22:04] in this case it will be the "adds changes" dumps that broke [08:22:12] lemme try reinstalling it [08:22:33] i have no idea how it got removed, just Nagios told me "dkg broken packages" and then i checked why [08:24:16] !log running "apt-get -f install" on snapshot3 to fix dpkg, which installed mysql-client- and client-core-5.1 [08:24:19] Logged the message, Master [08:24:36] that's pretty frustrating. I would love to know how they just disappeared [08:24:37] looks like Nagios should soon report that fixed [08:24:41] RECOVERY - DPKG on snapshot3 is OK: All packages OK [08:27:06] nothing whatsoever in the sysadmin log [08:27:57] Start-Date: 2012-03-12 15:43:31 [08:27:58] yeah my incrementals haven't been running for two days. [08:27:59] Upgrade: libmysqlclient16 (5.1.53-fb3753-wm1, 5.1.61-0ubuntu0.10.04.1), mysql-common (5.1.53-fb3753-wm1, 5.1.61-0ubuntu0.10.04.1), mysql-client-5.1 (5.1.53-fb3753-wm1, 5.1.61-0ubuntu0.10.04.1) [08:28:03] Error: Sub-process /usr/bin/dpkg returned an error code (1) [08:28:08] /Var/log/apt/history.log [08:28:15] but who ran it? [08:28:53] and why not follow up to be sure it worked? [08:31:40] I can get today's to run I think but it's going to include yesterday's data [08:36:32] apergos: hmm, see how the Start-Date above was at 15:_43_ ? [08:36:51] uh huh [08:36:53] apergos: and there is also a cronjob at "43" [08:37:03] this gets done by cron? [08:37:11] which [ -f /var/lib/puppet/state/puppetdlock ] && find /var/lib/puppet/state/puppetdlock -ctime +1 -delete [08:37:32] and then you can see in auth.log how root , started by cron ... [08:38:09] but i dont know yet how that would do the package upgrade.. but i dont see a manual user login around that time.. and the timestamps fit too good [08:38:40] I wonder if the other snaps got upgraded or are liabl to break for no reason shortly [08:39:12] I have rerun the one incremental step and tweaked the configured delay so that today's job should run [08:40:03] mm, how could deletin a puppet lock file, cause a apt command [08:40:10] lets run puppet manually again now [08:40:48] ok [08:41:04] you don't see apt in the log file as part of puppet? [08:41:50] Mar 12 15:43:34 snapshot3 puppet-agent[14592]: (/Stage[main]/Snapshots::Packages/Package[mysql-client-5.1]/ensure) change from 5.1.53-fb3753-wm1 to 5.1.61-0ubuntu0.10.04.1 failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install mysql-client-5.1' returned 100: [08:41:58] uh huh [08:42:00] puppet did it.. somehow [08:42:26] from wm1 to ubuntu .. its that again. our vs. distro [08:42:51] but you do have apt preferences saying it should prefer ours [08:44:03] (/Stage[main]/Snapshots::Sync/Exec[snapshot-trigger-mw-sync]) Dependency Package[mysql-client-5.1] has failures: true [08:44:24] so Exec[snapshot..] has a dependecy on that package [08:44:27] on snapshot2 I see this: [08:44:28] Mar 12 15:24:26 snapshot2 puppet-agent[28349]: (/Stage[main]/Snapshots::Packages/Package[mysql-client-5.1]/ensure) ensure changed '5.1.41-3ubuntu12.10' to '5.1.61-0ubuntu0.10.04.1 [08:45:15] which therefore "just worked" [08:46:32] I don't care which version it uses as long as it upgrades cleanly and nothing breaks [08:50:10] apt sources.list and apt preferences are identical [11:17:28] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [11:55:53] New patchset: Hashar; "gerrit played ping pong between http / https URL" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3148 [11:56:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3148 [12:04:06] PROBLEM - Disk space on search1017 is CRITICAL: DISK CRITICAL - free space: /a 5002 MB (3% inode=99%): [12:08:27] PROBLEM - Disk space on search1017 is CRITICAL: DISK CRITICAL - free space: /a 5000 MB (3% inode=99%): [12:43:14] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [12:46:32] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [12:55:43] apergos: ping ? :-] [12:56:49] hashar: pong [12:57:09] I would need some root magic to list files on manganese, that is the server hosting gerrit [12:57:23] what exactly do you need? [12:57:27] I need to hack a script that would list all git repositories hosted there that begins with 'mediawiki' [12:57:44] it seems they are hold in /var/lib/gerrit2/git , I would like to have a confirmation :) [12:57:48] ah [12:57:48] ssh manganese.wikimedia.org /bin/ls -1 /var/lib/gerrit2/git [12:57:50] lemme see [12:58:24] then I will write a php script that output to the public: /bin/ls -1 | grep mediawiki [12:58:24] nope [12:58:40] root@manganese:/var/lib/gerrit2/review_site# ls [12:58:40] bin cache etc git hooks lib logs static tmp [12:58:55] there is no /var/lib/gerrit2/git [12:59:07] maybe that 'git' subdirectory so [12:59:13] /var/lib/gerrit2/review_site/git [12:59:15] ls git [12:59:15] All-Projects.git analytics analytics.git integration labs mediawiki mediawiki.git operations test [12:59:24] wonderful! [12:59:29] ok [12:59:34] can you ls mediawiki/extensions ? [13:00:04] should have something like /var/lib/gerrit2/review_site/git/mediawiki/extensions/*.git [13:02:41] root@manganese:/var/lib/gerrit2/review_site/git/mediawiki/extensions# ls -a [13:02:42] . ContributionReporting.git Narayam.git SubPageList3.git [13:02:42] etc. [13:02:51] you rocks! [13:03:29] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [13:03:51] thanks [13:03:51] !log cp1029-cp1035 all installed and ready for varnish deployment, puppet has been run [13:03:53] anyting else? [13:03:55] Logged the message, RobH [13:04:14] apergos: should be good for now [13:04:20] I am going to push a gerrit change for review [13:06:11] New patchset: Hashar; "publicly list mediawiki extensions git repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149 [13:06:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3149 [13:10:32] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:59] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:28:59] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:58] New patchset: Hashar; "publicly list mediawiki extensions git repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149 [13:39:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3149 [13:39:20] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [13:39:37] New review: Hashar; "That second patch set makes the change no more dependent on another pending one." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/3149 [13:39:56] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.174 seconds [13:40:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.182 seconds [13:44:00] apergos: would you be willing to review some PHP hack I have submitted https://gerrit.wikimedia.org/r/#change,3149 :-] [13:44:33] I could but it would carry very little weight I'm afraid [13:46:32] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:50] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:31] apergos: I will ask RobH :-]] [13:50:03] more seriously, going to hold till some SF friend way e up [13:50:04] asking me to look at php code is like asking a blind person to describe the mona lisa. [13:50:22] i know it exists, but i have no idea what i am lookin at. [13:50:35] haha [13:50:54] which languages are you fluent with? Beside english and dmesg output ? :) [13:51:45] i am ashamed to say [13:52:05] vb [13:52:30] before I got the job just before wiki i was employed in a microsoft shop. [13:52:41] did .net hacking and sql mining [13:52:53] my hands will never be clean. [13:52:54] I know someone who has made that is expert field [13:53:13] and he does vb hacking as a second job :-] Pay off very well since real experts are rare [13:53:14] I was not a talented programmer. [13:53:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.178 seconds [13:54:06] most of my programming was bare minimum to plug in sql quieries in to applications our customers used with our organization [13:54:14] it was SQL heavy and VB light [13:54:26] and I don't recall any of my sql knowledge anymore. [13:54:30] ahah [13:54:36] it was well..... 7 years ago. [13:54:44] you should attempt to learn perl. that saves a lot of time when doing sysadmin [13:54:53] and I am sure you eventually come back to SQL very easily [13:54:59] heh, what about python! [13:55:06] or python even better [13:55:17] that would make mark_ happier, he is a python fan [13:58:01] and I just wrote a script using perl ... [13:59:37] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.504 seconds [14:00:49] RobH: do you have any apache skill ? :-) [14:01:01] got a funny mod_rewrite issue for you if so : https://gerrit.wikimedia.org/r/#change,3148 [14:01:54] yer just forcing https ? [14:05:03] it is already forced [14:05:15] but accessing HTTPS / we are redirect to HTTP / [14:05:21] yea, thats crappy. [14:05:30] and you are fixing that with this patch is what it looks like to me. [14:05:50] hmm I am wrong above sorry [14:06:03] well the commit message have it right. [14:06:11] yea, i see what you mean [14:06:13] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:26] it does a loop to and bback from http url for no reason due to bad rule [14:06:29] is how i am reading it [14:06:39] though its not really noticable to most users [14:06:49] yes? [14:06:53] yup. I suspect the :443 virtual host was copy pasted from the :80 virtualhost [14:07:07] indeed, thats the usual culprit, i will review and approve =] [14:07:08] probabl [14:07:27] probably nobody notice, the URL bar is just bilking a bit while the redirects are being done [14:07:33] New review: RobH; "Someone fixing apache host files, huzzah!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3148 [14:07:36] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3148 [14:07:45] ideally we should use some apache configuration snippets [14:07:50] yea i never noticed until now and paid reallllly close attention [14:08:05] well, puppet has a way of generating vhost files, mark_ is presently toying iwth it [14:08:08] its not fully working yet [14:08:21] but once its done it should standardize all our vhost configurations on the cluster [14:08:31] puppet labs has a forge to host puppet "receipes" [14:09:01] \o/ [14:09:05] heh, i guess i need to push your change live eh? [14:09:29] if you want, though it is not urgent [14:09:38] will be live next time apache restart :-]]] [14:09:40] yea but when someone else goes to push later they will see your change [14:09:44] in puppet [14:09:51] and its confusing to folks cuz they dunno where it came from [14:10:13] so i will merge it into production on puppet now [14:11:49] and going to force puppet update on gerrit to ensure we didnt break it, i dont think we did but rather break it now than it do it later on auto puppet run [14:12:51] hrmm, i wonder if puppet makes apache restart to take the change... [14:13:11] hashar: check gerrit for the redirect if ya dont mind, it shoudl be live now (if puppet restarts apache) [14:13:34] i imagine you have the curl test running from the comment to test, i just load it browser which isnt proper test [14:13:44] checking [14:14:05] if its not live, it means puppet doesnt force apache to reload, and i need to do it manually, im just curious if it did [14:14:14] I am not sure apache is restated by puppet [14:14:18] we have to subscribe it or something [14:14:30] so its not live? [14:14:33] nop [14:14:37] ok, manual restart apache [14:14:41] still the same output for: curl -sIL http://gerrit.wikimedia.org/ | grep Location [14:14:44] ok, try now =] [14:14:54] $ curl -sIL http://gerrit.wikimedia.org/ | grep Location [14:14:54] Location: https://gerrit.wikimedia.org/ [14:14:54] Location: https://gerrit.wikimedia.org/r/ [14:14:56] well done!!! [14:15:00] on less redirect! [14:15:03] thx for fixing it =] [14:15:18] New review: Hashar; "Yeah one less redirect!!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3148 [14:15:47] a puppet template would be to lint the apache configuration and automatically reload apache [14:16:02] thanks for applying the patch !! [14:16:09] glad to help [14:18:46] shit its already 1018 [14:18:50] where is the day going =P [14:19:08] i even started early today [14:22:41] been awake at 6am, and it is 3pm already :-/ [14:25:00] i feel your pain [14:25:10] i may need to take a nap for lunch [14:25:31] though the weather is absolutely stunning today, i may go for a walk instead. [14:25:42] which is the opposite of napping, but the weather is just that awesome today. [14:27:53] i just noticed from chris's irc quit he has fios [14:27:56] i am jealous. [14:31:18] ahh man, i was wondering why i couldnt hit my labs instance, i have to apply the webserver port rules... [14:32:16] can we only apply a sercurity group to an instance at creation? [14:33:04] grr, yes, how annoying. [14:33:12] there goes all my work in my lab instance, oh well. [14:33:31] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.087 seconds [14:43:00] hrmm [14:43:11] i fubar'd up my labs, robh1 is resolving to an IP that the instance isnt running [14:43:13] what the hell. [14:46:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:54:13] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.621 seconds [14:54:22] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.960 seconds [15:00:58] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:07] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:01:34] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.45025310924 (gt 8.0) [15:04:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.883 seconds [15:04:50] RobH: it is sunny outside too [15:04:58] i missed the sun [15:05:01] RobH: I am out to read something [15:05:13] have a nice time =] [15:08:35] danke! [15:11:04] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:19] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.683 seconds [15:34:10] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.766 seconds [15:34:37] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:38:04] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 14.3817131667 (gt 8.0) [15:43:28] PROBLEM - Puppet freshness on mw1020 is CRITICAL: Puppet has not run in the last 10 hours [15:46:46] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:53] do we have a wiki page on how to fix broken ssh host keys after reinstallation? (broken from the client's perspective - i.e. on bast1001 it doesn't like ms-be5's new key) [16:06:06] apergos or Jeff_Green or RobH maybe you know? [16:06:07] :) [16:06:22] oh [16:06:33] you sorted out the cert and stuff right? [16:06:36] doesnt puppet copy over all the keys? [16:06:47] the host itself is happy, but everyone else still has the old key [16:06:55] it's complaining because of /tc/something or /root/.ssh/known_hosts or whatever? [16:07:06] i'm not aware of such a page [16:07:10] well maybe it's wrong but I edit the file by hand [16:07:28] yeah, I do that too, but I figured there was probably a "right" way to do it. [16:07:41] I have never heard of one [16:08:08] ah well. [16:08:12] i've always edited the offending known_hosts by hand, but I believe there is a ssh-* command to remove a specific key [16:08:25] i don't believe it matters which way you do it [16:08:30] oh yeah, I'd forgotten about that. [16:08:34] sure, but I guess you are looking to ddsh across all hosts or something [16:08:50] I douobt there is any such script [16:08:55] it's easy enough to do it by hand with 'vi known_hosts +1623' (or whatever line it's at) [16:08:57] maybe there should be [16:11:03] re. puppet--from what I've seen our config will add keys but not remove them [16:12:26] huh. I hadn't seen ssh-copy-id before. [16:12:34] (as I'm scanning ssh man pages) [16:12:45] :-) [16:19:57] !log updating dns for new domain wikimediacommons.pt (nameservers not yet pointed at us) [16:20:00] Logged the message, RobH [16:21:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.181 seconds [16:27:56] function clean_ssh_key() { linenumber=$1; echo -e "${linenumber}d\nw" | ed ~/.ssh/known_hosts 2> /dev/null; echo -e "${linenumber}d\nw" | ed /etc/ssh/ssh_known_hosts 2> /dev/null; } [16:27:59] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:10] not exactly elegant, but hey. [16:28:50] that relies on knowing theline number and it being the same on all hosts [16:29:10] I will bet dollars to donuts (I tink I've never used that expression before!) that the line number varies [16:29:23] it's only for localhost, and the ssh error spits out the line number. [16:29:26] ok [16:29:34] oh I see, a little scriptlet [16:29:46] I was thinking of a toll one would run to clean up the whole cluster [16:29:46] i.e. you try and ssh somewhere, it says SCREWU! and you run that to clean it out. [16:29:49] *boom* [16:30:00] *tool [16:30:23] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.722 seconds [16:31:23] hrmph. neither the new_install key nor my key works on ms-be5. [16:32:00] well that sounds permissions-y no? [16:32:08] yeah, could be. [16:32:30] I'm gonna presumptively blame puppet [16:32:36] just because. [16:33:34] hmm. no login prompt on the console either. [16:33:58] hey cmjohnson1 - in what state did you leave ms-be5? [16:35:23] maplebed: i don't remember...let me plug in the cart [16:35:32] thanks. [16:40:44] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:40:45] maplebed: i left the testing software running...rebooting now [16:40:53] cool. tnx. [16:41:29] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:38] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 1.28 ms [16:47:43] maplebed: good to go [16:47:56] RECOVERY - DPKG on ms-be5 is OK: All packages OK [16:48:14] RECOVERY - Disk space on ms-be5 is OK: DISK OK [16:48:50] RECOVERY - RAID on ms-be5 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [16:50:56] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Wed Mar 14 16:50:46 UTC 2012 [17:08:18] * cmjohnson1 is moving to pmtpa [17:11:29] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.3287510084 [17:18:41] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.751 seconds [17:24:05] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [17:25:08] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:25:44] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.79999766667 (gt 8.0) [17:26:02] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [17:31:53] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.57810478992 [17:33:05] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:33:05] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [17:34:01] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [17:34:46] New patchset: Lcarr; "Allowing icinga in sudoers as with nagios + group gammu" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3153 [17:34:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3153 [17:35:40] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:35:45] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3153 [17:35:48] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3153 [17:40:46] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.6636874167 (gt 8.0) [17:42:52] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.85144512605 [17:49:55] New patchset: Bhartshorne; "bumping up the number of replicator processes running on swift storage bricks to improve time to balance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3154 [17:50:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3154 [17:50:55] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3154 [17:50:58] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3154 [17:51:43] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [18:16:25] New patchset: Lcarr; "pushing http to http /icinga and https to https /icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3155 [18:16:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3155 [18:17:11] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3155 [18:17:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3155 [18:18:07] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:21:12] New patchset: Bhartshorne; "dropping down to 2 from 4. put latency increased too much." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3156 [18:21:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3156 [18:21:33] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3156 [18:21:36] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3156 [18:23:13] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:49] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:16] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.623 seconds [18:26:13] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.165386 (gt 8.0) [18:31:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:31] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:35:49] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.601 seconds [18:36:34] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 20.0373309167 (gt 8.0) [18:41:24] New patchset: Asher; "db18,19 decom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3157 [18:41:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3157 [18:42:16] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:52] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.51411680672 [18:49:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.532 seconds [18:55:15] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:56:54] New patchset: Lcarr; "making sure conf.d exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3158 [18:57:07] New patchset: Lcarr; "more icinga tweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3159 [18:57:12] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.112 seconds [18:57:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3158 [18:57:19] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3158 [18:57:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3159 [18:57:20] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3158 [18:57:35] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3159 [18:57:37] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3159 [18:58:36] !log ms-be5 is back in rotatino [18:58:39] Logged the message, Master [18:59:26] RobH: ms-be1 is reporting a different number of CPUs than the rest of the ms-be hosts. Do you know if that's hyperthreading or an actual difference? http://ganglia.wikimedia.org/latest/?r=20min&cs=&ce=&m=load_report&s=by+name&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:59:51] hyerpthreading [18:59:59] they are identical hosts and cpus [19:00:04] k. [19:00:51] so unless swift can make use of HT we should reboot and turn it off on them. (Or schedule to do so when convient) [19:01:02] since it tends to give a false impression, imho [19:01:29] but i dunno, can swift do anything with hyperthreading that is beneficial? [19:01:35] or does it really not matter [19:01:42] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.146 seconds [19:03:04] well, all I know is that ms-be1 is hurting much worse than the others. [19:03:20] I don't think it has anything to do with hyperthreading, but it is screwingc up my config [19:03:30] (there's a part of the swift config that says numworkers == numcpus) [19:03:43] so ms-be1 only has half the workers that the rest do. [19:03:57] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:06:30] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 335 seconds [19:06:39] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 344 seconds [19:15:48] PROBLEM - DPKG on db58 is CRITICAL: Connection refused by host [19:16:15] PROBLEM - Disk space on db58 is CRITICAL: Connection refused by host [19:16:33] PROBLEM - MySQL disk space on db58 is CRITICAL: Connection refused by host [19:17:27] PROBLEM - RAID on db58 is CRITICAL: Connection refused by host [19:17:36] PROBLEM - SSH on db58 is CRITICAL: Connection refused [19:18:21] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:21:39] RECOVERY - RAID on db58 is OK: OK: State is Optimal, checked 12 logical device(s) [19:21:48] RECOVERY - SSH on db58 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:22:15] RECOVERY - DPKG on db58 is OK: All packages OK [19:22:33] RECOVERY - Disk space on db58 is OK: DISK OK [19:22:51] RECOVERY - MySQL disk space on db58 is OK: DISK OK [19:28:15] PROBLEM - Swift HTTP on magnesium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:54] !log rebooting ms-be1 to enable hyperthreading (and make it the same as all the other ms-be hosts) [19:29:58] Logged the message, Master [19:30:57] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.885 seconds [19:30:57] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.729 seconds [19:32:27] PROBLEM - Host ms-be1 is DOWN: PING CRITICAL - Packet loss = 100% [19:37:15] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:15] RECOVERY - Host ms-be1 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [19:37:24] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:20] New patchset: Lcarr; "more icinga fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3160 [19:45:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3160 [19:46:26] hello [19:46:31] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3160 [19:46:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3160 [19:47:00] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:27] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:36] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.531 seconds [19:47:36] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.137 seconds [19:53:54] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:46] New patchset: Lcarr; "another apache updated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3161 [19:55:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3161 [19:56:00] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3161 [19:56:02] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3161 [20:00:03] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:10] New patchset: Lcarr; "making check all spelled out for purging nagios resources" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3162 [20:06:18] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3162 [20:07:06] New patchset: Lcarr; "making check all spelled out for purging nagios resources" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3162 [20:07:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3162 [20:07:26] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3162 [20:07:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3162 [20:07:37] woosters: think this is pretty trivial https://rt.wikimedia.org/Ticket/Display.html?id=2631 [20:07:48] but would help wikisource :) [20:08:04] let me take a look and will get back to u [20:08:09] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:04] woosters: tyvm [20:12:39] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.953 seconds [20:17:38] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:32] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.447 seconds [20:21:50] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.018 seconds [20:24:42] New patchset: Lcarr; "Remove default icinga conf file from apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3163 [20:24:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3163 [20:25:20] New patchset: Lcarr; "Remove default icinga conf file from apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3163 [20:25:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3163 [20:25:42] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3163 [20:25:44] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3163 [20:29:57] looks like we had a site bump (bunch of people complained and then back up) [20:30:32] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 186 seconds [20:31:08] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 221 seconds [20:31:08] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 221 seconds [20:31:08] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 221 seconds [20:31:08] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 221 seconds [20:31:26] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 240 seconds [20:31:53] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 265 seconds [20:31:53] PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: CRIT replication delay 265 seconds [20:32:11] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 285 seconds [20:33:56] ^^^ is ok now [20:33:59] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 391 seconds [20:34:17] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:34:44] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 0 seconds [20:35:11] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 0 seconds [20:35:11] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [20:35:11] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds [20:35:11] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [20:35:38] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [20:35:56] RECOVERY - MySQL Replication Heartbeat on db52 is OK: OK replication delay 0 seconds [20:35:56] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [20:35:56] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [20:36:14] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.614 seconds [20:36:23] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [20:36:23] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:16] New patchset: Asher; "disabling log_queries_not_using_indexes for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3164 [20:39:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3164 [20:41:00] !log disabled log_queries_not_using_indexes on all core dbs [20:41:03] Logged the message, Master [20:41:58] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3130 [20:42:01] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3130 [20:42:18] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3157 [20:42:21] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3157 [20:42:32] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:44:29] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.117 seconds [20:45:29] New patchset: Asher; "disabling log_queries_not_using_indexes for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3164 [20:45:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3164 [20:46:01] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3164 [20:46:03] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3164 [20:50:56] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:42] New review: Ryan Lane; "Let's make this a cron that generates a static file, so that we don't need to include php on the ger..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3149 [20:57:14] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.924 seconds [21:03:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 2.372 seconds [21:09:32] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:09:41] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:06] New patchset: Hashar; "publicly list mediawiki extensions git repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149 [21:13:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3149 [21:18:50] New patchset: Hashar; "publicly list mediawiki extensions git repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149 [21:18:59] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [21:19:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3149 [21:20:51] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3149 [21:20:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3149 [21:24:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.406 seconds [21:24:05] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.122 seconds [21:25:24] Anyone familiar with xinetd? [21:29:40] Reedy: haven't used it for years and years sorry :-( [21:30:10] Just trying to finish replicating how we have the extdist remote client setup [21:30:31] xinetd says the service is running, but there's nothing listening on the right port, and hence, can't telnet into it [21:32:14] TimStarling: hello [21:32:20] hi [21:36:32] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:36:32] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:38:46] New patchset: Ryan Lane; "Upping tools again" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3165 [21:38:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3165 [21:40:06] New patchset: Lcarr; "fixing ordering for icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3166 [21:40:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3166 [21:40:49] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3166 [21:40:52] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3166 [21:40:53] New patchset: Ryan Lane; "Adding docroot for 443 on gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3167 [21:41:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3167 [21:41:10] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3167 [21:41:25] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3165 [21:41:27] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3167 [21:41:28] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3165 [21:46:22] New patchset: Ryan Lane; "Changing ls one-liner to only output extension names, rather than directory contents" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3168 [21:46:32] Error undeleting file: Could not connect to storage backend "swift-local-backend-copper". [21:46:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3168 [21:46:35] * AaronSchulz grumbles [21:47:27] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3168 [21:47:29] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3168 [21:49:27] New patchset: Bhartshorne; "trying to puppetize a cronjob to run the swift cleaner changed swiftcleanermanager to only allow one instance at a time (using a pidfile)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134 [21:49:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134 [21:52:10] New patchset: Lcarr; "removing icinga default conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3169 [21:52:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3169 [21:52:40] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3169 [21:52:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3169 [21:53:37] New patchset: Bhartshorne; "trying to puppetize a cronjob to run the swift cleaner changed swiftcleanermanager to only allow one instance at a time (using a pidfile)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134 [21:53:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3134 [21:54:28] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.226 seconds [21:58:47] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3134 [21:58:49] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3134 [22:00:55] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:24] New patchset: Bhartshorne; "installing the swift cleaner on iron" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3170 [22:02:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3170 [22:02:40] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3170 [22:02:43] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3170 [22:09:10] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.817 seconds [22:09:19] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.094 seconds [22:10:37] New patchset: Bhartshorne; "correcting variable scope in template file, correcting path to conf file in cron invocation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3171 [22:10:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3171 [22:11:07] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3171 [22:11:10] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3171 [22:21:29] New patchset: Bhartshorne; "trying with local scope" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3172 [22:21:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3172 [22:22:19] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3172 [22:22:22] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3172 [22:38:25] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:38:34] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:42:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.622 seconds [22:44:25] PROBLEM - Puppet freshness on virt4 is CRITICAL: Puppet has not run in the last 10 hours [22:48:28] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [22:48:46] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:50:43] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.565 seconds [22:51:01] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 7.418 seconds [22:55:39] New patchset: Bhartshorne; "added option to ignore previous state when running swiftcleaner, fixed bug in pidfile detection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3173 [22:55:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3173 [22:56:19] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3173 [22:56:22] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3173 [22:57:10] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:57:10] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:05:25] PROBLEM - Puppet freshness on virt2 is CRITICAL: Puppet has not run in the last 10 hours [23:11:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:35] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:13:23] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.334 seconds [23:13:32] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.164 seconds [23:14:17] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [23:14:26] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [23:16:52] !log installed the swiftcleaner to run daily from iron. see root's crontab for more info. [23:16:55] Logged the message, Master [23:22:14] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:22:14] PROBLEM - Swift HTTP on zinc is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:31:50] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:08] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:05] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.726 seconds [23:37:50] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.003 seconds [23:42:38] PROBLEM - Swift HTTP on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:45:20] New patchset: Bhartshorne; "correcting scrubstate logic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3177 [23:45:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3177 [23:46:30] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3177 [23:46:33] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3177 [23:48:29] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:48:29] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:50:26] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.631 seconds [23:50:26] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.636 seconds