[00:02:55] RECOVERY - Lucene on search1015 is OK: TCP OK - 0.027 second response time on port 8123 [01:09:19] New review: Reedy; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3238 [01:28:04] New patchset: Reedy; "Change doxygen checkout of core to checkout/update via git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2990 [01:28:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2990 [01:29:24] in puppet, if we add a directory and add all files in it recursively,would that also handle symlinks just as any other file? [01:34:13] New patchset: Reedy; "Change doxygen checkout of core to checkout/update via git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2990 [01:34:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2990 [01:34:56] otherwise I'd have to ensure => link and target => for each one if i have a lot fo them [01:35:38] New patchset: Reedy; "Change doxygen checkout of core to checkout/update via git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2990 [01:35:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2990 [01:39:58] ah.. http://projects.puppetlabs.com/issues/93 [01:40:16] 6 years ago :) ok [01:49:19] PROBLEM - Disk space on stafford is CRITICAL: DISK CRITICAL - free space: /var/lib/puppet 758 MB (3% inode=92%): [01:55:36] !log deleting puppet report files older than 60hours on stafford to free disk space [01:55:37] RECOVERY - Disk space on stafford is OK: DISK OK [01:55:37] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [01:55:40] Logged the message, Master [01:57:43] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [02:06:43] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [02:06:43] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [02:37:56] New patchset: Dzahn; "split swift-storage process monitoring into separate classes and call from role class (1st attempt)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3510 [02:38:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3510 [02:40:16] New patchset: Dzahn; "split swift-storage process monitoring into separate classes and call from role class (1st attempt)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3510 [02:40:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3510 [02:42:18] New review: Dzahn; "this is most likely not the final solution yet, but i hope it's at least the right direction. this i..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3510 [03:20:58] RECOVERY - Puppet freshness on linne is OK: puppet ran at Fri Mar 23 03:20:52 UTC 2012 [03:23:50] New patchset: Dzahn; "call swift proxy http monitoring from role class, remove hostname condition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3511 [03:24:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3511 [03:34:13] New patchset: Dzahn; "add misc::statistics::plotting to install 'ploticus' per RT-2163" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3512 [03:34:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3512 [03:44:04] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [03:46:10] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [03:47:20] New patchset: Dzahn; "add role class for statistics server, move includes from site.pp to role class and add plotting class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3513 [03:47:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3513 [03:49:02] New patchset: Dzahn; "add role class for statistics server, move includes from site.pp to role class and add plotting class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3513 [03:49:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3513 [03:49:52] New review: Dzahn; "ok? should the special accounts also be in a separate class?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3513 [04:47:25] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:49:22] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [05:13:41] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [05:52:23] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [07:37:54] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=61%): /var/lib/ureadahead/debugfs 199 MB (2% inode=61%): [07:45:01] Reedy: hi [07:50:30] RECOVERY - Disk space on srv223 is OK: DISK OK [08:01:08] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/3272 [08:59:35] New patchset: Hashar; "Change doxygen checkout of core to checkout/update via git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2990 [08:59:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2990 [09:01:21] New review: Hashar; "I have tweaked Sam patch to name the git clone command ("clone mediawiki for doc") and use that to m..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2990 [09:49:13] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 182 seconds [09:53:43] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [10:19:39] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [10:30:09] RECOVERY - udp2log processes on locke is OK: OK: all filters present [10:36:27] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [10:38:24] RECOVERY - udp2log processes on locke is OK: OK: all filters present [10:44:42] PROBLEM - udp2log processes on locke is CRITICAL: CRITICAL: filters absent: /a/squid/urjc.awk, [10:46:48] RECOVERY - udp2log processes on locke is OK: OK: all filters present [11:57:38] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [11:59:17] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:59:35] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [12:03:29] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:09:39] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [12:09:39] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:37:24] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 189 seconds [13:37:51] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 201 seconds [13:40:31] New review: Demon; "I'm wondering if we should move this somewhere else anyway. svn.wm.o doesn't feel like the right pla..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2990 [13:41:00] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 276 MB (3% inode=61%): /var/lib/ureadahead/debugfs 276 MB (3% inode=61%): [13:45:57] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [13:46:24] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [13:48:57] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 197 MB (2% inode=61%): /var/lib/ureadahead/debugfs 197 MB (2% inode=61%): [13:55:42] RECOVERY - Disk space on srv221 is OK: DISK OK [13:57:30] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:24] RECOVERY - Disk space on srv222 is OK: DISK OK [14:01:42] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [14:27:03] New patchset: Hashar; "reindent / align hookconfig.py $filename hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3526 [14:27:17] New patchset: Hashar; "abstract logic getting irc filename, add tests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3527 [14:27:23] somehow [14:27:27] I am a huge fan of git-review :-] [14:27:31] New patchset: Hashar; "support project wildcard for irc notifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3528 [14:27:46] New patchset: Hashar; "use wildcards for gerrit IRC notifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3529 [14:28:00] New patchset: Hashar; "gerrit IRC bot no join #wikimedia-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3530 [14:28:14] New patchset: Hashar; "analytics/integration IRC notificiation in #wikimedia-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3531 [14:28:28] New patchset: Hashar; "remove +x bits from files of /srv/org/mediawiki/integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [14:28:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3526 [14:28:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3527 [14:28:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3528 [14:28:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3529 [14:28:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3530 [14:28:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3531 [14:28:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3433 [14:29:03] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3529 [14:30:05] New review: Demon; "I think we should keep the bot out of -dev. Part of the reason -dev exists is to have discussion wit..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/3530 [15:01:02] New review: Hashar; "A jenkins job is somehow on my roadmap. I would love to have a doc.mediawiki.org domain, probably w..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2990 [15:10:20] New review: Hashar; "The reason is to move notifications for analytics/* and integration/* (see follow up https://gerrit...." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/3530 [15:14:47] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [15:18:59] RECOVERY - RAID on db1020 is OK: OK: State is Optimal, checked 2 logical device(s) [15:19:08] RECOVERY - DPKG on db1020 is OK: All packages OK [15:19:08] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay seconds [15:19:26] RECOVERY - MySQL Slave Running on db1020 is OK: OK replication [15:19:26] RECOVERY - Disk space on db1020 is OK: DISK OK [15:19:26] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay seconds [15:19:44] RECOVERY - MySQL disk space on db1020 is OK: DISK OK [15:19:53] RECOVERY - Full LVS Snapshot on db1020 is OK: OK no full LVM snapshot volumes [15:20:05] * mark cries [15:20:06] LVS [15:20:20] RECOVERY - MySQL Idle Transactions on db1020 is OK: OK longest blocking idle transaction sleeps for seconds [15:20:29] RECOVERY - SSH on db1020 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:20:38] RECOVERY - MySQL Recent Restart on db1020 is OK: OK seconds since restart [15:25:35] PROBLEM - Host db1020 is DOWN: PING CRITICAL - Packet loss = 100% [15:26:56] RECOVERY - Host db1020 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [15:29:05] !log db20 memory error on raid controller resolved with firmware updarte [15:29:09] Logged the message, RobH [15:30:19] !log db1020 can go back into whatever rotation Asher wants it in [15:30:23] Logged the message, RobH [15:37:40] !log firmware updating on cp1017, no one touch it please [15:37:44] Logged the message, RobH [15:38:47] RECOVERY - Host cp1017 is UP: PING OK - Packet loss = 0%, RTA = 27.32 ms [15:43:04] PROBLEM - RAID on cp1017 is CRITICAL: Connection refused by host [15:43:58] PROBLEM - Disk space on cp1017 is CRITICAL: Connection refused by host [15:44:18] !log shutting down magnesium for disk swap [15:44:22] Logged the message, RobH [15:44:25] PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: Connection refused [15:45:28] PROBLEM - SSH on cp1017 is CRITICAL: Connection refused [15:45:28] PROBLEM - Backend Squid HTTP on cp1017 is CRITICAL: Connection refused [15:45:46] PROBLEM - DPKG on cp1017 is CRITICAL: Connection refused by host [15:49:04] PROBLEM - Host magnesium is DOWN: PING CRITICAL - Packet loss = 100% [15:53:27] !log magnesium coming back online [15:53:31] Logged the message, RobH [15:53:52] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [15:59:34] RECOVERY - Host magnesium is UP: PING OK - Packet loss = 0%, RTA = 26.91 ms [16:03:26] hi [16:03:50] end of the log is interesting http://bots.wmflabs.org/~petrb/logs/%23wikimedia-dev/20120323.txt [16:04:23] would it be possible register domain wm.org for that? :) [16:04:53] I am pretty sure lot of people objected against it [16:05:00] really? [16:05:01] why [16:05:15] it would be quite useful, I myself would use it often [16:05:21] plus wm.org is not free :) [16:05:25] oh [16:05:29] it just doesn't lookup [16:05:35] Created On:07-Mar-2000 11:50:28 UTC [16:05:36] Last Updated On:06-Feb-2012 21:35:14 UTC [16:05:36] Expiration Date:07-Mar-2013 11:50:28 UTC [16:05:40] hm [16:05:47] try the whois command [16:05:50] heh [16:05:55] whois wm.org [16:06:12] ok anyway if there was a support for that, we could probably find a short url [16:06:37] there was a thread on wikitech-l about short url some months ago [16:06:50] end result was "no" :-D [16:06:52] there needs to be a search engine for wikitech [16:06:59] link me [16:07:13] wait, if it was few months ago it could be in my inbox [16:07:49] PROBLEM - NTP on cp1017 is CRITICAL: NTP CRITICAL: No response from NTP server [16:08:03] or on some mail archiver such as markmail [16:08:23] [ANN] MediaWiki Short URL Builder configuration to ? [16:08:24] this one? [16:08:40] I can't find anything else [16:08:59] !log raid rebuilding on magnesium, however swift stuff is kind of black box mystery right now to me, need Ben to review magnesium later for that [16:09:03] Logged the message, RobH [16:10:20] I can't think of a reason why not to implement it [16:10:30] it couldn't cause any harm [16:12:21] the idea of a url shortener came up before [16:12:36] question is why it didn't pass [16:12:39] it was shot down because no one had the time or inclination to research and find an open source url software [16:12:51] since it would require someone in ops to do it, and we have a ton of other stuff right now [16:12:56] now we have labs though [16:12:56] oh I am talking about something else [16:13:07] short domain, like wp.org [16:13:10] ahh, i misunderstood your conversation [16:13:15] wp.org/databasename/pagename [16:13:18] example: [16:13:18] versus what, wikipedia? [16:13:22] wp.org/enwp/Article [16:13:32] wp.org/frwp/Article [16:13:37] and such [16:13:39] that sounds even more compelex than a shortening service. [16:13:43] complex even [16:13:51] it would be quite simple to create a redirecting system for that [16:13:55] would have to write in a bunch of redirect rules [16:14:02] and our apache config files need a major overhaul [16:14:08] it would just redirect wp.org/enwp/* to en.wikipedia.org/wiki/* [16:14:12] they are a damned mess =P [16:14:29] it could live on own server on labs [16:14:38] 9 versus 17 characters [16:14:40] we could just redirect the domain to isntance on labs [16:14:50] it could be tested on labs [16:14:58] but it cannot live on labs as a public facing long term service. [16:15:00] labs isnt for that. [16:15:05] probably not [16:15:23] but if someone wanted to work on making that work, labs would be the testbed indeed [16:15:27] I know, I just meant, that if no one from ops is willing to make it, someone from volunteers would like make it on labs [16:15:33] then you could move it somewhere on prod [16:15:43] yep =] [16:16:03] Im not saying it can happen thats not my call, but that sounds exactly why labs was made =] [16:16:25] and if some volunteer does all the work to make it function, it sounds like a win win situation [16:16:37] problem is with domain [16:16:44] though you guys are only gaining 9 versus 17 characters ;] [16:17:01] yes, a shorter way would be maybe better, if there is any [16:17:14] I myself use enwp.org very often [16:17:24] looks like telepathy.com owns wp.org [16:17:39] yes, we would need to find a free domain for this [16:17:44] if there is any [16:18:08] maybe it could be something like wp.org/en/Article [16:18:12] that would be shorter [16:18:37] maybe there is a free tld for wp [16:18:50] wp.it would sound funny :) [16:18:54] wiki it :D [16:21:49] en.wikipedia.org/wiki wp.it/en/ 21 vs 8 chars [16:21:54] :) [16:23:32] wi.ki/enwp/ [16:23:35] :) [16:23:41] I like that [16:24:06] mobile users would love it [16:25:30] if there is no one going to make it I will likely register that domain myself and dedicate it to wikimedia project :D [16:27:15] that would be cool [16:27:29] i actually do like wi.ki - where is .ki ? [16:27:39] (and i know the irony that i could use wp to find that out) [16:28:17] but it's taken :/ [16:28:27] some company https://domaininfo.com/ registered it [16:28:34] they say: wi.ki No Please contact Domaininfo for more information [16:28:42] probably expensive [16:29:24] bitches [16:29:26] yeah [16:29:41] let's find some... :) [16:33:57] petan|wk: we can register it [16:34:09] if you come up with a good one, just drop me an email about it if im not on irc [16:34:13] i handle all the domain mgmt for wmf [16:34:29] otherwise if you do, when we go to put it in real serivce, we will be asking you to migrate it to us anyhow [16:34:34] may as well skip a step ;] [16:34:57] (either works though) [16:37:49] RECOVERY - RAID on cp1017 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:38:16] RECOVERY - DPKG on cp1017 is OK: All packages OK [16:38:16] RECOVERY - Disk space on cp1017 is OK: DISK OK [16:38:25] RECOVERY - SSH on cp1017 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:40:31] RECOVERY - Lucene on search1001 is OK: TCP OK - 0.027 second response time on port 8123 [16:41:25] RECOVERY - NTP on cp1017 is OK: NTP OK: Offset 0.04408991337 secs [16:44:17] !log cp1017 memory error seems ot have cleared post firmware update, will keep an eye on it for the rest of the day [16:44:20] Logged the message, RobH [16:44:25] !log cp1019 in middle of firmware update, please dont touch [16:44:28] Logged the message, RobH [16:44:28] New review: Demon; "Ah ok, all makes sense now." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3530 [16:45:28] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [16:45:33] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3433 [16:49:49] PROBLEM - Disk space on cp1019 is CRITICAL: Connection refused by host [16:50:16] PROBLEM - Frontend Squid HTTP on cp1019 is CRITICAL: Connection refused [16:50:52] PROBLEM - RAID on cp1019 is CRITICAL: Connection refused by host [16:51:28] PROBLEM - SSH on cp1019 is CRITICAL: Connection refused [16:51:28] PROBLEM - Backend Squid HTTP on cp1019 is CRITICAL: Connection refused [16:51:37] PROBLEM - DPKG on cp1019 is CRITICAL: Connection refused by host [16:57:41] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:58:08] RECOVERY - Frontend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.162 seconds [16:59:02] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:11] RECOVERY - Backend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 0.162 seconds [16:59:29] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:59:38] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 424 seconds [17:00:23] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:00:41] PROBLEM - Host cp1019 is DOWN: PING CRITICAL - Packet loss = 100% [17:26:20] RECOVERY - RAID on cp1019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:26:29] RECOVERY - Host cp1019 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [17:27:23] RECOVERY - SSH on cp1019 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:27:41] RECOVERY - DPKG on cp1019 is OK: All packages OK [17:28:08] RECOVERY - Disk space on cp1019 is OK: DISK OK [17:35:38] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 22.8082502521 (gt 8.0) [17:41:24] New patchset: Pyoungmeister; "spam reduction" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3638 [17:41:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3638 [17:43:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3638 [17:43:18] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3638 [17:43:35] New patchset: Ryan Lane; "Adding andrewb as a root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3639 [17:43:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3639 [17:52:24] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3639 [17:52:27] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3639 [17:55:44] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 2.824 seconds [17:56:29] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27399 bytes in 2.300 seconds [18:11:59] !log removed ms1 and most of ms2 from the production swift rings. no effect expected. [18:12:02] Logged the message, Master [18:17:03] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 2.98938149123 [18:19:09] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:20:30] PROBLEM - Host cp1019 is DOWN: PING CRITICAL - Packet loss = 100% [18:21:06] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [19:33:57] New patchset: Pyoungmeister; "adding log rotation for search indexer. how novel." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3660 [19:34:10] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3660 [19:51:52] !log all power strips in eqiad are now properly grounded [19:51:56] Logged the message, RobH [19:53:12] yay! [19:53:46] now just every door, side panel, roof, and roof top cable manager to go =P [19:54:14] my back is going to be killing me tonight [19:54:19] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 7948912 seconds since restart [19:54:19] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [19:54:25] to put the screw in the bottom of the power strips required lying on the floor [19:54:33] and reaching into the underside of the enclosures [19:54:38] im gettin too old for this shit ;P [19:57:03] these are the ones in eqiad? [19:57:09] oh yeah I remember how those are mounted [19:57:11] heh [19:57:19] New patchset: Pyoungmeister; "adding log rotation for search indexer. how novel." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3660 [19:57:34] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3660 [19:59:00] yea since we had to put them upside down [19:59:05] the grounding is a pain to get to [20:00:42] New patchset: Pyoungmeister; "adding log rotation for search indexer. how novel." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3660 [20:00:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3660 [20:03:01] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3660 [20:03:04] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3660 [20:09:06] RobH: do you have kneepads? [20:09:15] they seem like they might be useful to keep in the dc [20:10:26] they would [20:10:30] but i dont =P [20:10:56] buy some this weekend? [20:11:10] scraped and bruised knees are no fun [20:12:01] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: Device does not support ifTable - try without -I option [20:13:58] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 10.65.0.1, interfaces up: 32, down: 0, dormant: 0, excluded: 0, unused: 0 [20:16:13] the more work i see done by equinix the less impressed i am with them [20:16:20] they put in the fiber trays wrong, didnt match the diagram [20:16:22] so they fixed it [20:16:38] but now the tops of the racks have plastic bits, bolts, and paper bits all over them from the techs doing the install [20:16:45] they didnt even bother to clean up after themselves. [20:18:30] take a pix of it and send it to eriksilver [20:29:02] i already cleaned it up [20:37:05] it really never entered my mind to take photos to escalate, its just trash so i swept it up and cleaned it away [20:37:14] though i suppose i should have snapped a few shots of their lack of pride. [20:37:25] i just cannot really comprehend doing something so half assed. [20:38:08] mark: are you still here? [20:40:13] ^demon|away: why does FlaggedRevs have some bogus directories in git? [20:41:45] robh: you miss miquel and renee? [20:45:59] when miguel did something it was clean. [20:46:15] so yes. [20:46:15] heh [20:51:35] woosters: i just CC'd you and mark on another issue. [20:52:26] * not surprised by their inaction [20:54:57] <^demon|away> AaronSchulz: It does? [20:55:40] * AaronSchulz moves to #mediawiki [21:41:48] anybody have a minute for a puppet review? [21:42:25] New patchset: Bhartshorne; "moving swift ring files into puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3672 [21:42:35] mutante: it's already saturday, isn't it? [21:42:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3672 [21:42:47] maybe LeslieCarr? ^^ [21:42:57] looking [21:45:27] New review: Lcarr; "LGTM" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/3672 [21:45:32] thanks. [21:45:48] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3672 [21:45:51] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3672 [21:47:33] Ok, now all the rooftop cable managers are properly grounded. [21:47:43] just every single door and enclosure roof left =P [21:52:16] LeslieCarr: you aer the only person i can think of that may be using serial in eqiad right now [21:52:28] just making sure you arent, as i am going to take down the serial console to ground it [21:52:34] RobH: yay, yes I often use serial in eqiad, but I am not currently [21:52:37] thanks for the check [21:52:44] it has no grounding, so i need to tie the wire into the screws of the rackmount kit [21:52:46] cool, thx [21:53:17] LeslieCarr: i didnt check rt, but just checking [21:53:26] were you able (or mark) to setup the virt sandbox stuff? [21:53:35] and allocate a subnet? [21:53:54] ryan is having major issues with gettin gthe ciscos to work, and its easier for me to troubleshoot servers local to me =] [21:54:03] rather than the single virt5 in tampa [21:54:08] i haven't yet, is that high pri for you ? [21:54:18] the virt1001-1008, si ? [21:54:18] I am not sure if I will get to it tomorrow [21:54:24] but i will be on it by monday [21:54:47] its high priority in that we need to figure them out, but since ryan is gone on vacation, not sure if i fix it there will be any further moviement [21:54:50] movement even [21:55:04] but i also dunno how hard it will be to make work, so ;] [21:55:08] okay cool [21:55:26] well i will make it a get to it by monday thing then :) [21:55:29] ie: dont work late or nuthing, but if you have a bunch of medium priority stuff this is in there too [21:55:34] cool [21:55:39] i want a [21:55:41] PA [21:55:50] if you end up fixing monday for me to work on it on tuesday, that is cool =] [21:56:15] ooo, intern …. "come do all my organizing and scan business cards and make my tea, i won't pay you but you can inflate it on your resume" [21:56:17] i have enough other items to keep me busy for awhile, i just know they are a concern, so i would like to have ryan come back to a bunch of working ciscos that only lack his labs software sertup =] [21:56:24] hehe [21:56:29] i'm sure he'd like that [21:56:31] techops intern would be nice [21:56:54] New patchset: Bhartshorne; "specifying user and group for all swift::base files; without this they're getting set to root:root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3673 [21:57:07] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/3673 [21:58:09] New patchset: Bhartshorne; "specifying user and group for all swift::base files; without this they're getting set to root:root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3673 [21:58:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3673 [21:58:25] !log scs-a8-eqiad coming down for re-grounding [21:58:28] Logged the message, RobH [21:58:37] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3673 [21:58:40] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3673 [21:58:57] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [22:00:54] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [22:10:57] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [22:10:57] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [22:13:02] *wave* [22:13:39] any update on the state of the boot issues with the cisco machines? (RobH i guess?) [22:14:48] dschoon: so its on my list to tackle on monday or tuesday, as it requires some network stuff to be done by our net admins [22:14:56] and since ryan is on vacation, even if i fix it today [22:15:01] they wont go in service until he gets back [22:15:21] though i paln to have them up with the OS installed and ready for him to push into service when he gets back =] [22:15:42] sorry if its not a fantastic answer ;[ [22:16:39] but it is a high priority [22:16:53] great, thank you! [22:16:59] it's not a terrible answer. [22:17:02] unfortunately, its on my plate, and i had to fix our grounding issues in our cage this week as its causing undesirable operation errors [22:17:21] now if i get an alarm for what i thought was a grounding issue since i fixed it [22:17:33] I care because getting the labs machine up and running is a prerequisite for getting my machines up. [22:17:34] im just gonna burst into tears and perhaps slam my head in an enclosure door =P [22:17:35] so! [22:17:38] aw :( [22:17:40] no worries [22:17:56] i keep losing connection to some of our power strips snmp traps [22:18:13] i've been occupied with budget stuff [22:18:18] i half fixed the grounding and it seemed to lessen the issue [22:18:27] so it's not time lost. i just want to be able to get into it once that lets up [22:18:29] i wont really inkow though until I get a good 5 days of no alarms [22:18:33] yep =] [22:18:42] yo! [22:24:28] !log scs-a1-eqiad back online [22:24:32] Logged the message, RobH [22:48:59] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay seconds [22:48:59] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay seconds [22:49:26] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [22:55:08] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 21069 seconds [22:55:17] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 21054 seconds [23:30:23] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 10.65.0.1 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [23:32:20] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 10.65.0.1, interfaces up: 32, down: 0, dormant: 0, excluded: 0, unused: 0 [23:32:39] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3272 [23:32:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3272