[00:09:27] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2425* [00:13:39] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [00:13:58] AaronSchulz: http://wikitech.wikimedia.org/view/Swift/Deploy_Plan_-_Originals [00:14:14] woosters: ^^^ ctw1 [00:19:23] hi maplebed [00:19:46] woosters: aaron and I talked through stuff that is between us and originals in swift; I wrote a summary at that URL. [00:20:24] thks [00:26:15] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [00:29:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.018 seconds [00:50:53] ^^^ that power nagios alert is ok for now; robh is on it. [00:55:48] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 223 seconds [00:55:57] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 226 seconds [00:56:15] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 267 seconds [00:56:42] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 287 seconds [01:11:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.169 seconds [01:36:18] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [01:38:15] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [01:38:42] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [01:39:54] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [01:40:03] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [01:42:31] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection refused [01:43:16] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection refused [01:45:30] hrmm [01:45:34] notpeter: are you about? [01:45:50] poking at search doesn't sound fun in the least ;] [01:46:41] that's not in prod yet [01:46:43] heh, yay! [01:46:47] got your sms ;] [01:50:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.006 seconds [01:55:31] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4083 [02:10:07] PROBLEM - udp2log log age on emery is CRITICAL: CRITICAL: log files /var/log/squid/countries-100.log, /var/log/squid/countries-10.log, /var/log/squid/countries-1.log, have not been written to in 6 hours [02:30:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:19] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [02:36:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.618 seconds [04:53:14] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3795 MB (3% inode=99%): [05:55:03] New patchset: Tim Starling; "Split apache out of local syslogs and limit file sizes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4149 [05:55:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4149 [06:00:23] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3272 MB (2% inode=99%): [06:02:47] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:44] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [06:11:29] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [06:14:21] New review: Tim Starling; "The rsyslog configuration is tested, except for the rotation script." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4149 [06:14:23] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4149 [06:49:24] RECOVERY - udp2log log age on emery is OK: OK: all log files active [06:49:33] RECOVERY - udp2log processes on emery is OK: OK: all filters present [07:01:06] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:09] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.013 seconds [07:24:14] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:23] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.010 seconds [07:52:35] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:32] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [07:59:27] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4096 [07:59:30] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4096 [08:24:21] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [08:30:21] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [08:40:24] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [08:40:24] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [08:55:24] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [08:55:24] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:24:41] ACKNOWLEDGEMENT - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2727 - has alarm status [10:58:15] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [11:10:51] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [11:39:17] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [12:00:26] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:10:47] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:11:50] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2425* [12:16:02] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2438* [12:36:35] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:37:38] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [12:45:17] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [13:16:47] New patchset: Tim Starling; "Add notify and change conf file mode" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4160 [13:17:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4160 [13:17:27] New patchset: ArielGlenn; "continue rsyncs after failed job but don't delete related dirs" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4161 [13:18:05] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4160 [13:18:07] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4160 [13:22:40] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4161 [13:22:43] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4161 [13:26:25] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [13:31:11] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [13:31:14] New patchset: Tim Starling; "Revert "Add notify and change conf file mode"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4163 [13:31:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4163 [13:31:40] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4163 [13:31:43] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4163 [13:35:26] !log manually reloaded rsyslogd on all apaches [13:35:28] Logged the message, Master [13:53:11] New patchset: Jgreen; "checking in some search QA scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4164 [13:53:26] New patchset: Jgreen; "install search qa scripts on iron" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4165 [13:53:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4164 [13:53:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4165 [13:53:41] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4164 [13:53:43] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4164 [13:54:21] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4165 [13:54:24] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4165 [14:02:29] New patchset: Hashar; "gerrit: use uid to login instead of display name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4166 [14:02:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4166 [14:03:58] New review: Hashar; "Dear lord of Gerrit," [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4166 [14:07:34] New patchset: Pyoungmeister; "oh, dependencies..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4167 [14:07:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4167 [14:09:45] New patchset: Pyoungmeister; "oh, dependencies..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4167 [14:10:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4167 [14:10:21] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4167 [14:10:23] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4167 [14:12:46] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Tue Apr 3 14:12:25 UTC 2012 [14:14:28] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4168 [14:14:51] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4168 [14:15:36] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [14:40:31] RECOVERY - DPKG on cp1021 is OK: All packages OK [14:40:31] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [15:09:48] New patchset: Pyoungmeister; "putting toe in to test the waters of scoped var. if doesn't blow up, will change rest in this conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4170 [15:10:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4170 [15:11:52] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [15:17:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4170 [15:17:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4170 [15:19:36] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [15:24:17] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Apr 3 15:24:14 UTC 2012 [15:29:59] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [15:30:26] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2425* [15:32:14] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=63%): [15:32:32] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 89 MB (1% inode=66%): [15:33:44] Rarrgh ^ [15:34:38] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 206 MB (2% inode=63%): [15:34:47] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 207 MB (2% inode=66%): [15:40:38] RECOVERY - Disk space on srv222 is OK: DISK OK [15:40:56] RECOVERY - Disk space on srv221 is OK: DISK OK [15:42:46] New patchset: Pyoungmeister; "scopin' more vars, cleaning up white space" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4174 [15:43:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4174 [15:43:31] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4174 [15:43:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4174 [15:46:40] RECOVERY - Disk space on srv219 is OK: DISK OK [15:46:49] RECOVERY - Disk space on srv223 is OK: DISK OK [16:06:27] New patchset: RobH; "added cisco ucs c250 m1 server ttyS0-115200 entries" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4176 [16:06:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4176 [16:07:22] New patchset: Jgreen; "added search qa mode where we can test through LVS rather than direct to lucene hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4177 [16:07:31] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [16:07:38] New review: RobH; "self review is the best review" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4176 [16:07:38] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4176 [16:07:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4177 [16:08:16] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4177 [16:08:19] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4177 [16:09:08] why is stafford asking for login when suckpuppet updates its puppet repo =/ [16:09:32] i mean, i know the pass, but odd that its suddenly asking. [16:10:20] !log updating brewster to use new dhcp files for cisco, no more local hackin. [16:10:22] Logged the message, RobH [16:13:09] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [16:21:38] New patchset: Jgreen; "really an impressive collection of typos, fine work!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4180 [16:21:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4180 [16:22:17] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4180 [16:22:20] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4180 [16:23:44] robh: plz bring down srv230 [16:24:10] !rt 2759 [16:24:10] https://rt.wikimedia.org/Ticket/Display.html?id=2759 [16:24:26] hrmm, its in mid rack [16:24:40] think you are juggling it to another phase or removing from rack? [16:24:51] juggling to a new phase [16:25:14] cool, ok lemme ensure its not an imagine scaler then will be able to take it down [16:26:30] ok, its a general use apache [16:26:48] !log shutting down srv230 for power phase move per rt 2759 [16:26:50] Logged the message, RobH [16:26:59] cmjohnson1: when it powers down its all yours [16:27:12] once you are done moving it and power looks good, lemme know and I can ensure its back in service pool(s) [16:27:16] k..if this doesn't work..i am going to have to remove srv226 [16:27:26] but let's hope this works first [16:28:16] New patchset: Jgreen; "more typos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4181 [16:28:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4181 [16:28:51] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4181 [16:28:54] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4181 [16:29:43] PROBLEM - Host srv230 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:30] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.2.230:11000 (Connection refused) [16:32:06] RECOVERY - Host srv230 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:32:15] PROBLEM - NTP on srv230 is CRITICAL: NTP CRITICAL: Offset unknown [16:33:00] PROBLEM - Apache HTTP on srv230 is CRITICAL: Connection refused [16:33:36] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:34:35] robh: d2 is in better shape but I could probably juggle another srv to a different phase to better balance...check srv230 and if you could bring down for me [16:34:42] srv231 [16:35:22] it looks like you need to move something from YZ to YX [16:35:31] that is srv 231 [16:36:01] make that 237 [16:36:18] robh ^ [16:36:21] i dont want to bring down more servers until 230 is back. [16:36:27] RECOVERY - NTP on srv230 is OK: NTP OK: Offset 0.01653683186 secs [16:36:27] chekcing it now [16:37:03] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [16:37:07] !log srv230 back in rotation [16:37:09] Logged the message, RobH [16:38:06] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:40] !log bringing down srv237 for phase balancing [16:38:42] Logged the message, RobH [16:38:56] cmjohnson1: when it powers off its all yours [16:38:58] k [16:42:00] PROBLEM - Host srv237 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:34] robh: better now but z is still just over 20A. we can still add to xy...let me know what you think [16:43:39] RECOVERY - Host srv237 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [16:46:11] cmjohnson1: 20amp is ok [16:46:21] if you look at http://ps1-d2-sdtpa.mgmt.pmtpa.wmnet/ it shows the warning levels [16:46:41] they are 18.change 19/20. and 19/20 [16:46:53] so the phases are in balance now [16:47:05] atleast close enough to clear any alarms [16:47:21] i did see that...but z phase goes over 20A from time to time [16:47:51] PROBLEM - Apache HTTP on srv237 is CRITICAL: Connection refused [16:48:25] yea, but it has to go up quite a bit from there to go into full alarm [16:48:36] so the ticket should be able to get resolved, now later today during peak will be the real test [16:48:44] you also have another ticket for pmtpa [16:48:49] yes [16:48:59] that requires relocation of servers [16:51:17] hrmm [16:51:34] ok, well, work on something else for now then, i need to go get lunch ;] [16:51:41] if you can figure out the math [16:51:49] try to determine how many watts each server is pulling in the rack [16:51:50] i did...sent you an email...i will dig it up [16:51:56] oh? [16:51:58] maybe a ticket [16:52:10] you prolly did, i am drowning in tickets and emails ;_; [16:52:27] !rt 2692 [16:52:27] https://rt.wikimedia.org/Ticket/Display.html?id=2692 [16:52:57] spiffy, hrmm [16:53:13] d1 isnt an apache rack [16:53:20] i dont wanna toss apaches in there if we can help it [16:53:25] is there room in d3 for all of them? [16:53:44] cmjohnson1: d3 shows a bunch of old servers in it. [16:53:56] if they are no longer in that rack, they need to move to a decommission rack [16:54:13] then we should be able to move all three of the servers you need to rebalance power to d3 [16:54:27] i have room...checking racktables now [16:54:37] d3-pmtpa [16:54:55] not sdtpa...i have some squids in there but they are still being used [16:55:53] right pmtpa [16:56:02] it shows a bunch of sq69-sq86 [16:56:18] so those are in use? [16:56:35] ahh, when you list racks in ticekts [16:56:43] please list full name d3-pmpta, d1-sdtpa [16:56:47] re-reading. [16:57:11] cmjohnson1: so you want to move them to d3 sdtpa or d3pmtpa? [16:57:22] d3-pmtpa [16:57:46] hrmm, that has less room than d1-sdtpa [16:58:01] i think it may be best to just move all the overload to d1-sdtpa [16:59:11] we can do that [17:00:29] cmjohnson1: so the easiest way for you to check for image scalers [17:00:37] on these we are moving, its easiest if you are moving plain old apaches [17:00:49] image scalers are in a much smaller pool of servers [17:01:00] so, check out ganglia.wikimeida.org when you pick the servers to move [17:01:07] ensure they are in apaches, not imagescaling or whatnot [17:01:25] 90% of the time they are fine, but easier for you to check rather than wait on me to check when i go to take them down [17:01:36] (granted, i will check anyhow, but it saves us unneeded delays ;) [17:01:49] k [17:02:04] if you had root you would just grep the pybal configs [17:02:15] and then there are a few other things to check, like if they are jobrunners and the like [17:02:35] but there isnt a way for you to check that with your (unnecessarily) limited access levels. [17:02:57] * RobH beats the dead horse about access levels some more [17:03:14] ok, im goin to lunch. [17:03:29] so either mark or LeslieCarr have to handle the network ticket you will be dropping for those relocations [17:03:38] once they setup the new ports, we can move them [17:03:43] cmjohnson1: just bug me about the ports :) [17:03:53] ok [17:05:15] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [17:06:30] RobH: are you at the dc today ? [17:06:37] yes [17:06:46] we're seeing no light at the port from XO [17:07:18] ok, can we troubleshoot in an hour [17:07:23] im trying to leave for lunch ;] [17:07:48] oh wait [17:07:53] can you try flipping the fiber real quick [17:07:56] on a conference call [17:08:04] have peeps on hold [17:08:07] ok [17:08:33] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [17:08:50] they are flipped [17:08:51] RobH: yay that fixed it [17:08:52] thanks [17:09:00] thats the opposite of all the other punchdowns [17:09:24] ok, im leaving for lunch. [17:09:46] RobH: I'm sorry to have to ask again, but could you remind me of the current state of the new swift hardware in both pmtpa and eqiad? (the hardware for the 12-node cluster, not the current prod hw){ [17:10:57] (or perhaps I just missed you.) [17:14:51] RECOVERY - Apache HTTP on srv237 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [17:38:09] LeslieCarr: [17:38:21] !rt 2761 [17:38:21] https://rt.wikimedia.org/Ticket/Display.html?id=2761 [17:40:54] cool [17:43:26] * schoolcraftT would slap Thehelpfulone, but is not being violent today [17:44:05] * schoolcraftT breaks out the slapping rod and looks sternly at Krinkle [17:55:40] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4020 [17:55:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4020 [18:00:31] cmjohnson1: the ports should be ready now [18:06:41] maplebed: All the hardware is on site but not racked yet, as I need to allocate space for it. [18:06:52] the existing hardware needs firmware updated, but i have not found time to do so [18:07:04] (anyone can update the firmware, it doesn't require physical access) [18:07:48] New review: Hashar; "Can be tested by amending a change in the test branch: https://gerrit.wikimedia.org/r/4144" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4145 [18:10:13] robh: mw servers [18:10:17] !rt 2692 [18:10:17] https://rt.wikimedia.org/Ticket/Display.html?id=2692 [18:10:52] cmjohnson1: i would set the network ticket to a blocker. [18:11:05] like i just did, depends on. [18:13:57] cmjohnson1: hey do you want to work on the wireless right now (or else i will head into the office and we can work on it later) [18:14:34] sure [18:14:43] let's get this done...at least in sdtpa [18:14:57] tired of tethering to a single spot in the DC ;] [18:15:20] makes it difficult at times...running back and forth [18:17:23] ok :) [18:17:24] cool [18:20:07] cmjohnson1: you are linking the ticket to me cuz when you finish wifi stuff you are ready to relocate them? [18:20:25] I am about to set fire to the rack of cisco servers [18:20:32] im getting really tired of fighting with them. [18:20:40] correct robh [18:23:45] hrmm, how the hell do i determine jobrunner nodes... [18:25:11] great, they are all three job runners [18:25:40] Ok, I see that Tim supposedly fixed the package [18:25:46] so i assume they can be restarted... [18:26:38] ahh, yep, init scripts start htem ok [18:26:40] so non issue. [18:26:52] cmjohnson1: I am going to go ahead and shut all three down, you will be moving them today right? [18:27:13] yes [18:28:31] !log shutting down mw28, mw49, & mw58 for rack relocation due to power overload in d2-pmtpa, relocation to d1-sdtpa per rt 2692 [18:28:33] Logged the message, RobH [18:28:38] oxford comma ftw. [18:29:08] cmjohnson1: all yours, they are powering off now. [18:29:29] once you bring them back up, ping me or someone else in ops to ensure they are repooling correctly (and jobrunning scripts are working) [18:30:54] PROBLEM - Host mw28 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:54] PROBLEM - Host mw49 is DOWN: PING CRITICAL - Packet loss = 100% [18:31:12] PROBLEM - Host mw58 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:24] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [18:34:21] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.11.28:11000 (Connection timed out) [18:35:17] ohh god dman it [18:35:21] thats my fault, fixing [18:37:48] !log syncing new mc.php, forgot to check for all three of the servers i took down, opps. [18:37:50] Logged the message, RobH [18:38:06] cmjohnson1: now when you move them please ping me before you turn them up [18:38:14] actually, puppet should fix the mc.php, but ping me anyhow [18:38:20] cuz they will have outdated files and i worry. [18:38:33] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [18:38:42] weeeee [18:38:51] robh: ok [18:39:05] RobH: problem "fixed" ;) [18:39:14] cmjohnson1: just FYI when you move an apache there are a few things you have to check, which service pool it will affect, if its a job runner (usually no big deal, but recently those scripts were fubar), and if its memcached [18:39:24] i of course did all but the last, missing one of the 3 in mc.php [18:39:41] its not the end of the world, but memcached issues can lead to performance issues. [18:40:04] much greater deal in the past before some overhaul on parser if i understand how things changed, and i may not =P [18:42:18] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [18:42:18] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [18:56:24] New patchset: Jgreen; "search qa tool now accepts IP addresses as pool assignments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4195 [18:56:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4195 [18:57:17] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4195 [18:57:18] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [18:57:18] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [18:57:20] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4195 [19:15:12] New patchset: Jgreen; "added run mode info to search api_sweep_test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4196 [19:15:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4196 [19:16:08] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4196 [19:16:11] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4196 [19:22:35] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4166 [19:36:02] binasher: https://gerrit.wikimedia.org/r/#change,4162,patchset=5 [19:36:27] oooh [19:36:50] i have that installed on cp1021 now, a test build [19:36:55] (if puppet didn't wipe it, I don't think it does) [19:37:05] appears to work, but not thoroughly tested yet [19:37:15] feel free to comment/review/etc :) [19:38:18] wrote it yesterday, if noone finds any serious problems with it I might put a little load on it on thursday [19:51:04] New review: Demon; "Please reformat your commit message per the guidelines." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/3815 [19:51:18] mark: did you install it by dpkg on cp1021? if so then i think puppet did wipe it out [19:52:06] dpkg -i [19:52:08] yeah might have [19:52:10] it'll be in /tmp [19:53:00] if the VCL is gone too, just replace 'hash' by 'chash' in the director config [19:53:05] params are compatible [19:54:52] ok, awesome! i'll test it out [19:54:56] the general idea is that backend names (as defined in the VCL) are hashed with sha256, converted to a double between 0.0 and 1.0, and placed on a continuum [19:55:08] every backend "weight" times, so 10 times for a backend with weight 10 [19:55:30] and then each url has a hash too, and the next/closest backend for the given hash is used [19:55:39] and a server with weight 10 has 10 hash values on that circle, etc [19:56:15] so as long as the backend names don't change, the hash values don't change [19:56:36] and adding or removing a backend will only affect a small portion of all URLs [19:58:09] i'm going off now, see you tomorrow [19:58:26] have a good night. it looks great [19:58:32] tnx... c'ya [20:03:38] maplebed: finally got concurrent ops not sucking, seeing a 3x improvement for "store" ops in copper and 2x for moves/deletes [20:03:52] that's still less than I would have expected. [20:04:05] but good to hear. [20:04:40] I bet pmtpa will show much more gain [20:05:01] oh yeah, I forgot you were working against the eqiad cluster [20:05:21] actually there may be more than 3x for store, I haven't even maxed it out yet [20:05:32] * AaronSchulz could try more files at once [20:08:45] nope, just 3x [20:13:09] !log restarting lsearchd on search7. was taosted [20:13:11] Logged the message, notpeter [20:24:18] hey maplebed, do you have some time for attempt 2 of the new filters? [20:24:39] yes. gimme 5. [20:25:27] okay [20:43:20] RECOVERY - Disk space on search1021 is OK: DISK OK [20:44:45] ping drdee [20:44:52] alright, sorry, 5 turned into 20. [20:45:56] ready [20:47:13] i put the new deb package in my home folder on emery, it has today's date in the filename [20:47:28] * maplebed looks [20:47:32] RECOVERY - Disk space on search1022 is OK: DISK OK [20:48:47] drdee: it looks like the tgz file has the date but the .deb doesn't. that's ok, right? [20:50:16] maplebed:yes [20:51:33] the usage output for the binary says "Either --path or --domain are mandatory (you can use them both, the other command line parameters are optional:" but the new filters you added don't specify either (specifying only -i instead). is that ok? [20:51:33] RECOVERY - Host mw49 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:51:33] RECOVERY - Host mw28 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:51:33] RECOVERY - Host mw58 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [20:51:42] PROBLEM - NTP on mw58 is CRITICAL: NTP CRITICAL: Offset unknown [20:51:59] robh: mw servers have been relocated [20:55:54] RECOVERY - NTP on mw58 is OK: NTP OK: Offset -0.01417863369 secs [20:56:39] PROBLEM - Apache HTTP on mw58 is CRITICAL: Connection refused [20:58:38] drdee: did you see my question? [20:58:51]