[00:09:27] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2425* [00:13:39] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [00:13:58] AaronSchulz: http://wikitech.wikimedia.org/view/Swift/Deploy_Plan_-_Originals [00:14:14] woosters: ^^^ ctw1 [00:19:23] hi maplebed [00:19:46] woosters: aaron and I talked through stuff that is between us and originals in swift; I wrote a summary at that URL. [00:20:24] thks [00:26:15] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [00:29:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:35:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.018 seconds [00:50:53] ^^^ that power nagios alert is ok for now; robh is on it. [00:55:48] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 223 seconds [00:55:57] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 226 seconds [00:56:15] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 267 seconds [00:56:42] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 287 seconds [01:11:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.169 seconds [01:36:18] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [01:38:15] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [01:38:42] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [01:39:54] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [01:40:03] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [01:42:31] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection refused [01:43:16] PROBLEM - LVS Lucene on search-prefix.svc.eqiad.wmnet is CRITICAL: Connection refused [01:45:30] hrmm [01:45:34] notpeter: are you about? [01:45:50] poking at search doesn't sound fun in the least ;] [01:46:41] that's not in prod yet [01:46:43] heh, yay! [01:46:47] got your sms ;] [01:50:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:54:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.006 seconds [01:55:31] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4083 [02:10:07] PROBLEM - udp2log log age on emery is CRITICAL: CRITICAL: log files /var/log/squid/countries-100.log, /var/log/squid/countries-10.log, /var/log/squid/countries-1.log, have not been written to in 6 hours [02:30:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:19] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [02:36:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.618 seconds [04:53:14] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3795 MB (3% inode=99%): [05:55:03] New patchset: Tim Starling; "Split apache out of local syslogs and limit file sizes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4149 [05:55:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4149 [06:00:23] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 3272 MB (2% inode=99%): [06:02:47] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:04:44] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [06:11:29] PROBLEM - Puppet freshness on searchidx1001 is CRITICAL: Puppet has not run in the last 10 hours [06:14:21] New review: Tim Starling; "The rsyslog configuration is tested, except for the rotation script." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4149 [06:14:23] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4149 [06:49:24] RECOVERY - udp2log log age on emery is OK: OK: all log files active [06:49:33] RECOVERY - udp2log processes on emery is OK: OK: all filters present [07:01:06] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:09] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.013 seconds [07:24:14] PROBLEM - Swift HTTP on ms-fe2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:23] RECOVERY - Swift HTTP on ms-fe2 is OK: HTTP OK HTTP/1.1 200 OK - 366 bytes in 0.010 seconds [07:52:35] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:54:32] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [07:59:27] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4096 [07:59:30] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4096 [08:24:21] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [08:30:21] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [08:40:24] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [08:40:24] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [08:55:24] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [08:55:24] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [10:24:41] ACKNOWLEDGEMENT - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2727 - has alarm status [10:58:15] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [11:10:51] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [11:39:17] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [12:00:26] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:10:47] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:11:50] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2425* [12:16:02] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2438* [12:36:35] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:37:38] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [12:45:17] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [13:16:47] New patchset: Tim Starling; "Add notify and change conf file mode" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4160 [13:17:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4160 [13:17:27] New patchset: ArielGlenn; "continue rsyncs after failed job but don't delete related dirs" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4161 [13:18:05] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4160 [13:18:07] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4160 [13:22:40] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4161 [13:22:43] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/4161 [13:26:25] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2413* [13:31:11] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [13:31:14] New patchset: Tim Starling; "Revert "Add notify and change conf file mode"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4163 [13:31:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4163 [13:31:40] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4163 [13:31:43] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4163 [13:35:26] !log manually reloaded rsyslogd on all apaches [13:35:28] Logged the message, Master [13:53:11] New patchset: Jgreen; "checking in some search QA scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4164 [13:53:26] New patchset: Jgreen; "install search qa scripts on iron" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4165 [13:53:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4164 [13:53:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4165 [13:53:41] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4164 [13:53:43] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4164 [13:54:21] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4165 [13:54:24] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4165 [14:02:29] New patchset: Hashar; "gerrit: use uid to login instead of display name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4166 [14:02:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4166 [14:03:58] New review: Hashar; "Dear lord of Gerrit," [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4166 [14:07:34] New patchset: Pyoungmeister; "oh, dependencies..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4167 [14:07:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4167 [14:09:45] New patchset: Pyoungmeister; "oh, dependencies..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4167 [14:10:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4167 [14:10:21] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4167 [14:10:23] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4167 [14:12:46] RECOVERY - Puppet freshness on searchidx1001 is OK: puppet ran at Tue Apr 3 14:12:25 UTC 2012 [14:14:28] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4168 [14:14:51] Change abandoned: Mark Bergsma; "(no reason)" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4168 [14:15:36] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [14:40:31] RECOVERY - DPKG on cp1021 is OK: All packages OK [14:40:31] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [15:09:48] New patchset: Pyoungmeister; "putting toe in to test the waters of scoped var. if doesn't blow up, will change rest in this conf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4170 [15:10:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4170 [15:11:52] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [15:17:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4170 [15:17:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4170 [15:19:36] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [15:24:17] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Apr 3 15:24:14 UTC 2012 [15:29:59] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: Connection refused [15:30:26] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2425* [15:32:14] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=63%): [15:32:32] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 89 MB (1% inode=66%): [15:33:44] Rarrgh ^ [15:34:38] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 206 MB (2% inode=63%): [15:34:47] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 207 MB (2% inode=66%): [15:40:38] RECOVERY - Disk space on srv222 is OK: DISK OK [15:40:56] RECOVERY - Disk space on srv221 is OK: DISK OK [15:42:46] New patchset: Pyoungmeister; "scopin' more vars, cleaning up white space" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4174 [15:43:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4174 [15:43:31] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4174 [15:43:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4174 [15:46:40] RECOVERY - Disk space on srv219 is OK: DISK OK [15:46:49] RECOVERY - Disk space on srv223 is OK: DISK OK [16:06:27] New patchset: RobH; "added cisco ucs c250 m1 server ttyS0-115200 entries" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4176 [16:06:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4176 [16:07:22] New patchset: Jgreen; "added search qa mode where we can test through LVS rather than direct to lucene hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4177 [16:07:31] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 643 bytes in 0.053 seconds [16:07:38] New review: RobH; "self review is the best review" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4176 [16:07:38] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4176 [16:07:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4177 [16:08:16] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4177 [16:08:19] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4177 [16:09:08] why is stafford asking for login when suckpuppet updates its puppet repo =/ [16:09:32] i mean, i know the pass, but odd that its suddenly asking. [16:10:20] !log updating brewster to use new dhcp files for cisco, no more local hackin. [16:10:22] Logged the message, RobH [16:13:09] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [16:21:38] New patchset: Jgreen; "really an impressive collection of typos, fine work!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4180 [16:21:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4180 [16:22:17] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4180 [16:22:20] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4180 [16:23:44] robh: plz bring down srv230 [16:24:10] !rt 2759 [16:24:10] https://rt.wikimedia.org/Ticket/Display.html?id=2759 [16:24:26] hrmm, its in mid rack [16:24:40] think you are juggling it to another phase or removing from rack? [16:24:51] juggling to a new phase [16:25:14] cool, ok lemme ensure its not an imagine scaler then will be able to take it down [16:26:30] ok, its a general use apache [16:26:48] !log shutting down srv230 for power phase move per rt 2759 [16:26:50] Logged the message, RobH [16:26:59] cmjohnson1: when it powers down its all yours [16:27:12] once you are done moving it and power looks good, lemme know and I can ensure its back in service pool(s) [16:27:16] k..if this doesn't work..i am going to have to remove srv226 [16:27:26] but let's hope this works first [16:28:16] New patchset: Jgreen; "more typos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4181 [16:28:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4181 [16:28:51] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4181 [16:28:54] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4181 [16:29:43] PROBLEM - Host srv230 is DOWN: PING CRITICAL - Packet loss = 100% [16:31:30] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.2.230:11000 (Connection refused) [16:32:06] RECOVERY - Host srv230 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:32:15] PROBLEM - NTP on srv230 is CRITICAL: NTP CRITICAL: Offset unknown [16:33:00] PROBLEM - Apache HTTP on srv230 is CRITICAL: Connection refused [16:33:36] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:34:35] robh: d2 is in better shape but I could probably juggle another srv to a different phase to better balance...check srv230 and if you could bring down for me [16:34:42] srv231 [16:35:22] it looks like you need to move something from YZ to YX [16:35:31] that is srv 231 [16:36:01] make that 237 [16:36:18] robh ^ [16:36:21] i dont want to bring down more servers until 230 is back. [16:36:27] RECOVERY - NTP on srv230 is OK: NTP OK: Offset 0.01653683186 secs [16:36:27] chekcing it now [16:37:03] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [16:37:07] !log srv230 back in rotation [16:37:09] Logged the message, RobH [16:38:06] PROBLEM - Varnish HTTP upload-frontend on cp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:38:40] !log bringing down srv237 for phase balancing [16:38:42] Logged the message, RobH [16:38:56] cmjohnson1: when it powers off its all yours [16:38:58] k [16:42:00] PROBLEM - Host srv237 is DOWN: PING CRITICAL - Packet loss = 100% [16:43:34] robh: better now but z is still just over 20A. we can still add to xy...let me know what you think [16:43:39] RECOVERY - Host srv237 is UP: PING OK - Packet loss = 0%, RTA = 0.71 ms [16:46:11] cmjohnson1: 20amp is ok [16:46:21] if you look at http://ps1-d2-sdtpa.mgmt.pmtpa.wmnet/ it shows the warning levels [16:46:41] they are 18.change 19/20. and 19/20 [16:46:53] so the phases are in balance now [16:47:05] atleast close enough to clear any alarms [16:47:21] i did see that...but z phase goes over 20A from time to time [16:47:51] PROBLEM - Apache HTTP on srv237 is CRITICAL: Connection refused [16:48:25] yea, but it has to go up quite a bit from there to go into full alarm [16:48:36] so the ticket should be able to get resolved, now later today during peak will be the real test [16:48:44] you also have another ticket for pmtpa [16:48:49] yes [16:48:59] that requires relocation of servers [16:51:17] hrmm [16:51:34] ok, well, work on something else for now then, i need to go get lunch ;] [16:51:41] if you can figure out the math [16:51:49] try to determine how many watts each server is pulling in the rack [16:51:50] i did...sent you an email...i will dig it up [16:51:56] oh? [16:51:58] maybe a ticket [16:52:10] you prolly did, i am drowning in tickets and emails ;_; [16:52:27] !rt 2692 [16:52:27] https://rt.wikimedia.org/Ticket/Display.html?id=2692 [16:52:57] spiffy, hrmm [16:53:13] d1 isnt an apache rack [16:53:20] i dont wanna toss apaches in there if we can help it [16:53:25] is there room in d3 for all of them? [16:53:44] cmjohnson1: d3 shows a bunch of old servers in it. [16:53:56] if they are no longer in that rack, they need to move to a decommission rack [16:54:13] then we should be able to move all three of the servers you need to rebalance power to d3 [16:54:27] i have room...checking racktables now [16:54:37] d3-pmtpa [16:54:55] not sdtpa...i have some squids in there but they are still being used [16:55:53] right pmtpa [16:56:02] it shows a bunch of sq69-sq86 [16:56:18] so those are in use? [16:56:35] ahh, when you list racks in ticekts [16:56:43] please list full name d3-pmpta, d1-sdtpa [16:56:47] re-reading. [16:57:11] cmjohnson1: so you want to move them to d3 sdtpa or d3pmtpa? [16:57:22] d3-pmtpa [16:57:46] hrmm, that has less room than d1-sdtpa [16:58:01] i think it may be best to just move all the overload to d1-sdtpa [16:59:11] we can do that [17:00:29] cmjohnson1: so the easiest way for you to check for image scalers [17:00:37] on these we are moving, its easiest if you are moving plain old apaches [17:00:49] image scalers are in a much smaller pool of servers [17:01:00] so, check out ganglia.wikimeida.org when you pick the servers to move [17:01:07] ensure they are in apaches, not imagescaling or whatnot [17:01:25] 90% of the time they are fine, but easier for you to check rather than wait on me to check when i go to take them down [17:01:36] (granted, i will check anyhow, but it saves us unneeded delays ;) [17:01:49] k [17:02:04] if you had root you would just grep the pybal configs [17:02:15] and then there are a few other things to check, like if they are jobrunners and the like [17:02:35] but there isnt a way for you to check that with your (unnecessarily) limited access levels. [17:02:57] * RobH beats the dead horse about access levels some more [17:03:14] ok, im goin to lunch. [17:03:29] so either mark or LeslieCarr have to handle the network ticket you will be dropping for those relocations [17:03:38] once they setup the new ports, we can move them [17:03:43] cmjohnson1: just bug me about the ports :) [17:03:53] ok [17:05:15] RECOVERY - Varnish HTTP upload-frontend on cp1021 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.054 seconds [17:06:30] RobH: are you at the dc today ? [17:06:37] yes [17:06:46] we're seeing no light at the port from XO [17:07:18] ok, can we troubleshoot in an hour [17:07:23] im trying to leave for lunch ;] [17:07:48] oh wait [17:07:53] can you try flipping the fiber real quick [17:07:56] on a conference call [17:08:04] have peeps on hold [17:08:07] ok [17:08:33] New patchset: Mark Bergsma; "Initial implementation of a consistent hashing director" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4162 [17:08:50] they are flipped [17:08:51] RobH: yay that fixed it [17:08:52] thanks [17:09:00] thats the opposite of all the other punchdowns [17:09:24] ok, im leaving for lunch. [17:09:46] RobH: I'm sorry to have to ask again, but could you remind me of the current state of the new swift hardware in both pmtpa and eqiad? (the hardware for the 12-node cluster, not the current prod hw){ [17:10:57] (or perhaps I just missed you.) [17:14:51] RECOVERY - Apache HTTP on srv237 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [17:38:09] LeslieCarr: [17:38:21] !rt 2761 [17:38:21] https://rt.wikimedia.org/Ticket/Display.html?id=2761 [17:40:54] cool [17:43:26] * schoolcraftT would slap Thehelpfulone, but is not being violent today [17:44:05] * schoolcraftT breaks out the slapping rod and looks sternly at Krinkle [17:55:40] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4020 [17:55:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4020 [18:00:31] cmjohnson1: the ports should be ready now [18:06:41] maplebed: All the hardware is on site but not racked yet, as I need to allocate space for it. [18:06:52] the existing hardware needs firmware updated, but i have not found time to do so [18:07:04] (anyone can update the firmware, it doesn't require physical access) [18:07:48] New review: Hashar; "Can be tested by amending a change in the test branch: https://gerrit.wikimedia.org/r/4144" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4145 [18:10:13] robh: mw servers [18:10:17] !rt 2692 [18:10:17] https://rt.wikimedia.org/Ticket/Display.html?id=2692 [18:10:52] cmjohnson1: i would set the network ticket to a blocker. [18:11:05] like i just did, depends on. [18:13:57] cmjohnson1: hey do you want to work on the wireless right now (or else i will head into the office and we can work on it later) [18:14:34] sure [18:14:43] let's get this done...at least in sdtpa [18:14:57] tired of tethering to a single spot in the DC ;] [18:15:20] makes it difficult at times...running back and forth [18:17:23] ok :) [18:17:24] cool [18:20:07] cmjohnson1: you are linking the ticket to me cuz when you finish wifi stuff you are ready to relocate them? [18:20:25] I am about to set fire to the rack of cisco servers [18:20:32] im getting really tired of fighting with them. [18:20:40] correct robh [18:23:45] hrmm, how the hell do i determine jobrunner nodes... [18:25:11] great, they are all three job runners [18:25:40] Ok, I see that Tim supposedly fixed the package [18:25:46] so i assume they can be restarted... [18:26:38] ahh, yep, init scripts start htem ok [18:26:40] so non issue. [18:26:52] cmjohnson1: I am going to go ahead and shut all three down, you will be moving them today right? [18:27:13] yes [18:28:31] !log shutting down mw28, mw49, & mw58 for rack relocation due to power overload in d2-pmtpa, relocation to d1-sdtpa per rt 2692 [18:28:33] Logged the message, RobH [18:28:38] oxford comma ftw. [18:29:08] cmjohnson1: all yours, they are powering off now. [18:29:29] once you bring them back up, ping me or someone else in ops to ensure they are repooling correctly (and jobrunning scripts are working) [18:30:54] PROBLEM - Host mw28 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:54] PROBLEM - Host mw49 is DOWN: PING CRITICAL - Packet loss = 100% [18:31:12] PROBLEM - Host mw58 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:24] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [18:34:21] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Can not connect to 10.0.11.28:11000 (Connection timed out) [18:35:17] ohh god dman it [18:35:21] thats my fault, fixing [18:37:48] !log syncing new mc.php, forgot to check for all three of the servers i took down, opps. [18:37:50] Logged the message, RobH [18:38:06] cmjohnson1: now when you move them please ping me before you turn them up [18:38:14] actually, puppet should fix the mc.php, but ping me anyhow [18:38:20] cuz they will have outdated files and i worry. [18:38:33] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [18:38:42] weeeee [18:38:51] robh: ok [18:39:05] RobH: problem "fixed" ;) [18:39:14] cmjohnson1: just FYI when you move an apache there are a few things you have to check, which service pool it will affect, if its a job runner (usually no big deal, but recently those scripts were fubar), and if its memcached [18:39:24] i of course did all but the last, missing one of the 3 in mc.php [18:39:41] its not the end of the world, but memcached issues can lead to performance issues. [18:40:04] much greater deal in the past before some overhaul on parser if i understand how things changed, and i may not =P [18:42:18] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [18:42:18] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [18:56:24] New patchset: Jgreen; "search qa tool now accepts IP addresses as pool assignments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4195 [18:56:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4195 [18:57:17] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4195 [18:57:18] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [18:57:18] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [18:57:20] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4195 [19:15:12] New patchset: Jgreen; "added run mode info to search api_sweep_test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4196 [19:15:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4196 [19:16:08] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4196 [19:16:11] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4196 [19:22:35] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4166 [19:36:02] binasher: https://gerrit.wikimedia.org/r/#change,4162,patchset=5 [19:36:27] oooh [19:36:50] i have that installed on cp1021 now, a test build [19:36:55] (if puppet didn't wipe it, I don't think it does) [19:37:05] appears to work, but not thoroughly tested yet [19:37:15] feel free to comment/review/etc :) [19:38:18] wrote it yesterday, if noone finds any serious problems with it I might put a little load on it on thursday [19:51:04] New review: Demon; "Please reformat your commit message per the guidelines." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/3815 [19:51:18] mark: did you install it by dpkg on cp1021? if so then i think puppet did wipe it out [19:52:06] dpkg -i [19:52:08] yeah might have [19:52:10] it'll be in /tmp [19:53:00] if the VCL is gone too, just replace 'hash' by 'chash' in the director config [19:53:05] params are compatible [19:54:52] ok, awesome! i'll test it out [19:54:56] the general idea is that backend names (as defined in the VCL) are hashed with sha256, converted to a double between 0.0 and 1.0, and placed on a continuum [19:55:08] every backend "weight" times, so 10 times for a backend with weight 10 [19:55:30] and then each url has a hash too, and the next/closest backend for the given hash is used [19:55:39] and a server with weight 10 has 10 hash values on that circle, etc [19:56:15] so as long as the backend names don't change, the hash values don't change [19:56:36] and adding or removing a backend will only affect a small portion of all URLs [19:58:09] i'm going off now, see you tomorrow [19:58:26] have a good night. it looks great [19:58:32] tnx... c'ya [20:03:38] maplebed: finally got concurrent ops not sucking, seeing a 3x improvement for "store" ops in copper and 2x for moves/deletes [20:03:52] that's still less than I would have expected. [20:04:05] but good to hear. [20:04:40] I bet pmtpa will show much more gain [20:05:01] oh yeah, I forgot you were working against the eqiad cluster [20:05:21] actually there may be more than 3x for store, I haven't even maxed it out yet [20:05:32] * AaronSchulz could try more files at once [20:08:45] nope, just 3x [20:13:09] !log restarting lsearchd on search7. was taosted [20:13:11] Logged the message, notpeter [20:24:18] hey maplebed, do you have some time for attempt 2 of the new filters? [20:24:39] yes. gimme 5. [20:25:27] okay [20:43:20] RECOVERY - Disk space on search1021 is OK: DISK OK [20:44:45] ping drdee [20:44:52] alright, sorry, 5 turned into 20. [20:45:56] ready [20:47:13] i put the new deb package in my home folder on emery, it has today's date in the filename [20:47:28] * maplebed looks [20:47:32] RECOVERY - Disk space on search1022 is OK: DISK OK [20:48:47] drdee: it looks like the tgz file has the date but the .deb doesn't. that's ok, right? [20:50:16] maplebed:yes [20:51:33] the usage output for the binary says "Either --path or --domain are mandatory (you can use them both, the other command line parameters are optional:" but the new filters you added don't specify either (specifying only -i instead). is that ok? [20:51:33] RECOVERY - Host mw49 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [20:51:33] RECOVERY - Host mw28 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:51:33] RECOVERY - Host mw58 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [20:51:42] PROBLEM - NTP on mw58 is CRITICAL: NTP CRITICAL: Offset unknown [20:51:59] robh: mw servers have been relocated [20:55:54] RECOVERY - NTP on mw58 is OK: NTP OK: Offset -0.01417863369 secs [20:56:39] PROBLEM - Apache HTTP on mw58 is CRITICAL: Connection refused [20:58:38] drdee: did you see my question? [20:58:51] hm, rob left (the chat) 90 minutes ago... he said something about those servers having outdated mc.php and other files before he left, saying puppet should fix, but he wanted to make sure [20:58:53] i missed it [20:59:06] yeah,need to update docs :) [20:59:10] (see log at around 20:35 utc+2 [21:00:52] drdee: the new package is in our repo. Do you want to prepare or re-submit the puppet changes or do you want me to do it? [21:01:07] Ryan_Lane: is there anything i can bribe you with to see if you can help figure out wtf is going on with puppet writing nagios information multiple times to one file ? [21:01:20] maplebed: i'll push the changes [21:01:31] ugh [21:01:41] I really don't want to dig through nagios stuff right now :( [21:01:46] it'll eat up my entire day [21:01:55] ok [21:02:10] I'm trying to get something going for a GSOC person right now [21:04:28] maplebed: sorry what was the folder of the config again? [21:05:25] which folder for which config? [21:05:25] LeslieCarr: i think puppet has always done that [21:05:40] maplebed: got it [21:05:41] (grep will probably get you an answer quicker than me) [21:06:18] binasher: nope, it's done it for a lot of other files but if it did that for the nagios puppet_hosts, nagios wouldn't run [21:06:38] binasher: and i double checked on spence, it's definitely acting differently than on neon :( [21:07:03] LeslieCarr: ah, didn't read enough, was just thinking of the per-host service config files [21:07:21] ah yeah, for some reason those duplicate, but nagios is smart enough to do each check once [21:07:48] but it flips out if a host is defined more than once [21:08:25] New patchset: Diederik; "Update udp-filter config to 0.2 Deploy Wikipedia Zero and Teahouse filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4205 [21:08:37] maplebed: pushed changes :) [21:08:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4205 [21:08:47] reviewing. [21:09:15] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.037 second response time [21:10:32] maplebed: new package also contains a man entry --> man udp-filter [21:14:13] still reviewing. [21:14:28] New patchset: Jgreen; "search qa stuff minor tweaks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4207 [21:14:44] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4207 [21:14:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4207 [21:14:44] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4207 [21:15:47] diederik: can you tell me about what [a-z0-9//]* means in the teahouse log? [21:16:00] mostly I'm wondering whether the two // is a typo. [21:16:14] mapbled; give me 5 minutes [21:16:20] sure. [21:26:02] maplebed: that was to escape the forward slash [21:26:14] \ is the escape character, isn't it? [21:26:29] also, in normal regex I don't think you have to escape forward slashes [21:26:38] (during character sets, that is) [21:26:46] of course, this likely isn't regex.... [21:27:52] maplebed: sorry, so i need a regex for /wiki/Teahouse* [21:28:26] I'm happy to deploy this one if you like and you can change it if it's wrong. Alternatively, we can leave the teahouse commented out until the udp-filter package is deployed then try it before putting it in puppet. [21:29:16] let's go with option 2 [21:29:30] ok. you should submit a git amend to comment out that line. [21:29:40] then we can try it by hand and get it right. [21:30:58] check [21:34:03] New patchset: Diederik; "Update udp-filter config to 0.2 Deploy Wikipedia Zero and Teahouse filters Comment out Teahouse filter Change-Id: I4b8f35a7bc71eb740cba01286be46ad4f06a0ff6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4205 [21:34:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4205 [21:35:07] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4205 [21:35:10] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4205 [21:36:10] drdee: if you just need wiki/Teahouse*, why not use string match intsead of regex? strstr is substring, not full string, so will match if you just leave off the chararcter set at the end. [21:36:28] (not to mention being computationally way simpler...) [21:37:11] damn. [21:37:21] what I thought was a weird output from gerrit is actually a problem with the configs. [21:37:42] The third country matching line ends with a $ (you probably copy/pasted from emacs, or something, right?) [21:37:57] +pipe 100 /usr/bin/udp-filter -f -c BD,BH,IQ,JO,KE,KW,LK,NG,QA,SN,TN,UG,ZA -g -m /var/log/squid/filters/GeoIPLibs/GeoIP.dat -b country >> /var/log/squid/coun$ [21:38:34] but I've already merged the change. [21:38:42] so I'll have to introduce a new change to fix it. [21:38:43] ::sigh:: [21:40:05] New patchset: Bhartshorne; "correcting copy/paste typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4210 [21:40:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4210 [21:40:34] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4210 [21:40:37] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4210 [21:41:20] drdee: running puppet on emery now to deploy. [21:41:34] !log deploying new udp-filter and teahouse filters to emery for diederik [21:41:37] Logged the message, Master [21:42:24] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [21:42:43] damnit. [21:42:49] drdee: diederik " udp-filter: Depends: libgeoip1 (>= 1.4.7~beta5+dfsg) but 1.4.6.dfsg-17 is to be installed" [21:44:20] maplebed: is it possible to upgrade? [21:44:34] I overrode the dependency and installed anyways. [21:44:45] it should work [21:45:01] emphasis on should :) [21:45:03] why is the dependency set so strictly? [21:45:27] that is done automatically by the debian build config [21:45:39] you're the one building the package - you can set it to whatever you want. [21:46:17] it's generally a bad idea to set dependencies to "exactly what I have on my build system that doesn't match production". [21:46:18] true :) but it happens because of variable substitution [21:46:42] i'll figure out how to do it in more relaxed way [21:46:47] it's a useful trick when you're developing for your own box, but a bad idea when building packages for a production environment. [21:47:39] (mostly because it allows you to be ignorant about what stuff you're using is in what version of the dependency, which is useful for rapid development but causes all sorts of problems later on when you're trying to figure out what you actually need.) [21:47:57] PROBLEM - udp2log log age on emery is CRITICAL: CRITICAL: log files /var/log/squid/orange-ivory-coast.log, /var/log/squid/digi-malaysia.log, have not been written to in 6 hours [21:48:06] PROBLEM - udp2log processes on emery is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, [21:48:07] got it, i'll look into it [21:48:18] interesting nagios alerts. [21:48:20] :) [21:48:29] particularly the 2nd one [21:49:37] ps doesn't match what's in the config. [21:50:10] ok, that's better [21:50:21] RECOVERY - udp2log processes on emery is OK: OK: all filters present [21:50:21] it is installed in /usr/bin and it does run [21:50:25] !log ran /etc/init.d/udp2log reload on emery to enact the puppetted changes [21:50:27] Logged the message, Master [21:50:39] ps shows the 3 country filters running now. [21:50:45] would you check the output and see if it looks right? [21:50:49] sure [21:51:49] digi-malaysia.log is capturing date [21:52:03] date == data, that looks good [21:52:41] orange-ivory-coast nothing yet [21:52:53] countries-10 and 100 don't look like they're working. [21:53:26] oh, maybe I'm just waiting on a flush to disk. [21:54:19] but it seems that only 1 is running of those country filters [21:57:23] yeah, they haven't updated in 8 minutes. [21:57:25] that's too long. [21:57:48] i did a ps -aux and they didn't show up [21:57:55] Pipe terminated, suspending output: /usr/bin/udp-filter -f -c KH,BW,CM,MG,ML,MU,NE,VU -g -m /var/log/squid/filters/GeoIPLibs/GeoIP.dat -b country >> /var/log/squid/countries-10.log [21:58:01] (that's from udp2log.log) [21:58:15] mmmmm [21:58:16] for both 10 and 100. [21:58:26] can you relaunch one and see what it says? [21:58:44] *** glibc detected *** /usr/bin/udp-filter: double free or corruption (out): 0x00000000007fc040 *** [21:59:19] there are a bunch of those in the log. [21:59:47] reloading the config [21:59:47] not good [21:59:56] (to watch it happen again) [22:00:24] it keeps restarting them and they keep crashing. [22:00:35] time to roll back? [22:00:57] yep [22:01:05] ok. reverting the puppet changes in gerritmw. [22:01:44] New patchset: Bhartshorne; "Revert "correcting copy/paste typo"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4214 [22:01:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4214 [22:02:04] New patchset: Bhartshorne; "Revert "Update udp-filter config to 0.2 Deploy Wikipedia Zero and Teahouse filters Comment out Teahouse filter Change-Id: I4b8f35a7bc71eb740cba01286be46ad4f06a0ff6"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4215 [22:02:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4215 [22:02:53] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4214 [22:02:56] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4214 [22:03:14] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4215 [22:03:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4215 [22:03:20] thanks maplebed, time for some more debugging [22:03:29] yup. [22:03:46] also, a test environment that lets you feed it data that replicates these crashes so you know when tehy're fixed. [22:03:47] :P [22:04:37] !log rolled back changes to emery in udp-filter due to the new binary crashing. [22:04:39] Logged the message, Master [22:06:51] RECOVERY - udp2log log age on emery is OK: OK: all log files active [22:07:28] diederik: roll back complete; emery looks functional again. [22:08:52] maplebed: ty [22:37:45] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [22:39:42] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [22:44:52] i give up andam going to reinstall neon [22:44:57] !log reinstalling neon [22:44:59] Logged the message, Mistress of the network gear. [22:54:41] PROBLEM - HTTP on neon is CRITICAL: Connection refused [22:54:50] PROBLEM - SSH on neon is CRITICAL: Connection refused [22:54:58] that's me, can ignore those [22:59:15] <^demon> Ok good, I was about to panic ;-) [22:59:16] hey, quick question: why do i find an ip6 address in our squid logs? [23:02:34] drdee: is it a log from europe ? we have ipv6 on some whitelisted resolvers for upload.wikimedia.org there [23:02:47] let me check, hold on [23:04:56] never mind, it's from ipv6and4.labs.wikimedia.org, didn't realize that [23:15:15] maplebed: ok, the concurrency patch passes all unit tests [23:15:22] * AaronSchulz has to trace down an annoying bug [23:15:22] \o/ [23:15:24] *had [23:15:34] all concurrency bugs are annoying. [23:15:55] it was actually a bug in the core which I fixed, not just the code in that branch [23:17:51] !log updating bgp policies on cr1.sdtpa [23:17:53] Logged the message, Mistress of the network gear. [23:43:56] Ryan_Lane: http://i.imgur.com/5C0py.jpg