[00:01:31] <icinga-wm>	 PROBLEM - Disk space on analytics1024 is CRITICAL: DISK CRITICAL - free space: / 1069 MB (3% inode=90%):  
[02:18:04] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Aug 14 02:18:04 UTC 2013
[02:18:15] <morebots>	 Logged the message, Master
[02:40:07] <icinga-wm>	 PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours  
[03:39:49] <icinga-wm>	 PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours  
[03:39:49] <icinga-wm>	 PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours  
[03:39:49] <icinga-wm>	 PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours  
[03:39:49] <icinga-wm>	 PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours  
[03:39:49] <icinga-wm>	 PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours  
[03:39:50] <icinga-wm>	 PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours  
[04:10:49] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours  
[04:10:49] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours  
[04:10:49] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours  
[04:10:49] <icinga-wm>	 PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours  
[04:10:49] <icinga-wm>	 PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours  
[04:10:50] <icinga-wm>	 PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours  
[04:59:52] <icinga-wm>	 RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000  
[05:03:02] <icinga-wm>	 PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[05:25:12] <icinga-wm>	 PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[05:26:12] <icinga-wm>	 RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s)  
[06:20:51] <icinga-wm>	 PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[06:27:41] <icinga-wm>	 RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0)  
[06:28:41] <icinga-wm>	 RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000  
[06:29:31] <icinga-wm>	 PROBLEM - Disk space on analytics1024 is CRITICAL: DISK CRITICAL - free space: / 1052 MB (3% inode=90%):  
[06:31:51] <icinga-wm>	 PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[08:22:11] <icinga-wm>	 PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[08:24:11] <icinga-wm>	 PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours  
[08:24:11] <icinga-wm>	 RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s)  
[08:33:13] <grrrit-wm>	 (03CR) 10Mark Bergsma: "(1 comment)" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon)
[08:45:51] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 031] "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis)
[08:49:06] <grrrit-wm>	 (03PS2) 10Faidon: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 
[08:58:49] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[08:59:39] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time  
[09:13:46] <grrrit-wm>	 (03PS3) 10Faidon: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 
[09:39:58] <mark>	 !log Moved non-https, non-IPv6 eqiad traffic for *.wikimedia.org (wikimedialb) to text-varnish
[09:40:11] <morebots>	 Logged the message, Master
[10:01:19] <mark>	 !log Moved non-https, non-IPv6 eqiad traffic for *.wikimedia.org (wikimedialb) back to Squid
[10:01:30] <morebots>	 Logged the message, Master
[10:19:18] <icinga-wm>	 PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[10:20:08] <icinga-wm>	 RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s)  
[10:33:46] <grrrit-wm>	 (03PS1) 10Faidon: ceph: track the stable releases again [operations/puppet] - 10https://gerrit.wikimedia.org/r/79043 
[10:33:47] <grrrit-wm>	 (03CR) 10Faidon: [C: 032 V: 032] ceph: track the stable releases again [operations/puppet] - 10https://gerrit.wikimedia.org/r/79043 (owner: 10Faidon)
[10:34:24] <paravoid>	 oh dear
[10:34:42] <Thehelpfulone>	 Reedy, wikimania 2014 wiki doesn't seem to be showing up in Google search - do you happen to know if there was a change in the script used to create special wikis which means it's been noindexed?
[10:48:00] <paravoid>	 !log upgrading ceph to 0.67
[10:48:09] <morebots>	 Logged the message, Master
[10:57:07] <icinga-wm>	 PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 24 pgs recovering: 4 pgs recovery_wait: 1234 pgs stale: 7 pgs stuck unclean: recovery 23/898167921 degraded (0.000%):  recovering 1 o/s, 0B/s, 61351 key/s: 10/142 in osds are down  
[10:57:21] <paravoid>	 hehe
[10:58:00] <paravoid>	 (that's normal)
[10:58:07] <icinga-wm>	 RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK  
[10:59:00] <akosiaris>	 !log stoping etherpad-lite for maintainance (character set hell)
[10:59:10] <morebots>	 Logged the message, Master
[10:59:18] <icinga-wm>	 PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 2736 pgs degraded: 121 pgs stuck unclean: recovery 49169908/898167933 degraded (5.474%)  
[11:01:07] <icinga-wm>	 PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 1423 pgs stale: 12/142 in osds are down  
[11:02:28] <icinga-wm>	 PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 1394 pgs stale: 12/142 in osds are down  
[11:03:37] <paravoid>	 I wonder if I should make this warning instead of critical
[11:04:07] <icinga-wm>	 RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK  
[11:04:18] <icinga-wm>	 RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK  
[11:04:28] <icinga-wm>	 RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK  
[11:08:24] <paravoid>	 cmjohnson1: you were pinging me yesterday?
[11:09:29] <cmjohnson1>	 i did....but don't remember why
[11:10:50] <cmjohnson1>	 paravoid: now i remember...what log are you seeing the errors in ms-be1005 and 1008
[11:11:29] <paravoid>	 I pasted it on the tickets on RT
[11:11:37] <paravoid>	 typical i/o errors basically
[11:11:54] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Don't cache login.wikimedia.org requests, for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/79044 
[11:13:05] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Don't cache login.wikimedia.org requests, for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/79044 (owner: 10Mark Bergsma)
[11:13:25] <paravoid>	 !log swift->ceph thumb sync (swiftrepl @ ms-fe1002)
[11:13:36] <morebots>	 Logged the message, Master
[11:46:31] <icinga-wm>	 PROBLEM - DPKG on ms-be3 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[11:47:30] <icinga-wm>	 RECOVERY - DPKG on ms-be3 is OK: All packages OK  
[11:48:34] <paravoid>	 apergos: so I was looking at how ms-be1 uses far less CPU than all the others
[11:49:00] <paravoid>	 apergos: and it looks like it's the only one not running a container server
[11:49:33] <paravoid>	 how come?
[11:51:18] <apergos>	 yeah that was a reminder to me to understand better why the container servers take so much more cpu
[11:51:36] <paravoid>	 yeah it's crazy
[11:55:07] <paravoid>	 Aug 14 11:55:02 ms-be1 kernel: [6197403.078004] XFS (sdh1): xfs_log_force: error 5 returned.
[11:57:07] <paravoid>	 apergos: wanna handle it?
[11:57:13] <apergos>	 sure
[11:57:21] <paravoid>	 :)
[11:57:35] <paravoid>	 lunch for me, bbl
[11:57:42] <apergos>	 lenjoy
[12:23:13] <mark>	 !log Moved non-https, non-IPv6 eqiad traffic for *.wikimedia.org (wikimedialb) to text-varnish, not caching login.wikimedia.org
[12:23:25] <morebots>	 Logged the message, Master
[12:36:31] <icinga-wm>	 PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[12:37:30] <icinga-wm>	 RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0)  
[12:41:19] <icinga-wm>	 PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours  
[12:44:28] <paravoid>	 apergos: I just realized I was a bit cryptic, talking about two issues at the same time
[12:44:33] <paravoid>	 there's a failed disk in ms-be1
[12:44:50] <apergos>	 yes, I opened a ticket
[12:45:04] <paravoid>	 ah ok
[12:45:14] <paravoid>	 I was just rereading backlog
[12:45:19] <apergos>	 ok
[12:52:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[12:53:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time  
[13:05:58] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Move wikimedialb to the text-varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79048 
[13:06:35] <paravoid>	 \o/
[13:07:17] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Move wikimedialb to the text-varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79048 (owner: 10Mark Bergsma)
[13:10:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[13:11:29] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time  
[13:13:26] <mark>	 i'm pass'ing login.wikimedia.org requests just in case
[13:22:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[13:23:29] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time  
[13:36:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[13:37:29] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time  
[13:40:19] <icinga-wm>	 PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours  
[13:40:19] <icinga-wm>	 PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours  
[13:40:19] <icinga-wm>	 PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours  
[13:40:19] <icinga-wm>	 PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours  
[13:40:19] <icinga-wm>	 PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours  
[13:40:20] <icinga-wm>	 PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours  
[14:07:44] <mark>	 ottomata: so I guess we don't have a kafka 0.8 setup yet right?
[14:07:56] <ottomata>	 in labs, yes
[14:07:59] <ottomata>	 not in prod
[14:08:31] <ottomata>	 varnishkafka is running and producing to 0.8 in labs, with mirroring to a a second kafka cluster (of 1 node)
[14:09:31] <mark>	 yeah
[14:09:36] <mark>	 but for performance testing, labs is no good
[14:10:16] <mark>	 does it work fine in labs?
[14:11:19] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours  
[14:11:19] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours  
[14:11:19] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours  
[14:11:19] <icinga-wm>	 PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours  
[14:11:19] <icinga-wm>	 PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours  
[14:11:20] <icinga-wm>	 PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours  
[14:32:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[14:32:56] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Add HTTPS service for misc_web [operations/puppet] - 10https://gerrit.wikimedia.org/r/79052 
[14:33:29] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time  
[14:34:30] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Add HTTPS service for misc_web [operations/puppet] - 10https://gerrit.wikimedia.org/r/79052 (owner: 10Mark Bergsma)
[14:42:27] <ottomata>	 mark, sorry, ja works fine
[14:42:35] <ottomata>	 I talked a little with paravoid the other day
[14:42:44] <ottomata>	 if you want to perf test asap, we can set 0.8 up on a couple of the cisco boxes
[14:42:52] <mark>	 i don't want to
[14:42:56] <mark>	 but I believe diederik does ;)
[14:42:59] <ottomata>	 haha, yeah
[14:43:09] <mark>	 we have that meeting tomorrow
[14:43:11] <ottomata>	 aye
[14:43:14] <ottomata>	 i can do that pretty quick
[14:43:18] <mark>	 ok
[14:43:23] <ottomata>	 you ok if I don't do that puppetized?  those machines still need reinstalled anyway
[14:43:26] <mark>	 let me know when you have something, then i'll install varnishkafka on a varnish box
[14:43:29] <mark>	 yep
[14:43:33] <ottomata>	 we haven't put the .deb in apt yet, because it isn't 100% stable yet
[14:43:35] <ottomata>	 ok cool
[14:43:37] <mark>	 yeah
[14:44:14] <drdee>	 mark: i thought you wanted to do performance testing on a production box, i was just checking the status :D
[14:44:25] <drdee>	 buuuut we need some testing in a real env, right?
[14:44:25] <mark>	 correct
[14:44:37] <mark>	 i need a kafka install to send to
[14:44:46] <mark>	 i'll test varnishkafka on one production varnish box
[14:44:47] <drdee>	 sure
[14:44:51] <drdee>	 kool
[14:45:33] <paravoid>	 ottomata: I did a couple of updates in librdkafka & varnishkafka today
[14:45:46] <paravoid>	 and I have .debs in brewster's /root (but not in apt)
[14:48:23] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Recommission cp104[34] [operations/puppet] - 10https://gerrit.wikimedia.org/r/79056 
[14:48:34] <ottomata>	 oh awesome
[14:48:43] <ottomata>	 thanks paravoid, i'll use those ones
[14:49:05] <ottomata>	 oh wait, i don't need those 
[14:49:09] <ottomata>	 ha, mark will use those ones :p
[14:49:23] <mark>	 ?
[14:49:28] <mark>	 no this is unrelated
[14:49:35] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Recommission cp104[34] [operations/puppet] - 10https://gerrit.wikimedia.org/r/79056 (owner: 10Mark Bergsma)
[14:50:02] <ottomata>	 i mean, i don't need those to install kafka 0.8 somewhere for you.  you can setup varnishkafka using those .debs (or however you want)
[14:50:35] <paravoid>	 yes, I just mentioned them since you were also playing with them in labs
[14:50:38] <ottomata>	 aye danke
[14:50:57] <grrrit-wm>	 (03CR) 10BryanDavis: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis)
[14:51:08] <grrrit-wm>	 (03PS4) 10BryanDavis: Add ganglia monitoring for vhtcpd. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 
[14:52:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[14:53:19] <icinga-wm>	 PROBLEM - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[14:53:59] <icinga-wm>	 PROBLEM - DPKG on analytics1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[14:54:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time  
[14:58:16] <chrismcmahon>	 apergos: thanks for the help with beta labs!  something still seems wonky there though, but it's inconsistent.  some browser instances seem fine, another gets no js/css, another gets consistent timeouts on pages.   I'm looking for some reliable repro for this, but it's sketchy
[14:58:47] <apergos>	 yw and it's weird you're getting that off and on behavior
[14:59:03] <apergos>	 I tried a few pages and it was ok for me but I didn't switch browsers or projects or anything
[15:01:09] <chrismcmahon>	 apergos: weird, my timeouts just stopped, now pages for that instance are loading.  guessing something needed to propagate. 
[15:01:14] <apergos>	 :-D
[15:01:23] <apergos>	 well that's irritating
[15:02:00] <apergos>	 http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&m=load_one&s=by+name&c=deployment-prep&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4  these show some interesting spikes for the apaches
[15:02:09] <apergos>	 een when I look at the 4 hour graph it's the same
[15:02:12] <apergos>	 but why?  who knows
[15:02:40] <chrismcmahon>	 apergos:   I have this overly-complicated browser test that I am trying to sort by running against beta.    It seems to be cranking now after getting consistent timeouts until like literally seconds ago. 
[15:03:01] <chrismcmahon>	 So I'll be stressing it today. 
[15:05:30] <apergos>	 sounds good
[15:06:12] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Add Misc Web caching Ganglia cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79057 
[15:06:13] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Add Misc Web caching Ganglia cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79057 (owner: 10Mark Bergsma)
[15:06:25] <chrismcmahon>	 apergos:  when you restarted apache on the apache33 host, was it a simple 'apachectl -k restart' or did you do something fancier than that?
[15:07:19] <apergos>	 oh I shot the parent, it wouldn't restart just with the init script
[15:07:29] <apergos>	 and I never remember the existence of apachectl
[15:07:58] <grrrit-wm>	 (03PS2) 10Mark Bergsma: Add Misc Web caching Ganglia cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79057 
[15:08:43] <chrismcmahon>	 apergos:  so what exactly did you do to restart?  'kill -9'?  haha, only serious... 
[15:08:56] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Add Misc Web caching Ganglia cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79057 (owner: 10Mark Bergsma)
[15:09:13] <apergos>	 after shooting the one process still around, /etc/init.d/apache<tab> start 
[15:09:23] <apergos>	 prolyl apache2, don't remember now
[15:10:18] <chrismcmahon>	 apergos: thanks, we should consider writing a troubleshooting guide to beta labs I think. 
[15:10:33] <apergos>	 er, once we understand it better :-D
[15:11:24] <chrismcmahon>	 apergos: this is a great example.  I'm pretty sure I did 'apachectl -k restart' on the apache32 host at least but it didn't dtrt. 
[15:11:40] <apergos>	 so I always check to see if the process actually restarted
[15:11:51] <apergos>	 as in, it's there with a current time stamp on it
[15:12:04] <apergos>	 and not just osme of the children but all of them
[15:12:33] <apergos>	 but for things like this /etc/init.d/something is almost always guaranteed to start something once you have stopped it by whatever means
[15:12:47] <chrismcmahon>	 apergos: yep, that would be good information to have for drive-by maintainers
[15:12:51] <apergos>	 right
[15:13:16] <apergos>	 I guess there might be some things in there which are upstart jobs and don't have anything in init.d any more but it's the same principle
[15:13:26] <chrismcmahon>	 it's been a lot of years since I cared much about the details of how apache works 
[15:13:36] <apergos>	 yup
[15:14:18] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Make nginx listen on IPv6 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/79058 
[15:15:09] <apergos>	 do people discuss operational stuff about it in the labs channel or not so much?
[15:15:13] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Make nginx listen on IPv6 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/79058 (owner: 10Mark Bergsma)
[15:17:28] <grrrit-wm>	 (03PS1) 10Nemo bis: Remove long-buried $wgLogAutocreatedAccounts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79059 
[15:20:07] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Add IPv6 addresses to misc_web service [operations/puppet] - 10https://gerrit.wikimedia.org/r/79060 
[15:20:44] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Add IPv6 addresses to misc_web service [operations/puppet] - 10https://gerrit.wikimedia.org/r/79060 (owner: 10Mark Bergsma)
[15:22:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:23:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time  
[15:31:00] <ottomata>	 hookay mark
[15:31:06] <ottomata>	 0.8 running on analytics1003 and analytics1004
[15:31:13] <mark>	 cool :)
[15:31:17] <ottomata>	 i'll create a topic 'varnish' with replication factor 2
[15:31:18] <mark>	 I think i'll play with that tomorrow then
[15:31:24] <ottomata>	 you can configure varnishkafka to produce to that
[15:36:42] <paravoid>	 yay
[15:36:48] <paravoid>	 (sorry, was in a meeting)
[15:37:30] <icinga-wm>	 PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[15:37:58] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Revert "Add HTTPS service for misc_web" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79062 
[15:38:20] <icinga-wm>	 RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0)  
[15:39:18] <grrrit-wm>	 (03PS2) 10Mark Bergsma: Revert "Add HTTPS service for misc_web" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79062 
[15:40:16] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Revert "Add HTTPS service for misc_web" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79062 (owner: 10Mark Bergsma)
[15:52:25] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Add the port number to the PyBal service name [operations/puppet] - 10https://gerrit.wikimedia.org/r/79064 
[15:53:14] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Add the port number to the PyBal service name [operations/puppet] - 10https://gerrit.wikimedia.org/r/79064 (owner: 10Mark Bergsma)
[15:56:46] <andre__>	 Hi opsen. Some nl.wikipedia users get old versions of pages sometimes, likely depending on the servers
[15:56:50] <andre__>	 see https://bugzilla.wikimedia.org/show_bug.cgi?id=52853
[15:57:12] <andre__>	 didn't see anything suspicious in SAL. Is there anything known going on? Any idea how to investigate further?
[16:00:18] <grrrit-wm>	 (03PS1) 10Mark Bergsma: "Add HTTPS service for misc_web"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79065 
[16:00:39] <grrrit-wm>	 (03PS2) 10Mark Bergsma: "Add HTTPS service for misc_web"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79065 
[16:01:38] <greg-g>	 apergos: thanks for the beta cluster diagnosis and fix!
[16:01:44] <apergos>	 yw
[16:01:56] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] "Add HTTPS service for misc_web"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79065 (owner: 10Mark Bergsma)
[16:04:23] <andre__>	 Any opsen who might take a look at bug 52853 / the issue with nl.wikipedia that I mentioned 10min ago: valhallasw and Romaine are in this channel and could help
[16:04:26] <andre__>	 would be very welcome
[16:04:39] <paravoid>	 bblack: around?
[16:04:48] <Romaine>	 https://bugzilla.wikimedia.org/show_bug.cgi?id=52853
[16:05:07] <Romaine>	 page jumps back and forth in time
[16:06:28] <bblack>	 paravoid: yes
[16:06:51] <chrismcmahon>	 apergos:  beta labs is giving 503s again :(
[16:07:00] <apergos>	 grrr
[16:07:11] <chrismcmahon>	 but not as often.  I just saw one. 
[16:07:19] <paravoid>	 Romaine: X-Cache/X-Cache-Lookup headers would be useful 
[16:07:37] <Romaine>	 how do I get those?
[16:07:48] <valhallasw>	 paravoid: I posted a set for one of my last requests just a few secs ago
[16:08:10] <greg-g>	 chrismcmahon: apergos I thanked too soon/jinxed it!
[16:08:39] <valhallasw>	 Romaine: in chrome: F12, network tab, press F5 until you get a broken server, then select the top entry, and copy the 'Response Headers' under 'Headers' on the right.
[16:08:46] <valhallasw>	 not sure for other browsers...
[16:08:48] <apergos>	 these are going tobe forr some other reason. 
[16:09:28] <apergos>	 if you retry do you get the page?
[16:10:31] <Romaine>	 valhallasw: I have a broken version in front of me, where I see the 'Response Headers'?
[16:10:53] <Romaine>	 (I use Firefox)
[16:10:56] <apergos>	 chrismcmahon: 
[16:11:15] <valhallasw>	 Romaine: not sure about firefox...
[16:11:23] <chrismcmahon>	 greg-g apergos seeing about 1 out of 6 503, it's just too flaky to repro
[16:12:05] <chrismcmahon>	 and the more I use it the better it gets I think
[16:12:14] <chrismcmahon>	 caching maybe then
[16:12:17] <valhallasw>	 paravoid: http://pastebin.com/enpN1EEF / http://pastebin.com/U85bb1AF
[16:12:57] <apergos>	 if you keep livehttpheaders open you can see which of theseare cache miss for frnt/back end
[16:13:06] <mark>	 Romaine: are you logged in when you see the older page?
[16:13:13] <Romaine>	 yes
[16:13:20] <mark>	 then it's very unlikely to be squid...
[16:13:25] <valhallasw>	 mine are also all logged in
[16:13:28] <paravoid>	 it's not squid
[16:13:37] <paravoid>	 the "NOT OK" shows all MISSes
[16:13:49] * valhallasw  has to go now. I'll check back in in half an hour or so
[16:13:54] <paravoid>	 hm, your NOT OK list says Last-Modified:Wed, 14 Aug 2013 15:37:57 GMT
[16:14:03] <valhallasw>	 yes, that was surprising to me, too.
[16:14:16] <paravoid>	 parser cache?
[16:14:27] <mark>	 I think so
[16:14:30] <Romaine>	 (it is happening the past days)
[16:21:26] <mark>	 i'm logged in too, keep refreshing that page, but not seeing it
[16:22:56] <Romaine>	 mark: I just tried again and out of 4 I got one older version
[16:23:10] <grrrit-wm>	 (03PS1) 10Cmjohnson: adding hafnium to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/79067 
[16:24:18] <mark>	 assuming the date at the bottom of the page is correct
[16:24:34] <^d>	 hi mark
[16:24:40] <mark>	 hi
[16:24:43] <Romaine>	 I have it both in http and with https
[16:24:47] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032 V: 032] adding hafnium to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/79067 (owner: 10Cmjohnson)
[16:25:10] <paravoid>	 I have zero clue about parser cache
[16:25:36] <^d>	 mark: So, since gitblit is being served by varnish now, want to move antimony to an internal address?
[16:25:39] <chrismcmahon>	 greg-g: fwiw, the more I'm using beta the better it performs. 
[16:25:41] <^d>	 paravoid: I know some things, shoot.
[16:25:45] <Romaine>	 even at this moment
[16:25:47] <mark>	 ^d: er, gitblit served by varnish?
[16:25:59] <^d>	 I thought it was being served by the varnish cache you had setup.
[16:26:01] <mark>	 i didn't change dns
[16:26:06] <^d>	 Ahhh, nevermind then :)
[16:26:06] <mark>	 i'm testing it
[16:26:10] <mark>	 we can move it over soon
[16:26:16] <mark>	 but still working on https and some tidbits
[16:26:21] * ^d  nods
[16:26:31] <mark>	 you can test it if you want
[16:26:59] <paravoid>	 ^d: https://bugzilla.wikimedia.org/show_bug.cgi?id=52853 the suspicions are on parser cache now
[16:27:02] <mark>	 change your /etc/hosts to 208.80.154.241 for git.wikimedia.org for that
[16:27:12] <mark>	 and/or 2620:0:861:ed1a::11 for ipv6
[16:27:33] <mark>	 Romaine: weird, I can't reproduce it
[16:27:43] <mark>	 http://nl.wikipedia.org/wiki/Wikipedia:Aanmelding_moderatoren, correct ?
[16:27:46] <greg-g>	 chrismcmahon: hah, I guess we just need google to index it contstantly, then ;)
[16:27:46] <Romaine>	 yes
[16:27:47] <paravoid>	 me neither
[16:27:59] <Romaine>	 I have reproduced at least 20 times
[16:28:00] <paravoid>	 I even tried multiple mwNNNN servers
[16:28:05] <Romaine>	 and a lot of users too
[16:28:53] <Romaine>	 it is randomly happen
[16:29:56] <^d>	 paravoid: So to summarize the bug: logged in users are seeing different versions of the page depending on which server they hit?
[16:30:20] <paravoid>	 Last-Modified is correct, content body sometimes is stale
[16:30:26] <paravoid>	 only happens on logged in users so far
[16:30:43] <mark>	 ha
[16:30:53] <mark>	 when I click edit, I see no votes past august 10
[16:31:18] <Romaine>	 yes
[16:31:28] <Romaine>	 it actually shows an older version of that page
[16:31:33] <Romaine>	 instead of the most recent one
[16:33:09] <^d>	 Hmm, I'm definitely seeing the latest info both in the page and when I click edit.
[16:34:08] <Romaine>	 how can I help with reproducing the problem?
[16:35:59] <mark>	 now i'm seeing the latest version edit page too
[16:38:15] <mark>	 seems like I didn't get a LM header...
[16:41:04] <mark>	 but I don't trust web-developer with that
[16:43:52] <mark>	 strange that it would happen for multiple URLs related to this one page
[16:44:03] <mark>	 perhaps some memcached key?
[16:49:29] <Romaine>	 valhallasw: not solved yet
[16:51:23] <chrismcmahon>	 apergos: beta now serving me all 503 and load went nuts http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&m=load_one&s=by+name&c=deployment-prep&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[16:52:01] <Coren>	 chrismcmahon: Is NFS issue.  I'm going to deploy a workaround this afternoon that, if nothing else, will tell us definitely if the issue is at the driver or hardware level.
[16:52:32] <apergos>	 apache is dead on apache33
[16:53:16] <chrismcmahon>	 hi Coren thanks.  beta has been limping along all week, apergos and I were trying to make it at least plod. 
[16:53:56] <apergos>	 restarted
[16:54:01] <apergos>	 so you'll be k for a little while
[16:54:05] <Coren>	 Yeah, either way the problem is a paid.  If it /is/ a driver regression issue we're in trouble eventually since those controllers are all over both DCs.
[16:54:14] <Coren>	 pain*
[16:54:15] <chrismcmahon>	 apergos:  thanks
[16:54:31] <apergos>	 dead n apache32 too
[16:54:54] <chrismcmahon>	 Coren: my big project on beta right now is the CirrusSearch stuff.  wow apergos I killed *both* apaches?  nice. 
[16:55:02] <apergos>	 restarted there too
[16:55:08] <apergos>	 yes you did a number on em
[16:55:16] <chrismcmahon>	 lemme try that again
[16:55:17] <apergos>	 "congrats"? :-P
[16:55:26] <chrismcmahon>	 thus I work in QA
[16:55:29] <apergos>	 you get to restart em this time!
[16:55:34] <chrismcmahon>	 OK :)
[16:56:16] <chrismcmahon>	 Coren: just wondering, "this afternoon" in what time zone? 
[16:56:24] <grrrit-wm>	 (03PS1) 10Cmjohnson: Revert "adding hafnium to netboot" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79069 
[16:56:48] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032 V: 032] Revert "adding hafnium to netboot" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79069 (owner: 10Cmjohnson)
[16:57:11] <Coren>	 Mine (GMT-4).  I should have specified.  Then again, the rsync I need to do will probably take quite some time so that's probably going to end up being afternoon PTD.
[16:57:24] * Coren  is about to send to labs-l about it.
[17:02:09] <grrrit-wm>	 (03PS1) 10Cmjohnson: adding hafnium with lvm cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/79070 
[17:03:38] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032 V: 032] adding hafnium with lvm cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/79070 (owner: 10Cmjohnson)
[17:15:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:16:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time  
[17:27:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:29:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time  
[17:33:00] <grrrit-wm>	 (03PS1) 10RobH: aaron handed me a new pubkey, unless it was a pod person... [operations/puppet] - 10https://gerrit.wikimedia.org/r/79073 
[17:33:41] <RobH>	 AaronSchulz: are you a pod person or the real aaron... wait i bet you would lie if you were a pod person
[17:33:50] <RobH>	 i think ops is going to require dna sampling for ssh keys now.
[17:34:25] <grrrit-wm>	 (03CR) 10RobH: [C: 032] aaron handed me a new pubkey, unless it was a pod person... [operations/puppet] - 10https://gerrit.wikimedia.org/r/79073 (owner: 10RobH)
[17:36:00] <RobH>	 AaronSchulz: key is merged on puppetmaster, it'll be a number of hours for it to propogate across cluster
[17:36:12] <RobH>	 if you need a specific system access right away lemmeknow and i kick it until it updates
[17:37:44] <Romaine>	 mark: any news about the bug?
[17:39:15] <chrismcmahon>	 beta dying again it seems.  thanks Coren it is nice to know why at least.
[17:40:05] <Coren>	 Even better would be the problem not being there.  I'm working on the copying now; with the switch panned for 20:00 UTC
[17:40:13] <Coren>	 planned*
[17:40:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:40:55] <Coren>	 chrismcmahon: That said, if beta is dying *now* it might not be the NFS after all: it's not stalled atm.
[17:41:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time  
[17:41:33] <apergos>	 is apache running? 
[17:41:54] <chrismcmahon>	 apergos: not sure, but ganglia shows that familiar pattern http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&m=load_one&s=by+name&c=deployment-prep&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
[17:41:59] <apergos>	 (also I'm not really here. it's way past my afktime. but I wanna process some of these photos)
[17:42:07] <logmsgbot>	 !log csteipp synchronized php-1.22wmf12/includes
[17:42:08] <chrismcmahon>	 apergos: np
[17:42:19] <morebots>	 Logged the message, Master
[17:42:29] <apergos>	 you have rights to log onto the apache32 and 33 instances right?
[17:42:33] <apergos>	 just peek in and see
[17:42:56] <apergos>	 I did not look at the logs the last time but maybe the syslog or dmesg will have something interesting in it too
[17:43:18] <chrismcmahon>	 yes
[17:45:37] <chrismcmahon>	 I'll surf over to 32/33.  So many yaks.  What I'm actually *trying* to do is to optimize an automated browser test. :)
[17:45:52] <AaronSchulz>	 RobH: \o/
[17:47:23] <apergos>	 if you don't feel like looking into it, just restart em (if they are dead)
[17:47:35] <apergos>	 if you kill them tomorrow I can look at it then
[17:47:57] <apergos>	 I just know my brain has entered low gear for the night, so best to give it a rest
[17:53:34] <grrrit-wm>	 (03PS1) 10BBlack: add queue_size and queue_max_size stats output [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/79076 
[17:53:56] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] add per-connection purging limits for sanity [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/75128 (owner: 10BBlack)
[17:55:17] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] add queue_size and queue_max_size stats output [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/79076 (owner: 10BBlack)
[17:56:34] <grrrit-wm>	 (03CR) 10Demon: [C: 031] Configure elasticearch multicast per datacenter [operations/puppet] - 10https://gerrit.wikimedia.org/r/78966 (owner: 10Manybubbles)
[17:57:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[17:58:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time  
[17:59:32] <Romaine>	 valhallasw: one users reports a strange thing
[17:59:58] <Romaine>	 he says he sees: (huidig | vorige) 10 aug 2013 19:44? Glatisant (Overleg | bijdragen)? . . (59.018 bytes) (+59.018)? . . (??Voor moderatorschap Dqfn13) (ongedaan maken)
[18:00:08] <Romaine>	 (without the ?)
[18:00:35] <Romaine>	 while I only see (+124)?
[18:00:39] <icinga-wm>	 PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[18:01:01] <grrrit-wm>	 (03PS1) 10Ottomata: Using renamed geowiki repo instead of editor-geocoding [operations/puppet] - 10https://gerrit.wikimedia.org/r/79078 
[18:01:01] <Romaine>	 mark ^^
[18:01:03] <grrrit-wm>	 (03PS2) 10Ottomata: Using renamed geowiki repo instead of editor-geocoding [operations/puppet] - 10https://gerrit.wikimedia.org/r/79078 
[18:01:21] <valhallasw>	 Romaine: that's... really strange.
[18:01:22] <Romaine>	 that is specific the revision that keeps showing up randomly
[18:01:30] <icinga-wm>	 RECOVERY - Disk space on labstore3 is OK: DISK OK  
[18:01:57] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Using renamed geowiki repo instead of editor-geocoding [operations/puppet] - 10https://gerrit.wikimedia.org/r/79078 (owner: 10Ottomata)
[18:02:23] <valhallasw>	 Romaine: no, that's the one just after
[18:02:37] <valhallasw>	 I also see the +59.018 there
[18:02:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:03:01] <Romaine>	 you are right
[18:03:04] <Romaine>	 the one after
[18:03:27] <valhallasw>	 that implies the software interprets it as a new page (or a diff from an empty page)...
[18:03:29] <Romaine>	 yes, I have reproduced it
[18:03:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time  
[18:03:44] <Romaine>	 I have now a version in my screen with (huidig | vorige) 10 aug 2013 19:44? Glatisant (Overleg | bijdragen)? . . (59.018 bytes) (+59.018)? . . (??Voor moderatorschap Dqfn13) (ongedaan maken)
[18:04:34] <grrrit-wm>	 (03PS1) 10Cmjohnson: changing hafnium back to raid1-lvm cfg cuz there are 2 disk now [operations/puppet] - 10https://gerrit.wikimedia.org/r/79082 
[18:04:46] <grrrit-wm>	 (03PS1) 10BBlack: 0.0.9 stuff [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/79083 
[18:04:58] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] 0.0.9 stuff [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/79083 (owner: 10BBlack)
[18:05:28] <grrrit-wm>	 (03PS1) 10BBlack: Merge branch 'master' into debian [operations/software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/79084 
[18:05:29] <grrrit-wm>	 (03PS1) 10BBlack: bump pkg version [operations/software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/79085 
[18:05:42] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Merge branch 'master' into debian [operations/software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/79084 (owner: 10BBlack)
[18:05:55] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] bump pkg version [operations/software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/79085 (owner: 10BBlack)
[18:06:54] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032 V: 032] changing hafnium back to raid1-lvm cfg cuz there are 2 disk now [operations/puppet] - 10https://gerrit.wikimedia.org/r/79082 (owner: 10Cmjohnson)
[18:12:00] <Romaine>	 valhalla1w: I added this to https://bugzilla.wikimedia.org/show_bug.cgi?id=52853
[18:12:24] <valhalla1w>	 great!
[18:12:57] <Romaine>	 weird
[18:24:19] <icinga-wm>	 PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours  
[18:25:37] <cmjohnson1>	 robh: clarify for me...are sas disk backwards compatible for a sata backplane?
[18:25:50] <cmjohnson1>	 i know it works the other way around
[18:25:59] <cmjohnson1>	 bios settings are ata, ahci and raid
[18:29:09] <icinga-wm>	 PROBLEM - RAID on analytics1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[18:32:09] <icinga-wm>	 PROBLEM - RAID on analytics1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[18:32:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:33:30] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time  
[18:39:29] <icinga-wm>	 PROBLEM - SSH on labstore3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[18:39:52] <grrrit-wm>	 (03CR) 10BBlack: "Can you update this patchset for https://git.wikimedia.org/blobdiff/operations%2Fsoftware%2Fvarnish%2Fvhtcpd.git/13892a2f274f75a850cebcb5d" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis)
[18:40:20] <icinga-wm>	 RECOVERY - SSH on labstore3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0)  
[19:19:00] <grrrit-wm>	 (03PS5) 10BryanDavis: Add ganglia monitoring for vhtcpd. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 
[19:19:41] <grrrit-wm>	 (03CR) 10BryanDavis: "patchset 5 adds the new `queue_size` and `queue_max_size` stats. It is also rebased against head of production branch." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis)
[19:20:21] <bd808>	 bblack: ^
[19:20:34] <bblack>	 bd808: thanks :)
[19:20:42] <bd808>	 np
[19:32:43] <preilly>	 Anybody in the office want to grab lunch?
[19:35:21] <grrrit-wm>	 (03CR) 10Ori.livneh: "I'm not going to keep rebasing this. Just let me know when you want to merge it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh)
[19:37:58] <preilly>	 ha ha ha ^^
[19:47:20] <valhalla1w>	 ori-l: you guys could also switch from 'merge on submit' to 'rebase on submit', I guess? :-)
[19:48:48] <Romaine>	 preilly: it is time here for a evening drink
[20:39:19] <icinga-wm>	 PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours  
[21:01:05] <bawolff>	 I think cache purging on cp1063 is broken
[21:04:00] <bawolff>	 Both  https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg and  https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/120px-BDavis-test-del.jpg seem to go through that varnish (cp1063), and neither of them are responding to purges
[21:04:36] <bawolff>	 However, other sizes do get varnish cache cleared on purge, and those particular sizes appear to get swift cache of the thumb cleared on ?action=purge. But the varnish doesn't clear
[21:06:22] <bawolff>	 If somebody could check hit that particular varnish with a stick (or check vhtcpd is running on it properly, etc), that would be awesome
[21:12:28] <grrrit-wm>	 (03CR) 10Demon: "Ok, I think this is fine now with all my other changes merged in." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon)
[21:12:53] <^d>	 AaronSchulz: Can you take a look at ^ again, and see if I resolved the issue in my next to last comment with the other changes we merged?
[21:12:59] <^d>	 I think we're all set now :)
[21:14:03] <grrrit-wm>	 (03CR) 10Demon: [C: 031] Turn on more default elasticsearch logging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78903 (owner: 10Manybubbles)
[21:14:33] <grrrit-wm>	 (03CR) 10Demon: [C: 031] Setup metrics collection for elasticserch [operations/puppet] - 10https://gerrit.wikimedia.org/r/78414 (owner: 10Manybubbles)
[21:16:19] <grrrit-wm>	 (03CR) 10Aaron Schulz: "Seems OK when those core changes are backported/deployed" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon)
[21:17:14] <bawolff>	 I filed what i mentioned above as bug 52864, so it doesn't get forgotten
[21:17:28] <jeremyb>	 bawolff: sounds like something i've seen before
[21:17:39] <jeremyb>	 it's missing from swift but is in varnish
[21:17:55] <bawolff>	 I don't think its missing from swift
[21:18:06] <jeremyb>	 so it doesn't get purged when a purge is done. (it's not even attempted)
[21:18:09] <jeremyb>	 how do you know?
[21:18:10] <bawolff>	 I used an alternate url, which should have recreated it in swift
[21:18:52] <bawolff>	 When I download from the alternate url, I get the same file (based on file mod date in exif), when I hit purge, this seems to delete the swift file, since the mod date updates
[21:19:20] <bawolff>	 For example, do https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg?randomstring
[21:19:26] <bawolff>	 Then do https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg?randomstring2
[21:19:34] <bawolff>	 note how they have same exif modification date
[21:19:39] <bawolff>	 then ?action=purge
[21:19:49] <bawolff>	 then https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg?randomstring3 has a new exif modification date
[21:19:52] <jeremyb>	 what does ?action=purge mean exactly?
[21:19:54] <grrrit-wm>	 (03CR) 10Aaron Schulz: "Actually looks be fine regardless" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon)
[21:20:04] <bawolff>	 https://test2.wikipedia.org/wiki/File:BDavis-test-del.jpg?action=purge
[21:20:36] <bawolff>	 And the entire time, the https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg does not have the age header reset
[21:21:06] <bawolff>	 However, this only happens with 120px and 159px. Other sizes I tried have all the cache clearing work as expected
[21:21:23] <bawolff>	 and the commonality between 120px and 159px is they both go through varnish cp1063
[21:21:39] <jeremyb>	 i just got a 500...
[21:22:27] <bawolff>	 interesting. Doing what?
[21:23:10] <jeremyb>	 generating a new thumb. refresh and it worked
[21:23:37] <bawolff>	 oh fun. A repeat of that bug from the weekend?
[21:23:53] <jeremyb>	 idk, maybe it was a one-off
[21:24:25] <jeremyb>	 would be nice if that page was changed to have the host not in a comment. i now lost the 500 body
[21:24:44] <bawolff>	 hopefully. Special:newfiles on commons seems fine, so I don't think its as widespread as last weekend's bug
[21:38:50] <bawolff>	 jeremyb: I also did a test of all thumbnails of Lysurus_castaneiceps.jpg between 96px to 106px. 2 out of the 10 did not purge properly
[21:39:11] <bawolff>	 both were served by cp1063
[21:39:51] <bawolff>	 (The two in question were 104px and 98px)
[21:40:56] <jeremyb>	 bawolff: welp, could be the varnish itself... can't be a cross atlantic problem because that's in eqiad
[21:42:07] <bawolff>	 Yeah, I'm in north america, but I also tested the original file going through esams, and there was no change
[21:42:26] <jeremyb>	 bawolff: regardless the hit is from eqiad
[21:42:41] <jeremyb>	 so the thing that needs purging is in eqiad
[21:43:27] <grrrit-wm>	 (03CR) 10Demon: [C: 032] Redo search configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon)
[21:43:49] <bawolff>	 So the esams caching servers forward to eqiad now? 
[21:43:58] <grrrit-wm>	 (03Merged) 10jenkins-bot: Redo search configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon)
[21:44:14] * bawolff  must admit, I'm not familar with the intricate set up of wmf's networking architecture
[21:44:57] <jeremyb>	 bawolff: they always did. no canonical data is in ams. that's a legal decision
[21:45:07] <jeremyb>	 bawolff: esams is only caching
[21:45:35] <bawolff>	 I mean its forwarding the cache hit to eqiad
[21:45:42] <jeremyb>	 i don't follow
[21:45:51] <bawolff>	 since the varnish store a copy of the file
[21:45:52] <jeremyb>	 you can see in the headers where it was a hit or a miss
[21:46:02] <jeremyb>	 some varnishes are backed by other varnishes
[21:46:11] <jeremyb>	 and for text squids are backed by other squids
[21:46:27] <jeremyb>	 i wonder if any squids are backed by varnish or varnish by squid. i guess not
[21:49:22] <bawolff>	 Ah. So you mean that the esams varnishes are backed by eqiad varnishes, and cp1063 happens to be a varnish that is physically in eqiad?
[21:49:45] <jeremyb>	 yes and yes
[21:49:48] <jeremyb>	 AIUI
[21:50:41] <bawolff>	 But in theory, an esams varnish could have a cached copy of a file, just in this case that's not relavent?
[21:51:04] <jeremyb>	 well it doesn't matter if esams has it if it's not being purged in eqiad
[21:51:16] <jeremyb>	 let me fetch again to be triply sure
[21:51:37] <jeremyb>	 but i think i should also be hitting eqiad, i'm in north america
[21:52:01] <bawolff>	 My question is probably irrelavent to the current situation, I'm just asking out of curiosity
[21:53:17] <bawolff>	 Most likely causes of this situation imo is either vhtcpd exploded, varnish is misconfigured, or less likely, varnish exploded
[21:53:25] <jeremyb>	 basically: if esams purges are failing that's bad, if eqiad purges are failing that's worse. we could be dealing with either or both or none of those
[21:53:42] <jeremyb>	 or something wrong with the network maybe
[21:53:50] <jeremyb>	 but i guess that's least likely
[21:55:30] <jeremyb>	 bawolff: i'm not sure who's awake right now. there are a few greeks that could look at this and maybe an AaronSchulz
[21:55:38] <jeremyb>	 i can't do much myself
[21:56:29] <bawolff>	 I'm sure someone will get to it within the next day
[21:56:33] <bawolff>	 Which is probably good enough
[21:57:02] <bawolff>	 The users haven't even started complaining yet
[21:57:12] <jeremyb>	 well test2 i don't care much about. but if it's effecting other stuff..
[21:57:54] <bawolff>	 test2 happened to be where we noticed it, because bd808 was testing other things, but it appears to affect all wikis equally I would assume
[22:01:34] <grrrit-wm>	 (03CR) 10Asher: [C: 032 V: 032] Configure elasticearch multicast per datacenter [operations/puppet] - 10https://gerrit.wikimedia.org/r/78966 (owner: 10Manybubbles)
[22:04:49] <grrrit-wm>	 (03PS2) 10Asher: Turn on more default elasticsearch logging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78903 (owner: 10Manybubbles)
[22:04:59] <grrrit-wm>	 (03CR) 10Asher: [C: 032 V: 032] Turn on more default elasticsearch logging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78903 (owner: 10Manybubbles)
[22:08:39] <icinga-wm>	 PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[22:08:45] * Romaine  points at https://bugzilla.wikimedia.org/show_bug.cgi?id=52853
[22:09:39] <icinga-wm>	 RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.186 second response time  
[22:19:59] <bawolff>	 Romaine: wow, talk about a weird bug
[22:20:04] <Romaine>	 :p
[22:20:19] <Romaine>	 finnaly we have found the corpse in the closet of the MediaWiki software
[22:20:40] <jeremyb>	 we usually say "skeleton in the closet" i thinks
[22:20:42] <Romaine>	 "Lijk in de kast" is Dutch expression
[22:20:48] <Romaine>	 a right
[22:28:55] <bawolff>	 Romaine: If it makes you feel better, I was able to reproduce the weird history thing
[22:29:36] <Romaine>	 great :p
[22:29:44] <Romaine>	 do you also have a cure for it?
[22:30:42] <bawolff>	 no
[22:31:00] <bawolff>	 all signs point to something very screwed up
[22:42:19] <icinga-wm>	 PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours  
[22:47:40] <Romaine>	 it keeps on occuring with users messing it up
[23:32:19] <icinga-wm>	 PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours  
[23:41:19] <icinga-wm>	 PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours  
[23:41:19] <icinga-wm>	 PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours  
[23:41:19] <icinga-wm>	 PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours  
[23:41:19] <icinga-wm>	 PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours  
[23:41:19] <icinga-wm>	 PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours  
[23:41:20] <icinga-wm>	 PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours