[00:01:31] PROBLEM - Disk space on analytics1024 is CRITICAL: DISK CRITICAL - free space: / 1069 MB (3% inode=90%): [02:18:04] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Aug 14 02:18:04 UTC 2013 [02:18:15] Logged the message, Master [02:40:07] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [03:39:49] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [03:39:49] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [03:39:49] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [03:39:49] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [03:39:49] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [03:39:50] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:49] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:49] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:49] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:49] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:49] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [04:10:50] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [04:59:52] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [05:03:02] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:12] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:26:12] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [06:20:51] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:27:41] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:28:41] RECOVERY - check_job_queue on fenari is OK: JOBQUEUE OK - all job queues below 10,000 [06:29:31] PROBLEM - Disk space on analytics1024 is CRITICAL: DISK CRITICAL - free space: / 1052 MB (3% inode=90%): [06:31:51] PROBLEM - check_job_queue on fenari is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:22:11] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:24:11] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [08:24:11] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [08:33:13] (03CR) 10Mark Bergsma: "(1 comment)" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [08:45:51] (03CR) 10Mark Bergsma: [C: 031] "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [08:49:06] (03PS2) 10Faidon: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 [08:58:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [09:13:46] (03PS3) 10Faidon: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 [09:39:58] !log Moved non-https, non-IPv6 eqiad traffic for *.wikimedia.org (wikimedialb) to text-varnish [09:40:11] Logged the message, Master [10:01:19] !log Moved non-https, non-IPv6 eqiad traffic for *.wikimedia.org (wikimedialb) back to Squid [10:01:30] Logged the message, Master [10:19:18] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:08] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [10:33:46] (03PS1) 10Faidon: ceph: track the stable releases again [operations/puppet] - 10https://gerrit.wikimedia.org/r/79043 [10:33:47] (03CR) 10Faidon: [C: 032 V: 032] ceph: track the stable releases again [operations/puppet] - 10https://gerrit.wikimedia.org/r/79043 (owner: 10Faidon) [10:34:24] oh dear [10:34:42] Reedy, wikimania 2014 wiki doesn't seem to be showing up in Google search - do you happen to know if there was a change in the script used to create special wikis which means it's been noindexed? [10:48:00] !log upgrading ceph to 0.67 [10:48:09] Logged the message, Master [10:57:07] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 24 pgs recovering: 4 pgs recovery_wait: 1234 pgs stale: 7 pgs stuck unclean: recovery 23/898167921 degraded (0.000%): recovering 1 o/s, 0B/s, 61351 key/s: 10/142 in osds are down [10:57:21] hehe [10:58:00] (that's normal) [10:58:07] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [10:59:00] !log stoping etherpad-lite for maintainance (character set hell) [10:59:10] Logged the message, Master [10:59:18] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 2736 pgs degraded: 121 pgs stuck unclean: recovery 49169908/898167933 degraded (5.474%) [11:01:07] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 1423 pgs stale: 12/142 in osds are down [11:02:28] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 1394 pgs stale: 12/142 in osds are down [11:03:37] I wonder if I should make this warning instead of critical [11:04:07] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [11:04:18] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [11:04:28] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [11:08:24] cmjohnson1: you were pinging me yesterday? [11:09:29] i did....but don't remember why [11:10:50] paravoid: now i remember...what log are you seeing the errors in ms-be1005 and 1008 [11:11:29] I pasted it on the tickets on RT [11:11:37] typical i/o errors basically [11:11:54] (03PS1) 10Mark Bergsma: Don't cache login.wikimedia.org requests, for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/79044 [11:13:05] (03CR) 10Mark Bergsma: [C: 032] Don't cache login.wikimedia.org requests, for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/79044 (owner: 10Mark Bergsma) [11:13:25] !log swift->ceph thumb sync (swiftrepl @ ms-fe1002) [11:13:36] Logged the message, Master [11:46:31] PROBLEM - DPKG on ms-be3 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:47:30] RECOVERY - DPKG on ms-be3 is OK: All packages OK [11:48:34] apergos: so I was looking at how ms-be1 uses far less CPU than all the others [11:49:00] apergos: and it looks like it's the only one not running a container server [11:49:33] how come? [11:51:18] yeah that was a reminder to me to understand better why the container servers take so much more cpu [11:51:36] yeah it's crazy [11:55:07] Aug 14 11:55:02 ms-be1 kernel: [6197403.078004] XFS (sdh1): xfs_log_force: error 5 returned. [11:57:07] apergos: wanna handle it? [11:57:13] sure [11:57:21] :) [11:57:35] lunch for me, bbl [11:57:42] lenjoy [12:23:13] !log Moved non-https, non-IPv6 eqiad traffic for *.wikimedia.org (wikimedialb) to text-varnish, not caching login.wikimedia.org [12:23:25] Logged the message, Master [12:36:31] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:37:30] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [12:41:19] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [12:44:28] apergos: I just realized I was a bit cryptic, talking about two issues at the same time [12:44:33] there's a failed disk in ms-be1 [12:44:50] yes, I opened a ticket [12:45:04] ah ok [12:45:14] I was just rereading backlog [12:45:19] ok [12:52:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:53:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [13:05:58] (03PS1) 10Mark Bergsma: Move wikimedialb to the text-varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79048 [13:06:35] \o/ [13:07:17] (03CR) 10Mark Bergsma: [C: 032] Move wikimedialb to the text-varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79048 (owner: 10Mark Bergsma) [13:10:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:11:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:13:26] i'm pass'ing login.wikimedia.org requests just in case [13:22:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:23:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [13:36:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [13:40:19] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [13:40:19] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [13:40:19] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [13:40:19] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [13:40:19] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [13:40:20] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [14:07:44] ottomata: so I guess we don't have a kafka 0.8 setup yet right? [14:07:56] in labs, yes [14:07:59] not in prod [14:08:31] varnishkafka is running and producing to 0.8 in labs, with mirroring to a a second kafka cluster (of 1 node) [14:09:31] yeah [14:09:36] but for performance testing, labs is no good [14:10:16] does it work fine in labs? [14:11:19] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [14:11:19] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [14:11:19] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [14:11:19] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [14:11:19] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [14:11:20] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [14:32:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:56] (03PS1) 10Mark Bergsma: Add HTTPS service for misc_web [operations/puppet] - 10https://gerrit.wikimedia.org/r/79052 [14:33:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [14:34:30] (03CR) 10Mark Bergsma: [C: 032] Add HTTPS service for misc_web [operations/puppet] - 10https://gerrit.wikimedia.org/r/79052 (owner: 10Mark Bergsma) [14:42:27] mark, sorry, ja works fine [14:42:35] I talked a little with paravoid the other day [14:42:44] if you want to perf test asap, we can set 0.8 up on a couple of the cisco boxes [14:42:52] i don't want to [14:42:56] but I believe diederik does ;) [14:42:59] haha, yeah [14:43:09] we have that meeting tomorrow [14:43:11] aye [14:43:14] i can do that pretty quick [14:43:18] ok [14:43:23] you ok if I don't do that puppetized? those machines still need reinstalled anyway [14:43:26] let me know when you have something, then i'll install varnishkafka on a varnish box [14:43:29] yep [14:43:33] we haven't put the .deb in apt yet, because it isn't 100% stable yet [14:43:35] ok cool [14:43:37] yeah [14:44:14] mark: i thought you wanted to do performance testing on a production box, i was just checking the status :D [14:44:25] buuuut we need some testing in a real env, right? [14:44:25] correct [14:44:37] i need a kafka install to send to [14:44:46] i'll test varnishkafka on one production varnish box [14:44:47] sure [14:44:51] kool [14:45:33] ottomata: I did a couple of updates in librdkafka & varnishkafka today [14:45:46] and I have .debs in brewster's /root (but not in apt) [14:48:23] (03PS1) 10Mark Bergsma: Recommission cp104[34] [operations/puppet] - 10https://gerrit.wikimedia.org/r/79056 [14:48:34] oh awesome [14:48:43] thanks paravoid, i'll use those ones [14:49:05] oh wait, i don't need those [14:49:09] ha, mark will use those ones :p [14:49:23] ? [14:49:28] no this is unrelated [14:49:35] (03CR) 10Mark Bergsma: [C: 032] Recommission cp104[34] [operations/puppet] - 10https://gerrit.wikimedia.org/r/79056 (owner: 10Mark Bergsma) [14:50:02] i mean, i don't need those to install kafka 0.8 somewhere for you. you can setup varnishkafka using those .debs (or however you want) [14:50:35] yes, I just mentioned them since you were also playing with them in labs [14:50:38] aye danke [14:50:57] (03CR) 10BryanDavis: "(2 comments)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [14:51:08] (03PS4) 10BryanDavis: Add ganglia monitoring for vhtcpd. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 [14:52:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:53:19] PROBLEM - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:53:59] PROBLEM - DPKG on analytics1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:54:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [14:58:16] apergos: thanks for the help with beta labs! something still seems wonky there though, but it's inconsistent. some browser instances seem fine, another gets no js/css, another gets consistent timeouts on pages. I'm looking for some reliable repro for this, but it's sketchy [14:58:47] yw and it's weird you're getting that off and on behavior [14:59:03] I tried a few pages and it was ok for me but I didn't switch browsers or projects or anything [15:01:09] apergos: weird, my timeouts just stopped, now pages for that instance are loading. guessing something needed to propagate. [15:01:14] :-D [15:01:23] well that's irritating [15:02:00] http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&m=load_one&s=by+name&c=deployment-prep&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 these show some interesting spikes for the apaches [15:02:09] een when I look at the 4 hour graph it's the same [15:02:12] but why? who knows [15:02:40] apergos: I have this overly-complicated browser test that I am trying to sort by running against beta. It seems to be cranking now after getting consistent timeouts until like literally seconds ago. [15:03:01] So I'll be stressing it today. [15:05:30] sounds good [15:06:12] (03PS1) 10Mark Bergsma: Add Misc Web caching Ganglia cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79057 [15:06:13] (03CR) 10jenkins-bot: [V: 04-1] Add Misc Web caching Ganglia cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79057 (owner: 10Mark Bergsma) [15:06:25] apergos: when you restarted apache on the apache33 host, was it a simple 'apachectl -k restart' or did you do something fancier than that? [15:07:19] oh I shot the parent, it wouldn't restart just with the init script [15:07:29] and I never remember the existence of apachectl [15:07:58] (03PS2) 10Mark Bergsma: Add Misc Web caching Ganglia cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79057 [15:08:43] apergos: so what exactly did you do to restart? 'kill -9'? haha, only serious... [15:08:56] (03CR) 10Mark Bergsma: [C: 032] Add Misc Web caching Ganglia cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/79057 (owner: 10Mark Bergsma) [15:09:13] after shooting the one process still around, /etc/init.d/apache start [15:09:23] prolyl apache2, don't remember now [15:10:18] apergos: thanks, we should consider writing a troubleshooting guide to beta labs I think. [15:10:33] er, once we understand it better :-D [15:11:24] apergos: this is a great example. I'm pretty sure I did 'apachectl -k restart' on the apache32 host at least but it didn't dtrt. [15:11:40] so I always check to see if the process actually restarted [15:11:51] as in, it's there with a current time stamp on it [15:12:04] and not just osme of the children but all of them [15:12:33] but for things like this /etc/init.d/something is almost always guaranteed to start something once you have stopped it by whatever means [15:12:47] apergos: yep, that would be good information to have for drive-by maintainers [15:12:51] right [15:13:16] I guess there might be some things in there which are upstart jobs and don't have anything in init.d any more but it's the same principle [15:13:26] it's been a lot of years since I cared much about the details of how apache works [15:13:36] yup [15:14:18] (03PS1) 10Mark Bergsma: Make nginx listen on IPv6 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/79058 [15:15:09] do people discuss operational stuff about it in the labs channel or not so much? [15:15:13] (03CR) 10Mark Bergsma: [C: 032] Make nginx listen on IPv6 as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/79058 (owner: 10Mark Bergsma) [15:17:28] (03PS1) 10Nemo bis: Remove long-buried $wgLogAutocreatedAccounts [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/79059 [15:20:07] (03PS1) 10Mark Bergsma: Add IPv6 addresses to misc_web service [operations/puppet] - 10https://gerrit.wikimedia.org/r/79060 [15:20:44] (03CR) 10Mark Bergsma: [C: 032] Add IPv6 addresses to misc_web service [operations/puppet] - 10https://gerrit.wikimedia.org/r/79060 (owner: 10Mark Bergsma) [15:22:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [15:31:00] hookay mark [15:31:06] 0.8 running on analytics1003 and analytics1004 [15:31:13] cool :) [15:31:17] i'll create a topic 'varnish' with replication factor 2 [15:31:18] I think i'll play with that tomorrow then [15:31:24] you can configure varnishkafka to produce to that [15:36:42] yay [15:36:48] (sorry, was in a meeting) [15:37:30] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:37:58] (03PS1) 10Mark Bergsma: Revert "Add HTTPS service for misc_web" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79062 [15:38:20] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:39:18] (03PS2) 10Mark Bergsma: Revert "Add HTTPS service for misc_web" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79062 [15:40:16] (03CR) 10Mark Bergsma: [C: 032] Revert "Add HTTPS service for misc_web" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79062 (owner: 10Mark Bergsma) [15:52:25] (03PS1) 10Mark Bergsma: Add the port number to the PyBal service name [operations/puppet] - 10https://gerrit.wikimedia.org/r/79064 [15:53:14] (03CR) 10Mark Bergsma: [C: 032] Add the port number to the PyBal service name [operations/puppet] - 10https://gerrit.wikimedia.org/r/79064 (owner: 10Mark Bergsma) [15:56:46] Hi opsen. Some nl.wikipedia users get old versions of pages sometimes, likely depending on the servers [15:56:50] see https://bugzilla.wikimedia.org/show_bug.cgi?id=52853 [15:57:12] didn't see anything suspicious in SAL. Is there anything known going on? Any idea how to investigate further? [16:00:18] (03PS1) 10Mark Bergsma: "Add HTTPS service for misc_web"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79065 [16:00:39] (03PS2) 10Mark Bergsma: "Add HTTPS service for misc_web"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79065 [16:01:38] apergos: thanks for the beta cluster diagnosis and fix! [16:01:44] yw [16:01:56] (03CR) 10Mark Bergsma: [C: 032] "Add HTTPS service for misc_web"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79065 (owner: 10Mark Bergsma) [16:04:23] Any opsen who might take a look at bug 52853 / the issue with nl.wikipedia that I mentioned 10min ago: valhallasw and Romaine are in this channel and could help [16:04:26] would be very welcome [16:04:39] bblack: around? [16:04:48] https://bugzilla.wikimedia.org/show_bug.cgi?id=52853 [16:05:07] page jumps back and forth in time [16:06:28] paravoid: yes [16:06:51] apergos: beta labs is giving 503s again :( [16:07:00] grrr [16:07:11] but not as often. I just saw one. [16:07:19] Romaine: X-Cache/X-Cache-Lookup headers would be useful [16:07:37] how do I get those? [16:07:48] paravoid: I posted a set for one of my last requests just a few secs ago [16:08:10] chrismcmahon: apergos I thanked too soon/jinxed it! [16:08:39] Romaine: in chrome: F12, network tab, press F5 until you get a broken server, then select the top entry, and copy the 'Response Headers' under 'Headers' on the right. [16:08:46] not sure for other browsers... [16:08:48] these are going tobe forr some other reason. [16:09:28] if you retry do you get the page? [16:10:31] valhallasw: I have a broken version in front of me, where I see the 'Response Headers'? [16:10:53] (I use Firefox) [16:10:56] chrismcmahon: [16:11:15] Romaine: not sure about firefox... [16:11:23] greg-g apergos seeing about 1 out of 6 503, it's just too flaky to repro [16:12:05] and the more I use it the better it gets I think [16:12:14] caching maybe then [16:12:17] paravoid: http://pastebin.com/enpN1EEF / http://pastebin.com/U85bb1AF [16:12:57] if you keep livehttpheaders open you can see which of theseare cache miss for frnt/back end [16:13:06] Romaine: are you logged in when you see the older page? [16:13:13] yes [16:13:20] then it's very unlikely to be squid... [16:13:25] mine are also all logged in [16:13:28] it's not squid [16:13:37] the "NOT OK" shows all MISSes [16:13:49] * valhallasw has to go now. I'll check back in in half an hour or so [16:13:54] hm, your NOT OK list says Last-Modified:Wed, 14 Aug 2013 15:37:57 GMT [16:14:03] yes, that was surprising to me, too. [16:14:16] parser cache? [16:14:27] I think so [16:14:30] (it is happening the past days) [16:21:26] i'm logged in too, keep refreshing that page, but not seeing it [16:22:56] mark: I just tried again and out of 4 I got one older version [16:23:10] (03PS1) 10Cmjohnson: adding hafnium to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/79067 [16:24:18] assuming the date at the bottom of the page is correct [16:24:34] <^d> hi mark [16:24:40] hi [16:24:43] I have it both in http and with https [16:24:47] (03CR) 10Cmjohnson: [C: 032 V: 032] adding hafnium to netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/79067 (owner: 10Cmjohnson) [16:25:10] I have zero clue about parser cache [16:25:36] <^d> mark: So, since gitblit is being served by varnish now, want to move antimony to an internal address? [16:25:39] greg-g: fwiw, the more I'm using beta the better it performs. [16:25:41] <^d> paravoid: I know some things, shoot. [16:25:45] even at this moment [16:25:47] ^d: er, gitblit served by varnish? [16:25:59] <^d> I thought it was being served by the varnish cache you had setup. [16:26:01] i didn't change dns [16:26:06] <^d> Ahhh, nevermind then :) [16:26:06] i'm testing it [16:26:10] we can move it over soon [16:26:16] but still working on https and some tidbits [16:26:21] * ^d nods [16:26:31] you can test it if you want [16:26:59] ^d: https://bugzilla.wikimedia.org/show_bug.cgi?id=52853 the suspicions are on parser cache now [16:27:02] change your /etc/hosts to 208.80.154.241 for git.wikimedia.org for that [16:27:12] and/or 2620:0:861:ed1a::11 for ipv6 [16:27:33] Romaine: weird, I can't reproduce it [16:27:43] http://nl.wikipedia.org/wiki/Wikipedia:Aanmelding_moderatoren, correct ? [16:27:46] chrismcmahon: hah, I guess we just need google to index it contstantly, then ;) [16:27:46] yes [16:27:47] me neither [16:27:59] I have reproduced at least 20 times [16:28:00] I even tried multiple mwNNNN servers [16:28:05] and a lot of users too [16:28:53] it is randomly happen [16:29:56] <^d> paravoid: So to summarize the bug: logged in users are seeing different versions of the page depending on which server they hit? [16:30:20] Last-Modified is correct, content body sometimes is stale [16:30:26] only happens on logged in users so far [16:30:43] ha [16:30:53] when I click edit, I see no votes past august 10 [16:31:18] yes [16:31:28] it actually shows an older version of that page [16:31:33] instead of the most recent one [16:33:09] <^d> Hmm, I'm definitely seeing the latest info both in the page and when I click edit. [16:34:08] how can I help with reproducing the problem? [16:35:59] now i'm seeing the latest version edit page too [16:38:15] seems like I didn't get a LM header... [16:41:04] but I don't trust web-developer with that [16:43:52] strange that it would happen for multiple URLs related to this one page [16:44:03] perhaps some memcached key? [16:49:29] valhallasw: not solved yet [16:51:23] apergos: beta now serving me all 503 and load went nuts http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&m=load_one&s=by+name&c=deployment-prep&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [16:52:01] chrismcmahon: Is NFS issue. I'm going to deploy a workaround this afternoon that, if nothing else, will tell us definitely if the issue is at the driver or hardware level. [16:52:32] apache is dead on apache33 [16:53:16] hi Coren thanks. beta has been limping along all week, apergos and I were trying to make it at least plod. [16:53:56] restarted [16:54:01] so you'll be k for a little while [16:54:05] Yeah, either way the problem is a paid. If it /is/ a driver regression issue we're in trouble eventually since those controllers are all over both DCs. [16:54:14] pain* [16:54:15] apergos: thanks [16:54:31] dead n apache32 too [16:54:54] Coren: my big project on beta right now is the CirrusSearch stuff. wow apergos I killed *both* apaches? nice. [16:55:02] restarted there too [16:55:08] yes you did a number on em [16:55:16] lemme try that again [16:55:17] "congrats"? :-P [16:55:26] thus I work in QA [16:55:29] you get to restart em this time! [16:55:34] OK :) [16:56:16] Coren: just wondering, "this afternoon" in what time zone? [16:56:24] (03PS1) 10Cmjohnson: Revert "adding hafnium to netboot" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79069 [16:56:48] (03CR) 10Cmjohnson: [C: 032 V: 032] Revert "adding hafnium to netboot" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79069 (owner: 10Cmjohnson) [16:57:11] Mine (GMT-4). I should have specified. Then again, the rsync I need to do will probably take quite some time so that's probably going to end up being afternoon PTD. [16:57:24] * Coren is about to send to labs-l about it. [17:02:09] (03PS1) 10Cmjohnson: adding hafnium with lvm cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/79070 [17:03:38] (03CR) 10Cmjohnson: [C: 032 V: 032] adding hafnium with lvm cfg [operations/puppet] - 10https://gerrit.wikimedia.org/r/79070 (owner: 10Cmjohnson) [17:15:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.131 second response time [17:27:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [17:33:00] (03PS1) 10RobH: aaron handed me a new pubkey, unless it was a pod person... [operations/puppet] - 10https://gerrit.wikimedia.org/r/79073 [17:33:41] AaronSchulz: are you a pod person or the real aaron... wait i bet you would lie if you were a pod person [17:33:50] i think ops is going to require dna sampling for ssh keys now. [17:34:25] (03CR) 10RobH: [C: 032] aaron handed me a new pubkey, unless it was a pod person... [operations/puppet] - 10https://gerrit.wikimedia.org/r/79073 (owner: 10RobH) [17:36:00] AaronSchulz: key is merged on puppetmaster, it'll be a number of hours for it to propogate across cluster [17:36:12] if you need a specific system access right away lemmeknow and i kick it until it updates [17:37:44] mark: any news about the bug? [17:39:15] beta dying again it seems. thanks Coren it is nice to know why at least. [17:40:05] Even better would be the problem not being there. I'm working on the copying now; with the switch panned for 20:00 UTC [17:40:13] planned* [17:40:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:40:55] chrismcmahon: That said, if beta is dying *now* it might not be the NFS after all: it's not stalled atm. [17:41:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:41:33] is apache running? [17:41:54] apergos: not sure, but ganglia shows that familiar pattern http://ganglia.wmflabs.org/latest/?r=20min&cs=&ce=&m=load_one&s=by+name&c=deployment-prep&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [17:41:59] (also I'm not really here. it's way past my afktime. but I wanna process some of these photos) [17:42:07] !log csteipp synchronized php-1.22wmf12/includes [17:42:08] apergos: np [17:42:19] Logged the message, Master [17:42:29] you have rights to log onto the apache32 and 33 instances right? [17:42:33] just peek in and see [17:42:56] I did not look at the logs the last time but maybe the syslog or dmesg will have something interesting in it too [17:43:18] yes [17:45:37] I'll surf over to 32/33. So many yaks. What I'm actually *trying* to do is to optimize an automated browser test. :) [17:45:52] RobH: \o/ [17:47:23] if you don't feel like looking into it, just restart em (if they are dead) [17:47:35] if you kill them tomorrow I can look at it then [17:47:57] I just know my brain has entered low gear for the night, so best to give it a rest [17:53:34] (03PS1) 10BBlack: add queue_size and queue_max_size stats output [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/79076 [17:53:56] (03CR) 10BBlack: [C: 032 V: 032] add per-connection purging limits for sanity [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/75128 (owner: 10BBlack) [17:55:17] (03CR) 10BBlack: [C: 032 V: 032] add queue_size and queue_max_size stats output [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/79076 (owner: 10BBlack) [17:56:34] (03CR) 10Demon: [C: 031] Configure elasticearch multicast per datacenter [operations/puppet] - 10https://gerrit.wikimedia.org/r/78966 (owner: 10Manybubbles) [17:57:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [17:59:32] valhallasw: one users reports a strange thing [17:59:58] he says he sees: (huidig | vorige) 10 aug 2013 19:44? Glatisant (Overleg | bijdragen)? . . (59.018 bytes) (+59.018)? . . (??Voor moderatorschap Dqfn13) (ongedaan maken) [18:00:08] (without the ?) [18:00:35] while I only see (+124)? [18:00:39] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:01:01] (03PS1) 10Ottomata: Using renamed geowiki repo instead of editor-geocoding [operations/puppet] - 10https://gerrit.wikimedia.org/r/79078 [18:01:01] mark ^^ [18:01:03] (03PS2) 10Ottomata: Using renamed geowiki repo instead of editor-geocoding [operations/puppet] - 10https://gerrit.wikimedia.org/r/79078 [18:01:21] Romaine: that's... really strange. [18:01:22] that is specific the revision that keeps showing up randomly [18:01:30] RECOVERY - Disk space on labstore3 is OK: DISK OK [18:01:57] (03CR) 10Ottomata: [C: 032 V: 032] Using renamed geowiki repo instead of editor-geocoding [operations/puppet] - 10https://gerrit.wikimedia.org/r/79078 (owner: 10Ottomata) [18:02:23] Romaine: no, that's the one just after [18:02:37] I also see the +59.018 there [18:02:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:01] you are right [18:03:04] the one after [18:03:27] that implies the software interprets it as a new page (or a diff from an empty page)... [18:03:29] yes, I have reproduced it [18:03:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [18:03:44] I have now a version in my screen with (huidig | vorige) 10 aug 2013 19:44? Glatisant (Overleg | bijdragen)? . . (59.018 bytes) (+59.018)? . . (??Voor moderatorschap Dqfn13) (ongedaan maken) [18:04:34] (03PS1) 10Cmjohnson: changing hafnium back to raid1-lvm cfg cuz there are 2 disk now [operations/puppet] - 10https://gerrit.wikimedia.org/r/79082 [18:04:46] (03PS1) 10BBlack: 0.0.9 stuff [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/79083 [18:04:58] (03CR) 10BBlack: [C: 032 V: 032] 0.0.9 stuff [operations/software/varnish/vhtcpd] - 10https://gerrit.wikimedia.org/r/79083 (owner: 10BBlack) [18:05:28] (03PS1) 10BBlack: Merge branch 'master' into debian [operations/software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/79084 [18:05:29] (03PS1) 10BBlack: bump pkg version [operations/software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/79085 [18:05:42] (03CR) 10BBlack: [C: 032 V: 032] Merge branch 'master' into debian [operations/software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/79084 (owner: 10BBlack) [18:05:55] (03CR) 10BBlack: [C: 032 V: 032] bump pkg version [operations/software/varnish/vhtcpd] (debian) - 10https://gerrit.wikimedia.org/r/79085 (owner: 10BBlack) [18:06:54] (03CR) 10Cmjohnson: [C: 032 V: 032] changing hafnium back to raid1-lvm cfg cuz there are 2 disk now [operations/puppet] - 10https://gerrit.wikimedia.org/r/79082 (owner: 10Cmjohnson) [18:12:00] valhalla1w: I added this to https://bugzilla.wikimedia.org/show_bug.cgi?id=52853 [18:12:24] great! [18:12:57] weird [18:24:19] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [18:25:37] robh: clarify for me...are sas disk backwards compatible for a sata backplane? [18:25:50] i know it works the other way around [18:25:59] bios settings are ata, ahci and raid [18:29:09] PROBLEM - RAID on analytics1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:32:09] PROBLEM - RAID on analytics1011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:32:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [18:39:29] PROBLEM - SSH on labstore3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:52] (03CR) 10BBlack: "Can you update this patchset for https://git.wikimedia.org/blobdiff/operations%2Fsoftware%2Fvarnish%2Fvhtcpd.git/13892a2f274f75a850cebcb5d" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [18:40:20] RECOVERY - SSH on labstore3 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:19:00] (03PS5) 10BryanDavis: Add ganglia monitoring for vhtcpd. [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 [19:19:41] (03CR) 10BryanDavis: "patchset 5 adds the new `queue_size` and `queue_max_size` stats. It is also rebased against head of production branch." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77975 (owner: 10BryanDavis) [19:20:21] bblack: ^ [19:20:34] bd808: thanks :) [19:20:42] np [19:32:43] Anybody in the office want to grab lunch? [19:35:21] (03CR) 10Ori.livneh: "I'm not going to keep rebasing this. Just let me know when you want to merge it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [19:37:58] ha ha ha ^^ [19:47:20] ori-l: you guys could also switch from 'merge on submit' to 'rebase on submit', I guess? :-) [19:48:48] preilly: it is time here for a evening drink [20:39:19] PROBLEM - Puppet freshness on zirconium is CRITICAL: No successful Puppet run in the last 10 hours [21:01:05] I think cache purging on cp1063 is broken [21:04:00] Both https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg and https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/120px-BDavis-test-del.jpg seem to go through that varnish (cp1063), and neither of them are responding to purges [21:04:36] However, other sizes do get varnish cache cleared on purge, and those particular sizes appear to get swift cache of the thumb cleared on ?action=purge. But the varnish doesn't clear [21:06:22] If somebody could check hit that particular varnish with a stick (or check vhtcpd is running on it properly, etc), that would be awesome [21:12:28] (03CR) 10Demon: "Ok, I think this is fine now with all my other changes merged in." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [21:12:53] <^d> AaronSchulz: Can you take a look at ^ again, and see if I resolved the issue in my next to last comment with the other changes we merged? [21:12:59] <^d> I think we're all set now :) [21:14:03] (03CR) 10Demon: [C: 031] Turn on more default elasticsearch logging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78903 (owner: 10Manybubbles) [21:14:33] (03CR) 10Demon: [C: 031] Setup metrics collection for elasticserch [operations/puppet] - 10https://gerrit.wikimedia.org/r/78414 (owner: 10Manybubbles) [21:16:19] (03CR) 10Aaron Schulz: "Seems OK when those core changes are backported/deployed" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [21:17:14] I filed what i mentioned above as bug 52864, so it doesn't get forgotten [21:17:28] bawolff: sounds like something i've seen before [21:17:39] it's missing from swift but is in varnish [21:17:55] I don't think its missing from swift [21:18:06] so it doesn't get purged when a purge is done. (it's not even attempted) [21:18:09] how do you know? [21:18:10] I used an alternate url, which should have recreated it in swift [21:18:52] When I download from the alternate url, I get the same file (based on file mod date in exif), when I hit purge, this seems to delete the swift file, since the mod date updates [21:19:20] For example, do https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg?randomstring [21:19:26] Then do https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg?randomstring2 [21:19:34] note how they have same exif modification date [21:19:39] then ?action=purge [21:19:49] then https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg?randomstring3 has a new exif modification date [21:19:52] what does ?action=purge mean exactly? [21:19:54] (03CR) 10Aaron Schulz: "Actually looks be fine regardless" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [21:20:04] https://test2.wikipedia.org/wiki/File:BDavis-test-del.jpg?action=purge [21:20:36] And the entire time, the https://upload.wikimedia.org/wikipedia/test2/thumb/e/eb/BDavis-test-del.jpg/159px-BDavis-test-del.jpg does not have the age header reset [21:21:06] However, this only happens with 120px and 159px. Other sizes I tried have all the cache clearing work as expected [21:21:23] and the commonality between 120px and 159px is they both go through varnish cp1063 [21:21:39] i just got a 500... [21:22:27] interesting. Doing what? [21:23:10] generating a new thumb. refresh and it worked [21:23:37] oh fun. A repeat of that bug from the weekend? [21:23:53] idk, maybe it was a one-off [21:24:25] would be nice if that page was changed to have the host not in a comment. i now lost the 500 body [21:24:44] hopefully. Special:newfiles on commons seems fine, so I don't think its as widespread as last weekend's bug [21:38:50] jeremyb: I also did a test of all thumbnails of Lysurus_castaneiceps.jpg between 96px to 106px. 2 out of the 10 did not purge properly [21:39:11] both were served by cp1063 [21:39:51] (The two in question were 104px and 98px) [21:40:56] bawolff: welp, could be the varnish itself... can't be a cross atlantic problem because that's in eqiad [21:42:07] Yeah, I'm in north america, but I also tested the original file going through esams, and there was no change [21:42:26] bawolff: regardless the hit is from eqiad [21:42:41] so the thing that needs purging is in eqiad [21:43:27] (03CR) 10Demon: [C: 032] Redo search configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [21:43:49] So the esams caching servers forward to eqiad now? [21:43:58] (03Merged) 10jenkins-bot: Redo search configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [21:44:14] * bawolff must admit, I'm not familar with the intricate set up of wmf's networking architecture [21:44:57] bawolff: they always did. no canonical data is in ams. that's a legal decision [21:45:07] bawolff: esams is only caching [21:45:35] I mean its forwarding the cache hit to eqiad [21:45:42] i don't follow [21:45:51] since the varnish store a copy of the file [21:45:52] you can see in the headers where it was a hit or a miss [21:46:02] some varnishes are backed by other varnishes [21:46:11] and for text squids are backed by other squids [21:46:27] i wonder if any squids are backed by varnish or varnish by squid. i guess not [21:49:22] Ah. So you mean that the esams varnishes are backed by eqiad varnishes, and cp1063 happens to be a varnish that is physically in eqiad? [21:49:45] yes and yes [21:49:48] AIUI [21:50:41] But in theory, an esams varnish could have a cached copy of a file, just in this case that's not relavent? [21:51:04] well it doesn't matter if esams has it if it's not being purged in eqiad [21:51:16] let me fetch again to be triply sure [21:51:37] but i think i should also be hitting eqiad, i'm in north america [21:52:01] My question is probably irrelavent to the current situation, I'm just asking out of curiosity [21:53:17] Most likely causes of this situation imo is either vhtcpd exploded, varnish is misconfigured, or less likely, varnish exploded [21:53:25] basically: if esams purges are failing that's bad, if eqiad purges are failing that's worse. we could be dealing with either or both or none of those [21:53:42] or something wrong with the network maybe [21:53:50] but i guess that's least likely [21:55:30] bawolff: i'm not sure who's awake right now. there are a few greeks that could look at this and maybe an AaronSchulz [21:55:38] i can't do much myself [21:56:29] I'm sure someone will get to it within the next day [21:56:33] Which is probably good enough [21:57:02] The users haven't even started complaining yet [21:57:12] well test2 i don't care much about. but if it's effecting other stuff.. [21:57:54] test2 happened to be where we noticed it, because bd808 was testing other things, but it appears to affect all wikis equally I would assume [22:01:34] (03CR) 10Asher: [C: 032 V: 032] Configure elasticearch multicast per datacenter [operations/puppet] - 10https://gerrit.wikimedia.org/r/78966 (owner: 10Manybubbles) [22:04:49] (03PS2) 10Asher: Turn on more default elasticsearch logging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78903 (owner: 10Manybubbles) [22:04:59] (03CR) 10Asher: [C: 032 V: 032] Turn on more default elasticsearch logging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78903 (owner: 10Manybubbles) [22:08:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:08:45] * Romaine points at https://bugzilla.wikimedia.org/show_bug.cgi?id=52853 [22:09:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 4.186 second response time [22:19:59] Romaine: wow, talk about a weird bug [22:20:04] :p [22:20:19] finnaly we have found the corpse in the closet of the MediaWiki software [22:20:40] we usually say "skeleton in the closet" i thinks [22:20:42] "Lijk in de kast" is Dutch expression [22:20:48] a right [22:28:55] Romaine: If it makes you feel better, I was able to reproduce the weird history thing [22:29:36] great :p [22:29:44] do you also have a cure for it? [22:30:42] no [22:31:00] all signs point to something very screwed up [22:42:19] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [22:47:40] it keeps on occuring with users messing it up [23:32:19] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [23:41:19] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [23:41:19] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [23:41:19] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [23:41:19] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [23:41:19] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [23:41:20] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours