[00:06:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [00:13:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [00:21:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:24:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [01:13:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [01:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [01:32:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [01:34:03] (03PS1) 10Faidon: Link dynamically with librdkafka [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/78780 [01:38:11] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [01:38:11] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [01:38:11] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [01:38:11] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [01:38:11] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:38:12] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:26] (03PS1) 10Faidon: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 [01:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:09:11] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:12] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [02:10:38] !log LocalisationUpdate completed (1.22wmf12) at Mon Aug 12 02:10:38 UTC 2013 [02:10:50] Logged the message, Master [02:13:59] (03PS1) 10Reedy: Revert "Super secret Wikidata logo for Wikimania HK 2013" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78783 [02:14:12] (03CR) 10jenkins-bot: [V: 04-1] Revert "Super secret Wikidata logo for Wikimania HK 2013" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78783 (owner: 10Reedy) [02:14:22] (03Abandoned) 10Reedy: Revert "Super secret Wikidata logo for Wikimania HK 2013" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78783 (owner: 10Reedy) [02:14:58] * paravoid grumbles [02:19:49] (03PS1) 10Reedy: Go back to normal Wikidata logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78784 [02:20:42] (03CR) 10Reedy: [C: 032] Go back to normal Wikidata logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78784 (owner: 10Reedy) [02:20:52] (03Merged) 10jenkins-bot: Go back to normal Wikidata logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78784 (owner: 10Reedy) [02:22:01] !log reedy synchronized wmf-config/InitialiseSettings.php 'Revert back to normal wikidata logo' [02:22:12] Logged the message, Master [02:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [02:24:50] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 12 02:24:49 UTC 2013 [02:25:01] Logged the message, Master [02:25:11] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [03:09:08] I'm getting a lot of 500 errors for generating thumbs, all seem to be from the server mw1153 [03:12:31] bawolff: example url? [03:13:04] http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/2004-12-01_500-Millibar_Height_Contour_Map_NOAA.png/120px-2004-12-01_500-Millibar_Height_Contour_Map_NOAA.png [03:13:13] It seems second time around the request goes through [03:13:24] Its quite noticable on Special:newfiles on commons [03:13:30] where several thumbs are broken [03:14:15] In the response body of the files with the error, mw1153 is always the server serving it [03:15:23] bawolff: where are you getting 1153 from? [03:15:36] in the response body of the 500 error [03:15:44] huh. i got an empty body [03:15:46] for a 500 [03:15:58] i guess my accept header maybe was wrong [03:16:34] https://dpaste.de/hthPy/ [03:17:13] You might have to try a couple different files, it appears that once one request goes through succesfully, the thumb is generated, and issue no longer occours [03:17:34] well, sure it's cached then [03:17:39] in swift [03:18:24] Anyway, I have to go, but I'll file a bug as well just in case [03:18:55] now, i got it. same backend [03:19:03] https://bugzilla.wikimedia.org/show_bug.cgi?id=52740 [03:19:09] Thanks :) [03:27:32] ottomata: ping? [03:27:51] i poked elsewhere, no response yet. maybe you can just depool the bad host at least [03:30:45] https://commons.wikimedia.org/wiki/Special:NewFiles looks fine to me. [03:32:44] Elsie: i got a 500 in the last 20 secs [03:33:01] I forgive you. [03:33:06] uhuh [03:36:25] i wonder what a vampire object is [03:45:43] Leslie is going to get her laptop [03:48:16] ok, danke [03:48:49] The thumbnail log seems to suggest it's multiple servers [03:48:56] 2013-08-12 03:48:50 mw1160 jawiki: thumbnail failed on mw1160: error 1 "convert: no decode delegate for this image format `/a/magick-tmp/magick-UKFgMFyc' @ error/constitute.c/ReadImage/532. [03:49:22] 013-08-12 03:49:12 mw1159 commonswiki: thumbnail failed on mw1159: error 1 "Error reading SVG:Error domain 1 code 96 on line 1 column 17 of file:///tmp/localcopy_8f2843e29844-1.svg: Malformed declaration expecting version" from "'/usr/bin/'rsvg-convert --no-external-files -w 120 -h 74 -o '/tmp/transform_c2db57ace573-1.png' '/tmp/localcopy_8f2843e29844-1.svg' 2>&1" [03:49:24] Jees [03:49:27] This log is noisy [03:50:16] i was hoping we'd have 500 or at least 5xx stats for the cluster or individual hosts [03:50:28] i couldn't find it unless it's in graphite where i can't see it [03:50:54] maybe a https://gdash.wikimedia.org/dashboards/reqerror/ limited to image scaling [03:51:07] online now [03:51:52] > MediaWiki error counts: http://ur1.ca/edq1f [03:51:54] i'm a thinking i may put in an access req for graphite :) or just redebate with binasher/faidon [03:52:02] What does the "m" indicate on the Y axis? [03:52:17] Elsie: milli i think [03:52:25] so, less than one per sec [03:52:29] don't quote me! [03:53:30] 2013-08-12 03:53:02 mw1153 commonswiki: thumbnail failed on mw1153: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=461x768 '/tmp/localcopy_9494dcdd2156-1.jpg' -thumbnail '461x768!' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_b8b7c6f4f187-1.jpg' 2>&1" [03:53:30] 2013-08-12 03:53:02 mw1153 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_b8b7c6f4f187-1.jpg". unlink() succeeded [03:53:30] 2013-08-12 03:53:02 mw1153 commonswiki: thumbnail failed on mw1153: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=500x627 '/tmp/localcopy_802df53317bc-1.jpeg' -thumbnail '500x627!' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_0498c5460b8a-1.jpeg' 2>&1" [03:53:30] 2013-08-12 03:53:02 mw1153 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_0498c5460b8a-1.jpeg". unlink() succeeded [03:53:35] well i'm going to reload imagescalers on mw1153 [03:53:44] because hitting witha hammer usually works [03:53:50] Elsie: anyway, i'm pretty sure it's some sort of multiplier [03:54:07] !log restarting apache2 on mw1153 [03:54:11] LeslieCarr: is there a disk full? [03:54:17] Logged the message, Mistress of the network gear. [03:54:17] inode or space [03:54:20] nope [03:54:21] very unfull [03:54:32] hrmmm [03:54:34] 9% is the owrst disk space and 5% is the worst inodes [03:54:34] tmp and sda1 < 10% [03:54:51] so, y u 0 bytes? [03:55:43] LeslieCarr: finished booting? [03:55:50] finished restarting apache [03:57:24] and now dist-upgrading for good measure [03:58:09] so, what's mw1153 look like now ? [03:58:52] hrmmmmmmmm [03:59:05] so... i now still see some broken images on [[special:newfiles]] [03:59:17] but, they're not longer matching to 500s in my console [03:59:27] Purge thumbs etc? [03:59:50] Reedy: ? [04:00:09] make sure they're not cached for some stupid reason [04:00:28] is that level of thumbnail failed normal ? [04:00:44] ugh, the dangers of viewing [[special:newfiles]] [04:00:47] didn't want to see that [04:01:11] yup, we just had the same thing ;) [04:01:13] hahaha [04:01:53] Reedy: well i right click it and there's an extra option that's not there for a normal image. "reload image". if i do that then it loads fine [04:02:00] and instantly [04:02:41] and no, not normal. certainly not at night [04:03:25] i've debugged this in the past with apergos. it's normal (at least once broken boxes are fixed) to have no errors for long enough that you give up trying to find them [04:04:30] ok, so it looks like no more issues ? [04:04:58] LeslieCarr: can you tell if those graphs exist in graphite? [04:05:11] 12 03:50:16 < jeremyb> i was hoping we'd have 500 or at least 5xx stats for the cluster or individual hosts [04:05:15] 12 03:50:27 < jeremyb> i couldn't find it unless it's in graphite where i can't see it [04:05:16] not sure [04:05:18] 12 03:50:53 < jeremyb> maybe a https://gdash.wikimedia.org/dashboards/reqerror/ limited to image scaling [04:05:22] graphite is confusing [04:05:25] and i have to run [04:05:25] because i can't see graphite [04:05:28] yeah [04:05:29] ok [04:05:32] sorry [04:07:07] (i'm still getting these visible errors but not anything i can tie back to a 50x) [04:08:55] (thanks for fixing it though!) [04:38:09] damn you puppet [04:40:06] * paravoid was just bitten hard by http://projects.puppetlabs.com/issues/14518 [04:40:45] paravoid: i may have found a good case of something that might be in graphite but not gdash (so i can't see the graphs or even easily know if it exists). see above :) [04:43:17] hrmmm, no activity on that bug in over a year. did no one else get bitten? [04:58:56] (03PS1) 10Faidon: ceph: add ensure param to ceph::key [operations/puppet] - 10https://gerrit.wikimedia.org/r/78791 [04:58:57] (03PS1) 10Faidon: ceph: add ceph::nagios class [operations/puppet] - 10https://gerrit.wikimedia.org/r/78792 [04:58:58] (03PS1) 10Faidon: Re-enable LVS check for ms-fe.svc.eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/78793 [04:59:32] (03CR) 10Faidon: [C: 032] "Tested with puppet apply." [operations/puppet] - 10https://gerrit.wikimedia.org/r/78791 (owner: 10Faidon) [05:01:14] (03PS2) 10Faidon: Re-enable LVS check for ms-fe.svc.eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/78793 [05:01:31] poor gerrit [05:01:56] (03CR) 10Faidon: [C: 032 V: 032] Re-enable LVS check for ms-fe.svc.eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/78793 (owner: 10Faidon) [05:07:51] (03PS2) 10Faidon: ceph: add ceph::nagios class [operations/puppet] - 10https://gerrit.wikimedia.org/r/78792 [05:07:52] (03PS2) 10Faidon: ceph: add ensure param to ceph::key [operations/puppet] - 10https://gerrit.wikimedia.org/r/78791 [05:10:50] !log reedy synchronized php-1.22wmf12/extensions/WikimediaMaintenance/addToSites.php [05:11:01] Logged the message, Master [05:11:13] (03CR) 10Faidon: [C: 032 V: 032] ceph: add ensure param to ceph::key [operations/puppet] - 10https://gerrit.wikimedia.org/r/78791 (owner: 10Faidon) [05:11:27] is this a bad day to work? [05:11:31] multiple different outages [05:11:35] a puppet bug [05:11:51] gerrit is slow [05:11:54] Take the rest of the day off [05:12:04] We noticed that here [05:12:13] maganese looks pretty idle though [05:12:42] (03CR) 10Faidon: [C: 032] "Tested." [operations/puppet] - 10https://gerrit.wikimedia.org/r/78792 (owner: 10Faidon) [05:13:30] 100% CPU, doesn't look so idle [05:17:33] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous+eqiad&h=manganese.wikimedia.org&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [05:17:37] Ganglia lies then [05:18:16] no, ganglia divides the % with the CPUs [05:18:21] so it's a real percentage, rather than going > 100 [05:18:49] manganese's cpu is ~130% right now [05:19:01] it has 8 cpus, which makes ganglia report it as 16% [05:19:28] That's slightly irritating [05:20:09] at least links to viwikivoyage are appearing on frwikivoyagre [05:20:30] btw, regarding your RTs about lucene for the two new wikis [05:20:50] I think I may have seen some comments that suggested that new wikis are going to be CirrusSearch'ed now [05:21:15] orly? [05:23:35] https://gerrit.wikimedia.org/r/#/c/78083/2/wmf-config/CommonSettings.php [05:23:47] # New wikis are special and get Cirrus :) [05:24:00] that's what I remembered [05:26:03] Ah, but that's not merged yet [05:27:02] Which will come first... That patch being fixed or users of the new wikis complaining? :D [05:37:22] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph [05:38:02] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph [05:38:12] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph [05:55:00] hm [06:01:02] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [06:01:22] (03PS1) 10Faidon: ceph: actually user ceph::key's owner/group/mode [operations/puppet] - 10https://gerrit.wikimedia.org/r/78794 [06:02:23] (03CR) 10Faidon: [C: 032] ceph: actually user ceph::key's owner/group/mode [operations/puppet] - 10https://gerrit.wikimedia.org/r/78794 (owner: 10Faidon) [06:11:12] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [06:11:22] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [06:14:02] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN noup flag(s) set [06:14:12] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN noup flag(s) set [06:14:22] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN noup flag(s) set [06:14:26] hm, takes a while [06:15:02] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [06:15:12] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [06:15:22] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [06:24:12] (03PS1) 10Faidon: ceph: fix sysctl invocations [operations/puppet] - 10https://gerrit.wikimedia.org/r/78796 [06:26:16] (03CR) 10Faidon: [C: 032] ceph: fix sysctl invocations [operations/puppet] - 10https://gerrit.wikimedia.org/r/78796 (owner: 10Faidon) [06:27:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [06:33:42] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:34:42] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:37:41] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:40:32] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:54:03] (03PS1) 10Faidon: uninstall os-prober from all machines [operations/puppet] - 10https://gerrit.wikimedia.org/r/78797 [06:55:01] PROBLEM - Puppet freshness on db9 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:22] (03CR) 10Faidon: [C: 032] uninstall os-prober from all machines [operations/puppet] - 10https://gerrit.wikimedia.org/r/78797 (owner: 10Faidon) [07:16:13] (03PS3) 10TTO: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 [07:16:21] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:18:11] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:18:27] (03CR) 10jenkins-bot: [V: 04-1] Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [07:27:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.153 second response time [07:31:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:49] (03CR) 10Edenhill: [C: 031] Link dynamically with librdkafka [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/78780 (owner: 10Faidon) [07:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [07:37:26] (03CR) 10Faidon: [C: 032 V: 032] Link dynamically with librdkafka [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/78780 (owner: 10Faidon) [07:49:31] Ryan_Lane: heya [07:49:41] paravoid: howdy [07:49:46] (03CR) 10Edenhill: [C: 031] "(5 comments)" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [07:50:06] Ryan_Lane: ferm was merged last week [07:50:09] jfyi ;) [07:50:27] awesome :) [07:50:37] I'll need to switch to that for the openstack manifests [07:50:45] yep [07:50:51] I'd be happy to review [07:50:55] feel free to add me as a reviewer [07:57:55] !log krinkle synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [07:58:06] Logged the message, Master [07:58:39] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:00:39] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:00:47] !log Ran updateiwcache to fix issue with viwikivoyage interwiki links resolving to $lang.wikipedia.org instead of wikivoyage [08:00:58] Logged the message, Master [08:03:39] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:29] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:29:39] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:39] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:43:39] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:29] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:53:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [09:09:38] addshore: sure, when I do more puppet stuff :) [09:09:40] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:42] let me add for the current two things [09:13:39] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [09:14:30] YuviPanda_zz: you took about 3 days to reply xD thats rather funny :P [09:14:38] addshore: well, hong kong :P [09:14:43] xD [09:15:24] done [09:17:23] (03PS4) 10TTO: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 [09:18:27] :> [09:18:30] cheers! [09:23:40] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:29] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [09:38:53] (03PS2) 10Andrew Bogott: Add VIPS / TIFF packages to toollabs exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/78629 (owner: 10Yuvipanda) [09:40:36] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:32] (03CR) 10Andrew Bogott: [C: 032] Add VIPS / TIFF packages to toollabs exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/78629 (owner: 10Yuvipanda) [09:52:50] !log stopping etherpad-lite to load old etherpad data [09:53:01] Logged the message, Master [10:11:16] PROBLEM - NTP on pdf2 is CRITICAL: NTP CRITICAL: No response from NTP server [10:13:16] RECOVERY - NTP on pdf2 is OK: NTP OK: Offset -0.001949667931 secs [10:13:26] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [10:19:36] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:06] checking out pdf2 [10:23:47] grrr, unresponsive on console [10:24:38] !log rebooting pdf2 [10:24:44] woo rebooting from the airport! [10:24:49] Logged the message, Mistress of the network gear. [10:26:40] hrm [10:26:44] this is taking a while to reboot [10:26:57] haha as soon as i type that it gets going [10:29:26] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [10:31:16] PROBLEM - NTP on pdf2 is CRITICAL: NTP CRITICAL: Offset unknown [10:32:16] RECOVERY - NTP on pdf2 is OK: NTP OK: Offset -0.001126885414 secs [10:32:59] !log dist-upgrading pdf2 [10:33:10] Logged the message, Mistress of the network gear. [10:33:57] !log uninstalling wpasupplicant from pdf2 [10:33:59] because wtf ? [10:34:08] Logged the message, Mistress of the network gear. [10:36:01] !log removing wirelesstools from pdf2 [10:36:02] also wtf [10:36:12] Logged the message, Mistress of the network gear. [10:36:59] !log removing wirelesstools from pdf1 [10:37:08] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:37:09] Logged the message, Mistress of the network gear. [10:37:15] !log removing wpasupplicant from pdf1 [10:37:26] Logged the message, Mistress of the network gear. [10:37:29] does anyone know how those got there ? [10:38:28] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [10:51:08] LeslieCarr: probably because the pediapress guys had root on those in the past [10:51:58] hehe [10:52:01] probably :) [10:53:40] I remember one time Tim went to try fix something on those boxes and came back with nopenopenope [11:05:03] i'm out! pdf machines better not die again [11:05:08] or regrow spasupplicant [11:05:14] !log removed wpasupplicant from pdf3 [11:05:24] Logged the message, Mistress of the network gear. [11:18:19] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: No response from NTP server [11:28:02] (03CR) 10Akosiaris: [C: 032] Introducing bacula module [operations/puppet] - 10https://gerrit.wikimedia.org/r/70840 (owner: 10Akosiaris) [11:38:12] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [11:38:12] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [11:38:12] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [11:38:12] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [11:38:12] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [11:38:13] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:13] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [12:25:12] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [13:27:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:51] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [13:53:13] ottomata: hey [13:53:19] ottomata: I have preliminary packages for varnishkafka [13:53:55] (I also just added bblack as a reviewer to Snaps' changesets, among us he's surely a much better C coder than I am :) [13:54:26] someone needs to start working on the varnishkafka puppet bits, should be easy [13:54:42] although we need to fix the init script first I guess [13:55:08] ottomata: what are your plans re: kafka 0.8 in prod? [13:55:39] this is kind of a blocker for varnishkafka performance testing, labs isn't really well suited for that [13:55:52] oh hm, well [13:56:09] if LeslieCarr is around this week, then we are going to 'repave' the analytics hadoop/kafka stuff together [13:56:16] ok [13:56:26] that will get us 2 0.8 kafka brokers up [13:56:47] but, since we still don't know exactly how we are going to consume into hdfs yet [13:56:57] using 0.8 means we won't really be able to [13:57:06] but hm [13:57:07] actually [13:57:10] what's the blocker? [13:57:17] you just need 0.8 brokers to send to? [13:57:21] I guess :) [13:57:38] i could probably fire up a couple on 2 of the cisco boxes [13:57:45] I think the plan was to just put varnishkafka on e.g. a varnish prod box [13:57:51] like one of the mobile ones [13:57:53] aye [13:57:57] yeah [13:57:58] hm [13:58:08] and leave it like that for a while, see how it does [13:58:16] aye ok [13:58:23] hm [13:58:44] I'll talk to Diederik today to see if we can basically just shut down Kraken this week [13:58:44] I guess we can wait until all the repaving work is done, but it'd be nice to get some early feedback for Snaps [13:58:47] til we get this worked out [13:58:48] yeah [13:59:02] if we can repave the kafka bit and don't mind the shut down (I think we can) [13:59:18] then the easiest thing to do woudl be just to start up 0.8 on the two kafka prod machines [13:59:28] k [13:59:30] how far is camus? [13:59:40] I have no idea what that entails [13:59:43] i'm still just playing with it, its a little young, requires coding to get right [13:59:44] its not hard [13:59:52] but, my java fu is lacking [14:00:06] gonna see if I can get qchris to help me out today with that [14:00:20] hello qchris! :) [14:00:34] what kind of coding? [14:00:37] Hi ottomata [14:01:09] basically, just implmenting our own etl/MapReduce subclass [14:01:19] the main bit is extracting the timestamp out of the dtta [14:01:20] data [14:01:29] so it depends on the final format of the data from kafka [14:01:36] i'd like to have just a 'timestamp' field in json [14:01:38] that woudl be best [14:01:46] right now varnishkafka is not using our usual timestamp format [14:02:00] oo, which reminds me, I need to ask Snaps to fix that [14:02:48] Camus is good, but not well documented and very young [14:03:06] its been a lot of code reading for me, and I don't yet fully grasp what exactly should be done [14:03:10] ottomata: Sure I can try to help you with that [14:03:26] ottomata: But my camus experience == 0 [14:03:38] yea, i'm only a 0.2 [14:03:44] so we'll do it together [14:04:00] :-) [14:04:50] Just let me know what/when/how/... I can do. [14:08:36] k, gimme 30 mins to get through emails and get back into it [14:57:16] (03PS1) 10Ottomata: Adding custom Zookeeper chroot support. [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/78816 [14:58:41] (03CR) 10Ottomata: [C: 032 V: 032] Adding custom Zookeeper chroot support. [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/78816 (owner: 10Ottomata) [15:06:56] ottomata: okay, specify what you need! [15:10:37] (03PS9) 10Ottomata: Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 [15:12:12] Snaps, our timestamps are of the form: [15:12:12] 2013-08-11T06:26:58.123 [15:12:46] YYYY-mm-ddTHH:MM:SS(.ms0 [15:12:47] ) [15:13:04] that's iso 8601, isn't it? [15:13:08] ja think so [15:15:35] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:35] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [15:16:49] (03PS10) 10Ottomata: Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 [15:21:25] paravoid, here's a quick kafka q [15:21:39] we'll have different kafka clusters in each datacenter, right? [15:21:59] the broker nodes we have right now are in eqiad [15:22:08] so, the eqiad varnishes could produce directly to them [15:22:09] OR [15:22:21] (03CR) 10jenkins-bot: [V: 04-1] Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 (owner: 10Ottomata) [15:22:24] they could be used for aggregation as the analytics boxes [15:22:38] and there could be separate brokers for eqiad for varnishes to produce to [15:25:03] why would we have two clusters? [15:25:16] PROBLEM - Etherpad HTTP on hooper is CRITICAL: Connection refused [15:26:09] in eqiad? [15:26:20] yes [15:26:35] not sure really, the only pro is maybe that then the analytics cluster is only an aggregator [15:26:37] from all the datacenters [15:26:56]