[00:06:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:07:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [00:13:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:14:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [00:21:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:24:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:52:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [01:13:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [01:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [01:32:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [01:34:03] (03PS1) 10Faidon: Link dynamically with librdkafka [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/78780 [01:38:11] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [01:38:11] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [01:38:11] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [01:38:11] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [01:38:11] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [01:38:12] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [01:47:26] (03PS1) 10Faidon: (WIP) Initial Debian version [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 [01:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:09:11] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:11] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [02:09:12] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [02:10:38] !log LocalisationUpdate completed (1.22wmf12) at Mon Aug 12 02:10:38 UTC 2013 [02:10:50] Logged the message, Master [02:13:59] (03PS1) 10Reedy: Revert "Super secret Wikidata logo for Wikimania HK 2013" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78783 [02:14:12] (03CR) 10jenkins-bot: [V: 04-1] Revert "Super secret Wikidata logo for Wikimania HK 2013" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78783 (owner: 10Reedy) [02:14:22] (03Abandoned) 10Reedy: Revert "Super secret Wikidata logo for Wikimania HK 2013" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78783 (owner: 10Reedy) [02:14:58] * paravoid grumbles [02:19:49] (03PS1) 10Reedy: Go back to normal Wikidata logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78784 [02:20:42] (03CR) 10Reedy: [C: 032] Go back to normal Wikidata logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78784 (owner: 10Reedy) [02:20:52] (03Merged) 10jenkins-bot: Go back to normal Wikidata logo [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78784 (owner: 10Reedy) [02:22:01] !log reedy synchronized wmf-config/InitialiseSettings.php 'Revert back to normal wikidata logo' [02:22:12] Logged the message, Master [02:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [02:24:50] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 12 02:24:49 UTC 2013 [02:25:01] Logged the message, Master [02:25:11] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [03:09:08] I'm getting a lot of 500 errors for generating thumbs, all seem to be from the server mw1153 [03:12:31] bawolff: example url? [03:13:04] http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/2004-12-01_500-Millibar_Height_Contour_Map_NOAA.png/120px-2004-12-01_500-Millibar_Height_Contour_Map_NOAA.png [03:13:13] It seems second time around the request goes through [03:13:24] Its quite noticable on Special:newfiles on commons [03:13:30] where several thumbs are broken [03:14:15] In the response body of the files with the error, mw1153 is always the server serving it [03:15:23] bawolff: where are you getting 1153 from? [03:15:36] in the response body of the 500 error [03:15:44] huh. i got an empty body [03:15:46] for a 500 [03:15:58] i guess my accept header maybe was wrong [03:16:34] https://dpaste.de/hthPy/ [03:17:13] You might have to try a couple different files, it appears that once one request goes through succesfully, the thumb is generated, and issue no longer occours [03:17:34] well, sure it's cached then [03:17:39] in swift [03:18:24] Anyway, I have to go, but I'll file a bug as well just in case [03:18:55] now, i got it. same backend [03:19:03] https://bugzilla.wikimedia.org/show_bug.cgi?id=52740 [03:19:09] Thanks :) [03:27:32] ottomata: ping? [03:27:51] i poked elsewhere, no response yet. maybe you can just depool the bad host at least [03:30:45] https://commons.wikimedia.org/wiki/Special:NewFiles looks fine to me. [03:32:44] Elsie: i got a 500 in the last 20 secs [03:33:01] I forgive you. [03:33:06] uhuh [03:36:25] i wonder what a vampire object is [03:45:43] Leslie is going to get her laptop [03:48:16] ok, danke [03:48:49] The thumbnail log seems to suggest it's multiple servers [03:48:56] 2013-08-12 03:48:50 mw1160 jawiki: thumbnail failed on mw1160: error 1 "convert: no decode delegate for this image format `/a/magick-tmp/magick-UKFgMFyc' @ error/constitute.c/ReadImage/532. [03:49:22] 013-08-12 03:49:12 mw1159 commonswiki: thumbnail failed on mw1159: error 1 "Error reading SVG:Error domain 1 code 96 on line 1 column 17 of file:///tmp/localcopy_8f2843e29844-1.svg: Malformed declaration expecting version" from "'/usr/bin/'rsvg-convert --no-external-files -w 120 -h 74 -o '/tmp/transform_c2db57ace573-1.png' '/tmp/localcopy_8f2843e29844-1.svg' 2>&1" [03:49:24] Jees [03:49:27] This log is noisy [03:50:16] i was hoping we'd have 500 or at least 5xx stats for the cluster or individual hosts [03:50:28] i couldn't find it unless it's in graphite where i can't see it [03:50:54] maybe a https://gdash.wikimedia.org/dashboards/reqerror/ limited to image scaling [03:51:07] online now [03:51:52] > MediaWiki error counts: http://ur1.ca/edq1f [03:51:54] i'm a thinking i may put in an access req for graphite :) or just redebate with binasher/faidon [03:52:02] What does the "m" indicate on the Y axis? [03:52:17] Elsie: milli i think [03:52:25] so, less than one per sec [03:52:29] don't quote me! [03:53:30] 2013-08-12 03:53:02 mw1153 commonswiki: thumbnail failed on mw1153: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=461x768 '/tmp/localcopy_9494dcdd2156-1.jpg' -thumbnail '461x768!' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_b8b7c6f4f187-1.jpg' 2>&1" [03:53:30] 2013-08-12 03:53:02 mw1153 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_b8b7c6f4f187-1.jpg". unlink() succeeded [03:53:30] 2013-08-12 03:53:02 mw1153 commonswiki: thumbnail failed on mw1153: error 1 "" from "'/usr/bin/convert' -quality 80 -background white -define jpeg:size=500x627 '/tmp/localcopy_802df53317bc-1.jpeg' -thumbnail '500x627!' -depth 8 -sharpen '0x0.8' -rotate -0 '/tmp/transform_0498c5460b8a-1.jpeg' 2>&1" [03:53:30] 2013-08-12 03:53:02 mw1153 commonswiki: Removing bad 0-byte thumbnail "/tmp/transform_0498c5460b8a-1.jpeg". unlink() succeeded [03:53:35] well i'm going to reload imagescalers on mw1153 [03:53:44] because hitting witha hammer usually works [03:53:50] Elsie: anyway, i'm pretty sure it's some sort of multiplier [03:54:07] !log restarting apache2 on mw1153 [03:54:11] LeslieCarr: is there a disk full? [03:54:17] Logged the message, Mistress of the network gear. [03:54:17] inode or space [03:54:20] nope [03:54:21] very unfull [03:54:32] hrmmm [03:54:34] 9% is the owrst disk space and 5% is the worst inodes [03:54:34] tmp and sda1 < 10% [03:54:51] so, y u 0 bytes? [03:55:43] LeslieCarr: finished booting? [03:55:50] finished restarting apache [03:57:24] and now dist-upgrading for good measure [03:58:09] so, what's mw1153 look like now ? [03:58:52] hrmmmmmmmm [03:59:05] so... i now still see some broken images on [[special:newfiles]] [03:59:17] but, they're not longer matching to 500s in my console [03:59:27] Purge thumbs etc? [03:59:50] Reedy: ? [04:00:09] make sure they're not cached for some stupid reason [04:00:28] is that level of thumbnail failed normal ? [04:00:44] ugh, the dangers of viewing [[special:newfiles]] [04:00:47] didn't want to see that [04:01:11] yup, we just had the same thing ;) [04:01:13] hahaha [04:01:53] Reedy: well i right click it and there's an extra option that's not there for a normal image. "reload image". if i do that then it loads fine [04:02:00] and instantly [04:02:41] and no, not normal. certainly not at night [04:03:25] i've debugged this in the past with apergos. it's normal (at least once broken boxes are fixed) to have no errors for long enough that you give up trying to find them [04:04:30] ok, so it looks like no more issues ? [04:04:58] LeslieCarr: can you tell if those graphs exist in graphite? [04:05:11] 12 03:50:16 < jeremyb> i was hoping we'd have 500 or at least 5xx stats for the cluster or individual hosts [04:05:15] 12 03:50:27 < jeremyb> i couldn't find it unless it's in graphite where i can't see it [04:05:16] not sure [04:05:18] 12 03:50:53 < jeremyb> maybe a https://gdash.wikimedia.org/dashboards/reqerror/ limited to image scaling [04:05:22] graphite is confusing [04:05:25] and i have to run [04:05:25] because i can't see graphite [04:05:28] yeah [04:05:29] ok [04:05:32] sorry [04:07:07] (i'm still getting these visible errors but not anything i can tie back to a 50x) [04:08:55] (thanks for fixing it though!) [04:38:09] damn you puppet [04:40:06] * paravoid was just bitten hard by http://projects.puppetlabs.com/issues/14518 [04:40:45] paravoid: i may have found a good case of something that might be in graphite but not gdash (so i can't see the graphs or even easily know if it exists). see above :) [04:43:17] hrmmm, no activity on that bug in over a year. did no one else get bitten? [04:58:56] (03PS1) 10Faidon: ceph: add ensure param to ceph::key [operations/puppet] - 10https://gerrit.wikimedia.org/r/78791 [04:58:57] (03PS1) 10Faidon: ceph: add ceph::nagios class [operations/puppet] - 10https://gerrit.wikimedia.org/r/78792 [04:58:58] (03PS1) 10Faidon: Re-enable LVS check for ms-fe.svc.eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/78793 [04:59:32] (03CR) 10Faidon: [C: 032] "Tested with puppet apply." [operations/puppet] - 10https://gerrit.wikimedia.org/r/78791 (owner: 10Faidon) [05:01:14] (03PS2) 10Faidon: Re-enable LVS check for ms-fe.svc.eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/78793 [05:01:31] poor gerrit [05:01:56] (03CR) 10Faidon: [C: 032 V: 032] Re-enable LVS check for ms-fe.svc.eqiad.wmnet [operations/puppet] - 10https://gerrit.wikimedia.org/r/78793 (owner: 10Faidon) [05:07:51] (03PS2) 10Faidon: ceph: add ceph::nagios class [operations/puppet] - 10https://gerrit.wikimedia.org/r/78792 [05:07:52] (03PS2) 10Faidon: ceph: add ensure param to ceph::key [operations/puppet] - 10https://gerrit.wikimedia.org/r/78791 [05:10:50] !log reedy synchronized php-1.22wmf12/extensions/WikimediaMaintenance/addToSites.php [05:11:01] Logged the message, Master [05:11:13] (03CR) 10Faidon: [C: 032 V: 032] ceph: add ensure param to ceph::key [operations/puppet] - 10https://gerrit.wikimedia.org/r/78791 (owner: 10Faidon) [05:11:27] is this a bad day to work? [05:11:31] multiple different outages [05:11:35] a puppet bug [05:11:51] gerrit is slow [05:11:54] Take the rest of the day off [05:12:04] We noticed that here [05:12:13] maganese looks pretty idle though [05:12:42] (03CR) 10Faidon: [C: 032] "Tested." [operations/puppet] - 10https://gerrit.wikimedia.org/r/78792 (owner: 10Faidon) [05:13:30] 100% CPU, doesn't look so idle [05:17:33] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Miscellaneous+eqiad&h=manganese.wikimedia.org&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [05:17:37] Ganglia lies then [05:18:16] no, ganglia divides the % with the CPUs [05:18:21] so it's a real percentage, rather than going > 100 [05:18:49] manganese's cpu is ~130% right now [05:19:01] it has 8 cpus, which makes ganglia report it as 16% [05:19:28] That's slightly irritating [05:20:09] at least links to viwikivoyage are appearing on frwikivoyagre [05:20:30] btw, regarding your RTs about lucene for the two new wikis [05:20:50] I think I may have seen some comments that suggested that new wikis are going to be CirrusSearch'ed now [05:21:15] orly? [05:23:35] https://gerrit.wikimedia.org/r/#/c/78083/2/wmf-config/CommonSettings.php [05:23:47] # New wikis are special and get Cirrus :) [05:24:00] that's what I remembered [05:26:03] Ah, but that's not merged yet [05:27:02] Which will come first... That patch being fixed or users of the new wikis complaining? :D [05:37:22] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph [05:38:02] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph [05:38:12] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph [05:55:00] hm [06:01:02] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [06:01:22] (03PS1) 10Faidon: ceph: actually user ceph::key's owner/group/mode [operations/puppet] - 10https://gerrit.wikimedia.org/r/78794 [06:02:23] (03CR) 10Faidon: [C: 032] ceph: actually user ceph::key's owner/group/mode [operations/puppet] - 10https://gerrit.wikimedia.org/r/78794 (owner: 10Faidon) [06:11:12] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [06:11:22] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [06:14:02] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN noup flag(s) set [06:14:12] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN noup flag(s) set [06:14:22] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN noup flag(s) set [06:14:26] hm, takes a while [06:15:02] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [06:15:12] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [06:15:22] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [06:24:12] (03PS1) 10Faidon: ceph: fix sysctl invocations [operations/puppet] - 10https://gerrit.wikimedia.org/r/78796 [06:26:16] (03CR) 10Faidon: [C: 032] ceph: fix sysctl invocations [operations/puppet] - 10https://gerrit.wikimedia.org/r/78796 (owner: 10Faidon) [06:27:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:28:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [06:33:42] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:34:42] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:37:41] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:40:32] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:54:03] (03PS1) 10Faidon: uninstall os-prober from all machines [operations/puppet] - 10https://gerrit.wikimedia.org/r/78797 [06:55:01] PROBLEM - Puppet freshness on db9 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:22] (03CR) 10Faidon: [C: 032] uninstall os-prober from all machines [operations/puppet] - 10https://gerrit.wikimedia.org/r/78797 (owner: 10Faidon) [07:16:13] (03PS3) 10TTO: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 [07:16:21] PROBLEM - SSH on searchidx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:18:11] RECOVERY - SSH on searchidx1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [07:18:27] (03CR) 10jenkins-bot: [V: 04-1] Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 (owner: 10TTO) [07:27:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.153 second response time [07:31:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:31:49] (03CR) 10Edenhill: [C: 031] Link dynamically with librdkafka [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/78780 (owner: 10Faidon) [07:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [07:37:26] (03CR) 10Faidon: [C: 032 V: 032] Link dynamically with librdkafka [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/78780 (owner: 10Faidon) [07:49:31] Ryan_Lane: heya [07:49:41] paravoid: howdy [07:49:46] (03CR) 10Edenhill: [C: 031] "(5 comments)" [operations/software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/78782 (owner: 10Faidon) [07:50:06] Ryan_Lane: ferm was merged last week [07:50:09] jfyi ;) [07:50:27] awesome :) [07:50:37] I'll need to switch to that for the openstack manifests [07:50:45] yep [07:50:51] I'd be happy to review [07:50:55] feel free to add me as a reviewer [07:57:55] !log krinkle synchronized wmf-config/interwiki.cdb 'Updating interwiki cache' [07:58:06] Logged the message, Master [07:58:39] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:00:39] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:00:47] !log Ran updateiwcache to fix issue with viwikivoyage interwiki links resolving to $lang.wikipedia.org instead of wikivoyage [08:00:58] Logged the message, Master [08:03:39] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:07:29] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:29:39] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:39] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:43:39] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:29] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [08:53:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [09:09:38] addshore: sure, when I do more puppet stuff :) [09:09:40] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:42] let me add for the current two things [09:13:39] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [09:14:30] YuviPanda_zz: you took about 3 days to reply xD thats rather funny :P [09:14:38] addshore: well, hong kong :P [09:14:43] xD [09:15:24] done [09:17:23] (03PS4) 10TTO: Continuing to clean up InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78637 [09:18:27] :> [09:18:30] cheers! [09:23:40] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:29] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [09:38:53] (03PS2) 10Andrew Bogott: Add VIPS / TIFF packages to toollabs exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/78629 (owner: 10Yuvipanda) [09:40:36] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:32] (03CR) 10Andrew Bogott: [C: 032] Add VIPS / TIFF packages to toollabs exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/78629 (owner: 10Yuvipanda) [09:52:50] !log stopping etherpad-lite to load old etherpad data [09:53:01] Logged the message, Master [10:11:16] PROBLEM - NTP on pdf2 is CRITICAL: NTP CRITICAL: No response from NTP server [10:13:16] RECOVERY - NTP on pdf2 is OK: NTP OK: Offset -0.001949667931 secs [10:13:26] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [10:19:36] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:06] checking out pdf2 [10:23:47] grrr, unresponsive on console [10:24:38] !log rebooting pdf2 [10:24:44] woo rebooting from the airport! [10:24:49] Logged the message, Mistress of the network gear. [10:26:40] hrm [10:26:44] this is taking a while to reboot [10:26:57] haha as soon as i type that it gets going [10:29:26] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [10:31:16] PROBLEM - NTP on pdf2 is CRITICAL: NTP CRITICAL: Offset unknown [10:32:16] RECOVERY - NTP on pdf2 is OK: NTP OK: Offset -0.001126885414 secs [10:32:59] !log dist-upgrading pdf2 [10:33:10] Logged the message, Mistress of the network gear. [10:33:57] !log uninstalling wpasupplicant from pdf2 [10:33:59] because wtf ? [10:34:08] Logged the message, Mistress of the network gear. [10:36:01] !log removing wirelesstools from pdf2 [10:36:02] also wtf [10:36:12] Logged the message, Mistress of the network gear. [10:36:59] !log removing wirelesstools from pdf1 [10:37:08] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:37:09] Logged the message, Mistress of the network gear. [10:37:15] !log removing wpasupplicant from pdf1 [10:37:26] Logged the message, Mistress of the network gear. [10:37:29] does anyone know how those got there ? [10:38:28] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [10:51:08] LeslieCarr: probably because the pediapress guys had root on those in the past [10:51:58] hehe [10:52:01] probably :) [10:53:40] I remember one time Tim went to try fix something on those boxes and came back with nopenopenope [11:05:03] i'm out! pdf machines better not die again [11:05:08] or regrow spasupplicant [11:05:14] !log removed wpasupplicant from pdf3 [11:05:24] Logged the message, Mistress of the network gear. [11:18:19] PROBLEM - NTP on pdf3 is CRITICAL: NTP CRITICAL: No response from NTP server [11:28:02] (03CR) 10Akosiaris: [C: 032] Introducing bacula module [operations/puppet] - 10https://gerrit.wikimedia.org/r/70840 (owner: 10Akosiaris) [11:38:12] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [11:38:12] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [11:38:12] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [11:38:12] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [11:38:12] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [11:38:13] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:12] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [12:09:13] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [12:25:12] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [13:27:01] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:27:51] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [13:53:13] ottomata: hey [13:53:19] ottomata: I have preliminary packages for varnishkafka [13:53:55] (I also just added bblack as a reviewer to Snaps' changesets, among us he's surely a much better C coder than I am :) [13:54:26] someone needs to start working on the varnishkafka puppet bits, should be easy [13:54:42] although we need to fix the init script first I guess [13:55:08] ottomata: what are your plans re: kafka 0.8 in prod? [13:55:39] this is kind of a blocker for varnishkafka performance testing, labs isn't really well suited for that [13:55:52] oh hm, well [13:56:09] if LeslieCarr is around this week, then we are going to 'repave' the analytics hadoop/kafka stuff together [13:56:16] ok [13:56:26] that will get us 2 0.8 kafka brokers up [13:56:47] but, since we still don't know exactly how we are going to consume into hdfs yet [13:56:57] using 0.8 means we won't really be able to [13:57:06] but hm [13:57:07] actually [13:57:10] what's the blocker? [13:57:17] you just need 0.8 brokers to send to? [13:57:21] I guess :) [13:57:38] i could probably fire up a couple on 2 of the cisco boxes [13:57:45] I think the plan was to just put varnishkafka on e.g. a varnish prod box [13:57:51] like one of the mobile ones [13:57:53] aye [13:57:57] yeah [13:57:58] hm [13:58:08] and leave it like that for a while, see how it does [13:58:16] aye ok [13:58:23] hm [13:58:44] I'll talk to Diederik today to see if we can basically just shut down Kraken this week [13:58:44] I guess we can wait until all the repaving work is done, but it'd be nice to get some early feedback for Snaps [13:58:47] til we get this worked out [13:58:48] yeah [13:59:02] if we can repave the kafka bit and don't mind the shut down (I think we can) [13:59:18] then the easiest thing to do woudl be just to start up 0.8 on the two kafka prod machines [13:59:28] k [13:59:30] how far is camus? [13:59:40] I have no idea what that entails [13:59:43] i'm still just playing with it, its a little young, requires coding to get right [13:59:44] its not hard [13:59:52] but, my java fu is lacking [14:00:06] gonna see if I can get qchris to help me out today with that [14:00:20] hello qchris! :) [14:00:34] what kind of coding? [14:00:37] Hi ottomata [14:01:09] basically, just implmenting our own etl/MapReduce subclass [14:01:19] the main bit is extracting the timestamp out of the dtta [14:01:20] data [14:01:29] so it depends on the final format of the data from kafka [14:01:36] i'd like to have just a 'timestamp' field in json [14:01:38] that woudl be best [14:01:46] right now varnishkafka is not using our usual timestamp format [14:02:00] oo, which reminds me, I need to ask Snaps to fix that [14:02:48] Camus is good, but not well documented and very young [14:03:06] its been a lot of code reading for me, and I don't yet fully grasp what exactly should be done [14:03:10] ottomata: Sure I can try to help you with that [14:03:26] ottomata: But my camus experience == 0 [14:03:38] yea, i'm only a 0.2 [14:03:44] so we'll do it together [14:04:00] :-) [14:04:50] Just let me know what/when/how/... I can do. [14:08:36] k, gimme 30 mins to get through emails and get back into it [14:57:16] (03PS1) 10Ottomata: Adding custom Zookeeper chroot support. [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/78816 [14:58:41] (03CR) 10Ottomata: [C: 032 V: 032] Adding custom Zookeeper chroot support. [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/78816 (owner: 10Ottomata) [15:06:56] ottomata: okay, specify what you need! [15:10:37] (03PS9) 10Ottomata: Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 [15:12:12] Snaps, our timestamps are of the form: [15:12:12] 2013-08-11T06:26:58.123 [15:12:46] YYYY-mm-ddTHH:MM:SS(.ms0 [15:12:47] ) [15:13:04] that's iso 8601, isn't it? [15:13:08] ja think so [15:15:35] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:35] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [15:16:49] (03PS10) 10Ottomata: Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 [15:21:25] paravoid, here's a quick kafka q [15:21:39] we'll have different kafka clusters in each datacenter, right? [15:21:59] the broker nodes we have right now are in eqiad [15:22:08] so, the eqiad varnishes could produce directly to them [15:22:09] OR [15:22:21] (03CR) 10jenkins-bot: [V: 04-1] Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 (owner: 10Ottomata) [15:22:24] they could be used for aggregation as the analytics boxes [15:22:38] and there could be separate brokers for eqiad for varnishes to produce to [15:25:03] why would we have two clusters? [15:25:16] PROBLEM - Etherpad HTTP on hooper is CRITICAL: Connection refused [15:26:09] in eqiad? [15:26:20] yes [15:26:35] not sure really, the only pro is maybe that then the analytics cluster is only an aggregator [15:26:37] from all the datacenters [15:26:56] I don't see the point but I don't know the arch that well [15:27:15] RECOVERY - Etherpad HTTP on hooper is OK: HTTP OK: HTTP/1.1 302 Found - 338 bytes in 0.080 second response time [15:27:22] there might not be a point really, it might be fine for eqiad varnishes to produce to analytics cluster [15:27:32] i'm just pointing out that there are 2 optoins on what to do in eqiad [15:32:06] bahhhhh hahah [15:32:06] https://github.com/philipl/pifs [15:35:47] gerrit seems not working [15:35:49] can't pull [15:35:53] or clone [15:36:13] HMMM [15:36:15] yes I can [15:36:15] hm [15:36:18] maybe not from labs? [15:54:22] (03PS11) 10Ottomata: Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 [15:59:41] (03CR) 10jenkins-bot: [V: 04-1] Adding role/analytics/kafka.pp Also adding modules/kafka [operations/puppet] - 10https://gerrit.wikimedia.org/r/77971 (owner: 10Ottomata) [15:59:42] (03PS1) 10Jgreen: reenable otrs GenericAgent.pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/78819 [16:05:02] (03CR) 10jenkins-bot: [V: 04-1] reenable otrs GenericAgent.pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/78819 (owner: 10Jgreen) [16:07:00] ottomata: if I exposed the strftime formatting (allow any time format to be specified), would that be good? [16:07:34] %{!strfime:%Y-%m-%dT%...}t [16:10:18] (03CR) 10Jgreen: [C: 032 V: 032] reenable otrs GenericAgent.pm [operations/puppet] - 10https://gerrit.wikimedia.org/r/78819 (owner: 10Jgreen) [16:14:16] (03PS2) 10Petr Onderka: Deleting deleted pages [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/78393 [16:14:52] oh my. puppet-merge looked pretty scary [16:14:58] (03CR) 10Petr Onderka: [C: 032 V: 032] Deleting deleted pages [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/78393 (owner: 10Petr Onderka) [16:20:24] (03PS1) 10Jgreen: stupid jenkins-bot. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78820 [16:21:46] is anyone working on the gerrit issue? [16:21:59] RECOVERY - Puppet freshness on mchenry is OK: puppet ran at Mon Aug 12 16:21:53 UTC 2013 [16:22:22] me [16:22:43] Jeff_Green: it is working better know [16:22:52] ah good. i've got the [publish & submit] button grayed out. any ideas? [16:23:28] https://gerrit.wikimedia.org/r/#/c/78820 [16:23:29] it had 5 stuck commands in queue... i killed one and it seems to move on with the rest but the other 4 are still running [16:23:32] (03CR) 10Jgreen: [C: 032 V: 032] stupid jenkins-bot. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78820 (owner: 10Jgreen) [16:23:44] i think i will killed them all and see what happens [16:25:07] oddly it says "review in progress" even after jenkins-bot has verified. is that normal? [16:26:25] well it was a stuck state [16:26:35] some minutes ago i couldn't even pull [16:26:52] i see [16:27:22] (03Abandoned) 10Jgreen: stupid jenkins-bot. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78820 (owner: 10Jgreen) [16:27:49] try once more... [16:28:21] now my checkout is fucked. sigh. 20 minutes to remove a pound sign. :-) [16:29:35] lol [16:32:08] git review looks like it's going to time out [16:32:24] (03PS1) 10Jgreen: redo the fantastic removal of a single pound sign [operations/puppet] - 10https://gerrit.wikimedia.org/r/78822 [16:32:29] miracle [16:34:20] lol [16:42:25] (03PS1) 10Akosiaris: Rewrite etherpad to etherpad-lite URLs for compat [operations/puppet] - 10https://gerrit.wikimedia.org/r/78823 [16:43:12] * AaronSchulz reads about the outage [16:45:11] Snaps, yeah taht would be awesome [16:46:16] (03CR) 10Akosiaris: [C: 032] Rewrite etherpad to etherpad-lite URLs for compat [operations/puppet] - 10https://gerrit.wikimedia.org/r/78823 (owner: 10Akosiaris) [16:49:28] (03CR) 10Jgreen: [C: 032 V: 031] redo the fantastic removal of a single pound sign [operations/puppet] - 10https://gerrit.wikimedia.org/r/78822 (owner: 10Jgreen) [16:51:35] PROBLEM - HTTP on zirconium is CRITICAL: Connection refused [16:51:55] akosiaris: that's you [16:52:01] i know [16:54:22] (03PS1) 10Akosiaris: Just the old hostname for etherpad [operations/puppet] - 10https://gerrit.wikimedia.org/r/78825 [16:55:45] PROBLEM - Puppet freshness on db9 is CRITICAL: No successful Puppet run in the last 10 hours [16:59:45] RECOVERY - Puppet freshness on db9 is OK: puppet ran at Mon Aug 12 16:59:41 UTC 2013 [16:59:56] (03CR) 10Akosiaris: [C: 032] Just the old hostname for etherpad [operations/puppet] - 10https://gerrit.wikimedia.org/r/78825 (owner: 10Akosiaris) [17:08:35] RECOVERY - HTTP on zirconium is OK: HTTP OK: HTTP/1.1 200 OK - 208455 bytes in 0.053 second response time [17:13:24] gah! gerrit is stupid slow again. [17:13:28] <^d> Yeah. [17:13:30] <^d> I just noticed. [17:13:32] <^d> Looking. [17:16:05] yeah... I issue some kill commands to some gerrit jobs that seemed stuck some 30 minutes ago and it got better but it's crappy again [17:18:54] known issue about the ssl cert mismatch on epl.wikimedia.org right now [17:18:57] ? [17:19:19] (it's the cert for *.planet.wikimedia) [17:19:33] also, first things first... [17:19:49] * greg-g goes to delete rsa keys that are associated with his dell XPS [17:20:01] greg-g: fixed [17:20:09] akosiaris: awesome sauce [17:21:54] <^d> !log gerrit restarting [17:21:59] <^d> *sigh* [17:22:05] Logged the message, Master [17:24:47] ^d: i though i could avoid that... Seems i was wrong. Any idea what happened ? [17:25:48] <^d> It's been freaking out for about the last hour or so complaining about database connections in various ways. [17:25:58] <^d> Concurrency exceptions, "server went away" crap, so forth. [17:26:27] :-( [17:26:28] <^d> We had this problem awhile ago, but we never found a root cause and it kinda just Went Away. Might have to devote some time this week to figuring it out. [17:26:45] ^d: thanks for making it better for now. [17:27:02] <^d> yw [17:40:40] (03PS1) 10Akosiaris: OLD etherpad install to etherpad-old.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/78832 [17:42:50] (03CR) 10Akosiaris: [C: 032] OLD etherpad install to etherpad-old.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/78832 (owner: 10Akosiaris) [17:54:59] <^d> cmjohnson1: Ping [17:58:34] ^d what's up [18:00:17] <^d> So, testsearch1001 is a little funky. paravoid and I did some debugging last week, and he seemed to think it was either hardware or power settings. [18:00:31] <^d> Compare http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Miscellaneous+eqiad&h=testsearch1002.eqiad.wmnet&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS and http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Miscellaneous+eqiad&h=testsearch1001.eqiad.wmnet&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS [18:02:18] ^d it's possible it's a power setting...because of the lack for redundant power in tampa the power settings were tinkered with...can I take them down to look? [18:02:38] that is where those boxes came from (tampa) [18:03:06] <^d> Do whatever you need, they're not serving anything yet :) [18:03:12] okay..cool [18:03:19] will let you k now...is there a ticket? [18:03:52] <^d> I believe we were using rt #5555 for this. [18:05:17] cmjohnson1: #5555 [18:06:44] paravoid: why do you think the the increase in cpu utilization is related to the rt5555...if that were the case we would see this problem with all of our newer servers [18:07:10] it looks like it's only r320s [18:09:18] the cpu kernel messages are happening on the r420's as well [18:14:14] (03PS1) 10Petr Onderka: Progress reporting on standard error stream [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/78835 [18:16:55] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:17:55] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [18:20:00] !log powering down testsearch1001 and 1002 to compare settings [18:20:11] Logged the message, Master [18:26:05] PROBLEM - Host testsearch1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:27:50] paravoid: http://kafka.apache.org/08/ops.html#datacenters (re kafka broker clusters & datacenters) [18:28:35] PROBLEM - Host testsearch1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:28:53] (03CR) 10Akosiaris: "I have added rspec tests which fail at this point (mostly due to some dependencies and the ::monitor_service thingy). I have also done VM " [operations/puppet] - 10https://gerrit.wikimedia.org/r/77720 (owner: 10Akosiaris) [18:30:31] (03PS2) 10Petr Onderka: Progress reporting on standard error stream [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/78835 [18:34:19] paravoid ^d so the issue is related to the system profile setup. tsearch1001 was set to OS leading to all the cpu messages and tsearch1002 was set to dapc. Other than the both cfg's were the same [18:36:30] parvoid: we can set it to custom and allow the dell controller to maintain the cpu utilization or set it to max performance as a h/w temporary fix [18:40:04] (03CR) 10Petr Onderka: [C: 032 V: 032] Progress reporting on standard error stream [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/78835 (owner: 10Petr Onderka) [18:49:55] RECOVERY - Host testsearch1001 is UP: PING WARNING - Packet loss = 61%, RTA = 0.29 ms [18:50:36] RECOVERY - Host testsearch1002 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [19:08:07] mutante: I assume the thing with John for etherpad didn't happen? [19:28:02] (03PS1) 10Manybubbles: Turn on more default elasticsearch logging. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78903 [20:31:11] <^d> cmjohnson1: For what little my opinion matters, I say give it a shot :) [20:31:35] <^d> I know zilch about what we're doing here though :) [20:33:28] ^d changing the setting to max performance will get the cpu power notification msgs to stop that you see in dmesg. I have demonstrated that already. However, that is not exactly we want either. I don't think that last weeks high cpu utilization rate is related cuz we would've seen it elsewhere unless there is someting funky about the R320 but again..would've seen it other servers [20:36:38] <^d> cmjohnson1: Makes sense. I'd say set it for now so we can silence that problem, and if any other problems remain we should still see issues. [20:37:10] ^ my thoughts as well [20:38:18] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [20:38:44] heya paravoid, you there? [20:38:59] wanna know what you think about maybe adding another init script to kafka .deb [20:39:02] for mirroring [20:39:18] https://cwiki.apache.org/confluence/display/KAFKA/Kafka+mirroring+%28MirrorMaker%29 [20:49:52] (03CR) 10Demon: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [21:38:44] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [21:38:44] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [21:38:44] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [21:38:44] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [21:38:44] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [21:38:45] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [21:41:39] Hello! In case people didn't know, git.wikimedia.org is down. [21:43:00] yaron: http://lists.wikimedia.org/pipermail/wikimania-l/2013-August/005064.html ;-) [21:43:52] Ah - I guess it takes people a while to recuperate from Wikimania. :) [21:50:14] ^d is in the office, presumably looking into it. I know it isn't as simple as just "kick it and done" anymore :/ [22:05:44] <^d> I thought we disallowed all robots. [22:09:12] <^d> Someone mind looking at a one-liner: https://gerrit.wikimedia.org/r/78919 [22:09:44] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [22:09:44] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [22:09:44] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [22:09:44] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [22:09:44] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [22:09:45] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [22:10:23] (03PS1) 10Demon: Disallowing all indexing for now [operations/puppet] - 10https://gerrit.wikimedia.org/r/78919 [22:19:41] <^d> cmjohnson1: testsearch1001 seems to be much happier now fwiw. [22:20:04] we'll see if it lasts [22:22:22] <^d> Yeah. Thanks for your help :) [22:24:33] <^d> binasher: Could you look at a one-liner for puppet for me? Same thing paravoid did before, but in puppet now so it won't get reverted. [22:25:00] url? [22:25:06] <^d> https://gerrit.wikimedia.org/r/#/c/78919/ [22:37:18] (03PS3) 10Demon: Redo search configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 [22:37:22] (03CR) 10jenkins-bot: [V: 04-1] Redo search configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [22:40:33] (03PS4) 10Demon: Redo search configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 [22:40:55] (03CR) 10Demon: "PS3 fixes problems, PS4 a rebase." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [22:44:53] RECOVERY - mysqld processes on db1009 is OK: PROCS OK: 1 process with command name mysqld [22:55:59] (03CR) 10Aaron Schulz: [C: 031] Redo search configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78083 (owner: 10Demon) [23:43:20] (03PS1) 10Ottomata: kafka.init - Using $DEFAULT in error message. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/78923 [23:43:21] (03PS1) 10Ottomata: Adding mirror-maker and consumer-offset-checker to kafka bin script [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/78924 [23:43:38] (03CR) 10Ottomata: [C: 032 V: 032] kafka.init - Using $DEFAULT in error message. [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/78923 (owner: 10Ottomata) [23:44:48] (03CR) 10Ottomata: [C: 032 V: 032] Adding mirror-maker and consumer-offset-checker to kafka bin script [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/78924 (owner: 10Ottomata)