[00:05:54] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [00:22:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:26:04] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:31:54] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [00:32:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [00:32:44] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 00:32:40 UTC 2013 [00:32:54] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [00:52:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [00:56:04] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:16] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [01:08:46] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [01:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [01:32:46] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 01:32:38 UTC 2013 [01:33:16] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [01:36:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [01:52:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [02:00:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:06:51] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [02:06:52] !log LocalisationUpdate completed (1.22wmf12) at Thu Aug 8 02:06:51 UTC 2013 [02:07:04] Logged the message, Master [02:11:38] (03CR) 10GWicke: "Ping!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/72653 (owner: 10Pyoungmeister) [02:16:52] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Aug 8 02:16:52 UTC 2013 [02:17:03] Logged the message, Master [02:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:33:21] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 02:33:11 UTC 2013 [02:33:51] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [02:49:11] PROBLEM - Puppet freshness on db9 is CRITICAL: No successful Puppet run in the last 10 hours [02:53:36] (03CR) 10MaxSem: "We might want to use https://github.com/lkarsten/libvmod-cookie here." [operations/puppet] - 10https://gerrit.wikimedia.org/r/75316 (owner: 10Mark Bergsma) [03:01:31] PROBLEM - Puppetmaste [03:08:27] eh, did icinga-wm just choke? [03:18:16] MaxSem: strange, checking [03:26:40] weird, the log line is complete [03:26:44] lemme check something [03:26:54] oh wait! [03:26:57] partition is full [03:27:05] heh [03:27:15] same thing as before [03:27:53] it doesn't monitor its own host? [03:28:54] MaxSem: i think there aren't a lot of disk use checks in general [03:29:25] jeremyb: is there an explicit list somewhere of "checks we should have but don't yet" [03:29:32] like, where should we add this one? :) [03:29:36] * MaxSem recalls that they actually helped not so long ago [03:30:13] https://wikitech.wikimedia.org/wiki/Projects#Basic_monitoring_.26_alerting [03:30:56] greg-g: idk, but there's https://rt.wikimedia.org/Ticket/Display.html?id=4728 :) [03:39:02] doesn't monitor own disk use [03:40:30] (03PS1) 10Lcarr: icinga needs logrotate [operations/puppet] - 10https://gerrit.wikimedia.org/r/78185 [03:41:36] (03CR) 10Lcarr: [C: 032] icinga needs logrotate [operations/puppet] - 10https://gerrit.wikimedia.org/r/78185 (owner: 10Lcarr) [03:47:12] also strangely enough our puppet logrotate doesn't actually rotate the puppet file [03:48:08] jeremyb: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=DISK [03:48:15] (03PS1) 10Lcarr: rotate puppet.log in logrotate [operations/puppet] - 10https://gerrit.wikimedia.org/r/78187 [03:48:31] mutante: Ryan_Lane ^^ ? [03:49:06] LeslieCarr: ah, has it been called /var/log/puppet but now its puppet.log ? [03:49:10] looks like it [03:49:23] yeah [03:49:50] (03CR) 10Dzahn: [C: 032] "makes sense, used ot be /var/log/puppet and now changed to puppet.log it seems" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78187 (owner: 10Lcarr) [04:08:02] (03PS1) 10Lcarr: fixing icinga.log logrotate [operations/puppet] - 10https://gerrit.wikimedia.org/r/78189 [04:08:33] (03CR) 10Dzahn: [C: 032] fixing icinga.log logrotate [operations/puppet] - 10https://gerrit.wikimedia.org/r/78189 (owner: 10Lcarr) [04:10:41] thanks mutante [04:11:58] done on sockpuppet [04:13:56] (03PS9) 10Dzahn: add virtual language subdomain redirects for wikidata [operations/apache-config] - 10https://gerrit.wikimedia.org/r/65443 [04:21:02] (03PS6) 10Dzahn: the existing etherpad-lite_1.0-wm2 package in a operations/debs repo for completeness [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/76654 [04:22:06] (03CR) 10Dzahn: "was supposed to be a patch file: https://gerrit.wikimedia.org/r/#/c/76654/6/debian/patches/log4js.js.patch" [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/76661 (owner: 10Dzahn) [04:38:11] LeslieCarr: < wikibugs_> (NEW) please change mail FROM address to not use noc@ - https://bugzilla.wikimedia.org/52628 normal; Wikimedia: OTRS; () [04:38:24] so we dont get all the otrs mail [04:38:58] yay [04:39:00] *heart* [04:39:16] andre_ told me to open it in component OTRS :) [04:41:29] that looks annoying [04:42:10] Oops. [04:42:56] can create any new alias, like otrs-admins@ if you like [04:57:04] hrm i think that there's some things that expect the old nagios.cmd file in nagios plugins being submitted by nsca [04:58:43] may need to revert that old change [05:00:11] (03PS1) 10Lcarr: Revert "Fixed the path to icinga cmd file all around +file" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78200 [05:00:27] (03PS2) 10Lcarr: Revert "Fixed the path to icinga cmd file all around +file" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78200 [05:02:01] (03CR) 10Lcarr: [C: 032] "Reverting for now - need to fix and repackage the nagios plugins" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78200 (owner: 10Lcarr) [05:22:55] anybody around who can approve a new gerrit account? [05:25:39] How is the list on noc.wikimedia.org generated? I hope not by hand? [05:28:01] greg-g: Which list? [05:28:08] I think he means the conf/ index. [05:28:09] oh, /conf/ [05:28:14] I'm a mindreader. [05:28:16] Yes, it's manually [05:28:18] :) [05:28:27] Well, add a line to a file, run script to create symlinks [05:28:32] heh [05:28:35] gotcha [05:28:41] Tim and Krinkle discussed making that whole index page pointers to GitHub. [05:28:52] Or git.wm.o, I suppose, if we can keep that running for more than ten minutes. [05:28:53] I was just looking for our version of https://github.com/pediapress/mwlib.rl/blob/master/example-mwlib.config [05:29:04] We shell out to PediaPress (the company). [05:29:07] As I understand it. [05:29:19] Or maybe we have (also?) internal PDF servers? [05:29:21] I thought we had our own servers doing it [05:29:24] yeah [05:29:29] pdfXXXX [05:29:36] greg-g: Elsie: noc.wm.o/conf is mostly generated. The file contents are live from wmf-deployment, the html overview is generated based on a white list [05:29:46] greg-g: that would be hiding away on the pdf boxes most likely [05:29:51] it's all in version control, so look it up if you want to knw more [05:29:52] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=PDF%2520servers%2520pmtpa&tab=m&vn= [05:29:55] which may or maynot be a mess still [05:29:55] Or perhaps in Puppet. [05:30:04] wikitech.wm.o probably has docs about this. [05:30:07] (Or really ought to.) [05:30:25] https://github.com/wikimedia/operations-mediawiki-config/tree/master/docroot/noc/conf [05:30:38] https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/noc/conf/index.php [05:31:11] https://github.com/wikimedia/operations-mediawiki-config/blob/master/docroot/noc/createTxtFileSymlinks.sh [05:31:13] Krinkle: greg-g was investigating a PDF bug. noc.wm.o was a secondary question. ;-) [05:31:20] Krinkle knows the symlinks.sh file well. ^_^ [05:33:39] but, like I said, sleep is a good idea [05:33:41] g'night [05:40:26] Elsie: yea, it's pediapress operated [05:47:32] Elsie: /home/pp .. heiko@duck, volker@caramba .. and others [05:50:20] Elsie: docs _would_ be here if they were http://simple.pediapress.com/wiki/RenderServerOperation [05:50:35] been like that for months .. [05:52:23] https://wikitech.wikimedia.org/wiki/Pdf1.wikimedia.org [06:00:39] test ircecho [06:00:45] yay [06:01:00] i'm a bot [06:08:54] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [06:29:15] (03CR) 10Lcarr: [C: 04-1] "still missing firewall rules" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75777 (owner: 10Demon) [06:30:22] (03CR) 10Yuvipanda: [C: 04-1] "Needs a newer nginx-extras to work, hoping I can convince Kartik to work on it :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78002 (owner: 10Yuvipanda) [06:32:55] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 06:32:47 UTC 2013 [06:33:54] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [07:05:45] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [07:32:25] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [07:32:25] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [07:32:25] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [07:32:25] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [07:32:25] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [07:32:26] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [07:33:05] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 07:32:59 UTC 2013 [07:33:45] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [07:45:45] Elsie: jeluf and elian are otrsadmin.. isn't that outdated? [07:46:04] mail alias that is [07:57:19] !log reedy synchronized docroot and w [07:57:30] Logged the message, Master [07:59:30] !log reedy synchronized wmf-config/ [07:59:41] Logged the message, Master [08:04:19] (03CR) 10Dzahn: [C: 032] "Rangilo, feel free to add me as reviewer next time you have changes." [operations/puppet] - 10https://gerrit.wikimedia.org/r/78009 (owner: 10Rangilo Gujarati) [08:04:40] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:40] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:40] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:40] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:40] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [08:04:41] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [08:05:22] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [08:08:44] (03CR) 10Dzahn: [C: 032] "http://docs.python-requests.org/en/latest/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75632 (owner: 10Hashar) [08:12:57] (03CR) 10Dzahn: "ii python-requests 0.8.2-1" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75632 (owner: 10Hashar) [08:15:05] (03CR) 10Dzahn: [C: 031] Add remaining skins (Cologne Blue and Modern) to sync script. [operations/puppet] - 10https://gerrit.wikimedia.org/r/69444 (owner: 10Mattflaschen) [08:16:20] let's block Googlebot from git.wikimedia ?! [08:16:41] https://gerrit.wikimedia.org/r/#/c/77909/1 [08:17:13] hmm. its's kind of down too [08:20:27] !log attempted to restart gitblit on antimony [08:20:37] Logged the message, Master [08:20:44] !log git.wikimedia.org back [08:20:55] Logged the message, Master [08:24:44] mutante: I had suggested blocking the useragent for the zip urls [08:25:08] looking at access.log , indeed it's pretty fast [08:25:12] and full ogf Googlebot [08:25:33] and java CPU usage high [08:25:56] I'd say block it for now [08:26:06] ^demon will fix it later [08:26:11] (03CR) 10Dzahn: [C: 032] "indeed access.log is full of Googlebot" [operations/puppet] - 10https://gerrit.wikimedia.org/r/77909 (owner: 10Demon) [08:26:18] alright, yeay, d [08:36:14] (03CR) 10Reedy: [C: 032] cswikiquote: Add custom namespace for works [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78142 (owner: 10Danny B.) [08:36:15] (03Merged) 10jenkins-bot: cswikiquote: Add custom namespace for works [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78142 (owner: 10Danny B.) [08:37:25] Ryan_Lane: [08:37:30] apergos: ? [08:38:02] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 08:37:55 UTC 2013 [08:38:03] google supposedly has reduced their queries / sec and the number of parallel connections to prevent this [08:38:14] til we get robots.txt fixed, which demon was gonna do today [08:38:21] Ryan_Lane: ping? [08:38:22] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [08:38:51] yuvipanda: ? [08:39:01] apergos: ah, ok [08:39:11] Ryan_Lane: I added KartikMistry to the project-proxy project, but he's still not able to ssh in to even bastion [08:39:12] he has shell [08:39:15] and the key works for gerrit [08:39:18] is there more that needs to be done? [08:39:26] what's the shell account name? [08:39:28] if it's still falling over that's not good, I can let em know (but it might be a while before I get a response, wrong tz) [08:39:44] otherwise we can *cough* ignore it til demon's fix later today [08:39:47] Ryan_Lane: KartikMistry [08:40:07] it's kartik [08:40:09] ;) [08:40:17] aaah [08:40:21] so it is an username problem [08:40:22] yuvipanda: FAIL [08:40:45] yuvipanda: can the user try sshing in? [08:41:00] (03PS2) 10Reedy: Set up flood flag for zhwikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78018 (owner: 10TTO) [08:41:06] (03CR) 10Reedy: [C: 032] Set up flood flag for zhwikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78018 (owner: 10TTO) [08:41:08] !log on overloaded antimony: stopped gitblit, ran puppet, blocked Googlebot, restarted apache and gitblit [08:41:11] apergos: [08:41:16] (03Merged) 10jenkins-bot: Set up flood flag for zhwikinews [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78018 (owner: 10TTO) [08:41:18] Logged the message, Master [08:41:19] it doesnt care so far :p [08:41:29] Googlebot/2.1; +http://www.google.com/bot.html)" [08:41:31] ok well lemme tell em we've blocked them temprarily [08:41:35] Ryan_Lane: still doesn't work [08:41:36] yuvipanda: yeah, using the wrong username [08:41:41] Ryan_Lane: even with kartik@ [08:41:44] same error [08:41:52] Invalid user KartikMistry from [08:41:52] until demon can get his end fixed [08:41:59] so, yeah, wrong username [08:42:07] (03PS2) 10Reedy: (bug 52578) New user group 'botadmin' on ckb.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78056 (owner: 10Dereckson) [08:42:53] (03CR) 10Siebrand: "Added some platform engineers as in I3c156792." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/66339 (owner: 10Liangent) [08:42:54] apergos: thanks, it seems to ignore the block [08:42:56] now it's working [08:43:02] Ryan_Lane: yeah, it is [08:43:02] it does? [08:43:04] I'm an idiot. [08:43:10] er how did you block them? [08:43:32] apergos: this is definitely in Apache https://gerrit.wikimedia.org/r/#/c/77909/1/templates/apache/sites/git.wikimedia.org.erb but access.log still .. [08:43:39] (03CR) 10Reedy: [C: 032] (bug 52578) New user group 'botadmin' on ckb.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78056 (owner: 10Dereckson) [08:43:46] (03Merged) 10jenkins-bot: (bug 52578) New user group 'botadmin' on ckb.wikipedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78056 (owner: 10Dereckson) [08:43:47] apergos: oh.. Googlebot-Image/1.0 != Googlebot ? [08:43:53] no [08:44:00] googlebot is the one that rate limited [08:44:09] that should be enough, it's the /zip/ paths primarily that were the problem [08:44:17] Googlebot-Image is the one still being very active [08:44:21] ok [08:44:25] it's minor compared to the rest [08:44:34] alright [08:44:35] I did a count from the logs yeasterday [08:44:42] ah:) good [08:44:53] (03PS2) 10Reedy: Add autopatrol protection level for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/77701 (owner: 10TTO) [08:44:56] now YandexBot kicking in :) [08:44:58] (03CR) 10Reedy: [C: 032] Add autopatrol protection level for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/77701 (owner: 10TTO) [08:45:16] (03Merged) 10jenkins-bot: Add autopatrol protection level for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/77701 (owner: 10TTO) [08:45:29] please note it on the bz report so we don't forget [08:46:09] java CPU usage isn't > 500% anymore though :) [08:46:39] (03PS7) 10Reedy: Clean up headers in CommonSettings and InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76342 (owner: 10TTO) [08:46:51] (03CR) 10Reedy: [C: 032] Clean up headers in CommonSettings and InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76342 (owner: 10TTO) [08:46:56] I bet it isn't [08:47:01] (03Merged) 10jenkins-bot: Clean up headers in CommonSettings and InitialiseSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/76342 (owner: 10TTO) [08:48:27] apergos: done [08:53:09] !log reedy synchronized wmf-config/ [08:59:30] thanks [09:02:42] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 09:02:37 UTC 2013 [09:03:22] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [09:10:59] (03PS1) 10Dzahn: fix atom.xml redirect on planet.wikimedia.org without language subdomain (bug 46721) [operations/puppet] - 10https://gerrit.wikimedia.org/r/78220 [09:12:46] (03PS2) 10Dzahn: fix atom.xml redirect on planet.wikimedia.org without language subdomain (bug 46721) [operations/puppet] - 10https://gerrit.wikimedia.org/r/78220 [09:13:25] (03CR) 10Dzahn: [C: 032] fix atom.xml redirect on planet.wikimedia.org without language subdomain (bug 46721) [operations/puppet] - 10https://gerrit.wikimedia.org/r/78220 (owner: 10Dzahn) [09:23:38] apergos: is adding a (readable) robots.txt the long term solution for gitblit, or something else? [09:23:54] the current status of the bugzilla ticket is not super-clear [09:24:07] fixing the issue with the serving of the robots.txt file is the fix, yes [09:24:24] at that point if we want to add a few other paths in there we can [09:33:07] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 09:32:59 UTC 2013 [09:33:17] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [09:43:30] (03CR) 10Mark Bergsma: "No, I had looked at its code and it sucks. Large static arrays, linear searches, no sorting, etc. But an improved/fixed up version of it c" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75316 (owner: 10Mark Bergsma) [09:48:30] (03CR) 10Mark Bergsma: [C: 032] Setup a misc services varnish cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/78090 (owner: 10Mark Bergsma) [09:50:39] (03PS1) 10Mark Bergsma: Rename role::cache::ssl::text to ssl::unified [operations/puppet] - 10https://gerrit.wikimedia.org/r/78224 [09:55:09] apergos: should it be added to the bug summary? [09:55:48] (03PS1) 10Mark Bergsma: Install cp1043/1044 as misc varnish servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/78225 [09:55:49] (03PS1) 10Mark Bergsma: Decommission cp1041-1044 [operations/puppet] - 10https://gerrit.wikimedia.org/r/78226 [09:55:59] I don't think it' snecessary, unless demon doesn't get to it today for some reason [09:56:03] (03CR) 10Mark Bergsma: [C: 032] Rename role::cache::ssl::text to ssl::unified [operations/puppet] - 10https://gerrit.wikimedia.org/r/78224 (owner: 10Mark Bergsma) [09:56:41] (03CR) 10Mark Bergsma: [C: 032] Install cp1043/1044 as misc varnish servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/78225 (owner: 10Mark Bergsma) [09:57:04] (03CR) 10Mark Bergsma: [C: 032] Decommission cp1041-1044 [operations/puppet] - 10https://gerrit.wikimedia.org/r/78226 (owner: 10Mark Bergsma) [09:57:45] oki doki [09:57:58] PROBLEM - Puppet freshness on mchenry is CRITICAL: No successful Puppet run in the last 10 hours [10:08:51] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [10:09:41] (03PS1) 10TTO: Clean up first part of InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78227 [10:31:15] (03PS1) 10Mark Bergsma: Move cp1043 and cp1044 internal [operations/puppet] - 10https://gerrit.wikimedia.org/r/78228 [10:31:52] (03CR) 10Mark Bergsma: [C: 032] Move cp1043 and cp1044 internal [operations/puppet] - 10https://gerrit.wikimedia.org/r/78228 (owner: 10Mark Bergsma) [10:33:21] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 10:33:16 UTC 2013 [10:33:51] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [10:34:20] (03CR) 10Mark Bergsma: "Why not just put a single file resource and use the "recurse" attribute? :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/76678 (owner: 10Tim Starling) [10:39:10] I should resurrect this grnet script... [10:39:40] file { "/home/faidon": [10:39:40] ensure => directory, [10:39:40] source => [ [10:39:40] "puppet:///files/userfiles/faidon/", [10:39:40] "puppet:///files/userfiles/skel/", [10:39:42] ], [10:39:44] sourceselect => first, [10:39:47] recurse => remote, [10:39:49] is what we had [10:39:58] yeah [10:43:07] (03CR) 10Faidon: "I've used this previously and has worked well for me:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/76678 (owner: 10Tim Starling) [11:06:06] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [11:35:16] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 11:35:09 UTC 2013 [11:36:06] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [12:01:53] PROBLEM - SSH on sq50 is CRITICAL: Server answer: [12:02:53] RECOVERY - SSH on sq50 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:03:13] PROBLEM - SSH on sq60 is CRITICAL: Server answer: [12:04:13] RECOVERY - SSH on sq60 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:05:23] PROBLEM - SSH on sq59 is CRITICAL: Server answer: [12:06:14] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [12:06:24] RECOVERY - SSH on sq59 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [12:06:34] PROBLEM - SSH on virt1000 is CRITICAL: Server answer: [12:07:34] RECOVERY - SSH on virt1000 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:11:34] PROBLEM - SSH on virt1000 is CRITICAL: Server answer: [12:12:34] RECOVERY - SSH on virt1000 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:14:14] PROBLEM - SSH on lvs1003 is CRITICAL: Server answer: [12:15:14] RECOVERY - SSH on lvs1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [12:32:44] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 12:32:40 UTC 2013 [12:33:14] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [12:34:00] (03CR) 10MZMcBride: "This is a small chunk? :-)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78227 (owner: 10TTO) [12:34:32] mutante: Super-outdated. :-) [12:34:52] mutante: JeLuF and Elian haven't been involved since like 2005, I don't think. [12:49:44] PROBLEM - Puppet freshness on db9 is CRITICAL: No successful Puppet run in the last 10 hours [13:05:36] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [13:14:45] have you seen the size of the whole config file lately ( Elsie )? [13:32:56] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 13:32:46 UTC 2013 [13:33:36] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [13:56:56] (03CR) 10Ottomata: "(3 comments)" [operations/software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/78160 (owner: 10Edenhill) [14:06:29] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [14:32:49] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 14:32:47 UTC 2013 [14:33:29] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [15:08:08] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [15:12:28] (03CR) 10Demon: "Yeah just hadn't gotten back to it yet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/75777 (owner: 10Demon) [15:13:15] <^d> manybubbles: I had a dream that we did something very very wrong in CirrusSearch [15:13:19] <^d> And it broke things. [15:13:29] <^d> Sadly it was a dream, so I'm having trouble remembering what we did wrong. [15:13:39] <^d> (And if it was a reflection on what we're actually doing) [15:13:47] ! [15:14:19] the only thing I'm worried about right now is making sure we run side by side correctly.... if that is ok then I think we'll be fine. [15:14:33] <^d> So I think today I'm going to look at a bunch of things with my paranoid hat on. [15:14:40] <^d> And assume the worst :) [15:14:41] but yeah, if you remember what your dream said it might be useful. unless it was about rainbows. [15:14:47] sounds good [15:17:12] <^d> Heh, I see what you were trying to do in https://gerrit.wikimedia.org/r/#/c/78151/ [15:32:48] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 15:32:43 UTC 2013 [15:33:08] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [15:40:41] (03PS1) 10Demon: Don't proxy robots.txt [operations/puppet] - 10https://gerrit.wikimedia.org/r/78239 [15:51:57] RECOVERY - Puppet freshness on mchenry is OK: puppet ran at Thu Aug 8 15:51:55 UTC 2013 [15:52:59] (03PS2) 10Demon: Don't proxy robots.txt [operations/puppet] - 10https://gerrit.wikimedia.org/r/78239 [15:55:08] (03CR) 10ArielGlenn: [C: 032] Don't proxy robots.txt [operations/puppet] - 10https://gerrit.wikimedia.org/r/78239 (owner: 10Demon) [16:06:37] (03PS1) 10Demon: Revert "Poor googlebot, spidering git too fast" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78243 [16:07:50] (03PS1) 10Manybubbles: Turn down elasticsearch heap usage in production. [operations/puppet] - 10https://gerrit.wikimedia.org/r/78244 [16:08:29] ^d: ^^^ [16:08:37] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [16:09:23] (03CR) 10Demon: [C: 031] "Looks good and needed. Merge me plz :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78244 (owner: 10Manybubbles) [16:09:44] (03PS2) 10Demon: Revert "Poor googlebot, spidering git too fast" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78243 [16:11:32] (03CR) 10ArielGlenn: [C: 032] Revert "Poor googlebot, spidering git too fast" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78243 (owner: 10Demon) [16:13:04] (03PS3) 10Akosiaris: Refactor nrpe to a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/77720 [16:13:36] <^d> akosiaris: Hey, got a backups question for you :) [16:13:47] hit me [16:14:19] <^d> So until yesterday, we had daily backups of svn copied off to tridge. We discontinued those since SVN is now a read-only service. [16:14:32] <^d> So A) If you've got old backups laying around on tridge, feel free to kill them [16:14:54] <^d> and B) I've got a final backup of SVN, it's about 14G. Where would be a safe place to store that $practicallyForever? [16:16:30] A) cool. We are redesigning backup using a new infrastructure so that part will be implicit (just not migrated to the new infra) [16:16:54] B) The new infra. I can create a $forever volume for such things [16:17:13] let's call it archival (sounds better) [16:17:59] <^d> Sounds like a plan. [16:18:18] :-) [16:19:44] I wondeer what we want to do with rather larger data sets that we should keep around forever (a few T, more than a few T) [16:20:24] RAID1 over the internet ? [16:20:48] old slashdot joke... [16:20:56] apergos: like ? [16:21:05] page view stats for example [16:21:10] we keep them forever [16:21:29] having a safe place besides just "on spinning disks in two places" might be nice [16:21:30] <^d> akosiaris: We did an NFS mount between Amsterdam and Tampa once. ma rk didn't like that. [16:21:51] i would not like that either [16:22:22] wait til you hear where the labs (in tamp) gluster filesystem is mounted... [16:23:26] but i was referring to allowing people to mirror the data freely (through whatever technology). That's what CommanderTaco called RAID1 over the internet [16:23:50] we have a couple copies off site but they aren't guaranteed to be there [16:29:51] Am I in the right place to ask about logo changes? [16:29:54] To Wikidata [16:30:03] Or is it -tech [16:32:02] Wrong place, apparently. [16:32:47] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 16:32:43 UTC 2013 [16:33:37] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [17:09:22] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [17:15:50] ^d: antimony unhappy again [17:16:02] <^d> dangit. [17:17:16] <^d> apergos: We're gonna have to block some other things too, /zip/ wasn't enough :\ [17:18:15] (03PS1) 10Manybubbles: Turn down suggestion caching in labs. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78252 [17:18:17] block all the things [17:32:52] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 17:32:44 UTC 2013 [17:33:12] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [17:33:12] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [17:33:12] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [17:33:12] PROBLEM - Puppet freshness on pdf3 is CRITICAL: No successful Puppet run in the last 10 hours [17:33:12] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [17:33:13] PROBLEM - Puppet freshness on ssl1004 is CRITICAL: No successful Puppet run in the last 10 hours [17:33:22] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [17:37:16] <^d> paravoid: So, testsearch1001 is freaking out or something. [17:37:21] <^d> http://ganglia.wikimedia.org/latest/?r=week&cs=08%2F06%2F2013+00%3A00+&ce=08%2F08%2F2013+00%3A00+&c=Miscellaneous+eqiad&h=testsearch1001.eqiad.wmnet&tab=m&vn=&mc=2&z=small&metric_group=ALLGROUPS [17:37:25] <^d> 1002 and 1003 are fine [17:37:47] <^d> elastic search isn't running yet on any of them (unrelated issue, fix in puppet for that) [17:38:33] wow [17:38:44] <^d> top is reporting power_saving and watchdog as taking up >100% cpu in spikes. [17:41:00] cmjohnson1: around? [17:47:38] (03CR) 10Demon: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78252 (owner: 10Manybubbles) [17:50:13] <^d> paravoid: Thinking it's hardware? [17:51:52] that [17:52:00] or the friggin power settings again [17:53:13] (03PS2) 10Manybubbles: Turn down suggestion caching in labs. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78252 [17:53:20] i hate hardware [17:53:48] (03CR) 10Demon: [C: 032] Turn down suggestion caching in labs. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78252 (owner: 10Manybubbles) [17:53:57] (03Merged) 10jenkins-bot: Turn down suggestion caching in labs. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/78252 (owner: 10Manybubbles) [17:54:31] <^d> manybubbles: Ok, should have your 30m cache expiry in labs soon. [17:59:59] (03PS1) 10Mark Bergsma: Correct LVS service IP name [operations/puppet] - 10https://gerrit.wikimedia.org/r/78255 [18:00:10] <^d> Well that was annoying. [18:01:06] so should we put gitblit behind varnish? [18:01:12] (03CR) 10Mark Bergsma: [C: 032] Correct LVS service IP name [operations/puppet] - 10https://gerrit.wikimedia.org/r/78255 (owner: 10Mark Bergsma) [18:01:33] <^d> mark: It doesn't output sane enough caching headers yet. [18:01:49] what does it output? [18:02:18] <^d> Take https://git.wikimedia.org/commitdiff/mediawiki%2Fextensions%2FSpamBlacklist.git/29f69a35e7e115455796c8c04ab2f9ee7f53f213 for example. [18:02:31] <^d> Cache-Control: private, must-revalidate [18:02:32] <^d> Last-Modified: Thu, 08 Aug 2013 17:23:22 GMT [18:02:32] <^d> Expires: Thu, 08 Aug 2013 18:07:06 GMT [18:02:50] <^d> That's a diff page. The content will *never* change. [18:03:00] that's stupid indeed [18:03:05] we can override t in varnish, but it's hacky of course [18:03:12] is there anything that is actually private? [18:03:20] does it allow logins and such? [18:03:20] <^d> No, there's not. [18:03:26] <^d> Yes, but we have that disabled. [18:03:38] then we might as well force caching on it [18:03:53] hehe [18:03:58] varnish by default doesn't honor CC: private at all [18:04:01] I guess this is why [18:04:06] <^d> Upstream is nice and responsive. I can ask him to fix this :) [18:04:13] yeah that would be nice [18:04:59] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [18:04:59] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [18:04:59] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [18:04:59] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [18:04:59] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [18:05:00] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [18:05:51] <^d> mark: So I'll prod him on this. But generally yes, I think gitblit is a good candidate for your "misc varnish cluster" you were talking about. [18:06:08] i'm setting up two boxes for that now [18:06:10] so we can start on that [18:07:29] <^d> I'm probably going to move the old svn stuff to that same box at some point, which could also sit behind varnish. [18:07:50] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [18:08:14] another advantage is that many services won't need public IPs anymore [18:08:18] mark: out of pure curiosity, how come went gitblit instead of cgit? [18:08:33] you're asking me? [18:08:37] er [18:08:38] <^d> That was me. [18:08:38] sorry ^d [18:08:40] <^d> :) [18:09:31] <^d> So, the original plan was to have gitblit as a gerrit plugin. But that basically didn't work well and increased gerrit's load. I still liked gitblit and the upstream was nice & active so I went for it. [18:09:39] ah [18:09:50] mark: see http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&h=testsearch1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Miscellaneous+eqiad btw [18:09:57] looks like http://comments.gmane.org/gmane.linux.hardware.dell.poweredge/43926 [18:10:05] i saw [18:10:07] {hard,firm}ware bug [18:10:14] <^d> paravoid: Also possibly I've drank the java cool-aid and I figured "what's another jvm app running behind apache?" ;-) [18:10:15] annoying [18:10:23] ^d: haha [18:11:19] mark: so, I think we need to change our mail infrastructure if we want to confuse google less... [18:11:42] we need to separate lists & wikimedia.org handling completely [18:12:32] it's architecturally more sound for us too [18:12:50] but it'd help with google's spam scoring too... [18:18:30] PROBLEM - Squid on brewster is CRITICAL: Connection timed out [18:19:30] RECOVERY - Squid on brewster is OK: TCP OK - 1.026 second response time on port 8080 [18:20:24] perhaps we should put squid behind varnish too ;) [18:20:53] whoops [18:20:54] broken disk in brewster [18:21:46] for 3 months already [18:21:47] wtf [18:23:30] PROBLEM - Squid on brewster is CRITICAL: Connection timed out [18:24:43] !log Removed /dev/sda from sw raid arrays on brewster [18:24:55] Logged the message, Master [18:26:47] (03PS1) 10Demon: Also disallow blobdiff from indexing [operations/puppet] - 10https://gerrit.wikimedia.org/r/78256 [18:28:06] (03CR) 10ArielGlenn: [C: 032] Also disallow blobdiff from indexing [operations/puppet] - 10https://gerrit.wikimedia.org/r/78256 (owner: 10Demon) [18:30:30] RECOVERY - Squid on brewster is OK: TCP OK - 0.026 second response time on port 8080 [18:31:05] also squid ran out of space [18:32:43] * mark wipes everything [18:33:00] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Thu Aug 8 18:32:51 UTC 2013 [18:33:30] PROBLEM - Squid on brewster is CRITICAL: Connection refused [18:33:50] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [18:35:30] RECOVERY - Squid on brewster is OK: TCP OK - 0.027 second response time on port 8080 [18:39:59] (03PS1) 10Mark Bergsma: Set memory_storage_size for the misc cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/78257 [18:43:37] (03CR) 10Mark Bergsma: [C: 032] Set memory_storage_size for the misc cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/78257 (owner: 10Mark Bergsma) [18:44:01] yay misc cluster yay [18:46:02] heh [18:46:06] what's the default backend for the misc cluster [18:49:25] ^d: does gitblit care about the source ip at all? [18:49:32] and/or does it parse XFF? [18:50:23] <^d> Nope, shouldn't care at all. [18:50:43] so it lives on antimony port 8080? [18:50:48] i need something to test against, might as well use that [18:50:50] even if it doesn't cache yet [18:51:45] <^d> Oh, one minor weird thing if you're pointing directly at 8080. [18:52:04] <^d> I have to set the following in apache: [18:52:04] <^d> RequestHeader set X-Forwarded-Proto https [18:52:11] <^d> RequestHeader set X-Forwarded-Port 443 [18:52:38] i can do the same in varnish [18:53:40] blog is also a good candidate [18:53:45] there's a varnish there already [18:54:07] yeah [18:54:27] also, it's broken wrt XFF [18:54:38] has been since it was moved into a different server [18:54:47] :) [18:54:56] (i.e. all comments appear from 127.0.0.1) [18:55:14] the old machine had mod-rpaf I think [18:56:03] * Jeff_Green seeks opinions on where to store some of the OTRS config [18:56:23] they're giving us what was done before in the form of a single OTRS package [18:56:28] oh hi Jeff_Green [18:56:39] oh hai [18:56:42] it'd help with spam scoring if we implemented dkim across our mailservers [18:56:45] are you interested? :) [18:56:57] i think it's a great idea [18:57:02] take it to the bridge :-P [18:57:22] I've done it before but frankly I'm too busy with other engagements atm [18:57:54] so I was wondering if you had some spare cycles, since you're probably the only other person in the team that has done so before :) [18:58:00] me too. i already put in my 20 hour day this week, so I'm pretty well out of hours before my vacation starts next monday [18:58:11] oh, shame [18:58:20] I'm happy to do it when I return. iirc i puppetized it already, just a matter of making the keys and stuff [18:58:30] and DNS entries of course [18:59:45] huh, i don't see it...where [18:59:58] (03PS1) 10Mark Bergsma: Configure the misc varnish cluster for gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/78262 [18:59:59] ^d: ^ [19:00:16] (03CR) 10jenkins-bot: [V: 04-1] Configure the misc varnish cluster for gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/78262 (owner: 10Mark Bergsma) [19:01:04] <^d> mark: Missing a comma in cache.pp, but otherwise lgtm. [19:01:07] ah ha. it's in templates/exim/exim4.donate.erb [19:01:54] (03PS2) 10Mark Bergsma: Configure the misc varnish cluster for gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/78262 [19:02:00] note that I'll need to setup ssl too [19:02:11] but i'll make this work first [19:02:48] (03CR) 10Mark Bergsma: [C: 032] Configure the misc varnish cluster for gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/78262 (owner: 10Mark Bergsma) [19:03:17] paravoid: put in an RT ticket for the DKIM/domainkeys thing and if nobody has looked at it I can do so when I get back [19:04:03] well [19:04:05] i wonder how that would work with gmail though? [19:04:07] exim on mchenry is too old [19:04:13] so we need to move that to eqiad anyway [19:04:13] true [19:04:22] i sort of said i'd do that [19:04:28] so I can take care of it, next week or the week after [19:04:37] even better :-P [19:04:38] it's been on my list for a while ;) [19:05:44] most people seem scared of exim [19:05:57] i don't get it, it's like the best documented piece of software, just about ;) [19:06:01] it'd be nice to do a few other changes along the way [19:06:27] like split into three clusters: incoming, incoming wikimedia.org, outgoing [19:06:42] mark i'm not scared of it but I generally prefer postfix [19:06:54] or, rather, incoming *, outgoing *, incoming/outgoing lists.wikimedia.org [19:06:55] and I have a good history of breaking exim :-P [19:07:16] paravoid: that's pretty much already the case [19:07:16] sbernardin: can you check to see if we have 1TB Seagate Barracuda Model ST31000333AS in the cabinet. I recall having a spare [19:07:22] it's not really [19:07:26] yeah it is [19:07:31] mchenry is a backup of lists and lists is a backup of mchenry [19:07:43] yes [19:07:44] mchenry is a backup of lists (for lists.wm.org) and lists/sodium is a backup of mchenry (for wikimedia.org) [19:07:52] yeah, that has presented a problem now [19:08:08] so, apparently, on google apps [19:08:13] yeah but i mean, the config is pretty well prepared for that [19:08:17] you declare your "inbound gateways" [19:08:17] it just needs to be moved to other boxes [19:08:26] it's not like it's hard to untangle [19:08:45] oh yeah I didn't mean that [19:09:12] it's not that complicated in general, exim makes it all easy (to me at least) [19:09:14] mark do you see any reason to keep the separation of lists vs wikimedia ? [19:09:22] hell yes [19:09:26] i made that separation [19:09:28] yes, they need to be separated [19:09:30] because it was hell before [19:09:42] hahahaha. what was so hellish? [19:10:19] mark: also, having two wikimedia.org mtas that can both perform deliveries would be nice [19:10:34] so the plan for that was, one in eqiad, one in pmtpa [19:10:37] that's still the plan [19:10:42] of course the pmtpa one is going away now :) [19:10:44] i.e. not just queue for later delivery but actually deliver to e.g. google [19:10:54] right, that sounds nice [19:11:23] but lists only in e.g. eqiad, right? [19:11:33] syncing mailman would be a pita :) [19:11:34] lists is harder to scale yeah [19:13:24] and the other thing is splitting off labs :) [19:13:43] we should quarantine labs from the world [19:13:48] (03CR) 10Michał Łazowik: "Per JanZerebecki:" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/65443 (owner: 10Dzahn) [19:14:00] make it just deliver to an imap server people can access :-P [19:17:34] back to my original question: what's the appropriate place to keep small package files for OTRS, which are applied through the web UI? [19:18:21] Jeff_Green: the idea that was floating was for using ldap to route emails, so project admins would get their boxes' emails [19:19:19] basically the modern way to deal with OTRS site changes is to apply an OTRS package, which any admin user can install from their web browser as a file upload [19:19:40] so I'd like to put the current package somewhere I can point to from wikitech [19:19:41] that sounds horrible [19:19:48] I'm guessing that's code, right? [19:19:59] seems to be XML [19:20:28] i don't know that it's worse that quilt? [19:20:47] most of the customization is done through the web UI anyway [19:20:53] what's the contents of this XML? [19:22:13] the current one installs two files within the install directory, sets perms, etc [19:22:29] what kind of files? [19:22:51] looks like XML to me [19:23:29] I'm not sure if they're clobbered or patched [19:24:33] ^demon|lunch: i'm getting a redirect to https? [19:32:59] (03PS1) 10Mark Bergsma: Don't run the default VCL for gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/78266 [19:34:26] doh [19:35:59] (03PS2) 10Mark Bergsma: Connect to gitblit directly, don't run the default VCL for gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/78266 [19:36:51] (03CR) 10Mark Bergsma: [C: 032] Connect to gitblit directly, don't run the default VCL for gitblit [operations/puppet] - 10https://gerrit.wikimedia.org/r/78266 (owner: 10Mark Bergsma) [19:37:24] typo [19:37:37] s/antimoney/antimony/ [19:37:43] oh thanks [19:37:54] are you antimoney? [19:38:04] I found myself very pro-money [19:38:13] ok, bad joke [19:38:15] (03PS1) 10Mark Bergsma: Fix typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/78267 [19:38:16] before or after joing the wmf [19:38:26] and before or after the last bernanke speech [19:38:40] (03CR) 10Mark Bergsma: [C: 032] Fix typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/78267 (owner: 10Mark Bergsma) [19:38:51] (03CR) 10Mark Bergsma: [V: 032] Fix typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/78267 (owner: 10Mark Bergsma) [19:41:04] analytics will be happy [19:43:33] we should have a concept of merit badges here, like the scouts [19:43:46] where you get one for each system you've suffered through dealing with [19:44:05] I'd get . . . fundraising, civicrm, otrs, exim, mysql . . . [19:51:31] we will be happy for money? [20:24:33] !log working on carbon [20:24:44] Logged the message, Master [20:26:51] PROBLEM - Host carbon is DOWN: CRITICAL - Host Unreachable (208.80.154.10) [20:27:21] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [20:33:36] <^d> mark: For gitblit? Yeah I forced https. [20:43:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:44:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [20:47:07] what's the point of forcing https? [20:48:06] <^d> Not much I suppose. [20:49:23] root@cp1044:~# telnet antimony.wikimedia.org 8080 [20:49:23] Trying 208.80.154.7... [20:49:23] telnet: Unable to connect to remote host: Connection refused [20:49:32] localhost only? :) [20:50:34] <^d> Quite possibly. [20:55:04] yeah it's listening on localhost [21:02:43] (03PS1) 10Mark Bergsma: Correct gitblit file modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/78314 [21:08:17] (03CR) 10Demon: [C: 031] Correct gitblit file modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/78314 (owner: 10Mark Bergsma) [21:08:28] cmjohnson1: Ok, so carbon is up [21:08:34] of course I don't want to leave gitblit listening open on eth0 [21:11:23] robh: you will see there is no sdb [21:11:36] why isnt there? [21:11:47] did you add hw raid? [21:12:03] cmjohnson1: ... uh [21:12:09] Disk /dev/sdb: 500.1 GB, 500107862016 bytes [21:12:12] there is an sdb. [21:12:21] fdisk -l [21:12:26] cat /proc/mdstat [21:12:59] right, it fell out [21:13:02] sdb died [21:13:04] you replaced it [21:13:07] yes [21:13:08] now the new sdb is there [21:13:13] but it doesnt have the partition table of the old one [21:13:19] so it cannot be placed into an array [21:13:28] this is normal and will happen every single time we have a sw raid disk failure [21:13:29] right when i tried to add it ...it failed [21:13:33] (03PS1) 10Jgreen: remove /opt/otrs/bin/otrs.GenericAgent.pl cron job b/c it's barfing [operations/puppet] - 10https://gerrit.wikimedia.org/r/78315 [21:13:39] did you just try to add without making the partition mapping? [21:13:44] cuz you cannot do that [21:13:51] it has no sdb1 or sdb2 yet [21:14:43] (03PS2) 10Mark Bergsma: Correct gitblit file modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/78314 [21:14:49] (03CR) 10Jgreen: [C: 032 V: 031] remove /opt/otrs/bin/otrs.GenericAgent.pl cron job b/c it's barfing [operations/puppet] - 10https://gerrit.wikimedia.org/r/78315 (owner: 10Jgreen) [21:15:09] robh: [21:15:10] (03CR) 10Mark Bergsma: [C: 032 V: 032] Correct gitblit file modes [operations/puppet] - 10https://gerrit.wikimedia.org/r/78314 (owner: 10Mark Bergsma) [21:15:21] ? [21:15:23] are you doing anythign on carbon? [21:15:26] yes [21:15:43] trying to recall the damned command to copy the entire partition mapping including raid superblock info from sda to sdb [21:15:49] so we can just add it back to array and rebuild [21:15:52] ok...so i tried this awhile ago but yes I tried to create the same partition table [21:15:55] and its escaping me [21:15:56] sfdisk -d /dev/sda | sfdisk /dev/sdb [21:16:27] but will try again unless you are doing something now [21:17:09] you don't need to copy the superblock info [21:17:09] im doin for test [21:17:14] just partitions [21:17:26] cmjohnson1: so that works for me [21:17:34] had to force cuz the disks are not the identical same [21:17:40] but the newer disk is larger, so its ok. [21:17:48] hrm ...maybe i had too much coffee that day [21:18:00] so now you can just rebuild the array with mdadm [21:18:09] i bet i forgot the part about copying the partition table [21:18:27] mark: yea i guess mdadm handles that [21:18:38] i misread something and misrecalled [21:19:02] cmjohnson1: if you need help with the raid rebuild lemme know [21:19:17] but anytime we have a raid system, and we can get it to boot [21:19:23] we can resurrect without reinstall [21:19:45] yeah..i've replaced a few before...but i must've missed a step w/carbon. [21:19:49] (03PS1) 10Mark Bergsma: Serve the Wikimedia error page on misc [operations/puppet] - 10https://gerrit.wikimedia.org/r/78317 [21:19:52] thx for helping [21:21:11] md1 : active raid1 sdb2[2] sda2[0] [21:21:11] 482526144 blocks [2/1] [U_] [21:21:11] resync=DELAYED [21:21:12] [21:21:14] md0 : active raid1 sdb1[2] sda1[0] [21:21:16] 5858240 blocks [2/1] [U_] [21:21:18] [========>............] recovery = 42.8% (2513088/5858240) finish=0.5min speed=96657K/sec [21:21:29] robh ^ [21:21:40] (03CR) 10Mark Bergsma: [C: 032] Serve the Wikimedia error page on misc [operations/puppet] - 10https://gerrit.wikimedia.org/r/78317 (owner: 10Mark Bergsma) [21:28:28] sweet [21:40:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [22:50:34] PROBLEM - Puppet freshness on db9 is CRITICAL: No successful Puppet run in the last 10 hours [23:20:45] AaronSchulz: paravoid: (not sure who else) I wanted to make sure you knew about a swift hackathon we're having. http://swifthackathon.eventbrite.com