[00:03:08] maybe it is load order dependent [00:03:26] are you going to spend your day debugging that? [00:03:31] maybe it breaks for me and paravoid because we have a higher RTT than Aaron [00:04:00] oi [00:04:08] New patchset: Dzahn; "move misc::udpprofile::collector to mediawiki::udpprofile::collector and into mediawiki.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36118 [00:04:16] New patchset: Pyoungmeister; "setting coredb::common to inherit coredb::config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36119 [00:04:24] I am working on lucene, for your information ;) [00:04:34] TimStarling: then why did I have the same problem with the same box in the same office before without a prior os/FF? [00:04:57] mysterious [00:05:19] I'm not aware of the office wired connection latency going down...though maybe it went done just enough microseconds that it started working ;) [00:06:02] TimStarling: so why does it use OAI instead of the usual search triggers? [00:06:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36119 [00:06:19] dunno [00:06:30] well ok, there were search triggers but they sucked [00:06:31] it all seems like a black whole to me [00:06:58] any useful information is beyond the event horizon [00:06:58] page save would stall when lucene went down (which it often did) [00:07:16] so someone introduced the OAI thing as a replacement [00:07:21] why would it be synchronous? [00:07:35] I don't think the hooks have to be used that way [00:07:36] right, they could have just fixed it instead [00:08:15] or use a separate unmaintained extension that keeps an changelog in an innodb table [00:08:39] *keeps a [00:08:42] it was rainman who added it [00:09:11] translation: "I'm not the witch, she actually lives yonder!" [00:09:35] right [00:10:25] anyway, we'll probably switch to some new shiny thing like solr in a few months, so there's not much point doing a lot of work on it [00:10:55] will that involve dismembering OAI? [00:11:08] we should make sure it is on the requirements list [00:11:59] New patchset: Ryan Lane; "Add shared sync hook" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36120 [00:11:59] New patchset: Ryan Lane; "Change some sync and log directory locations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36121 [00:12:30] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36111 [00:12:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36120 [00:13:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36121 [00:14:18] paravoid: when i go to wikipedia.org, i get a response that starts out with 200 OK. can you help? [00:15:45] TimStarling: can you glance at https://gerrit.wikimedia.org/r/#/c/34062/ later? It's a bit rough but seems to work. [00:16:10] New patchset: Dzahn; "rename misc::dc-cam-transcoder to facilities::dc-cam-transcoder and move to facilities.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36122 [00:16:26] ok [00:16:41] !log aaron synchronized wmf-config/InitialiseSettings.php 'LST for test2wiki' [00:16:48] Logged the message, Master [00:17:29] New review: Dzahn; "yeah, it actually does not appear to be in use" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/36122 [00:17:30] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36122 [00:20:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.781 seconds [00:25:59] could someone with mod rights on ops-l OK https://lists.wikimedia.org/mailman/confirm/ops/df000f82dbcef58bac39d402df1074f5b056d224 ? i replied-all from my personal e-mail account, so the CC bounced, and i don't want to spam other people included in the thread. [00:27:08] ori-l: ok [00:27:25] thanks [00:27:44] !log olivneh Finished syncing Wikimedia installation... : [00:27:51] Logged the message, Master [00:28:14] ori-l: done, one by you and one by Nemo_bis [00:34:15] RECOVERY - mysqld processes on db68 is OK: PROCS OK: 1 process with command name mysqld [00:35:13] wikivoyage could use a squid purge maybe (see wikitech-l) [00:36:51] http://lists.wikimedia.org/pipermail/wikitech-l/2012-November/064731.html [00:38:13] New patchset: Ryan Lane; "Fix hookdir location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36125 [00:38:27] PROBLEM - MySQL Replication Heartbeat on db68 is CRITICAL: CRIT replication delay 2787 seconds [00:40:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36125 [00:41:27] jeremyb: squid purge done. yes, caching. and wow how people notice right away..that domain wasnt even in our DNS until a few hours ago [00:41:52] works for me,, at least if they also refresh in browser [00:43:24] PROBLEM - MySQL Slave Delay on db68 is CRITICAL: CRIT replication delay 1843 seconds [00:47:54] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [00:48:12] RECOVERY - MySQL Slave Delay on db68 is OK: OK replication delay 0 seconds [00:50:24] New patchset: Ryan Lane; "Add sync hook links for repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36126 [00:51:46] New patchset: Ryan Lane; "Add sync hook links for repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36126 [00:52:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36126 [00:56:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:51] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [00:58:51] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:06:20] New patchset: Ryan Lane; "Add the linking into the correct class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36127 [01:06:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36127 [01:07:00] New patchset: Pyoungmeister; "s7: removing db26, adding db68, and lowering weight on db56 (new snapshot)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36128 [01:07:42] New patchset: Pyoungmeister; "setting db56 as new snapshot host for s7" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36129 [01:08:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [01:10:47] New patchset: Ryan Lane; "Fix scope of definition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36130 [01:11:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36130 [01:12:44] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36128 [01:12:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36129 [01:14:04] !log py synchronized wmf-config/db.php 's7: db26 out, db68 in, db56 lower weight' [01:14:13] Logged the message, Master [01:14:45] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [01:14:45] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:31:05] New patchset: Ryan Lane; "Fix syntax in deployment pillar template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36138 [01:34:45] New patchset: Dzahn; "move misc::udpprofile::collector to mediawiki::udpprofile::collector and into mediawiki.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36118 [01:36:28] New patchset: Dzahn; "move misc::udpprofile::collector to mediawiki::udpprofile::collector and into mediawiki.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36118 [01:37:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36118 [01:39:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:40:47] !log olivneh synchronized php-1.21wmf5/extensions/EventLogging/content/JsonSchemaHooks.php [01:40:53] Logged the message, Master [01:42:31] New patchset: Dzahn; "move rsync image classes to misc/images and DELETE misc-servers.pp ..yay..done (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36141 [01:47:16] New patchset: Dzahn; "move rsync image classes to misc/images and DELETE misc-servers.pp ..yay..done (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36141 [01:48:08] New review: Dzahn; "no more misc-servers.pp :)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/36141 [01:48:09] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36141 [01:54:18] New patchset: Ryan Lane; "Add top file for pillars" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36142 [01:54:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36138 [01:54:44] soooooo cloooooooseeeeee [01:58:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36142 [01:58:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [01:59:42] New patchset: Ryan Lane; "Refresh pillars on top change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36143 [02:00:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36143 [02:08:21] New patchset: Dzahn; "remove ./misc/images.pp with rsync image classes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36146 [02:09:25] New review: Dzahn; "Mark, do we still need these? misc::images::rsync(d). Those were the last i had in misc-servers.pp a..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/36146 [02:11:37] New patchset: Ryan Lane; "Up to wmf5 version of package" [operations/debs/git-deploy] (master) - https://gerrit.wikimedia.org/r/36147 [02:11:50] Change merged: Ryan Lane; [operations/debs/git-deploy] (master) - https://gerrit.wikimedia.org/r/36147 [02:13:41] New patchset: Ryan Lane; "Configure git-deploy sync hook via pillar data" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36148 [02:16:15] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [02:16:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36148 [02:16:46] New patchset: Tim Starling; "RMI timeout tweaks" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/36149 [02:25:31] New patchset: Ryan Lane; "Fix repo urls and module use of urls" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36150 [02:26:45] !log LocalisationUpdate completed (1.21wmf5) at Fri Nov 30 02:26:44 UTC 2012 [02:26:51] Logged the message, Master [02:28:29] New patchset: Ryan Lane; "Fix repo urls and module use of urls" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36150 [02:29:09] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36150 [02:31:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.069 seconds [02:49:43] !log LocalisationUpdate completed (1.21wmf4) at Fri Nov 30 02:49:43 UTC 2012 [02:49:50] Logged the message, Master [03:20:46] mutante: when i was testing with curl, sending a req w/ Cache-Control: no-cache made it give a correct response but then when I did a normal fetch again it gave the default httpd conf again [03:23:33] i never actually tested in a browser (just went straight to curl) until after i saw you said it was purged by email. was on my phone then so I checked with phone browser and it worked. but didn't curl again or check IRC until now ;) [03:25:25] mutante: also, wikipedians can never be /too/ fast. (michael jordan was before your time. barack obama i guess too) but sometimes we're very slow too. e.g. there's still articles that need updating to show the results of elections from this month [03:25:48] heh [03:26:26] I'm sure I could find some examples of staleness older than a month [03:26:40] I went through a brief period fixing old newsy articles for staleness [03:26:51] sure but elections are good impetus to update [03:26:56] I'd look through current events from a year ago, finding articles about disasters and things like that [03:27:13] elections provide a hit list of things to fix [03:27:14] and then I'd be the first person to proof read them in about 6 months [03:27:23] mostly it was about changing all the tenses from present to past [03:27:43] whereas e.g. sandy developed more over time [03:27:59] TimStarling: did you see the article on wikipedia's coverage of sandy? [03:28:05] no [03:28:13] * jeremyb fetches link [03:28:29] but here is some of my past work: https://en.wikipedia.org/w/index.php?title=Hurricane_Katrina_disaster_relief&diff=prev&oldid=338310939 [03:29:11] TimStarling: http://www.popsci.com/technology/article/2012-11/wikipedia-sandy?cmpid=mobify [03:29:25] that edit was 2010, 5 years after the event [03:32:50] I don't think you have to be a climate change denier to think it is ridiculous to link every instance of bad weather to global warming [03:33:41] i think the mentions of the possibility of a link were themselves notable [03:33:58] sure [03:36:19] > January 3, 2013 – [03:36:20] Present [03:38:36] that was added in https://en.wikipedia.org/w/index.php?title=Arizona%27s_2nd_congressional_district&diff=next&oldid=523531144 [03:40:30] so it was over 2 weeks after election before that article listed the representative-elect and it still doesn't list election tallies [03:40:55] [05:14:07] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [05:28:48] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [05:35:41] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [06:28:28] hi [06:28:55] woops, wrong channel. hi ops :) [07:20:42] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:05:57] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 261 seconds [08:06:24] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 285 seconds [08:09:24] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [08:10:45] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [08:12:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:17:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.288 seconds [08:19:45] PROBLEM - MySQL Replication Heartbeat on db56 is CRITICAL: CRIT replication delay 267 seconds [08:20:22] PROBLEM - MySQL Slave Delay on db56 is CRITICAL: CRIT replication delay 303 seconds [08:31:26] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 324 seconds [08:31:44] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 342 seconds [08:34:59] fundraising [08:48:17] and getting worse. I have no access to those boxes nor any idea aobut the setup [08:48:21] *sigh* [08:50:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:05] apergos, aren't you a root? [09:02:15] I don't have access to the fr boxes [09:02:22] only a few folks do [09:02:38] wow. serioz bizniss [09:02:46] yes, it is [09:04:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.644 seconds [09:08:29] RECOVERY - MySQL Slave Delay on db56 is OK: OK replication delay 8 seconds [09:08:29] RECOVERY - MySQL Replication Heartbeat on db56 is OK: OK replication delay 5 seconds [09:08:39] whew [09:26:42] New patchset: MaxSem; "Postgres module for OSM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36155 [09:31:17] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [09:40:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:50] what's going on? [09:47:25] apergos: [09:47:33] hello [09:47:38] morning [09:47:40] what's up? [09:47:45] with fr/ [09:48:02] looks like the fundraising dbs were backed up waiting on db56 replcation [09:49:00] db56 is happy again but [09:49:09] oh that, I thought something more important :) [09:49:10] dunno about the fr boxes, I guess you have access to them [09:49:18] I do [09:49:27] no, if it was more important I would have called you! [09:49:35] :-) [09:56:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [10:14:47] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [10:20:56] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [10:28:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:41:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.361 seconds [10:59:29] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [10:59:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:15:32] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:15:32] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [11:17:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:30:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.715 seconds [12:04:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:23] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:19:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.272 seconds [12:53:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:08:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.635 seconds [13:44:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:57:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.734 seconds [14:32:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:47:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.889 seconds [15:14:56] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [15:17:47] New patchset: Cmjohnson; "Adding partman cfg for pc1-3 and pc1001-3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36189 [15:22:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:56] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [15:36:59] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [15:38:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [15:38:51] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35293 [15:49:35] New patchset: Andrew Bogott; "Add the install_path param for single-mode mediawiki." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35313 [15:50:06] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35313 [16:11:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [16:58:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:36] New patchset: Dereckson; "(bug 42579) Set upload URL on pl.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36197 [17:13:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.210 seconds [17:21:42] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [17:45:37] New patchset: Aude; "update settings for wikibase client and repo, in prep for next deployment" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36205 [17:46:26] RobH: What's the status of https://rt.wikimedia.org/Ticket/Display.html?id=3873 (wtp1001)? The ticket seems to have been stalled for a few weeks now, and I don't see wtp1001 in DNS or Ganglia... and the only dependent ticket that isn't already closed is not viewable for me [17:47:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:47:36] seems like its just waiting for actual deployment, hrmm... lemme see what wtp1 uses for partmap [17:48:05] huh, not automapped [17:49:56] i'll see if i cannot get it spun up with OS install shortly. [17:50:30] New review: Aude; "todo items:" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/36205 [17:51:53] RobH: Excellent, thanks [17:52:36] RobH: Once it's up, I'll do some stress-testing, so we'll have an idea of how many servers we'll need when we go live with it. I guess for that I'll also need to create an LVS group in puppet and get a service IP assigned and all that [17:53:01] so wtp1 was manually partitioned (eww) [17:53:07] with small swap and rest / [17:53:16] i was just going to use the raid1-1partition receipe [17:53:24] unless you know of a reason it should be different? [17:53:28] I should warn you [17:53:37] Chris repartitioned it because it had software RAID on top of hardware RAID [17:53:45] AND a failed disk [17:53:56] yea so who knows what it started as [17:54:12] but i dont have a wtp1 reference in netboot.cfg, so i imagine it was always manually done [17:55:52] partman/lvm.cfg [17:56:11] commit f27ff758110ae36d9f68fb1688e652f6381b2599 [17:56:17] Date: Fri Nov 16 00:20:39 2012 +0000 [17:57:06] thanks to ash er for writing a good commit entry [18:03:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [18:21:05] !log authdns-update to add wtp1001 info to dns [18:21:13] Logged the message, RobH [18:24:39] Change abandoned: RobH; "this was done already by someone else, and names are already back in use, pushing this will cause is..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29600 [18:29:22] New patchset: RobH; "adding wtp1001 basic install stuff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36216 [18:30:56] cmjohnson1: review that will ya? ^ [18:31:15] I know you can only +1 but it is good enough for me in terms of a second set of eyes reviewing what I normally self review ;] [18:31:30] lgtm but I've never seen these things before :) [18:32:01] yea chris has to do these changes more often now than before, as he does reinstalls [18:32:13] and im sure its right, but im making myself get reviews. [18:32:48] looks good (robh) [18:36:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:51] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36216 [18:36:58] cmjohnson1: thx, i dont mind merging it myself [18:37:03] but atleast someone else eyeballed it [18:38:14] New patchset: MaxSem; "WIP: OSM module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36222 [18:42:20] robh: i didn't merge it [18:50:49] !log analytics1007 troubleshooting hdd/pxe boot issue (going to power cycle several times) [18:50:56] Logged the message, Master [18:51:19] apergos: in re the replag on fundraising dbs, that happens daily when db78 and db1025 get backups dumped. nothing to worry about [18:51:35] ok, thanks for the info [18:51:46] thanks for keeping an eye on them overnight [18:53:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [19:11:20] !log authdns-update for server antimony [19:11:27] Logged the message, RobH [19:18:25] !log pgehres synchronized php-1.21wmf5/extensions/CentralNotice/ 'Updating CN to master. More banner buckets for A/B testing' [19:18:31] Logged the message, Master [19:20:29] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [19:21:16] !log pgehres synchronized php-1.21wmf5/resources/startup.js 'Touching to update RL' [19:21:23] Logged the message, Master [19:22:30] New patchset: Ottomata; "Including base::sysctl in misc::udp2log::sysctl to make sure that start procps can be notified." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36228 [19:22:55] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36228 [19:23:51] !log pgehres synchronized php-1.21wmf4/extensions/CentralNotice/ 'Updating CN to master. More banner buckets for A/B testing' [19:23:58] Logged the message, Master [19:24:22] !log pgehres synchronized php-1.21wmf4/resources/startup.js 'Touching to update RL' [19:24:30] Logged the message, Master [19:25:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:16] !log temp stopping puppet on brewster [19:30:23] Logged the message, notpeter [19:32:20] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [19:34:39] New patchset: RobH; "wtp1001 to lvm.cfg since its hwraid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36231 [19:35:18] New review: RobH; "this isnt self review, and these arent the droids you are looking for" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/36231 [19:35:18] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36231 [19:37:11] ^demon: bleh, we manually partitioned the current gerrit server. [19:37:22] the new server is hardware raid dual 500GB [19:37:33] what partman should we use? [19:37:46] <^demon> I don't know, really :( [19:37:47] (if you dunno the ones available just describe what is ideal and i will find a match) [19:37:55] so its hwraid1 [19:39:28] looks like a basic swap with rest / [19:39:32] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:39:49] did you wanna make it identical to that, or do something with a bit more complexity? [19:40:11] (you dont have to have the answer now, just asking cuz i assume i'll do the install for you later) [19:44:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [19:47:12] <^demon> RobH: I can't really think of why it'd need to be more complex than that. [19:48:16] ok, so just swap and rest / ext3? [19:48:34] cuz the lvm recipe will do that [19:50:54] btrfs all the way. [19:50:56] <^demon> Sounds fine. [19:53:16] ^demon: you should bribe LeslieCarr to do https://rt.wikimedia.org/Ticket/Display.html?id=3995 [19:53:22] so i can install your server ;] [19:54:02] New patchset: Catrope; "Add an LVS group for Parsoid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36233 [19:59:22] New review: Ottomata; "Ah, Asher, I just realized that these fields aren't tab separated. I'm not sure how puppet gets thi..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36053 [19:59:37] binasher ^ [20:00:00] can someone +2 this for me https://gerrit.wikimedia.org/r/36189 [20:01:30] i gotcha cmjohnson1 [20:01:38] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36189 [20:01:38] thx [20:01:45] need sockpuppet merge or do you have that? [20:02:00] i can do that [20:02:01] thx [20:02:02] cool [20:02:21] ottomata: was that gerrit comment a request for tab delimiters? [20:02:53] yup [20:03:10] it was in the original email too, I just realized that this format didn't have them [20:03:13] didn't see that yesterday [20:03:18] no prob [20:03:27] I could make the change in puppet, but i'm not sure i should merge it, not sure where to babysit [20:03:47] in my tests, LOG_FORMAT was being set for varnish in /etc/default/varnishncsa [20:04:06] and i'm not sure where this puppet log_fmt parameter pushes that line through to on varnishes [20:04:13] tabs and quoting can be tricky sometimes [20:04:45] New patchset: Asher; "analytics wants varnishncsa fields to be tab delimited" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36235 [20:05:02] danke [20:05:18] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36235 [20:12:24] cool, thanks binasher, I see the tabs already :) [20:14:54] New patchset: Ryan Lane; "Remove update-manager-core in labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36238 [20:15:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36238 [20:16:21] https://bugzilla.wikimedia.org/show_bug.cgi?id=42585 Is the code for this in Gerrit somewhere? [20:16:25] (http://dumps.wikimedia.org/other/pagecounts-raw/) [20:20:32] Ryan_Lane: so, two spaces huh [20:20:38] ? [20:20:47] that's the upstream standard [20:20:54] I know that [20:21:01] mark objected to that :-) [20:21:07] and you didn't like it either iirc [20:21:10] I hate it [20:21:12] so we never actually decided :-) [20:21:19] but I use the standard when there's one [20:21:38] I don't like pep8 much either, but I use that on projects that use it [20:29:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.336 seconds [20:44:32] New patchset: Raimond Spekking; "Enable AFTv5 for dewiki (WiP - DO NOT MERGE)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34964 [20:44:56] New patchset: Ryan Lane; "Add upstream apache module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36246 [20:55:39] !log for log completeness - wikivoyage.com usable, redirects to .org [20:55:47] Logged the message, Master [21:00:20] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:00:20] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:04:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:16:23] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [21:16:23] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [21:20:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [21:23:32] Lesliecarr: gotta a minute [21:37:14] New patchset: Ottomata; "filters.oxygen.erb - adding new TIM Brasil Wikipedia Zero filter." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36301 [21:37:53] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36301 [21:38:41] Ryan_Lane [21:38:46] ok to merge this on sockpuppet? [21:38:46] Remove update-manager-core in labs [21:38:47] ottomata [21:38:48] :) [21:38:51] yes [21:38:53] cool [21:49:23] has anyone worked with the nagios 'scheduled downtime' nagios feature? [21:49:41] Jeff_Green: yep [21:50:08] it looks like it's appropriate for a recurring outage thing, does it actually work? [21:50:16] Jeff_Green: via webui is easier,, unless you need mass scheduling [21:50:22] it does [21:50:26] ok cool [21:50:29] should turn off notifications [21:50:48] add https to nagios url to get the login screen [21:51:05] New patchset: Raimond Spekking; "Enable AFTv5 for dewiki (WiP - DO NOT MERGE)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34964 [21:51:16] mutante: got it--thanks! [21:51:21] after login go to host page and use the link, optionally you can make it fixed or not and leave a comment [21:51:26] sure,np [21:53:11] mutante: I'm confused about how to make it recurring [21:53:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:43] bah, it's not recurring [21:54:51] gargabharga. [21:55:03] Jeff_Green: uhm..i'm afraid then you need to do the shell version and cron [21:55:22] but its possible.. lemme find it [21:55:48] http://wikitech.wikimedia.org/view/Nagios#Scheduling_downtimes_with_a_shell_command [21:55:54] New patchset: Ryan Lane; "Add apache vhost for deployment system" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36303 [21:56:00] there you got a loop to mass schedule a group of hosts [21:56:22] so this gets put on a cron job? [21:56:36] via shell command.. you would need a wrapper around that that changes the timestamps [21:56:43] and cron, yea [21:56:53] alright--thank you [21:56:57] yw [21:57:23] it is all about writing the right lines into /var/lib/nagios3/rw/nagios.cmd [21:57:47] yeah [21:58:05] this just became a whole lot more painful than I can deal with today :-( [21:58:05] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36246 [22:02:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36303 [22:04:59] Jeff_Green: actually.. you dont need cron.. if you know all the start and end times of all perios you want it down in advance, you just need to write them all to the file once i think [22:05:17] just need to have all start and end times in unix timestamp [22:05:34] mutante: it's daily forever--to ignore replag while a mysql dump happens [22:06:00] so, cron to work around nagios glaring feature hole [22:06:09] i'm going to unmonitor for now and figure it out next week [22:06:40] New review: Krinkle; "So which one is this?" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/36105 [22:06:42] http://exchange.nagios.org/directory/Addons/Scheduled-Downtime/Schedule-Downtime-via-cron/details [22:07:02] yup, there are several solutions.. its been done by others in a couple ways..also on the nagios exchange [22:07:11] i guess ideally we wanted it in puppet [22:07:30]