[00:03:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [00:09:50] Change merged: Tim Starling; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/36149 [00:11:03] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [00:11:03] PROBLEM - Puppet freshness on lvs4 is CRITICAL: Puppet has not run in the last 10 hours [00:23:57] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [00:35:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:50:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.478 seconds [01:24:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:38:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.877 seconds [02:13:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:47] !log LocalisationUpdate completed (1.21wmf5) at Mon Dec 3 02:24:47 UTC 2012 [02:25:01] Logged the message, Master [02:28:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.924 seconds [02:46:35] !log LocalisationUpdate completed (1.21wmf4) at Mon Dec 3 02:46:35 UTC 2012 [02:46:45] Logged the message, Master [03:20:55] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [03:21:24] New review: Tim Starling; "Fine apart from inline comment." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/35818 [03:33:59] New review: Tim Starling; "I wouldn't like to see this change merged and I2c6ab07d ignored, since I2c6ab07d is ready to deploy ..." [operations/apache-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/34113 [03:35:55] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [03:42:58] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [05:27:56] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [07:38:58] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:07:30] jeremyb: https://bugzilla.wikimedia.org/show_bug.cgi?id=40582#c5 [08:32:16] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [08:51:28] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.013 second response time on port 11000 [09:07:04] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:07:04] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:22:53] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:22:53] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:11:56] PROBLEM - Puppet freshness on lvs3 is CRITICAL: Puppet has not run in the last 10 hours [10:11:56] PROBLEM - Puppet freshness on lvs4 is CRITICAL: Puppet has not run in the last 10 hours [10:25:30] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [12:01:36] New review: Aude; "note that we want to deploy the mw1.21-wmf5 branch, which is what was tagged on Friday + security fi..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/36205 [12:57:58] !log Shutdown slauerhoff's switchports [12:58:10] Logged the message, Master [13:08:33] New patchset: Dereckson; "(bug 42644) Throttle for Library of Israel editathon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36548 [13:10:23] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36548 [13:12:07] !log demon synchronized wmf-config/throttle.php 'Deploy I826b9114' [13:12:16] Logged the message, Master [13:20:44] New review: Dereckson; "This doesn't match community desire, as expressed on " [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33390 [13:22:18] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:36:37] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [13:44:34] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [13:58:01] New review: Hashar; "recheck" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/31575 [14:00:05] New patchset: Hashar; "validating new Jenkns job (do not submit)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36552 [14:04:53] New review: Hashar; "recheck" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/36552 [14:07:55] Change abandoned: Hashar; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36552 [14:12:11] Change restored: Hashar; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36552 [14:12:23] New patchset: Hashar; "validating new Jenkns job (do not submit)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36552 [14:12:59] Change abandoned: Hashar; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36552 [14:36:22] cmjohnson1: are we still making the switch today? [14:36:52] not sure...apergos: how are we looking for replacing be3 and be5 [14:37:37] cmjohnson1: I will set weights to zero on ms-be3 toay [14:37:43] after that you can pull it [14:37:58] I'm not going to do the other host at the same time, I'll wait two days and we;ll see how it looks [14:38:08] *grumble* [14:38:08] k..sounds good [14:42:02] why not do it at the same time? [14:45:28] because the new machines are still far from being caught up [14:48:30] strange [14:51:10] yeah, I am not excited about it either [14:53:32] doesnt' the process take about 4 weeks [14:53:33] ? [14:57:19] so we're doing ms-be3 today or tomorrow and ms-be5 at the end of the week? [14:57:48] sbernardin: be3..but not yet [14:58:57] ok [15:00:29] ms-be3 today [15:00:50] ms-be5 we'll see, if I think we can get away with it in two days we'll do it in two days (your schedules permitting) [15:01:39] ok...sounds good [15:14:56] !log upgrade mwlib to 0.14.2 [15:15:05] Logged the message, Master [15:28:28] New patchset: Mark Bergsma; "Add ms-be3004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36568 [15:29:04] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [15:29:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36568 [15:29:56] sbernardin: let's work on mw16 now. rt3951...bad dimm on mw16 [15:34:36] cmjohnson1: ok [15:35:23] sbernardin: i am going to take the server down. I want you to swap the DIMM in A1 to B1 and than turn the server back on and let's see if the error moves w/the dimmm or sticks around [15:35:49] !log mw16 shutting down to troubleshoot DIMM error rt3951 [15:35:58] Logged the message, Master [15:36:51] sbernardin: you should be good to go...ping me once it's back on [15:39:40] PROBLEM - Host mw16 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:43] cmjohnson1: swapped DIMM A1 and B1 [15:43:43] RECOVERY - Puppet freshness on lvs3 is OK: puppet ran at Mon Dec 3 15:43:32 UTC 2012 [15:43:57] cmjohnson1: mw16 powered back on [15:45:06] okay..great....now we will wait a few days and see if the error returns and where..that will determine whether or not it is simple DIMM error or a problem with the main board or cpu1 [15:45:22] RECOVERY - Host mw16 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [15:54:40] RECOVERY - Puppet freshness on lvs4 is OK: puppet ran at Mon Dec 3 15:54:13 UTC 2012 [15:57:47] did you ran memtest86+? [16:10:13] platonides: the r410 has a history of bad DIMM modules on the mainboard...the easiest troubleshoot is to swap the DIMM and see if the error follows [16:13:52] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [16:30:04] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [16:39:45] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36485 [16:49:37] !log Installed ms-be3004 [16:49:48] Logged the message, Master [16:57:12] !log Installed ms-be3003 [16:57:21] Logged the message, Master [17:28:26] New patchset: ArielGlenn; "object replication with up to 3 rsyncs at once, up from 2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36581 [17:30:15] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36581 [17:30:55] !log reedy synchronized wmf-config/InitialiseSettings.php [17:31:04] Logged the message, Master [17:31:05] snapshot1002: rsync: change_dir#3 "/apache/common-local" failed: No such file or directory (2) [17:31:05] snapshot1002: rsync error: errors selecting input/output files, dirs (code 3) at main.c(643) [Receiver=3.0.9] [17:31:31] that sounds like a message for apergos [17:31:53] no, it's a message for "yeah I gotta get back to working on that bax" [17:32:03] it was pulled for upgrade, ignore for now, sorry for the noise [17:33:22] alright [17:33:28] *box ! [17:33:35] weird typos are weird [17:34:16] * jeremyb sometimes wonders if paravoid knows what you mean when you type english in greek letters [17:35:19] well it's not the normal way you would type english words in greek [17:35:26] but still one gets used to it (layout typos) [17:35:49] I get screwed once in awhile by someone with a french keyboard who makes the opposite mistake: gr but with latin letters [17:35:55] his mapping is relaly bizarre [17:36:08] my DSL flapped again [17:36:09] sigh [17:36:24] ugh [17:36:30] still from the storms? [17:36:34] ;/ [17:36:44] it's either that humidity goes up or that people are getting back from work and crosstalk is increased [17:37:06] telco said they're going to fix it on Wednesday, but it worked more or less today [17:37:25] that will make it harder for them to chase it down, if it works when they show up [17:37:36] no, it's easy [17:37:42] go to the flooded junction box [17:37:45] remove the water [17:38:10] oh, there's not a general problem with the cables? [17:38:38] (I lived in an apartment where things got worse with storms, but it was not that simple a fix) [17:38:57] i think you forgot "replace the copper" [17:39:56] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [17:40:07] http://www.theverge.com/2012/11/17/3655442/restoring-verizon-service-manhattan-hurricane-sandy [17:41:08] LeslieCarr: I think I finally found out what was wrong with that Parsoid LVS thing: https://gerrit.wikimedia.org/r/36376 [17:41:28] hahaha [17:41:29] :) [17:41:39] oh humans and our inability to sometimes see obvious [17:42:11] hah [17:42:43] Yeah [17:42:48] I pulled my hair out over this for hours [17:43:05] I see Patrick merged it on Saturday so it should already be live [17:43:41] New patchset: Jarry1250; "Add librsvg to contint puppet manifest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36583 [17:45:45] But somehow it still doesn't work. wtp1 is listening on the service IP correctly, but pinging the service IP doesn't work (I get destination unreachable from the router) and HTTP requests don't work either [17:53:09] Hmm, I found out that I apparently had to create https://noc.wikimedia.org/pybal/pmtpa/parsoid [17:53:26] I'm suddenly very concerned about pybal apparently being dependent on noc [17:53:47] ACKNOWLEDGEMENT - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn ack test via android app on tablet [17:59:54] yay, nagios acknowledgement by hitting touchscreen on tablet [18:00:09] Jeff_Green: [18:00:57] http://exchange.nagios.org/directory/Addons/Frontends-%28GUIs-and-CLIs%29/Mobile-Device-Interfaces/aNag/details [18:03:03] RoanKattouw: sounds like a good thing to put in a ticket to puppetize that part [18:03:11] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [18:04:14] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [18:04:20] Well, also [18:04:38] I'm very curious why that stuff is on /home and on a web server that. last I knew, ran on fenari [18:05:40] i usually find with those questions that the answer is that 5 years ago there either was a very valid reason or it was way easier at the time [18:08:17] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [18:09:18] mutante: neat [18:14:52] New patchset: Aaron Schulz; "Enabled retries for jobs that fail to be acknowledged as done." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35818 [18:16:08] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35818 [18:16:50] !log aaron synchronized wmf-config/CommonSettings.php 'deployed b999dbca1f8ca0a76fe327dd1ae30beb6eed462f' [18:16:58] Logged the message, Master [18:19:29] Ryan_Lane, any sign of Mike? Do you know if he's been invited to IRC and/or to the meeting? [18:19:48] andrewbogott: not sure. ct may have more info [18:20:06] woosters, got an email address for Mike W? [18:21:11] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.054 second response time [18:23:25] anyone knows which hardware had been purchased for OSM? [18:23:59] ping wangatlargo [18:25:25] LeslieCarr: Aha, I found out there are a few hoops to jump through when changing LVS https://wikitech.wikimedia.org/view/LVS#Deploy_a_change_to_an_existing_service [18:25:42] woosters: Ok, at least he knows where to find us; that was my main concern. [18:25:47] !log installing package upgrades on spence [18:25:55] Logged the message, Master [18:26:45] hi, [18:27:24] andrewbogott, ryan_lane - wangatlargo is in the channel now [18:27:32] wangatlargo: hi there [18:27:49] Hi, Ryan [18:27:55] folks, wangatlargo is our part time consultant, helping out labs [18:28:01] welcome aboard! [18:28:01] LeslieCarr: You wanna sit down and go through those steps after the ops meeting? I can theoretically do that by myself but that doesn't seem like a very responsible thing to do [18:28:02] Hi Andrew, [18:28:12] sure Roan [18:28:23] Excellent [18:30:06] wangatlargo: after the meeting let's start setting your accounts up [18:30:25] good [18:33:47] anyone knows which hardware had been purchased for OSM? [18:33:55] package upgrade broke apache on spence.. missing apache config.. looking [18:34:00] MaxSem: i can field that [18:34:08] MaxSem: Computers [18:34:12] MaxSem: so we have the servers for db, caching, etc [18:34:14] r310s [18:34:27] MaxSem: however, the hard disks had issues, they didnt ship with the carriers or cables [18:34:40] we have gotten dell to fix this, and the carriers have arrived onsite, but the cables have not yet [18:34:51] otherwise they are all racked up and ready to have the OS installed [18:35:34] 4 osm caching, 4 osm database, and 4 osm apache servers [18:35:42] in each site (eqiad and sdtpa) [18:36:00] RobH, thank you [18:36:04] nice :) [18:36:07] (osm-db#, osm-cp#, and osm-web#) [18:36:17] 1-4 in tampa, 1001-1004 in eqiad [18:36:19] RobH: that's per datacenter? [18:36:20] ah [18:36:21] ok [18:36:22] * Ryan_Lane nods [18:36:54] yea both sdtpa and eqiad have a total of 12 osm servers each site, total 24 osm servers in wmf cluster [18:37:00] :D [18:37:08] now if we can get the damned cables in [18:40:01] RobH, why in both datacenters? for migration or redundancy? [18:40:09] .... [18:40:18] !log deleting broken link to nagios3.conf on spence and restarting apache [18:40:28] Logged the message, Master [18:40:28] because everything we have needs to be in multiple datacenters? [18:40:46] we need to build all services in both sites for cross site redundancy. [18:40:54] or else whats the point? [18:41:42] ideally whatever service it is can balance between both being live [18:41:49] but i have no idea how thats going to be structured, so no clue. [18:41:52] New patchset: Reedy; "enwiki to 1.21wmf5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36588 [18:42:05] but at miniumum if one site goes dark we have the other to run it [18:42:47] yeah, load balancing... http://wiki.openstreetmap.org/wiki/Developer_FAQ#Why_don.27t_we_spread_the_load_on_the_OpenStreetMap_database_across_a_number_of_servers.3F [18:44:32] well, that's for people making changes on the maps, right? [18:44:48] and we're just going to be providing tile services? [18:45:10] if that's the case I can't see why this can't be load balanced geographically fairly easily [18:45:13] Ryan_Lane, actually OSM runs on just one tileserver:P [18:45:35] heh. we just told you why we'd have more than one [18:47:36] MaxSem: there will be multiple tile servers [18:48:04] mutante: would you be interested in being primary for open street maps? [18:48:53] New patchset: Jgreen; "offhost_backups should only copy gzipped db dumps" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36591 [18:49:51] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36591 [18:58:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36588 [19:00:14] * MaxSem reads about Postgres replication and pukes [19:01:14] MaxSem: don't replicate [19:01:22] wangatlargo, you there? [19:01:33] they don't need replication [19:01:48] mutante: please take over osm [19:04:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf5 [19:05:08] Logged the message, Master [19:07:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:07:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:09:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:10:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [19:22:36] New patchset: Reedy; "Defrag commented out readonly statements" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36594 [19:23:46] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [19:23:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [19:24:25] New patchset: Reedy; "Remove $wgLegacySchemaConversion" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36595 [19:28:21] binasher: http://ganglia.wikimedia.org/latest/?c=MySQL%20pmtpa&h=db63.pmtpa.wmnet&m=cpu_report&r=week&s=by%20name&hc=4&mc=2 [19:28:27] I wonder what's with the 28th [19:28:50] hmm, mysql uptime reset...must have been a restart [19:29:48] !log reedy synchronized wmf-config/CommonSettings.php 'wikidatawiki marked readonly' [19:29:57] Logged the message, Master [19:30:47] New review: Aaron Schulz; "Summary typo" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/36595 [19:40:51] binasher: Don't suppose I can get away with adding a column (and index on that column) to a wikidata table with around 10M rows, can I? [19:41:10] (I know you're in a meeting atm too) [19:41:43] Reedy: no… and the wikibase extension schema is kind of fucked up [19:42:26] Just kind of? ;) [19:42:26] hrm .... [19:42:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:56] do we need a term_id column there? or what? [19:43:02] * aude not a db expert [19:43:26] having a PK column makes using online schema change possible (IIRC) [19:43:32] ok [19:44:24] New review: Hashar; "indeed :)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/36583 [19:45:18] binasher: https://gdash.wikimedia.org/dashboards/filebackend/ those lockmanger graphs should be split out from the streamfile ones [19:47:00] * AaronSchulz lols at the stats taking 1500ms [19:47:55] * AaronSchulz sighs at multiwrite slowing things down too [19:51:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.048 seconds [19:53:50] New patchset: Raimond Spekking; "Enable AFTv5 for dewiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/34964 [19:54:14] New patchset: Jdlrobson; "ensure all photo uploads go to commons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36103 [19:57:24] sbernardin: you are good to set up the new ms-be3 [19:57:40] the mgmt ip is in the ticket [19:57:55] thanks for working on it! [19:58:39] sbernardin: rt3855 [19:58:59] AaronSchulz: heh, as expected... [19:59:21] binasher: Does truncating a large table replicate "well" (quickly)? [20:00:28] paravoid: can you take that bug? [20:03:42] Reedy: yeah, truncate should generally be ok [20:09:00] binasher: in https://ishmael.wikimedia.org what is the number in parenthesis for % time ? [20:11:33] AaronSchulz: time in seconds [20:11:41] ah, seconds [20:11:58] clicking "more" gives confusing numbers then [20:13:33] that time in parenthesis on the main page isn't very useful [20:14:02] it's per sample not per instance right? [20:14:13] as in, a sum [20:14:44] yeah, a sum over the time period, or a sum over the time period for queries that took > the slow query threshold [20:15:19] it would be more interesting if ishmael ingested 100% of queries [20:19:25] what's the prognosis for deploying Wikibase to test2 today? are the schema problems blockers? [20:19:53] robla: seems so [20:20:12] first step is to update the repo (wikidata.org) and it requires the schema changes / [20:20:24] or plan is probably truncate one of the tables and rebuild it with a script [20:23:26] !log reedy synchronized wmf-config/CommonSettings.php 'wikidatawiki back to writeable' [20:23:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:33] Logged the message, Master [20:26:56] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [20:30:16] AGHHHHH [20:30:17] MEETING [20:30:19] AGGGGH [20:30:23] WHY DOES MY CALENDAR FAIL ME SO [20:32:07] !log kaldari synchronized php-1.21wmf5/resources/startup.js 'syncing startup.js per Roan' [20:32:15] Logged the message, Master [20:32:31] woosters, so sorry about missing the meeting, trying to figure out why calendar doesn't notify me now [20:32:41] I'm updating the etherpad with my updates right now too [20:32:44] np [20:32:46] we gave you all the work [20:32:50] no worries [20:32:57] haha [20:33:13] !log added max_user_connections to wikiadmin grants on s1 [20:33:22] Logged the message, Master [20:39:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.019 seconds [20:42:03] mutante, so will you be the OSM guy?:) [20:42:53] !log kaldari synchronized php-1.21wmf5/extensions/Vector/modules/ext.vector.collapsibleNav.css 'syncing vector css for menus' [20:43:01] Logged the message, Master [20:43:14] AaronSchulz or Reedy, can you think of a reason we would not want to provide a copy of the interwiki cdb file for download to the public? [20:43:50] I am thinking that this with some instructions about a few lines to add, is the quickest way to get a modern mw release up to date on interwikis, since the sql table for a wiki they import might well be empty [20:45:48] wangatlargo: ok, let's start with getting you labs and bugzilla access [20:46:17] wangatlargo: do you have an iphone or android, or something else? [20:47:38] I am using computer now. Is that work? [20:47:57] !log jenkins / gallium : updating Zuul repository 2747501..f73a04d , should be updated by puppet automatically [20:48:04] !log deploying squid config for the updated mobile ACL [20:48:05] well, for one part of this you're going to need a mobile device for two-factor-authentication [20:48:08] Logged the message, Master [20:48:15] Logged the message, Master [20:48:36] wangatlargo: make an account here: https://bugzilla.wikimedia.org/ [20:48:51] also, add a request here: http://www.mediawiki.org/wiki/Developer_access [20:49:16] Ryan_Lane: hi :) I have marked you as on duty in this channel topic. [20:49:31] Ryan_Lane: someone from ops told me so :-D [20:49:39] yeah. I'm on duty [20:50:11] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [20:51:41] Ryan_Lane: and here is a very simple one, adding librsvg on the contint server https://gerrit.wikimedia.org/r/36583 :-] [20:51:53] Ryan_Lane: basically adding librsvg2-bin [20:52:27] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36583 [20:52:37] done [20:53:20] thanks ryan! [20:53:24] yw [20:53:29] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [20:53:46] Ryan_Lane: and I got zuul in production for real, it is triggering a few extensions already. [20:53:51] sweet [20:54:12] Ryan_Lane: fixing up ops/puppet right now (Zuul did not like `production` as a HEAD) [20:54:22] Ryan_Lane: will most probably enable mediawiki core this week :-] [20:54:26] Ryan: account created at https://bugzilla.wikimedia.org/ with email mwang@wikimedia.org. I already created an account at http://www.mediawiki.org/wiki/Developer_access using mwang@wikimedia.org. I can login http://www.mediawiki.org/wiki/Developer_access now [20:54:52] wangatlargo: well, you want to add a request to that page [20:57:41] wangatlargo: you should also join #wikimedia-labs channel [20:57:54] and subscribe to labs-l: https://lists.wikimedia.org/mailman/listinfo/Labs-l [20:57:59] I added a request to that page on Sat. It already done. [20:58:17] ah [20:58:24] what's your labsconsole username? [20:58:35] http://www.mediawiki.org/wiki/Developer_access#User:Mwang [20:58:36] ? [20:58:44] Mwang [20:58:48] ok [20:58:53] did you log into labsconsole yet? [20:59:03] you should also log into gerrit: gerrit.wikimedia.org [20:59:18] gerrit/labsconsole use the same username/password [20:59:30] you should upload an ssh key to each [21:00:25] I've added you to a bunch of groups on labsconsole [21:00:38] you'll need to enable two-factor authentication before you can use most of the interfaces there [21:01:01] https://labsconsole.wikimedia.org/wiki/Special:OATH [21:01:03] I can log into labsconsole. [21:03:29] I can log in gerrit.wikimedia.org now. [21:04:55] !log adding mwang to the ops ldap group [21:05:03] Logged the message, Master [21:05:33] I am creating keys now. Then I will upload my public key. [21:05:39] ok [21:07:27] apergos: I guess it can't hurt [21:08:54] ok, I'll see if Reedy knows any reasons why not, otherwise out it goes [21:08:55] thanks [21:13:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:58] wangatlargo: welcome! [21:28:56] !log DNS update - adding doc and integration CNAMEs [21:29:00] Ryan: pub_key uploaded to labconsole and gerrit.wikimedia.org. two-factor authentication enabled. I wrote down the TOKEN and kept it in a safe place. [21:29:04] Logged the message, Master [21:29:17] wangatlargo: did you write down all scratch tokens? [21:29:48] mutante: thanks :-] [21:30:05] hashar: doc.wikimedia, doc.mediawiki.. integration.. as you requested [21:30:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [21:31:38] mutante: works for me, you can close the RT :) [21:31:54] done:) [21:32:00] no I have to send the apache conf snippet to serve the new entries! [21:32:01] \O/ [21:32:56] yea, so you have that? nice [21:33:51] cmjohnson1: ms-be3 is up [21:34:21] cmjohnson1: plugged into mrjp-b2-sdtpa port 12 [21:34:47] sbenardin: go ahead and finish the setup...once you are done. create a network ticket w/the above informaiton [21:35:20] cmjohnson1: ok [21:35:25] LeslieCarr: Let me know when you're ready to do the LVS thing (looks like you're out to lunch now) [21:36:35] !log aaron synchronized php-1.21wmf5/extensions/CentralAuth/CentralAuthUser.php 'added exception' [21:36:43] Logged the message, Master [21:36:56] Ryan: Yes, I wrote down all 5 scratch TOKENS. [21:37:01] great [21:37:23] wangatlargo: let's discuss things further at #wikimedia-labs [21:39:11] * Damianz scratches wangatlargo's token [21:43:18] 2012-12-03 21:43:03 refreshLinks2 Module:Taxobox-fr table=templatelinks start= end= STARTING [21:43:20] The luasandbox extension is not present, this engine cannot be used. [21:43:34] notpeter: lua is not on srv193 [21:55:53] RoanKattouw: hey i'm here now [21:55:58] Awesome [21:56:00] want to sit physically together and do this ? [21:56:06] Yes, I'll come over [22:00:35] RoanKattouw: why are exceptions suppressed even in cli mode? [22:00:58] New patchset: Ryan Lane; "Add sync_all deploy function" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36650 [22:01:02] <^demon|brb> The exception output stuff probably doesn't differ per sapi. [22:01:46] when I run jobs, all I get is the "enable $wgShowExceptionDetails" crap [22:01:49] <^demon|brb> Well, the check for outputting a stacktrace, prolly. [22:01:52] enotif jobs [22:02:10] are problem to Special:BlockList after wmf5 update already known? [22:02:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:03:13] anyone know why lvs1005 and lvs1006 are not actually bgp peering ? are they supposed to be down ? [22:03:23] or would i kill stuff if i restarted pybal ? [22:03:25] http://en.wikipedia.org/wiki/Special:BlockList / http://www.wikidata.org/wiki/Special:BlockList [22:04:24] New patchset: Catrope; "Add the eqiad parsoid VIP to lvs1003/1004 as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36651 [22:05:50] LeslieCarr: ---^^ [22:07:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36651 [22:07:06] !log aaron synchronized wmf-config/CommonSettings.php [22:07:14] Logged the message, Master [22:12:30] sbernardin: how are things coming along w/ms-be3 [22:12:44] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/35135 [22:15:05] PROBLEM - Host ms-be3 is DOWN: PING CRITICAL - Packet loss = 100% [22:15:49] yeah I ran into that (cli exception not being useful) also, it was pretty annoying [22:15:55] I would like to see that changed [22:17:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.346 seconds [22:21:54] RobH: So just to check, what are the specs on the remaining misc boxes you have? Are they the same as wtp1/wtp1001 (6-core 1.90GHz)? [22:22:09] same as wtp1001 [22:22:20] I ask because Parsoid (the service I wanna run on a couple of them) is highly CPU-bound and every core helps [22:22:29] !log restarting pybal on lvs3 [22:22:37] Logged the message, Mistress of the network gear. [22:22:39] yep, same servers as wtp1001 [22:22:45] dual 6core [22:22:59] dual? [22:23:08] /proc/cpuinfo only shows me 1 proc w/ 6 cores [22:23:10] i thought it was [22:23:19] maybe its single, its whatever wtp1001 is [22:23:23] Right, OK [22:23:42] if its cpu bound it may be better to order for it [22:23:51] as we can get that platform (r320) dual cpu [22:25:29] Yeah [22:25:36] Didn't cadmium have like 16 or 32 cores? [22:25:39] Some ridiculous number [22:25:59] We will order for it, but we'll need misc boxes in the interim. About 3-4 of them probably [22:26:21] !log restarting lvs4 for parsoid vip [22:26:29] Logged the message, Mistress of the network gear. [22:26:33] !log modify above - restarting PYBAL on lvs for parsoid vip [22:26:41] Logged the message, Mistress of the network gear. [22:27:00] I'll put a ticket in today [22:30:07] !log restarting pybal on lvs1006 for parsoid vip [22:30:15] Logged the message, Mistress of the network gear. [22:33:22] PROBLEM - LVS HTTP IPv4 on parsoid.svc.pmpta.wmnet is CRITICAL: (null) [22:33:46] (that paged) [22:33:59] (just fwiw, I think it's good to know that it did :-) [22:34:11] Yeah, that one needs to not page [22:34:35] how come? [22:34:41] what's the plan with that btw? [22:35:19] I know nothing about it other than what parsoid is supposed to be [22:37:43] oh hey [22:37:47] we're putting parsoid up right now [22:37:52] so yeah, ignore pages [22:37:53] sorry [22:37:55] root@lvs1006:/etc/pybal# Unhandled error in Deferred: [22:37:55] Unhandled Error [22:37:56] Traceback (most recent call last): [22:37:57] Failure: twisted.names.error.DNSNameError: [22:38:08] anyone know what that is ^^ ? getting that error when starting pybal [22:38:56] i am killing dvd.wikimedia.org .. it was once supposed to be a "rotation script" for different mirrors of a WP DVD .. but it was gone [22:39:05] and just pointed to a German company [22:40:24] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/35135 [22:40:38] !log restarted pybal on lvs1006 and lvs1005 [22:40:46] Logged the message, Mistress of the network gear. [22:40:53] !log DNS update - removing broken dvd.wikimeedia.org [22:41:01] Logged the message, Master [22:44:17] New patchset: Catrope; "Fix another pmpta typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36657 [22:45:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36657 [22:48:11] !log restarting pybal on lvs1003 for parsoid vip [22:48:19] Logged the message, Mistress of the network gear. [22:50:22] LeslieCarr: I've not seen that error before, no [22:50:53] they fixed it [22:50:55] New patchset: Reedy; "Remove comment about temp being 200, it's actually 100 * 1024" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36659 [22:51:03] it was basically nxdomain [22:52:26] ah [22:52:27] ok [22:52:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:55:32] huzzah, parsoid vip success! [22:55:42] New patchset: Pyoungmeister; "renaming coredb module to coredb_mysql to avoid module/role class namespace fuckery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36661 [22:57:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36661 [23:05:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.042 seconds [23:08:19] New patchset: Pyoungmeister; "need to change name in role class as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36664 [23:11:23] heya, binasher, you there? [23:11:31] i'm not sure, but just looking at some of the event.gif logs [23:11:39] it looks like the tab as delimeter change never made it to some of the hosts [23:11:54] e.g. strontium looks good, I see the tabs [23:12:05] palladium and niobium are still using spaces [23:12:57] New patchset: Andrew Bogott; "Add a class to generate a puppet docs website" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36666 [23:14:29] New patchset: Aaron Schulz; "Set wgShowExceptionDetails on for cli mode" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36668 [23:15:55] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/36668 [23:16:30] New review: Andrew Bogott; "This is one of those patches which I can't properly test with puppetmaster::self. So, merging witho..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/36666 [23:16:51] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36666 [23:16:54] andrewbogott: you are awesome:) [23:17:17] hashar: You mean, awesomely apologetic? [23:17:40] andrewbogott: I got DNS entries for doc.wikimedia.org so we could get the puppet doc available at http://doc.wikimedia.org/ops/puppet/ or something :) [23:17:53] andrewbogott: can even make Jenkins to build them for us :-) [23:18:05] shutdown -h hashar -m out of work time [23:18:20] Oh, great! I didn't duplicate your work I hope? [23:18:24] killall `pidof shutdown` [23:18:36] andrewbogott: na I did nothing, just been thinking about it :-) [23:18:38] New review: Aaron Schulz; "I remember tim doing some napkin math calculations based on the scalar apache's memory and maxclient..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/36659 [23:18:56] hashar, btw, the site that my puppet class generates is incredibly ugly. I was hoping to only use it with deep links in order to avoid anyone seeing the… frames [23:19:05] andrewbogott: I guess it will be all a matter of replacing the wmflabs.org domain by wikimedia.org and apply your class on the contint server :-] [23:19:34] andrewbogott: I am sure #puppet can help you find out a script that generate some nice documentation [23:19:55] That's the script I'm using. It works great except for the 1995-era look and feel. [23:20:05] indeed [23:20:20] If you have tunneling set up, you can view here: http://puppetdoc2.pmtpa.wmflabs/ [23:20:23] 1995 was a good year. [23:20:27] (probably you've see that exact site before) [23:20:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36664 [23:21:05] New patchset: Catrope; "Parsoid puppetization changes:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36670 [23:21:20] hashar: andrewbogott: want me to just put integration.wikimedia.org into the integration.mediawiki.org file [23:21:37] as ServerAlias [23:21:46] andrewbogott: Also, http://puppetdoc2.instance-proxy.wmflabs.org/ [23:22:10] RoanKattouw: Whoah, since when does /that/ work? [23:22:20] A while ago apparently [23:22:30] It's been around forever but no one knows about it [23:22:35] holy crap! [23:22:40] Do you know who wrote it? [23:22:44] Liangent [23:22:46] Seems like we should document it or something... [23:22:51] * andrewbogott makes a note [23:23:01] Ya [23:23:15] mutante: not sure how it will work :-) [23:23:19] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:23:20] mutante: but you can give it a try [23:23:28] mutante: I don't think I know what you mean/what that would do [23:23:35] mutante: I think I will migrate everything to use the wikimedia.org domain [23:24:24] Now I feel /really/ bad for torturing all those volunteers with foxyproxy instructions [23:26:45] anyway, that rdoc site looks promising [23:27:03] I am sure rdoc output can be enhanced in one way or another [23:27:44] Yeah. There's a text mode output from the tool which we could format however we want. That exceeds my ambitions for the moment though. [23:27:56] Mostly I want to add doc links to the 'configure instance' page in labs. [23:28:03] I mean, per-class links. [23:28:16] that is a great idea [23:28:44] And we'll need to update the style guide so that people comment classes in a way that gets parsed correctly. I haven't investigated that yet. [23:29:10] or you can get the markdown syntax dumped and use a tool to convert the markdown to some nice html [23:29:46] I thought briefly about feeding it into the mediawiki api. But that would get us editable pages which would be very wrong. [23:30:48] hashar: andrewbogott: integration.wikimedia.org is now "it works" because we added DNS but not to Apache config.. i am gonna make that work.. independent from the "doc." virtual hosts [23:31:06] I wonder, is there a way to mark a section of a wiki so that it's editable by only a single editor? That would work, and then a bot could upload and update & such [23:31:39] andrewbogott: you can protect the pages and give administrator to that user and not the others [23:31:55] mutante: I'm still a beat behind… what is integration an alias for? [23:32:10] mutante: The bot would need to create new pages, though… is that problem? [23:32:41] mutante: sounds good, thanks :) [23:32:44] the bot user would also need the right flags.. i am not sure if a bot flag means you can edit protected pages [23:33:40] New patchset: Dzahn; "add integration.wikimedia.org ServerAlias to integration.mw" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36671 [23:34:00] andrewbogott: that's all https://gerrit.wikimedia.org/r/#/c/36671/1/files/apache/sites/integration.mediawiki.org [23:34:19] oh, I see. [23:34:51] You said that before, but my eyes transposed the words and I was wondering why you were aliasing something to itself [23:34:52] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36671 [23:37:16] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [23:40:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:43:20] New patchset: Pyoungmeister; "attempt to hack around more coredb issues" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36673 [23:43:39] hashar: Any idea where we should host this? [23:43:58] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36673 [23:44:09] andrewbogott: the puppet docs ? [23:44:46] I would go for something like http://doc.wikimedia.org/ops/puppet/ [23:44:54] or simply http://doc.wikimedia.org/puppet/ [23:45:11] feel free to have them hosted on gallium.wikimedia.org server which is the host running Jenkins [23:45:13] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [23:45:29] (doc.wm.o points to gallium) [23:45:41] OK. Hm... [23:45:57] Thinking about having the docs update via jenkins vs. just update via puppet [23:47:41] And, are you expecting to have a MW installed at the top level? [23:47:48] nop [23:47:56] at least not yet :-] [23:48:10] just want to migrate the MediaWiki automatically generated doc from formey to gallium [23:48:28] http://svn.wikimedia.org/doc/ [23:48:45] which is legacy and need to be phased out [23:49:10] Right now my class generates a whole site, with a sites-enabled entry. It would work just as well as a simple subdir of an existing site... [23:49:34] docs sounds better [23:49:35] I think [23:49:42] I guess I can just install the site at that subdir too, I just have to learn how to do that :) [23:50:59] Reedy: lets bikeshed about doc vs docs :-] [23:51:21] Reedy: I am not sure which one looks more natural. I guess I choose 'doc' simply because it is shorter [23:51:30] hashar: actually. We should use documentation.wikimedia.org and make apache change it [23:52:28] andrewbogott: the integration website is somewhere under /srv/ [23:52:52] andrewbogott: maybe /srv/org/wikimedia/doc/puppet [23:53:02] andrewbogott: but you will have to provide the apache conf [23:53:38] doh 1am, bed time for real [23:53:44] * hashar waves [23:53:50] hashar: I think that's the part I don't understand. But I'll catch up with you tomorrow about this. [23:54:12] sure :) [23:54:18] have a good evening! [23:54:25] g'night [23:56:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36650 [23:56:42] bleh. Looks like I snuck another change in with that one accidentally [23:57:16] !log aaron synchronized php-1.21wmf5/extensions/CentralAuth/CentralAuthUser.php 'revert live hacks, exception already dumped in bugzilla' [23:57:24] Logged the message, Master [23:57:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds