[00:16:06] New patchset: Tim Starling; "Add CodeEditor and Scribunto to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19691 [00:17:29] Change merged: Catrope; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19691 [00:17:33] TimStarling: ---^^ [00:18:05] yeah, don't run scap on that just yet [00:18:23] I think it will break since I only added those two extensions to wmf9, not wmf8 [00:18:41] Hmm right [00:18:47] Yeah not touching the cluster, just approving [00:20:23] thanks for approving [00:22:55] I think mergeMessageFileList is fairly tolerant of missing i18n files, but you might want to add it to wmf8 just to be sure [00:29:04] mutante: the dupe is back! [00:29:18] kill both, do puppetrun? [00:29:58] jeremyb: nagios-wm? i just see one [00:30:11] 15 00:30:05 [+nagios-wm ] [ closedmouth ] [ joernjoern ] [ mutante ] [ phuzion ] [ T3rminat0r ] [00:30:14] 15 00:30:05 [+nagios-wm_ ] [ CodeBlock ] [ kaldari ] [ Nemo_bis ] [ preilly ] [ tfinc ] [00:30:32] errr, maybe that wasn't such a good idea (collateral hilight) [00:33:07] by nagios-wm_!!! [00:33:08] jeremyb: i could not see it in /names for some reason [00:33:14] gah, bye* [00:33:28] doing what you suggested [00:33:51] damn, this lag is really horrible [00:34:17] puppet runs on spence take a while.. [00:35:55] i imagine! [00:36:03] i hope you didn't get the extra long variant [00:37:26] /usr/ircecho/bin/ircecho vs. /home/wikipedia/bin/ircecho [00:37:26] 14 17:30:15 -!- nagios-wm_ [~nagios-wm@spence.wikimedia.org] has joined #wikimedia-operations [00:37:30] 14 20:38:03 -!- nagios-wm_ [~nagios-wm@spence.wikimedia.org] has quit [Remote host closed the connection] [00:37:33] 14 21:26:02 -!- nagios-wm_ [~nagios-wm@spence.wikimedia.org] has joined #wikimedia-operations [00:37:36] UTC [01:07:01] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [01:07:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [01:09:18] New patchset: Ryan Lane; "Fix linking of CA certs by using an unless that will match" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19696 [01:09:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19696 [01:10:48] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19696 [01:11:14] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [01:11:14] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [01:11:14] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [01:40:56] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 215 seconds [01:41:32] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 223 seconds [01:47:59] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 610s [01:55:38] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 16 seconds [01:57:23] New patchset: Tim Starling; "Make fluorine an MW log host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19698 [01:57:53] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 3 seconds [01:58:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19698 [01:58:29] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 8s [01:59:46] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19698 [02:06:20] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [02:07:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [02:12:12] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [02:12:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [02:17:14] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [02:20:37] New patchset: Ryan Lane; "Add the upstream stdlib module to the repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19699 [02:21:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19699 [02:39:16] New patchset: Tim Starling; "Enable Scribunto and CodeEditor on test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19700 [02:39:31] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19700 [02:49:01] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:21:47] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [03:22:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [03:30:43] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [03:31:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [04:06:28] New patchset: Tim Starling; "Use standard iptables rules for mw loggers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19703 [04:07:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19703 [04:08:22] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19703 [04:36:40] New patchset: Tim Starling; "Add exemption for Wikimedia Philippines event." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19704 [04:37:19] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19704 [05:14:48] New patchset: Tim Starling; "Send MW logs to fluorine per RT 2400" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19705 [05:15:20] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19705 [05:15:46] morning [05:17:05] hi [05:19:11] New patchset: Tim Starling; "Fix log file location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19706 [05:19:50] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19706 [05:22:25] !log moving MediaWiki logs to fluorine [05:22:33] Logged the message, Master [05:27:04] New patchset: Tim Starling; "Send wmerrors logs to fluorine and notify apache" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19707 [05:27:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19707 [05:28:07] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19707 [05:38:56] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [05:41:56] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [05:41:56] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [05:41:56] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [06:06:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:08:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.391 seconds [06:44:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:47:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.697 seconds [07:00:03] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [07:10:14] RECOVERY - Apache HTTP on srv281 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.009 seconds [07:12:02] RECOVERY - Puppet freshness on srv281 is OK: puppet ran at Wed Aug 15 07:11:34 UTC 2012 [07:21:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:30:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.477 seconds [07:35:37] PROBLEM - Apache HTTP on srv281 is CRITICAL: Connection refused [07:37:34] RECOVERY - NTP on srv281 is OK: NTP OK: Offset -0.04735791683 secs [07:53:46] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [07:59:44] jeremyb: hrhr, no, not a too good idea, I'm happy the pc IRC runs on isn't in the room my GF and me slept tonight, else she would have asked me who woke us at 2:30 ;) [08:05:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:17:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.120 seconds [08:17:46] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:50:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.071 seconds [09:11:45] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [09:26:45] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [09:36:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:39] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [09:46:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.617 seconds [10:20:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.760 seconds [11:08:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:12:05] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [11:12:05] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [11:12:05] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [11:20:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [11:39:31] !log Destroyed aggregate 'labs' on nas1-a [11:39:41] Logged the message, Master [11:52:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.939 seconds [12:11:09] !log Restricted NFS mounting of /vol/root on all NetApps [12:11:18] Logged the message, Master [12:15:00] !log Started rsync of /home to nas1-a:/vol/home_pmtpa on nfs1 [12:15:11] Logged the message, Master [12:16:36] PROBLEM - Host ms-be1 is DOWN: PING CRITICAL - Packet loss = 100% [12:17:57] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [12:26:24] \o/ [12:38:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:54] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:50:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.057 seconds [13:19:46] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [13:24:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.938 seconds [13:46:10] !log authdns-update for wikipedia.co.za [13:46:19] Logged the message, RobH [14:00:36] robh: regarding new es 7 & 8. how do you choose thier ip's? [14:01:23] i would think it is 7.19 and 20 for mgmt [14:01:28] cmjohnson1: those should be easy, lemme take a look at the files [14:02:10] cmjohnson1: you seem to be on the right track for that [14:02:30] cmjohnson1: es7/8? [14:02:44] cmjohnson1: we arleady have es7/8 in tampa [14:02:51] the new ones are es9 & es10 [14:02:59] es7/8 are in c1 pmtpa [14:03:03] yep..9/10 [14:03:18] so yea, 9 and 10 go in the 10 reverse file [14:03:20] under $ORIGIN 7.1.$zonename. [14:03:30] you can see that .19 and .20 are free, so thats easy [14:03:49] cmjohnson1: but once you add that, stay in the file, we need to give them their main IP as well [14:04:19] cmjohnson1: Now, those are a bit more complicated, if you look in $ORIGIN 0.0.$zonename. section of the file [14:04:34] you can see it has a bunch of gaps in the ranges as servers have been decommissioned. [14:04:50] since there isnt room in the series of IPs adjacent to the other ES servers, we will just put them in the same subnet [14:05:20] (like 55 to make es9 into .220 and line 59 to set .224 to nas10 [14:05:21] ) [14:05:42] cmjohnson1: Also, once you make all these changes do NOT svn commit, need to review them and its easiest with svn diff [14:06:01] did you mean line 50 to be es10? [14:06:06] line 59 [14:06:31] ? [14:06:42] line 59 is the blank spot for IP .224 [14:06:50] that can be es10 [14:06:55] (i think we are on same page?) [14:06:55] ok [14:06:59] we ar [14:07:01] cool [14:07:17] to be honest, i find it easier to have two terminal windows to sockpuppet for this [14:07:23] i open the reverse file in one vim session [14:07:34] then in the other tab i open the wmnet file [14:07:53] so i have the 10.in-addr.arpa and wmnet both open, makes it easier to ensure they match [14:09:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:44] once you have those IPs added in the reverse (10) file, you can add the servers and the server mgmt to the appropriate places in the wmnet file [14:10:01] but the initial question of HOW we decide to allocate IPs [14:10:18] If there was no subnet setup for this stuff, that falls to our networking folks, they make the decision on what subnet to use where [14:10:34] but once its setup like now, its just adding servers to the logical place, in this case the ES servers get added to the same subnets as the existing es servers [14:10:44] (eqiad is slightly more complex, in that every row has its own subnets) [14:10:57] we'll worry about eqiad some other time though ;] [14:13:58] T3rminat0r: hah. yeah, i thought about that just a little too late ;) [14:14:24] ^^ [14:21:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.107 seconds [14:23:49] robh: can you check [14:24:11] checking [14:24:21] cmjohnson1: did you save the 10 file [14:24:25] i dont see your chagnes [14:24:37] yes [14:24:50] exit the file [14:24:52] i will save again [14:24:57] cmjohnson1: I see NO changes to it [14:25:01] the wmnet file has changes [14:25:10] yep, no changes still [14:25:14] are your changes still in there? [14:25:15] ;] [14:25:24] yep..i don't see them either..odd [14:25:28] svn diff showed them [14:25:31] give me a sec [14:25:32] i dont see in svn diff [14:25:43] i see the wmnet changes though, so easy enough to redo [14:25:47] lemme know when to recheck [14:28:35] robh: okay..i think i exited w/out saving changes b4 [14:29:42] ok, now it looks good. You can svn commit, then do the update (remember to -A your key into dobson) [14:29:44] and admin log it [14:30:02] rk [14:33:06] !log authdns-update for es9 and es10 [14:33:15] Logged the message, Master [14:33:58] dont forget to dig against them post update [14:34:12] yep [14:38:01] robh: they are good both mgmt and main ip's [14:38:19] cmjohnson1: ? whatcha mean? [14:38:49] the dig [14:39:20] they seem ok [14:39:26] i didnt dig every fqdn [14:39:36] but seems fine, (ns2 answered ok for them) [14:39:59] yeah..i was saying I checked them all and they're ok [14:40:02] ahhh, ok [14:40:08] cool [14:43:59] New patchset: Jeremyb; "bug 39359 - Philippine event is over" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19751 [14:48:00] New review: Jeremyb; "This was from bug 39359." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19704 [14:51:09] New patchset: Jeremyb; "bug 39359 - Philippine event is over" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19751 [14:53:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.226 seconds [15:39:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:07] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [15:43:07] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [15:43:07] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [15:43:07] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [15:51:23] New patchset: Jeremyb; "add comments pointing to throttle.php for some cases" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/19765 [15:51:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.906 seconds [16:10:30] !log Increased aggregate 'prod1' by 4 drives on both nas1-a and nas1001-a [16:10:39] Logged the message, Master [16:10:50] !log Created 4 TB volumes 'fr_archive' in aggregate prod1 on both nas1-a nad nas1001-a [16:10:59] Logged the message, Master [16:11:06] !log Setup SnapMirror replication from nas1-a:fr_archive to nas1001-a:fr_archive [16:11:15] Logged the message, Master [16:25:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.369 seconds [17:00:39] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 13.6870988688 (gt 8.0) [17:01:50] New patchset: Pyoungmeister; "appserver module: no inits in subdirs. rejiggering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19773 [17:02:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19773 [17:03:03] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.331772362205 [17:06:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19773 [17:08:45] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [17:10:24] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [17:10:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:11:09] !log ms-be1 dropped offline yesterday; powercycling. [17:11:17] Logged the message, Master [17:13:33] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [17:14:27] RECOVERY - Host ms-be1 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [17:19:15] apergos or paravoid: did you have an opinion on how to test the swift upgrade on the eqiad cluster? [17:19:53] good morning ben [17:20:09] I don't but haven't thought about it much [17:21:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.893 seconds [17:21:39] PROBLEM - swift-object-server on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:26:36] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [17:28:20] maplebed: so what's the problem exactly? [17:28:31] I'm not sure I completely understand [17:28:46] just the mechanics of upgrading. [17:28:59] our puppet configs can't support multiple versions of swift [17:29:16] you asked me about that before haven't you? [17:29:26] sigh, sorry :/ [17:29:29] I sent mail [17:29:42] I tihnk that's the only time. [17:29:54] so the options as I see them are [17:29:58] * don't test with puppet [17:30:09] * put in all the shiton of conditionals necessary [17:30:18] * stop puppet on the prod cluster until the testing's done. [17:30:48] I don't really like any of the three choices. I'm hoping there's another path I haven't thought of. [17:32:31] I'd probably pick (b) [17:32:47] but I've been known to be a perfectionist, so... [17:33:39] hm. [17:34:05] so going off http://wikitech.wikimedia.org/view/User:Bhartshorne/swift_upgrade_notes_2012-08, the things that would need to vary based on version are [17:34:07] which puppet classes are relevant here? [17:34:12] * rewrite.py [17:34:19] * all the configs [17:34:33] * whether the packages are ensure => present or ensure => latest [17:34:36] all the configs?! [17:35:21] yeah, there's a new flag necessary in each of the swift server configs. [17:35:48] but we can use templates for that, can't we? [17:35:50] (there are only 4 config files) [17:36:24] sure! but there has to be a conditional somewhere that says "if new version, use this extra config option" for each of the config files. [17:36:29] that's all I mean... [17:36:29] right [17:37:10] I'm trying to find where puppet provisions rewrite.py [17:37:12] any hints? [17:37:15] I wouldn't want to use templates for rewrite.py; I'd want to have two separate files [17:37:52] ah [17:37:52] # pull in the SwiftMedia python bits [17:38:12] yeah, that's it. [17:38:20] what's changing in rewrite.py? [17:39:06] https://gerrit.wikimedia.org/r/#/c/18264/ <-- all the changes (assuming no conditionals) [17:39:29] heh [17:40:11] can you explain the WSGI a bit? they switched from their own thing to WSGI? [17:40:59] no, it was wsgi the whole time. but something about the handoff between modules became more strict [17:41:13] and they required starting to read the body before the headers are available [17:41:19] so is that backwards-compatible? [17:41:28] the original code tested the headers before reading the body. [17:41:41] along with making things more strict, they added some helper functions to make it easier to deal with the strictness [17:41:49] so no, sadly, not backwards compatible. [17:43:07] we could make rewrite.py work with both... [17:43:26] yes, we could. [17:44:01] my struggle is this - how much extra work is worth while for a few days testing? [17:44:08] as soon as the tests pass, all the clusters are going to get upgraded. [17:44:36] which tests? [17:44:43] testing the upgrade. [17:45:19] 1) upgrade a cluster half way 2) test it, make sure everything works. 3) upgrade all the way 4) repeat tests 5) upgrade everywhere. [17:45:42] * paravoid nods [17:45:55] I've done those tests in labs without using puppet, but the labs environment is different. [17:46:16] my goal is to test that the puppet configs are correct and verify that none of the differences between labs and our regular network environment matter. [17:47:11] that's why all the conditionals stuff feels like extra and wasted work. [17:47:36] but it also sucks to turn off puppet for 5 days (basically today through monday) [17:49:24] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [17:50:45] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [17:51:41] you raise a good point... [17:54:03] here's another option... [17:54:21] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [17:54:25] do the testing on monday, then I only turn puppet off in production for 1 day (the duration of the tests). [17:54:27] rather than 5. [17:54:30] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [17:54:37] schedule tuseday for the production upgrade. [17:54:52] is 1 day of testing enough? [17:55:02] I've already been testincg in labs for a while. [17:55:04] but why is the conditionals so much work then? [17:55:10] I'd say it's about 10 mins [17:55:16] I know I can't do it in 10 minutes. [17:55:41] (not saying it wouldn't be 10m for you though; you're much better at puppet than I am) [17:56:29] but yeah [17:56:33] environments would be nice here [17:56:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:09] the one that scares me the most isn't actually puppet but putting the new packages in our .deb repo and making sure nothing in production accidentally does an apt upgrade. [17:57:22] cuz if it gets the new packages without the puppet configs, it's fucked. [17:57:26] mark: yeah, I thought of that already :/ [17:57:37] hehe [17:58:04] I guess you can pin packages for that [17:58:14] but indeed, it does become a bit of work like this [17:58:45] and esp. the part where you make sure it won't break everything [17:58:58] New patchset: Pyoungmeister; "appserver module: smooshing sync back into server to avoid circular deps" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19777 [17:59:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19777 [18:01:24] ok, here's what I propose: [18:02:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19777 [18:02:13] * do the half way test on eqiad (upgrade 1/2 of the proxies and 1/3 of the back ends) without puppet (dpkg -i + scp) [18:02:23] most of the testing I need to do can be accomplished in that state [18:02:46] on monday I disable puppet in production and test that puppet gets an eqiad host to the same state the manual upgrade did [18:03:04] then I can do the upgrade with puppet in production. [18:03:16] does that sound reasonable? [18:03:33] yes [18:03:41] hopefully next time around we'll have environments [18:03:54] or I'll mess with the manifests at that point [18:04:38] yeah [18:06:53] paravoid: to decrease the chance of the puppet tests failing, would you mind reviewing that gerrit change closely? [18:07:13] okay [18:07:24] RECOVERY - swift-object-server on ms-be1 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:13:04] New patchset: Ottomata; "analytics-dell.cfg - these have a single SSD as sda. Use it for / and swap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19779 [18:13:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19779 [18:21:49] New patchset: Ottomata; "analytics-dell.cfg - these have a single SSD as sda. Use it for / and swap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19779 [18:22:35] New patchset: Asher; "increase es server range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19782 [18:23:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19779 [18:23:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19782 [18:25:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19779 [18:31:01] New patchset: Asher; "increase es server range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19782 [18:31:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19782 [18:32:38] New patchset: Pyoungmeister; "fix for dependency cycle" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19783 [18:33:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19783 [18:33:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19783 [18:36:25] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [18:37:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [18:39:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19699 [18:40:14] New patchset: Ottomata; "analytics-dell.cfg - need . at end" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19785 [18:40:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19785 [18:41:45] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19785 [18:53:59] New patchset: Ryan Lane; "Change puppetmaster::self's branch to production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19786 [18:54:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19786 [18:59:02] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7772 [18:59:58] ah, ok, so this had a Verified and +2 by Tim but was not merged. Out of curiosity i just did the Verified part but no +1 or +2, and .. it gets merged by that [19:00:33] it is the redirect from /w/ /w/index.php [19:01:51] on eday of testing instead of five sounds good [19:01:56] and I would also pick option (b) [19:02:00] maplebed: [19:02:08] (I was actually "outside" for awhile [19:02:10] ) [19:03:20] apergos: did you catch my message an hour ago (now - 62 minutes) [19:03:37] unless you object, that's the path I'll take. [19:04:16] urgh you are a google docs fan? [19:04:18] *sigh* [19:04:44] dude, get me a decent spreadsheet editor on a wiki and I'll use it. [19:04:55] I go blind trying to deal with wiki table markup. [19:05:03] a nyways it must not be this message [19:05:28] in irc was it? [19:05:49] [11:02 AM] * do the half way test on eqiad (upgrade 1/2 of the proxies and 1/3 of the back ends) without puppet (dpkg -i + scp) [19:05:49] [11:02 AM] most of the testing I need to do can be accomplished in that state [19:05:49] [11:02 AM] on monday I disable puppet in production and test that puppet gets an eqiad host to the same state the manual upgrade did [19:06:00] [11:03 AM] then I can do the upgrade with puppet in production. [19:06:00] [11:03 AM] does that sound reasonable? [19:06:19] what day does that first bit happen? [19:06:31] the half way testing? now. [19:06:40] ah [19:06:50] ms-fe1001's already upgraded. [19:06:52] ok [19:07:27] yeah that sounds decent ... given the state of things [19:11:16] maplebed: what's db_preallocation? [19:12:07] paravoid: when the account and container servers make their db files to store Stuff, it controls whether they glob chunks of disk at a time or grow only when necessary. [19:12:14] New patchset: Pyoungmeister; "appserver/mediawiki module: sadly, more cross-module deps are needed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19788 [19:12:16] maplebed: and is proxy-logging related to statsd or is it something completely different? [19:12:45] it's more efficient when it's on but uses more disk space. it used to default to on and now defaults to off. [19:12:56] we have enough disk space (even on the SSDs) so I"m leaving it on. [19:13:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19788 [19:13:15] logging was removed from the proxy server and placed into its own module. It's independent of the statsd stuff. [19:16:17] and the swift egg is python-swift I presume? [19:16:42] yup. [19:16:48] okay [19:16:50] so everything looks good [19:16:55] but from what I can see [19:17:31] the only things that are backwards-incompatible from 1.5->1.4 is this proxy-logging stanza [19:17:37] and rewrite.py [19:17:47] !log Moved VRRP mastership from cr2-pmtpa to cr1-sdtpa by reassigning VRRP priorities, relieving the sdtpa-pmtpa link [19:17:55] Logged the message, Master [19:18:24] mark: storage admin/network admin/puppet expert? [19:18:39] I think that's true, though I don't remember how it behaves when it finds unfamiliar config directives. [19:18:54] i'm not a puppet expert [19:18:57] what would be an unfamiliar config directive? [19:18:57] I just use it a lot ;) [19:19:16] aiui, the logging stuff are commented-out and db_preallocation existed, just with a different default value [19:20:44] paravoid: the db_preallocation. it was not configurable before. When changing the default they added the ability for you to choose. [19:25:49] uh, oh well [19:26:04] btw, do we have RT tickets for all the tasks? [19:29:02] !log Removed OSPF/OSPFv3 metric 60 on cr1-sdtpa:xe-1/1/0 (eqiad link) [19:29:10] Logged the message, Master [19:30:56] there [19:31:01] that should remove our packet loss a bit [19:31:39] New patchset: Ottomata; "analytics-dell.cfg - swap on primary partition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19791 [19:31:43] the rt ticket titled "hey robla, ping" is surreal! [19:32:04] paravoid: no, I haven't made RTs. [19:32:21] okay, I can do that if you find them a good idea [19:32:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19791 [19:33:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19791 [19:49:25] paravoid: robla and jeremyb wanted to test if they can mail to RT, because when i said 'why not just mail to the existing ticket' robla said that did not work for him [19:50:06] thanks, although I figured there was some reason. the RT was just very funny to read [19:51:36] ;-) [19:52:49] Reedy: idk if you saw, gave you a few review requests for mediawiki config [20:04:43] !log stopping puppet on brewster to experiment with partman on new analytics dells [20:04:52] Logged the message, Master [20:14:54] maplebed: so confirm this for me [20:15:22] !log sync-apache pushing out new redirects from /w/ to Main_Page [20:15:31] Logged the message, Master [20:15:46] tomorrow MW will write to both swift and nfs and read from swift [20:16:01] yes. [20:16:06] but we'll continue serving upload traffic from nfs [20:16:13] due to squid's config [20:16:23] upload traffic for originals. yes. [20:16:57] so, we'll leave it like that for what? a week? [20:17:28] and then a) switch squids to read from swift b) switch MW to read to/write from swift exclusively [20:17:33] I'd be happy switching to upload/originals traffic to swift anytime next week [20:17:56] I think MW will keep writing to both for a while (to allow us to roll back) [20:18:04] (where a while == several weeks) [20:18:09] okay [20:18:41] but when to turn off NFS writes should be a conversation between robla, aaron, you, ariel, ct, and maybe me if I'm around. [20:18:41] so, switching squids is 1) a minor config change, 2) a change in rewrite.py [20:19:14] what's the change in rewrite.py about? [20:19:31] do you have a copy open? [20:19:58] I do know [20:20:03] er, now [20:20:04] !log apache-graceful-all [20:20:15] Logged the message, Master [20:20:48] line 225 is wehre it starts processing an incoming URL [20:20:48] srv281: apache2: Syntax error on line 328 of /etc/apache2/apache2.conf: Could not open configuration file /etc/apache2/wmf/all.conf: No such file or directory [20:20:55] <-- just on 281 [20:21:04] yeah [20:21:16] line 266 is where it decides what it shoudl do with the URL [20:21:30] the regex there only catches thumbs, not originals [20:21:50] the if starting on 267 is closed on 340 [20:21:59] mutante: #3336 [20:22:01] where it basically says "whoops, 404." [20:22:29] !log http://en.wikipedia.org/w/ now redirects to Main_page on all languages. It was a 403 before. [20:22:38] Logged the message, Master [20:22:44] paravoid: thanks:) [20:23:00] maplebed: we probably have a different version [20:23:16] right; sorry. I'm looking at the version modified for 1.5 [20:23:44] * maplebed opens the other version. [20:24:42] the if starting on 279 is closed on 353 [20:25:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19788 [20:25:36] isn't /thumb/ optional? [20:25:40] and then zone is set to public? [20:26:27] oh shit, you're right. [20:26:32] !log staring puppet on brewster. PARTMAN IS THE ENEMY. will continue the fight tomorrow. [20:26:34] New patchset: Cmjohnson; "adding es9 and es10 to dhcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19931 [20:26:39] maybe the lats time we patched rewrite we actually did the originals work too! [20:26:41] Logged the message, Master [20:27:16] I forgot about that since it's not used yet. [20:27:19] New review: Dzahn; "merged and pushed out to cluster. works fine for me." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7772 [20:27:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19931 [20:27:32] why is it called "public" though? [20:28:29] I don't know. mediawiki thing I think. [20:28:56] okay [20:30:08] so public is originals, good to know [20:30:21] AaronSchulz: ^^^ is that right? [20:30:26] New patchset: Cmjohnson; "adding additional servers for es install to accomodate es10" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19932 [20:30:36] ottomata: reportcard on stat1. what's the status here? i guess you still need it? [20:30:52] bwer? [20:30:58] is that in gerrit? [20:31:06] !change 11042 | ottomata [20:31:06] ottomata: https://gerrit.wikimedia.org/r/#q,11042,n,z [20:31:14] ah yes [20:31:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19932 [20:31:23] we are actually going to do that on stat1001, now that is live [20:31:33] maplebed: local-public/local-deleted are originals yes [20:31:45] aaand, i'v ebeen super busy with other stuff [20:31:49] so haven't been able to work on this [20:31:52] i think just leave it in there for now? [20:31:55] i'll add a comment [20:32:21] New review: Ottomata; "This will actually go on stat1001 now that it is live. Busy with other things. Will get to this ev..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/11042 [20:32:27] ottomata: that's fine. just checking since i see it in "my" changes being a reviewer [20:32:36] ayyye cool [20:32:39] k [20:32:41] gotta run, laters! [20:32:45] cya [20:45:13] bedtime. see folks tomorrow [20:45:26] Change abandoned: Cmjohnson; "change is not needed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19932 [20:47:05] can someone please take a look at https://gerrit.wikimedia.org/r/#/c/17902/ ? [20:47:55] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19931 [20:48:07] cmjohnson1: looks good, can you take care of merging on sockpuppet? [20:48:32] coolness, i recalled you doing that atleast once before, change is merged [20:48:50] can a kind opsen please review/merge https://gerrit.wikimedia.org/r/#/c/17902/ ? it's simply to ensure git and git-svn on yttrium [20:48:55] MaxSem: looking now [20:49:02] cmjohnson1: hold up on sockpuppet merge a moment if y ou would [20:49:06] oh ha max beat me to it. [20:49:10] im going to review MaxSem's change so you can merge it too [20:49:27] thanks for looking RobH [20:49:41] New review: RobH; "looks good, merged" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/17902 [20:49:42] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17902 [20:49:46] awjr: welcome [20:49:49] huzzah! [20:49:53] cmjohnson1: Ok, now you can proceed with the update on sockpuppet [20:50:12] RobH, many thanks [20:50:14] * AaronSchulz loves how gedit highlites .py files [20:50:24] MaxSem: you need someone to force the puppet run on yttrium? [20:50:33] New patchset: Pyoungmeister; "on faidon's advice, include not require." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19934 [20:50:47] awjr, do you need it right now?^^ [20:51:09] RobH, MaxSem that would be great [20:51:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19934 [20:51:34] cmjohnson1: ping me when you finish updating sockpuppet so i can force puppet run on another host [20:52:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19934 [20:53:11] just merged someone's stuff [20:53:48] some wlm stuff and your stuff cmjohnson1 [20:55:07] notpeter: and a change to yttrium in site.pp [20:55:09] i hope [20:55:18] cmjohnson1: yep, es servers get raid10 w/256k stripe size [20:55:34] cmjohnson1: hey :) [20:55:37] you busy ? [20:55:43] RobH: actually, I don't see that [20:56:02] notpeter: hrmm [20:56:31] can you remove the stacking module from the ex4500 in d3 and powercycle it ? [20:56:40] notpeter: hrmmmmmm.... [20:56:43] i merged it [20:56:44] wtf. [20:58:09] thanks cmjohnson1 --- turns out with the stacking module, if you install 10.4 on the switch, it freezes [20:58:10] do we really want to add "admins::roots" to every single node instead of including it in "standard"? [20:59:07] wait until i revert the sw [20:59:08] thanks [20:59:09] :) [21:07:07] RoanKattouw: https://bugzilla.wikimedia.org/show_bug.cgi?id=39221 Is that still running? [21:11:38] Oh thanks for the reminder [21:11:51] It should've finished about 14 hours ago, let me check [21:16:17] hoo: It's done, I should go back and see if there are any bad ones left [21:16:31] RoanKattouw: There are [21:16:45] http://commons.wikimedia.org/wiki/Special:MovePage/File:D%C4%9B%C4%8D%C3%ADn,_Dlouh%C3%A1_j%C3%ADzda,_pivovarsk%C3%BD_kom%C3%ADn.jpg [21:16:57] https://commons.wikimedia.org/wiki/File:D%C4%9B%C4%8D%C3%ADn,_Dlouh%C3%A1_j%C3%ADzda,_pivovarsk%C3%BD_kom%C3%ADn.jpg [21:16:59] Right [21:17:10] Some files with strange characters didn't get picked up right [21:17:24] !log chown on ms7 has finished, running another find to find remaining bad files [21:17:33] Logged the message, Mr. Obvious [21:19:08] !log find /export/upload/wik*/*/{0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,archive,math,temp,timeline} ! -user apache -exec chown apache \{\} \; [21:19:16] Logged the message, Mr. Obvious [21:20:05] LOL [21:20:52] RoanKattouw: Any ETA? [21:21:33] Another 12h? I don't know [21:21:49] * hoo wonders how that could even happen [21:21:50] Ryan_Lane: https://bugzilla.wikimedia.org/show_bug.cgi?id=39327 [21:22:56] hoo: Mostly because I dumped the output of find into a file, then ran the chown a day later [21:23:16] So some files may have moved around and stuff, and it seemed to have issues with non-Latin characters [21:23:56] RoanKattouw: I moreover meant, that files were uploaded with another user than apache [21:24:15] Oh [21:24:17] Import from shell? [21:24:20] Yes [21:24:51] When Aaron made me aware of this problem on Monday, I went "oh crap, just *one* of the batches that I imported in 2010 was a quarter of a million files" [21:25:07] So I knew the problem was large, I didn't know it was 7x larger, but whatever [21:25:18] (We ended up finding 1.85 million bad files) [21:25:57] Just run apache as root and you're fine :D [21:26:07] Or whatever web server servs files :P [21:26:27] haha [21:26:43] Well in early 2012 we added a locking mechanism, which caused imports as non-apache shell users to fail [21:26:56] That's when we went "oops, guess we should've been doing as apache all along" [21:27:11] But I didn't know it was actually causing user-visible problems until I was looped in on that bug on Monday [21:29:05] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [21:29:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [21:31:26] RoanKattouw: Wouldn't a plain chown apache:apache be faster? At least that's what I usually do and it's damn fast (for the few files I got) (ext3 or 4 drives) [21:32:16] hoo: I'm not quite sure I trust chown not to touch files that already have the right owner. Especially on Solaris [21:32:57] On Linux it doesn't, it runs over 1500 files in 0m0.018s on my server [21:32:58] find ensures you gotta go read all the metadata for a few million files, xfs ftw [21:33:37] I'm cautious because 1) it's Solaris, 2) the filesystem is huge (24T, couple dozen million files), 3) it's Solaris, 4) it's a replicated filesystem, 5) it's running in production, 6) it's shared over NFS and 7) it's Solaris [21:33:58] Damianz: Well it seems to me you need to read that metadata anyway to see who the owner is [21:34:26] Except if you just blindly set the ownership on everything, in which case I'm not all that confident it won't cause more writes than necessary [21:34:28] Yeah, you're doing a read then a write as oppose to just a write (I think was the point). [21:34:33] Solaris is evil though [21:34:36] Yes [21:34:39] 1, 3 and 7 are good points, indeed [21:34:49] * hoo never had troubles with chown on Linux, at least [21:34:54] I forgot how much Solaris sucked until I had to deal with it this week [21:35:35] Well by the fact you're going to read all files then write a percentage, writing all would be less iops but it depends if reads of writes are more effected which is pretty much fs type dependant. [21:35:49] Yeah [21:36:06] Well since the FS is both replicated and shared, I feel much safer doing excessive reads than doing excessive writes [21:36:38] True, depending on how smart the replication works, you might screw it [21:38:15] New patchset: Pyoungmeister; "refixing page triage cron" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19938 [21:38:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19938 [21:39:54] I'd probably be more worried about nfs having kittens than replication assuming it's drdb/raid etc, glusterfs would die with a heavy write load though. [21:40:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19938 [21:40:31] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [21:41:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19501 [21:47:04] New patchset: Andrew Bogott; "Reconfigure mysql after changing the config." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19940 [21:47:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19940 [21:55:52] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19501 [21:59:02] New patchset: Ryan Lane; "Revert "Rework openstack and ldap manifests"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19943 [21:59:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19943 [22:03:48] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19940 [22:12:57] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19947 [22:13:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19947 [22:15:43] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19947 [22:16:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19947 [22:21:51] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19947 [22:23:17] New patchset: Ryan Lane; "Revert "Rework openstack and ldap manifests"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19950 [22:23:29] god damn it [22:23:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19950 [22:23:59] part of this change is untestable by puppetmaster::self [22:26:22] New patchset: Ryan Lane; "Rework openstack and ldap manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19951 [22:27:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19951 [22:27:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19951 [22:51:31] New patchset: Ryan Lane; "Switch dns ldap config to use new ldap hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19955 [22:52:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19955 [22:52:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19955 [23:03:17] New patchset: Ryan Lane; "Making virt0 and virt1000 most minimal ldap clients" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19959 [23:04:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19959 [23:04:11] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19959 [23:10:13] New patchset: Ryan Lane; "Fix template location for libvirt info" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19960 [23:10:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19960 [23:11:00] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19960 [23:11:17] !log authdns-update, moving zirconium to public IP and removing selenium [23:11:26] Logged the message, Master [23:29:08] New patchset: Ryan Lane; "Fix labs controller hostname in pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19961 [23:29:51] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19961 [23:35:49] !log restarted pdns-recursor on ns0 [23:35:57] Logged the message, Master [23:38:29] New patchset: Ryan Lane; "Make zones based on datacenter name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19964 [23:39:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19964 [23:39:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19964 [23:44:10] !log brought nagios back up by removing "misc_pmtpa" host and servicegroups from srv194 manually [23:44:18] Logged the message, Master [23:44:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [23:44:57] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.035 second response time [23:45:06] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [23:45:06] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [23:45:06] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [23:45:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:45:06] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [23:45:07] PROBLEM - Puppet freshness on srv281 is CRITICAL: Puppet has not run in the last 10 hours [23:45:07] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [23:45:08] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [23:45:08] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [23:45:09] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [23:45:09] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [23:46:18] PROBLEM - Host ms-be3 is DOWN: PING CRITICAL - Packet loss = 100% [23:46:27] PROBLEM - Host srv259 is DOWN: PING CRITICAL - Packet loss = 100% [23:47:39] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.9:11000 (Connection timed out) [23:52:10] !log powercycled srv259 [23:52:19] Logged the message, Master [23:53:20] New patchset: Ryan Lane; "Fix puppet db host setting for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19966 [23:54:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/19966 [23:54:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/19966 [23:55:09] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [23:55:27] RECOVERY - Host srv259 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [23:56:03] Crap [23:56:05] chown: /export/upload/wikipedia/da/0/0e/Dameskr�ddersyning4stof.gif: No such file or directory [23:56:13] Why isn't this working *grumble* [23:56:34] RoanKattouw: there is some special character in there [23:57:10] Dameskr�dde.. [23:57:25] What the ... [23:57:27] No that's not it [23:57:28] They're symlinks [23:58:20] O.o [23:58:30] Broken symlinks [23:59:44] AaronSchulz: http://pastebin.com/5ES2tt6i