[01:47:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [01:56:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:15:55] !log LocalisationUpdate completed (1.22wmf3) at Thu May 9 02:15:54 UTC 2013 [02:16:03] Logged the message, Master [02:28:44] PiRSquared: meet morebots [02:29:17] morebots: help [02:29:17] I am a logbot running on wikitech-static. [02:29:17] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [02:29:17] To log a message, type !log . [02:36:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [03:39:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 9 03:39:55 UTC 2013 [03:40:04] Logged the message, Master [04:08:02] New review: Faidon; "I don't think this is bad per se, but note that a few lines above there is a commented out section a..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62103 [04:09:36] New patchset: Faidon; "Swift: pep8 clean rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [04:11:34] New patchset: Faidon; "Swift: pep8 clean rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [04:13:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [04:15:51] New patchset: Faidon; "Swift: remove wikimania2006 exception" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62946 [04:17:23] New patchset: Faidon; "Swift: remove wikimania2006 exception" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62946 [04:17:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62946 [04:20:43] Change abandoned: Faidon; "Too obsolete by now. Getting this up-to-date would be a similar amount of effort to making it from s..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16411 [04:24:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [04:34:45] New review: Yurik; "Faidon, we use X-CS (X-Carrier is going away) to test for specific ID, but setting it already skews ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62103 [04:41:48] New patchset: Faidon; "swiftrepl: initialize connection before use" [operations/software] (master) - https://gerrit.wikimedia.org/r/62947 [04:41:48] New patchset: Faidon; "swiftrepl: add an error message for IncompleteSend" [operations/software] (master) - https://gerrit.wikimedia.org/r/62948 [04:41:48] New patchset: Faidon; "swiftrepl: set NOBJECT to 1000" [operations/software] (master) - https://gerrit.wikimedia.org/r/62949 [04:42:17] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62418 [04:42:30] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62419 [04:42:57] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62947 [04:43:15] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62948 [04:43:34] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62949 [05:11:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [05:11:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [05:26:07] paravoid, ping [05:50:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [06:44:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [07:25:14] !log Jenkins HTTP interface died :/ [07:25:22] Logged the message, Master [07:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.591 second response time [07:33:53] morning hashar [07:34:03] jenkins is dead again :( [07:34:14] well technically it works but it isn o more accepting http connection [07:34:16] :( [07:35:07] no longer! :D [07:37:40] I wish there was a way to abort a thread in java [07:38:42] gonna restart it :( [07:39:05] !log restarting jenkins. All its http thread are locked :/ [07:39:13] Logged the message, Master [07:39:15] :( [07:47:05] hmmm [07:49:24] how is zuul's memory consumption? [07:49:29] does it ever leak? [07:53:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:54:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:01:49] New patchset: Faidon; "Add a third-party cron module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62955 [08:04:52] New review: Faidon; "puppetdoc, rspec, manifest, Modulefile, Rakefile, README for a 80 lines of pure puppet code but not ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62955 [08:13:24] PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: CRITICAL: packet_loss_average is 10.9447852672 (gt 8.0) [08:13:34] PROBLEM - Packetloss_Average on analytics1006 is CRITICAL: CRITICAL: packet_loss_average is 12.2033658647 (gt 8.0) [08:14:34] PROBLEM - Packetloss_Average on analytics1004 is CRITICAL: CRITICAL: packet_loss_average is 11.1855046154 (gt 8.0) [08:16:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [08:17:14] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 11.8206513386 (gt 8.0) [08:17:24] RECOVERY - Packetloss_Average on analytics1003 is OK: OK: packet_loss_average is 1.17573723577 [08:17:34] PROBLEM - Packetloss_Average on gadolinium is CRITICAL: CRITICAL: packet_loss_average is 10.530239633 (gt 8.0) [08:17:35] RECOVERY - Packetloss_Average on analytics1006 is OK: OK: packet_loss_average is -0.169203090909 [08:18:34] RECOVERY - Packetloss_Average on analytics1004 is OK: OK: packet_loss_average is 0.134039166667 [08:18:44] PROBLEM - Packetloss_Average on analytics1008 is CRITICAL: CRITICAL: packet_loss_average is 10.7626705385 (gt 8.0) [08:20:52] !log upgrading Jenkins plugins 'promoted build' and 'M2 release'. That should fix the Gerrit jobs disappearing (ping qchris) [08:21:00] Logged the message, Master [08:21:14] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.927818596491 [08:21:34] RECOVERY - Packetloss_Average on gadolinium is OK: OK: packet_loss_average is 0.605147391304 [08:21:54] ori-l: I don't think Zuul leaks memory. [08:22:01] !log restarting Jenkins again [08:22:08] Logged the message, Master [08:22:40] yeah, i checked out the code to see if it's lingering connections to jenkins but it doesn't look like it [08:22:44] RECOVERY - Packetloss_Average on analytics1008 is OK: OK: packet_loss_average is 0.048695042735 [08:23:14] PROBLEM - Packetloss_Average on analytics1005 is CRITICAL: CRITICAL: packet_loss_average is 9.58535916667 (gt 8.0) [08:23:36] ori-l: I will also make zuul to use the private IP instead of the https / public one [08:23:47] that will let Zuul connect directly to Jenkins [08:23:52] instead of via the apache proxy [08:24:09] yeah, that's nice; probably won't make a huge difference tho [08:26:21] Change restored: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60866 [08:26:24] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60866 [08:27:00] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60866 [08:27:14] RECOVERY - Packetloss_Average on analytics1005 is OK: OK: packet_loss_average is 0.18957173913 [08:29:59] so jenkins restarts just fine now [08:32:45] shower time [08:54:07] !log Jenkins apparently working fine. I have upgraded a few plugins, that seems to have fixed the Gerrit jobs. [08:54:13] off again bbl [08:54:15] Logged the message, Master [08:57:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [09:07:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [09:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [09:50:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [11:15:50] New patchset: ArielGlenn; "kiwix mirror stanza moved to mirror role and added system_role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62963 [11:20:39] New review: Hashar; "Looks fine, I will let you deploy it :-]" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/62814 [11:23:56] New patchset: ArielGlenn; "kiwix mirror stanza moved to mirror role and added system_role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62963 [11:24:28] !log upgrading Zuul to latest master (makes report times friendlier) [11:24:37] Logged the message, Master [11:27:44] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62963 [11:43:02] New patchset: Hashar; "nginx_site now uses boolean values" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62583 [11:43:46] New review: Hashar; "I have converted the $enable and $install parameters of nginx_site to be boolean values and updated ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62583 [11:44:50] New patchset: ArielGlenn; "Revert "kiwix mirror stanza moved to mirror role and added system_role"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62965 [11:46:07] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62965 [11:49:22] New patchset: Hashar; "proxy_configuration $ipv6_enabled is now boolean" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62966 [11:51:14] New patchset: Hashar; "labs: hardcode nginx types_hash_bucket_size to 64" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62603 [11:51:29] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62583 [11:51:52] New review: Hashar; "I have rebased this change against tip of production. It is an independent change we can merge right..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62603 [11:52:17] apergos: and you might want to pick https://gerrit.wikimedia.org/r/#/c/62603/ as well [11:52:31] apergos: that is tweaking the bucket sizes for nginx on labs. must be harmless in production [11:52:48] I need to rebase my lame change now :/ [11:54:08] yes, I had only wited because it was commited as part of the series (with dependency), otherwise I would have merged it right away [11:54:45] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62603 [11:55:00] have you sorted out how to get the nginx changes deployed in production? [11:55:06] err [11:55:16] you don't read your pms do you [11:55:19] :-P [11:55:24] :-] [11:55:30] must have dropped it sorry [12:06:34] New patchset: Hashar; "proxy_configuration $ipv6_enabled is now boolean" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62966 [12:49:58] New patchset: Hashar; "puppet-lint protoproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62973 [13:07:11] New patchset: Hashar; "puppet-lint protoproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62973 [13:07:29] New review: Hashar; "PS2 removes the remaining trailing semi columns." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62973 [13:31:53] New patchset: Hashar; "doc for proxy_configuration define" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62976 [13:31:54] New patchset: Hashar; "protoproxy proxy_addresses is now optional" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62977 [13:32:21] New review: Hashar; "That one is straightforward. Might want to write more documentation though." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62976 [13:32:42] New review: Hashar; "Need to be tested out in labs first." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/62977 [14:01:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:06:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [14:16:50] Q: Has anyone tested removing files from the servers without suppressing them first? [14:17:22] I assume things would break pretty badly if it happened, wouldn't they? [14:19:02] ^demon, mutante, and paravoid – you guys should know things, right? :) [14:20:17] <^demon> Huh? [14:22:37] ^^ what would happen if someone removed files server-side without deleting them via MediaWiki, I wonder [14:23:42] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62161 [14:24:52] <^demon> Like on the cluster? [14:25:22] <^demon> Next time someone checked stuff out, it'd check the file out again. [14:25:31] <^demon> Oh, in MW. [14:25:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:35] <^demon> Oh, it'll probably barf. [14:25:53] <^demon> (I thought you meant like delete MW classes in deployment) [14:25:58] <^demon> Not uploaded files. [14:26:04] no, should've made myself clearer [14:26:12] I assume people would get shit errors [14:26:43] <^demon> For a period of time, while the caches were inconsistent with what's on disk. [14:27:22] ah, and then a simple 'file does not exist, you can upload it' message? [14:29:04] <^demon> I dunno tbh, since the image table entries would still be inconsistent. [14:29:37] <^demon> I imagine we at least handle the case semi-decently. [14:31:03] New patchset: Umherirrender; "$wgFlaggedRevsNamespaces for dewiki: Add NS_MODULE" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [14:32:32] odder, it will fail to retrieve the file, create thumbs, etc. if we don't have it in squid (varnish?) cache then there will be a failure to retrieve from upload, and so it won't display. In an article with a thumb you'll get a placeholder of some sort with the name of the file, on the file description page you might get the ''broken image' icon (don't remember now), from upload.wm.org you'll get a whine about being unable to serve the m [14:32:50] New review: Umherirrender; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [14:33:48] if you try to suppress after removing (hard removal from filesytem) the media, the suppresion might fail [14:33:53] depends on the order that happens [14:34:01] best to test it on a local instance [14:34:08] *after dong [14:35:19] doing ! [14:35:26] thanks apergos [14:35:32] yw [14:36:48] there is a maintenance script that fixes up that sort of thing iirc [14:40:21] cleanupImages.php might do it but I have not ever tested it myselff [14:40:46] there is also this scary script written by AaronSchulz coming up :) [14:41:04] which one is that? [14:41:31] * odder searches Bugzilla [14:42:05] https://gerrit.wikimedia.org/r/#/c/62549/ [14:42:24] New review: Raimond Spekking; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [14:43:27] er [14:43:31] really? ummm [14:43:36] I mean sometimes those get undeleted [14:43:52] example: the great commons jimbo purge of 2010 (or whenever it was) [14:43:57] some of those got restored [14:44:52] yes, but these were MediaWiki-deleted [14:44:56] ok well if it's jus provided as a script for third party I guess... :-D [14:45:02] that script is to be used by legal [14:45:17] scary script for scary stuff. [14:45:18] ah [14:45:33] so after deleted via mw then this goes through and gets the rest [14:45:46] makes sense, right now they have to drop an rt ticket every time [14:45:51] yep. [14:46:18] this is just BTW; I was asked a question whether the WMF uses different ways to delete stuff than suppression + server-side, which I'm sure you don't [14:46:27] and wondered what would happen :) [14:46:54] delete/suppress and then "nuke with file" = remove from nfs or swift or whever it happens to live on real storage [14:46:58] that's about how it works [14:47:19] I know I have in the past nuked without suppression [14:47:36] it only affects folks trying to retrieve the specific image so.... [14:47:55] ooo, how long in the past? [14:48:20] I wonder if 'scubVersion' should be 'scrubVersion' everywhere in there [14:48:25] oh not this year [14:48:55] I guess it must have been more than a year ago [14:49:27] this script would help a lot, I remember having a lot of fun with mutante the other day, trying to figure out where a file was kept :) [14:49:36] yes, it's harder now [14:49:45] you have to dig it out of ceph as well as swift [15:12:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [15:12:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [15:18:27] Anyone mind if I deploy gerrit change 62814 quick? It's preparation for enabling CodeEditor for core at some point in the future. [15:22:51] New review: Anomie; "per Hashar" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/62814 [15:23:04] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62814 [15:26:10] !log anomie synchronized wmf-config/InitialiseSettings-labs.php 'Make $wgCodeEditorEnableCore configurable per wiki, prep for bug 39653' [15:26:17] Logged the message, Master [15:26:24] !log anomie synchronized wmf-config/InitialiseSettings.php 'Make $wgCodeEditorEnableCore configurable per wiki, prep for bug 39653' [15:26:31] Logged the message, Master [15:26:38] !log anomie synchronized wmf-config/CommonSettings.php 'Make $wgCodeEditorEnableCore configurable per wiki, prep for bug 39653' [15:26:45] Logged the message, Master [15:32:33] New patchset: Hoo man; "$wgFlaggedRevsNamespaces for dewiki: Add NS_MODULE" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [15:33:51] New review: Hoo man; "Fixed encoding problem" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [15:42:32] anomie: looks good (as in, it didn't break anything) ;) [15:43:07] greg-g: Yeah, I even remembered to check basic functionality right after deploying ;) [15:50:12] :) [15:51:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [15:56:09] !log powering down barium /relocate server to c1 [15:56:16] Logged the message, Master [16:08:10] New patchset: RobH; "hooper no long racktables host, removing old racktables stuff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62983 [16:08:14] cmjohnson1: I got shipment notification of the hard disks [16:08:16] for barium [16:08:24] yep..installing now [16:08:29] oh, nice! [16:08:32] i get the notices to :-P [16:08:40] well, i meant the newegg one [16:08:41] not the eq one [16:08:45] (notification) [16:08:55] i saw you handled the eq one already, wasnt sure if it was same thing [16:08:55] oh..yeah..i got an eq notice... [16:09:07] cool [16:09:18] yep..while we're on topic....have your heard from dominion freight? [16:09:31] and still no addt'l ssds from amazon [16:09:58] dominion for what? [16:10:09] the remaining 10 parsoid servers should arrive today as well [16:10:12] (i already put a ticket in) [16:10:21] we are short 1 400GB ssd right? [16:10:23] yes..the other 10 server [16:10:32] yes on the ssd [16:12:01] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62983 [16:13:14] New patchset: RobH; "Revert "hooper no long racktables host, removing old racktables stuff"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62984 [16:13:41] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62984 [16:29:58] New patchset: Aaron Schulz; "Added NS_MODULE as a reviewable namespace." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62985 [16:31:26] New review: Aaron Schulz; "WIP" [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/62985 [16:45:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [16:46:18] New patchset: Aaron Schulz; "Added NS_MODULE as a reviewable namespace." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62985 [16:55:36] paravoid, hi, are you around? [17:21:13] New review: Umherirrender; "Unneeded, when I59989f7c gets merged" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [17:33:44] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [17:38:45] yurik: yes? [17:40:07] paravoid, hi! i replied to your comment on https://gerrit.wikimedia.org/r/#/c/62103/ -- basically spoofing X-CS skews the results and allows us to test only very limited aspects of the zero. Spoofing IP would do a full testing [17:41:44] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:44:44] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [17:45:44] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:49:21] New patchset: Pgehres; "Enabling wgCentralAuthAutoMigrate on all wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62995 [17:50:11] heya ops: this page looks hand written, and I don't know where to submit a diff... so... [17:50:39] https://noc.wikimedia.org/conf/ the link for the puppet configs links to the lucene specific folder, it should be https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;hb=HEAD instead [17:51:05] (3rd href on the page) [17:54:48] New review: Pgehres; "Looks like private and fishbowl don't even call CA, so it should be a no-op, but I can match this fo..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62995 [17:57:56] New review: CSteipp; "This shouldn't effect private wikis, and will help with unification. If we see any issues we can rev..." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/62995 [17:58:20] New patchset: Alex Monk; "Add throttle exception for Haifa University workshop on 12/5/13" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62998 [17:58:25] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62995 [17:59:48] New patchset: Alex Monk; "Add throttle exception for Haifa University workshop on 12/5/13" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62998 [18:00:05] !log pgehres synchronized wmf-config/InitialiseSettings.php 'Enabling wgCentralAuthAutoMigrate on all wikis that use CentralAuth' [18:00:12] Logged the message, Master [18:17:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [18:20:00] New patchset: Pgehres; "Enabling wfDebugLog for CentralAuth" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63001 [18:26:11] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63001 [18:27:12] !log pgehres synchronized wmf-config/InitialiseSettings.php 'Enabling CentralAuth debug log' [18:27:20] Logged the message, Master [18:37:09] New patchset: Diederik; "Ensure that all UMAPI files belong to wikidev group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [18:55:42] !log attempting to restart broken puppetmaster on stafford [18:55:50] Logged the message, Master [19:05:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.829 second response time [19:08:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [19:09:14] !log restarted Apache and then puppetmaster on stafford, that revived it [19:09:23] Logged the message, Master [19:16:26] !log fixing file permissions in /a/e3/E3Analysis repo on stat1001 [19:16:34] Logged the message, Master [19:22:20] New patchset: Diederik; "Ensure that all UMAPI files belong to wikidev group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [19:22:46] New review: Hashar; "I can confirm this is working in labs. I have tried out calling proxy_configuration{} with no proxy_..." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/62977 [19:24:24] New review: Dzahn; "yea, i just fixed those file permissions manually using find. some files were owned by group root. i..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/63005 [19:28:27] New review: Dzahn; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [19:31:29] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62582 [19:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:32:47] hashar: do you approve of using "recurse" in puppet inside git::clone to ensure file permissions? [19:33:18] hmm [19:33:30] I had that issue a few months ago I think [19:35:06] i just told Diederik to just avoid the recurse if its not really needed [19:35:12] mutante: so git::clone is supposed to create the files according to use and group passed. I have added the group part with https://gerrit.wikimedia.org/r/#/c/30988/ [19:35:13] and fixed some permissions by hand using find [19:35:25] and puppet didnt rebreak them [19:35:27] recurse tend to be evil IIRC [19:35:34] thats what i thought , yea [19:35:42] cause on each puppet run it will traverse all the tree to find out what is wrong [19:35:45] i didnt really want to encourage its use [19:35:56] drdee: ^ [19:36:10] if they got altered somehow, I guess the best is to fix it manually [19:36:32] https://gerrit.wikimedia.org/r/#/c/63005/ [19:36:40] this looks harmless without recurse now [19:36:44] can has validate? [19:37:12] yep, the manual fix has been applied, thanks for input hashar [19:38:03] and in this case, I guess the repo should be owned by root:root [19:38:11] since the git::clone has ensure => latest [19:38:22] probably don't want anyone to play with that repo beside root (aka puppet) [19:38:33] the issue is he wants both [19:38:39] manual git pull as human [19:38:43] and puppet doing it [19:38:52] ah to force an update I gues [19:38:53] s [19:38:56] yup [19:39:03] yea, doesnt want to wait 30 min when deploying [19:39:06] but then if something goes wrong, puppet might in turn fail [19:39:50] it's all owned by stats:wikidev [19:40:01] and most of it was, just some were group root [19:40:14] the fix was to make them all stats:wikidev so far [19:40:42] so [19:40:57] drdee: so when manually pulling, do it as the stats user? [19:41:00] if puppet maintain the clone, the files belong to root:root to prevent humans to alter the git repo and thus potentially make puppet file [19:41:07] if a human need to update it, it needs root [19:41:13] preferably as own user [19:41:30] not root because then ottomata is the only on the team who can do it [19:41:34] if the repo is to be updated by human, that should be out of puppet; possibly with git-deploy [19:41:39] and there it is a mixed case :-] [19:42:01] i agree, i said earlier you can also just say deploying is always done by human as sanity-check [19:42:14] and not let puppet do the pull automatically [19:42:35] but have multiple deployers [19:42:53] Zuul used to be deployed by a git::clone ensure => latest [19:43:09] with a repo to which i pushed master [19:43:11] ensure => latest in general, also with packages, has surprised us in the past [19:43:15] then had to run puppet myself [19:43:16] not ideal [19:43:25] nowadays I su and git pull then install [19:43:49] which in turns need root access grbmbl [19:43:50] on blog or bugzilla i also just git pull manually, didnt really want puppet do automatically do it if somebody merges [19:44:05] hmm [19:44:13] in this case I would get it out of puppet [19:44:18] and let the team manually deploy [19:45:22] said all that i'd merge that little change anyways, just adding the group [19:45:34] but jenkins.. [19:45:42] doesnt like to verify [19:46:03] and Zuul / Jenkins could potentially be used to refresh the local repo after merge [19:46:25] I did that for the integration.wikimedia.org website which is updated whenever a change is merged in integration/docroot.git [19:46:35] but that needs a jenkins slave installed on the host :( [19:46:48] so yeah just wikidev it but I would remove the ensure latest and let the team update manually [19:47:42] drdee: so, i'm going for lunch, up to you if you want to add another patch set to remove "ensure => latest" now or not, i can merge it when i'm back [19:48:04] i will have ottomata look at next week [19:48:10] thanks for your ideas [19:48:29] sure, no problem, thank hashar [19:48:30] bbiaw [19:51:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [19:55:38] New patchset: Cmjohnson; "Changing mac address for aluminium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63019 [19:56:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [20:04:01] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63019 [20:19:57] !log Jenkins died :/ [20:20:05] Logged the message, Master [20:20:25] !log restarting Jenkins. [20:20:32] Logged the message, Master [20:21:07] jeff_green: aluminum has mainboard swapped. not sure how to fix w/out a reinstall due to mac address nic change...other than maybe changing if interface to eth1 instead of eth0 in networking/interfaces? [20:21:46] I don't understand? [20:22:25] we should be able to console in and tweak any necessary network settings, but afaik it's just binding the IP to what the kernel sees as eth0 [20:23:04] if you can get the DRAC up on its normal IP with the normal login, I think I can take it from there [20:24:33] the drac is up...but i can't ping the server [20:24:39] it's all yours than! [20:24:48] k. will you be around for a bit if I get stuck? [20:25:06] yep [20:25:08] like 20 minutes == while [20:25:12] thx. looking now [20:30:24] mutante: jenkins failed: https://gerrit.wikimedia.org/r/#/c/63005/ [20:36:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [20:40:08] hashar: heh, hey come it says LOST now instead of FAILED ? [20:40:21] how [20:40:28] Jenkins got restarted [20:40:30] :D [20:40:41] so LOST means like i lost memory of the change? [20:40:42] :) [20:41:02] says so on 63005, lemme recheck [20:41:09] New review: Dzahn; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [20:41:50] <^demon> hashar: That might explain why jmap wouldn't connect to the proc :p [20:42:40] ^demon: I think I killed Jenkins by using strace :( [20:43:08] New patchset: coren; "Tool Labs: Add libjson-perl to exec nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63063 [20:43:37] <^demon> strace isn't usually super useful a multithreaded program running in the jvm. [20:43:44] <^demon> You get way too much jvm noise. [20:43:49] I guess [20:44:06] using top (and showing threads with H ) [20:44:11] I managed to get the pid of the thread [20:44:13] http://itsecureadmin.com/2010/12/using-strace-to-attach-to-a-multi-threaded-process-like-a-jvmjava/ [20:44:22] but could not find it in jstack nor in Jenkins thread dump [20:44:42] ( http://integration.wikimedia.org/ci/threadDump ) [20:44:49] <^demon> http://www.fromdev.com/2008/12/debugging-java-on-unixlinux-my-favorite.html, mostly duh info [20:45:30] !log olivneh synchronized php-1.22wmf3/extensions/ConfirmEdit 'Updating ConfirmEdit to master' [20:45:38] Logged the message, Master [20:45:45] !log olivneh synchronized php-1.22wmf3/extensions/EventLogging 'Updating EventLogging to master' [20:45:53] Logged the message, Master [20:46:00] !log olivneh synchronized php-1.22wmf3/extensions/EventLogging 'Updating GuidedTour to master' [20:46:07] Logged the message, Master [20:46:07] ^demon: ahh I am learning some new commands :-D [20:46:19] jmap to connecto live java process sounds nice [20:47:30] jstack is very nice [20:47:42] !log olivneh synchronized php-1.22wmf3/extensions/GuidedTour 'Actually updating GuidedTour to master' [20:47:47] gives you a full stack trace of all the threads together with the source filename / line [20:47:50] Logged the message, Master [20:48:10] I guess I will try pstack next time [20:48:15] <^demon> Yeah, jstack and jmap are your two big tools in the "debug a java problem" belt. [20:48:35] bah pstack is not on gallium [20:48:45] you know what would be fancy? a shared bookmark server in the form of a mozilla sync server. so we could put useful bookmarks in common place and sync to browsers [20:51:09] http://en.wikipedia.org/wiki/Firefox_Sync + http://docs.services.mozilla.com/howtos/run-sync.html but i just thought of the bookmarks part, not the passwords etc i suppose;) [20:51:34] RECOVERY - Host aluminium is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:52:11] ^ \o/ [20:52:13] New patchset: Hashar; "install pstack on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63064 [20:52:48] New patchset: Petrb; "inserted 2 more packages needed to compile" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63065 [20:52:50] mutante: a simpler solution could be to publish the bookmarks as RSS feed and use ff's 'live bookmarks' feature http://www.mozilla.org/en-US/firefox/livebookmarks.html [20:53:18] all you'd need is the bookmarks formatted as an rss feed on noc.wikimedia.org or whatever [20:53:46] mutante: I could use `pstack` on gallium. That is an utility to display a stack trace of a running process ( https://gerrit.wikimedia.org/r/63064 ) [20:55:45] ori-l: i suppose with the sync server we could upload new bookmarks via browser while putting them in a file/feed would mean we'd have to actually put them in git [20:55:58] mutante: yeah, that's true [20:56:29] hashar: yea, let's do that, jenkins debugging ++ [20:56:38] will run puppet :-] [20:56:42] just need sock puppet merge [20:57:04] New review: Dzahn; "debugging jenkins is a good thing" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/63064 [20:57:30] New review: Dzahn; "manual verify, debug jenkins" [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/63064 [20:57:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63064 [20:58:33] hashar: you can run it [20:58:45] mutante: thanks! [20:59:16] sure, you should have debug tools when needed [21:01:17] New patchset: Brion VIBBER; "Update FirefoxOS app with language fix" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63067 [21:01:24] are there known issues with bits.wikimedia.org being slow? Asking as there's https://bugzilla.wikimedia.org/show_bug.cgi?id=48257 [21:01:42] * andre__ was offline yesterday (public holidays), so wondering if something on IRC was going on [21:02:28] speaking of bits.wikimedia.org, I have a bug fix update for the FirefoxOS app which is hosted there: https://gerrit.wikimedia.org/r/63067 [21:02:34] <^demon> andre__: There was some networking issues yesterday that cascaded to a couple of other things. [21:02:43] <^demon> Timing looks about right. [21:03:41] <^demon> brion: We include code from github as a submodule in mediawiki-config? o_O [21:03:52] ^demon: we do! blame preilly probably :) [21:04:05] ^demon: I see, thanks for the info. Didn't see anything on ops@ either. Has this been mostly fixed? [21:04:06] <^demon> That scares me. [21:04:15] <^demon> andre__: Yes. See engineering@ [21:04:42] ^demon: argh, thanks! [21:05:01] <^demon> brion: Ugh, over git://? [21:05:03] <^demon> Fuck. That. [21:05:36] ^demon: feel free to change how it's included [21:05:41] as long as it goes out i'm happy :) [21:06:14] <^demon> subtree merge :) [21:06:20] oh my [21:06:39] New patchset: coren; "Tool Labs: Add missing packages for tools" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63063 [21:07:11] New patchset: Hashar; "enable Jenkins access log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63068 [21:07:22] Is jenkins broken? [21:07:25] Or just slow? [21:08:18] Coren: slow :-D [21:08:50] Coren: it is being debugged right now , hashar just got pstack [21:08:56] installed on gallium [21:09:13] well it is slow right now because l10n-bot submitted ton of translations [21:09:29] mutante: https://gerrit.wikimedia.org/r/63068 [21:09:41] mutante: that will enable access log on jenkins whenever I restart it :-D [21:09:57] and I will make another change following up [21:10:26] andre__: i saw Leslie forwarded a report about some networking issues to Mark/Faidon a couple hours ago [21:13:22] New review: Dzahn; "looks reasonable for debugging purposes" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/63068 [21:13:28] New review: Dzahn; "looks reasonable for debugging purposes" [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/63068 [21:13:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63068 [21:14:09] hashar: you can restart [21:15:30] New patchset: Hashar; "make Zuul query Jenkins directly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63069 [21:16:11] mutante: and the last is https://gerrit.wikimedia.org/r/63069 :D [21:16:22] that is not going to speed up things but that will help [21:16:39] you are quick writing those commit messages :p [21:16:41] !log olivneh synchronized php-1.22wmf3/extensions/WikimediaMessages 'Updating WikimediaMessages to master' [21:16:50] Logged the message, Master [21:16:58] remember I had a job where I was mostly writing hehe [21:18:26] mutante: the reason I write long commit, is that it is very helpful when coming back on a change a few months after :D [21:18:29] New review: Dzahn; "per commit message, makes sense, yeah. and per hashar "that is not going to speed up things but that..." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/63069 [21:18:30] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63069 [21:18:32] I am addicted to git blame [21:19:01] done [21:19:07] thx [21:19:14] yw [21:19:34] mutante: on a different topic, I have seen a puppet install that showed the git revision as a catalog version instead of a timestamp :-D [21:20:24] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62669 [21:20:41] aah, that sounds nice actually [21:20:54] http://rcrowley.org/talks/sv-puppet-2011-01-11/#76 [21:20:58] 'caching catalog terminus' [21:21:13] which uses git rev-parse HEAD as an identifier [21:21:17] instead of a timestamp [21:21:31] I am wondering if that could speed up the operations on the puppetmaster [21:21:45] cause I suspect it is currently recompiling the full catalog on each requests made during the same second [21:22:33] ah, now i ended up looking at this http://linux.die.net/man/8/puppet-catalog [21:22:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:22:42] :-D [21:22:58] afaik we'd still want puppetdb to speed up puppetmaster [21:23:09] but i'd have to be postgres, no mysql/maria [21:23:15] maybe that can buy you some time [21:23:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [21:23:32] should bring it up on list [21:24:58] some friends told me they get a puppet master on each server [21:25:03] then just rsync the git repo [21:25:17] hashar: any knowledge about this ? https://gerrit.wikimedia.org/r/#/c/62032/ [21:25:59] the ticket it refers to is called "#5073: updateinterwikicache missing from fenari" created by reedy [21:26:13] i'd be resolved by just installing those scripts, but.. shrug [21:27:05] mutante: no idea [21:27:06] hashar: on _each_ server? and rsync? ehmm.. well that sounds ..different... [21:27:11] so, quick question. Difference between test and test2 now is only the database they point to? (ie: no longer any difference like "test reads from NFS while test2 is on production cluster like the rest of them")? [21:27:23] greg-g: test is still only on a specific host [21:27:32] AFAIK.. [21:27:39] srv193 [21:27:45] afair [21:27:47] hrm, ok [21:27:54] yeah [21:27:58] <^demon> !log gerrit repositories will now recursively merge [21:28:06] all it's neighbour servers have been decomissioned [21:28:07] Logged the message, Master [21:28:09] somewhat amusingly [21:28:27] sibling? [21:28:50] Reedy: so this page is accurate or no? https://wikitech.wikimedia.org/wiki/Test.wikipedia.org [21:28:54] !log pdns-update to cut back to aluminium, restarted pdns on nescio b/c it failed [21:29:02] Logged the message, Master [21:31:10] !log restarting Zuul to make it query Jenkins directly without SSL + apache proxy {{gerrit|63069}}. [21:31:17] Logged the message, Master [21:32:26] Reedy: "relatively" is an acceptable answer, btw :) [21:33:23] Looks to be [21:33:33] actually,... test.wikipedia.org is an alias for wikipedia-lb.wikimedia.org. [21:33:38] test2.wikipedia.org is an alias for wikipedia-lb.wikimedia.org. [21:33:47] would have expected a single server.. [21:33:54] i must have missed that [21:33:56] our caching redirects [21:34:05] ah [21:34:35] Reedy: thanks [21:40:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [21:40:26] drdee: and that was your change [21:42:04] !log forced Zuul to stop. All events have been dropped :( [21:42:11] Logged the message, Master [21:44:44] !log restarted Jenkins to enable access logs in /var/log/jenkins/access.log {{gerrit|63068}} [21:44:52] Logged the message, Master [21:52:42] Reedy: i suppose this has not happened yet? 'setSiteInfoForWiki in multiversion/MWMultiVersion.php has been updated to work with the new docroot layout' [21:52:54] Nope [21:53:02] Have you looked at that code? :p [21:53:19] no, i just looked at "simplify wikimania apache conf" you wrote [21:53:38] says it is dependent [21:54:50] !log Jenkins / Zuul is back up (for now). [21:54:58] Logged the message, Master [21:55:00] :) [21:55:10] so that is crazy [21:55:14] I had to restart Jenkins twice [21:55:25] the first time it was getting wild [21:55:52] mutante: thanks for your help :-] [21:56:11] going to get a few hours of sleep, if it crash tomorrow I will get some traces with pstack [21:56:19] and finally be able to fill a bug upstream [21:57:32] ok, go get sleep, we'll discuss favicon.php vs. favicon.ico another day, heh:) [21:57:47] thanks for fixing it !! [21:58:02] mutante: what's the path to the puppet repo's gitdir on the puppet master? [21:58:06] upstream bug sounds good [21:59:02] ori-l: on sockpuppet: /root/puppet/ on stafford: /var/lib/git/operations/puppet/ [21:59:17] why do you ask [21:59:42] trying to think of an elegant way to emit the git sha1, as hashar suggested earlier [22:00:17] http://docs.puppetlabs.com/references/latest/function.html#file is nice in that it accepts multiple paths and outputs the first one that it finds; trying to determine what happens if none are found. [22:00:41] ah [22:02:28] New review: Dzahn; "dunno, "same like in production" and "already made" make sense, but not adding more complexity also ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62125 [22:02:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:39] ori-l: git rev-parse HEAD ? :D [22:03:02] hashar: yes, that would work if passed to 'generate', but you need to specify --git-dir and full path to git [22:03:11] :/ [22:03:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [22:03:44] a custom facter fact is probably the way to go, but i'm too lazy for that [22:05:35] anyway bed time for me *wave* [22:05:38] <^demon> ori-l: If you're poking git::clone, I would beg that you could set ensure => $sha1 and have it check out. [22:05:41] ciao hashar [22:06:00] ^demon: oh, that sounds like fun [22:06:02] i'll do that [22:06:02] New review: Dzahn; "Hey Jan, just wondering if you are planning to create new patchsets here or abandon it." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/53403 [22:06:27] ori-l: na there is no such thing :-D [22:06:44] I had a patch for git::clone to ensure => $sha1 but it did not get merged [22:07:06] https://gerrit.wikimedia.org/r/#/c/27175/ [22:07:40] now I am sleeping [22:07:47] !change https://gerrit.wikimedia.org/r/#/c/33713/5 | platform-engineering [22:07:47] platform-engineering: https://gerrit.wikimedia.org/r/#q,https://gerrit.wikimedia.org/r/#/c/33713/5,n,z [22:08:56] ^demon: shrug. it's a good change. i don't agree with faidon's review, but oh well. [22:09:11] hashar's, above, i mean. [22:10:19] <^demon> Meh [22:14:11] New patchset: Yurik; "* Optimized opera_mini ACLs and added more IPv4 & IPv6 ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63077 [22:14:33] local bazaar commit ? oh wow, ori... [22:14:58] i know, it's crazy, but things are the way they are, and i wasn't sure what would be better [22:15:18] there are various tools that can translate git trees to bazaar repos, but going that route i ended up in sci-fi land pretty quickly [22:15:27] i mean, thanks a lot for even attacking this one [22:15:45] but using another versioning system seemed .. well ..surprising [22:15:52] mutante: boogs.wmflabs.org :) [22:16:09] sweet:) [22:16:18] andre__: ^!! woot [22:17:27] New patchset: Yurik; "* Optimized opera_mini ACLs and added more IPv4 & IPv6 ranges" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63077 [22:18:10] ori-l: you know, instead of doing exec { 'a2enmod rewrite': etc. you could also just do apache_module { rewrite: name => 'rewrite' } [22:18:23] we usually use that to activate modules [22:19:00] yeah, i know. there are two reasons for that -- one, i do most of my development/testing on vagrant, which doesn't share the same puppet codebase as operations/puppet [22:19:24] two, some of the ones in operations/puppet aren't very well organized, or aren't worth the loss of portability [22:20:09] i see. yea, when it comes to configuring apache things in general we have different methods in ops/puppet and i would never say it's well organized [22:20:40] i heard apergos just installed vagrant to test and stuff , btw [22:21:30] i think otto uses it for puppet dev too. i my pattern is vagrant -> labs -> beg for merge to prod on #wikimedia-operations [22:22:39] the thing with operations/puppet is that it's very large, so there's a lot of 'action-at-a-distance' resources, where you use something but it's hard to chase down where you're getting it from [22:22:50] do you use puppetmaster::self on labs or the central master or both? [22:23:15] puppetmaster::self [22:24:16] it'd be nice sometimes to be able to switch a ::self instance back to actual puppetmaster without having to use another one, and just make auto-gerrit my current status on ::self :) [22:25:56] New patchset: Mattflaschen; "Update submodules recursively in Beta labs autoupdater." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63080 [22:26:16] examples of my points above: Systemuser resource (you loss portability in exchange for not having to type "group { 'groupname': ensure => present, }, otherwise it's like the 'User' built-in", Upstart_job (doesn't actually create an upstart service, but instead uses Ubuntu's 'upstart-job' init compatibility script) [22:26:20] * lose [22:27:56] Service { 'foo': provider => 'upstart', } is the way to go, but if you changed it now things would probably break, since it's used all over the place [22:29:13] (because puppet is treating all of those services as init services) [22:37:48] ori-l: were you ware of wikimedia/bugzilla/triagescripts and wikimedia/bugzilla/wikibugs (where the latter maybe shouldn't even be considered part of bugzilla) [22:38:35] oh no [22:38:38] triagescripts = "Greasemonkey browser scripts for triaging bug reports in bugzilla.wikimedia.org", oh well [22:38:40] moar stuff [22:38:47] probably also not really part of a server install [22:40:01] New patchset: MaxSem; "Update mobile logos to 72px high, set default to WMF logo instead of WP-specific W" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63081 [22:45:03] New review: Dzahn; "using bazaar was surprising at first, same with using exec a2enmod instead of our usual apache_modul..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/62404 [22:45:12] New patchset: coren; "Tool Labs: Add missing packages for tools" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63063 [22:46:44] New review: coren; "Just packages." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/63063 [22:46:45] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63063 [22:46:54] New review: Dzahn; "also works here: http://boogs.wmflabs.org/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62404 [22:47:01] mutante: btw, be advised that it doesn't work out of the box, because our patches reference 'url_filter', which was removed from bugzilla in 4.2 [22:47:41] i had to change references to 'url_filter' to 'uri' by hand. i have no idea how it works in production. i explained in https://rt.wikimedia.org/Ticket/Display.html?id=5115 [22:48:22] er, 'url_quote' rather. it should work if we do this: https://gerrit.wikimedia.org/r/#/c/62837/ (which is the same change bugzilla itself made: https://bugzilla.mozilla.org/show_bug.cgi?id=679096) [22:49:13] lemme look it up in prod [22:50:26] ori-l: it already uses "uri" [22:51:13] i wonder if the SVN repo of wikimedia/bugzilla/modifications is more up-to-date [22:51:16] * ori-l checks [22:51:41] hrmm, this should have been synced and converted.sigh [22:52:21] we could always add a third SCM to that puppet module :) [22:52:29] hah:) [22:53:12] grep url_quote * in /srv/org/wikimedia/bugzilla/template/en/custom/global = nothing [22:53:44] has it maybe been checked in gerrit but not in 4.2 but in 4.1 or something? [22:53:49] looking [22:55:02] New patchset: coren; "Tool Labs: replace outated libpng2 with libpng3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63083 [22:56:14] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63083 [22:57:30] mutante: i think it was just never committed. svn is not more up-to-date, and there are no obscure branches in git. [22:57:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:58:07] ori-l: the difference between prod and bugzilla-4.2 in gerrit is almost exactly what you did there [22:58:11] checked all 3 files [22:58:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [22:58:45] it replaces url_quote with uri and tiny thing like "My votes" -> "My Votes" [22:59:39] let's merge your change then? it's not like it changes anything about escaping [23:00:07] and grr..this should'nt even have happened.. [23:00:17] well, should we just kill two birds with one stone ? i can update the patch to include the nbsp [23:00:46] Violence doesn't solve anything [23:01:19] Reedy: http://paste.debian.net/3267/ [23:01:21] arg.. [23:01:23] <^demon> Or solves everything. [23:01:26] ori-l: http://paste.debian.net/3267/ [23:01:27] s/Violence/live-hacking in prod/g [23:01:37] One way to find out [23:01:39] Test it on enwiki! [23:01:57] New review: Jdlrobson; "The existing image is 129*129. Thus we shouldn't make it smaller and probably should use 129px to no..." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/63081 [23:01:59] <^demon> Violence on enwiki? [23:02:06] We don't usually test our code .. but if we do.. [23:02:24] i'll update the patch [23:02:28] cool [23:04:06] http://en.wikipedia.org/wiki/Workplace_violence#Aggression :) [23:06:39] !log pgehres synchronized php-1.22wmf3/extensions/CentralAuth/ 'Updates to CentralAuth' [23:06:47] Logged the message, Master [23:12:34] mutante, do you have a minute? [23:14:11] andrewbogott: what's up [23:14:29] I'm making dumb mysql mistakes. Can you log into labs instance rt-testing13? [23:14:53] And then sudo to root and run $rt-setup-database-4 --prompt-for-dba-password --action upgrade [23:15:58] give me a minute to get my labs key loaded etc [23:16:23] It wants there to be an 'rtuser' user. I keep trying to create that user but the rt script is never satisfied. [23:18:21] andrewbogott: i can't sudo to root yet on that instance [23:18:29] and now i locked myself out it seems :p [23:18:35] hm [23:18:41] channel 0: open failed: administratively prohibited: open failed [23:19:19] i can on another instance in another project [23:19:45] That project allows all users to sudo any command on any instance... [23:19:56] i could login to rt-testing3 but not sudo -s [23:20:03] i tried 3 times [23:20:11] then i tried on another instance, worked fine [23:20:14] New patchset: Kaldari; "Replacing 16x16 officewiki favicon with 48x48 favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63091 [23:20:16] then back to rt-testing3 i was locked out [23:20:25] for trying too often i suppose [23:20:26] well, anyway probably you can run that command without sudo since it authenticates for the db. [23:20:46] but now i can't login at all anymore :p [23:21:00] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63091 [23:21:04] We're talking about rt-testing13 right? [23:21:12] 13 ? [23:21:26] yep, luck 13. [23:21:33] *lucky [23:21:43] !log mwalker synchronized php-1.22wmf3/extensions/CentralNotice/ 'Updating CentralNotice to master for some bugfixes' [23:21:43] heh, no i tried 3 [23:21:50] Logged the message, Master [23:22:07] and i'm root :) [23:22:07] mutante: updated https://gerrit.wikimedia.org/r/#/c/62837/ [23:22:48] So… can you see what's happening with rtuser? [23:22:57] feel free to delete/recreate [23:23:37] Enter RT version you're upgrading from: [23:23:58] 3.8.11 [23:24:12] sweet that this even exists:) [23:24:26] the upgrading script stuff..getting the right patches etc [23:24:49] In theory the process is straightforward... [23:26:13] New review: MaxSem; "If someone could explain how resizing at the irrgular ratio of 1.791(6) can make anything sharper...." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63081 [23:26:45] !log pgehres synchronized php-1.22wmf3/extensions/CentralAuth/CentralAuthUser.php 'Moving one more CentralAuth log line' [23:26:53] Logged the message, Master [23:27:59] New patchset: Mwalker; "Enable siteNotice div on all mobile sites" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63092 [23:32:05] andrewbogott: for now.. confirming it's weird.. it doesn't work like one would expect.. still trying things [23:32:11] i see the issue [23:32:21] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63092 [23:32:26] I can't figure out what it wants :) [23:33:54] !log mwalker synchronized wmf-config/InitialiseSettings.php 'Enabling siteNotice div on all alpha and beta mobile sites' [23:34:02] Logged the message, Master [23:34:22] The world needs more site notices. [23:34:52] * MaxSem notices Susan [23:34:59] Now the other cheek. [23:40:31] stat("/usr/local/share/request-tracker4/lib/auto/DBD/mysql", 0x187e138) = -1 ENOENT (No such file or directory) [23:40:34] stat("/usr/share/request-tracker4/lib/auto/DBD/mysql", 0x187e138) = -1 ENOENT (No such file or directory) [23:40:37] stat("/etc/perl/auto/DBD/mysql", 0x187e138) = -1 ENOENT (No such file or directory) [23:40:40] stat("/usr/local/lib/perl/5.14.2/auto/DBD/mysql", 0x187e138) = -1 ENOENT (No such file or directory) [23:40:44] stat("/usr/local/share/perl/5.14.2/auto/DBD/mysql", 0x187e138) = -1 ENOENT (No such file or directory) [23:40:47] andrewbogott: [23:40:56] can't find the DBD stuff and stupidly reports it as "access denied" ? [23:41:07] hm... [23:41:18] because yea, it won't take anything [23:41:34] i tried with empty pass and "foo" and from host % and 127.0.0.1 and localhost ..bla bla [23:41:47] and i can connect just fine using mysql manually [23:41:53] with the same user/pass i set [23:41:58] Are those stat() calls from the script? Does it think the db is in the wrong place? [23:41:59] package { 'libdbd-mysql-perl': ensure => present, } [23:42:28] andrewbogott: yes, it's adding strace to the whole upgrade commandline and entering the info and continue ... [23:42:29] ori-l, it's there already [23:43:01] open("/usr/lib/perl5/auto/DBD/mysql/mysql.so", O_RDONLY|O_CLOEXEC) = 4 [23:43:04] ah, ok.. uhm [23:43:18] the heck, why would it use an explicit path for the db? [23:43:23] "DBI connect('dbname=rtdb;host=lo". [23:43:27] note it's just "lo" [23:43:46] no, that is just cut off.. [23:44:47] can you tell it to use socket instead of TCP? [23:45:50] wait, localhost should be socket and 127.0.0.1 should be TCP [23:46:03] i bet it's somewhere there, but already tried both in grants [23:46:54] /etc/request-tracker4/RT_SiteConfig.d/51-dbconfig-common [23:47:09] That is a copy of the one used by 3.8.11 [23:47:32] Oh, there's a password in there too. Interesting... [23:47:59] Although that doesn't seem to help [23:48:12] heh, yea, i just saw that [23:48:13] trying [23:49:35] do you need 'skip-create'? [23:51:01] I am (approximately) following this guide here: http://guzaho.wordpress.com/2011/07/03/upgrading-rt3-to-rt4-on-ubuntu-10-04-64bit-lts/ [23:52:11] andrewbogott: i got it:) [23:52:13] it's upgrading [23:52:18] but i see it happening in strace. hah [23:52:18] What did you change? [23:52:52] i used the password from that file in a grant [23:53:08] and i used @'localhost' instead of @'%' [23:53:17] and i deleted all other users from User table [23:53:22] and flush privileges [23:53:56] Weird... [23:54:00] because i noticed that i couldn't do mysql -h localhost -u rtuser ... [23:54:09] localhost is socket and not TCP [23:54:10] Well, I guess as long as you remember how to do it, we only need to do it once more, ever :) [23:54:15] therefore it doesnt apply to % [23:54:25] heh, ok [23:54:34] I was doing @'localhost' but guess I had the wrong password too [23:54:48] anyway, lemme see if I can fix up the apache config so we can see how it looks [23:55:00] write(1, "Done.\n", 6Done. [23:56:28] exit_group(0) = ? [23:57:25] My single ticket and queue survived the upgrade. [23:58:08] So I guess that's it -- I'll puppetize version 4 and we can upgrade production anytime. [23:58:22] weee!:) that's awesome [23:58:30] I guess we should schedule a window… do you know how to do that? [23:58:41] e.g. which calendar to use? [23:58:47] with people or with monitoring? [23:58:53] ah.. ehm.. [23:59:08] i don't think it fits the deployment calendar, so no [23:59:19] announce via mail on list should be good enough [23:59:19] It's not super important, but will result in a bit of downtime. We'll want to take the interface down so we can get a clean backup and such. [23:59:57] OK. Monday good for you?