[01:47:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [01:56:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [02:15:55] !log LocalisationUpdate completed (1.22wmf3) at Thu May 9 02:15:54 UTC 2013 [02:16:03] Logged the message, Master [02:28:44] PiRSquared: meet morebots [02:29:17] morebots: help [02:29:17] I am a logbot running on wikitech-static. [02:29:17] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [02:29:17] To log a message, type !log . [02:36:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [03:39:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 9 03:39:55 UTC 2013 [03:40:04] Logged the message, Master [04:08:02] New review: Faidon; "I don't think this is bad per se, but note that a few lines above there is a commented out section a..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62103 [04:09:36] New patchset: Faidon; "Swift: pep8 clean rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [04:11:34] New patchset: Faidon; "Swift: pep8 clean rewrite.py" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [04:13:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61889 [04:15:51] New patchset: Faidon; "Swift: remove wikimania2006 exception" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62946 [04:17:23] New patchset: Faidon; "Swift: remove wikimania2006 exception" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62946 [04:17:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62946 [04:20:43] Change abandoned: Faidon; "Too obsolete by now. Getting this up-to-date would be a similar amount of effort to making it from s..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16411 [04:24:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [04:34:45] New review: Yurik; "Faidon, we use X-CS (X-Carrier is going away) to test for specific ID, but setting it already skews ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62103 [04:41:48] New patchset: Faidon; "swiftrepl: initialize connection before use" [operations/software] (master) - https://gerrit.wikimedia.org/r/62947 [04:41:48] New patchset: Faidon; "swiftrepl: add an error message for IncompleteSend" [operations/software] (master) - https://gerrit.wikimedia.org/r/62948 [04:41:48] New patchset: Faidon; "swiftrepl: set NOBJECT to 1000" [operations/software] (master) - https://gerrit.wikimedia.org/r/62949 [04:42:17] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62418 [04:42:30] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62419 [04:42:57] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62947 [04:43:15] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62948 [04:43:34] Change merged: Faidon; [operations/software] (master) - https://gerrit.wikimedia.org/r/62949 [05:11:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [05:11:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [05:26:07] paravoid, ping [05:50:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [06:44:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [06:58:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [07:25:14] !log Jenkins HTTP interface died :/ [07:25:22] Logged the message, Master [07:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:28:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.591 second response time [07:33:53] morning hashar [07:34:03] jenkins is dead again :( [07:34:14] well technically it works but it isn o more accepting http connection [07:34:16] :( [07:35:07] no longer! :D [07:37:40] I wish there was a way to abort a thread in java [07:38:42] gonna restart it :( [07:39:05] !log restarting jenkins. All its http thread are locked :/ [07:39:13] Logged the message, Master [07:39:15] :( [07:47:05] hmmm [07:49:24] how is zuul's memory consumption? [07:49:29] does it ever leak? [07:53:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:54:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [08:01:49] New patchset: Faidon; "Add a third-party cron module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62955 [08:04:52] New review: Faidon; "puppetdoc, rspec, manifest, Modulefile, Rakefile, README for a 80 lines of pure puppet code but not ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62955 [08:13:24] PROBLEM - Packetloss_Average on analytics1003 is CRITICAL: CRITICAL: packet_loss_average is 10.9447852672 (gt 8.0) [08:13:34] PROBLEM - Packetloss_Average on analytics1006 is CRITICAL: CRITICAL: packet_loss_average is 12.2033658647 (gt 8.0) [08:14:34] PROBLEM - Packetloss_Average on analytics1004 is CRITICAL: CRITICAL: packet_loss_average is 11.1855046154 (gt 8.0) [08:16:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [08:17:14] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 11.8206513386 (gt 8.0) [08:17:24] RECOVERY - Packetloss_Average on analytics1003 is OK: OK: packet_loss_average is 1.17573723577 [08:17:34] PROBLEM - Packetloss_Average on gadolinium is CRITICAL: CRITICAL: packet_loss_average is 10.530239633 (gt 8.0) [08:17:35] RECOVERY - Packetloss_Average on analytics1006 is OK: OK: packet_loss_average is -0.169203090909 [08:18:34] RECOVERY - Packetloss_Average on analytics1004 is OK: OK: packet_loss_average is 0.134039166667 [08:18:44] PROBLEM - Packetloss_Average on analytics1008 is CRITICAL: CRITICAL: packet_loss_average is 10.7626705385 (gt 8.0) [08:20:52] !log upgrading Jenkins plugins 'promoted build' and 'M2 release'. That should fix the Gerrit jobs disappearing (ping qchris) [08:21:00] Logged the message, Master [08:21:14] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.927818596491 [08:21:34] RECOVERY - Packetloss_Average on gadolinium is OK: OK: packet_loss_average is 0.605147391304 [08:21:54] ori-l: I don't think Zuul leaks memory. [08:22:01] !log restarting Jenkins again [08:22:08] Logged the message, Master [08:22:40] yeah, i checked out the code to see if it's lingering connections to jenkins but it doesn't look like it [08:22:44] RECOVERY - Packetloss_Average on analytics1008 is OK: OK: packet_loss_average is 0.048695042735 [08:23:14] PROBLEM - Packetloss_Average on analytics1005 is CRITICAL: CRITICAL: packet_loss_average is 9.58535916667 (gt 8.0) [08:23:36] ori-l: I will also make zuul to use the private IP instead of the https / public one [08:23:47] that will let Zuul connect directly to Jenkins [08:23:52] instead of via the apache proxy [08:24:09] yeah, that's nice; probably won't make a huge difference tho [08:26:21] Change restored: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60866 [08:26:24] New patchset: Hashar; "Jenkins job validation (DO NOT SUBMIT)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60866 [08:27:00] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/60866 [08:27:14] RECOVERY - Packetloss_Average on analytics1005 is OK: OK: packet_loss_average is 0.18957173913 [08:29:59] so jenkins restarts just fine now [08:32:45] shower time [08:54:07] !log Jenkins apparently working fine. I have upgraded a few plugins, that seems to have fixed the Gerrit jobs. [08:54:13] off again bbl [08:54:15] Logged the message, Master [08:57:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [09:07:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [09:26:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:28:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [09:50:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [09:50:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [11:15:50] New patchset: ArielGlenn; "kiwix mirror stanza moved to mirror role and added system_role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62963 [11:20:39] New review: Hashar; "Looks fine, I will let you deploy it :-]" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/62814 [11:23:56] New patchset: ArielGlenn; "kiwix mirror stanza moved to mirror role and added system_role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62963 [11:24:28] !log upgrading Zuul to latest master (makes report times friendlier) [11:24:37] Logged the message, Master [11:27:44] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62963 [11:43:02] New patchset: Hashar; "nginx_site now uses boolean values" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62583 [11:43:46] New review: Hashar; "I have converted the $enable and $install parameters of nginx_site to be boolean values and updated ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62583 [11:44:50] New patchset: ArielGlenn; "Revert "kiwix mirror stanza moved to mirror role and added system_role"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62965 [11:46:07] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62965 [11:49:22] New patchset: Hashar; "proxy_configuration $ipv6_enabled is now boolean" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62966 [11:51:14] New patchset: Hashar; "labs: hardcode nginx types_hash_bucket_size to 64" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62603 [11:51:29] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62583 [11:51:52] New review: Hashar; "I have rebased this change against tip of production. It is an independent change we can merge right..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62603 [11:52:17] apergos: and you might want to pick https://gerrit.wikimedia.org/r/#/c/62603/ as well [11:52:31] apergos: that is tweaking the bucket sizes for nginx on labs. must be harmless in production [11:52:48] I need to rebase my lame change now :/ [11:54:08] yes, I had only wited because it was commited as part of the series (with dependency), otherwise I would have merged it right away [11:54:45] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62603 [11:55:00] have you sorted out how to get the nginx changes deployed in production? [11:55:06] err [11:55:16] you don't read your pms do you [11:55:19] :-P [11:55:24] :-] [11:55:30] must have dropped it sorry [12:06:34] New patchset: Hashar; "proxy_configuration $ipv6_enabled is now boolean" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62966 [12:49:58] New patchset: Hashar; "puppet-lint protoproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62973 [13:07:11] New patchset: Hashar; "puppet-lint protoproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62973 [13:07:29] New review: Hashar; "PS2 removes the remaining trailing semi columns." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62973 [13:31:53] New patchset: Hashar; "doc for proxy_configuration define" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62976 [13:31:54] New patchset: Hashar; "protoproxy proxy_addresses is now optional" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62977 [13:32:21] New review: Hashar; "That one is straightforward. Might want to write more documentation though." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62976 [13:32:42] New review: Hashar; "Need to be tested out in labs first." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/62977 [14:01:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:06:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [14:16:50] Q: Has anyone tested removing files from the servers without suppressing them first? [14:17:22] I assume things would break pretty badly if it happened, wouldn't they? [14:19:02] ^demon, mutante, and paravoid – you guys should know things, right? :) [14:20:17] <^demon> Huh? [14:22:37] ^^ what would happen if someone removed files server-side without deleting them via MediaWiki, I wonder [14:23:42] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62161 [14:24:52] <^demon> Like on the cluster? [14:25:22] <^demon> Next time someone checked stuff out, it'd check the file out again. [14:25:31] <^demon> Oh, in MW. [14:25:34] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:35] <^demon> Oh, it'll probably barf. [14:25:53] <^demon> (I thought you meant like delete MW classes in deployment) [14:25:58] <^demon> Not uploaded files. [14:26:04] no, should've made myself clearer [14:26:12] I assume people would get shit errors [14:26:43] <^demon> For a period of time, while the caches were inconsistent with what's on disk. [14:27:22] ah, and then a simple 'file does not exist, you can upload it' message? [14:29:04] <^demon> I dunno tbh, since the image table entries would still be inconsistent. [14:29:37] <^demon> I imagine we at least handle the case semi-decently. [14:31:03] New patchset: Umherirrender; "$wgFlaggedRevsNamespaces for dewiki: Add NS_MODULE" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [14:32:32] odder, it will fail to retrieve the file, create thumbs, etc. if we don't have it in squid (varnish?) cache then there will be a failure to retrieve from upload, and so it won't display. In an article with a thumb you'll get a placeholder of some sort with the name of the file, on the file description page you might get the ''broken image' icon (don't remember now), from upload.wm.org you'll get a whine about being unable to serve the m [14:32:50] New review: Umherirrender; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [14:33:48] if you try to suppress after removing (hard removal from filesytem) the media, the suppresion might fail [14:33:53] depends on the order that happens [14:34:01] best to test it on a local instance [14:34:08] *after dong [14:35:19] doing ! [14:35:26] thanks apergos [14:35:32] yw [14:36:48] there is a maintenance script that fixes up that sort of thing iirc [14:40:21] cleanupImages.php might do it but I have not ever tested it myselff [14:40:46] there is also this scary script written by AaronSchulz coming up :) [14:41:04] which one is that? [14:41:31] * odder searches Bugzilla [14:42:05] https://gerrit.wikimedia.org/r/#/c/62549/ [14:42:24] New review: Raimond Spekking; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [14:43:27] er [14:43:31] really? ummm [14:43:36] I mean sometimes those get undeleted [14:43:52] example: the great commons jimbo purge of 2010 (or whenever it was) [14:43:57] some of those got restored [14:44:52] yes, but these were MediaWiki-deleted [14:44:56] ok well if it's jus provided as a script for third party I guess... :-D [14:45:02] that script is to be used by legal [14:45:17] scary script for scary stuff. [14:45:18] ah [14:45:33] so after deleted via mw then this goes through and gets the rest [14:45:46] makes sense, right now they have to drop an rt ticket every time [14:45:51] yep. [14:46:18] this is just BTW; I was asked a question whether the WMF uses different ways to delete stuff than suppression + server-side, which I'm sure you don't [14:46:27] and wondered what would happen :) [14:46:54] delete/suppress and then "nuke with file" = remove from nfs or swift or whever it happens to live on real storage [14:46:58] that's about how it works [14:47:19] I know I have in the past nuked without suppression [14:47:36] it only affects folks trying to retrieve the specific image so.... [14:47:55] ooo, how long in the past? [14:48:20] I wonder if 'scubVersion' should be 'scrubVersion' everywhere in there [14:48:25] oh not this year [14:48:55] I guess it must have been more than a year ago [14:49:27] this script would help a lot, I remember having a lot of fun with mutante the other day, trying to figure out where a file was kept :) [14:49:36] yes, it's harder now [14:49:45] you have to dig it out of ceph as well as swift [15:12:34] PROBLEM - Puppet freshness on cp3003 is CRITICAL: No successful Puppet run in the last 10 hours [15:12:34] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [15:18:27] Anyone mind if I deploy gerrit change 62814 quick? It's preparation for enabling CodeEditor for core at some point in the future. [15:22:51] New review: Anomie; "per Hashar" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/62814 [15:23:04] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62814 [15:26:10] !log anomie synchronized wmf-config/InitialiseSettings-labs.php 'Make $wgCodeEditorEnableCore configurable per wiki, prep for bug 39653' [15:26:17] Logged the message, Master [15:26:24] !log anomie synchronized wmf-config/InitialiseSettings.php 'Make $wgCodeEditorEnableCore configurable per wiki, prep for bug 39653' [15:26:31] Logged the message, Master [15:26:38] !log anomie synchronized wmf-config/CommonSettings.php 'Make $wgCodeEditorEnableCore configurable per wiki, prep for bug 39653' [15:26:45] Logged the message, Master [15:32:33] New patchset: Hoo man; "$wgFlaggedRevsNamespaces for dewiki: Add NS_MODULE" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [15:33:51] New review: Hoo man; "Fixed encoding problem" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [15:42:32] anomie: looks good (as in, it didn't break anything) ;) [15:43:07] greg-g: Yeah, I even remembered to check basic functionality right after deploying ;) [15:50:12] :) [15:51:34] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [15:56:09] !log powering down barium /relocate server to c1 [15:56:16] Logged the message, Master [16:08:10] New patchset: RobH; "hooper no long racktables host, removing old racktables stuff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62983 [16:08:14] cmjohnson1: I got shipment notification of the hard disks [16:08:16] for barium [16:08:24] yep..installing now [16:08:29] oh, nice! [16:08:32] i get the notices to :-P [16:08:40] well, i meant the newegg one [16:08:41] not the eq one [16:08:45] (notification) [16:08:55] i saw you handled the eq one already, wasnt sure if it was same thing [16:08:55] oh..yeah..i got an eq notice... [16:09:07] cool [16:09:18] yep..while we're on topic....have your heard from dominion freight? [16:09:31] and still no addt'l ssds from amazon [16:09:58] dominion for what? [16:10:09] the remaining 10 parsoid servers should arrive today as well [16:10:12] (i already put a ticket in) [16:10:21] we are short 1 400GB ssd right? [16:10:23] yes..the other 10 server [16:10:32] yes on the ssd [16:12:01] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62983 [16:13:14] New patchset: RobH; "Revert "hooper no long racktables host, removing old racktables stuff"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62984 [16:13:41] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62984 [16:29:58] New patchset: Aaron Schulz; "Added NS_MODULE as a reviewable namespace." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62985 [16:31:26] New review: Aaron Schulz; "WIP" [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/62985 [16:45:34] PROBLEM - Puppet freshness on db44 is CRITICAL: No successful Puppet run in the last 10 hours [16:46:18] New patchset: Aaron Schulz; "Added NS_MODULE as a reviewable namespace." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62985 [16:55:36] paravoid, hi, are you around? [17:21:13] New review: Umherirrender; "Unneeded, when I59989f7c gets merged" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62979 [17:33:44] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [17:38:45] yurik: yes? [17:40:07] paravoid, hi! i replied to your comment on https://gerrit.wikimedia.org/r/#/c/62103/ -- basically spoofing X-CS skews the results and allows us to test only very limited aspects of the zero. Spoofing IP would do a full testing [17:41:44] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:44:44] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [17:45:44] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:49:21] New patchset: Pgehres; "Enabling wgCentralAuthAutoMigrate on all wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62995 [17:50:11] heya ops: this page looks hand written, and I don't know where to submit a diff... so... [17:50:39] https://noc.wikimedia.org/conf/ the link for the puppet configs links to the lucene specific folder, it should be https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=tree;hb=HEAD instead [17:51:05] (3rd href on the page) [17:54:48] New review: Pgehres; "Looks like private and fishbowl don't even call CA, so it should be a no-op, but I can match this fo..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62995 [17:57:56] New review: CSteipp; "This shouldn't effect private wikis, and will help with unification. If we see any issues we can rev..." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/62995 [17:58:20] New patchset: Alex Monk; "Add throttle exception for Haifa University workshop on 12/5/13" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62998 [17:58:25] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62995 [17:59:48] New patchset: Alex Monk; "Add throttle exception for Haifa University workshop on 12/5/13" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/62998 [18:00:05] !log pgehres synchronized wmf-config/InitialiseSettings.php 'Enabling wgCentralAuthAutoMigrate on all wikis that use CentralAuth' [18:00:12] Logged the message, Master [18:17:34] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [18:20:00] New patchset: Pgehres; "Enabling wfDebugLog for CentralAuth" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63001 [18:26:11] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/63001 [18:27:12] !log pgehres synchronized wmf-config/InitialiseSettings.php 'Enabling CentralAuth debug log' [18:27:20] Logged the message, Master [18:37:09] New patchset: Diederik; "Ensure that all UMAPI files belong to wikidev group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [18:55:42] !log attempting to restart broken puppetmaster on stafford [18:55:50] Logged the message, Master [19:05:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 7.829 second response time [19:08:34] PROBLEM - Puppet freshness on neon is CRITICAL: No successful Puppet run in the last 10 hours [19:09:14] !log restarted Apache and then puppetmaster on stafford, that revived it [19:09:23] Logged the message, Master [19:16:26] !log fixing file permissions in /a/e3/E3Analysis repo on stat1001 [19:16:34] Logged the message, Master [19:22:20] New patchset: Diederik; "Ensure that all UMAPI files belong to wikidev group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [19:22:46] New review: Hashar; "I can confirm this is working in labs. I have tried out calling proxy_configuration{} with no proxy_..." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/62977 [19:24:24] New review: Dzahn; "yea, i just fixed those file permissions manually using find. some files were owned by group root. i..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/63005 [19:28:27] New review: Dzahn; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [19:31:29] New review: Hashar; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/62582 [19:31:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:32:47] hashar: do you approve of using "recurse" in puppet inside git::clone to ensure file permissions? [19:33:18] hmm [19:33:30] I had that issue a few months ago I think [19:35:06] i just told Diederik to just avoid the recurse if its not really needed [19:35:12] mutante: so git::clone is supposed to create the files according to use and group passed. I have added the group part with https://gerrit.wikimedia.org/r/#/c/30988/ [19:35:13] and fixed some permissions by hand using find [19:35:25] and puppet didnt rebreak them [19:35:27] recurse tend to be evil IIRC [19:35:34] thats what i thought , yea [19:35:42] cause on each puppet run it will traverse all the tree to find out what is wrong [19:35:45] i didnt really want to encourage its use [19:35:56] drdee: ^ [19:36:10] if they got altered somehow, I guess the best is to fix it manually [19:36:32] https://gerrit.wikimedia.org/r/#/c/63005/ [19:36:40] this looks harmless without recurse now [19:36:44] can has validate? [19:37:12] yep, the manual fix has been applied, thanks for input hashar [19:38:03] and in this case, I guess the repo should be owned by root:root [19:38:11] since the git::clone has ensure => latest [19:38:22] probably don't want anyone to play with that repo beside root (aka puppet) [19:38:33] the issue is he wants both [19:38:39] manual git pull as human [19:38:43] and puppet doing it [19:38:52] ah to force an update I gues [19:38:53] s [19:38:56] yup [19:39:03] yea, doesnt want to wait 30 min when deploying [19:39:06] but then if something goes wrong, puppet might in turn fail [19:39:50] it's all owned by stats:wikidev [19:40:01] and most of it was, just some were group root [19:40:14] the fix was to make them all stats:wikidev so far [19:40:42] so [19:40:57] drdee: so when manually pulling, do it as the stats user? [19:41:00] if puppet maintain the clone, the files belong to root:root to prevent humans to alter the git repo and thus potentially make puppet file [19:41:07] if a human need to update it, it needs root [19:41:13] preferably as own user [19:41:30] not root because then ottomata is the only on the team who can do it [19:41:34] if the repo is to be updated by human, that should be out of puppet; possibly with git-deploy [19:41:39] and there it is a mixed case :-] [19:42:01] i agree, i said earlier you can also just say deploying is always done by human as sanity-check [19:42:14] and not let puppet do the pull automatically [19:42:35] but have multiple deployers [19:42:53] Zuul used to be deployed by a git::clone ensure => latest [19:43:09] with a repo to which i pushed master [19:43:11] ensure => latest in general, also with packages, has surprised us in the past [19:43:15] then had to run puppet myself [19:43:16] not ideal [19:43:25] nowadays I su and git pull then install [19:43:49] which in turns need root access grbmbl [19:43:50] on blog or bugzilla i also just git pull manually, didnt really want puppet do automatically do it if somebody merges [19:44:05] hmm [19:44:13] in this case I would get it out of puppet [19:44:18] and let the team manually deploy [19:45:22] said all that i'd merge that little change anyways, just adding the group [19:45:34] but jenkins.. [19:45:42] doesnt like to verify [19:46:03] and Zuul / Jenkins could potentially be used to refresh the local repo after merge [19:46:25] I did that for the integration.wikimedia.org website which is updated whenever a change is merged in integration/docroot.git [19:46:35] but that needs a jenkins slave installed on the host :( [19:46:48] so yeah just wikidev it but I would remove the ensure latest and let the team update manually [19:47:42] drdee: so, i'm going for lunch, up to you if you want to add another patch set to remove "ensure => latest" now or not, i can merge it when i'm back [19:48:04] i will have ottomata look at next week [19:48:10] thanks for your ideas [19:48:29] sure, no problem, thank hashar [19:48:30] bbiaw [19:51:34] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:34] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:34] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [19:55:38] New patchset: Cmjohnson; "Changing mac address for aluminium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63019 [19:56:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [20:04:01] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63019 [20:19:57] !log Jenkins died :/ [20:20:05] Logged the message, Master [20:20:25] !log restarting Jenkins. [20:20:32] Logged the message, Master [20:21:07] jeff_green: aluminum has mainboard swapped. not sure how to fix w/out a reinstall due to mac address nic change...other than maybe changing if interface to eth1 instead of eth0 in networking/interfaces? [20:21:46] I don't understand? [20:22:25] we should be able to console in and tweak any necessary network settings, but afaik it's just binding the IP to what the kernel sees as eth0 [20:23:04] if you can get the DRAC up on its normal IP with the normal login, I think I can take it from there [20:24:33] the drac is up...but i can't ping the server [20:24:39] it's all yours than! [20:24:48] k. will you be around for a bit if I get stuck? [20:25:06] yep [20:25:08] like 20 minutes == while [20:25:12] thx. looking now [20:30:24] mutante: jenkins failed: https://gerrit.wikimedia.org/r/#/c/63005/ [20:36:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.146 second response time [20:40:08] hashar: heh, hey come it says LOST now instead of FAILED ? [20:40:21] how [20:40:28] Jenkins got restarted [20:40:30] :D [20:40:41] so LOST means like i lost memory of the change? [20:40:42] :) [20:41:02] says so on 63005, lemme recheck [20:41:09] New review: Dzahn; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63005 [20:41:50] <^demon> hashar: That might explain why jmap wouldn't connect to the proc :p [20:42:40] ^demon: I think I killed Jenkins by using strace :( [20:43:08] New patchset: coren; "Tool Labs: Add libjson-perl to exec nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/63063 [20:43:37] <^demon> strace isn't usually super useful a multithreaded program running in the jvm. [20:43:44] <^demon> You get way too much jvm noise. [20:43:49] I guess [20:44:06] using top (and showing threads with H ) [20:44:11] I managed to get the pid of the thread [20:44:13] http://itsecureadmin.com/2010/12/using-strace-to-attach-to-a-multi-threaded-process-like-a-jvmjava/ [20:44:22] but could not find it in jstack nor in Jenkins thread dump [20:44:42] ( http://integration.wikimedia.org/ci/threadDump ) [20:44:49] <^demon> http://www.fromdev.com/2008/12/debugging-java-on-unixlinux-my-favorite.html, mostly duh info [20:45:30] !log olivneh synchronized php-1.22wmf3/extensions/ConfirmEdit 'Updating ConfirmEdit to master' [20:45:38] Logged the message, Master [20:45:45] !log olivneh synchronized php-1.22wmf3/extensions/EventLogging 'Updating EventLogging to master' [20:45:53] Logged the message, Master [20:46:00] !log olivneh synchronized php-1.22wmf3/extensions/EventLogging 'Updating GuidedTour to master' [20:46:07] Logged the message, Master [20:46:07] ^demon: ahh I am learning some new commands :-D [20:46:19] jmap to connecto live java process sounds nice [20:47:30] jstack is very nice [20:47:42] !log olivneh synchronized php-1.22wmf3/extensions/GuidedTour 'Actually updating GuidedTour to master' [20:47:47] gives you a full stack trace of all the threads together with the source filename / line [20:47:50] Logged the message, Master [20:48:10] I guess I will try pstack next time [20:48:15] <^demon> Yeah, jstack and jmap are your two big tools in the "debug a java problem" belt. [20:48:35] bah pstack is not on gallium [20:48:45] you know what would be fancy? a shared bookmark server in the form of a mozilla sync server. so we could put useful bookmarks in common place and sync to browsers [20:51:09] http://en.wikipedia.org/wiki/Firefox_Sync + http://docs.services.mozilla.com/howtos/run-sync.html but i just thought of the bookmarks part, not the passwords etc i suppose;) [20:51:34] RECOVERY - Host aluminium is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:52:11]