[00:09:50] !log bsitu Finished syncing Wikimedia installation... : IE ajax browser cache fix for 1.22wmf11 and scap to update message cache for 1.22wmf12 [00:10:00] Logged the message, Master [00:11:25] bsitu, please ping me when you're done [00:11:49] MaxSem: the scap was just done [00:12:08] so you don't need to do anything else? [00:12:10] MasSem: you can go ahead if you want to do some more deploy [00:12:15] thanks [00:12:17] nope [00:13:06] nope is to "so you don't need to do anything else?" , :) [00:15:13] !log maxsem synchronized php-1.22wmf12/extensions/MobileFrontend/includes/MobileContext.php 'https://gerrit.wikimedia.org/r/#/c/76043/' [00:15:24] Logged the message, Master [00:16:39] !log maxsem synchronized php-1.22wmf11/extensions/MobileFrontend/includes/MobileContext.php 'https://gerrit.wikimedia.org/r/#/c/76043/' [00:16:50] Logged the message, Master [00:25:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [00:27:56] Reedy: btw, how did the wikivoyage outage turn out? I don't see an incident doc yet. And the bug is still open and the patch last linked in the bug is unmerged. [00:28:07] (also, was it limited to wikivoyage?) [00:30:41] turn out? surely you can tell that by the fact you can visit the sites without an error? [00:31:12] wikivoyage was the only wikis on wmf11 that were running wikidata code [00:31:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:34:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [00:35:13] * twkozlowski takes notes for Tech News #31 [00:35:38] http://ur1.ca/edq1f [00:36:36] "A big blue rectangle appeared on Thursday which caused an hour-long outage of Wikivoyage." [00:36:52] I can't believe we have that many people using wikivoyage [00:37:19] 500m/sec errors? [00:37:31] twkozlowski: :D [00:38:10] :-) That's still too complicated, basile would bite me if I wrote something like that. [00:38:37] Reedy: the 'm' is mili, or 0.001 [00:38:39] "A big blue rectangle appeared on Thursday; Wikivoyage was offline for an hour." [00:38:52] so 500m = an error every half a second [00:38:59] Pfft [00:39:02] That's nothing [00:39:21] but i artificially capped the graph at 500m [00:39:52] otherwise the occasional bad deploy push the boundaries such that the graph because unreadable [00:39:59] MaxSem: are you done? One of the JS files is not updated in resourceloader, I need to touch and sync the file [00:40:19] bsitu, go ahead [00:40:25] MaxSem: thx [00:40:29] Reedy: http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Miscellaneous+eqiad&h=vanadium.eqiad.wmnet&jr=&js=&v=29597&m=exception&vl=errors&ti=Exceptions [00:40:31] 7/s [00:40:53] twkozlowski: "Today a dark grey squiggly thing caused ..." :) [00:41:22] Just be grateful wikipedias weren't on 1.22wmf11 at that point ;) [00:42:28] ori-l: squiggly ain't simple enough! but nice capture anyway [00:42:34] Reedy: yes yes, I know the wmf11 sites are back up, but that's just the first step. I don't handle these things, but a report and a closed bug should be the eventual end result [00:42:44] for all I know you rolled back and it is a temp solution [00:43:07] Rolled what back? [00:43:14] Technically, audes commit is a hack [00:43:21] ie the one at the bottom of the bug comments [00:44:33] just suppressing the exceptions and carrying on [00:46:08] try { $mediaWiki = new MediaWiki(); $mediaWiki->run(); } catch { echo "Reedy did it!\n"; } [00:46:32] {{cn|date=July 2013}} [00:46:38] !log bsitu synchronized php-1.22wmf11/extensions/Echo/modules/overlay/ext.echo.overlay.js 'update cache for resourceloader' [00:46:49] Logged the message, Master [00:46:53] [00:47:02] Heh. [00:47:18] !log bsitu synchronized php-1.22wmf11/extensions/Echo/modules/special/ext.echo.special.js 'update cache for resourceloader' [00:47:29] Logged the message, Master [00:47:57] Reedy: Well, I suppose (unrelated to this incident) the least we could do is not serve a useless 1-line text/plain response from apache/php, but server wmerror or something like it. [00:48:25] something more useful to users, not to us though [00:48:26] Krinkle: Apparently outages are a great time to ask for donations [00:48:41] Reedy: indeed, our wmerror page does that afaik [00:49:01] "[hash] gizmo wizardry" does not. [00:49:22] Reedy: So how was it fixed? That commit isn't merged afaik [00:49:33] Rebuilding the localisation cache [00:50:12] Another example where it'd be great if we had cache backups to revert to rather than waiting for them to rebuild and re-sync [00:50:36] sure [00:51:00] Reedy: How did that fix it though, didn't you do anything else? Why did it fail if we built it without changing anything. [00:51:18] It got upset with the Wikibase entries in extension-messages [00:51:27] Or was something unlisted, or listed between it going down from i18n rebuild and it being rebuild by you. [00:51:37] Effectively ignoring some of them (rather than fatalling like it did last time around) [00:51:51] Which lead to a missing magic word in the wmf11 cache that wikivoyage was using [00:53:12] Reedy: Was there anything wrong with those entries ext-messages? e.g. was this caused by something being off in wikibase master that they put in master but dind't intend to get live? [00:53:22] or a race condition that was solved by rebuilding the exact same thing. [00:53:41] We tried a workaround for the fatal that wikibase threw complaining about load order [00:53:54] Which for l10n, we couldn't really care less about [00:54:07] So Katie removed that as part of the fixup [01:00:59] So how long was the outage? 40 minutes? [01:05:31] Reedy: https://wikitech.wikimedia.org/wiki/Incident_documentation/20130725-Wikivoyage If you could elaborate/narrow down when you have a minute. [01:07:16] Reedy: I don't see any merged commits in the Wikibase repo, so I guess that sync was a local patch? [01:07:26] Nope.. [01:07:51] k, I'll let you handle it. [01:08:03] (* 17:12 logmsgbot: reedy synchronized php-1.22wmf11/extensions/Wikibase ) [01:08:07] https://git.wikimedia.org/commit/mediawiki%2Fextensions%2FWikibase.git/refs%2Fheads%2Fmw1.22-wmf11 [01:08:57] I see [01:09:14] That's branch only so it will happen again next branch if it isn't addressed? [01:09:45] !log reedy synchronized php-1.22wmf11/languages/messages/MessagesTl.php [01:09:48] Press the cherry pick button! [01:09:56] Logged the message, Master [01:10:56] Reedy: So the localization cache rebuild included extensions in an order that caused this defined check to fail, thus throw an exception and abort early, and as such various extensions' i18n data was not part of the resulting cache that got synced by l10nbot? [01:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [01:51:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [02:03:00] !log LocalisationUpdate failed: git pull of extensions failed [02:03:11] Logged the message, Master [02:21:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:24] anyone looking into that ^^ ? [02:21:32] !log LocalisationUpdate failed: git pull of extensions failed [02:23:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [02:24:57] error is from line 26 of files/misc/l10nupdate/l10nupdate-1 in operations/puppet [02:25:01] https://git.wikimedia.org/blob/operations%2Fpuppet.git/9f866900e745b78ec95a7dc51bfdea4107dde870/files%2Fmisc%2Fl10nupdate%2Fl10nupdate-1#L26 [02:35:08] !log ran the git-clone update component of l10nupdate (l10nupdate-1, lines 17-45) as 'l10nupdate' user to try and reproduce "git pull of extensions failed" error, but it completed successfully for me. [02:35:19] Logged the message, Master [02:52:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:53:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [02:57:58] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [02:58:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [03:56:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:57:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.169 second response time [04:05:56] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [04:22:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:23:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.168 second response time [04:44:11] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [04:52:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.172 second response time [04:58:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:59:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.160 second response time [05:09:14] (PS1) Ori.livneh: Re-write of MongoDB module [operations/puppet] - https://gerrit.wikimedia.org/r/76059 [05:11:10] (cue nosql joke in 3, 2, 1...) [05:24:42] ori-l: I would but can't think of one...I guess my brain isn't fast enough at processing large amounts of semi-structured data [05:27:30] Aaron|home: I can't relate. [05:27:41] ;) [05:52:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:53:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.170 second response time [06:19:08] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [06:21:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:23:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.162 second response time [07:27:04] happy sysadmin day! [07:27:13] it is? huh [07:28:01] hah, so it is! happy sysadmin day [07:28:02] https://en.wikipedia.org/wiki/System_Administrator_Appreciation_Day [07:28:20] http://sysadminday.com/ "Your network is secure, your computer is up and running, and your printer is jam-free." [07:28:24] I didn't know you guys did printers [07:28:27] let us all take the words of DevOpBorat to heart on this fine day: https://twitter.com/DEVOPS_BORAT/status/123869351462961152 [07:28:29] I'd have filed more RT tickets :D [07:29:06] ori-l: http://theoatmeal.com/comics/printers [07:30:22] Celebrations: Cake and ice cream # :-) [07:30:40] heh [07:31:22] ♥ the oatmeal [08:02:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [08:12:56] good morning [08:16:24] (Abandoned) Hashar: (bug 50929) Remove 'visualeditor-enable' from $wgHiddenPrefs [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [08:31:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:32:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.156 second response time [08:35:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:38:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.018 second response time [08:57:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [09:17:08] (PS6) Hashar: (bug 41285) adapt `foreachwiki` for labs [operations/puppet] - https://gerrit.wikimedia.org/r/55059 [09:17:20] (PS7) Hashar: adapt `foreachwiki` for labs [operations/puppet] - https://gerrit.wikimedia.org/r/55059 [09:20:44] (CR) Hashar: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/71968 (owner: Hashar) [09:20:48] (CR) Hashar: "(1 comment)" [operations/puppet] - https://gerrit.wikimedia.org/r/71968 (owner: Hashar) [09:21:37] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:21:57] PROBLEM - Disk space on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:07] PROBLEM - MySQL disk space on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:17] PROBLEM - MySQL Slave Delay on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:17] PROBLEM - mysqld processes on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:17] PROBLEM - RAID on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:37] PROBLEM - SSH on db1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:22:37] PROBLEM - MySQL Slave Running on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:37] PROBLEM - Full LVS Snapshot on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:37] PROBLEM - MySQL Recent Restart on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:37] PROBLEM - MySQL Idle Transactions on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:22:47] PROBLEM - DPKG on db1042 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:24:31] (PS10) Hashar: contint: publish Zuul git repositories [operations/puppet] - https://gerrit.wikimedia.org/r/71968 [09:24:48] (CR) Hashar: "Allowed gallium and rebased." [operations/puppet] - https://gerrit.wikimedia.org/r/71968 (owner: Hashar) [09:26:57] (CR) Patrick87: "Wrong, https://gerrit.wikimedia.org/r/#/c/75541/ will only expose a preference during beta period." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [09:29:32] yeah one more people [09:32:35] (CR) Hashar: "We talked on the thread how that preference should not be configured in operations/mediawiki-config but in the extension itself." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/73565 (owner: Odder) [09:35:17] PROBLEM - NTP on db1042 is CRITICAL: NTP CRITICAL: No response from NTP server [09:51:04] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:04] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:04] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:04] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:04] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [09:51:05] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [09:55:52] (CR) Mark Bergsma: [C: 2] Don't install the Ganglia Apache plugin for now [operations/puppet] - https://gerrit.wikimedia.org/r/75915 (owner: Mark Bergsma) [10:27:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:37:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.132 second response time [10:39:16] (PS1) Hashar: always collect ssh keys on tin (no more on fenari) [operations/puppet] - https://gerrit.wikimedia.org/r/76072 [10:42:09] (PS3) Mark Bergsma: Fix XFF handling on all Varnish clusters [operations/puppet] - https://gerrit.wikimedia.org/r/75860 [10:43:04] (CR) Mark Bergsma: [C: -1] "fenari is still a bastion host, so should keep the keys as well (for the time being). Please add, not replace. :)" [operations/puppet] - https://gerrit.wikimedia.org/r/76072 (owner: Hashar) [10:43:59] (PS2) Hashar: always collect ssh keys on tin (no more on fenari) [operations/puppet] - https://gerrit.wikimedia.org/r/76072 [10:44:03] which beta cache had puppetmster::self again, hashar? [10:44:17] mark: and I am not sure how puppet knows about tin hostname. Is it really tin or tin.eqiad.wmnet ? :-D [10:44:27] $::hostname is tin [10:44:31] $::fqdn is tin.eqiad.wmnet [10:44:39] ah! [10:44:53] on beta all caches are using the regular production branch [10:45:06] I am not sure there is any with ::self [10:45:26] (CR) Mark Bergsma: [C: 2] always collect ssh keys on tin (no more on fenari) [operations/puppet] - https://gerrit.wikimedia.org/r/76072 (owner: Hashar) [10:45:41] hmm [10:45:46] how am I supposed to test unmerged changes then [10:45:52] indeed none with ::self [10:45:55] i can merge them but then they hit production before I'm done testing [10:46:57] we could create another cache instance using ::self [10:47:02] and have it added in mediawiki config [10:47:17] also note that I'm working on putting ssl proxies on the caches [10:47:22] i know this already exists in labs [10:47:25] but production will use a different config [10:47:31] and of course that'll hit beta as well [10:47:51] on beta it si a bit tacky since nginx always use 127.0.0.1 as an upstream peer [10:48:01] on production that will be the same [10:48:05] and we don't have lvs so nginx listen on :443 for *beta.wmflabs.org [10:48:23] same same [10:48:37] but the puppet manifest will be a bit different [10:48:38] (and cleaner) [10:49:02] i don't see any problems with labs using the new stuff production will get, but it might clash in the beginning ;) [10:49:27] for ::self, I guess an instance without a public IP would work. You could test it out using curl queries maybe and a hacked /etc/hosts [10:49:43] there was one, right? [10:49:45] or we can set up another subdomain for not yet merged changes [10:49:46] did you convert it back? [10:49:53] I might have deleted it [10:50:40] deployment-cache-text1 is a m1.medium instance with classes: role::cache::text, role::protoproxy::ssl::beta [10:50:57] let me create one :-] [10:51:17] not sure about the name, deployment-staging-cache-text1 ? that is looong [10:51:30] fenari has been depreacted in favor of tin? [10:51:31] for bastion? [10:51:36] mobile would be best now [10:51:44] paravoid: for deployment [10:51:54] but collected keys are useful for bastion too [10:52:07] paravoid: na I was wrong :/ I meant it is the main work machine where MW folks do the file sync so I thought having an up to date known_hosts there was good [10:52:09] hence my review... [10:52:25] tin does not have a public address anyway so hard to be used as a bastion [10:52:33] I should rephrase my commit summary [10:52:44] a bit late [10:52:47] i already merged it [10:53:01] poor grrrit-wm did not notify the merge :( [10:53:23] we ops people review and merge so fast, not even the bots can keep up [10:53:26] and people are still wining! ;-) [10:53:37] oh, I didn't see your first review [10:53:37] Created instance i-0000085a with image "ubuntu-12.04-precise" and hostname i-0000085a.pmtpa.wmflabs. [10:53:48] okay, I'll go back in my hole :) [10:54:15] paravoid: will you be able to get out of your cavern this afternoon to review some CI related puppet changes ? :D [10:54:29] yeah [10:55:38] mark: deployment-staging-cache-text1.pmtpa.wmflabs is building, you can then apply the role::puppet::self there and run puppet manually [10:55:45] thanks [10:55:48] but I needed mobile ;) [10:55:50] mark: then I guess you only want to use role::cache::text [10:55:52] ahh [10:55:58] text I can still test in production, noone's using it hehe [10:56:10] deleting [10:57:23] deployment-staging-cache-mobile01 same process :-] [11:01:11] ty! [11:01:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [11:04:17] mark: also could you check up the status of our Ubuntu mirror https://launchpad.net/ubuntu/+mirror/ubuntu.wikimedia.org-archive ? [11:04:34] our raring copy seems to be lagging behind and is missing the python-d2to1 package :-] [11:04:43] http://packages.ubuntu.com/search?keywords=python-d2to1 (quantal + raring) [11:04:53] ok [11:09:28] hey the source package is named d2to1 and is in our mirror :-] [11:09:38] still seems sync is broken somehow though [11:12:27] back porting packages is so easy once you have a doc ( https://wikitech.wikimedia.org/wiki/Backport_packages ) [11:20:23] off to lunch [11:23:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.136 second response time [11:31:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:32:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [12:00:22] (PS1) MaxSem: Icinga check for host's presence in needed dsh groups [operations/puppet] - https://gerrit.wikimedia.org/r/76084 [12:01:30] hey, can someone tell me if ^^ is a good idea? I found a dreadfully out of sync appserver yesterday, this would not have happened if we had the above check:) [12:21:24] re [12:21:37] mark: congrats on syncing the ubuntu mirror : -] [12:21:47] i didn't do anything [12:22:03] https://launchpad.net/ubuntu/+mirror/ubuntu.wikimedia.org-archive is almost all green "Up to date" [12:58:45] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [13:08:56] paravoid: I am making progress regarding Debian packages. I now know how to update my cow builder images :] [13:08:58] BASEPATH=/var/cache/pbuilder/precise.cow cowbuilder --distribution=precise --update [13:09:22] still have to find out how to inject packages from the /var/cache/pbuilder/result hehe [13:16:32] wth [13:16:38] now the new beta instance doesn't let me login anymore [13:17:35] :-( [13:17:59] after applying puppet ::self ? :( [13:19:24] lets try applying the role::labsnfs::client [13:19:40] that will make /home to point to the NFS server instead of GlusterFS [13:20:16] mark: can we reboot it? Or if you can connect as root, you could puppetd-tv to get get the NFS class installed [13:20:27] I suspect the /home is unreadable / missing your ssh key [13:20:54] can't login as root either [13:21:04] rebooting it [13:21:06] will see [13:26:06] mark: I logged in [13:26:20] it now has a NFS home labnfs.pmtpa.wmnet:/deployment-prep/home on /home type nfs4 (rw,port=0,nfsvers=4,hard,rsize=8192,wsize=8192,sec=sys,sloppy,addr=10.0.0.45,clientaddr=10.4.1.68) [13:26:23] yeah me too [13:26:24] thanks [13:26:27] sorry [13:26:53] I phased out GlusterFS but there is no good way to provide NFS mounts instead of GlusterFS mounts. That is provisioned by LDAP [13:27:05] so one has to remember to use the role::labsnfs::client to override LDAP config :/ [13:27:21] you probably want to dist-upgrade the instance to get the latest varnish packages [13:40:47] (PS5) Akosiaris: Introducing bacula module [operations/puppet] - https://gerrit.wikimedia.org/r/70840 [13:48:16] (PS4) Mark Bergsma: Fix XFF handling on all Varnish clusters [operations/puppet] - https://gerrit.wikimedia.org/r/75860 [13:59:23] (CR) Faidon: [C: -1] "Backup client is not a role, maybe a base.pp class included from individual role classes." [operations/puppet] - https://gerrit.wikimedia.org/r/70840 (owner: Akosiaris) [14:07:10] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [14:44:59] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [15:31:09] !log reedy synchronized php-1.22wmf11/extensions/Wikibase [15:31:19] Logged the message, Master [15:33:09] !log reedy synchronized php-1.22wmf12/extensions/Wikibase [15:33:18] Logged the message, Master [16:12:15] I am off see you on monday [16:18:45] PROBLEM - Varnish HTTP mobile-backend on cp3011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:18:45] PROBLEM - Varnish HTCP daemon on cp3011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:25] PROBLEM - Varnish traffic logger on cp3011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:20:05] PROBLEM - Puppet freshness on sq41 is CRITICAL: No successful Puppet run in the last 10 hours [16:35:04] (PS1) Ottomata: Creating account on stat1 and stat1001 for qchris. RT 5474 [operations/puppet] - https://gerrit.wikimedia.org/r/76113 [16:35:16] (CR) Ottomata: [C: 2 V: 2] Creating account on stat1 and stat1001 for qchris. RT 5474 [operations/puppet] - https://gerrit.wikimedia.org/r/76113 (owner: Ottomata) [16:43:49] where's Ryan to complain to when you need him [16:53:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [17:05:43] !log Rebooting cp3011 [17:05:53] Logged the message, Master [17:07:17] PROBLEM - SSH on cp3011 is CRITICAL: Connection refused [17:10:57] PROBLEM - Host cp3011 is DOWN: PING CRITICAL - Packet loss = 100% [17:12:07] RECOVERY - Host cp3011 is UP: PING OK - Packet loss = 0%, RTA = 87.80 ms [17:12:17] RECOVERY - Varnish traffic logger on cp3011 is OK: PROCS OK: 2 processes with command name varnishncsa [17:12:17] RECOVERY - SSH on cp3011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:12:37] RECOVERY - Varnish HTTP mobile-backend on cp3011 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.181 second response time [17:12:38] RECOVERY - Varnish HTCP daemon on cp3011 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [17:27:17] heya ori-l got a puppet prefernce question for you [17:27:23] i'm on the fence about something [17:28:33] (CR) Aaron Schulz: [C: 1] Update protection configs for core change I6bf650a3 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/71538 (owner: Anomie) [17:41:49] Hey All, Happy Sysadmin day! :D [17:42:37] Obligatory link (even though it's more IT focused) http://www.youtube.com/watch?v=udhd9fmOdCs‎ [18:10:13] (PS1) Asher: pulling db1042 from s4 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76119 [18:10:36] (CR) Asher: [ C: 2 V: 2] pulling db1042 from s4 [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76119 (owner: Asher) [18:11:07] ottomata: what's up? [18:11:49] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1042 from s4' [18:12:00] Logged the message, Master [18:13:52] !log db1042 unresponsive on serial console, power cycling [18:14:06] Logged the message, Master [18:16:50] RECOVERY - SSH on db1042 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:17:00] RECOVERY - Disk space on db1042 is OK: DISK OK [18:17:00] RECOVERY - DPKG on db1042 is OK: All packages OK [18:17:00] RECOVERY - MySQL Recent Restart on db1042 is OK: OK seconds since restart [18:17:10] RECOVERY - MySQL Idle Transactions on db1042 is OK: OK longest blocking idle transaction sleeps for seconds [18:17:10] RECOVERY - MySQL Slave Delay on db1042 is OK: OK replication delay seconds [18:17:20] RECOVERY - MySQL Slave Running on db1042 is OK: OK replication [18:17:20] RECOVERY - Full LVS Snapshot on db1042 is OK: OK no full LVM snapshot volumes [18:17:30] RECOVERY - MySQL disk space on db1042 is OK: DISK OK [18:17:30] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay seconds [18:17:40] RECOVERY - RAID on db1042 is OK: OK: State is Optimal, checked 2 logical device(s) [18:21:05] !log installing package upgrades on iron [18:21:15] Logged the message, Master [18:25:00] RECOVERY - mysqld processes on db1042 is OK: PROCS OK: 1 process with command name mysqld [18:29:29] (PS1) Andrew Bogott: Remove default_interface fact [operations/puppet] - https://gerrit.wikimedia.org/r/76120 [18:48:49] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:49:40] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [18:52:49] PROBLEM - SSH on pdf2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:49] RECOVERY - SSH on pdf2 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [18:55:36] ori-l, just saw your response, hang on [19:05:26] ori-l [19:05:27] so [19:05:41] i'm puppetizing up HA NameNode stuff [19:05:43] there is a new service [19:05:47] called JournalNode [19:06:01] which runs as a quorum, so it needs at least 3 nodes [19:06:10] the hadoop configs are still all the same on all of hte hadoop nodes [19:06:30] I will be specifying the list of $journalnode_hostnames on all hadoop nodes [19:06:44] this means that I *could* conditionally include the journalnode class [19:06:48] based on hostname [19:06:57] which would be slick, but a little magicial [19:07:03] somethign liek [19:07:22] if ($fqdn in $jouralnode_hostnames) { include cdh4::hadoop::journalnode } [19:07:25] or. [19:07:35] i could just let people include that manually themselves [19:07:45] so they'd have to specify the list of journalnode hostnames to each hadoop node [19:07:59] AND manually include the journalnode class on each of the journalnode nodes [19:22:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [19:26:18] ottomata: stepped into a meeting in the interim but i'll reply once i'm out [19:27:53] k [19:39:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [19:43:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:51:08] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:08] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:08] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:08] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:08] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:09] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [19:51:49] (PS1) Dzahn: add visualwikipedia.com and .net redirects (RT #4677) [operations/apache-config] - https://gerrit.wikimedia.org/r/76125 [19:52:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [19:55:16] (CR) Dzahn: [ C: 2] add visualwikipedia.com and .net redirects (RT #4677) [operations/apache-config] - https://gerrit.wikimedia.org/r/76125 (owner: Dzahn) [20:01:26] syncs apache and it takes ...long [20:04:34] !log synced apache, gracefull'ed, activate visualwikipedia.com and .net [20:04:45] Logged the message, Master [20:05:01] visualwikipedia? [20:05:50] heh, yea [20:05:56] perhaps you should activate wikipedia.hr? [20:05:58] they must have existed before sometime in the past [20:06:16] they were in DNS but we didnt have them in between [20:06:21] Oups, I forgot this channel is logged. [20:17:56] (PS1) Andrew Bogott: Support modularized private repo. [operations/puppet] - https://gerrit.wikimedia.org/r/76129 [20:18:55] (CR) Andrew Bogott: [ C: -2] "Do not merge!" [operations/puppet] - https://gerrit.wikimedia.org/r/76129 (owner: Andrew Bogott) [20:30:05] (CR) Dzahn: "class nrpe::packages {" [operations/puppet] - https://gerrit.wikimedia.org/r/75777 (owner: Demon) [20:34:01] (PS1) Ori.livneh: Update my SSH key (user 'olivneh') [operations/puppet] - https://gerrit.wikimedia.org/r/76138 [20:39:54] (PS1) Dzahn: add wikimediacommons.pt and merge wikimediacommons.co.uk into existing regex to simplify. [operations/apache-config] - https://gerrit.wikimedia.org/r/76191 [20:40:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:41:04] when did gerrit-wm become grrrit-wm ? did i miss another trademark issue? [20:41:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [20:42:30] (CR) Lcarr: [ C: 2] Update my SSH key (user 'olivneh') [operations/puppet] - https://gerrit.wikimedia.org/r/76138 (owner: Ori.livneh) [20:56:21] (PS1) Dzahn: add softwarewikipedia.org and .com , just like existing .net [operations/apache-config] - https://gerrit.wikimedia.org/r/76195 [20:56:38] MaxSem: i like the dsh check - https://gerrit.wikimedia.org/r/#/c/76084/1 - shall i merge it ? [20:56:52] LeslieCarr, wee [20:57:08] sure, feel free to - however it neds to be used [20:57:35] ? [20:58:06] RobH: any reason to not merge https://gerrit.wikimedia.org/r/#/c/69591/1 ? [20:58:33] LeslieCarr, it works by slapping dsh_groups into role classes [20:58:37] LeslieCarr: nah, it just went in on one of the zuul down days [20:58:40] and i never got back to it. [20:58:43] the commit doesn't add antything so far [20:58:53] but it only commits the cert [20:58:56] not enables [20:58:59] yep to both [21:00:35] Would that change have helped prevent something like https://gerrit.wikimedia.org/r/#/c/75791/ ? [21:00:47] is the matching key already in private repo [21:00:58] mutante: pretty sure it is yea, but would have to check [21:01:02] i think i committed both [21:01:06] and private doesnt care for zuul stuff [21:01:09] so it went live when i did it [21:01:22] doesn't make sense of me to half do it, and i vaguely recall this [21:01:31] RoanKattouw: not yet [21:01:41] i somehow always pick the zuul issue period to do my commits ;_; [21:01:43] stats.wikimedia.org.key looks good, confirmed [21:01:48] (PS1) Manybubbles: Fix in process runjobs in singlenode mediawiki. [operations/puppet] - https://gerrit.wikimedia.org/r/76196 [21:01:50] (CR) Lcarr: [ C: 2] RT 5337 stats.wikimedia.org ssl cert [operations/puppet] - https://gerrit.wikimedia.org/r/69591 (owner: RobH) [21:01:57] hehe [21:02:03] heh, i never seen someone commit my stuff before [21:02:05] i'm just wanting to clear out the gerrit queue a little [21:02:08] didnt realize would still ping me [21:02:24] i dont pay attention to irc output when i work in gerrit. [21:02:58] (PS2) Manybubbles: Fix in process runjobs in singlenode mediawiki. [operations/puppet] - https://gerrit.wikimedia.org/r/76196 [21:03:14] LeslieCarr: https://gerrit.wikimedia.org/r/#/c/72666/ [21:03:14] RobH: The bot was rewritten recently, the new version was deployed like this week, so maybe that's why [21:03:19] (PS2) Lcarr: Add eqiad bits caches to /etc/dsh/group/bits [operations/puppet] - https://gerrit.wikimedia.org/r/75791 (owner: Catrope) [21:03:25] ottomata: I don't think it's too magical [21:03:36] (CR) Lcarr: [ C: 2] Add eqiad bits caches to /etc/dsh/group/bits [operations/puppet] - https://gerrit.wikimedia.org/r/75791 (owner: Catrope) [21:03:48] I think it makes sense to auto-include the class based on member in an array of journalnode hosts [21:03:53] RoanKattouw: ahh, makes sense [21:04:03] cuz someone must have somehow commited my code before [21:04:06] doesnt seem reasonable otherwise. [21:04:13] ok thanks ori-l, i'm working on that now [21:04:14] danke [21:04:33] Yeah I think pinging the owner is a new feature [21:04:40] (it is) [21:04:49] (CR) Lcarr: [ C: -1] "needs a rebase :(" [operations/puppet] - https://gerrit.wikimedia.org/r/72666 (owner: Dzahn) [21:04:50] ottomata: you could also require that journalnodes have 'journalnode' in their hostname and then regex match on that [21:05:10] LeslieCarr: visiting https://stats.wikimedia.org/ throws a warning saying "You attempted to reach stats.wikimedia.org, but instead you actually reached a server identifying itself as metrics.wikimedia.org." [21:05:39] naw, journalnode can be anywhere [21:05:50] will probably run on a few datanodes somewhere [21:06:05] uh oh [21:06:09] PROBLEM - DPKG on cp1051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:06:29] PROBLEM - Varnish HTTP upload-backend on cp1051 is CRITICAL: Connection refused [21:06:55] drdee, you sure it iddn't do that before? [21:07:30] drdee - we haven't updated anything yet though [21:07:36] no because LeslieCarr just installed the SSL certificate for stats.wikimedia.org [21:07:38] mmmm [21:07:44] yeah but that's just a cert [21:07:52] looks like an ssl server misconfig on stat1 [21:07:54] no i just put it in the repo [21:07:57] k [21:07:59] not actually installed it anywhere [21:08:09] ironically i hadn't puppet merged yet [21:08:13] so it wasn't even in the repo [21:08:33] hashar: is https://gerrit.wikimedia.org/r/#/c/51668/5 still needed ? [21:09:02] ja drdee, we'll have to do some ssl mods there [21:09:10] aight [21:09:15] looks like stat1001's apache is just set up to host metrics from 443 [21:09:18] so stats.wm won't work from there [21:09:35] will probably have to put an ssl proxy there or something [21:10:06] (CR) Lcarr: [ C: 2] "doesn't really do anything yet , will be a good start :)" [operations/puppet] - https://gerrit.wikimedia.org/r/76084 (owner: MaxSem) [21:10:22] LeslieCarr, thanks:) [21:12:16] LeslieCarr: yup [21:12:43] LeslieCarr: we do not have syslog on the beta project, for some reason the hack is not welcomed. I need to get the issue raised again :/ [21:13:23] I got too lazy to open up the can of worm again :-] [21:13:46] if I get highly motivated, I might well phase out syslog-ng in favor of rsyslog which is used everywhere else. [21:14:57] ok [21:15:34] (PS7) Lcarr: beta: syslog-ng on deployment-bastion host [operations/puppet] - https://gerrit.wikimedia.org/r/51668 (owner: Hashar) [21:16:16] (CR) Lcarr: [ C: 2 V: 2] "approved with the caveat that hashar will be working on switching beta labs to rsyslog" [operations/puppet] - https://gerrit.wikimedia.org/r/51668 (owner: Hashar) [21:16:19] there you go :p [21:16:52] LeslieCarr hashar +1 thank you [21:17:15] oh my god [21:17:17] :-] [21:17:31] we are going to get syslog traces on beta \O/ [21:17:57] LeslieCarr: make sure puppet does not do something weird on nfs1/nfs2 though [21:18:32] checking a puppet run now [21:18:41] same on the instance [21:19:07] chrismcmahon: mark has set up a staging instance on beta to test out the varnish text cache :-] [21:19:08] looks fine on nfs1 [21:19:25] chrismcmahon: or maybe it is the mobile cache, i can't remember. [21:20:18] da bug report is https://bugzilla.wikimedia.org/show_bug.cgi?id=36748 :-] [21:21:31] PROBLEM - DPKG on nfs2 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:21:58] (PS1) Jforrester: Enable anonymous use of VisualEditor on de/es/fr/he/it/pl/ru/sv [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76199 [21:22:01] that above is me upgrading packages [21:22:08] nfs2 had like 5 million security updates [21:22:31] RECOVERY - DPKG on nfs2 is OK: All packages OK [21:25:08] (CR) Alex Monk: "Might be a good idea to hold off on de for now - see wikitech-l response to this week's deployment highlights." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76199 (owner: Jforrester) [21:25:11] PROBLEM - DPKG on cp1062 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:25:11] PROBLEM - DPKG on cp1061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:25:21] PROBLEM - Varnish HTTP upload-backend on cp1061 is CRITICAL: Connection refused [21:25:22] PROBLEM - Varnish HTTP upload-backend on cp1062 is CRITICAL: Connection refused [21:26:21] RECOVERY - Varnish HTTP upload-backend on cp1061 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.014 second response time [21:26:22] RECOVERY - Varnish HTTP upload-backend on cp1062 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.005 second response time [21:26:26] LeslieCarr: I got syslog messages \O/ [21:26:36] huzzah! [21:27:14] (CR) MZMcBride: "I just came here to say the same as Alex. Link, for reference: ." [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76199 (owner: Jforrester) [21:28:43] LeslieCarr: thank you very much [21:29:57] yw hashar :) [21:30:15] can you believe I had to use tcpdump to stream syslog messages being sent there ? :-] [21:30:19] hehe [21:30:21] PROBLEM - DPKG on mw1153 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:32:11] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection refused [21:33:21] RECOVERY - DPKG on mw1153 is OK: All packages OK [21:34:12] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.092 second response time [21:37:07] RECOVERY - DPKG on cp1051 is OK: All packages OK [21:37:27] RECOVERY - Varnish HTTP upload-backend on cp1051 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.002 second response time [21:43:10] (PS2) Dzahn: remove non-root users from kaulen since it's not used as download server [operations/puppet] - https://gerrit.wikimedia.org/r/72666 [21:44:38] (PS3) Dzahn: remove non-root users from kaulen since it's not used as download server [operations/puppet] - https://gerrit.wikimedia.org/r/72666 [21:47:04] (PS4) Dzahn: remove non-root users from kaulen since it's not used as download server [operations/puppet] - https://gerrit.wikimedia.org/r/72666 [21:47:08] dang formatting [21:48:06] ;) [21:48:57] mutante: if you could find some misc box to host the mediawiki tar ball, that would be nice. Cant remember the RT ticket though. [21:49:23] ah https://rt.wikimedia.org/Ticket/Display.html?id=1839 :-] [21:49:41] hashar: https://rt.wikimedia.org/Ticket/Display.html?id=1839 [21:50:10] i guess it could be zirconium [21:50:58] whatever works [21:50:58] planet + etherpad + ..download ? [21:51:00] RobH: [21:51:30] uhh, dont we host a ton of large files for download? [21:51:32] we would need a way to upload the tar balls there from the jenkins hosts. either via rsync or scp :) [21:51:47] RobH: We certainly do at dumps.wikimedia.org. :-) [21:51:59] a misc system isnt good for that. [21:52:05] it doesn't have the long term disk capacity. [21:52:42] is it on nfs now i suppose and thats why we're asking? [21:53:25] no idea, but we would need a place where platform people and jenkins would be able to write to [21:53:30] also putting it on a misc server, over keeping on datasets, is the same solution [21:53:36] and get the dir bound to download.mediawiki.org [21:53:39] you have something where you require manual intervention [21:54:06] so a puppet class has to get written that will introduce the directory structure and give the appropriate group permissions for that... [21:54:14] and it could still ive wherever it lives now no? [21:54:17] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:54:29] (or is its current location non ideal for disk space reasons, dataset2?) [21:55:00] I have no idea :/ [21:55:06] and i dont have that many misc left in eqiad [21:55:09] in fact i have less than 5 right now [21:55:18] so i dont wanna give a temp solution one of those limited servers [21:55:36] we're shipping more up from tampa to ashburn, but it'll be another week or so before they are racked and ready [21:55:39] (fyi ;) [21:56:07] I'll update the ticket with the requirements I need answered to assign a misc host [21:56:07] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [21:56:08] or any host [21:56:18] I think one of the idea was to not give us access on dataset2 [21:56:18] hashar: i dont expect you to know all the answers, no worries ;] [21:56:57] Well, it may be in fact that dataset2 doesn't have the space needed for long term either. [21:57:09] yea, the idea was to not have it on dataset [21:57:12] That being said, its a fairly low cpu bound, its just lots of space for storage [21:57:22] so we may be able to use incoming misc server, and purchase 2tb disks [21:57:25] I am not even sure it requires that much space [21:57:32] a couple GB would probably be enough [21:57:32] is it that much? [21:57:34] well, that info needs to be on ticket. [21:57:46] one would have to look at wherever the mediawiki tar balls are hosted [21:57:49] cuz no tech specs on requirements = no way im giving out a server ;] [21:58:13] we have in the past had a habit of overallocating servers to projects that dont need even half the horsepower [21:58:19] http://dumps.wikimedia.org/mediawiki/ [21:58:22] trying to prevent that in future [21:59:00] im not gonna try to figure it out, thats for the folks requesting the server (sorry ;) [21:59:01] (PS1) MaxSem: Whitelist our IPv6 range [operations/mediawiki-config] - https://gerrit.wikimedia.org/r/76205 [21:59:07] (PS2) Ottomata: Puppetizing HA NameNode via Quorum Based JournalNode. [operations/puppet/cdh4] - https://gerrit.wikimedia.org/r/76018 [22:00:06] RobH: could you possibly update the RT 1839 ? like: no way we give a misc if ops has no clue about the needed spec (specially disk space required). [22:00:10] would be nice [22:00:13] hashar: RobH 1.3G [22:00:14] im updating now [22:00:22] great [22:00:44] mutante: thats all of it in total? [22:00:58] and is it trending larger or whats projected space requirements? [22:01:12] it will grow with every mediawiki release [22:01:23] that is all below ./mediawiki/ [22:01:23] are those releases growing themselves much? [22:01:26] so the actual mediawiki tarballs [22:01:27] this is stuff should be in ticket. [22:01:42] so when we get asked why we allocated space on something in 9 months [22:01:50] we wont all be like 'it seemed like great idea in irc chat' [22:01:51] ;] [22:02:03] already added a comment [22:02:03] heh [22:02:36] 117M 1.16 [22:02:36] 117M 1.17 [22:02:36] 154M 1.18 [22:02:36] 198M 1.19 [22:02:36] 141M 1.20 [22:02:38] 224M 1.21 [22:02:43] yo mutante; lately i notice that I cannot view more and more RT tickets, i know the topics and this is never security related but more like access requests (for qchris) or the SSL certificate for stats.wikimedia.org; can I get a little bit more privilege so i can see these type of tickets as well? [22:02:47] so much about the trend .. 1.20 was kind of small? [22:02:57] or less releases? [22:03:08] mutante: yea one of the mediawiki devs who control that need to chime in i think [22:03:38] cuz they have some idea (i hope) of how much space will be projected to need as each release grows itself in size, in addtion to keeping historic copies of each release ever. [22:03:51] i hope this doesnt sound dickish, im not trying to block shit [22:03:54] =[ [22:04:10] na [22:04:17] (cuz like i said, we'll have the server to allocate in a couple weeks or less) [22:04:28] you have expectations, want to make sure the solution will be sustainable for the years to come [22:04:35] so there is nothing to complain about :-] [22:04:39] it is actually most welcome. [22:04:51] drdee_: access request tickets work by giving the role of the requestor the permissions, so you want to be added as a requestor, got some examples for me? query if you like [22:05:20] anyway, too late for me, I am heading sleep. Thank for your time and enjoy the week-end! [22:05:26] cool, im summarizing our discussion and findings (plus what daniel already added) to the ticket [22:05:44] so we can document it and get answers, and we'll get something figured out so releasing the software isnt a root level requirement =] [22:05:53] cuz yea, that sucks for devs. [22:06:25] RobH: cool, thanks [22:06:43] btw, the actual title of the ticket, i should rename it [22:07:11] what sucks for dev is having an unreliable infrastructure. If you need more inputs to make it reliable, I guess we will be more than happy to provide said input [22:07:11] download.mediawiki.org, i just made it a redirect until this is setup [22:07:21] one of the problem is that we usually have no clue what could be needed :-] [22:07:40] I am pretty sure nobody wondered how much disk space would need to be provisioned for the next few years hehe [22:07:44] I never thought about it. [22:08:36] hashar: and if thats the answer, thats cool [22:08:42] but i'd be remiss to not ask the question =] [22:08:59] cuz if we guess wrong, i wanna be able to say 'yes, but it was an actual guess, not a bad allocation on the part of ops' [22:09:10] cuz we've had both in past (like every company) [22:10:58] :-] [22:11:25] (CR) Dzahn: [ C: 2] add softwarewikipedia.org and .com , just like existing .net [operations/apache-config] - https://gerrit.wikimedia.org/r/76195 (owner: Dzahn) [22:12:13] (CR) Dzahn: [ C: 2] add wikimediacommons.pt and merge wikimediacommons.co.uk into existing regex to simplify. [operations/apache-config] - https://gerrit.wikimedia.org/r/76191 (owner: Dzahn) [22:13:08] mutante: zirconium appears to have two 1tb disks [22:13:21] would need to grow its lvm cuz its tiny now [22:13:25] but could handle this i think.... [22:13:59] its memory and cpu utilization are low enough [22:14:11] im guessing the in spikes on network are its planet fetches? [22:14:15] yea, we did that on purpose (tiny lvm) afair [22:14:42] i guess so, it shouldnt do much else [22:14:47] that would be once per hour nowadays [22:14:52] i think growing the lvm and putting it there is fine, someone has to write up the puppet stuff to create the vhost and special group that some folks belong to [22:14:57] after starting out with just once daily [22:15:02] you could even craft a dedicated volume for the mw tar balls to finely restrict the amount of disk space they are going to use :D [22:15:04] maybe different than wikidev, cuz that box may host other things like it [22:15:13] yea, separate volume [22:15:17] but wikidev is an ok stop gap [22:15:37] (as there arent shared services yet on that box, but there will be some day, this is just first one ;) [22:15:55] be bold, mediawiki-release group ? :-D [22:15:59] the other services on it should by definition not have private data [22:16:10] well, as long as people dont forget about etherpad [22:16:23] not being the place to put private stuff [22:16:44] it should probably the same group that deploys mediawiki [22:16:59] providing the tarball is part of deploying kind of ,isnt it [22:17:12] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:17:31] i dont think it is [22:17:43] not everyone who can deploy can deploy mediawiki tarballs [22:17:46] thats intentional i thought [22:17:53] (only release managers versus all deployers) [22:17:54] ? [22:17:58] because it used to be mixed with dataset [22:18:08] even outside of that i think its independent [22:18:14] and this would be about not having that anymore [22:18:29] its about a release manager signing off on the release [22:18:35] well, its a question for devs anyhow [22:18:38] we wont decide ;] [22:18:55] well, if you can install it on prod servers ... [22:18:58] yea, agree [22:19:12] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [22:21:14] i dont see us projecting space to go larger than what zirconium can handle, but i also have no idea the roadmap for software development. [22:21:15] effectively it's the security team.. csteipp creates them and we upload for him by request [22:21:16] =] [22:21:39] so then not everyone who deploys to cluster for our projects should be allowed to push updates for tarballs live [22:21:49] mutante: ? [22:22:03] or you say effectively cuz of historic reasons regarding dataset2? [22:22:21] the latter [22:22:31] ahh, ok, then i understand what you mean. [22:23:03] and it makes me think that maybe there should be a puppet group for members of security [22:23:10] indeed [22:23:18] but yea, it should be continued in a meeting [22:26:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:27:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.389 second response time [22:29:47] (PS1) Dzahn: add mw1041 to dsh group mediawiki-installation as requested per RT #5522 [operations/puppet] - https://gerrit.wikimedia.org/r/76209 [22:30:41] (CR) Dzahn: [ C: 2] add mw1041 to dsh group mediawiki-installation as requested per RT #5522 [operations/puppet] - https://gerrit.wikimedia.org/r/76209 (owner: Dzahn) [22:40:00] (CR) Dzahn: "http://wikimediacommons.co.uk" [operations/apache-config] - https://gerrit.wikimedia.org/r/76191 (owner: Dzahn) [22:54:15] (PS1) Dzahn: redirect softwarewikipedia.net/.com/.org to mediawiki [operations/apache-config] - https://gerrit.wikimedia.org/r/76213 [22:55:06] (CR) Dzahn: [ C: 2] redirect softwarewikipedia.net/.com/.org to mediawiki [operations/apache-config] - https://gerrit.wikimedia.org/r/76213 (owner: Dzahn) [22:59:29] PROBLEM - Puppet freshness on holmium is CRITICAL: No successful Puppet run in the last 10 hours [23:01:33] !log DNS update - removing zones that link to other links. use wikipedia.org not wikipedia.com [23:01:44] Logged the message, Master [23:04:58] mutante: :-) [23:06:10] twkozlowski: hah, yea, now i had to :p [23:07:07] !log activated http://wikimediacommons.pt/ [23:07:19] Logged the message, Master [23:10:50] !log activated softwarewikipedia.com/.org [23:11:02] Logged the message, Master [23:21:30] (PS1) Lcarr: removing andrew's old key [operations/puppet] - https://gerrit.wikimedia.org/r/76216 [23:22:13] (CR) Lcarr: [ C: 2] removing andrew's old key [operations/puppet] - https://gerrit.wikimedia.org/r/76216 (owner: Lcarr) [23:23:37] (CR) Lcarr: [ C: 2] remove non-root users from kaulen since it's not used as download server [operations/puppet] - https://gerrit.wikimedia.org/r/72666 (owner: Dzahn)