[00:03:34] New patchset: Reedy; "Bug 44460 - Create Wikiversity Korean" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47349 [00:03:49] Change merged: Tim Starling; [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/45599 [00:03:53] voluptuous? [00:04:13] Change merged: Tim Starling; [operations/debs/nginx] (master) - https://gerrit.wikimedia.org/r/45598 [00:09:26] New patchset: Reedy; "Remove wgEnableUpload entries same as default" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47678 [00:10:22] New patchset: Reedy; "Remove wgEnableUpload entries same as default" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47678 [00:13:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47678 [00:14:14] Thanks Tim [00:37:37] !log stopping mailman to scrub an entry out of archives, will restart shortly [00:37:38] Logged the message, RobH [00:38:20] RobH, one sec [00:38:25] too late. [00:38:27] why? [00:38:39] did you do it following wikitech instructions and replace the text with something else [00:38:44] instead of just deleting the email? [00:38:49] im doing it now, via those yes [00:38:51] why do you ask? [00:39:10] good thanks, because doing it by deleting the email breaks the archives [00:39:40] im just ripping out the address info the dude left [00:39:46] and leaving the majority of the message intact anyhow [00:40:28] mostly done, rebuilding mbox file now [00:40:42] (shouldnt break as no actual full email was removed, headers all intact) [00:41:09] yep [00:41:15] so slow.... [00:41:22] wikitech-l was a huge ass list. [00:41:26] was/is/whatever [00:41:35] New patchset: Andrew Bogott; "In mediawiki::singlenode use a more modest memcached size." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47680 [00:41:52] I thought you could rebulid just for a particular list? [00:41:56] !log done with data removal on mailing list server, rebuilding mbox [00:41:56] Logged the message, RobH [00:42:02] its just this list. [00:42:04] its a huge list [00:42:13] up to nov 2012, almost to present [00:42:51] !log mailing lists returned to normal [00:42:51] Logged the message, RobH [00:43:02] New patchset: Andrew Bogott; "In mediawiki::singlenode use a more modest memcached size." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47680 [00:43:40] Thehelpfulone: did daniel say to reassign rt 4478 to him? [00:43:40] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47680 [00:43:53] cuz if not, dont go assigning them on a whim, cuz we have an RT triage person every week to do that [00:44:00] otherwise it will sit forever, ie: daniel is at conference now [00:44:08] so dispatching tickets to him isnt great. [00:44:10] nah I just did it because he's always done it - but yeah fair enough [00:44:21] are you on duty this week (topic?) [00:44:25] andrew otto [00:44:26] I thought you were last week? [00:44:36] yea, i just happen to watch all the RT queues as they come in [00:44:42] as RT admin i get every queue every update. [00:45:09] ah fun [00:46:31] that isnt the word i'd use ;P [00:46:43] so yea, im not saying dont dispatch tickets, but you wanna find out with the person before you do [00:46:54] ie: the rt triage person won't assign someone a ticket without touching base with said person [00:47:06] so you will wanna follow same procedure if possible [00:47:37] (or it may sit without notice ;_; ) [00:50:03] heh fair enough [00:50:20] MW config change deployment should be working, only Apache config change deployment is still broken [00:50:36] hah, wrong channel, sorry [00:51:10] yet i saw it anyhow [00:51:13] so it counts [01:32:19] RECOVERY - Puppet freshness on mw1128 is OK: puppet ran at Wed Feb 6 01:31:47 UTC 2013 [01:36:39] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [01:58:38] New patchset: J; "add cgroup to limit memory of sub processes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40784 [02:04:01] New patchset: Reedy; "Rename $wmgVectorEditSectionLinks to $wmgVectorSectionEditLinks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47690 [02:04:26] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47690 [02:04:33] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [02:05:00] !log reedy synchronized wmf-config/ [02:05:03] Logged the message, Master [02:28:06] !log LocalisationUpdate completed (1.21wmf8) at Wed Feb 6 02:28:05 UTC 2013 [02:28:08] Logged the message, Master [02:51:37] !log LocalisationUpdate completed (1.21wmf9) at Wed Feb 6 02:51:36 UTC 2013 [02:51:38] Logged the message, Master [05:49:52] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 195 seconds [05:51:40] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [07:03:13] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [07:11:19] morning [08:00:27] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [08:00:27] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [08:00:28] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:00:28] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:00:28] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [08:02:24] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [08:41:24] PROBLEM - Puppet freshness on cp3020 is CRITICAL: Puppet has not run in the last 10 hours [09:57:00] New patchset: Hashar; "cleanout testswarm from the manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47665 [09:57:51] New review: Hashar; "PS2 make this change independent for easier merging." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47665 [09:59:08] paravoid: hi :-] If you are around I did my first "per software" module with https://gerrit.wikimedia.org/r/#/c/47665/ :-] [09:59:39] that clean out the awful manifests/misc/contint.pp of anything related to the "testswarm" software and put the remaining part we are interested in in a testswarm module :-] [10:10:07] New review: Hashar; "The idea was merely to let me execute mw-update-l10n on my Mac laptop for debugging purpose. I am p..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46907 [10:21:37] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 200 seconds [10:22:04] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 209 seconds [10:51:38] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [10:52:05] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [11:05:55] Ryan_Lane: what did you remove? libpam-ldapd? [11:06:00] yes [11:06:15] which should also purge the ldap config from the pam files [11:06:16] libnss-ldapd Recommends libpam-ldapd [11:06:24] which means that it gets installed by default [11:06:30] -_ [11:06:30] err [11:06:32] -_- [11:06:41] so this needs to be ensure => absent in puppet [11:07:03] is this something new? [11:07:09] virt0 doesn't have it [11:07:15] neither does virt1000 [11:07:18] or mchenry [11:08:22] either way, yeah, it needs to be set absent [11:08:49] The following NEW packages will be installed: ldap-utils libnss-ldapd libpam-ldapd nscd nslcd [11:08:56] -_- [11:08:56] that's what I get for apt-get install libnss-ldapd [11:09:21] virt0/virt1000/mchenry don't have libnss-ldapd installed [11:09:22] well, that explains that [11:09:25] ah [11:09:31] * Ryan_Lane hates recommended packages [11:09:34] as far as I can see [11:09:51] installing recommends by default can be disabled [11:09:53] but I don't think we should [11:09:56] it's usually a good idea [11:09:59] sometimes it isn't :) [11:10:03] rightr [11:10:07] *right [11:10:15] from what I've read it'll break ubuntu, generally [11:10:39] oh, I don't know about that [11:10:42] it works in Debian [11:10:48] I have it on some systems at least [11:16:00] man. I *really* despise that the package changes the pam config, too [11:16:04] what the fuck [11:19:30] New patchset: Ryan Lane; "If pam ldap isn't enabled, then ensure it's absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47722 [11:20:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47722 [11:22:20] New patchset: Hashar; "(bug 44061) initial release" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [11:22:35] New review: Hashar; "typo in debian/control" [operations/debs/python-voluptuous] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44408 [11:24:47] hashar: give me a few, then I'll be yours :) [11:24:50] definitely today [11:25:17] good :-] [11:25:34] trying to figure out the commands to get the package to build manually [11:25:34] :-] [11:27:19] dpkg-buildpackage -uc -us is the canonical one [11:27:24] git-buildpackage for git [11:27:46] I came up with: uscan --verbose --rename --download-current-version && dpkg-buildpackage -rfakeroot -us -uc -b [11:28:41] that's okay [11:28:44] drop the -b though [11:28:53] I always prefer including source packages to our apt too [11:29:18] it's also necessary from a licensing perspective for GPLed binaries [11:30:38] ultimately I would like Jenkins to build packages on each change set / after merge [11:38:21] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [11:41:30] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 181 seconds [11:41:30] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 181 seconds [11:42:03] New patchset: Hashar; "gallium blessed with misc::package-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47725 [11:46:11] New patchset: Hashar; "(bug 44061) initial release" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [11:47:13] New review: Hashar; "PS8 debian/copyright now points to /usr/share/common-licenses/GPL-2 to make lintian happy" [operations/debs/python-voluptuous] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44408 [11:49:04] bah empty binary package [11:52:15] New patchset: Silke Meyer; "Added documentation to the Wikidata roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47726 [12:05:39] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [12:24:42] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [12:25:00] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [12:43:04] New patchset: Hashar; "gallium blessed with misc::package-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47725 [12:43:17] New review: Hashar; "fixed up space / tabs" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47725 [12:46:13] New review: Hashar; "works for me, pending mark approval." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/47567 [12:48:06] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 231 seconds [12:48:24] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 240 seconds [12:49:50] * paravoid is having lunch while watching Ceph talks :) [12:59:55] paravoid: while eating, could you potentially merge in https://gerrit.wikimedia.org/r/47725 :-] that get misc::package-builder on gallium [13:00:02] so jenkins can build / lint deb packages eventually [13:07:14] why do you want to build packages with jenkins? [13:09:29] to run the linting there and have them build automatically instead of mannually ? :-D [13:09:53] then people can send their patchset, receive a temp .deb as a result and test it out in labs :-] [13:11:55] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47726 [13:12:44] New review: Faidon; "Why do we configure MediaWiki via puppet? We don't generally do that, please explain why Wikidata is..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47585 [13:13:11] forgot to eat myself damn [13:13:11] thanks for the reminder [13:13:31] http://commons.wikimedia.org/wiki/File:2006_sardines_can_open.jpg miam [13:14:00] that's... scary [13:14:07] the automated building of debs [13:14:42] well this way we can have jenkins report about lintian failure in Gerrit [13:15:03] that is really the only things I wanted to achieve [13:19:05] New review: Faidon; "See inline for a few comments. Additional to that, debian/python-jsonschema.* and debian/python-json..." [operations/debs/python-jsonschema] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47662 [13:21:51] New review: Faidon; "Any reason to have a separate class for "systemuser" and reference that directly? Maybe just put it ..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/47665 [13:22:01] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 203 seconds [13:22:18] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 208 seconds [13:22:40] <^demon> paravoid: Speaking of debs...I think some hands-on training in how to build debs would be useful (and I suspect I'm not alone). [13:22:51] <^demon> I've tried reading the docs before but they're not very user-friendly. [13:22:57] <^demon> (various docs, for that matter) [13:22:59] maybe we can set up a workshop while we are all in SF ? [13:23:33] I guess you're right [13:23:43] although we have a pretty busy schedule planned for those days [13:23:50] how long are you staying? [13:23:51] <^demon> I know, and it doesn't have to be that week. [13:23:51] but it seems it might be needed, that's true [13:24:17] I'll be there for three weeks, starting from the next one [13:24:19] I am there from Sunday 24th feb till Sat 9th march [13:24:50] oh heh [13:25:00] <^demon> I'm on the same schedule as hashar. [13:25:31] until this happens though, feel free to ask me and/or put me as a reviewer [13:25:41] and I'll happily provide feedback [13:26:07] well we better schedule that in advance if we want it to happens :-] [13:26:30] so [13:26:36] I just reviewed two different debs [13:26:42] and they apparently are for the exact same purpose [13:26:42] cause I suspect we will all be very busy and that might be hard to all gather at some place [13:26:57] https://gerrit.wikimedia.org/r/#/c/44408/ [13:26:59] https://gerrit.wikimedia.org/r/#/c/47662/ [13:27:13] oh [13:27:19] ori is packaging too :-] [13:28:23] which highlight how I skipped "git-buildpackage" from your review [13:28:24] doh [13:28:29] need to find out the doc for that one [13:30:02] New review: Hashar; "in site.pp we have:" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47665 [13:30:06] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [13:30:24] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [13:30:38] New review: Faidon; "This has nothing to do with packaging but also on the Gerrit queue is Antoine's packaging of Voluptu..." [operations/debs/python-jsonschema] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/47662 [13:31:51] hmmm? system user with /bin/bash for a shell? [13:32:03] is that really needed here? [13:32:06] (sometimes it is) [13:32:10] cant remember [13:32:17] i merely copy pasted it [13:32:40] we can move it back to sillyshell / /bin/false [13:32:48] and put back bash later on if that is needed [13:32:54] /bin/false is fine [13:32:58] that or a FIXME [13:33:00] amending [13:33:18] sorry, I guess I'm too strict of a reviewer... :) [13:33:22] na it is ok [13:33:32] strictness is cool :-] [13:33:37] that ensure we don't produce crap [13:33:46] considering I found a few vulnerabilities today I'm not feeling bad [13:33:48] <^demon> Don't use sillyshell for anything except svn. [13:33:54] <^demon> It's not meant for anything but svn. [13:34:06] nod [13:34:12] I just read sillyshell today [13:34:34] <^demon> sillyshell is silly ;-) [13:34:36] New patchset: Hashar; "cleanout testswarm from the manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47665 [13:34:55] New review: Hashar; "cleanup: made the shell /bin/false" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47665 [13:35:02] paravoid: ^^^^ [13:35:22] New review: Faidon; "That's a fair reason. Maybe we can simplify it when it gets complete." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/47665 [13:35:24] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47665 [13:35:45] goood [13:36:03] contint.pp is slightly lighter now :-) [13:36:32] Ryan_Lane: merging libpam-ldapd on sockpuppet btw [13:37:30] wanna look at my other puppet changes? :-] [13:38:14] Change abandoned: Hashar; "yeah this is being split in smaller modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43429 [13:41:00] yes [13:41:02] which ones? [13:44:00] so yesterday [13:44:07] we rejected the wikimedia module :-] [13:44:18] I started extracting stuff out of misc/contint.pp to some new modules [13:44:38] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43420 [13:45:08] paravoid: https://gerrit.wikimedia.org/r/#/c/47663/ that moves ton of packages definitions to the new "contint" modules [13:45:19] that merely list random packages we need for jenkins jobs [13:45:23] such as php-* rake .. [13:45:38] I haven't moved everything though cause some packages are a bit scary and I have no idea where to move them [13:46:57] I think I will try to make several small changes, I guess that will be easier to review / apply [13:48:35] New patchset: Demon; "Install git on all servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37247 [13:51:31] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [13:51:40] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [13:52:18] <^demon> paravoid: I cleaned up that git change ^. Now puts it in $packages like you suggested. [13:54:26] sec [13:57:19] ^demon: that wasn't the reason I didn't merge it, but okay :) [13:57:25] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37247 [13:58:15] New patchset: Tpt; "(bug 44032) Deploy Universal Language Selector to oldwikiource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47732 [13:58:27] hashar: so... [13:58:50] paravoid: I am not sure if you prefer ton of small patchets [13:59:04] paravoid: or just a massive refactoring one that split misc/contint.pp to various small modules [14:02:35] wikidata seems down :( [14:02:38] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:38] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:38] PROBLEM - Apache HTTP on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:47] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:47] PROBLEM - Apache HTTP on mw1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:47] PROBLEM - Apache HTTP on mw1045 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:47] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:47] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:47] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:47] PROBLEM - Apache HTTP on mw1061 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:48] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:48] PROBLEM - Apache HTTP on mw1069 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:49] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:49] yeah [14:02:53] well maybe more [14:02:55] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:56] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:56] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:56] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:56] PROBLEM - Apache HTTP on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:56] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:04] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:04] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:04] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:04] PROBLEM - Apache HTTP on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:04] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:05] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:05] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:06] PROBLEM - Apache HTTP on mw1088 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:06] PROBLEM - Apache HTTP on mw1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:07] PROBLEM - Apache HTTP on mw1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:07] PROBLEM - Apache HTTP on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:08] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:08] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:13] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:22] panic! [14:03:22] PROBLEM - Apache HTTP on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:23] PROBLEM - Apache HTTP on mw1092 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:23] PROBLEM - Apache HTTP on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:31] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:32] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:32] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:32] what the hell [14:03:41] PROBLEM - Apache HTTP on mw1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:41] PROBLEM - Apache HTTP on mw1091 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:41] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:41] PROBLEM - Apache HTTP on mw1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:49] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:49] PROBLEM - Apache HTTP on mw1051 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:58] PROBLEM - Apache HTTP on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:07] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:07] PROBLEM - Apache HTTP on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:16] PROBLEM - Apache HTTP on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:21] at least one of them has apaches running [14:04:24] might just be overloadded [14:04:25] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.096 second response time [14:04:26] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.407 second response time [14:04:26] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.072 second response time [14:04:26] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.632 second response time [14:04:34] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.109 second response time [14:04:34] PROBLEM - Apache HTTP on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:04:43] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.106 second response time [14:04:44] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.110 second response time [14:04:44] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.131 second response time [14:04:44] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.379 second response time [14:04:52] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:05:01] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.111 second response time [14:05:01] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.123 second response time [14:05:01] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.094 second response time [14:05:10] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.099 second response time [14:05:11] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.106 second response time [14:05:20] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.825 second response time [14:05:20] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.102 second response time [14:05:20] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.102 second response time [14:05:20] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.099 second response time [14:05:20] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.111 second response time [14:05:29] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.105 second response time [14:05:29] RECOVERY - Apache HTTP on mw1051 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.506 second response time [14:05:31] seems to be better now [14:05:37] RECOVERY - Apache HTTP on mw1042 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.042 second response time [14:06:04] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.101 second response time [14:06:05] RECOVERY - Apache HTTP on mw1066 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.109 second response time [14:06:13] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.103 second response time [14:06:14] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.093 second response time [14:06:14] RECOVERY - Apache HTTP on mw1045 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.099 second response time [14:06:14] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.097 second response time [14:06:14] RECOVERY - Apache HTTP on mw1049 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.115 second response time [14:06:14] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.093 second response time [14:06:14] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.855 second response time [14:06:14] maybe a rack switch in eqiad [14:06:15] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.742 second response time [14:06:22] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.116 second response time [14:06:22] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.116 second response time [14:06:22] RECOVERY - Apache HTTP on mw1077 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.400 second response time [14:06:31] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.110 second response time [14:06:32] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.094 second response time [14:06:32] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.112 second response time [14:06:32] RECOVERY - Apache HTTP on mw1076 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.099 second response time [14:06:32] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.103 second response time [14:06:32] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.456 second response time [14:06:32] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.509 second response time [14:06:33] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.768 second response time [14:06:33] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.440 second response time [14:06:34] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.603 second response time [14:07:34] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.108 second response time [14:07:34] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.300 second response time [14:07:43] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.721 second response time [14:08:10] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.110 second response time [14:08:29] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.111 second response time [14:09:31] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:09:58] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.194 second response time [14:11:11] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.736 second response time [14:14:02] New patchset: Silke Meyer; "Definition of a function that gets MW extensions with less code" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46809 [14:14:33] db1027's mysql was killed by OOM [14:15:38] but that was 15' before the alerts, hmm [14:20:49] paravoid: are you investigating that more or can we proceed on the puppet cleanup I have sent ? :-∆ [14:20:50] New review: Demon; "I really don't think this is necessary anymore. Can we just abandon this?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8120 [14:20:51] ∆∆∆∆ [14:20:52] ho [14:37:10] hashar: go ahead :) [14:37:26] just noticed you reviewed one of the changes ;-) [14:37:33] refactoring [14:44:33] New patchset: Hashar; "move contint packages under a submodule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47663 [14:44:44] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [14:44:49] bah [14:46:33] New patchset: Hashar; "move contint packages under a submodule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47663 [14:47:21] New review: Hashar; "rebased" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47663 [14:47:47] paravoid: updated https://gerrit.wikimedia.org/r/#/c/47663/1..3/modules/contint/manifests/packages.pp,unified [14:48:07] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [14:50:28] what about the generic:: and misc:: includes? [14:51:09] looking at the jenkins one too [14:51:17] this starts to feel much cleaner, doesn't it? [14:51:53] well the generic::packages::ant18 I have done that explicitly for others to use it if needed [14:51:59] I can clean it out in another chnage [14:52:12] the misc::irc::wikibugs::packages , I am not sure what you mean :-] [14:52:24] that is used by a job that validate wikibugs (a perl script) [14:52:30] which has its own puppet class [14:52:41] so instead of copy pasting the list of dependencies, I am just including them [14:52:47] save us from some code duplication [14:53:24] ideally we shouldn't have modules depending on manifests at all [14:53:36] until we get there, sure, we can do that [14:54:49] yeah in an ideal word :-] [14:55:01] hashar: for jenkins: use multiline definitions everywhere, don't use compression (i.e. multiple files in a file { } stanza) [14:55:06] I can move the wikibugs stuff to a wikibugs module :) [14:55:08] we haven't been consistent *at all* for those [14:55:17] but since you're cleaning up and adhering to the style guide more or less [14:55:26] let's do that too if you don't mind [14:55:47] oh and same thing about /bin/bash for jenkins [14:55:48] so one file{} per file ? [14:55:48] :) [14:55:58] http://docs.puppetlabs.com/guides/style_guide.html 9.4 Compression [14:56:11] bah [14:56:16] puppet-lint does not complain about it [14:58:10] mashing the jenkins change [14:58:39] New review: Faidon; "Ideally most of the stuff there could use other modules (like PHP, Node.js etc.). But this would do ..." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/47663 [14:59:05] \O/ [15:02:41] paravoid: bah you voted CR+2 on https://gerrit.wikimedia.org/r/#/c/47663/ but that does not merge it apparently :-] [15:02:56] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [15:03:20] I did that on purpose, I'll merge them together with Jenkins :) [15:03:38] New review: Hashar; "one file{} per file" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47664 [15:04:47] paravoid: here is the jenkins one [15:04:51] there is still a lot of mess there [15:05:07] we have several files under /var/lib/jenkins but they don't belong to the 'jenkins' module [15:13:16] <^demon> paravoid: Another minor puppet change: https://gerrit.wikimedia.org/r/#/c/45786/ [15:16:49] hm, I pressed review a few mins ago, but gerrit-wm didn't say anything [15:17:46] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45786 [15:18:10] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47663 [15:19:07] New patchset: Hashar; "Jenkins module created out of contint manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47664 [15:19:28] paravoid: updated the jenkins change https://gerrit.wikimedia.org/r/#/c/47664/ [15:19:41] paravoid: we need jenkins user to have bash to be able to sudo as him [15:19:43] and get a shell [15:20:01] I do something like: sudo sudo -s -u jenkins [15:20:16] ok [15:20:18] what about groups? [15:20:22] groups::jenkins [15:20:33] yeah replied on that one, we need it in manifests/admins.pp [15:20:40] to add us (hashar, chad ..) in the jenkins group [15:20:55] class admins::jenkins { [15:21:55] the manifests are going to be very clean :-] [15:22:17] where? [15:22:21] I don't see it [15:22:30] the needing jenkins to add you in the group [15:22:37] manifests/admins.pp [15:22:45] class admins::jenkins { [15:22:45] include groups::jenkins [15:22:48] include accounts::hashar [15:22:51] and such [15:23:25] this two includes are separate from each other [15:23:33] what do I miss? [15:23:38] how are you guys added to that group? [15:24:26] hmm [15:24:36] maybe we are not :-] [15:24:46] $ id [15:24:47] uid=519(hashar) gid=500(wikidev) groups=500(wikidev),561(jenkins) [15:24:48] ah I am [15:25:14] ottomata: hey, around? [15:25:14] maybe we have been added manually ? [15:25:24] maybe [15:25:34] I can't remember honestly [15:25:36] but in any case, I don't see why groups::jenkins is needed anywhere [15:25:39] hm [15:25:44] maybe for gid uniqueness across [15:26:32] yup hi [15:26:44] I have added the jenkins madness with https://gerrit.wikimedia.org/r/#/c/3733/ [15:27:12] but that might not actually add the user to the group :/ [15:27:30] ottomata: so, switching to IRC to save us both some time :) [15:27:39] ottomata: I really lack background in Hadoop [15:27:50] but that thing you quoted assumes no authn at all, and we do LDAP [15:27:53] or do I miss something? [15:28:38] if the ldap thing works, then that will solve my problem completely [15:28:52] the blog post also referred to the fact that hadoop doesn't actually verify that the user connecting is who they say they are [15:28:53] so [15:29:02] if a user has sudo on a machine that can talk to the hadoop namenode [15:29:13] they can create and become a user account on that client machine called 'hdfs' [15:29:21] (or whatever) [15:29:26] paravoid: anything else missing on the jenkins change? https://gerrit.wikimedia.org/r/#/c/47664/ ;)D [15:29:34] and then sudo -u hdfs hadoop fs [15:29:49] (hdfs is the hadoop superuser) [15:29:51] so they could do [15:30:05] sudo -u hdfs hadoop fs -get /wmf/raw [15:30:13] so wait [15:30:18] which would download all the raw webrequest data, which currently has IPs in it [15:30:31] who will have sudo power to hdfs? [15:30:39] if it's just root/ops, that's okay [15:30:50] that's it [15:30:50] yeah [15:30:56] only people who would have power to sudo [15:30:57] or root [15:31:08] would third-parties researchers/volunteers able to sudo to hdfs? [15:31:09] but that would mean any machine [15:31:18] not unless they had sudo on a machine that could access analytics1010 [15:31:41] er [15:31:51] what do you mean? [15:31:51] afaik they don't, but there is an upcoming discussion about granting sudo powers more selectively, right? [15:31:57] *any* system they have root on? [15:31:57] ok [15:31:58] so [15:32:04] let's say stat1 [15:32:06] like e.g. a ceph node? [15:32:27] well [15:32:38] hashar: give me a sec, sorry :) [15:32:41] all hadoop does is look at the username that is running the hadoop command [15:32:55] hashar: the group thing troubles me, I need to do a bit more of a research [15:32:56] so if someone on stat1 has sudo powers [15:33:04] they could run [15:33:07] sudo useradd hdfs [15:33:07] paravoid: can't we fill a bug and sort it out later ? :-] [15:33:12] sudo -u hdfs hadoop fs ... [15:33:33] but that node has to be part of hadoop somehow? [15:33:42] it just has to be configured to talk to namenode [15:33:42] is there authn on the wire protocol or not? [15:34:00] node authentication, not user [15:34:05] sudo -u hdfs hadoop fs -get hdfs://analytics1010.eqiad.wmnet:8080/wmf/raw ... [15:34:07] (I think 8080) [15:34:32] no, i think it can be setup to use kerberos for that [15:34:39] or we could iptables it [15:35:10] so any random box across the infrastructure can decide it's a hadoop node and its root can decide it's a superuser on hadoop? [15:35:23] i believe so, yes [15:35:34] jesus [15:36:01] i don't know much more about this either, other than what that blog post and a few other things say [15:36:20] that post seems to say the way around that is to use kerberos [15:36:23] do we have data in /wmf/raw already? [15:36:25] yes [15:36:31] that's annoying [15:36:32] we want to anonymize it [15:36:39] but we've been told to hold off on that because it would take too much time [15:36:43] put private data first, check our security later [15:36:44] and we have to show value sooner rather than later [15:37:20] owned private data will show some value alright [15:37:30] sorry [15:37:34] I'm just annoyed [15:37:40] yeah, its cool, rightfully so [15:38:17] there are only a few hadoop ports [15:38:24] we could soonish setup up iptables rules [15:38:31] to only allow access to them from anlaytics nodes [15:38:40] actually, its only analytics1010 that matters [15:38:51] its the namenode and handles all access requests [15:39:10] https://ccp.cloudera.com/display/CDH4DOC/Configuring+Ports+for+CDH4 [15:40:33] which version of Hadoop do we use? [15:43:51] Change abandoned: Hashar; "will move it somewhere else" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29937 [15:44:47] oh [15:45:02] I have no idea where to move that php linting script now :-] [16:08:02] I am out [16:08:04] daughter time [16:08:06] and I am late [16:08:10] New patchset: Ottomata; "Bringing over analytics haproxy configuration from analytics puppet branch." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47741 [16:08:13] New patchset: Hashar; "contint::website regroup apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [16:12:16] !log Dropped user_rights from enwiki [16:13:01] New patchset: Ottomata; "Bringing over analytics haproxy configuration from analytics puppet branch." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47741 [16:13:37] !log Dropped user_rights from dewiki and frwiki [16:15:50] paravoid (or someone?): [16:15:50] https://gerrit.wikimedia.org/r/#/c/47741/ [16:16:07] would appreciate a review on that soon, so I can repuppetize analytics1001 and bring it back online [16:19:19] i'm sure there are things there I should change, would like to hear opsen feedback [16:21:37] !log Dropped user_rights table from all wikis [16:23:26] it's obsolete now? I feel myself obsoleted:P [16:24:13] ah, user_rights vs. user_groups [16:25:49] Reedy: no morebots... [16:26:08] MaxSem: It only lived for MW 1.4 [16:26:52] I'm less obsolete than I think, apparently [16:27:14] * Nemo_bis defers upgrading of MaxSem  [16:36:21] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Wed Feb 6 16:36:12 UTC 2013 [16:47:00] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [16:48:03] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [16:50:54] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [16:54:12] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [16:56:18] RECOVERY - NTP on analytics1001 is OK: NTP OK: Offset -0.04787909985 secs [17:19:04] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [17:19:29] New review: Hashar; "ps2 : typo in commit message." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47742 [17:23:08] New patchset: Cmjohnson; "Adding new dhcp entries for mw1161-mw1200 and updating mac for ms-be11" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47746 [17:27:12] RECOVERY - Puppet freshness on cp3020 is OK: puppet ran at Wed Feb 6 17:26:33 UTC 2013 [17:37:23] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47602 [17:42:34] New review: Silke Meyer; "Hm. For Wikidata, I can either not use puppet at all (which would have been the easiest way). Or I d..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/47585 [17:45:04] New review: Andrew Bogott; "Faidon, this is an elaboration of mediawiki::singlenode which is used for quick, simple labs deploym..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47585 [17:45:07] !log testing out morebots [17:45:36] LeslieCarr: morebots? [17:45:47] did you move it? [17:45:51] or restart it? :) [17:45:56] didn't restart it yet [17:45:57] about to [17:46:01] ah. ok [17:46:05] since obviously not working [17:46:08] was confirming the not working [17:46:09] * Ryan_Lane nods [17:48:25] !log restarted morebots [17:48:27] Logged the message, Mistress of the network gear. [17:48:32] woot [17:48:39] I envy your personalized message [17:49:36] Ryan actually made that, i cracked up the first time i saw that [17:49:47] ok. there *must* be something wrong with opendj on virt1000 [17:50:52] http://etherpad.wmflabs.org/pad/p/ldap-speed [17:51:19] Search: 3183165 Avg: 448.941 <— wtf [17:51:45] >1000ms: 203845 (6%) [17:52:00] 6% of requests are over 1s? no fucking way. [17:59:28] ah. it seems that it got itself into some weird state and restarting the service fixed it [17:59:58] so, the speed issue isn't related to cross-data center issues, it's due to the service breaking [18:01:32] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [18:01:33] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [18:01:33] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [18:01:33] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:01:33] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:01:48] RobH: why do we still get alerts for msfe1002? [18:03:29] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [18:07:23] paravoid: ? i dunno, why you asking me =] [18:07:27] i thought you disabled it? [18:08:34] that's msfe1002, not ms-fe1002 [18:12:06] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47746 [18:19:51] Ryan_Lane, could you help me with a review? [18:19:52] https://gerrit.wikimedia.org/r/#/c/47741/ [18:23:18] ... [18:24:22] ffs [18:24:34] password hashes in your config [18:29:48] New review: Faidon; "Generic classes should be parameterized and their files templates rather than having hardcoded hostn..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/47741 [18:35:22] <^demon> Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/47343/ workaround until upstream issue is fixed. [18:36:29] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47343 [18:41:35] !log authdns-update [18:41:36] Logged the message, RobH [18:42:35] <^demon> Ryan_Lane: Thanks. Is that merged on sockpuppet yet? [18:43:06] ^demon: yep [18:43:41] <^demon> !log running puppet on manganese, restarting gerrit [18:43:42] Logged the message, Master [18:44:32] cmjohnson1: So asw-c-eqiad isnt updated with the new misc servers (understandably) [18:44:40] So i need to know the ports for two of them that im trying to spin up now [18:45:05] wmf3568/neodymium in c4 and wmf3571/promethium in c7 [18:45:14] so 4/0/X and 7/0/X [18:45:39] now, if the other hpm SSDs are in the next two ports up from those (as they are the lowest in rack hpm ssd systems) [18:45:46] i can label those too with the asset tags for now in switch. [18:46:30] New patchset: Demon; "Remove custom hacks for search bar" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35562 [18:48:30] robh: yep...i have that information w/me...wait 1 [18:48:39] cool [18:50:46] New patchset: Andrew Bogott; "Remove some whitespace that was confounding the doc generator." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47750 [18:51:28] hrmm, seems parsoid really needs a lvs [18:51:36] poor celsus is the only caching server for it. [18:52:18] Ryan_Lane: you are aware that the virt pmtpa cluster isnt rendering updates in ganglia right? [18:52:36] RobH: so msfe1002? [18:52:40] yes it is [18:52:40] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Virtualization%2520cluster%2520pmtpa&tab=m&vn= [18:52:52] Ryan_Lane: all those are broken images [18:52:54] perhaps its just me then. [18:52:57] hm [18:52:58] robh..on c4 0-7 starting from bottom server up....and c8 17-24 bottom up [18:53:03] we previously agreed to decom msfe* (*not* ms-fe*) [18:53:04] work s for me [18:53:06] and I think you do those, no? [18:53:19] "we" as in you and me [18:53:31] someone is doing something bad: http://ganglia.wikimedia.org/latest/?c=Virtualization%20cluster%20pmtpa&h=virt7.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [18:54:42] cmjohnson1: Ok, got it, thanks! I'll update the switch for just the hpm ssd [18:54:47] as the others arent wired [18:54:59] paravoid: Ahh, yes, ok, so we have alerts for msfe servers and they shoudl be gone, yes? [18:55:14] yes [18:55:18] I did rename the servers and such, but i guess decom.pp didnt get the update, lemme check [18:55:25] ah [18:57:04] paravoid: thx for reminder of wtf was going on =] [18:57:06] fixing now [18:57:13] New patchset: RobH; "msfe1002 renamed back to element" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47751 [18:58:25] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47751 [18:59:52] paravoid: so its on decomissioning.pp, so when spence runs update it will stop alerting [19:00:03] great [19:00:21] I was basically worrying that the rename process was somewhere halt in the middle [19:00:31] so that's why I didn't add it to decom.pp myself without consulting you [19:00:50] apparently doing an apt-get update on all nodes at the same time is a bad call: http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=virt2.pmtpa.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1360177198&g=network_report&z=large&c=Virtualization%20cluster%20pmtpa [19:01:00] *all instances [19:04:13] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: closed, private and fishbowl wikis to 1.21wmf9 [19:04:13] Logged the message, Master [19:04:56] robh: ok to take mw1 down? not api or bits. [19:06:04] ? [19:06:09] uhh, steve is doing that? [19:06:14] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46809 [19:06:29] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47750 [19:06:32] cmjohnson1: Why? [19:06:33] bleh. all the virt boxes are in the middle of checking their raids [19:07:17] robh: there are rails from the old srv's underneath that won't come out...also has a dimm error that needs to be troubleshooting [19:07:54] cmjohnson1: Ok, but I want RT tickets for this stuff! (Se we can see whats happening all week on site) [19:08:26] i just noticed the DIMM error when I went to add new rails [19:08:29] but yea, if it has a dimm error, it can get shut down and troubleshot. but needs RT made and needs to be admin logged by sbernardin when he starts and completes the work [19:08:30] that will get a ticket [19:08:36] uhh, are you in tampa right now? [19:08:40] yes [19:08:50] i should really write this shit down. [19:08:59] you should ;-] [19:09:05] yea go for it [19:09:06] Lol [19:09:26] cmjohnson1: I guess I shouldnt expect those misc servers later today then eh? [19:09:28] ;p [19:09:43] probably not [19:09:45] LeslieCarr: So if you are about, when adding a port into a vlan, do i not need to remove from 'default' vlan? [19:09:55] cuz its not letting me do the delete member for it in 'default' [19:11:03] !log powerdowing mw1 to reseat DIMM B1 [19:11:04] Logged the message, Master [19:11:15] hah...spelling ^ [19:12:04] answered my own question [19:12:24] cmjohnson1: fyi: when adding new ports to vlans, if they are already in the 'default' then you dont have to delete from default [19:12:32] merely assigning and committing to new vlan will remove from default vlan [19:12:45] unlike normal vlans where you have to remove from old, then assign to new (afaik) [19:12:57] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: special and wikimedia wikis to 1.21mwf9 [19:12:58] Logged the message, Master [19:13:11] PROBLEM - Host mw1 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:21] New patchset: RobH; "adding neodymium & promethium to dhcp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47752 [19:18:51] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikinews, wikivoyage and wikiversity to 1.21wmf9 [19:18:52] Logged the message, Master [19:19:17] <^demon> RobH: Got a minute? [19:19:34] New review: RobH; "make it so" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47752 [19:19:35] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47752 [19:19:40] ^demon: sup? [19:19:49] <^demon> Could you add Antoine (amusso@wm.o) to the gerritadmin alias? [19:20:29] ^demon: done [19:20:38] <^demon> Thanks! [19:20:40] welcome [19:22:56] RECOVERY - Host mw1 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [19:28:49] cmjohnson1: Soooo, dell bios now has a misc section [19:28:52] with an 'asset tag' field [19:29:02] im totally updating with our asset tags as i come across them [19:29:11] you guys may want to start populating it [19:29:14] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikisource and wikiquote to 1.21wmf9 [19:29:14] k [19:29:15] Logged the message, Master [19:29:16] (i just noticed it) [19:29:27] i imagine then we can poll via drac and get the asset tag as well [19:29:29] which is kinda neat. [19:31:19] time to upgrade all the bioses? [19:31:27] that would be helpful...i wonder if we can add via racadm [19:33:29] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wiktionary and wikibooks to 1.21wmf9 [19:33:30] Logged the message, Master [19:34:21] New patchset: Reedy; "everything non 'pedia to 1.21wmf9" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47755 [19:36:03] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47755 [19:36:17] \O/ [19:41:33] New patchset: Reedy; "Bug 44460 - Create Wikiversity Korean" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47349 [19:47:19] !log authdns-update [19:47:20] Logged the message, RobH [19:49:20] New patchset: RobH; "neodymium & promethium updated in netboot.cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47759 [19:50:37] New review: RobH; "\o/" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47759 [19:50:38] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47759 [20:01:49] New patchset: Hashar; "contint::website regroups apache + basic files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47742 [20:02:16] New review: Hashar; "rebased on top of https://gerrit.wikimedia.org/r/#/c/47664/ "Jenkins module created out of contint m..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47742 [20:02:47] hashar: can you please fix the sudo on beta? [20:03:07] not sure how I can fix what [20:03:08] :-D [20:03:45] I can sudo at least :-] [20:04:01] Ryan_Lane: can you past again the sudo failure message? [20:04:12] and how could I receive the cron mails myself? [20:04:29] i-00000390.pmtpa.wmflabs : Feb  6 18:25:45 : mwdeploy : 3 incorrect password attempts ; TTY=pts/0 ; PWD=/ ; USER=apache ; COMMAND=/usr/local/bin/mwscript mergeMessageFileList.php --wiki=aawiki --list-file=/home/wikipedia/common/wmf-config/extension-list [20:05:09] ah yeah hmm [20:06:14] mwdeploy@deployment-bastion:~$ mw-update-l10n [20:06:15] Updating ExtensionMessages-master.php... [20:06:16] [sudo] password for mwdeploy: [20:06:18] that is unfortunate [20:08:34] USER=apache .. [20:08:49] RobH: don't need to remove from default vlan [20:08:56] (switching to -labs) [20:08:57] default vlan is just when no other vlan exists [20:09:08] yep, found that out [20:09:12] thx though =] [20:09:29] :) [20:10:44] New patchset: Ori.livneh; "Initial commit of debian/" [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/47662 [20:12:50] New review: Ori.livneh; "Re: Voluptuous -- there's "JSON Schema" as in "the draft JSON Schema standard" and "JSON schema" as ..." [operations/debs/python-jsonschema] (master) C: 0; - https://gerrit.wikimedia.org/r/47662 [20:13:50] robh: are you around or at lunch? [20:16:14] both! [20:16:20] eating at desk todfay, whats up? [20:16:35] New patchset: RobH; "fixing camera fqdn in row b/c" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47776 [20:17:07] i see the row b-c cameras hitting dhcp, i bet i can fix them remotely [20:17:12] \o/ [20:17:56] New review: RobH; "big brother may be watching, if this fixes the cameras" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47776 [20:17:57] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47776 [20:18:38] cmjohnson1: whats up? [20:19:21] robh: so looking at the apaches in d2 and 15 are mc servers...what should we do for replacements [20:26:35] cmjohnson1: sory, had a phone call [20:26:51] cmjohnson1: d2-sdtpa? [20:26:59] yep [20:27:03] those are all srv* [20:27:09] where are you seeing memcached? [20:27:10] yes [20:27:15] mc.pp [20:28:07] where is mc.pp? [20:28:15] i have mc-sdtpa.php for the files in mc cluster [20:28:40] Are you about to decommissoin and wipe? [20:28:45] mediawiki-config git repo [20:28:58] got this from peter....to check b4 doing anything [20:29:06] i see mc-pmtpa.php [20:29:06] yes...we are ready to wipe [20:29:09] not mc.php [20:29:12] where exactly is it? [20:29:14] path. [20:29:58] cmjohnson1: I do not see what file you mean, I need the full file path so I can reference it directly. [20:30:16] go to /home/wikipedia/common/wmf-cong [20:30:29] wmf-config/mc.php [20:30:30] yep, im there on the git checkout. [20:31:16] cmjohnson1: there is no mc.pp there. [20:31:44] there is no mc.pp [20:31:53] there is mc-eqad, mc-labs, and mc-sdtpa.php [20:31:53] mc.php [20:32:01] mc.php~ is an old file [20:32:04] there is no mc.php [20:32:12] that is the old, invlaid configuration. [20:32:13] hrmm..okay..got that from peter to check [20:32:16] so good [20:32:18] its changed since then [20:32:23] now we have the mc-site [20:32:24] .php [20:33:01] Now, before you pull those and decom them, some steps need to happen [20:33:04] what RT ticket is this on? [20:34:06] cmjohnson1: I have to empty another rack, NOT d2-sdtpa [20:34:07] well it will be on the one you will create to block 4436 [20:34:08] that i can see. [20:34:16] https://rt.wikimedia.org/Ticket/Display.html?id=4438 [20:34:25] I don't have any ticket saying to decom rack d2-sdtpa. [20:34:33] I hope you guys didnt do anthing yet. [20:34:54] nope [20:35:10] I don't understand why you are asking about d2-sdtpa. [20:35:22] robh: let me go through all of the tickets here and get back to you [20:35:32] i am not ready for anything hyet [20:35:32] Ok, because there is no plan to decom d2-sdtpa yet. [20:35:46] the plan is rack a5-sdtpa, which has older servers than d2-sdtpa [20:36:04] on 4436 it says that was the plan...but I didnt see any follow on tickets so I need to look into it more [20:36:12] cmjohnson1: also when you pasted all that stuff, your client throttled it and i got them line by line for a minute or so ;] [20:36:37] ahh, yes [20:36:41] that changed, and linked ticket changed [20:36:51] but ticket history doesnt show it unless you lok at assignments, updating ticket [20:37:21] udpated ticket [20:37:24] sorry for confusion [20:37:47] cmjohnson1: So a5-sdtpa will instead be decommissioned, except srv193 which is test.w.o and needs to move [20:37:52] because i dont feel like deploying a new test.w.o [20:38:22] * RobH is no longer capable of working anything without RT histories [20:38:25] too much shit goin on. [20:39:43] cmjohnson1: I fixed the cameras in eqiad b/c row [20:39:45] \o/ [20:39:54] i see your fqdn change [20:39:56] though the view is fubar, updating the ticket [20:39:59] k [20:40:00] needs you to connect and set view [20:40:05] when yer onsite there next [20:40:12] plz create ticket [20:40:22] 4417, updating. [20:57:29] robh: rt4439 [20:58:01] cool, you ready to move it? [20:58:25] ie: have the rails already in place, cables, etc? [20:58:31] yep [20:59:04] ok, so we wanna do a clean shutdoown, after admin logging, and get it moved and back online asap [20:59:19] cmjohnson1: did you already set the port and such and confirm its right? [20:59:59] robh: no i did not [21:00:02] !g Ibdd956d131fd0f53473d1f12b13e009378cd818f [21:00:02] https://gerrit.wikimedia.org/r/#q,Ibdd956d131fd0f53473d1f12b13e009378cd818f,n,z [21:05:11] New patchset: Hashar; "beta: let mwdeploy run mwscript as the apache user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47795 [21:05:26] Ryan_Lane: the sudo issue should be solved by https://gerrit.wikimedia.org/r/#/c/47795/ [21:05:38] cool [21:11:10] cmjohnson1: so if you wanna set the vlan go for it [21:11:30] is tampa different than eqiad? [21:12:56] maybe, lemme check [21:13:22] i think so cos the switch is foundry [21:14:28] ohh, good times [21:14:38] lemme pull up the switch [21:15:30] cmjohnson1: Ok, so yea, its a foundry. [21:16:01] cmjohnson1: What port will it be connected to on asw-a4-sdtpa? [21:16:33] 29 [21:16:56] may still have gilman in description [21:17:44] yep, sure does, correcting now, once its fixed i let you know so you can take down srv193 [21:18:07] ok [21:20:46] cmjohnson1: Ok, it should be ok to move now. You will want to do a proper shutdown, and admin log it before you start [21:21:03] yep..got it [21:21:19] !log srv193 power down to relocate to different rack [21:21:20] Logged the message, Master [21:22:44] test.w.o is down.... oh noes! [21:23:01] cmjohnson1: So once its back up and online http://test.wikipedia.org should work [21:23:54] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 188 seconds [21:24:30] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 196 seconds [21:25:43] RobH: did you have a chance to have a look at my commit? [21:25:44] robh: booting now [21:26:54] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 683713 seconds [21:27:30] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 681782 seconds [21:27:40] yay, nagios finally sees the problme [21:27:49] Jeff_Green: ^ [21:28:12] i wish to punch nagios in the the head [21:28:19] poor nagios [21:28:24] nagios blows [21:28:31] some people like being blown [21:28:40] don't piss at nagios [21:29:00] no piss, punch. [21:29:01] nagios didn't spot replication being broken on the fundraising databases [21:29:09] why? [21:29:11] FOR A WEEK [21:29:18] matanya: i was able to glance, but its going to be a bit more detailed check needed. i dunno if it will happen this week, cuz it should have someone with a more in depth knowledge of nagios than me [21:29:29] and everyone is traveling for fosdem and the like =[ [21:29:33] matanya: because the mysql check is defective [21:29:39] * RobH is alone in office [21:29:42] I can fix it [21:29:44] it's supposed to check seconds behind master [21:29:53] it works as long as the value is non-NULL [21:29:58] thanks RobH [21:30:23] cmjohnson1: So on the first misc server (neodymium) in 4/0/0 c4-eqiad [21:30:24] but it's NULL when replication is stopped, which is what happened, because a query misbehaved on the master but made it into the replication stream [21:30:32] it gets media check fail, it may not be connected properly [21:31:01] grrr [21:31:08] Jeff_Green: so the check should be fixed [21:31:21] ge-4/0/0 up down neodymium [21:31:25] matanya: indeed. [21:31:31] cmjohnson1: So that was the port the first hpm ssd is in right? [21:31:36] it sounds like an easy fix [21:31:42] PROBLEM - Apache HTTP on srv193 is CRITICAL: Connection refused [21:31:45] if NULL exit 2 [21:32:12] RobH: Don't forget the ghost [21:32:17] in theory yeah. I believe it's the stock module, which is compiled [21:32:22] Damianz: eh? [21:32:23] robh: i just remembered...no it is not. i left the cabling for the other misc servers i moved to c7 [21:32:40] so are any of the misc servers connected? [21:32:40] Every office has a ghost, just not sure who killed someone in yours [21:33:30] RECOVERY - Apache HTTP on srv193 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [21:33:44] cmjohnson1: so test.w.o is back online [21:33:59] yes they are all connected but not sure of their port assignments now (not 100%0 [21:34:10] yep...i just refresed the page [21:34:14] hrmm. [21:34:20] when are you back in ashburn? [21:34:51] admin log that its back ;] [21:36:50] !log srv193 is back online [21:36:50] Logged the message, Master [21:37:05] robh: i will be back sunday [21:37:15] so none of the servers have production network connections? [21:37:24] (the new misc eqiad servers in row c that is) [21:38:27] ? [21:38:38] ah it's the percona test, not the stock nagios one [21:38:56] cmjohnson1: I need to know so I can shelf these projects if none of them are network ready. [21:40:06] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [21:40:30] robh: they are all connected but having had to remove half i don't recall which ports their in...I am 99.9% certain that they start at 4/0/8 [21:41:00] i see 4/0/8 as down [21:41:05] not sure if its due to being disabled, checking [21:41:13] i put in ticket 4486 to correct and label these [21:41:34] so these will prolly wait on you to fix on switch when you are back onsite next monday [21:41:44] next monday? or you taking a day due to travel or what? [21:41:50] no [21:41:51] (Just trying to plan my projects) [21:41:57] New review: Hashar; "I have resent the original patch with https://gerrit.wikimedia.org/r/#/c/47564/ which is adding the ..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/47538 [21:42:32] New review: Hashar; "I guess I will merge / deploy that in the morning of Feb 7th." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47564 [21:42:57] Change abandoned: Demon; "You can re-use the same change-id :\" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47538 [21:50:45] so something is on 4/0/8, going to try to pxe and pull mac address out of the routing table [21:50:48] and confirm its the right server [21:53:09] cmjohnson1: fwiw, you are right [21:53:19] server neodymuim is in 4/0/8 [21:53:24] good memory ;] [21:53:32] so i can get that one online today as planned \o/ [21:53:42] i was 99% certain [21:53:46] woo [21:54:05] you have 40 mw servers as well ...if you haven't done those already [21:54:17] i can work on half of those tonight if you like? [21:55:01] no need to work off hours, you have enough there ;] [21:55:10] im not sure how im allocating them between api and general use yet either [21:55:21] i need to sit down and figure it out [21:55:33] (and then push said proposal to mark/asher for review) [21:55:55] cmjohnson1: Though you are more than welcome to update all the dhcpd files and netboot files for them [21:55:56] cool [21:56:05] just nothign in site.pp yet cuz i dunno whats api/general [21:56:19] then when we are ready to spin them up, its minimal configuration changes [21:58:04] robh: rt4438 i want to start pulling these offline....any objections? [21:58:27] they need to be decommissioned in software before we pull them [21:58:35] So lets see, im going to tell you what needs to change [21:58:45] you can change and i ll review the patchset for ya [21:59:28] k [22:00:26] so these need to come out of site.pp so they dont bork up [22:00:31] as well as go in decommissioning.pp [22:00:42] so in site.pp you will remove the srv190-192 entires [22:00:43] yeah..i have those 2 on my list [22:00:44] leave 193 [22:00:54] and pull the rest up to 225 [22:01:13] (you will have to edit a range stanza on line 2037 in site.pp for that) [22:01:28] New patchset: Jgreen; "adding check_mysql_slave_running for fundraising slave db's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47796 [22:01:34] so site.pp update to remove them, as well as decommissioning.pp, and the dhcpd lease file(s) [22:01:42] and netboot.cfg [22:01:46] 4 files to update [22:02:02] cmjohnson1: So you can have steve look over your shoulder to see what you are doing [22:02:15] but once those are done, and we commit the patch, then we can start killing off those servers [22:02:36] k [22:03:26] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47796 [22:04:35] cmjohnson1: So I am emailing ops list [22:04:40] sbernardin1: you will be cc'd as well [22:04:49] so while those are offline, we are undercommited for apache power in tampa [22:05:00] ie: if eqiad blows up, we may have trouble falling back to tampa [22:05:12] so lets try to keep this rack offline no more than a couple days [22:05:37] gotcha [22:05:43] so does sbernardin1 [22:06:18] coolness [22:06:26] i dont think its a major issue, but its something to be concerned about =] [22:06:57] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [22:13:18] New patchset: Cmjohnson; "removing srv190-192 and 194-224 from site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47797 [22:13:37] cmjohnson1: include in the same patchset the dhcpd file update and netboot.cfg update please [22:13:51] eaiser if we keep them in a single changeset, since its related. [22:14:09] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [22:14:18] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [22:14:36] cmjohnson1: Also, if you dont mind, remove my line for yttrium [22:14:40] since you are editing that file anyhow =] [22:14:48] (in decom.pp) [22:14:53] ahh..oka [22:14:54] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [22:15:15] so you should be able to just commit --amend -a your current changes to add and git review again [22:15:27] (since you havent made other changes, dont need to cherry pick back out iirc) [22:15:59] cmjohnson1: Also, in site.pp, line 2003 [22:16:05] srv225-srv230 [22:16:12] you want to modify that regex to be srv226 [22:16:21] k [22:18:11] cmjohnson1: So I already checked, but so you know what i did [22:18:21] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 7 seconds [22:18:23] if you notice, one of those stanzas takes out the ganglia aggregator for image scalers [22:18:39] so i had to confirm we had another set of apaches that have image scaling assignment, and have ganglia aggregator [22:18:52] in this case, line 1562 answers my question [22:18:59] # mw 1153-1160 are imagescalers (precise) [22:19:10] if $hostname =~ /^mw115[34]$/ { [22:19:10] $ganglia_aggregator = "true" [22:19:24] I didnt want us to take these offline, and suddly have no ganglia aggregators for image scalers [22:19:32] but we do, so we are fine, just fyi =] [22:21:55] New review: RobH; "Please update the netboot.cfg and dhcpd lease files to remove the old server entries for those as well." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/47797 [22:22:19] New patchset: Cmjohnson; " removing srv190-192 194-225 from site.pp and dhcpd file adding to decommissioning.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47798 [22:22:25] i love gerrit. [22:22:31] this shit was so much harder when we had to do it on cluster. [22:23:05] cmjohnson1: ok, i was wrong about the amending without cherry pick [22:23:12] it did a new patchset, so i am just abandoning and closing the old one [22:23:25] Change abandoned: RobH; "handled via another patchset" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47797 [22:23:32] ok [22:24:36] cmjohnson1: well damn it [22:24:40] so that was based on old change [22:24:46] so now that i dumped old change, both are borked [22:24:53] ah..ok..np..i can fix [22:25:00] lemme unabaondon [22:25:03] or that [22:25:12] Change restored: RobH; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47797 [22:25:17] cmjohnson1: Ok, lets do this right [22:25:27] we are going to kill your new change, you are going to cherry pick back out your old change [22:25:32] then add your changes. [22:25:34] New patchset: MF-Warburg; "(bug 44725) Enable Geocrumbs extension on Incubator" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47799 [22:25:53] Change abandoned: RobH; "this shoudl have chery picked into another change, abandoning" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47798 [22:26:09] cmjohnson1: Ok, once more with feeling, cherry pick out https://gerrit.wikimedia.org/r/#/c/47797/ [22:26:13] then make your dhcp changes and amend the commit [22:26:23] sorry about the confusion [22:38:32] New patchset: Cmjohnson; "removing srv190-192 and 194-224 from site.pp and dhcpd adding to decommission.pp Change-Id: I6f1a8986d0f1333a6f11982f6390383820ae39e8" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47797 [22:39:41] cmjohnson1: Ok, so I am looking at it now [22:39:49] notice how it has a dependency on some other patchset? [22:40:05] yeah...look at site.pp..i think it's still wrong [22:40:11] I think we can remove it by rebasing, so why dont you try to do that? (in gerrit web interface, you can click rebase change under the patchset2) [22:40:15] oh, lemme see [22:40:33] yep [22:40:52] New patchset: Cmjohnson; "removing srv190-192 and 194-224 from site.pp and dhcpd adding to decommission.pp Change-Id: I6f1a8986d0f1333a6f11982f6390383820ae39e8" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47797 [22:40:52] cmjohnson1: line 2026 [22:41:20] bleh, it keeps adding other dependencies, oh well [22:41:25] doesnt matter, dont need to keep rebasing [22:41:29] okay [22:41:35] but the regex for srv225 is still there [22:41:55] yeah line 2003 on the updated side [22:42:13] cmjohnson1: uhh, you didnt remove yttrium [22:42:16] just removed my comment [22:42:21] i wanted the entire line gone ;] [22:42:36] hah..you're right! [22:43:05] k..going to cherry pick it again [22:43:14] New review: RobH; "regex for srv226+ is still 225+ in site.pp" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/47797 [22:43:17] k [22:46:44] robh: got an error cannot ammend ..in the middle of a cherry pick [22:47:07] hrmm, try doign the reset [22:47:09] then cherry pick [22:47:28] git reset is your pal [22:50:29] New patchset: Cmjohnson; "removing srv190-192 and 194-224 from site.pp and dhcpd adding to decommission.pp Change-Id: I6f1a8986d0f1333a6f11982f6390383820ae39e8" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47797 [22:51:14] robh: ^ [22:51:34] jenkins bot failed it [22:52:01] cmjohnson1: So, to find out why, jenkins-bot always fails the first of 3 checks for us [22:52:13] failure (non-viting) [22:52:16] voting even [22:52:24] thats normal, but the http://integration.mediawiki.org/ci/job/operations-puppet-validate/1453/console : FAILURE [22:52:34] pupet validate is valid, and it fails [22:52:41] if we click on that, the details tell you why [22:53:04] 22:50:38 err: Could not parse for environment production: Syntax error at end of file; expected ']' at /var/lib/jenkins/jobs/operations-puppet-validate/workspace/manifests/decommissioning.pp:237 [22:53:09] i see it [22:53:25] pulled the trailing ] [22:53:35] cherry pick and fix ;] [22:55:57] New patchset: Reedy; "Bug 43812 - Create Sanskrit Wikiquote" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47346 [22:56:01] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47346 [22:56:07] New patchset: Cmjohnson; "removing srv190-192 and 194-224 from site.pp and dhcpd adding to decommission.pp Change-Id: I6f1a8986d0f1333a6f11982f6390383820ae39e8" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47797 [22:57:16] robh: ^ [22:57:39] New review: RobH; "meeeeerging" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47797 [22:57:40] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47797 [22:58:10] yay \o/ [22:58:17] cmjohnson1: merged onsockpuppet [22:58:20] you are all set. [22:58:24] awesome [22:58:28] (I was already on sockpuppet mergfing someting else) [22:58:31] wow, typos. [22:58:41] it's that kind of day [22:58:56] ok, so now we have one more place to pull them [22:59:03] (out of lvs) [22:59:24] cmjohnson1: So for this, you need to be root, and you need to be careful [22:59:29] because doing it wrong can do bad things. [22:59:48] su into root on fenari and edit the files located in /home/wikipedia/conf/pybal/pmtpa [22:59:59] these files are referenced live by the site, so any changes are immediately live. [23:00:08] New patchset: Reedy; "Add sawikiquote to dblists" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47804 [23:00:34] cmjohnson1: You will be editing a few files, specifically the apaches, api, and rendering files [23:00:41] okay [23:00:45] hrmm... [23:00:47] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47804 [23:00:51] why is rendering all old... [23:00:52] wtf. [23:00:58] i thought there were mw rendering servers [23:01:12] cmjohnson1: Are any of the new apaches racked up there already in d1-sdtpa? [23:01:25] yes [23:01:39] new as the ones we just got or the ones from last year? [23:01:49] ok, it seems that the ONLY image rendering servers are srv219-224 [23:02:05] cmjohnson1: as in brand new in this batch [23:02:20] cuz we wanna spin some up and bring them online as rendering systems [23:02:27] i should have caught this before now ;] [23:02:47] but, lemme tell you what to do in these files first, then you can do the new servers, then come back to these [23:02:57] So, you are going to do the basic install on 6 of the new servers [23:03:05] New patchset: Demon; "Updating for gerrit 2.5.1-1225-gd52acbc" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/47806 [23:03:08] and we will press them into image rendering service before we pull these offline [23:03:17] (although they are offline per nagios, they are still pooled) [23:03:23] make sense? [23:03:35] yep [23:03:37] New review: Demon; "The build we're deploying can be obtained from jenkins: https://integration.mediawiki.org/nightly/ge..." [operations/debs/gerrit] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/47806 [23:03:58] wait..the new servers are not cfg'd yet [23:03:58] So next step: spin up 6 new servers in d1-sdtpa. once they are basic install, dont sign puppet certs [23:04:00] <^demon> \o/ [23:04:14] cmjohnson1: right, you're going to have to do that before we wipe srv219-224 [23:04:23] okay..so we need to get those up first....let's stop there [23:04:23] <^demon> 2.5.1-1225-gd52acbc. That's the version we're installing :p [23:04:40] cmjohnson1: yep, stop there, and once that is online we come back to these [23:04:49] okay [23:04:52] cmjohnson1: actually, wait [23:04:58] lets start the wipe on non image servers [23:05:01] so we dont have to wait another day [23:05:09] k [23:05:16] good idea [23:05:32] so leave srv219-224 alone, and we can wipe srv190-213. open up apaches [23:05:38] we want to depool these first [23:05:51] so in the apaches file, you want to set enabled to False on all those [23:05:57] !log reedy synchronized php-1.21wmf9/extensions/WikimediaMaintenance/ [23:05:58] Logged the message, Master [23:06:04] hrmm, i think you can just comment them out entirely actually [23:06:12] cmjohnson1: yea, comment them out like srv193 is in the file [23:06:16] so pybal stops checking them entirely [23:06:43] cmjohnson1: you can also comment out srv225 [23:06:43] k [23:06:55] that tells the pybal service on our lvs servers to stop looking for them [23:07:01] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [23:07:01] and dont dispatch requests to them [23:07:02] Logged the message, Master [23:07:43] cmjohnson1: You can also open up the 'api' file and comment out servers 214-218 [23:08:18] then you can start the wipe on srv192-225, NOT counting the rendering servers 219-224 [23:08:29] (also not counting srv193, but its moved now so no danger) [23:09:04] should i add any comments? [23:09:14] so those can start wipe, and you guys can do new install on 6 of the new servers in d1-sdtpa so they can replace the current rendering servers [23:09:23] i'd admin log that you are doing it is all [23:09:38] maybe reference the RT ticket in the admin log [23:09:43] so folks can look there for more details [23:10:11] log removing srv192-218 and srv225 per rt # [23:10:13] so and such [23:10:29] removing so and such from pybal per rt # even [23:10:49] so that will start wipe on those, so tomorrow they can get pulled out of the rack [23:10:51] and replaced entirely [23:11:25] cmjohnson1: That all makes sense right? [23:11:36] if not lemme know and we can go in more detail [23:11:56] New patchset: Reedy; "Bug 44413 - Create Wikivoyage Romanian" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47347 [23:12:49] !log removing srv192-218 and srv225 per rt 4438 [23:12:50] Logged the message, Master [23:13:00] robh: yep it does [23:13:02] !log removed from pybal specifially [23:13:03] Logged the message, RobH [23:13:08] wow, i cannot spell today. [23:13:12] oh well. [23:13:33] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47347 [23:13:35] !log specifically (i can spell, maybe) [23:13:36] Logged the message, RobH [23:14:55] New patchset: Reedy; "Add sawikiquote" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47809 [23:15:05] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47809 [23:16:15] robh: okay...so, i am going to start the wipe on those servers (carefully and a bit redundantly specified above). Steve is cabling and configuring at least 6 of the new servers now. [23:16:41] New patchset: Reedy; "Add rowikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47810 [23:16:45] :-P needs to be added to that ^^ [23:16:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47810 [23:17:26] !log reedy synchronized wmf-config/InitialiseSettings.php [23:17:26] Logged the message, Master [23:18:17] cmjohnson1: cool, and we will leave srv219-srv224 alone until we bring the new ones online to replace them [23:18:23] i should have caught that before we started =[ [23:18:29] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [23:18:30] Logged the message, Master [23:19:10] sounds good! [23:19:11] cmjohnson1: we should technically put them back in site.pp, as they will fall out of sync from puppet runs [23:19:20] but the sync files for mediawiki are in node groups, and thus still affected. [23:19:35] so we need to bring the new ones online tomorrow at latest. [23:19:43] cuz half deployed bugs the shit outta me [23:19:46] =] [23:20:22] they will be ready [23:21:26] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47348 [23:22:38] mw1153-mw1160 are supposedly image scalers [23:22:42] New patchset: Reedy; "Add plwikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47811 [23:22:44] which i saw before in site.pp, and assumed they were [23:22:48] but pybal config doesnt include them [23:22:53] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47811 [23:22:59] i think they should be in service though, but i asked peter and we'll see what he replies [23:23:21] having all of them off and gone is a pain sometimes. [23:23:21] PROBLEM - Host srv192 is DOWN: PING CRITICAL - Packet loss = 100% [23:23:38] heh [23:23:50] loads more cannot connect to host errors ;) [23:23:57] ? [23:24:07] srv190: ssh: connect to host srv190 port 22: Connection timed out [23:24:08] srv191: ssh: connect to host srv191 port 22: Connection timed out [23:24:08] srv192: ssh: connect to host srv192 port 22: Connection timed out [23:24:16] Expected, I know [23:24:18] i took them out of node groups [23:24:25] what is throwing that? [23:24:31] sync-file [23:24:42] presumably fenari hasn't been updated [23:24:43] !log reedy synchronized wmf-config/InitialiseSettings.php [23:24:43] Logged the message, Master [23:24:59] Reedy: ahh, mediawiki-installation node file [23:25:01] still updating [23:25:07] yeah [23:25:09] PROBLEM - Host srv200 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:18] PROBLEM - Host srv201 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:19] PROBLEM - Host srv190 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:19] PROBLEM - Host srv199 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:19] PROBLEM - Host srv195 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:19] PROBLEM - Host srv198 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:45] PROBLEM - Host srv202 is DOWN: PING CRITICAL - Packet loss = 100% [23:25:54] PROBLEM - Host srv207 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:03] PROBLEM - Host srv208 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:03] PROBLEM - Host srv203 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:16] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [23:26:17] Logged the message, Master [23:26:26] PROBLEM - Host srv205 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:26] PROBLEM - Host srv210 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:30] PROBLEM - Host srv211 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:48] PROBLEM - Host srv204 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:48] PROBLEM - Host srv209 is DOWN: PING CRITICAL - Packet loss = 100% [23:26:48] PROBLEM - Host srv213 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:06] PROBLEM - Host srv212 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:15] PROBLEM - Host srv216 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:15] PROBLEM - Host srv225 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:24] PROBLEM - Host srv218 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:33] PROBLEM - Host srv215 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:59] oh shit...robh: did we miss something? [23:28:06] nah [23:28:11] PROBLEM - Host srv214 is DOWN: PING CRITICAL - Packet loss = 100% [23:28:11] decom.pp hasnt run on spence is all [23:28:17] scheduled downtimes? :) [23:28:22] ok [23:28:38] :-D [23:28:41] !log srv190-srv223 & srv225 should be offline, they are being decommissioned, can ignore nagios errors [23:28:42] Logged the message, RobH [23:28:46] :-P [23:28:51] cmjohnson1: luckily, we dont page on those [23:28:57] just on lvs for those ;] [23:29:01] imagine the hostility! [23:29:23] who would be paged? two people already awake and in this channel :-P [23:29:28] and the rest are in a good time zone [23:29:51] hm maybe some people are travelling after brussels [23:30:20] Reedy: so the node groups are updated now [23:30:23] yes they are :) [23:30:32] apergos: most are [23:30:36] office is empty! [23:30:41] i like it. [23:30:43] nice, enjoy it [23:31:05] good time for pranks :-P [23:31:40] cmjohnson1: So the other place to remove things for decom is the node file lits for dsh [23:31:47] in /usr/local/dsh/node_groups on fenari [23:31:56] all the sync scripts we use reference those [23:32:06] so, i just pulled out all but srv219-224 [23:32:13] as those are still active image scalers for the next 24 hours [23:32:54] yep...is there a wikitech on this? [23:33:04] in some fashion ues [23:33:13] in its entirety, dunno [23:33:23] hahahahaha [23:33:23] http://wikitech.wikimedia.org/view/Decommissioning_Servers [23:33:29] thats so outdated its funny [23:33:50] * RobH bookmarks to update later [23:34:13] cmjohnson1: i basically wrote it all down in front of me to ensure i did it all, so the online docs are outdated [23:34:40] the one part we have not done is clean the puppet certificates off [23:34:43] figured as much [23:34:47] but i was gonna do that in one big ass for loop later [23:34:48] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47349 [23:35:07] also, the image scalers coming offline wont break anythign, but meh [23:35:12] i rather we get new ones online first [23:35:43] agreed [23:36:20] New patchset: Reedy; "Add kowikiversity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47813 [23:36:46] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47813 [23:37:38] !log reedy synchronized wmf-config/InitialiseSettings.php [23:37:39] Logged the message, Master [23:37:46] yay, puppet caught up [23:38:42] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [23:38:43] Logged the message, Master [23:41:59] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47350 [23:43:55] !log reedy synchronized langlist [23:43:56] Logged the message, Master [23:44:23] !log reedy synchronized wmf-config/InitialiseSettings.php [23:44:24] Logged the message, Master [23:46:02] New patchset: Reedy; "Add minwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47814 [23:46:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47814 [23:50:03] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [23:50:05] Logged the message, Master [23:50:25] Oooh, crap, I forgot about that [23:50:48] Reedy: we've been getting an "output from your job" cronspam [23:50:50] Reedy: i keep gettin emails output for your job ;p [23:50:51] with you as the sender [23:50:53] hehe [23:51:02] Yeah, addwiki.php does it [23:51:05] I'm not quite sure why... [23:51:13] yep, i recall that it does [23:51:18] i just hate that it never tells you why [23:51:27] (unless you happen to know what the output is from already) [23:51:36] heh [23:51:42] It does email a mailing list at somepoint [23:51:47] yep [23:51:54] New patchset: Reedy; "Add symlinks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47816 [23:52:02] i recall when Roan made the changes (i think it was roan, wasnt it?) [23:52:17] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47816 [23:52:23] back when the number of folks adding wikis was counted on a single hand [23:52:33] It is still is. Me ;) [23:53:10] ahh but the number of folks who are capable is huggggge [23:53:14] they just dont ;] [23:53:21] heh [23:53:42] Why doesn't wikipedia have a wildcard dns entry? :( [23:55:13] RobH: Any chance you could make the dns entries for min.wikipedia.org please? [23:55:29] sure, you ahve a ticket or something [23:55:34] ie: where are they pointing? [23:55:57] Reedy: Because 'ilovebigtities.wikipedia.org' might not look so good pr wise? [23:55:58] !log reedy synchronized php-1.21wmf9/cache/interwiki.cdb 'Updating 1.21wmf9 interwiki cache' [23:55:59] Logged the message, Master [23:56:19] !log reedy synchronized php-1.21wmf8/cache/interwiki.cdb 'Updating 1.21wmf8 interwiki cache' [23:56:20] Logged the message, Master [23:56:37] Damianz: Some do have some level of wildcards.. [23:57:12] * Damianz notes to find obscure dns entires when more awake and his brain is working better [23:57:23] its not hard [23:57:36] our dns servers will just tell you all the zones [23:57:40] transparency! [23:57:49] Reedy: So what am I adding to DNS now? \ [23:57:52] * RobH has file open [23:58:25] min.wikipedia.org needs to resolve to the load balancers (like en etc) [23:58:41] ok, thats a language type? [23:58:45] and its on langlist and such? [23:58:56] cuz if so, i just authdns-update and it should go and pull automatically from langlist [23:59:03] aha [23:59:05] (i shoudlnt have to manually add) [23:59:18] It is in the langlist, and I thought I had synced it already [23:59:42] ok, running update [23:59:56] !log authdns-update to add min.wikipedia.org via addwiki automation [23:59:57] Logged the message, RobH