[00:14:51] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 252 seconds [00:20:51] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 1 seconds [00:28:44] Reedy: https://gerrit.wikimedia.org/r/#/c/10724/ [01:19:37] New patchset: Dereckson; "(bug 37401) Babel configuration for fo.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11133 [01:19:47] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11133 [01:24:00] New patchset: Dereckson; "(bug 37401) Babel configuration for fo.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11133 [01:24:09] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11133 [01:39:40] New patchset: Dereckson; "(bug 37384) - Collection default format is ODT for gu. projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11134 [01:39:46] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11134 [01:40:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 212 seconds [01:43:09] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 261 seconds [01:46:36] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:56:12] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 1 seconds [03:50:21] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [03:50:21] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [03:50:21] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [04:31:22] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 303 seconds [04:35:52] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 0 seconds [04:45:18] New patchset: Logicwiki; "(bug 37384) - Collection default format is ODT for gu. projects" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11134 [04:45:25] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11134 [04:45:47] New review: Hashar; "Please stop spamming random people with review requests. Thanks!" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11082 [04:54:19] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [06:58:12] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [07:51:44] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:57:14] ACKNOWLEDGEMENT - Host srv206 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn srv206 has a loong history of trouble in RT #241. get rid of it?! [08:06:24] New review: Dereckson; "@Hashar Please make clear to Sumanah what kind of notifications you wish to have or what kind of act..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/11082 [08:13:23] New patchset: Hashar; "import CommonSettings from wmflabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9131 [08:13:24] New patchset: Hashar; "vary wgUploadStashScalerBaseUrl based on cluster" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11035 [08:13:24] New patchset: Hashar; "wmfHostnames array to easily change hostnames on a cluster basis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11034 [08:13:30] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9131 [08:13:32] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11035 [08:13:34] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11034 [08:13:56] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9131 [08:14:08] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11035 [08:14:23] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11034 [08:14:26] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9131 [08:14:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11035 [08:14:29] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11034 [08:18:23] New patchset: Hashar; "cleanup whitespace in mobile-pmtpa.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11037 [08:18:24] New patchset: Hashar; "move mobile related conf to their own files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11036 [08:18:30] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11037 [08:18:32] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11036 [08:18:48] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11036 [08:18:51] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11036 [08:19:01] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11037 [08:19:04] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11037 [08:21:37] New patchset: Hashar; "specific shell configuration for transcoding boxes" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9132 [08:21:39] New patchset: Hashar; "import overriding system for InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9237 [08:21:40] New patchset: Hashar; "move throttling related conf to throttle.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9136 [08:21:46] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9132 [08:21:48] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9237 [08:21:50] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/9136 [08:22:34] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9237 [08:22:55] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9136 [08:23:04] New review: Hashar; "rebased" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/9132 [08:23:06] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9237 [08:23:07] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9136 [08:23:09] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/9132 [08:24:12] and now I am going to deploy that to the cluster [08:24:51] !log deploying several changes made to mediawiki-config gerrit changes 11034 11035 9131 11036 11037 9132 9136 and 9237 [08:24:57] Logged the message, Master [08:26:17] GRRR [08:35:27] gr? [08:36:44] someone merged changes in mediawiki-config but did not deploy them :( [08:36:48] so I have to do it hehe [08:36:53] and of course now I have an issue [08:37:20] !log installing samba-common-bin, smbclient package upgrades on tridge [08:37:25] Logged the message, Master [08:39:59] so that was the change being incorrect :] [08:41:00] New patchset: Hashar; "Revert "(bug 37482) Adding Proofread Page ext. namespaces on nl.wikisource"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11142 [08:41:06] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11142 [08:41:58] Standard mw deploy steps; 1) push 2) run :D [08:41:58] ugh [08:42:27] yeah [08:42:47] well that was merged -> run [08:42:53] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [08:42:54] leaving the actual deploy to someone else :-]] [08:43:01] New review: Hashar; "I have reverted the change. It uses the wrong global variable ($wgNamespaceAliases instead of $wgExt..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11067 [08:43:18] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11142 [08:43:21] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11142 [08:43:34] New review: Hashar; "Reverted with https://gerrit.wikimedia.org/r/#/c/11142/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11067 [08:43:34] OH [08:43:45] Gerrit has a [Revert Change] button [08:43:46] bah [08:44:02] New review: Hashar; "reverts https://gerrit.wikimedia.org/r/#/c/11067/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11142 [08:45:16] !log reverted '(bug 37482) Adding Proofread Page ext. namespaces on nl.wikisource' --> used the wrong configuration setting. [08:45:21] Logged the message, Master [08:45:30] * hashar proceed to next commit [08:49:52] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 298 seconds [08:50:18] New review: Dzahn; "blocked by RT 3106 (precise upgrade) for nodejs package." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11042 [08:52:43] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 0 seconds [09:08:44] New patchset: Dereckson; "(bug 37482) Adding Proofread Page ext. namespaces on nl.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11143 [09:08:49] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11143 [09:22:51] New review: Hashar; "Thanks, you rock!" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11143 [09:22:53] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11143 [09:23:01] New patchset: Dereckson; "(bug 37363) Uninstall CongressLookup extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11144 [09:23:07] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11144 [09:51:04] New patchset: Hashar; "boostrap placeholder for PHPUnit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11145 [09:51:05] New patchset: Hashar; "move utilities functions in common files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11146 [09:51:11] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11145 [09:51:13] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11146 [10:07:32] New patchset: Dereckson; "(bug 37336) Install Narayam ext. in te.wiktionary & te.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11147 [10:07:38] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11147 [10:15:25] New patchset: Hashar; "ant target to trigger PHPUnit tests under Jenkins" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11149 [10:15:31] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11149 [10:16:00] New review: Hashar; "Yeah we have tests in Jenkins now :-]" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11149 [10:16:02] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11149 [10:18:17] New patchset: Hashar; "boostrap placeholder for PHPUnit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11145 [10:18:18] New patchset: Hashar; "move utilities functions in common files" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11146 [10:18:24] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11145 [10:18:26] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11146 [10:19:41] New review: Hashar; "Patchset 2 is a rebase to take advantage of Ie7d96750 which enables PHPUnit tests in Jenkins." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11145 [10:19:49] New review: Hashar; "Patchset 2 is a rebase to take advantage of Ie7d96750 which enables PHPUnit tests in Jenkins." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11146 [10:21:11] New review: Hashar; "Since this is fixing two bugs, I would prefer we have two separate Gerrit changes :-]" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/10756 [10:31:12] Thehelpfulone: do you want to make / have a wiki table with all listinfo owners and moderators or something similar? [10:31:42] i can fetch raw info but i mean the "turn it into nice wiki" part [10:32:54] just cause you mentioned you already started a project on wiki on similar things [10:52:51] New patchset: Dereckson; "(bug 36972) activate the patroller group on nn.wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11150 [10:52:57] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11150 [10:56:49] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 203 seconds [10:58:19] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 0 seconds [11:34:46] so, I'd like to setup a tool that provides an overview of the infrastructure by exposing data coming from the puppet db [11:34:55] we've discussed this in the past too [11:35:15] how would I pick a server to do that? [11:37:56] paravoid: well, you could use a new misc server for that [11:38:06] you'd ask RobH [11:38:24] or you could stick it on another server that's doing something similar [11:38:27] it's really a simple django app, do we need a whole separate server for that? [11:39:48] as long as it's not spence i guess [11:40:20] maybe neon? [11:40:21] * Ryan_Lane goes to get food [11:43:17] maybe ask Leslie if she thinks its fine on neon (upcoming monitoring server to replace spence, as it might fit monitoring and historically spence also had multiple tools and webserver and i expect neon to have more resources. my 2 cents [11:45:39] that's more than 2 cents, thanks :-) [11:59:13] there are things like observium and (other end of the spectrum as far as resources) http://noc.wikimedia.org/dbtree/ [11:59:17] hmm noc, eh? grrr [11:59:28] anyways it woul dbe nice I guess to have those generally on one host, I think [12:03:56] true, but noc = fenari and potentially wants cleanup rather than new stuff [12:05:01] yeah I don't want things on fenari [12:05:07] what I want is that not to be on noc [12:06:41] and not spence [12:07:20] indeed [12:08:34] paravoid: pretty likely you know it, but win32-loader.exe to install Debian?:) quickest office OS migration for a guy from XP to Debian: mail to ALL: Please go to http://goodbye-microsoft.com/ ,click install, use your local windows admin rights (!sic) and tell me if any issues. ttyl, your admin". :) [12:09:04] yep, it's fun :) [12:31:13] !log backing up wikitech dir locally on linode instance [12:31:29] Logged the message, Master [12:31:52] upgrade time? [12:32:12] soon [12:33:11] /me sees "wikitech-static" as well [12:33:33] and "wikitech-broken" :p [12:33:46] they sound useful [12:33:48] oh yeah [12:33:54] the static one was a dumphtml copy I think [12:34:00] so we could grab it and stuff it wherever [12:34:01] which would you backup besides the main? [12:34:07] ok [12:34:24] it should get regenerated once a week by cron (in an ideal world, which we don't live in) [12:34:50] gah [12:34:59] really can it take me all day to get this one set of lists generated? [12:35:03] answer: yes, yes it can. [12:35:04] >_< [12:40:56] will just take all of /srv/org/wikimedia , first get it , reduce size later [12:41:07] ok [13:10:08] New review: Dereckson; "Don't merge, shellpolicy issue." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/11150 [13:25:28] mutante, just responded to RT about stat1 [13:25:43] Pretty sure Mark convinved Diederik to do a full reinstall, rather than an upgrade [13:26:19] ottomata: well and Ryan convinced to do in place upgrade, i dont know which was first:) [13:26:32] either or [13:26:33] hahah, we were first, then diederik talked to mark yesterday I think [13:26:39] well, we can do upgrade now, right? [13:26:42] a full install is always better [13:26:47] but we'd have to wait for who knows who to do the install? [13:27:01] and doing the upgrade doesn't mean we can't do the reinstall if the upgrade doesn't work... [13:27:05] meh? what should we do? [13:27:06] it's the best way to ensure puppetization is proper [13:27:20] ok [13:27:24] i guess we'll do reinstall then [13:27:30] we are ready to do that asap [13:27:41] do we need mark to do that? who can do that? [13:28:01] I dunno. add an rt for it [13:29:38] ottomata: turn my upgrade ticket into reinstall ticket. you can just renamed subject in RT [13:31:36] i think there is an RT for it, but it might have been closed [13:31:38] I can reopen it [13:32:05] ottomata: check out 2165 , i turned it into "stats master ticket" and all others are linked there [13:32:31] awesome, thanks [13:32:32] also added you as CC / requestor in several places [13:33:04] https://rt.wikimedia.org/Ticket/Display.html?id=2946 [13:33:55] ah,ok, that was missing. linked and closing the other [13:34:21] ok [13:34:32] i will reopen mine then [13:34:58] rejected upgrade [13:35:18] reinstall is open [13:35:43] but it could have a better comment than " Not doing this, just going with Lucid" :) [13:38:28] yeah, i'm adding [13:39:13] is mark really the only one who can reinstall? is there anyone else we can poke? [13:39:23] Erik Z has suspended some work he was doing on stat1 so we could do this [13:39:27] he asked us to have it done in 24 hours [13:41:36] are you satisfied with backups yet? [13:41:59] !rt 3098 [13:42:00] http://rt.wikimedia.org/Ticket/Display.html?id=3098 [13:43:44] cool! [13:44:03] hmm, i think we shoudl see the /a partition on there too [13:44:05] not just /home [13:44:07] i'd rather not start it in late afternoon with an appointment later, i can do it in the morning but i think everybody could if anybody in US timezone is avail. [13:44:39] ottomata: and yes, make that sure as well [13:45:07] how do I make sure? [13:45:13] do I need access to tridge? [13:45:39] should also receveice mail from amanda on ops list [13:49:42] ottomata: you see these "amanda mail report" mails? i see them but no "stat1" in there [13:49:56] while we do see those daily files on tridge itself [13:50:01] hmm, looking [13:51:26] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [13:51:26] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [13:51:26] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [13:51:38] FAILURE DUMP SUMMARY: [13:51:40]   http://stat1.wikimedia.org/ /a lev 0  FAILED "[dump larger than available tape space, 233263375 KB, but cannot incremental dump new disk]" [13:52:08] stat1.wikime /home       0 4917710 1895211   38.5    1:42 18520.5   1:42 18527.6 [13:52:12] so yeah, not enough space for /a [13:55:42] hm, or you do want to let it span multiple tapes [13:55:57] um, doesn't matter to me [13:56:19] apergos: how much did you remove from tridge approx? [13:57:04] I dunno bout we had a few T free instead of 1/2 T [13:57:06] ottomata: anything that can be dropped from /a that is just outdated or is enough to be stored _once_ elsewhere -> archive vs. daily backup [13:57:09] how much space do they need? [13:57:45] "There are TWO parameters that must be set so that the archive spans [13:57:48] multiple virtual tapes. " [13:58:07] probably, I do'nt really know what is in there [13:58:12] i've just been told to back it up [13:58:14] "Next, the parameter tape_splitsize must be set in the dumptypeamanda.conf, runtapes must be set to an appropriate value > 1. | [13:58:17] configuration in amanda.conf [13:58:46] http://www.backupcentral.com/phpBB2/two-way-mirrors-of-external-mailing-lists-3/amanda-20/dump-larger-than-available-tape-space-68176/ [14:01:32] ok... [14:02:26] would virtual tape size mean the size of the backup? or the size of all of the available tapes? [14:04:27] http://wiki.zmanda.com/index.php/How_To:Set_Up_Virtual_Tapes [14:04:37] (amanda = software, zmanda = company) [14:10:27] still not sure what I should do about this, I don't really feel comfortable changing backup server configs , and as far as I can tell this has to do with reconfiguring the backup server tape configs? [14:17:18] !log storage3 dist-upgrade and reboot [14:17:26] Logged the message, Master [14:19:40] ottomata: can you find out the answer to the "how much do they need" question? [14:20:03] ok, will ask them [14:20:04] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:07] New patchset: Krinkle; "(bug 37304) Set $wgTranslateDisablePreSaveTransform = true;" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11155 [14:26:15] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11155 [14:27:38] it.wiki seems to be died [14:31:19] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [14:34:37] PROBLEM - SSH on storage3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:34:55] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:04] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: (Return code of 255 is out of bounds) [14:35:22] PROBLEM - MySQL disk space on storage3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:24] Vito_away: works for me [14:35:31] PROBLEM - mysqld processes on storage3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:38:22] Change abandoned: Hashar; "Thanks MaxSem :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7773 [14:38:40] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [14:43:28] RECOVERY - SSH on storage3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:43:37] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [14:43:46] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay seconds [14:44:13] RECOVERY - MySQL disk space on storage3 is OK: DISK OK [14:46:02] !log lowering ttl for virt0 [14:46:08] Logged the message, Master [14:55:01] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [15:08:26] New patchset: Hashar; "(bug 37545) farsi: change default Collection namespace" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11161 [15:08:32] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11161 [15:27:40] New patchset: Dzahn; "disable shell access for raindrift per RT-3088" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11165 [15:28:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11165 [15:28:59] !log Unstuck torrus [15:29:06] Logged the message, Master [15:30:25] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11165 [15:30:28] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11165 [15:53:24] New review: Dzahn; "does "Newer changeset added in." mean RT-2512 is resolved?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3238 [16:12:53] !log adding gerrit@wikimedia.org to accepted nonmembers of mediawiki-cvs list [16:12:58] Logged the message, Master [16:13:50] mutante: what's mediawiki-cvs? [16:14:14] huh [16:14:26] jeremyb: "This list sends out notifications of commits to MediaWiki's CVS repository" (yes, its gonna be renamed) [16:14:27] it exists on lists.wm.o [16:14:45] a few decades out of date? ;) [16:14:48] yes:) [16:14:57] what it just getting no traffic from gerrit until now? [16:15:19] yea, demon is enabling it [16:16:40] i wonder what wikimedia-commits is [16:16:46] cmjohnson1: i cant really see anyone having a problem with it [16:17:04] cmjohnson1: we cant do more than you anyways as mgmt itself is like broken [16:17:44] New patchset: Hashar; "tweak logging for wmflabs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11173 [16:17:50] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11173 [16:17:52] how to get a quiet afternoon: don't launch your IRC client :-] [16:18:17] heh! ,,,:p [16:18:20] Logged the message, Master [16:19:02] !log shut down sq33 [16:19:07] Logged the message, Master [16:19:17] yea, that order may look weird now, but i just did :) [16:21:19] PROBLEM - Host sq33 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:54] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11173 [16:22:56] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11173 [16:24:34] Logged the message, Master [16:28:31] PROBLEM - Backend Squid HTTP on sq48 is CRITICAL: Connection refused [16:28:49] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: Connection refused [16:32:36] New patchset: Demon; "Allow some logs to supress comment-added notifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11176 [16:33:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11176 [16:34:49] RECOVERY - Host sq33 is UP: PING OK - Packet loss = 0%, RTA = 2.78 ms [16:38:10] Logged the message, Master [16:38:16] PROBLEM - Frontend Squid HTTP on sq33 is CRITICAL: Connection refused [16:38:52] PROBLEM - Backend Squid HTTP on sq33 is CRITICAL: Connection refused [16:43:31] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:52] RECOVERY - Frontend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 604 bytes in 0.003 seconds [16:45:55] RECOVERY - Backend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 459 bytes in 0.005 seconds [16:46:10] !log changing virt0's ip address and vlan [16:46:15] Logged the message, Mistress of the network gear. [16:46:40] PROBLEM - Host virt0 is DOWN: PING CRITICAL - Packet loss = 100% [16:47:30] LeslieCarr: you killed our sessions!! ;-] [16:48:34] New patchset: Ryan Lane; "Changing virt0's address" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11178 [16:49:04] PROBLEM - Host labsconsole.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:49:15] !log restarting mysql on virt0 with correct bind address [16:49:20] Logged the message, Master [16:49:35] hashar: sorry [16:49:59] but now our ip address utilization is better [16:50:03] due to switching ip ranges [16:50:08] * Ryan_Lane sighs [16:50:14] labsconsole is now broken [16:50:14] feel the ipv4 optimization! [16:50:53] might be because we used the IP for virt0 instead of the DNS entry? [16:51:12] no. mysql was bound to the wrong ip [16:51:19] and the /etc/hosts file had the wrong ip too [16:51:22] working now [16:51:55] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11178 [16:51:59] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11178 [16:52:02] hey mark, can we start installing precise on stat1? [16:56:13] drdee: can we do that tomorrow? [16:56:28] also I thought that erik z requested to start on stat1001 first [16:57:06] mark: i am giving up on this project [16:57:18] i have tried everything to do this in a reasonable time [16:57:25] i don't care about it [16:58:40] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [16:59:29] !log restarting opendj on virt0 [16:59:34] Logged the message, Master [16:59:36] me and andrew have spent on and off for about 6 weeks to get precise installed on stat1, we don't need it, we just want to help out but we cannot keep waiting for this forever [17:00:19] we can just set a time for it [17:00:25] "can we do it now" doesn't always work so well [17:00:40] especially not when I'm about to go off for dinner and a soccer match :P [17:01:43] I don't care about that box either, but it's gonna be even bumpier if we don't do it now [17:02:16] RECOVERY - Backend Squid HTTP on sq33 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.010 seconds [17:02:57] mark: anyways [17:03:01] RECOVERY - Frontend Squid HTTP on sq33 is OK: HTTP OK HTTP/1.0 200 OK - 27546 bytes in 0.006 seconds [17:03:04] let me know when it is done [17:04:50] mark, how are we supposed to set a time for it? [17:04:54] do we just set one and someone will do it? [17:05:08] we need to talk to you guys to set a time, which I guess is what we are trying to do [17:05:17] I tried to get someone to respond to this ticket to set a time 6 weeks ago as drdee says [17:05:24] but no one seemed interested, so we gave up [17:05:33] now it is back (and we do kinda need it this time) [17:05:48] so yesterday we got someone to plug in a USB drive so we could back up our data [17:05:51] this is done [17:05:56] so now the machine is sitting there [17:06:01] inactive until someone can do the reinstall [17:06:04] cmjohnson1: what's up? [17:07:45] cmjohnson1: "host search32.mgmt.pmtpa.wmnet" on any server [17:08:13] ottomata: i can do it tomorrow, that's no problem [17:08:38] cmjohnson1: what mark said, but yeah: 10.1.4.43 [17:08:50] ok, thanks mark, drdee, is that ok? tell erik? [17:09:04] my impression was that erik z was nervous about it [17:09:13] I don't know what the deal is with that [17:09:20] we should be totally backed up with what he is nervous about [17:09:24] alright [17:09:26] but, erik z has suspended his work [17:09:27] then i'll do it tomorrow [17:09:31] until this is done [17:09:44] mark: want me to grab it ? [17:09:49] if you want [17:09:57] but whoever drdee talked to yesterday said that it'd be done in 24 hours I guess, but whatevs [17:10:00] just needs a reinstall, with precise, with LVM partitioning [17:10:04] oh cool, thanks Leslie! [17:10:06] cool [17:10:08] (can be done manual, as long as data ends up in LVM LVs) [17:10:18] and there's data attached to an external drive [17:10:23] that needs to be copied back after the install [17:10:25] and puppet runs, etc [17:10:28] i can do that part [17:10:29] ottomata has all the details [17:10:32] right [17:10:33] so format the usb drive and get cmjohnson1 to destroy it [17:10:35] got it ;) [17:10:38] haha [17:11:12] actually cmjohnson1 can you remove the usb drive from stat1 for now ? i just want to be uber paranoid about not destroying anything [17:11:16] LeslieCarr: you know how precise installs work? [17:11:18] by default it'll do lucid [17:11:25] you need to add two lines to the node entry in dhcpd.conf [17:11:33] ok [17:11:47] check e.g. the lvs servers for examples [17:11:52] those have been reinstalled with precise last week [17:13:31] LeslieCarr: also, make sure you don't fill the entire logical volume with LVs [17:13:38] always good to keep at least 10-20% of free space [17:13:50] allocate what's needed, not what's available :) [17:14:06] that allows for LVM snapshots and "oops, disk ran out of space" quick fixes... [17:14:25] * Damianz seds entire logical volume to entire volume group and returns to a happy place [17:15:00] yes, entire volume group [17:17:17] thanks leslie! [17:18:40] thanks cmjohnson1 [17:20:12] LeslieCarr: can you approve this for me — https://gerrit.wikimedia.org/r/#/c/9640/ [17:20:22] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/9640 [17:20:45] in about 5 minutes or so [17:21:10] Ryan_Lane: you around? [17:21:12] New patchset: Lcarr; "Switching stat1 to precise install" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11186 [17:21:17] yes. not for long, though [17:21:24] and am in the middle of changing the ip for virt0 [17:21:26] Ryan_Lane: can you approve https://gerrit.wikimedia.org/r/#/c/9640/ [17:21:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11186 [17:21:46] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11186 [17:21:49] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11186 [17:23:14] hm [17:23:38] goodbye stat1, long live stat1 ! [17:23:46] !log rebooting stat1 for wipe and reinstall into precise [17:23:51] Logged the message, Mistress of the network gear. [17:24:10] !log restarting gerrit [17:24:14] Logged the message, Master [17:24:30] really need to sort out our LDAP situation [17:25:13] odd. virt0 is refusing ldap connections? [17:25:22] crap [17:25:24] I know why [17:25:33] Isn't that the only ldap server currently? :( [17:25:50] yes. as I said, we really need to sort that out :) [17:26:18] PROBLEM - Host stat1 is DOWN: CRITICAL - Host Unreachable (208.80.152.146) [17:26:54] LeslieCarr: the backups arent fixed yet, they need /a [17:27:32] New patchset: Ryan Lane; "Fixing virt0's address in another spot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11187 [17:28:19] mutante: oh ? [17:28:25] is this going to be a problem? [17:28:30] RoanKattouw: can you come to my desk for a minute [17:28:42] Sure [17:28:57] LeslieCarr: i hope not but i pointed out to otto to check backups first and he has been told to backup /a and that did NOT work [17:29:04] LeslieCarr: it might be :/ [17:29:14] !log restarting opendj again on virt0 [17:29:19] Logged the message, Master [17:29:51] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11187 [17:30:01] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11187 [17:30:06] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11187 [17:30:53] Ryan_Lane: can you approve both https://gerrit.wikimedia.org/r/#/c/10729/ and https://gerrit.wikimedia.org/r/#/c/9640/ [17:31:00] I'm fixing shit right now [17:31:05] you're going to need to get someone else to do it [17:31:51] RECOVERY - Host stat1 is UP: PING OK - Packet loss = 0%, RTA = 2.81 ms [17:32:01] Ryan_Lane: okay [17:32:13] New patchset: Jeroen De Dauw; "Updated location of wikidata repositories" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11188 [17:32:17] maplebed: you around? [17:32:24] yeah. [17:32:35] maplebed: can you approve both https://gerrit.wikimedia.org/r/#/c/10729/ and https://gerrit.wikimedia.org/r/#/c/9640/ ? [17:32:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11188 [17:33:50] maybe. [17:34:09] cmjohnson1: cool. I can do that now [17:34:22] checking. [17:35:01] !log restarting opendj on virt0 again... [17:35:07] Logged the message, Master [17:35:09] PROBLEM - SSH on stat1 is CRITICAL: Connection refused [17:35:13] no clue what's fucking up the nat rules [17:38:27] preilly: no, sorry, I can't. [17:38:36] ah. I think I know the problem [17:38:50] New patchset: Pyoungmeister; "updating mac for search32" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11190 [17:39:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11190 [17:39:42] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11190 [17:39:44] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11190 [17:40:04] cmjohnson1: updated [17:41:12] yep! definitely [17:43:31] ottomata: for stat1 is it a 2 partition (need separate /a) or 1 parittion ? [17:43:53] there we go. virt0 is finally good :) [17:45:46] separate /a [17:45:57] /a was quite large [17:46:54] /a on LVM then eh [17:49:39] cmjohnson1: rlly? [17:49:51] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:49:55] Ryan_Lane: ns2 is taking longer than the other 2 did. [17:49:57] i logged in earlier with the usual drac password [17:50:05] it's dns [17:50:11] * Ryan_Lane sighs [17:50:12] it was set to pxeboot which caught me by surprise [17:50:15] now they're all failing fast [17:52:15] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:53:36] PROBLEM - NTP on stat1 is CRITICAL: NTP CRITICAL: No response from NTP server [17:55:24] Ops folks, why u no review one line change? https://gerrit.wikimedia.org/r/#/c/11188/1 :) [17:56:11] and then some reviews for me after JeroenDeDauw ? :) [17:56:13] In most cases you have to ask [17:56:29] Reedy: that's his way of asking [17:56:48] lol [17:56:59] 50% of my unsubmitted commits are in the puppet repo [17:57:45] because you didn't deliver whiskey …. that's teh rule [17:57:48] whiskey for commits [17:58:04] LeslieCarr: i think Reedy offered whiskey? [17:58:29] Ooh whiskey [17:58:35] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11188 [17:58:38] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11188 [17:58:38] * Damianz hugs his bottle of laphroaig [17:58:57] LeslieCarr: https://gerrit.wikimedia.org/r/#/c/7823/ [17:59:01] I brought Lebkuchen! [17:59:13] LeslieCarr: https://twitter.com/tehreedy/status/207495457088352256 [17:59:33] shit, Reedy has me for another 15 reviews :) [17:59:44] this is why: if you review one you get more /me is seriously out at 8.pm /away [17:59:47] xD [18:00:03] AaronSchulz: asher had commented on that [18:00:56] well we already use the flags in that patch elsewhere [18:01:09] not sure I want to experiment with others [18:01:21] New patchset: Ryan Lane; "Fixing virt0's address in the recusor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11193 [18:01:24] * AaronSchulz can't wait for git-deploy to replace all this stuff anyway [18:01:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11193 [18:02:23] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11193 [18:02:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11193 [18:03:03] LeslieCarr: 3 out of 4 are still waiting: http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060915.html [18:03:19] let me see if they're all still a clean merge on master [18:03:24] or production rather [18:03:49] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/9640 [18:03:51] i need to fix up stat1 first, then i can check some other stuff out [18:03:57] RECOVERY - SSH on stat1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [18:05:00] LeslieCarr: well i'm double checking the mergeability. poke me if you have questions [18:05:45] LeslieCarr: are you still really busy? [18:06:12] yeah, stat1 has finally rebooted, need to fix it up first [18:06:40] can anybody in operations take a look at: https://gerrit.wikimedia.org/r/#/c/9640/ and https://gerrit.wikimedia.org/r/#/c/10729/2 ? [18:06:45] !log db1025 dist-upgrade & reboot [18:06:49] Logged the message, Master [18:07:45] cmjohnson1: can you plug the usb key back in ? [18:07:49] to stat1 ? [18:08:00] PROBLEM - Host db1025 is DOWN: PING CRITICAL - Packet loss = 100% [18:08:48] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/10729 [18:10:00] New review: SPQRobin; "Everything in MediaWiki properly redirects, so the only visible difference should be the lang attrib..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/10707 [18:10:42] RECOVERY - Host db1025 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [18:11:48] thanks :) [18:13:53] so I guess no one can help me with the changes that I need https://gerrit.wikimedia.org/r/#/c/9640/ and https://gerrit.wikimedia.org/r/#/c/10729/2 ? [18:14:28] PROBLEM - mysqld processes on db1025 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [18:14:36] RECOVERY - NTP on stat1 is OK: NTP OK: Offset 0.1059136391 secs [18:14:47] LeslieCarr: I guess I'll have to poke asher more...you're no fun! [18:15:48] AaronSchulz: what wa the path / name of the metric for purge timings? [18:16:33] wmfOnLocalFilePurgeThumbnails-list [18:16:44] though lately, there are no data points for some reason [18:18:22] ottomata: hey [18:18:30] stat1 should be ready for you to copy data back over [18:18:39] woot woot! thanks LeslieCarr! [18:18:50] we can expand /a if needed [18:19:49] ottomata: please let me know when you're done and if you see any problems [18:20:06] LeslieCarr: hi! [18:20:10] LeslieCarr: question for you [18:20:28] I want to setup a python/django tool that will give us an overview of the infrastructure [18:20:33] and possibly package updates too [18:20:37] :) [18:20:39] that uses the puppet db [18:20:43] and then you give me a pony ? [18:20:55] I'm looking for a machine to install it into [18:21:02] I was told that neon might be suitable [18:21:26] and that I should talk with you :) [18:21:58] hehehe neon should be (for some reason icinga borked, need to look at why), it can be a bit cpu intensive but memory and disk are totally open [18:22:01] so no one is willing or able to look at https://gerrit.wikimedia.org/r/#/c/9640/ and https://gerrit.wikimedia.org/r/#/c/10729/2 ? [18:22:25] preilly: looking [18:22:36] LeslieCarr, thanks so much! looking now [18:23:19] paravoid: thanks [18:24:04] eh, 10729 sounds wrong to me [18:24:23] so, we're trusting arbitrary X-F-F just because you happen to have Opera in your UA string? [18:24:38] is this restricted to certain IP ranges? [18:24:43] paravoid: yes [18:24:47] (I presume this is for the Opera Mini acceleration) [18:25:17] paravoid: it's because Opera Mini uses a proxy and we need to get the actual client IP address for detection [18:25:26] * jeremyb wonders if CU stores both XFF IP and the real remote [18:25:28] of the carrier network [18:25:34] right, I got that [18:25:45] but is this restricted to the Opera Mini proxies IPs? [18:26:12] this is a change that is already live [18:26:19] I just moved it up a few lines [18:26:25] saw that too [18:26:25] as it was occurring too late [18:26:28] RECOVERY - Host virt0 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [18:26:36] RECOVERY - Host labsconsole.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:26:39] but since I'm here, I thought I should ask [18:26:46] I understand [18:27:19] maplebed: anyway, the graph doesn't seem to be any better [18:28:05] these are the IPs for Opera Mini 195.189.142.0"/23; "91.203.96.0"/22; "80.239.242.0"/23; "217.212.230.0"/23; 141.0.8.0"/21; "82.145.208.0"/20; [18:28:25] we could create another ACL and match those on the Opera Mini requests [18:28:54] right [18:29:21] but I don't really want to do that right this second [18:29:22] no need for a UA match either [18:29:26] okay [18:29:31] I'll push this now [18:29:42] can you just merge [18:29:43] how do you prefer to track that thing? should I open a bugzilla bug for you? [18:29:46] or an RT? [18:29:47] and hold off on the push [18:29:56] open a RT ticket for it [18:30:12] we shouldn't merge but not push stuff. [18:30:23] it lays a trap. [18:30:36] New review: Faidon; "The stanza move looks good, however trusting X-F-F just because the UA says "Opera" is certainly wro..." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10729 [18:30:39] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10729 [18:30:45] if it's merged but not pushed, the next thing that needs to get pushed can't without pushing that too. [18:30:57] unless you go and stop puppet on all the mobile varnish servers. [18:31:13] there's only 4 of them! ;P [18:31:18] maplebed: right [18:31:34] maplebed: no I just have another change that I want to make after these two are merged [18:31:43] maplebed: I just wanted a minute to do that is all [18:32:21] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/9640 [18:32:24] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/9640 [18:32:35] preilly: go ahead :) [18:32:35] just make the change and then merge them all together! that's much better than putting a global lock on all puppet changes until you're done. [18:33:22] okay [18:33:39] RECOVERY - mysqld processes on db1025 is OK: PROCS OK: 1 process with command name mysqld [18:37:04] paravoid: actually those two can be pushed [18:37:09] sorry for the confusion [18:37:17] is it opera mini or opera mobi? [18:37:23] the thing with the "accelerating" proxies? [18:37:33] LeslieCarr, I think it is good [18:37:35] I am restoring the data now [18:38:00] cool [18:40:08] New patchset: Faidon; "mobile: trust X-F-F only for Opera Mini subnets" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11197 [18:40:22] preilly: want to review ^? mind if I push that too with your changes? [18:40:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11197 [18:40:41] or do you prefer messing with it later? [18:41:21] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/11197 [18:41:50] paravoid: nope that looks great and is exactly what I would have switched it to [18:43:03] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11197 [18:43:06] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11197 [18:43:34] both of them merged [18:43:40] paravoid: okay great [18:43:40] er, make that all 3 of them [18:43:45] paravoid: are you going to push it now [18:44:00] what do you mean? [18:44:11] I pushed/merged [18:44:18] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [18:44:20] paravoid: I mean push it live to sock puppet [18:44:31] I did [18:44:39] paravoid: oh okay sorry for the confusion [18:46:42] Getting https://gerrit.wikimedia.org/r/#/c/7820/ https://gerrit.wikimedia.org/r/#/c/7823/ https://gerrit.wikimedia.org/r/#/c/7831/ dealt with would be great if someone can. Need to do a cleanup revision to tidy up the debian specific stuff, but that can come in the near future... [18:49:28] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7820 [18:49:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7820 [18:49:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7823 [18:50:02] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7823 [18:50:10] Yaaay :D [18:50:29] I think there are probably a lot of scripts doing bad things.. [18:51:48] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7831 [18:51:50] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7831 [19:06:57] PROBLEM - Host srv190 is DOWN: PING CRITICAL - Packet loss = 100% [19:07:07] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10688 [19:07:09] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/10688 [19:08:07] New patchset: ArielGlenn; "unused thumb purge helper (generates lists for purge)" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/11202 [19:08:40] ok *now* off the clock for real sheesh [19:26:44] !log drop table enwiki.exlogging [19:26:48] Logged the message, Master [19:28:22] !log converted enwiki.interwiki to innodb [19:28:27] Logged the message, Master [19:29:21] !log drop table enwiki.trackbacks [19:29:26] Logged the message, Master [19:31:50] New patchset: Aaron Schulz; "Added a generic debugging UDP log." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11203 [19:31:56] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11203 [19:32:20] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11203 [19:32:22] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11203 [19:36:58] PROBLEM - Apache HTTP on mw2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:58] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:58] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:16] PROBLEM - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:19] RECOVERY - Apache HTTP on mw2 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.334 second response time [19:38:19] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.649 second response time [19:38:28] PROBLEM - Apache HTTP on mw5 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:46] RECOVERY - LVS HTTP IPv4 on foundation-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 38971 bytes in 8.966 seconds [19:39:49] RECOVERY - Apache HTTP on mw5 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.573 second response time [19:39:49] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.088 second response time [19:41:19] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:40] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.328 second response time [19:47:28] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:47:48] heya RobH / woosters / whoever -- any update on the status of the Dell analytics boxes we ordered back in March/April? [19:48:13] PROBLEM - Memcached on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:34] RECOVERY - Memcached on srv203 is OK: TCP OK - 0.109 second response time on port 11000 [19:50:28] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.305 second response time [19:50:52] LeslieCarr: maybe just /away could do? :) [19:51:44] !log payments cluster dist-upgrades & reboots [19:51:48] i was cycling through all my nicks to make sure they didn't get cleaned up in the big purge this weekend [19:51:48] Logged the message, Master [19:57:13] srv203 hanging sync [19:57:31] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CRIT replication delay 306 seconds [19:57:40] PROBLEM - SSH on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:01] RECOVERY - SSH on srv203 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:59:10] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.203:11000 (Connection timed out) [19:59:31] Reedy: https://gerrit.wikimedia.org/r/#/c/8348/ [20:00:02] Ya? [20:02:27] oh do you still want it ? [20:02:51] going through old changes in production/puppet … but i figured i'd ask before merging anything [20:03:31] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay 0 seconds [20:03:35] We'll need it in future... When https://gerrit.wikimedia.org/r/#/c/7826/ is actually in use [20:04:09] is it ok to merge now or should it be on hold ? [20:04:31] Presumably having a module loaded but not used is maybe only going to use some more memory [20:04:36] Probably safer just to keep it on hold for now [20:04:56] New review: Lcarr; "on hold until https://gerrit.wikimedia.org/r/#/c/7826/ is actually in use" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8348 [20:07:27] SimpleSurvey (Version 0.1) (1cd3d04) Nimish Gautam [20:07:33] how can I figure out when this was installed on enwiki? [20:08:07] Why? [20:08:10] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:08:15] Around the time of the usability initiative [20:08:44] well I just noticed it on the special pages, but http://en.wikipedia.org/wiki/Special:SimpleSurvey is empty - so I wanted to know if it's got a use in the future, or if it had a use? [20:09:31] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10226 [20:09:33] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10226 [20:10:02] I don't think it was every directly acessible via that [20:10:33] I think it's in a stupid extension dependancy chain [20:10:45] ottomata: https://gerrit.wikimedia.org/r/#/c/10958/ ? still needed ? ok to merge? [20:12:13] Thehelpfulone: articlefeedback includes it [20:12:14] for one [20:12:32] ok [20:12:59] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:22] yes please! [20:14:34] LeslieCarr: yes please merge it! [20:14:38] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/10958 [20:14:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/10958 [20:14:41] cool, merging [20:14:47] danke! [20:14:49] anyone else have any gerrit review needs ? [20:15:15] this one is old [20:15:16] https://gerrit.wikimedia.org/r/#/c/9628/ [20:15:19] someone asked if I could do that [20:15:20] so I did [20:15:35] ah i checked out the RT, need to get an ok from CT [20:16:10] reedy@fenari:/var/log/mw$ du --si fatal.log [20:16:10] 34G fatal.log [20:16:28] I've got a revision adding logrotate to that file... [20:16:31] LeslieCarr: thanks for the review! [20:17:05] np jeremyb-phone [20:17:07] ooo [20:17:11] yes logrotate would be good [20:17:13] let me find that [20:17:19] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.203:11000 (Connection timed out) [20:18:24] https://gerrit.wikimedia.org/r/#/c/6061/ [20:18:40] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:18:48] I somewhat guessed with the settings [20:18:59] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.486 second response time [20:19:51] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6061 [20:19:54] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6061 [20:20:02] wow the merge worked :) [20:20:08] i was waiting for the fail :) [20:20:38] nice [20:20:59] does anyone have experience with redis? [20:21:56] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5492 [20:23:10] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.203:11000 (Connection timed out) [20:23:46] PROBLEM - Apache HTTP on srv203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:55] New patchset: Lcarr; "Merge commit 'refs/changes/92/5492/1' of https://gerrit.wikimedia.org/r/operations/puppet into production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11207 [20:24:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11207 [20:24:28] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11207 [20:27:58] RECOVERY - Apache HTTP on srv203 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.750 second response time [20:32:01] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:35:55] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /var/log/squid/orange-ivory-coast.log, /var/log/squid/digi-malaysia.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [20:46:58] ottomata: is stat1 happy ? [20:48:10] looks good [20:48:13] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.203:11000 (Connection timed out) [20:48:13] its copying data over now [20:48:17] will take a while [20:48:57] ok, i'm closing 2946, let me know if you need more help with anything [20:49:07] cool, thank you so so sos much [20:49:13] we were waiting forever on that one [20:49:22] i've got a couple of oxygen related changes coming up [20:49:58] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [20:53:03] New patchset: Ottomata; "filters.oxygen.erb - agh, oxygen logs to a different directory than emery!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11208 [20:53:25] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.203:11000 (Connection timed out) [20:53:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11208 [20:54:28] LeslieCarr: we might have a DNS issue with the wmflabs.org domain [20:54:46] https://gerrit.wikimedia.org/r/11208 [20:55:05] LeslieCarr, before you go save the world for hashar, quick approve of that one please! [20:55:05] hashar: again? [20:55:08] LeslieCarr: chrismcmahon can not resolve ee-prototype.wikipedia.beta.wmflabs.org . I have noticed wmflabs.org has LABSCONSOLE.WIKIMEDIA.ORG and VIRT0.WIKIMEDIA.ORG [20:55:08] i had a booboo [20:55:22] I can't reproduce the issue from DNS resolver though [20:56:41] google resolver works though. So must be some issue at chrismcmahon resolver ( google resolve: dig ee-prototype.wikipedia.beta.wmflabs.org @8.8.8.8 ) [20:56:48] * ^demon hands ottomata a band-aid. [20:59:41] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11208 [20:59:43] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11208 [20:59:55] have a good night :) [21:00:18] apergos: are you about? if so, can you upload some mediawiki tarballs for me please? [21:00:24] ottomata: do you want that pushed to oxygen? [21:00:27] seems pretty clearly not chris's problem to me: http://dpaste.com/759290/plain/ [21:01:07] maplebed, yes please! [21:01:20] Ryan_Lane: LeslieCarr: ping [21:01:57] jeremyb: pong [21:02:17] ottomata: puppet successfully run. Do you recall whether I need to kick udp2log? [21:02:43] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [21:02:55] i think it subscribes [21:02:56] so no [21:03:01] but i got more changes coming in in a minute [21:03:01] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [21:03:05] for logrotate [21:03:20] ungh, too bad log directories aren't consistent [21:03:23] arrghhh [21:03:25] LeslieCarr: your vlan change broke DNS, ryan fixed it, now it's broke approximately the same way again i think. (labs) [21:03:39] LeslieCarr: 13 21:00:27 < jeremyb> seems pretty clearly not chris's problem to me: http://dpaste.com/759290/plain/ [21:04:19] things breaking again after you fixed them usually means puppet i thought? ;) [21:05:06] hehe [21:05:18] !g 11187 [21:05:18] https://gerrit.wikimedia.org/r/11187 [21:05:20] that's the fix [21:05:27] reedy. I think for everything after midnight you owe me a drink :-P [21:05:41] It's only 10pm here! [21:05:42] * Reedy grins [21:05:44] apergos: have a scorekeeper? [21:05:52] so this looks different, this looks like dns failing or something on virt0.... [21:06:09] andrewbogott: pinging you as you are knoweldgeable in the ways of labs (as i look as well) [21:06:15] don't make me take a screenshot of my desktop :-P [21:06:31] LeslieCarr: well earlier and again now all 3 auth NS returned NXDOMAIN for bastion.wmflabs.org [21:07:04] LeslieCarr: Catching up... [21:07:07] chrismcmahon: are you paying attention? [21:07:22] so where are they? [21:07:34] after 00:30 is two drinks! [21:07:38] the drinks? i think there's some latency [21:07:42] oh, the tarballs [21:07:45] no, the tarballs [21:07:46] right [21:07:59] drinks are much harder to serialize [21:08:01] http://noc.wikimedia.org/~reedy/upload-1.17.5.tar http://noc.wikimedia.org/~reedy/upload-1.18.4.tar http://noc.wikimedia.org/~reedy/upload-1.19.1.tar [21:08:08] jeremyb: you're just not trying hard enough [21:08:24] speaking of which... anyone know protobuf? [21:08:34] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.203:11000 (Connection timed out) [21:09:05] jeremyb: thanks, seems like that would be it [21:10:19] hhmm [21:10:27] that LVS alert for foundation-lb before [21:10:32] should have paged me, shouldn't it? [21:10:58] i think other things like DB masters are supposed to page but don't? [21:11:04] doh, anyone checking out srv203 ? [21:11:12] no 20, right? [21:11:25] paravoid: maybe not for a <90 sec outage? maybe after 3 mins or something? [21:11:25] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [21:11:37] Reedy: ^ [21:11:51] Yup, no 20 [21:11:59] check please [21:12:01] we're still iterating 20 on the site [21:12:09] that they are the way you want them [21:12:48] yup, looks good as usual! :) Thanks! [21:12:59] 20 what? [21:13:09] virgins [21:13:10] mediawiki 1.20 [21:13:48] (or 1.20.x) [21:13:58] ah [21:13:58] <^demon> 1.20wmfX [21:14:01] huh I know we're in 1.20wmf4 and yet in my mind I'm "that means there's a release, right?" :-/ [21:14:16] apergos: 1.20wmf5* [21:14:20] <^demon> apergos: 1.20.0 final isn't slated til fall-ish. [21:14:22] it's the election propaganda rotting my brain [21:14:29] 5? we're already at 5? [21:14:43] indeed [21:14:46] great [21:14:58] check your SAL ;) [21:14:59] everything but wikipedias are on wmf5 [21:15:05] every 2 weeks we go up var!! [21:16:04] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.2.203:11000 (Connection timed out) [21:16:18] i'm going to replace srv203 with another memcached [21:17:40] Time for some email spammage [21:17:53] and people replying OMG UNSUBSCRIBE MEH!!!oneone1! [21:18:51] LeslieCarr: Ok, now that I've caught up, I'm still not totally clear what problem we're talking about :) Failure of ns1.wikimedia.org & co. to resolve? [21:20:03] well right now i'm trying to find a working memcached spare … but yeah it looks like failure of them to resolve wmflabs.org at all .... [21:20:23] andrewbogott: for ns in {0..2}; do printf "ns$ns: %s\n" "$(dig @ns${ns}.wikimedia.org bastion.wmflabs.org 2>&1 | fgrep 'status: ')"; done [21:20:41] should be NXDOMAIN x3 [21:20:56] LeslieCarr, a memcached box should be easy to setup, assuming you have the machine [21:21:01] normally would be something like NOERROR [21:21:15] Platonides: there are spares usually [21:21:20] yeah, trying to find a working spare.... [21:21:22] :-/ [21:22:12] * AaronSchulz hands LeslieCarr a consistent hash ring [21:23:34] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [21:23:34] what's going on? [21:23:54] definitely can't reach any labs site from the wmf network, including labsconsole [21:24:23] Eloquence: you probably can. just DNS is broken [21:24:26] Eloquence: looking. [21:24:32] mhh [21:25:30] labsconsole works for me [21:25:33] LeslieCarr: trying to figure out the relevant lines from backlog [21:25:47] LeslieCarr: ns on virt0 is broken you say? [21:26:08] i believe so [21:26:22] did you renumber virt0? [21:26:25] yes [21:26:30] what's in its place now? [21:26:42] blank space that will get reallocated [21:26:59] i can get in fine with the IP hardcoded ;) [21:27:29] hrmmm, seems to be better now? [21:28:08] hrmmm, maybe just my recursor being wonky. AUTH is still broke [21:29:02] yay, we only have one NS for labs [21:29:33] jeremyb: I'm feeling a little dim here, can you explain why a result of your little script should differ from the result for e.g. "dig ns1.wikimedia.org | grep status" ? [21:31:06] so, [21:31:19] LeslieCarr renumbered virt0 [21:31:22] which is present in NS records [21:31:25] !log replacing srv203 with srv250 in memcache rotation since srv203 is broken [21:31:30] Logged the message, Mistress of the network gear. [21:31:46] andrewbogott: ns1's resolution ain't the problem [21:31:53] but its A entry is still cached into resolvers such as the one in the wmf office [21:31:59] andrewbogott: it's labs resolution on all 3 that are the problem [21:32:22] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10401 [21:32:23] dig @ns1.wikimedia.org bastion.wmflabs.org != dig ns1.wikimedia.org [21:32:25] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10401 [21:32:32] .wmflabs.org is delegated to virt0.wikimedia.org & labsconsole.wikimedia.org (CNAME to virt0) so at least his is a problem [21:32:36] ok, back [21:32:46] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11003 [21:32:58] paravoid: by NS? [21:33:09] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11147 [21:33:12] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11147 [21:33:13] jeremyb: It was the @ that I was missing; I'm less confused now. [21:33:16] yes. [21:33:42] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11133 [21:33:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11133 [21:34:06] so wipe-cache of virt0.wikimedia.org should fix it ? [21:34:13] no [21:34:15] oh [21:34:17] wait a bit more [21:34:23] what was the old IP of virt0? [21:34:27] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11134 [21:34:29] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11134 [21:34:38] oh, should be in my mail [21:34:38] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11086 [21:34:39] 208.80.153.135 i think [21:34:40] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11086 [21:37:07] New patchset: Ottomata; "Setting up udp2log logrotate on oxygen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11251 [21:37:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11251 [21:37:46] !g 11187 | paravoid [21:37:46] paravoid: https://gerrit.wikimedia.org/r/11187 [21:37:49] there's the IP [21:38:10] (same one i linked before) [21:38:52] - $dns_auth_ipaddress = "208.80.153.135" [21:38:53] + $dns_auth_ipaddress = "208.80.152.32" [21:38:57] yeah yeah found it already [21:39:07] virt0's ttl is 60 [21:39:51] what unit is that? [21:39:56] New patchset: Reedy; "Merge "(bug 37340) Set default search options on vi.wikibooks"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11252 [21:40:06] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11252 [21:40:43] New patchset: Ottomata; "filters.oxygen.erb - moving wikipedia zero filter log files into their own sub directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11253 [21:40:53] maplebed, could you approve these two as well please? [21:40:53] https://gerrit.wikimedia.org/r/11251 [21:40:53] https://gerrit.wikimedia.org/r/11253 [21:41:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11253 [21:41:19] Reedy: i never figured out fawiki (even after i got him to come to #-tech) [21:41:49] Change abandoned: Reedy; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11252 [21:42:02] hm, mchenry aka recursor1 fails to resolve wmflabs/wmflabs.org [21:42:46] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/orange-ivory-coast.log, /a/squid/digi-malaysia.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [21:42:48] this ;; ANSWER SECTION: [21:42:48] virt0.wikimedia.org. 69315 IN A 208.80.153.135 [21:42:57] argh :) [21:43:01] that's mchenry [21:43:10] what are the actual names for ns[0-2]? [21:43:18] New patchset: Ottomata; "filters.oxygen.erb - moving wikipedia zero filter log files into their own sub directory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11253 [21:43:22] that's recursor1 [21:43:25] auth dns has nothing to do with this [21:43:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11253 [21:44:11] LeslieCarr: when did you lower the TTL? [21:44:15] yesterday I presume? [21:44:38] New patchset: Reedy; "(bug 37456) Enable Narayam on kn.wikisource.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11003 [21:44:44] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/11003 [21:45:00] paravoid: i didn't touch that myself [21:45:01] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11003 [21:45:04] maplebed, lemme know if/when you have a sec to check those out [21:45:08] i think I shoudl baby sit those [21:45:18] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11003 [21:45:21] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11003 [21:45:21] otherwise we might get lots of false alerts about aged logs [21:45:46] things should be better now [21:45:49] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11155 [21:45:51] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11155 [21:46:06] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11144 [21:46:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/11144 [21:46:31] paravoid: how so? [21:46:43] I wiped virt0.wikimedia.org from mchenry's cache [21:47:00] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10855 [21:47:00] i mean it's still broke for me i think? [21:47:02] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10855 [21:47:05] however, the entry was changed before a full TTL expiry [21:47:14] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/10755 [21:47:14] so random nameservers will be broken for a while [21:47:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/10755 [21:47:20] well, labs will be [21:47:30] jeremyb: can you cat /etc/resolv.conf? [21:47:33] no, AUTH NS [21:47:38] and for every NS there do a dig for virt0.wikimedia.org? [21:48:53] oh, huh. auth isn't what i was saying it was [21:49:06] can you do ^^^ for me please? [21:49:42] i guess... i have a local router which is set to my by my NS i think [21:50:39] New review: Reedy; "This won't have the desired effect." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/11161 [21:50:52] "I don't know how common, but Facebook's IPv6 addresses contain "face:b00c". Anomie⚔ 03:02, 13 June 2012 (UTC)" [21:50:52] jeremyb: sed -nr 's/^nameserver //p' /etc/resolv.conf | while read ns; do dig +short virt0.wikimedia.org @$ns; done [21:50:55] aww, cute :) [21:51:07] AaronSchulz: and ours have ed1a [21:51:41] * AaronSchulz doesn't get it [21:52:53] edia? [21:53:04] sans the "wikip" [21:54:42] New review: Reedy; "Why is every man and his dog listed as a reviewer?" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11082 [21:55:06] New review: Aaron Schulz; "Wait, so who is the dog?" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/11082 [21:55:23] paravoid: http://dpaste.com/759300/plain/ [21:56:06] trace with @? no point [21:56:21] trace does a trace and asks the actual auth NSes [21:56:23] so, what's broken for you? [21:57:12] paravoid: no "w", no deal [21:57:50] ottomata: sorry, I think I missed a middle piece. Why do you want to move the zero logs to their own subdir? [21:58:24] i need to also set up a cron to copy them to stat1 [21:58:26] $ echo -n wiki | xxd -ps [21:58:26] 77696b69 [21:58:28] for analysis [21:58:32] now you have a w ;) [21:58:43] it'll be easier to rsync the rotated files if they are all in the same palce [21:58:53] so I don't have to specifically name each one, or do a complicated --exclude [21:59:02] i could just name them all something consistent [21:59:07] zero_blabla.log [21:59:13] but directory seems like it should work just as well, no? [21:59:33] I suppose... [21:59:55] I just hate having multiple versions of essentially the same thing, but introducing tiny differences in each. [22:00:09] paravoid: nvm, was all me querying the wrong place. was it just mchenry that did it? [22:00:19] eg lock and oxygen are /a/squid but emery is /var/log/squid. [22:00:35] locke and emery all have log files in .../squid/ but oxygen has them in .../squid/zeroblah. [22:00:42] that kind of thing just bugs me big time. [22:01:05] if we're going to have two things the same but the third one different, I'd hope for a really good reason. [22:01:25] jeremyb: no, every resolver out there that has virt0 in its cache [22:01:45] if only I knew who changed the TTL yesterday, I'd tell you an ETA for a complete fix [22:02:05] maybe ryan ? [22:02:11] oh i hate that too [22:02:20] er, Ryan probably but I meant when :) [22:02:48] you'd rather have the filenames just changed then? [22:02:49] hm, but how did it work before commiting it? [22:02:49] hmmmm [22:02:51] yeah i think you are right [22:03:00] because then we don't need a custom logrotate [22:03:06] and I can rsync based on filename match [22:03:08] ah! [22:03:11] ok.... [22:03:24] 2012-06-13 14:45:27 +0000 [22:04:12] hm, that's going to suck [22:04:18] this is going to mess with us for the next 18h [22:04:38] LeslieCarr: any chance we can rollback the renumbering? [22:05:13] i guess we could [22:05:19] won't that cause just as much truoble? [22:05:22] New patchset: Ottomata; "filters.oxygen.erb - renaming wikipedia zero log files so that they all match a consistent name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11253 [22:05:29] no, the TTL is 60 now [22:05:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11253 [22:05:47] maplebed, ah, yeah I still need a custom logrotate [22:05:48] can't we just have the box own both IPs? [22:05:54] but I don't need a special directory at least, right? [22:06:00] it's either full or partial (i.e. add IP instead of replace) rollback or a 0-18h downtime on some users [22:06:01] and bind NS to 0.0.0.0 or both IPs? [22:06:06] actually, ungh, yeah that is more trouble [22:06:11] i forgot i need to rotate these monthly [22:06:19] well, ergh, [22:06:24] so [22:06:32] we plan on combining all of these zero filters into one file [22:06:33] eventually [22:06:43] but we need to change the logging sources to add a special header to the log line first [22:06:47] and that will probably be a while before that happens [22:06:49] ottomata: you still have the directory creation stuff in site.pp [22:07:01] yeah, realized that, but I think I want to revert that last patch [22:07:04] and put it back at a directory [22:07:06] that would have been much better if we had two different NS for wmflabs.org, grmf. [22:07:12] because I need to different logrotate rules [22:07:13] why do you need a custom logrotate? [22:07:16] monthly for zero filters [22:07:31] paravoid: or just a secondary [22:07:32] the regular logfiles rotate daily [22:07:38] the zero log files don't have much in them [22:07:41] jeremyb: we can't have that sadly, two different vlans …. [22:07:44] and they only need to be analyzed once a month [22:07:51] jeremyb: that's the same thing [22:07:52] grumble. [22:07:53] LeslieCarr: 2 interfaces? [22:08:06] i mean, ha, if I had time right now [22:08:11] you're imposing the consumer's rules on the producer. [22:08:13] i'd fix up the logging instance even more [22:08:18] so more of that is configurable [22:08:19] two are already used … hrm, if this is a new box we could put another one in ? [22:08:20] LeslieCarr: (i mean physical ports) [22:08:30] yeah, the new ones have 4 ports.... [22:08:41] there's not 802.1q tags? [22:08:43] everything pulls in logs daily; how they're interpreted (hourly, daily, monthly) isn't the responsibilty of the aggregator but the analyzer, isn't it? [22:08:58] i guess... [22:09:28] i mean, generally I agree with you [22:09:38] right now, at least, the udp2log hosts have a really simple job - catch packets and write them to disk according to filters. [22:09:41] but this filtery stuff is all so mungy anyway [22:10:10] so, i should [22:10:21] give them a zero- prefix [22:10:21] those files then get transferred off to some other host to be interpreted. [22:10:25] rotate daily [22:10:29] rsync monthly [22:10:36] and let whomever deal with piecing them together [22:10:55] +1 keep interpretation separate from collection. [22:11:23] (I suppose we do need to check though, and amke sure that logrotate keeps >30d...) [22:11:23] LeslieCarr: we could do tagged, it won't be very pretty but it'll work [22:11:32] and then we'll have to modify pdns.conf and disable puppet too [22:11:45] maxage 180 [22:11:45] rotate 1000 [22:11:45] ? [22:11:48] it should for sure [22:11:55] on all other hosts its got stuff back to dec [22:12:04] LeslieCarr: and I don't see eth1 having an IP? [22:12:09] New patchset: Ottomata; "Setting up udp2log logrotate on oxygen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11251 [22:12:22] ... are you sure it's logrotate? [22:12:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11251 [22:12:59] eh? [22:13:02] whatcha mean? [22:13:44] New patchset: Ottomata; "filters.oxygen.erb - renaming wikipedia zero log files so that they all match a consistent name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11253 [22:14:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/11253 [22:14:27] ottomata: on oxygen, 'grep -r squid /etc/logrotate*' gives an empty set. [22:14:57] oh, it's just not installed yet. [22:14:58] ::sigh:: [22:15:05] :q [22:16:24] right, [22:16:29] it wasn't setup before [22:17:07] LeslieCarr: ping? :-) [22:20:14] ottomata: do you want to include whatever's necessary to set up logrotate in that patchset too? or should I submit it as is and you'll do another for logrotate? [22:20:47] hm, didn't I? [22:20:53] oh there are two commits [22:20:56] they are separate [22:21:13] ok, ok. 11253 going in then. [22:21:24] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/11253 [22:21:24] this is the logrotate one: [22:21:25] https://gerrit.wikimedia.org/r/11251 [22:21:26] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11253 [22:24:10] ottomata: you don't use the same email all the time for gerrit? [22:25:40] i don't? [22:25:47] correct [22:25:57] oh hmm, i just re cloned puppet repo [22:26:02] cause I borked something in the other one [22:26:15] ahhhh poo [22:26:16] yeah [22:26:49] paravoid: hey [22:26:52] hi [22:27:00] do you want to do vlans maybe? [22:27:01] sorry [22:27:03] to add that ip back? [22:27:11] how good is the support of vlans in linux ? [22:27:12] or we could setup a new server in that IP [22:27:19] ottomata: are you sure that the has_logrotate thing is what turns on logrotate for the logs? locke has it set to false, and yet it has something in logrotate.d/udp2log [22:27:27] also, I see eth1 being up but not having an IP [22:27:29] should be fixed now [22:27:31] yeah that's different [22:27:34] so [22:27:40] (802.1q is perfect in Linux, I have used it for years) [22:27:42] so eth1 is the internal vlan, it doesn't have an ip but it still talks on that subnet [22:27:44] ok [22:27:48] has_logrotate puts a ${name}-udp2log file set [22:27:50] okay [22:27:53] file in there [22:27:56] have you reused that subnet? [22:27:59] where $name is the name of the logging instance [22:28:00] any other server would work too [22:28:04] which in this case is 'oxygen' [22:28:06] so this will put [22:28:09] although virt0 is better [22:28:15] (= easier to setup) [22:28:16] yeah, that's how I read the puppet configs, but then I don't know how locke got its configs... [22:28:22] ::sigh:: [22:28:27] haven't reused the subnet yet [22:28:30] probably leftoever from before someone change the udp2log instance stuff [22:28:33] i htink notpeter did that [22:28:42] ok, I'll merge that one too. [22:29:09] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/11251 [22:29:11] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/11251 [22:29:19] LeslieCarr: I see a VLAN 103 on eth1 too [22:29:31] yeah, that's the internal to labs vlan [22:29:40] 10.4/16 iirc [22:29:44] so eth1 runs tagged? [22:29:44] * jeremyb just verified they still apply clean on production HEAD: 8120, 8339, 8344 [22:29:55] ottomata: loading onto oxygen now. [22:29:56] don't think so …. double checking [22:31:39] hah, i just realized "fai" is in faidon. ;) [22:32:02] (i think you were hating on it the other day?) [22:32:53] hrm, paravoid it actually looks like on virt0 eth1 is unused [22:33:18] ottomata: that didn't work; it's not pulling in the logrotate conf file. (the name changes went through.) [22:33:29] great [22:33:49] paravoid: what broke? [22:33:57] I swore I fixed everything before I left [22:34:06] Ryan_Lane: you changed the TTL from 86400 to 60, 2 hours before the renumbering [22:34:09] did you expect it to work? :-) [22:34:21] why would that cause problems? [22:34:30] it was 1 hour previously [22:34:41] I did that 2 hours before I changed it [22:35:04] why wouldn't that work? [22:35:31] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:35:45] I fixed a whole bunch of shit before I left [22:36:00] everything was working and resolving with the new address [22:36:11] and I ensured I forced ran puppet till everything was working [22:36:23] so, what actually broke> [22:36:41] we're having issues with resolving anything on the wmflabs.org domain [22:37:02] hm [22:37:09] Ryan_Lane: ttl was not 1h [22:37:16] no idea why not [22:37:16] yes, yes it was [22:37:34] mapebed, its there now [22:37:37] virt0 is still cached with the wrong IP in the WMF office [22:37:45] cat /etc/logrotate.d/oxygen-udp2log [22:37:52] that's their fuckup then [22:37:52] huh. so it is. [22:37:54] mchenry had virt0 with ttl 69000s when I forced-expired it [22:38:06] I didn't see puppet put it there. [22:38:09] ok! all done. [22:38:25] paravoid: +135 1H IN PTR virt0.wikimedia.org. [22:39:14] also, I ensured things resolved properly from my own desktop, which definitely hits a different resolver [22:39:24] the office's DNS must be fucked up [22:39:26] wrong entry but yes, I just saw 1H on svn [22:39:30] Ryan_Lane: mchenry too [22:39:32] yay, ok cool [22:39:33] is there anything else that force changes the ttl ? [22:39:40] other than a badly configured resolver ? [22:39:44] thanks maplebed! [22:40:08] Ryan_Lane: I forced expired mchenry myself, after seeing this: [22:40:09] virt0.wikimedia.org. 69315 IN A 208.80.153.135 [22:40:17] I'm totally cool taking an outage, but I'm pretty sure this one wasn't caused by me [22:40:17] 69315, far more than 3600 [22:41:07] paravoid: was this due to nscd on mchenry? [22:41:16] that's dig [22:41:22] that makes no sense [22:41:31] it hits resolver0 [22:41:35] which is on dobson [22:41:39] no, that's dig on itself [22:41:46] eh> [22:41:50] dobson is recursor0, which was okay (because is a slave) [22:41:54] mchenry is recursor1 [22:42:12] could it be that the resolver was somehow hitting nscd? [22:42:24] pdns_recursor? [22:42:28] I don't see how [22:42:40] I don't see how it would get a ttl longer than what the server specified [22:43:09] LeslieCarr or maplebed, whichever from you is in the office, can you do a "cat /etc/resolv.conf" [22:43:22] I'm at home, sorry. [22:43:22] yeah it goes to the local firewall [22:43:24] and then dig virt0.wikimedia.org @nameserver for each nameserver there? [22:43:28] paravoid: kaldari in #wikimedia-labs is saying it still isn't resolving [22:43:35] Ryan_Lane: I know [22:43:39] for office IT [22:43:40] ahha [22:43:50] (at least) [22:44:10] well, wmflabs.org doesn't [22:44:12] LeslieCarr: dig @firewall-ip virt0.wikimedia.org [22:44:15] http://pastebin.com/bsZ1BUVy [22:44:16] but bastion.wmflabs.org does [22:44:25] that's absurd [22:44:28] Ryan_Lane: wmflabs.org has two NS records to the same nameserver, virt0 [22:44:42] Ryan_Lane: see the pastebin? :) [22:44:43] yes, I know. we'll solve that when we bring up eqiad [22:44:54] hrm, lemme walk over to office it [22:45:14] andrew_wmf is here in the channel [22:45:32] if a fucked up resolver sees a host entry that's lower than its allowed ttl, does it switch to its default? [22:45:39] no. [22:45:46] LeslieCarr: wait [22:45:52] LeslieCarr: waait, don't expire the cache [22:46:01] LeslieCarr: can you do the same for williams.wikimedia.org? [22:46:27] kaldari is now saying they are all working [22:46:33] oh [22:46:52] hm. no. ignore me [22:48:59] it's indeed weird [22:49:04] it makes no sense to me [22:49:19] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [22:49:56] hm. I can't look up the NS records for wmflabs.org/ [22:50:02] eh? [22:50:37] yes, the zone is broken on virt0. [22:50:51] doesn't have NS and has a broken SOA [22:50:52] that's an issue, for sure [22:50:55] this is leslie [22:50:56] sorry [22:50:59] i expired the cache [22:51:03] on the office box [22:51:13] Ryan_Lane: hostmaster\@wikimedia.org. instead of hostmaster.wikimedia.org [22:51:16] i don't know why it was holding it so long…. it's running pdns-server [22:51:20] hostmaster.wikimedia.org. that is [22:52:15] lemme check the SOA record in LDAP [22:52:25] i'm back on my laptop [22:52:48] sOARecord: hostmaster@wikimedia.org 20120612192602 1800 3600 86400 7200 [22:53:47] not that that would affect anything [22:54:52] nope [22:55:00] so, the address is probably fucked up at the registrar [22:55:10] AH, I have an idea [22:55:17] glue records with a different TTL [22:55:23] glue records? [22:55:42] shouldn't have glue records, but maybe they do because they suck [22:56:45] nope, not that either [22:57:06] so, other recursors handled this properly [22:57:13] for all we know [22:57:15] what's the difference here? [22:57:18] no. I know they did [22:57:23] because I tested it before I left [22:57:25] you know /some/ of them did [22:57:28] yes [22:57:33] that's what I'm saying [22:57:43] what made some handle it properly, and not others? [22:58:07] I fucking hate dns [22:58:57] New review: Sumanah; "For future reference: I know I added a few reviewers to this changeset, but I did not intentionally ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/11082 [23:01:46] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:05:32] strange indeed [23:05:36] heh [23:05:50] how many times have I been bitten by DNS now? [23:05:54] it's my fucking nemesis. [23:06:01] mine's the 208 days :P [23:06:06] hahaha [23:06:08] indeed [23:06:53] maybe a pdns recursor bug? seems like a stretch though… [23:06:57] yeah [23:07:03] I dunno [23:07:37] both forward and reverse show 1H [23:07:44] well restarting pdns-recursor is when it seemed to fix itself [23:07:49] same for the domain [23:07:55] (on the office dns server) [23:08:13] we should check its configuration [23:08:24] maybe we have it configured improperly [23:08:42] it makes no sense that google's dns and my service provider's dns worked and the office and mchenry's didn't [23:11:03] in fact, it sounds like a problem ;) [23:11:18] this is what I get for not being in the office :D [23:20:20] Ryan_Lane: hm, I think I should attempt raising it back to virt0 [23:20:32] er, 1h [23:20:38] yeah, that's fine [23:20:43] it was only 60s for the switch [23:20:48] right [23:21:09] I also went back on svn history to see if it was 86400 at any point [23:21:13] so strange. [23:21:30] it was never that long [23:21:40] we make basically all entries 1H [23:21:43] no it wasn't, went as back as its addition [23:22:10] there's a select few that are longer, but are obvious ones like 127.0.0.1 [23:22:50] I have code waiting over a week for a review [23:23:03] I'll remember this the next time a dev wants a review :) [23:24:16] mcherny says 3600 now [23:24:25] after switching it [23:24:29] which is okay [23:24:31] o.O [23:24:39] what. the. fuck. [23:24:45] you hate DNS, I hate heisenbugs :) [23:24:48] hahaha [23:24:50] yeah [23:24:59] it's seriously out to get me [23:25:06] I didn't even do anything wrong this time :( [23:25:12] no, you didn't, I concur [23:25:43] I initially thought you lowered it from 86400, because that was the evidence I was getting [23:25:55] 65000-something pointed to 86400 to me [23:26:01] I'd prefer that, to things just being broken [23:26:16] I'd much rather that I fucked up somehow [23:26:21] I dislike not knowing what went wrong [23:34:29] i would bet on pdns being misconfigured somehow [23:34:59] i hope to have a new sys admin in a couple weeks to help me replace it [23:35:07] andrew_wmf: we had the same issue with mchenry [23:35:20] and I checked pdns there and I don't see any misconfiguration [23:35:44] yeah odd [23:38:02] andrew_wmf: could you do a dpkg -l |grep pdns for me please? [23:38:14] just wondering what version of pdns you're running [23:38:19] and if that matches mchenry's [23:38:26] in case this is a pdns bug [23:38:55] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/zero-saudi-telecom.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [23:39:04] yeah, I'm guessing it's more than simple misconfiguration [23:39:26] just a sec [23:40:32] pm'd you paravoid [23:40:52] thanks. [23:40:59] so, even newer version that mchenry [23:41:05] unbelievable [23:41:24] lol. i want to ditch pdns for bind asap [23:41:33] bind is a scary thing [23:41:39] bind is fine [23:41:47] s/bind/dns/ ? [23:41:49] you always like niche things Ryan [23:41:55] ;-) [23:41:55] heh [23:42:02] you and your opendj :P [23:42:06] well, pdns isn't necessarily my choice ;) [23:42:11] openldap is a gigantic POS [23:42:14] too mainstream for you, is it? [23:42:20] pdns that is :) [23:42:24] i like d bind 5 the best ;) [23:42:29] I'd happily move to fedoraDS [23:42:41] 389server nowadays [23:42:42] or some other resonable LDAP server [23:42:44] bind is a very tested/stable dns server. and being quite well known is a plus [23:42:45] right [23:42:46] that one [23:43:02] indeed [23:43:06] bind has been so, so insecure for so long, though :( [23:43:21] I've worked with at least bind, pdns, nsd, unbound [23:43:27] let's use djbdns [23:43:33] every one of them has its plus and minuses [23:43:45] paravoid: what LDAP does DSA use? [23:43:49] * andrew_wmf just wants something simple :D [23:43:52] jeremyb: openldap [23:44:02] and bind as auth DNS [23:44:04] openldap isn't simple :( [23:44:10] and unbound for local dnssec-verifying resolvers [23:44:26] Ryan_Lane: oh come on [23:44:36] you can say that it's a POS, doesn't have ACIs or whatnot [23:44:42] but you can't say it's not simple [23:44:46] it's not. [23:44:54] you can run a slapd with a 10-line config [23:44:56] ACI? [23:44:59] you have to deal with master/slave [23:45:07] that's /simple/? [23:45:14] which immediately makes it more difficult [23:45:34] master/master in LDAP is the norm nowadays [23:45:39] jeremyb: access lists [23:45:47] paravoid: not ACL? [23:45:56] nope :) [23:46:02] uhuh... [23:46:05] access control information I believe it stands [23:46:11] you need to reconfigure the clients when the master switches, and that makes it harder [23:46:11] for [23:46:19] was it made by the government? ;) [23:46:42] you can alternatively make DNS entries for master/slave, but that's a lame way of handling things [23:47:01] master/master allows immediate failover [23:47:31] can you not tell a client explicitly that a host is a master and another is a slave? [23:47:33] also, it sucks to have to restart the server to add an ACI [23:47:39] what about split brain? [23:47:41] I love how you care that much about multi-master LDAP replication but we still have only one NS for the wmflabs domains :P [23:48:10] well, to be fair, I thought I'd have a replica up quite a while ago [23:49:01] I'd like to separate the LDAP servers from the controllers also [23:49:08] one per datacenter [23:49:24] then remove the LDAP servers from nfs1/2 [23:50:07] either make them totally internal, or make them have public IPs and have a reasonable firewall solution [23:50:30] I'm sure you've heard me bitch about this before ;) [23:50:34] yep :) [23:50:40] sounds reasonable to me too [23:50:54] I'm not totally against switching to openldap [23:51:01] I didn't say that we should [23:51:04] I don't care that much [23:51:11] there's reasons I prefer what we're using, though :) [23:51:11] * andrew_wmf is open to openldap [23:51:23] I've worked a lot with SunDS in the past and it's a bit similar [23:51:39] sans the Java [23:51:46] yeah. opendj is basically a rewrite of sunDS [23:52:08] slightly better, I'd say, though [23:52:18] sunDS took weird turns towards the end [23:52:29] the IT ldap server is several versions out of date [23:52:40] you're running the same version as us [23:52:43] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [23:52:43] PROBLEM - Puppet freshness on es1003 is CRITICAL: Puppet has not run in the last 10 hours [23:52:43] PROBLEM - Puppet freshness on professor is CRITICAL: Puppet has not run in the last 10 hours [23:52:50] ok [23:53:02] I looked at the change log, there isn't any major functionality difference between the newest and ours [23:53:07] all I'm saying is that openldap /is/ simpler, esp. if you count the cost of you maintain packages, talking with upstreams about bugs etc. :-) [23:53:13] I do have plans on updating at some point, tough [23:53:15] *though [23:53:34] that's one reason I'd consider switching [23:53:39] not for less common configs such as multimaster ones (or even replicated ones!) [23:53:39] to not need to maintain the version [23:53:51] *package [23:54:01] you say multimaster is the norm, while I'd say that not having replication /at all/ is the norm [23:54:11] seriously? [23:54:15] yes :) [23:54:22] what LDAP server except openldap doesn't have multi-master? [23:54:31] does apache's server not? [23:54:32] I'm talking users, not software [23:54:48] that's terrifying [23:54:52] the percentage of users that need replication and the percentage of them that need multi-master [23:55:04] is it? [23:55:18] even apacheds has multimaster :D [23:55:19] hahaha [23:55:41] we don't do replication in Debian [23:55:52] you don't have more than one server? [23:55:55] nope [23:55:57] why would we/ [23:56:06] we don't have runtime dependencies on LDAP :-) [23:56:10] ah [23:56:14] I see LDAP like DNS [23:56:25] failover is on the client [23:56:47] how many clients do you have that do writes on labs? [23:56:48] I really don't have much issue having the runtime dependency on it [23:57:05] well, in the architecture that I used to work in, all of them [23:57:16] usually even setups that need replication rarely need multi-master [23:57:23] over 1000 [23:57:35] I'm not saying there aren't use cases [23:57:37] of course there are [23:57:42] and AD does that too, they're not stupid [23:57:58] all modern ldap servers support it [23:58:12] I'm just saying, for the 95% of users that don't need that, openldap *is* simple :) [23:58:31] true [23:59:03] in our case we could likely get away without having multi-master as well [23:59:22] we have 3-4 systems that write to ldap