[00:00:37] No, because it didn't break the site completely [00:00:41] and I have other work [00:00:48] so it gets in the mud. [00:00:55] I'm not sure I get the argument [00:01:07] pretty much the same reason hashar has root on it already. [00:01:10] you don't ask anyone but you wait days or weeks to get fixed? :) [00:01:17] because we don't have a representable emulation [00:01:21] so we test in production basically [00:02:06] I think you should try asking ops to fix things for you when you encounter them for a while [00:02:15] and to get that puppetised better I would likely go a whole lot easier to not be stabbing in the dark via puppet when puppetising the very thing itself. [00:02:20] and if it becomes a burden for us or we end up blocking you, then grant you access. [00:02:25] the fixing issues was just an example [00:02:35] like what? [00:02:39] please be specific [00:02:43] If I'm puppesing testswarm for example [00:02:47] I did that at jquery a few weeks back [00:03:08] nobody can review those changes because nobody else has even used testswarm around here before [00:03:08] the only way to ensure it works is to try it out [00:03:11] well, how? [00:03:20] you have integration.wmflabs don't you? [00:03:22] in production basically, at least until everything else is puppetised [00:03:31] yes, which is an empty apache server with some stuff on it [00:03:42] ? [00:03:48] not at all like gallium [00:03:52] but you're talking about bits you're puppetizing [00:03:57] yep [00:04:26] I agree it shouldn't be needed and in a few months I'll likely ask for it to be revoked and expect hashar to do the same. [00:04:27] so what prevents you from testing them in labs? [00:04:39] the environment isn't there [00:04:46] what environment? [00:04:58] everything gallium that isn't puppetised right now [00:05:13] and stuff that is is often hardcoded in production settings meaning it doesn't work in labs [00:05:15] but you're talking for *new* things [00:05:26] Yes, but they interact with existing things [00:05:35] I can't test zuul without jenkins and gerrit as well, for example. [00:07:04] so when hashar got that in production, we (he) basically set it up on gallium (using sudo for parts that need it) and put it in puppet once it works. [00:07:29] okay [00:07:38] I'm not entirely convinced [00:07:46] but it's limited root, so I'm not going to veto [00:07:46] that's okay, there's no rush. [00:07:53] it will get delayed a bit though [00:08:07] since we want to replace the certificate first [00:08:12] there's so much interaction with these apps you have to have it run to know it works, it's hard to review. And until everything is in puppet, it can't be tested in labs either. [00:08:17] sure [00:08:52] though hashar has access too as well. which means in most cases we assign integration features for him to create because I can't right now. [00:09:45] New patchset: Dzahn; "remove boards.wp->boards.wm redirect (bug 46341) neither of them exist in DNS, there is "board", but not "boards"" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/54801 [00:09:46] New patchset: Ryan Lane; "Add ssh key and authorized_keys for nova" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54802 [00:09:58] Anyway, no rush. For now I've got enough other work to focus on. [00:11:05] Krinkle: so i'm going to remove that from svn now.. right [00:11:10] the docs stuff [00:11:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54802 [00:11:40] mutante: doesn't puppet do that? [00:12:01] Krinkle: after a human merges it on sockpuppet, yes. [00:12:09] and i was going to look at it to confirm it works [00:12:16] cool [00:13:00] mutante: is puppet for the svn server different then? Or do you always have to merge it on sockpuppet in addition to the puppet repo in git? [00:13:07] err: Failed to apply catalog: Could not find dependency Package[apache2] for File[/var/cache/svnusers] at /var/lib/git/operations/puppet/manifests/svn.pp:44 [00:13:10] there we go :p [00:13:19] Krinkle: always have to [00:13:34] now how did that ..looking [00:14:01] last puppet run was yesterday .. not an hour ago .. hmm [00:16:45] New patchset: Ryan Lane; "Properly reference private repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54803 [00:16:49] New patchset: MaxSem; "Beta doesn't have SSL" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54804 [00:19:28] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54803 [00:22:34] New patchset: Matthias Mullie; "Completely disable AFTv5 on enwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54807 [00:23:02] New patchset: Dzahn; "after removing docs generation this require for apache2 package broke pupet runs - remove dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54808 [00:23:37] paravoid: I'm afraid something went wrong [00:23:43] https://doc.wikimedia.org/VisualEditor/master/ [00:23:46] Uncaught ReferenceError: Ext is not defined [00:24:11] Same run on http://integration.wmflabs.org/mw/extensions/VisualEditor/docs/ works fine [00:24:25] http://integration.wmflabs.org/mw/extensions/VisualEditor/docs/extjs/ext-all.js [00:24:28] https://doc.wikimedia.org/VisualEditor/master/extjs/ext-all.js [00:24:37] Ext JS Library 3.0.3 vs. Ext JS 4.1 [00:25:06] New patchset: Ryan Lane; "Change ownership of nova ssh config to nova" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54809 [00:25:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54809 [00:28:14] New patchset: Dzahn; "after removing docs generation this require for apache2 package broke puppet runs - remove dependency" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54808 [00:29:08] New review: Dzahn; "fix formey puppet runs" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/54808 [00:29:21] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54808 [00:31:56] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [00:33:06] paravoid: The version mismatch isn't the problem actually [00:33:48] paravoid: Looks like libjs-extjs has an odd layout in that the file you reference from /usr/share/javascript/extjs is incomplete [00:33:59] (in the patch) [00:34:37] libjs-extjs's distribution (non-standard from extjs point of view) has a subdirectory with adaptors, which need to be prepended to the main file to work. [00:34:39] Crap [00:36:38] New patchset: Milimetric; "cron job to regenerate mobile apps stats" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54811 [00:37:40] New patchset: Dzahn; "remove another requirement for apache2 package and fix unquoted resource names and unquoted file modes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54812 [00:38:54] New review: Dzahn; "yea, we hopefully don't need the entire svn.pp anymore soon :)" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/54812 [00:39:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54812 [00:41:05] RECOVERY - Puppet freshness on formey is OK: puppet ran at Wed Mar 20 00:41:02 UTC 2013 [00:42:31] paravoid: Okay, figured it out. It is indeed just a version mismatch. Forget the whole "adators" directory. that's something specific to v3 of extjs. In v4 (which jsduck depends on) there is just one extjs.js file which works as intended. [00:42:51] paravoid: The problem is bascially that the libjs-extjs package in your repo is rather outdated. [00:43:14] 3.0.3 instead of 4.0+ (latest stable is 4.2.0) [00:43:31] our* [00:43:56] I'll take the debs you created as example to see if I can package this one [00:50:44] New patchset: Milimetric; "cron job to regenerate mobile apps stats" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54811 [00:50:50] New patchset: Pyoungmeister; "run puppet by cron instead of via the agent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54815 [00:52:08] puppet run on formey (SVN) is .. taking ..just a little .. while [00:52:30] filebucketing like every single file in there [00:52:38] Krinkle: it's still running :p [00:59:20] New patchset: Pyoungmeister; "run puppet by cron instead of via the agent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54815 [01:05:49] rfaulkner: ping [01:06:14] preilly: what's up? [01:07:20] New review: Matthias Mullie; "I won't merge it in myself - if you think current load may be "dangerous", feel free to merge it in,..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54807 [01:07:46] rfaulkner: I sent you a PM [01:08:00] Ryan_Lane: do you have a second for a puppet question? [01:09:54] give me a bit [01:12:21] preilly: what's up? [01:12:46] puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff [01:13:02] Ryan_Lane: is there an easy way to get puppet to just ignore one file? [01:13:28] Ryan_Lane: e.g., can I get puppet to ignore php.ini in a VM [01:13:38] Ryan_Lane: but still do everything else as it normally would [01:14:03] hm [01:14:31] I don't believe so, if it's in the catalogue [01:14:51] like when you are doing recurse => true in a file {}? [01:14:54] file {} has an ingore => [01:14:56] Ryan_Lane: so I guess I could include a new file in the php.ini file that puppet doesn't manage right to put my changes in that file? [01:15:05] yep [01:15:11] Ryan_Lane: okay cool [01:15:18] Ryan_Lane: can I send you a PM? [01:15:21] sure [01:16:51] Krinkle: 19748 files later .. notice: /Stage[main]/Svn::Server/File[/var/mwdocs]/ensure: removed [01:16:55] :) [01:17:06] nice! [01:17:13] oh, right /var/mwdocs [01:17:19] that's all of svn :D [01:17:27] and then some [01:17:30] it went through every single file in there :P [01:17:38] Finished catalog run in 2203.85 seconds [01:17:52] Way to go puppet, who cares about efficiency [01:18:05] now it will check that is absent until the end of time [01:18:07] Why would it catalog the entire subtree only to remove it? [01:18:15] but that is just until we can drop the whole svn.pp :) [01:18:16] yeah, but only the top [01:18:32] and afaik for that we just need to make pywikipedia people finally move [01:19:04] well, it checked, and then found that: info: FileBucket got a duplicate file {md5} [01:19:10] for everything [01:19:41] New patchset: Pyoungmeister; "run puppet by cron instead of via the agent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54815 [01:23:07] New patchset: Pyoungmeister; "run puppet by cron instead of via the agent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54815 [01:34:14] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:46:26] paravoid: So, where in the debian package (debs-ruby-jsduck) do you specificy what version it fetches from rubygems? [01:46:39] In ./watch I see [01:46:41] version=3 [01:46:41] http://pkg-ruby-extras.alioth.debian.org/cgi-bin/gemwatch/jsduck .*/jsduck-(.*).tar.gz [01:46:48] yet it fetched 4.6.2 during the build [01:47:05] (as it should) [01:51:01] Krinkle: i think version=3 is the format of the watch file [01:51:07] yeah I know [01:51:38] I'm trying to find what it telling it to fetch that version [01:51:43] though I'm afraid nothing is [01:52:04] which means if you re-build the package without changing anything after a release is made, it will silently update [01:52:20] it would be automatically updating in a counter intuitive way (at build time) [02:06:54] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [02:07:37] Change merged: Andrew Bogott; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54804 [02:08:56] PROBLEM - Puppet freshness on db43 is CRITICAL: Puppet has not run in the last 10 hours [02:09:46] New review: Krinkle; "(1 comment)" [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/54691 [02:16:50] New patchset: Krinkle; "ExtJS: Use provided copy instead of overwriting with libsjs-extjs." [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/54818 [02:32:16] !log LocalisationUpdate completed (1.21wmf11) at Wed Mar 20 02:32:16 UTC 2013 [02:32:24] Logged the message, Master [02:34:14] PROBLEM - MySQL Slave Delay on db66 is CRITICAL: CRIT replication delay 207 seconds [02:34:25] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 215 seconds [02:35:15] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 185 seconds [02:35:54] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 209 seconds [02:37:54] PROBLEM - Puppet freshness on xenon is CRITICAL: Puppet has not run in the last 10 hours [02:38:54] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [02:44:04] !log LocalisationUpdate completed (1.21wmf12) at Wed Mar 20 02:44:03 UTC 2013 [02:44:10] Logged the message, Master [02:45:55] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 20 seconds [02:46:15] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 1 seconds [03:08:18] New review: Ori.livneh; "Thanks very much for this, Andrew." [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/54782 [03:12:50] New patchset: Ryan Lane; "Give nova a shell on nova-compute" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54820 [03:14:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54820 [03:56:14] RECOVERY - MySQL Slave Delay on db66 is OK: OK replication delay 0 seconds [03:56:25] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [04:23:49] Change merged: Tim Starling; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/54522 [04:57:54] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [05:03:54] PROBLEM - SSH on lvs6 is CRITICAL: Server answer: [05:04:13] LeslieCarr: ^ [05:04:45] she's idle [05:05:52] i agree [05:06:55] RECOVERY - SSH on lvs6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:13:56] RECOVERY - Puppet freshness on sq79 is OK: puppet ran at Wed Mar 20 05:13:51 UTC 2013 [05:39:25] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 199 seconds [05:48:25] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 21 seconds [05:49:26] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 23 seconds [05:49:29] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 23 seconds [06:00:54] PROBLEM - Puppet freshness on mw1094 is CRITICAL: Puppet has not run in the last 10 hours [06:01:56] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [06:01:56] PROBLEM - Puppet freshness on mw1052 is CRITICAL: Puppet has not run in the last 10 hours [06:01:56] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [06:02:55] PROBLEM - Puppet freshness on mw1008 is CRITICAL: Puppet has not run in the last 10 hours [06:04:01] PROBLEM - Puppet freshness on mw1056 is CRITICAL: Puppet has not run in the last 10 hours [06:05:56] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [06:36:54] PROBLEM - Puppet freshness on db66 is CRITICAL: Puppet has not run in the last 10 hours [06:38:35] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 196 seconds [06:59:55] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [07:00:34] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [07:25:35] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [07:25:54] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [07:25:54] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [07:25:54] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [07:25:55] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [07:30:35] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [07:41:53] hello [07:46:37] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 181 seconds [07:46:37] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [07:55:20] New review: Hashar; "indeed :)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/54798 [07:56:31] !log gallium: reloaded apache2 to make sure recent changes are applied. [07:56:37] Logged the message, Master [08:00:31] New review: Hashar; "(2 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54466 [08:01:56] New review: Hashar; "Not sure it is needed but why not. Make sure to get Timo trained by ops about the do and don't on a ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53861 [08:05:38] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [08:05:38] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [08:07:54] PROBLEM - Puppet freshness on mc1002 is CRITICAL: Puppet has not run in the last 10 hours [08:07:54] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [08:38:55] PROBLEM - Puppet freshness on db35 is CRITICAL: Puppet has not run in the last 10 hours [08:38:55] PROBLEM - Puppet freshness on labstore1 is CRITICAL: Puppet has not run in the last 10 hours [08:44:43] New review: Hashar; "It seems there is already such a packaging work being made:" [operations/debs/python-jsonschema] (master) C: -1; - https://gerrit.wikimedia.org/r/54782 [08:48:49] Waiting for 10.64.16.15: 111 seconds lagged [08:48:51] from https://www.wikidata.org/?maxlag=-1 [08:52:02] hm back to normal now, i guess thanks [08:59:13] legoktm: db33 had some lag one hour ago [08:59:22] legoktm: maybe that is the DB for wikidata :-] [08:59:33] well it all works fine now :) [09:33:11] Change abandoned: Hashar; "No point in keeping this change around, will craft a new one if needed." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54466 [10:32:54] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [10:36:56] PROBLEM - Puppet freshness on mw1130 is CRITICAL: Puppet has not run in the last 10 hours [10:37:55] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [10:45:40] New patchset: Hashar; "slightly refactor Zuul daemon web access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54845 [10:46:04] New review: Hashar; "New change is https://gerrit.wikimedia.org/r/54845" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54466 [11:10:10] New patchset: Hashar; "adjust Zuul daemon web access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54845 [11:10:36] New review: Hashar; "patchset 2 makes it MUCH simpler" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54845 [11:22:59] New patchset: Hashar; "adjust Zuul daemon web access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54845 [11:23:47] New review: Hashar; "Turns out ProxyPassMatch require a $1 :-] That last patchset works fine, verified it directly on g..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54845 [11:25:24] apergos: hi :-]  I could use a merge for the contint server, a slight tweak to a mod proxy rule https://gerrit.wikimedia.org/r/#/c/54845/3/modules/contint/files/apache/proxy_zuul,unified [11:25:38] apergos: which I have already tested on the server so that is not going to cause any trouble :-] [11:28:53] what gets served if the main doc root is requested (or some subpage under it)? [11:30:24] apergos some 404 :-] [11:30:53] no index pages or anything like that? [11:31:00] the /zuul/randomthing will resolve back to the Apache DocumentRoot which is /org/wikimedia/integration/zuul/ [11:31:10] which I am writing right now :-] [11:31:15] ok [11:31:19] so yeah the change will make /zuul/ a 404 :-] [11:31:29] but that is not reference anywhere and will soon be populated [11:31:38] all right [11:31:51] ProxyPassMatch is a nice trick [11:32:35] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54845 [11:33:04] apergos: i will take care of puppet on the box to save you some time [11:33:31] ok, it's all yours [11:36:05] New review: Hashar; "Works fine thanks!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54845 [11:36:10] thank you apergos [11:36:48] yw [11:38:33] my english is really awful [11:38:56] it took me a few minutes to understand the lyrics of http://youtu.be/4Xy6AH_tZWs [11:39:27] http://en.wikipedia.org/wiki/Step_into_My_Office,_Baby :-] [11:40:57] lyrics are usually hard [11:41:07] even in your mother tongue [11:41:25] they don't exactly sing them with clear articulation a lot of the tmie [11:41:51] at least I managed to found a few sentences and that was enough for google to find out the song for me :-] [11:42:08] :) [11:42:39] Platonides: while you are around, I removed the python pool counter daemon https://gerrit.wikimedia.org/r/54832 :-D [11:42:49] I don't mind [11:42:56] but beware of domas ;) [11:43:11] just cast your voice on the change ? :-D [11:47:27] btw, do you have many conference calls? [11:49:49] 2 on monday evening [11:49:56] some hangout from time to time during the evening [11:50:00] maybe once per week [11:50:04] rest is IRC :-] [11:50:15] most of my team seems to prefer writing over voice [11:54:36] Platonides: ^^^ [11:55:24] looks like you're trying to answer my question on https://xkcd.com/802/ in WMF :p [11:55:33] spoken language vs. written etc. [11:58:44] Nemo_bis: what question ? :-] [11:59:31] hashar: if the proportion in WMF is the same as there [11:59:52] oh, I found IRC: it's between the sea of protocol confusion and the sea of memes [11:59:59] yeah seeing it :-] [12:00:01] I love those islands [12:00:10] usenet was a nice place to live [12:00:20] together with troll bay and Wikipedia talkpages [12:00:51] yeah troll bay is enjoyable [12:00:56] much more than the sea of memes :-] [12:01:22] bah I got 95 bugs :/ [12:01:26] heh [12:01:33] you flooder non-+q-user [12:02:16] argh [12:02:22] PHP_CodeSniffer uses pear bugtracker :/ [12:02:55] <^demon> Why anyone's still using anything related to pear is beyond me :\ [12:04:12] <^demon> At least its not pear2 ;-) [12:07:34] F*** PEAR [12:07:51] Email is already in use for an existing account [12:07:54] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [12:07:58] grmblb [12:10:02] PROBLEM - Puppet freshness on db43 is CRITICAL: Puppet has not run in the last 10 hours [12:10:27] ^demon: I have worked a bit on having the old parser tests system to report junit :D [12:10:39] ^demon: that is not a pretty patch though [12:10:48] <^demon> Heh. [12:11:56] will add you to review :-D [12:12:58] writing it, I figured out I haven't written any new feature in MediaWiki for quite a long time now [12:21:57] I guess you don't know why -with parserfunctions installed- a number of tests are failing with "" is not a valid magic word for "if" [12:22:27] I hate these testbugs which only happen due to some interaction with other tests [12:24:39] <^demon> hashar: Will upgrading zuul require some downtime? [12:24:41] New patchset: Milimetric; "cron job to regenerate mobile apps stats" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54811 [12:24:50] thanks gerrit-wm! :) [12:31:03] ^demon: yeah [12:31:11] ^demon: but I can do it during European mornings :-] [12:31:14] <^demon> Maybe we should schedule a gerrit update for the same window. I've got a couple of things I'm wanting to pull into our install. Makes sense to take both services down at the same time to minimize disruption. [12:31:23] ^demon: and I need some python modules to be installed on gallium first. [12:38:54] PROBLEM - Puppet freshness on xenon is CRITICAL: Puppet has not run in the last 10 hours [12:39:54] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [13:03:16] New review: Faidon; "+2 on the idea, that's how I've been running puppet on every other setup I've ever been." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/54815 [13:07:23] Change merged: Faidon; [operations/debs/ruby-jsduck] (master) - https://gerrit.wikimedia.org/r/54818 [13:09:23] New review: Faidon; "Definitely. Please just import the package from e.g. Debian experimental and fork your changes from ..." [operations/debs/python-jsonschema] (master) C: -1; - https://gerrit.wikimedia.org/r/54782 [13:12:29] New review: Faidon; "I don't see the point for this (and what's makes it different than the rest), and I'm not a fan of e..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/54798 [13:12:32] New patchset: Mark Bergsma; "Purge buffered data when fflush() doesn't work" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/54683 [13:13:20] found the memleak? [13:13:34] no [13:13:54] i have a suspicion that it could be it [13:24:10] hello? [13:24:15] I run git status on stat1 [13:24:20] and it doesn't work [13:24:25] on valid git repos [13:24:33] what should I do? [13:24:48] "it doesn't work"? [13:24:56] New review: Faidon; "I don't see the rest of my comments addressed, pip being the most important one. Also, you seemed to..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53587 [13:26:11] mark: average_drifter means that it just sits there hanging [13:26:26] I've noticed the same thing, and additionally, saving files from vim takes a crazy long time [13:26:55] it's almost like the filesystem is borked somehow [13:27:01] or ssh [13:27:13] New review: Faidon; "Looks great; can someone (ottomata?) merge, build a package and put it up in apt?" [operations/debs/python-voluptuous] (master) C: 2; - https://gerrit.wikimedia.org/r/44408 [13:27:17] which git repo can I try? [13:27:29] so in vim issuing a save makes it hang for up to 60 seconds [13:28:02] saving where? [13:28:26] i'm working in my home directory, /home/dandreescu [13:28:43] /home/spetrea/wikistats/pageviews_reports is an example of a git repository where you can try git status [13:30:13] i can keep issuing vim commands and after about 60 seconds the screen refreshes and reflects my changes. so it's exciting, but not very productive :) [13:30:20] no that's a symlink [13:30:38] try /home/spetrea/wikistats/wikistats [13:31:07] ha, yeah, now it's fine - of course! [13:31:20] that symlink is best removed [13:31:36] but the problem seems intermittent, with the most consistent thing being the vim problem I described [13:31:42] i'll try editing with nano [13:31:53] i doubt vi is the problem [13:32:52] heh, crazy - editing with nano works perfectly [13:33:30] note that stat1 is running some jobs and its disks are very loaded at times [13:33:51] 100% utilized actually [13:33:56] that may easily explain some slowness [13:34:30] wonder if it's just vim's .swp and history tracking that makes it choke [13:34:39] anyway, thanks, I suspected it was spiked [13:49:02] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [13:49:48] New review: Hashar; "rebased" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/44709 [13:57:15] hashar: can we do loopback mounted filesystem images in labs? [13:57:18] then we could use XFS [13:57:31] mount -o ? [13:57:35] grrp [13:57:53] mount -t xfs /some/file.img /srv/sda3 -o loopback,... [13:58:05] that would be a lot closer to production then what you do now [13:58:05] we can try out :-] [13:58:59] I am not sure whether XFS matter or not [13:59:08] since we are mostly testing out MediaWiki itself [13:59:34] each instance has a /dev/vdb virtual device, maybe it can be formatted using XFS [13:59:35] it's definitely a goal to get the two environments as close to eachother as possible [13:59:56] why not use loopback images? [13:59:58] then you can have two, format them with xfs however you want [14:00:13] and since it's already virtualized anyway, another layer of indirection is not really gonna matter anymore anyway [14:00:51] that is for varnish isn't it ? [14:00:57] for varnish yes [14:01:06] so we can keep the storage backends the same as in production [14:01:09] (just a bit smaller I guess ;) [14:01:27] VCL refers to storage for example, it sucks if it has to be different in labs [14:01:32] and I don't see why it needs to be [14:01:41] ahh now I see what you mean [14:02:01] so /dev/vdb will hold some XFS images which would be mounted as /srv/sda3 and /srv/sdb3 [14:02:09] yes [14:02:12] XFS images as normal files [14:02:15] linux supports that [14:02:25] those images can live wherever [14:02:25] if you ever find out how to generate the file using puppet .. go for it :-] [14:02:32] easy [14:02:36] just run mkfs :) [14:02:51] we already do that in puppet in some places [14:02:53] at least for swift [14:02:58] and squid I think [14:03:19] so make the file, give it a certain size [14:03:21] then run mkfs on it [14:03:23] ah files/squid/setup-aufs-cachedirs [14:03:31] yeah but you don't need that script [14:03:39] just run dd to create the file [14:03:43] then mkfs it [14:03:45] then mount it [14:05:59] aharhazrar [14:06:04] I hate our role::cache class [14:06:09] i like it [14:06:25] and then the size of the storage backends can be put in a hash in the ::configuration class [14:06:29] the upload cache uses the LVS service IPs which do not exist in labs :-] [14:06:55] PROBLEM - Puppet freshness on mw1006 is CRITICAL: Puppet has not run in the last 10 hours [14:07:00] for the bits cache I had to do some case $::realm to have labs use the role::cache::configuration::backends instead of LVS ups [14:07:06] damn that is going to be ugly again [14:07:15] we should really make LVS work [14:07:54] yeah ideally [14:08:41] first filling a bug about that XFS trick [14:11:55] mark: I have logged the XFS trick at https://bugzilla.wikimedia.org/show_bug.cgi?id=46359 [14:34:15] nis there some current or recent caching problem where e.g. the enwiki [[Main Page]] was stale? [14:34:31] there's a handful (at least) of OTRS tickets about it within the last day or so [14:34:57] actually, they were all within the last 6 hours [14:35:26] someone sent a response saying it was being investigated. but he didn't give a bug # or anything and he's offline now [14:35:30] * jeremyb_ scrolls up [14:47:47] New patchset: Hashar; "upload cache in labs no more rely on LVS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54863 [14:54:48] hmm the topic are a bit missleading [14:54:59] New patchset: Hashar; "upload cache in labs no more rely on LVS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54863 [14:55:00] New patchset: Hashar; "mobile cache no more rely on LVS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54864 [14:56:11] mark: so in labs I need to get rid of lvs_service_ips in the role::cache classes. That is very simple for mobile https://gerrit.wikimedia.org/r/#/c/54864/1/manifests/role/cache.pp,unified :-] [14:56:59] and a bit more uglier for upload :-( [14:57:00] https://gerrit.wikimedia.org/r/#/c/54863/2/manifests/role/cache.pp,unified [14:57:50] New review: Hashar; "You might want to first look at the mobile change which is simpler https://gerrit.wikimedia.org/r/..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54863 [14:58:55] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [14:59:43] i don't really see the problem right now, i'll have to look at it in more detail later [15:01:58] both squid and varnish upload cache broken https://bugzilla.wikimedia.org/show_bug.cgi?id=46350 [15:04:44] New patchset: Hashar; "(bug 44041) adapt role::cache::mobile for beta" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44709 [15:04:51] mutante: hehehe, silly people thinking they can undo SMTP transactions :-) [15:05:04] PROBLEM - MySQL Recent Restart on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:06:19] jeremyb_? who tries to do such thing? [15:06:55] New patchset: Hashar; "Varnish rules for Beta cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47567 [15:06:55] RECOVERY - MySQL Recent Restart on db1047 is OK: OK 143225 seconds since restart [15:07:08] New review: Hashar; "rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47567 [15:07:13] Platonides: we got a message on an already existing thread that said: [15:07:16] $fname $lname would like to recall the message, "$subject". [15:07:31] Platonides: i assume it was autogenerated by MS outlook or something [15:09:40] jeremyb_: That's an outlookism. It has magic in it that tells an Exchange server to actually "unsend" the message from its local mailboxes. [15:09:51] right, i figured [15:10:04] orwell! [15:10:46] jeremyb_: I think that was a feature they added when the PHBs started yelling about their not being able to undo unwise reply-all. :-) [15:11:02] PHB? [15:11:10] Pointy-Haired Bosses [15:11:12] Platonides: what say you? [15:11:15] hah [15:11:58] !log Fixed purging for all sites but eqiad (which wasn't broken) [15:12:06] Logged the message, Master [15:16:32] mark: did you see my question earlier about stale pages? [15:16:47] 14:34:14 UTC [15:17:18] yes [15:17:23] * jeremyb_ is not quite understanding that last !log. it was broken and not broken? [15:17:37] it was broken for all caches except those in eqiad [15:17:49] in practice that just means, esams [15:17:56] ohhh, i guess i'm just half asleep still [15:18:22] so, !log is related to OTRS you think? [15:18:45] i have no idea what's in OTRS, but i've heard of caching problems in esams before [15:18:51] and that's definitely caused by the problem I just fixed [15:18:57] hrmm, ok [15:18:59] unfortunately the root of the problem is not fixed... [15:19:03] well, we'll see if more come in [15:19:09] what's that? [15:20:27] multicast routing not working well with asymmetric routing [15:22:35] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 192 seconds [15:24:35] ok, thanks [15:25:34] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [15:32:40] New patchset: Hashar; "zuul: need python-voluptuous 0.6.1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54866 [15:32:46] New review: Hashar; "Once the package is available in our APT repository, we can get it deployed with https://gerrit.wik..." [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [15:33:21] New review: Hashar; "That change requires the python-voluptuous module to be added in our repository https://gerrit.wikim..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/54866 [15:34:25] Change merged: Ottomata; [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/44408 [15:35:39] New patchset: Hashar; "zuul: need python-voluptuous 0.6.*" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54866 [15:50:37] hmm, hashar, [15:50:48] python-voluptuous debuild didn't actually put anything in the .deb [15:50:53] ohh [15:50:55] $ dpkg -c python-voluptuous_0.6.1-1_all.deb [15:50:55] drwxr-xr-x root/root 0 2013-03-20 15:50 ./ [15:50:55] drwxr-xr-x root/root 0 2013-03-20 15:50 ./usr/ [15:50:55] drwxr-xr-x root/root 0 2013-03-20 15:50 ./usr/share/ [15:50:55] drwxr-xr-x root/root 0 2013-03-20 15:50 ./usr/share/doc/ [15:50:56] drwxr-xr-x root/root 0 2013-03-20 15:50 ./usr/share/doc/python-voluptuous/ [15:50:56] -rw-r--r-- root/root 225 2013-03-20 15:50 ./usr/share/doc/python-voluptuous/changelog.Debian.gz [15:50:57] -rw-r--r-- root/root 1936 2013-03-20 15:34 ./usr/share/doc/python-voluptuous/copyright [15:51:02] :-( [15:51:28] and I gave it a +2? [15:51:29] yay me [15:51:41] I am pretty sure I once managed to get a .deb out of it [15:51:48] here's what I did [15:51:54] could it be related to the debuild command used? [15:52:01] i jsut ran [15:52:02] debuild [15:52:09] did you merge with the source? [15:52:20] just use git-buildpackage [15:52:23] with --git-overlay [15:52:28] add a debian/gbp.conf [15:52:31] i uscaned to get the source [15:52:35] as hashar told me to [15:53:32] https://gerrit.wikimedia.org/r/gitweb?p=operations/debs/ruby-dimensions.git;a=blob;f=debian/gbp.conf;h=c87b3c6ecfdb2188e44e34cb3c33b64a93278e20;hb=942961588bcfba03e3bc6968db0532b8d600ed67 [15:53:40] add this [15:54:32] mkdir ../{tarballs,build-area}; uscan --download-current-version --destdir ../tarballs; git-buildpackage [15:56:09] i can probably hack to fix this: [15:56:10] dpkg-source: info: local changes detected, the modified files are: [15:56:11] python-voluptuous/.gitreview [15:56:11] python-voluptuous/gbp.conf [15:56:11] dpkg-source: error: aborting due to unexpected upstream changes, see /tmp/python-voluptuous_0.6.1-1.diff.C78bKU [15:56:14] but what is the proper thing? [15:56:22] git review, gbp.conf aren't in the tarball [15:56:50] and drop .gitreview? [15:56:53] k [15:56:54] or add it to .gitignore [15:57:06] can't we ignore it when building ? [15:57:07] you can have a post script in gbp.conf that cleans that up [15:58:09] ahh [15:58:10] filter= [15:58:13] no [15:58:37] prebuild = rm .gitreview [15:58:41] iirc [15:58:47] filter is for import-orig [15:58:49] (again, iirc) [15:58:52] paravoid, hashar, that looks way better [15:59:11] voluptuous.py and symlinks etc. [15:59:14] ok cool, adding to apt... [15:59:17] the other way is to not use overlay = True [15:59:23] and import all of the source into a different branch [15:59:33] e.g. upstream for the upstream source / master for the Debian branch [15:59:36] or master / debian [15:59:42] but this would mess too much with gerrit I think [15:59:57] that's what i'm doing for my pacakges [15:59:57] might want to update the gbp.conf to filter it out or use the branch so [16:00:03] non-overlay is the standard for in Debian [16:00:32] Heja opsen, looks like there are still new cache purging issues for Europe, see https://bugzilla.wikimedia.org/show_bug.cgi?id=46350 - could somebody take a look at it, or shall I copy that to an RT ticket? [16:00:35] non-overlay should also be used if we're upstream and upstream source is in git/gerrit and going through are proceses [16:00:54] andre__: I think mark fixed this some time ago [16:01:03] SAL indicates that [16:01:08] 15:12 mark: Fixed purging for all sites but eqiad (which wasn't broken) [16:01:50] ottomata: debian/gbp.conf belongs in the git tree btw [16:01:54] paravoid: Oh awesome timing, thanks for the hint! I should read up SAL first next time, I guess. :) [16:01:54] PROBLEM - Puppet freshness on mw1094 is CRITICAL: Puppet has not run in the last 10 hours [16:01:56] add it and amend [16:02:02] andre__: that's okay :) [16:02:30] it's nice to actually have someone look at what people report as problems and escalate when needed [16:02:55] PROBLEM - Puppet freshness on colby is CRITICAL: Puppet has not run in the last 10 hours [16:02:55] PROBLEM - Puppet freshness on mw1052 is CRITICAL: Puppet has not run in the last 10 hours [16:02:56] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [16:03:03] !log added python-voluptuous package to apt repo [16:03:09] Logged the message, Master [16:03:23] both dsc and deb, right? [16:03:23] paravoid, I should push my changes? [16:03:50] the debian/gbp.conf , I guess so [16:03:54] hm, ok [16:03:55] PROBLEM - Puppet freshness on mw1008 is CRITICAL: Puppet has not run in the last 10 hours [16:03:58] debian/gbp.conf should be commited [16:04:09] i also modified changelog and remvoed .gtireview... [16:04:21] (had to modify change log to put my name in there) [16:04:29] and you should include in apt both debian binaries (.deb) and debian sources (.dsc + .orig.tar.gz + .diff.gz or .debian.tar.gz) [16:04:40] i imported from the changes file [16:04:42] that does both, right? [16:04:54] PROBLEM - Puppet freshness on mw1056 is CRITICAL: Puppet has not run in the last 10 hours [16:05:16] ottomata: thank you andrew :-] [16:05:28] ottomata: right [16:05:33] then I could use a review of https://gerrit.wikimedia.org/r/#/c/54866/ which install the package making sure to use 0.6* (not 0.7 or later) [16:05:34] ok cool [16:06:19] hashar, can't you just say to ensure that a particular version of that package is installed [16:06:22] rather than using apt::pin? [16:06:33] ensure => 0.6.1 for example ? [16:06:54] PROBLEM - Puppet freshness on cp1026 is CRITICAL: Puppet has not run in the last 10 hours [16:07:01] On packaging systems that can retrieve new packages on their own, you can choose which package to retrieve by specifying a version number or latest as the ensure value [16:07:02] yeah [16:07:20] yes you can use that [16:07:25] even better, fix zuul to function with 0.7 :) [16:07:38] the pin is wrong [16:07:51] just do package { ...: ensure => '0.6.1-1' } [16:07:53] yeah 0.7 support is work in progress [16:08:41] patch coming [16:08:51] New review: Ottomata; "Use ensure => 0.6.1 (or 0.6.1-1?) on package { 'python-voluptuous' ... }, instead of using apt::pin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54866 [16:09:02] slooow [16:09:04] New patchset: Hashar; "zuul: need python-voluptuous 0.6.1-1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54866 [16:09:15] ottomata: https://gerrit.wikimedia.org/r/#/c/54866/3/modules/zuul/manifests/init.pp,unified [16:10:01] 0.6.1-1 [16:12:52] moving out, will follow up later tonight :] [16:12:57] thanks for the packaging work! [16:14:26] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54866 [16:16:31] hashar, merged, laters! [16:34:22] mark ? [16:34:35] yes? [16:34:46] hi [16:35:00] i had some problems with stat1 today [16:35:00] New patchset: Ottomata; "Allowing rsync to /var/www from stat1 for hosting of public data." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54871 [16:35:47] mark: can you help me please to find a way to fix stat1 so i can roll out reports ? [16:36:05] or is it fixed now ? [16:36:06] what problems? [16:36:08] average_doc (are you also average_drifter?) [16:36:11] what's da prob? [16:36:16] git and vim were acing up on it [16:36:34] acting* [16:37:36] ottomata: its me , average , but im at the doc in the waiting room [16:37:40] haha, ok [16:37:45] what do you mean git and vim were acting up? [16:37:54] PROBLEM - Puppet freshness on db66 is CRITICAL: Puppet has not run in the last 10 hours [16:38:12] well i ran git status [16:38:21] and it took like 2minutes [16:38:38] that's probably just because erik zachte (or others?) is running jobs on stat1 [16:38:43] and i ran git checkout on some branch and it took centuries [16:38:44] utilizing all disk i/o bandwidth [16:38:49] vim also was unresponsive [16:39:08] not much we can do about that [16:39:38] mark , erik ran a job that only 100% one cpu out of the 16 available [16:39:53] and 100% of disk usage [16:40:05] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54871 [16:40:14] about disk, i myself ran very intense stuff on stat1 and it never behaved like that [16:41:07] it's not busy right now [16:41:13] so it probably works fine now [16:41:13] average_doc, if the process erik was running isnt' multi threaded/ multi proc, its only going to run on one cpy [16:41:15] cpu [16:41:48] yeah, and there are 16 so.. its not likely that it would be causing this [16:41:57] as i said... [16:42:02] it's not cpu, but disk i/o that was the problem [16:42:14] it doesn't matter that 15 cpu cores are idling if it's disk i/o that you need [16:47:02] mark how did you asses disk i/o to be the problem on stat1 ? did you run iotop ? [16:47:19] iostat, iotop work [16:48:21] and did it confirm loads of io ? [16:48:32] New review: Ottomata; "(6 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54811 [16:48:48] yes it did at the time [16:49:30] mark can you try again now with iostat or iotop on halfak's process please ? [16:50:04] iotop shows a few percent of disk i/o on sda [16:50:07] you should be fine right now [16:51:46] paravoid: I don't know what it is but we keep hitting more doom. [16:51:50] jsduck still won't run [16:51:55] /usr/lib/ruby/vendor_ruby/parallel.rb:275:in `write': Broken pipe (Errno::EPIPE) [16:52:03] " Aborted jsduck --" [16:52:13] i know what it is [16:52:16] it's ruby! [16:52:28] Krinkle: I've spent a considerable amount of my week for this [16:52:58] we don't really have the resources as a team to spend so many hours for a tiny fraction of a tiny project [16:53:20] and this sounds to me like a problem with the app itself, not the packaging [16:53:38] I'll be happy to build a new package for you when you find and fix that bug [16:53:53] but I can't write more ruby, at least not this week :) [16:54:22] I don't know. From a general perspective what I see is that it works when installed from gem, and it's not working after packaging. version mismatch or incorrect bundling. [16:54:38] The app itself is fine, it's probably another outdated package in the debian repo [16:54:43] I'll fix it. [16:54:58] YuviPanda: Can you vouch an irc nick for me? Email is easy to spoof headers [17:00:54] PROBLEM - Puppet freshness on europium is CRITICAL: Puppet has not run in the last 10 hours [17:04:24] paravoid: OK, I think I fixed it. It was a problem in the way I called the program, the package is fine [17:04:29] paravoid: Thanks for merging the extjs fix [17:04:40] "outdated package" [17:04:48] mark: That was the problem yesterday [17:04:50] and the day before [17:04:53] everything is outdated in the ruby world if it's not a git checkout from less than a week ago [17:04:58] that's just not reasonable [17:06:13] Krinkle: so what was the fix then? [17:06:14] the problem is that the software was depending on another gem, both of which are under active development. One of them didn't have a debian package so paravoid created it. The other did have a debian package so we just added a dependnecy on it [17:06:17] mark++ [17:06:37] but then it turned out that that package was 3 years old and not updated (the debian package didn't keep up with upstream) [17:06:37] New patchset: Ottomata; "Fixing rsync module name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54876 [17:06:50] the API had changed completely, [17:06:52] anyway [17:07:05] for this latest problem [17:07:28] paravoid: Though you merged it, I'm still seeing the old error of yesterday. Did you push it to the apt repo? Or should a root run apt-get upgrade on gallium? [17:07:48] I didn't [17:07:50] later [17:09:16] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54876 [17:15:44] RECOVERY - Puppet freshness on db35 is OK: puppet ran at Wed Mar 20 17:15:38 UTC 2013 [17:24:36] New patchset: Milimetric; "cron job to regenerate mobile apps stats" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54811 [17:25:25] RECOVERY - Puppet freshness on mc1002 is OK: puppet ran at Wed Mar 20 17:25:17 UTC 2013 [17:26:39] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54811 [17:26:55] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [17:26:55] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [17:26:56] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [17:26:56] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [17:33:28] New patchset: Milimetric; "run every hour and fix single quotes problems" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54877 [17:36:15] RECOVERY - Puppet freshness on labstore1 is OK: puppet ran at Wed Mar 20 17:36:04 UTC 2013 [17:37:43] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54877 [17:39:30] New patchset: Milimetric; "better synchronize running of cron job with sql" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54878 [17:49:07] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54878 [17:56:10] Change merged: RobH; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/54801 [18:06:54] PROBLEM - Puppet freshness on hume is CRITICAL: Puppet has not run in the last 10 hours [18:08:56] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [18:15:37] RobH: http://bots.wmflabs.org/~wm-bot/searchlog/index.php?action=search&channel=%23wikimedia-labs [18:22:11] hashar: im getting your ssl certs for gallium [18:22:23] RobH: nice! thanks :-] [18:22:24] the mediawiki.org ones are forwraded? [18:22:32] RobH: yeah I commented on the RT ticket [18:22:32] redirected i mean. [18:22:45] RobH: we can get ride of the mediawiki.org entries if you want, no need to put cert there. [18:22:50] if it can save some certs [18:23:02] heh, i think they are free on the place im on [18:23:07] if not, i pay 60 and we get as many as we want [18:23:10] {doc,integration}.mediawiki.org use a permanent redirect of /* to their wm.o equivalents [18:23:16] but they wont be wildcard, so its one cert per fqdn [18:23:20] okk [18:23:23] works for me :-] [18:23:24] cool [18:23:28] thx [18:23:36] er? [18:23:36] we can't do one cert per fqdn [18:23:43] we won't give 4 IPs to gallium [18:23:44] RobH: might need to do some apache changes too [18:24:02] and we shouldn't do SNI for this either [18:24:02] paravoid: we already have multiple certs on the single IP of gallium [18:24:08] paravoid: added comment to not put them in /etc/ssl/certs :p [18:24:15] I think that is not supported by IE8 though but not really a problem [18:24:17] that was interesting.. [18:24:34] can we please just do StartSSL and be done with it? [18:24:48] i think that what rob is doing [18:24:52] StartSSL [18:24:55] paravoid: what do you think im doing dude [18:24:57] this is startssl. [18:25:00] oh? [18:25:07] i resolved the startssl ticket. [18:25:18] class 2? [18:25:31] just the first one so far [18:25:34] do we need class 2 and why? [18:25:42] (they took days to get this approved ;) [18:25:49] RobH: can you also update the apache configurations for gallium? They are some where under /modules/contint/ ;)D [18:25:57] because a) how else are you going to do gallium's fqdns? [18:26:07] paravoid: I validated wikimedia.org [18:26:11] i can make any wikimedia.org cert i want. [18:26:13] and b) we shouldn't float valid certificates for the wikimedia.org name without the foundation's name [18:26:36] hop time to get my daughter to bed :-) [18:26:40] paravoid: you lost me. [18:27:00] (a) is 20:23 < RobH> but they wont be wildcard, so its one cert per fqdn [18:27:04] I don't see what the issue is when I signed up and validated that my email and wikimedia.org domain are indeed under my(our) control [18:27:14] paravoid: because we dont do wildcard wikimedia.org certs on any server anymore [18:27:27] if we did, then there is no real reason not to use the same wildcard cert on everything [18:27:31] which is what everyone said we needed to stop doing. [18:27:43] no, but gallium needs 4 fqdns [18:27:49] at least 4 [18:28:00] paravoid: So what do you want me to do? [18:28:07] get a certificate with 4 SANs? [18:28:10] Becuase I thought I was doing what you guys asked. [18:28:19] i'll try to do that if this supports it [18:28:22] i have not gotten that far. [18:28:25] but sure, will try [18:28:31] needs 2, not 4. [18:28:39] how come? [18:28:45] how come what? [18:28:52] what are the two? [18:29:07] we want doc.wikimedia.org and integration.wikimedia.org. [18:29:07] and mediawiki.org [18:29:08] no [18:29:09] for the redirects [18:29:10] paravoid: what if we want to move doc to another server? [18:29:11] yes [18:29:17] ... [18:29:17] Ryan_Lane: revoke and reissue [18:29:22] we want to certify a redirect? [18:29:25] yes [18:29:27] why not just get multiple individuals? [18:29:28] why? [18:29:32] Ryan_Lane: and? SNI? [18:29:54] just give each their own ip [18:29:55] RobH: because people will type it and get certificate errors and will not cost us more than 30 extra seconds [18:29:56] * Ryan_Lane shrugs [18:30:02] Ryan_Lane: 4 IPs for gallium? :) [18:30:06] or even two? [18:30:14] I guess we could [18:30:14] so apache serves the https cert, then redirects the url [18:30:15] ? [18:30:22] meh. that does kind of suck... [18:30:25] rather than just rewirting/redirecting [18:30:26] ? [18:30:51] RobH: get one with all the FQDNs as SANs [18:30:53] RobH: you type doc.mediawiki.org, you have HTTPS Everywhere installed [18:31:04] wait... [18:31:08] so you connect to https://doc.mediawiki.org/ [18:31:12] why is doc.mediawiki.org on gallium> [18:31:26] why don't we put that on the cluster and redirect from there? [18:31:50] RobH: if you don't have that as a SAN, you'll get a certificate error; if you do, your browser will connect and get a HTTP redirect for doc.wikimedia.org [18:32:19] we should definitely move the redirect away from gallium [18:32:32] Ryan_Lane: the redirect is there because doc.wikimedia.org is there [18:32:49] that's not a really good reason, though. [18:32:58] no, just simpler so far :) [18:33:09] not if we need to deal with ssl because of it ;) [18:33:15] heh, I guess so [18:33:28] that eliminates one fqdn at least :) [18:33:36] otoh I'm not sure how complicated I'd like to see on nginx configs [18:33:37] so are we not including doc now? [18:33:47] so, we have doc.wm.o and integration.wm.o [18:34:07] paravoid: we can push it back to apache on the appservers if you would prefer ;) [18:34:30] I'll defer that to you :) [18:34:37] hahaha [18:34:41] no seriously [18:35:00] other than integration and doc, what else is there? [18:35:10] you're taking point on https and I'm okay with you making that decision :) [18:35:24] we're doing all redirects at the appservers as is. [18:35:30] including silly ones like this [18:35:34] RobH: and the other point about class 2 is [18:35:41] class 2 means that we get O=Wikimedia Foundation [18:35:43] eventually we should move all of these on to varnish [18:35:46] you don't that on class 1s [18:35:52] I don't think the https servers should be doing this at all [18:35:57] and I'm not sure I'd like wikimedia.org certificates without O=Wikimedia Foundation out there [18:36:06] dont we already have that? [18:36:21] we never did extended validation for rapidssl [18:36:27] I'm not talking about EV [18:36:51] it's DV vs. OV if you want to speak CA lingo [18:37:14] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.10149775862 (gt 8.0) [18:37:43] the difference being, the former will have a subject of description=(random string), CN=US, CN=doc.wikimedia.org [18:37:51] while the latter will also say O=Wikimedia Foundation [18:37:57] paravoid: so in startssl lingo, you want me to escalate us to startssl verified? [18:38:04] I think so [18:38:09] that opens class 2/3 [18:38:17] (I'm sure that's the difference, I think that I prefer that) [18:38:42] ok, well, we have used class 1 i think, but if thats what we want [18:38:45] then we can do that. [18:38:50] Ryan_Lane: other than integration and doc, there is going to be "zuul.wm" [18:39:07] no [18:39:09] hashar was planning a zuul stats page, added to DNS as cname for gallium [18:39:16] ah. ok [18:39:17] we agreed not to do that [18:39:20] the other day [18:39:24] heh [18:39:27] ..o..k [18:39:38] sec [18:39:53] mutante: https://gerrit.wikimedia.org/r/#/c/54466/ [18:40:09] see my & Krinkle's comments [18:40:30] (and how it's abandoned) [18:41:05] i see.. gotcha [18:41:52] then let me just add for completeness.. we already removed the OLD mediawiki docs yesterday, on http://svn.mediawiki.org/doc/ [18:41:57] it redirects to the new one now [18:43:29] PROBLEM - Full LVS Snapshot on rdb1001 is CRITICAL: NRPE: Command check_lvs not defined [18:49:54] !log apache-graceful-all, remove boards.wm redirect [18:50:00] Logged the message, Master [18:50:50] welcome back logmsgbot [18:51:24] PROBLEM - Apache HTTP on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:24] PROBLEM - Apache HTTP on mw1063 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:24] PROBLEM - Apache HTTP on mw1181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:24] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:24] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:24] PROBLEM - Apache HTTP on mw1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:24] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:25] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:25] PROBLEM - Apache HTTP on mw1048 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:26] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:26] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:27] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:27] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:28] PROBLEM - Apache HTTP on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:28] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:29] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:29] PROBLEM - Apache HTTP on mw1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:30] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:30] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:31] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:31] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:32] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:32] PROBLEM - Apache HTTP on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:34] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:40] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.886 second response time [18:51:41] PROBLEM - Apache HTTP on mw1046 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:41] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:41] PROBLEM - Apache HTTP on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:41] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.088 second response time [18:51:41] RECOVERY - Apache HTTP on mw1033 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.848 second response time [18:51:41] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:51:42] RECOVERY - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 757 bytes in 4.969 second response time [18:51:42] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.077 second response time [18:51:43] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 7.115 second response time [18:51:50] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 4.945 second response time [18:51:50] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 5.616 second response time [18:51:50] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.565 second response time [18:51:50] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.607 second response time [18:51:50] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.439 second response time [18:51:51] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.992 second response time [18:51:51] RECOVERY - Apache HTTP on mw1063 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.063 second response time [18:51:52] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.202 second response time [18:51:58] w t f [18:52:17] looks like what happened to me the other day [18:53:57] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [18:53:57] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [18:54:07] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [18:54:08] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.072 second response time [18:54:08] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.092 second response time [18:54:08] RECOVERY - Apache HTTP on mw1048 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [18:54:08] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.075 second response time [18:54:08] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [18:54:08] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [18:54:09] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [18:54:09] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [18:54:10] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.389 second response time [18:54:17] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [18:54:18] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.217 second response time [18:54:43] testing random ones, they look fine and normal, no config problem [18:55:16] ok, now startssl is processing my personal details [18:55:24] then they'll let me authenticate our org. [18:57:28] ? [18:57:32] I thought chris was going to do that? [18:57:33] or me? [18:57:47] i decided if they steal my identity [18:57:50] i'll blame you ;] [18:57:51] why are you doing something you weren't comfortable doing? :) [18:57:59] because im tired of dealing with it. [18:58:07] dude, that wasn't what I was trying to accomplish [18:58:34] it's not my place to force you to do anything, much less anything with your personal data [18:58:48] i'm also very worries about my passport scan sitting on the office scanner memory, but meh [18:58:56] worried even [18:59:00] for purely technical reasons I think class 2 is much better [18:59:03] heh, that should have been fixed:) [18:59:10] there is now scan2mail for reals [18:59:13] mutante: not fileserver, in device memory [18:59:27] browse to printer, read local memory, have rob's passport data [18:59:34] * RobH has trust issues [18:59:39] RobH: there's an android app that uses the NFC to scan passports [18:59:42] powercycle it ? [18:59:45] hehe [18:59:50] and if you put your date of birth to the app to unlock the basic key [18:59:53] you even see your picture [18:59:55] and details [18:59:57] in your phone [19:00:04] jpeg2000 formatted [19:00:08] i am just waiting for NFC pick pockets in the subway :p [19:00:12] even had on the comment the software the passport authority used [19:00:13] * RobH has iphone for now [19:00:16] oops, touched your wallet there [19:00:21] gimme two more years on iphone4s then i jump to droid. [19:00:51] you can also change your name every once in a while :) [19:01:52] well, legally that creates even more papertrail and personal details [19:02:08] i would make a banshee reference, but pretty sure no one watches that. [19:03:19] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 192 seconds [19:04:09] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 212 seconds [19:05:02] New patchset: Asher; "converting the reset of s1 pmtpa to mariadb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54889 [19:06:02] cmjohnson1: did the cat5 from equinix finally get dealt with ? [19:06:05] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54889 [19:06:06] the xc ? [19:06:12] binasher: ! [19:06:15] nice :) [19:06:19] lesliecarr [19:06:37] yes, it has been done for a while [19:06:41] okay [19:06:41] thaks [19:06:46] i keep forgetting [19:06:50] is it patched into the srx100 ? [19:07:38] patched to mr1 0/7 [19:07:38] iirc [19:07:46] yay [19:07:49] thank you [19:07:52] and sorry if i ask you again [19:08:17] so I updated https://rt.wikimedia.org/Ticket/Display.html?id=4773 [19:08:18] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.00767 (gt 8.0) [19:08:35] and i kinda need some input on ticket from either Ryan_Lane or paravoid (when you ahve time, im waiting on startssl to verify so its not urgent) [19:09:29] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 13 seconds [19:09:29] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 8 seconds [19:10:40] !log installing package upgrades on formey [19:10:47] Logged the message, Master [19:11:08] !log installing package upgrades on mw1170-1179 [19:11:14] Logged the message, Mistress of the network gear. [19:11:15] Ryan_Lane: I'll leave it up to you :) [19:13:12] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 182 seconds [19:13:31] PROBLEM - mysqld processes on db60 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:13:31] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 184 seconds [19:15:32] paravoid: what? the https stuff? [19:15:37] the ticket above [19:16:04] PROBLEM - Apache HTTP on mw117 is CRITICAL: Connection refused [19:16:15] PROBLEM - Apache HTTP on mw1177 is CRITICAL: Connection refused [19:16:15] PROBLEM - Apache HTTP on mw1179 is CRITICAL: Connection refused [19:16:15] PROBLEM - Apache HTTP on mw1176 is CRITICAL: Connection refused [19:16:15] PROBLEM - Apache HTTP on mw1178 is CRITICAL: Connection refused [19:16:24] PROBLEM - Apache HTTP on mw1171 is CRITICAL: Connection refused [19:16:25] PROBLEM - Apache HTTP on mw1174 is CRITICAL: Connection refused [19:16:33] that's mine [19:16:34] PROBLEM - Apache HTTP on mw1175 is CRITICAL: Connection refused [19:16:35] yay [19:16:43] turns out the upgrades aren't as graceful as hoped [19:17:15] PROBLEM - Apache HTTP on mw1172 is CRITICAL: Connection refused [19:17:15] PROBLEM - Apache HTTP on mw1173 is CRITICAL: Connection refused [19:17:15] RECOVERY - Apache HTTP on mw117 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.187 second response time [19:17:27] starting [19:17:55] started them back up [19:18:14] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [19:18:15] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [19:18:15] RECOVERY - Apache HTTP on mw1177 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.083 second response time [19:18:15] RECOVERY - Apache HTTP on mw1179 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [19:18:15] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [19:18:15] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [19:18:24] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [19:18:25] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [19:18:39] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [19:18:39] !log upgrading mw118* and then restarting apache2 on these boxes [19:18:42] Logged the message, Mistress of the network gear. [19:19:01] New review: Dzahn; "so, /var/mwdocs/phase3 is still there and a /var/mwdocs/phase3-svn which isn't mentioned on this. i ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53954 [19:20:49] New patchset: Aude; "add allowDataTransclusion Wikibase setting, off by default, on for test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54893 [19:21:14] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [19:21:25] PROBLEM - Apache HTTP on mw118 is CRITICAL: Connection refused [19:21:25] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [19:21:34] PROBLEM - Host db60 is DOWN: PING CRITICAL - Packet loss = 100% [19:22:15] PROBLEM - mysqld processes on db32 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:23:24] RECOVERY - Apache HTTP on mw118 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.174 second response time [19:23:24] RECOVERY - Host db60 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [19:24:09] !log .tar.gz'ing /var/mwdocs on formey (SVN) and then manually delete it [19:24:10] Logged the message, Master [19:27:12] Can someone run sync-common as root on mw1085 please? [19:27:14] PROBLEM - Apache HTTP on mw1193 is CRITICAL: Connection refused [19:27:15] PROBLEM - Apache HTTP on mw1194 is CRITICAL: Connection refused [19:27:15] PROBLEM - Apache HTTP on mw1199 is CRITICAL: Connection refused [19:27:15] PROBLEM - Apache HTTP on mw1196 is CRITICAL: Connection refused [19:27:15] PROBLEM - Apache HTTP on mw1197 is CRITICAL: Connection refused [19:27:15] PROBLEM - Apache HTTP on mw1190 is CRITICAL: Connection refused [19:27:15] PROBLEM - Apache HTTP on mw1198 is CRITICAL: Connection refused [19:27:24] PROBLEM - Apache HTTP on mw1195 is CRITICAL: Connection refused [19:27:25] PROBLEM - Apache HTTP on mw1191 is CRITICAL: Connection refused [19:27:25] PROBLEM - Apache HTTP on mw1192 is CRITICAL: Connection refused [19:27:27] Reedy: ok [19:27:58] !log sync-common on mw1085 [19:28:05] Logged the message, Mistress of the network gear. [19:28:15] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [19:28:15] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [19:28:15] RECOVERY - Apache HTTP on mw1199 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.120 second response time [19:28:15] RECOVERY - Apache HTTP on mw1196 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [19:28:15] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.104 second response time [19:28:15] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [19:28:15] RECOVERY - Apache HTTP on mw1198 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.108 second response time [19:28:24] RECOVERY - Apache HTTP on mw1195 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [19:28:25] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [19:28:25] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.072 second response time [19:28:43] LeslieCarr: mutante Reedy we need help to get https://gerrit.wikimedia.org/r/#/c/52797/ merged and deployed [19:29:04] it's a second cron job for wikidata dispatching [19:29:07] to clients [19:30:45] aude: why do you want the same one every 5 and every 7 minutes ? [19:30:54] PROBLEM - Host db60 is DOWN: PING CRITICAL - Packet loss = 100% [19:31:24] LeslieCarr: the duration doesn't matter at all. the task runs continuously [19:31:43] some point, maybe we want it to be a daemon but not yet [19:32:04] okay, i'll deploy it but with a comment that we really should have it be a daemon [19:32:10] if it's running continuously [19:32:19] are you around to monitor this ? [19:32:24] RECOVERY - Host db60 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [19:32:29] LeslieCarr: agree [19:32:34] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.0 [19:32:34] yes we are around [19:33:02] New review: Lcarr; "Approving this with the caveat that we would like this to be a proper daemon. aude is here to monito..." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/52797 [19:33:05] sigh [19:33:14] now startssl is saying they need a copy of my cell phone bill [19:33:17] w.t.f. [19:33:21] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52797 [19:33:21] thanks LeslieCarr :) [19:33:33] these guys are a pain in the ass [19:34:33] root@formey:/var/mwdocs# rm -rf phase3 [19:34:47] Huh. I didn't know startssl. [19:34:58] cheap certs [19:35:11] I must say that free "normal" certs sounds... nice! [19:35:24] PROBLEM - Apache HTTP on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:35:37] well, we pay 60 to verify im actually rob halsell [19:35:47] then we pay an additoinal 60 to verify i am an agent for wikimedia foundation [19:36:02] per year, and thats the cost of one cert [19:36:07] so yea, cheap (nearly free) [19:36:36] !log restarting hung apache on mw1202 [19:36:42] Logged the message, Master [19:37:14] Can someone run sync-common as root on mw23 aswell please? [19:37:15] RECOVERY - Apache HTTP on mw1202 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [19:37:20] 2 * 60? [19:37:21] yep [19:37:23] are you sure? [19:37:28] Reedy: will do now [19:37:29] I've never done OV [19:37:37] if cacert.org would just get their root ca cert into browsers (they have been trying for years, it's just complicated)... [19:38:14] PROBLEM - Apache HTTP on mw23 is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 50035 bytes in 0.152 second response time [19:38:22] https://bugzilla.mozilla.org/show_bug.cgi?id=215243 [19:38:25] !log sync-common on mw23 [19:38:31] Logged the message, RobH [19:39:49] oh fenari... *** System restart required *** [19:40:02] every time i login and see that, it makes me feel bad. [19:40:56] wiki.cacert.org/InclusionStatus [19:41:04] .hushlogin ? [19:41:20] cacert is almost abandonded [19:41:24] they still use MD5 for example [19:41:32] hasn't made any progress for years [19:41:37] Is logmsgbot broken? [19:41:38] I used to support it a lot, but not lately [19:41:40] sigh, i would have assurer points for it [19:41:45] I'm a trusted assurer for years though [19:41:54] so if anyone wants assuring... [19:42:05] Reedy: hey, welcome back [19:42:15] I'd like to be assured that the world is inherently good and that it will all work out. [19:42:21] same here, i should have points. once got my passport checked by a cacert guy [19:42:25] I never left! ;) [19:42:39] to europe? [19:42:40] :) [19:43:08] 27 hour day and counting! [19:43:15] ouch [19:43:34] 3 flights, 4 baggage screening/security [19:43:52] 90 minutes driving [19:43:53] ? [19:43:54] What else... [19:44:02] how's screening/security > flights?! [19:44:20] had to go into the other side of the international terminal at sfo [19:44:32] because there's no way to move between them without going through security [19:44:52] * Reedy kicks logmsgbot [19:45:05] Reedy: thats cuz sfo is horrible airport [19:45:24] !log closed.dblist to 1.21wmf12 [19:45:31] Logged the message, Master [19:45:47] !log wikimania, private and fishbowl dblists to 1.21wmf12 [19:45:53] Logged the message, Master [19:46:35] !log special and wikimedia dblists to 1.21wmf12 [19:46:42] Logged the message, Master [19:46:57] !log Synchroni[sz]ed php-1.21wmf12/extensions for Wikidata deploy [19:47:03] Logged the message, Master [19:47:12] !log wikinews and wikisource to 1.21wmf12 [19:47:14] Reedy: logmsgbot seems to work. [19:47:20] Logged the message, Master [19:47:42] I've not seen it say anything [19:48:08] you never stop do you [19:48:08] Oh, I'm getting my bots confused. [19:48:13] morebots is fine. [19:48:13] I am a logbot running on wikitech-static. [19:48:13] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [19:48:13] To log a message, type !log . [19:48:46] !log wikivoyage and wiktionary to 1.21wmf12 [19:48:53] Logged the message, Master [19:49:03] New patchset: Aude; "add allowDataTransclusion Wikibase setting, off by default, on for test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54893 [19:49:07] paravoid: I came off the SFO-LHR flight with a red spotted t-shirt [19:49:14] Reedy: https://gerrit.wikimedia.org/r/#/c/54893/ [19:49:16] please [19:49:32] err, one se [19:49:33] c [19:49:35] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54893 [19:49:39] Reedy: earlier i could summon it by logging :) 11:59 < mutante> !log apache-graceful-all, remove boards.wm redirect 11:59 -!- logmsgbot [~logmsgbot@fenari.wikimedia.org] has joined #wikimedia-operations [19:49:40] lol [19:50:00] ready [19:50:05] whatever, it's fine [19:50:21] i just wanted to update the commit message but not a big deal [19:50:32] !log wikiversity and wikiquote to 1.21wmf12 [19:50:39] Logged the message, Master [19:51:14] !log sync'd wmf-config/InitialiseSettings.php [19:51:20] Logged the message, Master [19:51:28] thanks Reedy [19:52:47] !log wikibooks to 1.21wmf12 [19:52:53] Logged the message, Master [19:53:10] mutante: so yeah the zuul.wikimedia.org got abandoned :D [19:53:29] New patchset: Reedy; "Everything non wikipedia to 1.21wmf12" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54899 [19:54:08] Reedy: what do we do to rebuild localisation for wmf12 (or just wikibase)? [19:54:08] hashar: yea, paravoid told me. DNS clean up? [19:54:18] i think that's needed to make our new parser function work [19:54:53] scap [19:54:55] as usual [19:55:02] can do that in a minute [19:55:06] ok [19:55:08] thanks [19:55:17] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54899 [19:56:13] mutante: yup. Cname can die https://rt.wikimedia.org/Ticket/Display.html?id=4776 [19:59:35] !log DNS update - remove zuul CNAME [19:59:41] Logged the message, Master [20:08:34] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.30974991525 (gt 8.0) [20:13:14] RECOVERY - Apache HTTP on mw23 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.117 second response time [20:17:12] RobH, mind if I do 4726 real quick? Stat1 Access for Henrique Andrade [20:17:14] ? [20:17:59] New patchset: RobH; "adding new apaches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54903 [20:18:02] ottomata: go for it [20:18:08] no one complained, means compliance =] [20:18:16] cool [20:19:14] cmjohnson1: So I have a task for ya, but no RT # [20:19:25] if ya wanna do some installs, it seems mw1209-1220 arent installed yet [20:19:31] i don't work w/out rt ;-] [20:19:44] yeah..okay i will do it [20:19:50] biting the hand that feeds, i see how it is [20:19:52] ;] [20:20:07] checking to see if they are setup in puppet alerady [20:20:24] RECOVERY - mysqld processes on db60 is OK: PROCS OK: 1 process with command name mysqld [20:20:25] yep, in site.pp [20:21:05] and in dhcp lease file [20:21:22] ok [20:21:57] https://rt.wikimedia.org/Ticket/Display.html?id=4777 [20:22:07] So dunno if dns is setup (prolly) or network ports [20:22:24] but, those should be able to be installed and puppet signed and such [20:22:35] chmod g+rw /home/wikipedia/common/php-1.21wmf11/.git/modules/extensions/FormPreloadPostCache/index.lock [20:22:41] cmjohnson1: So do OS setup, then puppet runs, and once they are fully puppetized we can push into service [20:22:50] (i'll walk ya through it when we get to there) [20:22:53] ^ Can someone fix the permissions on that file please [20:22:59] robh: ok [20:23:04] PROBLEM - Host db32 is DOWN: PING CRITICAL - Packet loss = 100% [20:23:31] Reedy: done [20:23:48] Reedy: it gave error fo that on the sync script, needs rerun on mw23? [20:23:54] or not vital just annoying? [20:24:09] (i assumed the latter) [20:24:51] ^^^^ Thing we need to fix [20:24:56] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54903 [20:25:12] greg-g: ? [20:25:16] "what do these error messages mean? there's so many of them and I've been taught to ignore them..." (not your fault RobH ) [20:25:23] its a .git file [20:25:29] its a known issue, happens all the time. [20:25:34] RECOVERY - Host db32 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [20:26:10] not sayikgn its not annoying [20:26:24] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 742 bytes in 0.272 second response time [20:26:27] but its not exactly a major issue that im aware of [20:26:48] right, and for new people doing deploys, or those who do them infrequently, too much mental load on error messages that are not important. That's all, ignore my interjection :) [20:27:11] right, not major, and hopefully dealt with through git-deploy-glorious-future [20:27:29] i thought they only happened when folks did shit as root [20:27:47] (syncs) [20:27:50] but dunno [20:27:57] well, syncing is part of deploying, no? [20:28:01] * aude waiting on localisation cache update [20:28:12] yep, just saying i dunno why it came up with permission error [20:28:21] ah, right, I see [20:28:27] root's root and all [20:28:29] New patchset: Ottomata; "Adding Henrique Andrade and giving access on stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54908 [20:29:50] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54908 [20:32:42] aude: / Reedy: everything all good on your deploys? aude are you still waiting to do something? [20:33:15] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [20:34:15] localisation cache seems to be fixed [20:36:14] RECOVERY - mysqld processes on db32 is OK: PROCS OK: 1 process with command name mysqld [20:36:39] New patchset: Aude; "enable Wikibase data inclusion on test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54909 [20:37:12] aude: thanks [20:37:15] PROBLEM - Puppet freshness on mw1130 is CRITICAL: Puppet has not run in the last 10 hours [20:37:49] not sure how to avoid that in the future, when introducing new parser functions [20:37:59] except be sure to run localisation cache rebuild before enabling new stuff [20:38:15] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [20:38:21] seems super nasty that not finding a magic word is a fatal error [20:38:35] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.6879117094 (gt 8.0) [20:39:39] New patchset: Ottomata; "Changing handrade's ssh key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54910 [20:40:20] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54909 [20:40:24] PROBLEM - mysqld processes on db60 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:41:01] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54910 [20:41:25] RECOVERY - mysqld processes on db60 is OK: PROCS OK: 1 process with command name mysqld [20:41:57] !log Sync'd wmf-config/InitialiseSettings.php [20:42:03] Logged the message, Master [20:45:14] PROBLEM - MySQL Replication Heartbeat on db67 is CRITICAL: CRIT replication delay 229 seconds [20:46:59] New patchset: Brion VIBBER; "Updated IP test range for Vimpelcom" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54911 [20:47:24] PROBLEM - mysqld processes on db63 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:49:24] RECOVERY - mysqld processes on db63 is OK: PROCS OK: 1 process with command name mysqld [20:49:37] New review: Brion VIBBER; "We think it's right. :)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/54911 [20:49:42] New review: Yurik; "so do i" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/54911 [20:52:15] PROBLEM - MySQL Slave Running on db69 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error You cannot ALTER a log table if logging is enabled on query [20:52:15] PROBLEM - MySQL Slave Running on db71 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error You cannot ALTER a log table if logging is enabled on query [20:52:24] PROBLEM - MySQL Slave Running on db59 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error You cannot ALTER a log table if logging is enabled on query [20:52:57] !log updated CentralAuth to master on 1.21wmf11 and 1.21wmf12 [20:53:03] Logged the message, Master [20:53:15] PROBLEM - MySQL Slave Delay on db67 is CRITICAL: CRIT replication delay 693 seconds [20:53:15] PROBLEM - MySQL Slave Delay on db63 is CRITICAL: CRIT replication delay 459 seconds [20:53:58] broadcast: I am done with my deploy - all clear [20:56:29] Reedy: pgehres: Is logmsgbot not working? I noticed you both made manual entries for something that is usually triggered by the sync- script. [20:56:42] well, the sync script did not log anything [20:57:00] And you didn't think that was strange? Or has that been broken for longer? [20:57:11] I did think that was strange, yes [20:57:16] ok :) [20:57:29] Change abandoned: Demon; "Solution in search of a problem. If we ever rewrite this from the beginning, we can use maven, but f..." [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/53572 [20:57:29] I have no way of kicking the bot [20:57:48] Just checking if you noticed it before or it being new [20:57:51] So its new [20:57:53] nope [20:57:55] yeah [20:58:10] [ logmsgbot Idle: 2:07:31 ] [20:58:16] Hm.. at least 2 hours [20:58:16] odd. [20:59:15] RECOVERY - MySQL Slave Delay on db63 is OK: OK replication delay 0 seconds [20:59:21] Reedy: while you're around, I need to run a foreachwiki on hume. anything other than "foreachwiki extensions/CentralAuth/maintenance/migratePass0.php" ? [20:59:27] New patchset: Asher; "db1049 -> mariadb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54946 [20:59:58] New patchset: Pyoungmeister; "run puppet by cron instead of via the agent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54815 [21:00:17] paravoid: about? [21:00:22] pgehres: Your entires are written to /var/log/logmsg on fenari [21:00:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54946 [21:00:31] !log reedy synchronized wmf-config/InitialiseSettings.php [21:00:31] !log pgehres synchronized php-1.21wmf12/extensions/CentralAuth 'Updating CentralAuth to master' [21:00:31] !log pgehres synchronized php-1.21wmf11/extensions/CentralAuth 'Updating CentralAuth to master' [21:00:33] !log asher synchronized wmf-config/db-eqiad.php 'pulling db1049 from s1' [21:00:37] Logged the message, Master [21:00:40] Oh, I didn't mean to enter them [21:00:42] :-/ [21:00:43] Logged the message, Master [21:00:46] lol [21:00:51] Logged the message, Master [21:00:57] Logged the message, Master [21:01:11] pgehres: Whatever deamon is supposed to read them/buffer them from there to logmsgbot isn't working [21:01:21] notpeter: ? [21:01:42] paravoid: patchset for you! [21:01:50] :) [21:01:52] I do puppet the way you want :) [21:02:44] oh [21:02:48] except for the logging thing [21:03:03] I like the idea of putting puppet into its own log [21:03:23] and I see no reason to not do so? [21:03:30] and Ryan_Lane wants it ;) [21:03:35] this isn't going to be the puppet logs [21:03:44] well, it might be some of them [21:03:49] puppet does support syslog though [21:03:55] and will keep writing to syslog [21:04:11] I just tailed a run with that invocation, and the output all showed up in there [21:04:16] so [21:04:31] it might also wind up in syslog... not sure, though [21:04:35] okay, push it like that [21:04:41] The logmsgbot process is running fine on fenari [21:04:42] when we fix logging properly we'll take a look then [21:04:46] Must've lost connection to irc I guess [21:04:52] Could someone who's root on fenari restart it? [21:04:53] nobody 4177 0.0 0.1 195328 4436 ? S Feb27 6:52 python /usr/ircecho/bin/ircecho --infile=/var/log/logmsg #wikimedia-operations logmsgbot irc.freenode.net [21:04:56] paravoid: ok, sounds reasonable [21:04:56] because right now logging is all messed up :) [21:05:08] truuuueeeeeee [21:05:18] who removed zuul.w.o and didnt update the ticket ;P [21:05:29] @hourly root [ ! -d /var/cache/dsa ] || touch /var/cache/dsa/cron.alive [21:05:32] 34 */4 * * * root if [ -x /usr/sbin/puppetd ]; then sleep $(( $RANDOM \% 7200 )); if [ -x /usr/bin/timeout ]; then TO="timeout --kill-after=900 3600"; else TO=""; fi; tmp="$(tempfile)"; egrep -v '^(#|$)' /etc/dsa/cron.ignore.dsa-puppet-stuff > "$tmp" && $TO /usr/sbin/puppetd -o --no-daemonize 2>&1 | egrep --text -v -f "$tmp"; rm -f "$tmp"; fi [21:05:37] @daily root find /var/lib/puppet/clientbucket/ -type f -mtime +30 -atime +30 -exec rm {} \+ [21:05:40] that's what we have in debian btw [21:06:28] thanks for pushing this btw, I've been slacking off not doing it for a long time :) [21:06:42] ARGH [21:06:51] all the new apaches im about to deploy attach to a fastiron [21:06:55] ;_; [21:07:15] PROBLEM - mysqld processes on db1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [21:07:20] paravoid: well, any chance I can get to free up roughly 400 gigs of ram across the cluster seems pretty worthwhile to me :) [21:07:43] and ryan was ranting about how much ram it was eating up [21:08:27] puppet 3.0 changed that [21:08:29] in a very silly way :) [21:08:32] heh [21:08:37] the agent doesn't do puppet runs [21:08:40] it just forks [21:08:44] and spawns a puppet run [21:08:49] :/ [21:08:51] so it's basically an advanced cron now [21:08:55] yeah [21:08:59] cron + kick I guess [21:10:04] PROBLEM - Host db1049 is DOWN: PING CRITICAL - Packet loss = 100% [21:11:25] RECOVERY - Host db1049 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [21:12:14] RECOVERY - mysqld processes on db1049 is OK: PROCS OK: 1 process with command name mysqld [21:14:53] pgehres: Probably not. You might need to sudo -u apache first [21:15:11] Reedy: k, thx. found a bug, fixing that first [21:15:27] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54815 [21:15:44] New patchset: Mattflaschen; "Add labs redis subclassing the main one and setting directory." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [21:19:06] New review: Demon; "Should rebase this on top of https://gerrit.wikimedia.org/r/#/c/53973/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37570 [21:19:45] RECOVERY - Puppet freshness on cp1034 is OK: puppet ran at Wed Mar 20 21:19:42 UTC 2013 [21:22:09] I'm an asshole. please don't merge anything on sockpuppet for a sec [21:24:16] notpeter: ha ha ha [21:24:42] New patchset: Pyoungmeister; "correct my invocation of fqdn_rand" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54972 [21:25:15] New review: Ottomata; "If a package already exists, do we need to rebuild it ourselves? Can we just stick it in apt?" [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/54782 [21:26:03] paravoid ^? [21:26:09] New review: Faidon; "We can if the version suffices. I think (but I'm not sure) it's not, so we just need to forward-port..." [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/54782 [21:26:12] you are so fast [21:26:15] PROBLEM - NTP on db1049 is CRITICAL: NTP CRITICAL: Offset unknown [21:26:51] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54972 [21:27:04] New review: Ottomata; "When I took this over yesterday, Ori was building the 0.8.0 version anyway. I think he might be fin..." [operations/debs/python-jsonschema] (master) - https://gerrit.wikimedia.org/r/54782 [21:27:25] notpeter: so I've used that erb trick in the past [21:27:29] notpeter: it has two nice things [21:27:40] one is that you don't see sleep all the time in ps [21:27:55] the other is that you can easily cat /etc/cron.d/puppet and see when runs are happening for that machine [21:28:42] ori-l [21:28:48] are you ok with version 0.8.0 of python-jsonschema? [21:29:00] if so, there is a .deb already in debian experimental that we can just stick in our apt repo [21:29:08] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54678 [21:29:17] cmjohnson1: hey you at the dc today ? [21:29:27] not now [21:29:29] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54677 [21:29:39] New patchset: Pyoungmeister; "Revert "run puppet by cron instead of via the agent"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54974 [21:29:48] New patchset: Pyoungmeister; "Revert "correct my invocation of fqdn_rand"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54975 [21:29:53] I can go if needed? I am not far away (lesliecarr) [21:30:09] paravoid: what is the correct way to use fqdn_rand in a template? mine version of the template is failing, and there is a seemingly open bug about using it in templates [21:30:14] no emergency, just trying to get that albatross of a copper link up [21:30:15] RECOVERY - NTP on db1049 is OK: NTP OK: Offset 2.193450928e-05 secs [21:30:24] RECOVERY - Puppet freshness on db43 is OK: puppet ran at Wed Mar 20 21:30:16 UTC 2013 [21:30:34] i shall ticketize [21:31:01] ok [21:31:04] last time I didn't do it with fqdn_rand [21:31:06] New review: Dr0ptp4kt; "Agreed looks good." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/54911 [21:31:11] it didn't exist back then [21:32:16] (looking now) [21:32:42] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=4780 [21:33:21] paravoid: I can also just pull the use of fqdn_rand into the class and then sub it into the template [21:33:24] that is sure to work [21:33:26] okay..will look at in the morning [21:33:57] New patchset: Asher; "adding db105[12]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54976 [21:34:16] Can someone with root on fenari kill/restart logmsgbot? It's been broken for about 2 hours now. Deployments are not being logged to server admin log. [21:34:27] notpeter: or even not use the template at all I guess [21:34:30] got it Krinkle [21:34:35] cron { ...: minute => fqdn_rand(NNN) } [21:34:37] should work [21:35:01] !log restarted ircecho on fenari [21:35:08] Logged the message, Mistress of the network gear. [21:35:09] I suggested a template since that's how I did it last time [21:35:17] I don't see a good reason now though [21:35:24] paravoid: ok, cool [21:35:42] (although I would like to see the cron resource to be replaced by something that puts it in /etc/cron.d instead of crontab's root) [21:36:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54976 [21:36:33] Krinkle: happy now ? [21:37:57] New patchset: Pyoungmeister; "pulling fqdn_rand into class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54977 [21:38:48] LeslieCarr: Assuming it automatically restarts..., yes :) [21:38:57] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54977 [21:39:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54514 [21:40:05] hehe [21:40:46] ervice { "ircecho": require => Package[ircecho], ensure => running; [21:40:50] Seems so [21:41:10] Though I suppose that's only ensured on puppet run, not in the system itself [21:41:21] so will take < 30 minutes [21:42:28] exactly but it should be running right now [21:42:31] New patchset: Asher; "missing curly bracket" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54978 [21:42:46] i see it went back on #tech but didn't see #operations in its init file [21:42:49] lemme check its puppet [21:43:04] New review: Dzahn; "i deleted /var/mwdocs/phase3 and /var/mwdocs/phase3-svn manually after tar.gz'ing them and moving to..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53954 [21:43:13] yeah, it only is going to #wikimedia-tech according to its puppet config [21:43:31] need me to fix that up to both tech and operations ? [21:43:32] ^demon: I'm not really a huge fan of this: https://gerrit.wikimedia.org/r/#/c/53173/4/manifests/gerrit.pp,unified [21:43:43] paravoid: any thoughts on this? https://gerrit.wikimedia.org/r/#/c/53173/4/manifests/gerrit.pp,unified [21:44:15] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54978 [21:44:15] umm, yes [21:44:24] not a big fan either :) [21:44:37] logrotate? [21:44:44] RECOVERY - Puppet freshness on mw1094 is OK: puppet ran at Wed Mar 20 21:44:34 UTC 2013 [21:44:46] binasher: https://rt.wikimedia.org/Ticket/Display.html?id=4712 [21:44:48] rdp1/2 [21:44:54] need you to tech review please =] [21:45:09] <^demon> Ryan_Lane: We can skip that. [21:45:24] RECOVERY - Puppet freshness on mw1130 is OK: puppet ran at Wed Mar 20 21:45:20 UTC 2013 [21:45:39] Failed to parse template base/puppet.cron.erb: undefined method `+' for nil:NilClass [21:45:40] \o/ [21:45:41] Krinkle: want me to add in operations in logmsgbot's configuration ? [21:46:24] RECOVERY - Puppet freshness on mw1056 is OK: puppet ran at Wed Mar 20 21:46:22 UTC 2013 [21:46:31] LeslieCarr: uh? So it did restart, but came back in -tech [21:46:42] New patchset: Pyoungmeister; "Revert "pulling fqdn_rand into class"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54980 [21:47:01] LeslieCarr: that's odd. Yeah, it should be updated in site.pp fenari logmsgbot #wikimedia-operations [21:47:04] that's where it was [21:47:17] presumably someone updated it from -tech to -operations, but only did so on fenari [21:47:31] https://wikitech.wikimedia.org/wiki/Logmsgbot [21:47:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54980 [21:47:51] yep [21:47:56] that's what looks like happened [21:48:01] somebody with root on fenari [21:48:12] improperly using their root powers instead of puppetizing!!! [21:48:15] Krinkle: i removed /var/mwdocs manually, puppet run down to 34 seconds [21:48:17] LeslieCarr: I can submit a puppet change, or are you doing so already? [21:48:26] i am doing one right now :) [21:48:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54975 [21:48:31] but thank you for the offering! [21:48:39] New patchset: Pyoungmeister; "Revert "run puppet by cron instead of via the agent"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54974 [21:49:24] RECOVERY - MySQL Slave Running on db59 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [21:49:43] New patchset: Jforrester; "Add wikibugs IRC bot to #mediawiki-visualeditor" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37570 [21:50:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54974 [21:50:31] New review: Jforrester; "Rebased onto r 53973 per Chad." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37570 [21:50:49] New patchset: Lcarr; "fixing logmsgbot to be in #operations and icinga's mysql packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54982 [21:50:55] andrewbogott: RT needs PHP? [21:51:00] my broken crap is backed out [21:51:10] feel free to re-fetch and then merge on sockpuppet [21:51:25] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 3412 seconds [21:52:20] RECOVERY - MySQL Slave Running on db71 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [21:52:20] RECOVERY - MySQL Slave Running on db69 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [21:52:20] mutante: I don't remember at the moment… leave a comment in gerrit and I'll see if I can figure out why that's in there. [21:52:24] RECOVERY - Puppet freshness on mw1052 is OK: puppet ran at Wed Mar 20 21:52:16 UTC 2013 [21:52:54] RECOVERY - Puppet freshness on mw1006 is OK: puppet ran at Wed Mar 20 21:52:47 UTC 2013 [21:53:01] mergeing what's on sockpuppet binasher Ryan_Lane [21:53:15] ok [21:54:07] sigh, nothing feels more wrong than installing something on my nice pretty macbook with gem [21:54:24] PROBLEM - MySQL Slave Delay on db69 is CRITICAL: CRIT replication delay 3396 seconds [21:54:25] PROBLEM - MySQL Slave Delay on db71 is CRITICAL: CRIT replication delay 3482 seconds [21:54:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54982 [21:55:35] New review: Krinkle; "The log buffers the perl script writes to are termined by wikimedia/bugzilla/wikibugs.git:/wikibugs." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/37570 [21:57:04] hey dudlees, are sq*.wikimedia.org machines in pmtpa? [21:57:07] (notpeter?) [21:57:58] ottomata: si senor [21:58:27] the easiest way to tell if a machine with a number is in tampa/eqiad/esams is that tampa are 0-999 , eqiad 1000-1999, esams 3000-3999 [21:58:33] sq = pmtpasq [21:58:51] New review: Krinkle; "Depends on I04e82cb392f1ad in wikimedia/bugzilla/wikibugs.git" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37570 [21:58:54] esams es = amssq and knsq [21:59:59] New patchset: Mattflaschen; "Add labs redis subclassing the main one and setting directory." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [22:00:17] ok cool, tanks! [22:00:17] andrewbogott: ok, i got one other comment, --prompt-for-dba-password kind of sounds like manual interaction needed [22:00:34] RECOVERY - Puppet freshness on searchidx2 is OK: puppet ran at Wed Mar 20 22:00:26 UTC 2013 [22:00:36] New patchset: Pyoungmeister; "run puppet by cron instead of via the agent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54985 [22:01:25] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 17 seconds [22:01:34] New review: Dzahn; "(2 comments)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47026 [22:02:58] !log stopping puppet on gadolinium to test for packet loss weirdness [22:03:04] Logged the message, Master [22:04:24] !log restarting ircecho on fenari to ensure logmsgbot is on operations and tech irc channels [22:04:25] RECOVERY - MySQL Slave Delay on db69 is OK: OK replication delay 0 seconds [22:04:27] and yay [22:04:28] it's back [22:04:29] :) [22:04:30] Logged the message, Mistress of the network gear. [22:04:39] lol [22:04:54] RECOVERY - Puppet freshness on neon is OK: puppet ran at Wed Mar 20 22:04:47 UTC 2013 [22:05:24] RECOVERY - MySQL Slave Delay on db71 is OK: OK replication delay 0 seconds [22:08:09] New review: Mattflaschen; "This is essentially meant to subclass modules/redis/manifests/init.pp." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [22:14:44] <^demon> ori-l: You still need operations/debs/python-jsonschema deleted, right? [22:15:05] Ok, so as far as I can tell, we have 5 servers not replicated from Tampa to Ashburn (not counting labs) [22:15:11] putting in here in case someone sees somehting I missed [22:15:46] ekrem - irc server, kaulen - bz server, streber - rt server, hooper - etherpad & racktables, & hume - batch dev host [22:16:06] mutante: You worked on BZ and its puppetizations reacently? [22:16:11] there's a few miscellaneous things [22:16:19] what misc stuff am i missin? [22:16:28] * RobH is making tickets for these items now [22:16:32] multicast relay [22:16:38] observium, torrus [22:16:44] maybe some db masters [22:16:44] recall server for the relay? [22:16:56] maerlant? [22:16:59] yea the db masters for most misc is replicated to slaves in ashburn, i think. [22:17:16] maerlant sounds correct [22:18:22] RobH: well, kind of, that bugzilla reporter because it's our custom stuff [22:18:54] rancid as well is on streber [22:19:14] RobH: fenari ?:) [22:20:09] (squid::cachemgr) [22:20:15] oh, on spence ishmael is unpuppetized [22:20:17] (misc::noc-wikimedia) [22:20:22] (misc::extension-distributor) [22:20:31] https://rt.wikimedia.org/Ticket/Display.html?id=2183 and https://rt.wikimedia.org/Ticket/Display.html?id=4616 [22:20:55] New review: Mattflaschen; "This fails on the puppet1 machine (https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000005e8) with:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [22:21:38] New review: Mattflaschen; "Non-gerritized labs link: http://goo.gl/zcaqO" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [22:23:44] RobH: formey (svn), yes, it's still not dead :p [22:23:53] pywikipedia using it [22:24:10] (role::gerrit::replicationdest) [22:24:16] (svn::server) [22:25:27] emery, ersch, tarin [22:25:35] erzurumi [22:28:02] mutante: http://etherpad.wikimedia.org/EQIAD-rollout-sequence [22:31:03] New patchset: Demon; "People only want notifs on new changes, not new patches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54989 [22:31:17] <^demon> Ryan_Lane: We're still too spammy :\ ^ [22:38:42] PROBLEM - Puppet freshness on arsenic is CRITICAL: Puppet has not run in the last 10 hours [22:39:42] PROBLEM - Puppet freshness on xenon is CRITICAL: Puppet has not run in the last 10 hours [22:40:03] PROBLEM - Apache HTTP on mw1211 is CRITICAL: Connection refused [22:40:15] PROBLEM - Apache HTTP on mw1210 is CRITICAL: Connection refused [22:40:16] binasher: jfyi - I am running migratePass0, but the slaves seem to be doing just peachy [22:40:32] PROBLEM - udp2log log age for gadolinium on gadolinium is CRITICAL: NRPE: Command check_udp2log_log_age-gadolinium not defined [22:40:53] Ryan_Lane: ping [22:41:05] PROBLEM - Apache HTTP on mw1212 is CRITICAL: Connection refused [22:41:06] PROBLEM - Apache HTTP on mw1209 is CRITICAL: Connection refused [22:41:50] pgehres: thanks for the heads up [22:42:35] PROBLEM - Apache HTTP on mw1213 is CRITICAL: Connection refused [22:43:15] PROBLEM - Apache HTTP on mw1214 is CRITICAL: Connection refused [22:44:27] checking out the mediawikis [22:44:40] notpeter: So I am going to be moving some of our misc services to eqiad [22:44:49] was wondering what we will be doing for the db9 bound hosts for this. [22:44:55] oh docroot fail [22:44:59] (if you have time to discuss now, if not this can wait) [22:45:09] I have a plan [22:45:23] and I want to implement it after I am done spinning up the pre-labsdb-dbs [22:45:28] and the labsdbs [22:45:34] oh nm, those are new :) [22:45:35] !log ignore mw1209+ they arent deployed [22:45:41] Logged the message, RobH [22:45:48] :) [22:45:57] im operating on a 3 minute time delay [22:45:58] I want to migrate the m1 shard (the db9/10 stuff) to galera cluster [22:46:11] <^demon> Can I get a merge on https://gerrit.wikimedia.org/r/#/c/54989/? [22:46:29] notpeter: What is the projected time frame on that? CT would like me to move a number of services to eqiad by the end of two weeks time (from our meeting today) [22:46:41] so if we can start a service at a time that woudl rock. [22:47:07] example: we'd like to get bugzilla and RT running out of ashburn by 4/3 [22:47:34] granted, RT needs a bit of puppet polish, but bz should be mostly ok. [22:47:41] uh [22:47:45] RobH: is your time delay for fcc buzzing? [22:47:54] ok, that's much sooner than I have time to do anything significant [22:47:55] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.0 [22:48:02] nope, just lack of I/O capability. [22:48:20] notpeter: Thats fine, but then I need to pass that back to CT so he knows why it isnt done. [22:48:26] hrm, we need to get you a ssd [22:48:51] RobH: ok, cool [22:51:14] hey all! I've got some mobile varnish config updates I need to get merged in time to go live for a 9pm test: https://gerrit.wikimedia.org/r/#/c/54911/ [22:51:18] who wants to +2 me? [22:51:30] looking [22:51:36] it's the gerrit equivalent of a high-five [22:51:36] who is Dr0ptp4kt [22:51:40] notpeter: heh https://rt.wikimedia.org/Ticket/Display.html?id=2187 [22:51:42] it had a ticket. [22:51:49] LeslieCarr: that's adam baso, our new guy [22:51:59] RobH: cool [22:52:36] PROBLEM - NTP on mw1214 is CRITICAL: NTP CRITICAL: Offset unknown [22:52:46] done [22:52:46] \o/ thanks [22:52:47] shall i merge it ? [22:52:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54911 [22:53:05] PROBLEM - NTP on mw1213 is CRITICAL: NTP CRITICAL: Offset unknown [22:56:30] \o/ whee [22:57:07] RECOVERY - NTP on mw1213 is OK: NTP OK: Offset 0.004801511765 secs [22:58:50] RECOVERY - NTP on mw1214 is OK: NTP OK: Offset -0.001222968102 secs [23:00:59] PROBLEM - Apache HTTP on mw1216 is CRITICAL: Connection refused [23:01:20] PROBLEM - Apache HTTP on mw1215 is CRITICAL: Connection refused [23:01:51] * TimStarling stabs logrotate [23:02:44] http://paste.tstarling.com/p/AeVKDA.html [23:03:59] if you replace that if body with: [23:04:06] /* seen this log file before */ [23:04:09] continue; [23:04:35] then it would be possible to, say, have a different log expiry for API logs than other logs [23:04:44] and thus avoid exhausting disk space on fluorine [23:06:04] * TimStarling stabs again [23:08:43] is there any software that's like logrotate except better? [23:09:37] maybe I'll just have a cron job delete some extra log files [23:12:43] why not have logs in different directories? [23:13:23] RoanKattouw: so we reclaimed xenon and caesium iirc [23:13:29] PROBLEM - NTP on mw1215 is CRITICAL: NTP CRITICAL: Offset unknown [23:13:33] and only wtp1001-1104 are doing parsoid in eqiad [23:13:43] I think some other things use the same multiplexer script so it's not so easy to hack [23:13:48] i think im right on this, but wanted to confirm [23:13:57] and it would be adding special cases to it when it's not really needed [23:14:16] I think a cron job is the best solution, because there are some other issues with logrotate that a cron job is best suited to fix [23:14:47] like what? [23:15:25] if a file is deleted from the log directory, logrotate stops rotating it [23:15:30] RECOVERY - NTP on mw1215 is OK: NTP OK: Offset -0.002371907234 secs [23:15:41] which means that old archives are kept forever, which is a potential legal compliance issue [23:16:32] also if MW stops sending log entries for a particular log, say because the code is gone, the file stays in the log directory with zero size forever [23:16:36] so a cron job can clean those up [23:18:43] Ryan_Lane: ping (2) [23:19:22] this could be an annoyance when viewing [23:19:37] but you could also instruct rsyslog to write to /year/month/day/foo.log or something [23:19:48] New patchset: Asher; "pulling es100[69], returning pmtpa dbs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54996 [23:19:50] and then just remove periods via cron [23:20:12] preilly: hey :) [23:21:15] paravoid: howdy [23:21:25] New patchset: Asher; "pulling es100[69], returning pmtpa s1 dbs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54996 [23:23:36] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/54996 [23:29:07] pgehres: Wed Mar 20 23:21:49 UTC 2013 hume commonswiki Error selecting database commonswiki on server 10.0.6.47 [23:29:23] pgehres: the dberror log is filling up with stuff like that [23:29:33] huh [23:29:41] i am done with commons ... [23:29:43] attempts to access various db's from s7 slaves that aren't on s7 [23:29:58] latest is Wed Mar 20 23:26:52 UTC 2013 hume mediawikiwiki Error selecting database mediawikiwiki on server 10.0.6.78 [23:30:09] RoanKattouw: ^^ [23:30:18] any for metawiki? [23:30:23] that is what is currently running [23:30:46] pgehres: metawiki is actually on s7 [23:30:54] ugh, right [23:31:08] we already fixed one bug related to that [23:31:17] !log authdns-update for lanthanum [23:31:23] Logged the message, RobH [23:31:25] pgehres: are you going to give us even more of those sweet account unification stats? [23:31:41] of course, once i unify some accounts :-) [23:32:29] binasher: i am going to let metawiki finish since it isn't exploding and then explore a theory [23:33:16] pgehres: ok, but try to kill as soon after finishing as you can [23:34:07] running wiki by wiki :-) [23:34:22] any stacktrace to the line that is exploding? [23:34:46] PROBLEM - Apache HTTP on mw1217 is CRITICAL: Connection refused [23:35:17] PROBLEM - Apache HTTP on mw1220 is CRITICAL: Connection refused [23:35:28] New patchset: Asher; "es100[69] to mariadb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54998 [23:35:36] PROBLEM - Apache HTTP on mw1218 is CRITICAL: Connection refused [23:35:39] !log asher synchronized wmf-config/db-pmtpa.php 'returning s1 servers' [23:35:45] Logged the message, Master [23:35:55] binasher: do I have access to the dberror log? [23:36:27] !log asher synchronized wmf-config/db-eqiad.php 'pulling es100[69]' [23:36:37] Logged the message, Master [23:36:49] pgehres: fluorine:/a/mw-log/dberror.log [23:36:52] looks like your account is there [23:37:29] There was a bug previously where MW connected to the wrong server [23:37:32] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54998 [23:37:52] But then it complained it couldn't find centralauth on the target DB, now it's reversed [23:37:54] New patchset: RobH; "lanthanum to replace ekrem, after it works" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54999 [23:38:36] New patchset: Pyoungmeister; "run puppet by cron instead of via the agent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54985 [23:38:55] binasher: my account may be there but: "ssh_exchange_identification: Connection closed by remote host" [23:39:06] soooo why are the labstores the only thing in dhcpd leases with ip addresses? (the rest are standardized to fqdn) notpeter (or binasher)? [23:39:17] PROBLEM - Apache HTTP on mw1219 is CRITICAL: Connection refused [23:39:20] oh wiat, labstore, not labdb [23:39:27] sorry guys, wrong folks to ping. [23:39:33] Ryan_Lane: ^? (you the right person?) [23:39:36] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54985 [23:39:59] RobH: me no no labstores [23:40:09] yea, i ralized after pinging you it wasnt yers [23:40:12] sorry about that ;] [23:40:16] no prob [23:40:39] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54999 [23:41:06] PROBLEM - mysqld processes on es1006 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:41:16] PROBLEM - mysqld processes on es1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:42:18] New patchset: Asher; "class typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55000 [23:42:28] pgehres: Does it give you any more info than "error selecting database"? [23:42:35] Like, a stack trace or something [23:42:38] RoanKattouw: i see no error [23:42:43] :O [23:42:45] i knew nothing until asher yelled at me [23:42:54] hah [23:42:59] and /me cannot see flourine logs [23:43:05] I looked but there's nothing there [23:43:10] MW seems to be ignoring the error [23:43:18] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55000 [23:43:46] RoanKattouw: any ideas on instrumenting this a bit more [23:43:58] Not really, let me see [23:44:23] i wish we could just add --verbose :-) [23:47:07] RECOVERY - mysqld processes on es1006 is OK: PROCS OK: 1 process with command name mysqld [23:47:17] RECOVERY - mysqld processes on es1009 is OK: PROCS OK: 1 process with command name mysqld [23:47:38] PROBLEM - NTP on mw1219 is CRITICAL: NTP CRITICAL: Offset unknown [23:47:38] PROBLEM - NTP on mw1220 is CRITICAL: NTP CRITICAL: Offset unknown [23:47:57] pgehres: I am going to do some instrumentation using live hacks on hume [23:48:05] * pgehres approves [23:48:21] i would avoid using commonswiki as a test [23:49:52] LeslieCarr: so its delete interface-range vlan-private1-eqiad member-range ge-4/0/38 to ge-4/0/42 right? [23:50:01] cuz it throws error at the to [23:50:07] syntax error, expecting ';', [Enter], or '|'. [23:50:09] New patchset: Asher; "return es100[69], pmtpa s2 dbs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55001 [23:50:26] get rid of everything after the "to" [23:50:36] RECOVERY - NTP on mw1220 is OK: NTP OK: Offset -0.002127170563 secs [23:50:43] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/55001 [23:51:22] !log asher synchronized wmf-config/db-pmtpa.php 'returning s2 servers' [23:51:28] Logged the message, Master [23:52:05] !log asher synchronized wmf-config/db-eqiad.php 'returning es100[69]' [23:52:11] Logged the message, Master [23:52:39] RECOVERY - NTP on mw1219 is OK: NTP OK: Offset -0.002117753029 secs [23:53:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:54:50] robh: mw1209-20 ready to be put in service...anything besides adding to node_groups? [23:56:07] a few things, yep [23:56:16] we have to add to lvs as well, lets plan on doing that tomrrow morning [23:56:21] * RobH is getting ready to head out [23:56:27] !log freeing up roughly 500G of ram across our cluster by calling puppet via cron :) [23:56:29] i rather not push new services this late in day [23:56:33] Logged the message, notpeter [23:56:38] cmjohnson1: we ahve to add to nodegroups and lvs/pybal [23:56:47] !log yay peter! [23:56:53] Logged the message, RobH [23:56:58] \o/ [23:57:06] robh: okay sounds good [23:57:16] you cant hear it, but the server fans, they sound like applause [23:57:18] notpeter: ^ [23:57:28] hahahaha [23:57:30] awesome :) [23:58:17] New patchset: Pyoungmeister; "WIP: first bit of stuff for taming the mysql module and making the SANITARIUM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53907 [23:58:38] \m/ [23:59:09] notpeter: does the SANITARIUM use \m/ariadb? [23:59:45] New patchset: Tim Starling; "Add a cron job to clean up old MW logs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55003