[00:00:58] notpeter: done with newegg chat :) [00:01:51] cool [00:02:41] so, the precise boxes *had* libapache2-mod-php5filter installed on them [00:02:48] yeah, so the drive that had linux was replaced with an ssd, so I've only got vista on that box while I download kubuntu [00:02:53] this conflicts with libapache2-mod-php5 [00:03:13] * Aaron|laptop hopes the ram will work next time...grr [00:03:13] but the confs weren't purged. is it possible that they're sitll being used and that's what's fucking things up? [00:03:27] hmm [00:03:44] see mw1:/etc/php5/apache2filter/php.ini [00:03:59] seems like a longshot, but it's a difference between them [00:04:25] but has [00:04:31] allow_url_fopen = On [00:04:37] and allow_url_include = Off [00:04:44] not seeing that dir [00:04:50] which look like they could affect how they talk to swift? [00:05:06] oh, weird [00:05:10] look at mw16 [00:05:13] is present there [00:05:26] (difference between nodes could be the order in which puppet installed htings...) [00:06:49] * Aaron|laptop looks up php5filter [00:08:23] hhhhmmmm, no. that can't be it [00:08:33] as fluorine shows same error [00:08:44] and that junk isn't present on all precise mw hosts [00:08:55] sorry to point you in wrong direction :/ [00:09:03] anyway, we don't use fopen or url_include for cloudfiles [00:09:13] ah, well [00:10:51] i shall keep looking [00:15:12] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [00:23:09] RECOVERY - SSH on virt5 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:25:22] New patchset: Asher; "* redirector.c from commit f31f165d827a51ce002ed14351218503de4a5fe3 * disables redirection of officewiki * this thing really needs a config file.." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25201 [00:26:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25201 [00:31:41] New patchset: Dzahn; "remove webserver::php5 from ./misc/secure.pp, conflicts with same in planet class which is also applied to singer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25202 [00:32:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25202 [00:33:05] New review: Dzahn; "old secure.wm" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/25202 [00:33:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25202 [00:34:33] RECOVERY - Puppet freshness on singer is OK: puppet ran at Wed Sep 26 00:33:58 UTC 2012 [00:34:41] !log fixing puppet on singer..finally [00:34:52] Logged the message, Master [00:43:51] Hi, any db expert here? I could use some advice for a schema for an analytics tool. Tracking usage of user properties over time. [00:43:57] https://github.com/Krinkle/ts-krinkle-wmfPrefStats/blob/master/tables.sql [00:44:03] https://github.com/Krinkle/ts-krinkle-wmfPrefStats/blob/master/queries.txt [00:46:12] binasher: Maybe you can take a peek? [00:46:52] I've basically just scraped together bits and pieces that make sense to me. But since it would be in for the long run, I'd rather do it "right-ish at once [00:48:13] Krinkle: i'll have to look at it later, but I've got tabs open with those two urls [00:48:15] there's about 150 distinct user properties on an average wiki. It would be populated once a week. So that adds [ 150 props * N values * 2 (active and total) * 800 wikis ] rows a week [00:49:03] that sounds like a lot, so I'm not sure there is a more efficient approach. I've already minimized number of rows a lot by aggregating into a sum column (instead of leaving them separate as in the original table) [00:49:20] binasher: Thanks! Just ping me here when you have a minute, I'll be here for another couple hours [00:51:20] btw, fyi: 150 * 3 (avg. diff values) * 2 * 800 = 720,000; you don't want to know what it was when N users was still in there. Also nice that there is no rows where sum=0, because it is aggregated, properties not used, don't exist. [00:53:29] Krinkle: is that adding ~ 720k new rows every week, or just updating 720k? [00:53:40] adding, it is for historical analysis and statistics [00:54:09] to see how usage raises and lowers over time (its primary output would be a graph) [00:54:25] (generated on the client-side with JSON data from the server from this table) [00:56:36] RECOVERY - NTP on virt5 is OK: NTP OK: Offset -0.03804636002 secs [00:57:14] New patchset: Ryan Lane; "Setting correct compute nodes for pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25204 [00:58:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25204 [00:58:17] Treat me like a newbie, I don't defend this by any means. It's just what I came up in about an hour – based on the little database experience I have. Any radical change suggestions are welcome. [00:59:45] PROBLEM - Host cp1030 is DOWN: PING CRITICAL - Packet loss = 100% [01:00:27] !log reinstalling cp1030 with precise [01:00:37] Logged the message, Master [01:00:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25204 [01:05:27] RECOVERY - Host cp1030 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [01:05:54] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:46] thats also me [01:08:45] PROBLEM - SSH on cp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:06] RECOVERY - SSH on cp1030 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:11:36] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [01:14:59] New patchset: Andrew Bogott; "Restart compute after installing wikinotifier." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25205 [01:15:21] PROBLEM - SSH on cp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25205 [01:18:12] RECOVERY - SSH on cp1029 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:21:21] PROBLEM - Host cp1032 is DOWN: PING CRITICAL - Packet loss = 100% [01:22:24] RECOVERY - Host cp1032 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [01:26:00] PROBLEM - Host cp1034 is DOWN: PING CRITICAL - Packet loss = 100% [01:28:42] PROBLEM - NTP on cp1030 is CRITICAL: NTP CRITICAL: No response from NTP server [01:29:44] Ryan_Lane: here's the actual varnishlog line I was running on the mobile varnish servers [01:29:47] varnishlog -c -m 'TxStatus:^503$' -n frontend [01:29:54] PROBLEM - SSH on cp1032 is CRITICAL: Connection refused [01:29:58] that provides the entire request log for anything that 503's [01:30:32] which is about 30-40 lines, but really lets you see what's going on [01:30:37] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25205 [01:31:42] RECOVERY - Host cp1034 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [01:34:24] PROBLEM - NTP on cp1029 is CRITICAL: NTP CRITICAL: No response from NTP server [01:35:54] PROBLEM - SSH on cp1034 is CRITICAL: Connection refused [01:38:36] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:40:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [01:40:06] PROBLEM - Host cp1035 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:54] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 258 seconds [01:44:36] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [01:44:54] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [01:45:48] RECOVERY - Host cp1035 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [01:45:57] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [01:46:33] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [01:46:51] RECOVERY - SSH on cp1032 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:48:03] RECOVERY - SSH on cp1034 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:48:21] PROBLEM - NTP on cp1032 is CRITICAL: NTP CRITICAL: No response from NTP server [01:49:24] PROBLEM - Host cp1033 is DOWN: PING CRITICAL - Packet loss = 100% [01:50:27] PROBLEM - SSH on cp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:36] PROBLEM - Host cp1036 is DOWN: PING CRITICAL - Packet loss = 100% [01:51:39] RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [01:53:27] RECOVERY - SSH on cp1035 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:54:39] PROBLEM - NTP on cp1034 is CRITICAL: NTP CRITICAL: No response from NTP server [01:55:06] RECOVERY - Host cp1033 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:00:03] PROBLEM - SSH on cp1036 is CRITICAL: Connection refused [02:02:25] Krinkle: ping me tomorrow about wmfPrefStats, i'm about to sign off for the day. but to start things out, it sounds like query cases such as "Usage of all preferences on all wikis over time (between 2012-09-01 and 2012-10-01)" will be pretty costly. regarding the schema, a cheap primary key might be better than the current unique index, although how data should be ordered on disk depends on which of the query cases is most common. a complex [02:02:26] primary key hurts insert performance (doesn't matter if something is only written to via batch updates once a week) but also makes secondary indexes take up a lot more space which might matter here. a unique index shouldn't include a blob column (or anything that can be null) as i_ph_item does now. and if using a blob for ph_value could be avoided, that would be nice as well. [02:03:39] binasher: Not sure what kind of primary key you mean? [02:03:51] I don't have any other data [02:04:04] you mean a primary auto-increment integer? [02:04:17] there could even be a garbage auto increment pk that isn't typically used for anything [02:05:18] that will result in data on disk being sorted by insertion time, and will make using secondary keys cheaper [02:05:40] right [02:06:17] I'm not sure I can change the type of any columns (specifically the "blob" for _value). That comes from mw_user_properties.up_value which is also blob [02:06:47] though it only contains text afaik, I suppose I can merged null and empty string [02:07:09] so would a varchar or varbinary be better (I can't give a width though) [02:07:25] i wonder what the max length would be in that case [02:07:41] in mediawiki it doesn't have a max length [02:07:42] RECOVERY - SSH on cp1036 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:08:18] I can find out, but it wouldn't be stable, there can be some extreme values. one of them is fancysig which contains arbitrary wikitext. [02:08:29] binasher: but the index can be shorter than the value, right? [02:09:04] (I'll make it skip field that take arbitrary input of course) [02:10:39] yep, and you should define an index length if adding one to a text or blob field [02:11:05] but i wouldn't recommend adding one to a unique field [02:11:19] Oh that's right. I shouldn't trim the unique one [02:13:48] binasher: btw, what's the deal with varchar and varbinary? I see both used in mediawiki kinda of mixed, and then sometimes varchar() binary. [02:14:25] New review: Faidon; "Typo: "is none" is not valid Python, it needs to be "is None"." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/24303 [02:15:53] (btw, don't let me hold you up) - we'll talk tomorrow. [02:16:06] PROBLEM - NTP on cp1035 is CRITICAL: NTP CRITICAL: No response from NTP server [02:17:51] Krinkle: we set binary to the default char set in mysqld which forces char to binary regardless of the schema definition.. going to run now, tty tmw [02:18:17] ah, okay. That's what I suspected (since it has charset=binary). [02:19:06] PROBLEM - NTP on cp1033 is CRITICAL: NTP CRITICAL: No response from NTP server [02:30:39] PROBLEM - NTP on cp1036 is CRITICAL: NTP CRITICAL: No response from NTP server [02:40:24] RECOVERY - Puppet freshness on cadmium is OK: puppet ran at Wed Sep 26 02:40:10 UTC 2012 [03:00:57] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [03:00:57] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:29:00] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (47132), zhwiki (24178) [03:30:12] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (47178), zhwiki (26889) [03:50:52] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [03:50:52] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [03:50:52] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [03:50:52] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [03:50:52] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [03:50:53] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [03:50:53] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [03:50:54] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [03:50:54] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [03:50:55] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [03:50:55] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [03:50:56] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [03:50:56] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [03:50:57] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [03:50:57] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [03:50:58] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [04:07:13] PROBLEM - Squid on brewster is CRITICAL: Connection refused [06:23:13] New patchset: ArielGlenn; "file offset uses off_t so we can deal with huge files" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/25226 [06:38:34] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/25226 [06:43:34] RECOVERY - Squid on brewster is OK: TCP OK - 0.004 second response time on port 8080 [07:07:45] New review: Tim Starling; "Interesting TODO item. So the script as it stands pulls code from gerrit every 3 minutes and immedia..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/22116 [07:24:42] hello [07:27:20] New patchset: Dereckson; "(bug 40521) Add "deletelogentry" right to "Eliminator" group on ja.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25231 [07:27:46] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:27:46] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [07:27:46] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [07:27:46] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:27:46] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [07:30:53] !g Iaee6bd06725aa8d95070af3306bdd35db2af6730 [07:30:53] https://gerrit.wikimedia.org/r/#q,Iaee6bd06725aa8d95070af3306bdd35db2af6730,n,z [07:36:38] New patchset: Hashar; "avoid duplicate apache2 definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [07:37:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25188 [07:45:26] New patchset: Dereckson; "(bug 29902) Cleaning InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23059 [07:47:12] New patchset: Hashar; "avoid duplicate apache2 definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [07:48:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25188 [08:01:12] New review: Hashar; "bah we need to change $name apparently." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25188 [08:04:39] New patchset: Hashar; "avoid duplicate apache2 definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [08:05:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25188 [08:09:40] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [08:09:41] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24934 [08:09:41] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24878 [08:09:41] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25173 [08:09:41] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [08:09:50] ahahha [08:09:53] such a hacker [08:22:02] stupid puppet [08:22:08] so I guess I have to resubmit all those changes now [08:23:41] New review: Hashar; "So that just does not work and I am abandoning that change." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [08:28:25] New patchset: Hashar; "Redo SSL certs for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25233 [08:29:26] New patchset: Hashar; "role::gerrit:labs::jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25234 [08:30:20] New review: Hashar; "resent has https://gerrit.wikimedia.org/r/25233" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [08:30:21] New review: Hashar; "resent as https://gerrit.wikimedia.org/r/25234" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25173 [08:30:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25233 [08:30:21] New review: Hashar; "This was https://gerrit.wikimedia.org/r/#/c/24213/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25233 [08:30:21] New review: Hashar; "This was https://gerrit.wikimedia.org/r/#/c/25173/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25234 [08:30:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25234 [08:32:31] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [08:33:23] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25236 [08:34:17] New review: Hashar; "resent as https://gerrit.wikimedia.org/r/#/c/25235/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24878 [08:34:17] New review: Hashar; "resent as https://gerrit.wikimedia.org/r/#/c/25236/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24934 [08:34:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [08:34:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25236 [08:35:41] Change abandoned: Hashar; "Abandoning this, I cant wait for puppet so will setup Gerrit manually." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25234 [09:07:51] New review: Dereckson; "The shellpolicy issue is now resolved." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23076 [09:27:47] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:15:49] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [10:35:55] PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours [10:54:46] PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours [11:07:40] PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours [11:09:28] PROBLEM - SSH on searchidx2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:58] RECOVERY - SSH on searchidx2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:23:43] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [11:23:43] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [11:34:04] New review: Demon; "And pushing it to the branch makes it *look* merged here. Looks like we're hitting https://code.goog..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [11:39:36] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:42:36] PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours [11:52:39] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [12:32:18] bad cr2-eqiad ! i set your prefix limit for our new neighbor to 50k, yet it keeps thinking it's 10k and shutting the session down [12:32:29] why are you being dumb cr2 ? [12:32:32] they're confused [12:32:37] they want a support contract so badly [12:32:54] awww, it just wants to know it's loved :) [12:32:59] but it is! [12:33:13] i love you, cr2-eqiad [12:33:50] you were my first juniper device that i could really call a router [12:33:57] except your twin, cr1-eqiad [12:34:02] but you got most of the traffic anyway! [12:34:06] (don't be mad, cr1) [12:36:12] now cr1 is crying :( [12:36:23] argh [12:36:25] can't please everyone [12:36:27] thank you for coming out so strongly in support of use of open source software in your email mark [12:36:46] i want to see wikimedia doing the right thing [12:37:26] me too. your word has a lot of weight to it, it's good to see you using it [12:38:33] uh back in a few [12:46:00] RobH_: are you gonna be at eqiad today? [13:01:39] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [13:01:39] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:35:38] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:38] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:39:14] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:34] mark: yep, leaving to head down there in a few minutes [13:40:44] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:44] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.343 second response time [13:43:44] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.926 second response time [13:44:22] New patchset: Demon; "Refactor gerrit2 account stuff so we can reuse it on other hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24557 [13:44:38] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:23] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:24] New review: Demon; "PS2 is just a rebase." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/24557 [13:45:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24557 [13:45:59] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.229 second response time [13:46:53] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.554 second response time [13:52:08] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [13:52:08] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [13:52:08] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [13:52:08] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [13:52:08] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [13:52:09] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [13:52:09] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [13:52:10] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [13:52:10] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [13:52:11] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [13:52:11] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [13:52:12] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [13:52:12] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [13:52:13] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [13:52:13] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [13:52:14] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [13:56:05] !log ms-be8 powercycling [13:56:16] Logged the message, Master [13:56:20] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:34] hello cmjohnson1 [13:56:38] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:50] good morning apergos [13:56:57] or afternoon [13:57:32] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:41] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [13:57:50] lemme know when/what you need me to do as far as these boxes [13:57:59] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.246 second response time [13:58:08] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [13:58:26] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [13:59:05] apergos: paravoid says that he was still seeing disk slow to mount but I am not...can you poke around be8 [13:59:35] sure. is it all installed and stuff? [13:59:48] cause I'll jsut do some reboots and watch, check the logs, etc [13:59:58] ye [14:00:18] yes, should be able to reboot and watch [14:00:44] okey dokey [14:01:47] i am out of console [14:02:37] cool [14:02:52] jeremyb: ping [14:02:56] i updated http://meta.wikimedia.org/wiki/Www.wikidata.org_template [14:03:03] when does it get synced? [14:03:15] or anyone else know? [14:07:49] New patchset: Hashar; "should enable AFTv5 100% on beta only" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25048 [14:07:57] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25048 [14:08:03] and here we go, reboot number one [14:09:05] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:44] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:13:08] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:17] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:26] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:47] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47095 bytes in 2.047 seconds [14:14:56] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.305 second response time [14:16:08] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:08] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.243 second response time [14:17:09] apergos: is the lvs stuff related to you or should i check stuff out [14:17:29] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39743 bytes in 6.087 seconds [14:18:02] not me [14:18:16] LeslieCarr: [14:18:32] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:38] okay [14:19:53] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.378 second response time [14:20:02] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:11] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:38] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:23] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [14:21:41] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.839 second response time [14:21:50] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:23:48] i don't see the problems that are complaining ... [14:25:17] PROBLEM - Apache HTTP on mw39 is CRITICAL: Connection refused [14:26:33] the uplink on spence is also okay … [14:26:47] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:26:47] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:07] hmm scrying the ganglia graphs isn't netting me anything [14:28:17] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.857 second response time [14:28:17] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.727 second response time [14:29:56] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.820 second response time [14:30:41] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:50] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:02] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.640 second response time [14:32:54] checked syslog on a random one of the mw hosts that had complained, did not see anything interesting (eg not a pile of errors spewing etc) [14:33:19] it looks like about 2 hours ago there were a bunch of OOM errors but not right now AFAICT [14:33:50] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [14:34:53] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:23] !g I4913fb6521f1f8bb98bb6a9480edd8e22eb2f7dd [14:36:23] https://gerrit.wikimedia.org/r/#q,I4913fb6521f1f8bb98bb6a9480edd8e22eb2f7dd,n,z [14:37:17] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [14:40:44] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:38] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:38] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:08] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.678 second response time [14:43:53] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.925 second response time [14:46:08] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [14:48:33] cmjohnson1: after three reboots, stillnot seeing the delays [14:48:36] for ms-be8 [14:49:03] cool [14:49:27] i know i rebooted about 4 times yesterday and once this morning and no delays either [14:50:02] what was done with this box, remind me? [14:50:03] apergos: should we pull the disk and put them in another machine and see how it does? [14:50:12] ms-be8...nothing just new sata drives [14:50:26] oh, sata instead of nl sata eh? [14:50:31] these ones are wd too [14:51:55] yep [14:52:00] well it's interesting [14:52:12] I'd like to see whathappens with ms-be6 and the new controller also [14:52:15] are they sending that? [14:52:34] we had nl sas drives in there b4 [14:54:05] er nl sas, yeah [14:57:49] um so are they sending the new controller? or are they waiting for further word from us first? [14:59:10] the controller will be here today [14:59:16] sometime around 2p usually [14:59:19] sweet [14:59:38] 9 pm for me [14:59:40] what box do you want to put that it in? [14:59:48] well I was thinking we could try ms-be6 [14:59:51] the sata drives appear to be working on 8... [15:00:00] k....let's go w/6 [15:00:01] it's been through everything else and still has that one drive that takes its time [15:05:29] New patchset: Ottomata; "Removing unused misc/analytics.pp file." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25260 [15:06:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25260 [15:06:46] New patchset: Ottomata; "role/analytics.pp - including geoip" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25261 [15:07:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25261 [15:08:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25260 [15:09:03] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25261 [15:12:01] !log sq37 has fatal error...powering down to replace disk controller card [15:12:11] Logged the message, Master [15:13:35] * Damianz pokes LeslieCarr [15:13:43] LeslieCarr: I want to work on bast1001 network cable 100mbit issue [15:13:50] and you are the only user other than me on the host [15:13:53] eep! [15:13:56] ;] [15:14:00] sure, go for it [15:14:03] thanks for letting me know [15:15:43] LeslieCarr: Just wondering, is there any network restrictions betwean the toolserver and labs network wise? Seems you can't ssh from labs to ts, but can to the outside world (I know prod is restricted for ssh access from labs and wonder if ts is caught up in that). [15:16:15] Damianz: quite possibly, give me about 10 minutes to check it out [15:16:24] :) No rush. [15:19:40] !log troubleshooting bast1001 link speed per 3414 [15:19:51] Logged the message, RobH [15:20:14] how the hell does one check link speed in the command line without ethtool ;_; [15:21:40] !log bast1001 link speed fixed, done working on it [15:21:50] Logged the message, RobH [15:21:52] you could grep dmesg [15:22:35] indeed, that works [15:22:39] thx =] [15:23:06] not quite as stateless as an ethtool check but since we dont put ethtool in production servers, it works =] [15:23:18] ethtool is just generally awesome [15:33:03] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 57.86 ms [15:40:42] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:24] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [15:52:24] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [15:54:48] PROBLEM - Host payments2 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:36] RECOVERY - Host payments2 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:57:13] !log dist-upgrade & reboot most payments boxes [15:57:23] Logged the message, Master [16:06:12] PROBLEM - Host payments3 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:48] RECOVERY - Host payments3 is UP: PING WARNING - Packet loss = 64%, RTA = 1.39 ms [16:09:30] PROBLEM - NTP on analytics1007 is CRITICAL: NTP CRITICAL: No response from NTP server [16:12:39]