[00:00:58] notpeter: done with newegg chat :) [00:01:51] cool [00:02:41] so, the precise boxes *had* libapache2-mod-php5filter installed on them [00:02:48] yeah, so the drive that had linux was replaced with an ssd, so I've only got vista on that box while I download kubuntu [00:02:53] this conflicts with libapache2-mod-php5 [00:03:13] * Aaron|laptop hopes the ram will work next time...grr [00:03:13] but the confs weren't purged. is it possible that they're sitll being used and that's what's fucking things up? [00:03:27] hmm [00:03:44] see mw1:/etc/php5/apache2filter/php.ini [00:03:59] seems like a longshot, but it's a difference between them [00:04:25] but has [00:04:31] allow_url_fopen = On [00:04:37] and allow_url_include = Off [00:04:44] not seeing that dir [00:04:50] which look like they could affect how they talk to swift? [00:05:06] oh, weird [00:05:10] look at mw16 [00:05:13] is present there [00:05:26] (difference between nodes could be the order in which puppet installed htings...) [00:06:49] * Aaron|laptop looks up php5filter [00:08:23] hhhhmmmm, no. that can't be it [00:08:33] as fluorine shows same error [00:08:44] and that junk isn't present on all precise mw hosts [00:08:55] sorry to point you in wrong direction :/ [00:09:03] anyway, we don't use fopen or url_include for cloudfiles [00:09:13] ah, well [00:10:51] i shall keep looking [00:15:12] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [00:23:09] RECOVERY - SSH on virt5 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:25:22] New patchset: Asher; "* redirector.c from commit f31f165d827a51ce002ed14351218503de4a5fe3 * disables redirection of officewiki * this thing really needs a config file.." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25201 [00:26:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25201 [00:31:41] New patchset: Dzahn; "remove webserver::php5 from ./misc/secure.pp, conflicts with same in planet class which is also applied to singer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25202 [00:32:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25202 [00:33:05] New review: Dzahn; "old secure.wm" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/25202 [00:33:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25202 [00:34:33] RECOVERY - Puppet freshness on singer is OK: puppet ran at Wed Sep 26 00:33:58 UTC 2012 [00:34:41] !log fixing puppet on singer..finally [00:34:52] Logged the message, Master [00:43:51] Hi, any db expert here? I could use some advice for a schema for an analytics tool. Tracking usage of user properties over time. [00:43:57] https://github.com/Krinkle/ts-krinkle-wmfPrefStats/blob/master/tables.sql [00:44:03] https://github.com/Krinkle/ts-krinkle-wmfPrefStats/blob/master/queries.txt [00:46:12] binasher: Maybe you can take a peek? [00:46:52] I've basically just scraped together bits and pieces that make sense to me. But since it would be in for the long run, I'd rather do it "right-ish at once [00:48:13] Krinkle: i'll have to look at it later, but I've got tabs open with those two urls [00:48:15] there's about 150 distinct user properties on an average wiki. It would be populated once a week. So that adds [ 150 props * N values * 2 (active and total) * 800 wikis ] rows a week [00:49:03] that sounds like a lot, so I'm not sure there is a more efficient approach. I've already minimized number of rows a lot by aggregating into a sum column (instead of leaving them separate as in the original table) [00:49:20] binasher: Thanks! Just ping me here when you have a minute, I'll be here for another couple hours [00:51:20] btw, fyi: 150 * 3 (avg. diff values) * 2 * 800 = 720,000; you don't want to know what it was when N users was still in there. Also nice that there is no rows where sum=0, because it is aggregated, properties not used, don't exist. [00:53:29] Krinkle: is that adding ~ 720k new rows every week, or just updating 720k? [00:53:40] adding, it is for historical analysis and statistics [00:54:09] to see how usage raises and lowers over time (its primary output would be a graph) [00:54:25] (generated on the client-side with JSON data from the server from this table) [00:56:36] RECOVERY - NTP on virt5 is OK: NTP OK: Offset -0.03804636002 secs [00:57:14] New patchset: Ryan Lane; "Setting correct compute nodes for pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25204 [00:58:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25204 [00:58:17] Treat me like a newbie, I don't defend this by any means. It's just what I came up in about an hour – based on the little database experience I have. Any radical change suggestions are welcome. [00:59:45] PROBLEM - Host cp1030 is DOWN: PING CRITICAL - Packet loss = 100% [01:00:27] !log reinstalling cp1030 with precise [01:00:37] Logged the message, Master [01:00:39] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25204 [01:05:27] RECOVERY - Host cp1030 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [01:05:54] PROBLEM - Host cp1029 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:46] thats also me [01:08:45] PROBLEM - SSH on cp1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:06] RECOVERY - SSH on cp1030 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:11:36] RECOVERY - Host cp1029 is UP: PING OK - Packet loss = 0%, RTA = 26.65 ms [01:14:59] New patchset: Andrew Bogott; "Restart compute after installing wikinotifier." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25205 [01:15:21] PROBLEM - SSH on cp1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:15:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25205 [01:18:12] RECOVERY - SSH on cp1029 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:21:21] PROBLEM - Host cp1032 is DOWN: PING CRITICAL - Packet loss = 100% [01:22:24] RECOVERY - Host cp1032 is UP: PING OK - Packet loss = 0%, RTA = 26.78 ms [01:26:00] PROBLEM - Host cp1034 is DOWN: PING CRITICAL - Packet loss = 100% [01:28:42] PROBLEM - NTP on cp1030 is CRITICAL: NTP CRITICAL: No response from NTP server [01:29:44] Ryan_Lane: here's the actual varnishlog line I was running on the mobile varnish servers [01:29:47] varnishlog -c -m 'TxStatus:^503$' -n frontend [01:29:54] PROBLEM - SSH on cp1032 is CRITICAL: Connection refused [01:29:58] that provides the entire request log for anything that 503's [01:30:32] which is about 30-40 lines, but really lets you see what's going on [01:30:37] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25205 [01:31:42] RECOVERY - Host cp1034 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [01:34:24] PROBLEM - NTP on cp1029 is CRITICAL: NTP CRITICAL: No response from NTP server [01:35:54] PROBLEM - SSH on cp1034 is CRITICAL: Connection refused [01:38:36] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [01:40:06] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 189 seconds [01:40:06] PROBLEM - Host cp1035 is DOWN: PING CRITICAL - Packet loss = 100% [01:41:54] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 258 seconds [01:44:36] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 11 seconds [01:44:54] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [01:45:48] RECOVERY - Host cp1035 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [01:45:57] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [01:46:33] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 11 seconds [01:46:51] RECOVERY - SSH on cp1032 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:48:03] RECOVERY - SSH on cp1034 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:48:21] PROBLEM - NTP on cp1032 is CRITICAL: NTP CRITICAL: No response from NTP server [01:49:24] PROBLEM - Host cp1033 is DOWN: PING CRITICAL - Packet loss = 100% [01:50:27] PROBLEM - SSH on cp1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:36] PROBLEM - Host cp1036 is DOWN: PING CRITICAL - Packet loss = 100% [01:51:39] RECOVERY - Host cp1036 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [01:53:27] RECOVERY - SSH on cp1035 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [01:54:39] PROBLEM - NTP on cp1034 is CRITICAL: NTP CRITICAL: No response from NTP server [01:55:06] RECOVERY - Host cp1033 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [02:00:03] PROBLEM - SSH on cp1036 is CRITICAL: Connection refused [02:02:25] Krinkle: ping me tomorrow about wmfPrefStats, i'm about to sign off for the day. but to start things out, it sounds like query cases such as "Usage of all preferences on all wikis over time (between 2012-09-01 and 2012-10-01)" will be pretty costly. regarding the schema, a cheap primary key might be better than the current unique index, although how data should be ordered on disk depends on which of the query cases is most common. a complex [02:02:26] primary key hurts insert performance (doesn't matter if something is only written to via batch updates once a week) but also makes secondary indexes take up a lot more space which might matter here. a unique index shouldn't include a blob column (or anything that can be null) as i_ph_item does now. and if using a blob for ph_value could be avoided, that would be nice as well. [02:03:39] binasher: Not sure what kind of primary key you mean? [02:03:51] I don't have any other data [02:04:04] you mean a primary auto-increment integer? [02:04:17] there could even be a garbage auto increment pk that isn't typically used for anything [02:05:18] that will result in data on disk being sorted by insertion time, and will make using secondary keys cheaper [02:05:40] right [02:06:17] I'm not sure I can change the type of any columns (specifically the "blob" for _value). That comes from mw_user_properties.up_value which is also blob [02:06:47] though it only contains text afaik, I suppose I can merged null and empty string [02:07:09] so would a varchar or varbinary be better (I can't give a width though) [02:07:25] i wonder what the max length would be in that case [02:07:41] in mediawiki it doesn't have a max length [02:07:42] RECOVERY - SSH on cp1036 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:08:18] I can find out, but it wouldn't be stable, there can be some extreme values. one of them is fancysig which contains arbitrary wikitext. [02:08:29] binasher: but the index can be shorter than the value, right? [02:09:04] (I'll make it skip field that take arbitrary input of course) [02:10:39] yep, and you should define an index length if adding one to a text or blob field [02:11:05] but i wouldn't recommend adding one to a unique field [02:11:19] Oh that's right. I shouldn't trim the unique one [02:13:48] binasher: btw, what's the deal with varchar and varbinary? I see both used in mediawiki kinda of mixed, and then sometimes varchar() binary. [02:14:25] New review: Faidon; "Typo: "is none" is not valid Python, it needs to be "is None"." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/24303 [02:15:53] (btw, don't let me hold you up) - we'll talk tomorrow. [02:16:06] PROBLEM - NTP on cp1035 is CRITICAL: NTP CRITICAL: No response from NTP server [02:17:51] Krinkle: we set binary to the default char set in mysqld which forces char to binary regardless of the schema definition.. going to run now, tty tmw [02:18:17] ah, okay. That's what I suspected (since it has charset=binary). [02:19:06] PROBLEM - NTP on cp1033 is CRITICAL: NTP CRITICAL: No response from NTP server [02:30:39] PROBLEM - NTP on cp1036 is CRITICAL: NTP CRITICAL: No response from NTP server [02:40:24] RECOVERY - Puppet freshness on cadmium is OK: puppet ran at Wed Sep 26 02:40:10 UTC 2012 [03:00:57] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [03:00:57] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:29:00] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (47132), zhwiki (24178) [03:30:12] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , plwiktionary (47178), zhwiki (26889) [03:50:52] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [03:50:52] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [03:50:52] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [03:50:52] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [03:50:52] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [03:50:53] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [03:50:53] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [03:50:54] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [03:50:54] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [03:50:55] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [03:50:55] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [03:50:56] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [03:50:56] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [03:50:57] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [03:50:57] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [03:50:58] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [04:07:13] PROBLEM - Squid on brewster is CRITICAL: Connection refused [06:23:13] New patchset: ArielGlenn; "file offset uses off_t so we can deal with huge files" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/25226 [06:38:34] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/25226 [06:43:34] RECOVERY - Squid on brewster is OK: TCP OK - 0.004 second response time on port 8080 [07:07:45] New review: Tim Starling; "Interesting TODO item. So the script as it stands pulls code from gerrit every 3 minutes and immedia..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/22116 [07:24:42] hello [07:27:20] New patchset: Dereckson; "(bug 40521) Add "deletelogentry" right to "Eliminator" group on ja.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25231 [07:27:46] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:27:46] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [07:27:46] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [07:27:46] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [07:27:46] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [07:30:53] !g Iaee6bd06725aa8d95070af3306bdd35db2af6730 [07:30:53] https://gerrit.wikimedia.org/r/#q,Iaee6bd06725aa8d95070af3306bdd35db2af6730,n,z [07:36:38] New patchset: Hashar; "avoid duplicate apache2 definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [07:37:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25188 [07:45:26] New patchset: Dereckson; "(bug 29902) Cleaning InitialiseSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/23059 [07:47:12] New patchset: Hashar; "avoid duplicate apache2 definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [07:48:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25188 [08:01:12] New review: Hashar; "bah we need to change $name apparently." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25188 [08:04:39] New patchset: Hashar; "avoid duplicate apache2 definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [08:05:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25188 [08:09:40] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [08:09:41] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24934 [08:09:41] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24878 [08:09:41] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25173 [08:09:41] Change merged: Hashar; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [08:09:50] ahahha [08:09:53] such a hacker [08:22:02] stupid puppet [08:22:08] so I guess I have to resubmit all those changes now [08:23:41] New review: Hashar; "So that just does not work and I am abandoning that change." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25188 [08:28:25] New patchset: Hashar; "Redo SSL certs for gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25233 [08:29:26] New patchset: Hashar; "role::gerrit:labs::jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25234 [08:30:20] New review: Hashar; "resent has https://gerrit.wikimedia.org/r/25233" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [08:30:21] New review: Hashar; "resent as https://gerrit.wikimedia.org/r/25234" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25173 [08:30:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25233 [08:30:21] New review: Hashar; "This was https://gerrit.wikimedia.org/r/#/c/24213/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25233 [08:30:21] New review: Hashar; "This was https://gerrit.wikimedia.org/r/#/c/25173/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25234 [08:30:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25234 [08:32:31] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [08:33:23] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25236 [08:34:17] New review: Hashar; "resent as https://gerrit.wikimedia.org/r/#/c/25235/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24878 [08:34:17] New review: Hashar; "resent as https://gerrit.wikimedia.org/r/#/c/25236/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24934 [08:34:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [08:34:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25236 [08:35:41] Change abandoned: Hashar; "Abandoning this, I cant wait for puppet so will setup Gerrit manually." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25234 [09:07:51] New review: Dereckson; "The shellpolicy issue is now resolved." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/23076 [09:27:47] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:15:49] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [10:35:55] PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours [10:54:46] PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours [11:07:40] PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours [11:09:28] PROBLEM - SSH on searchidx2 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:10:58] RECOVERY - SSH on searchidx2 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:23:43] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [11:23:43] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [11:34:04] New review: Demon; "And pushing it to the branch makes it *look* merged here. Looks like we're hitting https://code.goog..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24213 [11:39:36] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [11:42:36] PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours [11:52:39] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [12:32:18] bad cr2-eqiad ! i set your prefix limit for our new neighbor to 50k, yet it keeps thinking it's 10k and shutting the session down [12:32:29] why are you being dumb cr2 ? [12:32:32] they're confused [12:32:37] they want a support contract so badly [12:32:54] awww, it just wants to know it's loved :) [12:32:59] but it is! [12:33:13] i love you, cr2-eqiad [12:33:50] you were my first juniper device that i could really call a router [12:33:57] except your twin, cr1-eqiad [12:34:02] but you got most of the traffic anyway! [12:34:06] (don't be mad, cr1) [12:36:12] now cr1 is crying :( [12:36:23] argh [12:36:25] can't please everyone [12:36:27] thank you for coming out so strongly in support of use of open source software in your email mark [12:36:46] i want to see wikimedia doing the right thing [12:37:26] me too. your word has a lot of weight to it, it's good to see you using it [12:38:33] uh back in a few [12:46:00] RobH_: are you gonna be at eqiad today? [13:01:39] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [13:01:39] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:35:38] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:38] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [13:39:14] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:39:34] mark: yep, leaving to head down there in a few minutes [13:40:44] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:40:44] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.343 second response time [13:43:44] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.926 second response time [13:44:22] New patchset: Demon; "Refactor gerrit2 account stuff so we can reuse it on other hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24557 [13:44:38] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:23] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:45:24] New review: Demon; "PS2 is just a rebase." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/24557 [13:45:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24557 [13:45:59] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.229 second response time [13:46:53] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.554 second response time [13:52:08] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [13:52:08] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [13:52:08] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [13:52:08] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [13:52:08] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [13:52:09] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [13:52:09] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [13:52:10] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [13:52:10] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [13:52:11] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [13:52:11] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [13:52:12] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [13:52:12] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [13:52:13] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [13:52:13] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [13:52:14] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [13:56:05] !log ms-be8 powercycling [13:56:16] Logged the message, Master [13:56:20] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:34] hello cmjohnson1 [13:56:38] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:50] good morning apergos [13:56:57] or afternoon [13:57:32] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:41] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [13:57:50] lemme know when/what you need me to do as far as these boxes [13:57:59] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.246 second response time [13:58:08] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [13:58:26] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [13:59:05] apergos: paravoid says that he was still seeing disk slow to mount but I am not...can you poke around be8 [13:59:35] sure. is it all installed and stuff? [13:59:48] cause I'll jsut do some reboots and watch, check the logs, etc [13:59:58] ye [14:00:18] yes, should be able to reboot and watch [14:00:44] okey dokey [14:01:47] i am out of console [14:02:37] cool [14:02:52] jeremyb: ping [14:02:56] i updated http://meta.wikimedia.org/wiki/Www.wikidata.org_template [14:03:03] when does it get synced? [14:03:15] or anyone else know? [14:07:49] New patchset: Hashar; "should enable AFTv5 100% on beta only" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25048 [14:07:57] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25048 [14:08:03] and here we go, reboot number one [14:09:05] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [14:10:44] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:13:08] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:17] PROBLEM - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:26] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:47] RECOVERY - LVS HTTP IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 47095 bytes in 2.047 seconds [14:14:56] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.305 second response time [14:16:08] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:16:08] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.243 second response time [14:17:09] apergos: is the lvs stuff related to you or should i check stuff out [14:17:29] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39743 bytes in 6.087 seconds [14:18:02] not me [14:18:16] LeslieCarr: [14:18:32] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:18:38] okay [14:19:53] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.378 second response time [14:20:02] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:11] PROBLEM - Apache HTTP on srv258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:20:38] PROBLEM - Host ms-be8 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:23] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [14:21:41] RECOVERY - Apache HTTP on srv258 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.839 second response time [14:21:50] RECOVERY - Host ms-be8 is UP: PING OK - Packet loss = 0%, RTA = 0.37 ms [14:23:48] i don't see the problems that are complaining ... [14:25:17] PROBLEM - Apache HTTP on mw39 is CRITICAL: Connection refused [14:26:33] the uplink on spence is also okay … [14:26:47] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:26:47] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:28:07] hmm scrying the ganglia graphs isn't netting me anything [14:28:17] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.857 second response time [14:28:17] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.727 second response time [14:29:56] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.820 second response time [14:30:41] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:50] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:02] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 3.640 second response time [14:32:54] checked syslog on a random one of the mw hosts that had complained, did not see anything interesting (eg not a pile of errors spewing etc) [14:33:19] it looks like about 2 hours ago there were a bunch of OOM errors but not right now AFAICT [14:33:50] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [14:34:53] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [14:36:23] !g I4913fb6521f1f8bb98bb6a9480edd8e22eb2f7dd [14:36:23] https://gerrit.wikimedia.org/r/#q,I4913fb6521f1f8bb98bb6a9480edd8e22eb2f7dd,n,z [14:37:17] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [14:40:44] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:38] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:38] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:43:08] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.678 second response time [14:43:53] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 6.925 second response time [14:46:08] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [14:48:33] cmjohnson1: after three reboots, stillnot seeing the delays [14:48:36] for ms-be8 [14:49:03] cool [14:49:27] i know i rebooted about 4 times yesterday and once this morning and no delays either [14:50:02] what was done with this box, remind me? [14:50:03] apergos: should we pull the disk and put them in another machine and see how it does? [14:50:12] ms-be8...nothing just new sata drives [14:50:26] oh, sata instead of nl sata eh? [14:50:31] these ones are wd too [14:51:55] yep [14:52:00] well it's interesting [14:52:12] I'd like to see whathappens with ms-be6 and the new controller also [14:52:15] are they sending that? [14:52:34] we had nl sas drives in there b4 [14:54:05] er nl sas, yeah [14:57:49] um so are they sending the new controller? or are they waiting for further word from us first? [14:59:10] the controller will be here today [14:59:16] sometime around 2p usually [14:59:19] sweet [14:59:38] 9 pm for me [14:59:40] what box do you want to put that it in? [14:59:48] well I was thinking we could try ms-be6 [14:59:51] the sata drives appear to be working on 8... [15:00:00] k....let's go w/6 [15:00:01] it's been through everything else and still has that one drive that takes its time [15:05:29] New patchset: Ottomata; "Removing unused misc/analytics.pp file." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25260 [15:06:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25260 [15:06:46] New patchset: Ottomata; "role/analytics.pp - including geoip" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25261 [15:07:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25261 [15:08:33] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25260 [15:09:03] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25261 [15:12:01] !log sq37 has fatal error...powering down to replace disk controller card [15:12:11] Logged the message, Master [15:13:35] * Damianz pokes LeslieCarr [15:13:43] LeslieCarr: I want to work on bast1001 network cable 100mbit issue [15:13:50] and you are the only user other than me on the host [15:13:53] eep! [15:13:56] ;] [15:14:00] sure, go for it [15:14:03] thanks for letting me know [15:15:43] LeslieCarr: Just wondering, is there any network restrictions betwean the toolserver and labs network wise? Seems you can't ssh from labs to ts, but can to the outside world (I know prod is restricted for ssh access from labs and wonder if ts is caught up in that). [15:16:15] Damianz: quite possibly, give me about 10 minutes to check it out [15:16:24] :) No rush. [15:19:40] !log troubleshooting bast1001 link speed per 3414 [15:19:51] Logged the message, RobH [15:20:14] how the hell does one check link speed in the command line without ethtool ;_; [15:21:40] !log bast1001 link speed fixed, done working on it [15:21:50] Logged the message, RobH [15:21:52] you could grep dmesg [15:22:35] indeed, that works [15:22:39] thx =] [15:23:06] not quite as stateless as an ethtool check but since we dont put ethtool in production servers, it works =] [15:23:18] ethtool is just generally awesome [15:33:03] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 57.86 ms [15:40:42] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:46:24] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [15:52:24] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [15:54:48] PROBLEM - Host payments2 is DOWN: PING CRITICAL - Packet loss = 100% [15:56:36] RECOVERY - Host payments2 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:57:13] !log dist-upgrade & reboot most payments boxes [15:57:23] Logged the message, Master [16:06:12] PROBLEM - Host payments3 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:48] RECOVERY - Host payments3 is UP: PING WARNING - Packet loss = 64%, RTA = 1.39 ms [16:09:30] PROBLEM - NTP on analytics1007 is CRITICAL: NTP CRITICAL: No response from NTP server [16:12:39] PROBLEM - Host cp1037 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:18] PROBLEM - Host cp1038 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:36] PROBLEM - Host cp1039 is DOWN: PING CRITICAL - Packet loss = 100% [16:14:45] !log carbon rebooting for console redirection issues, any eqiad based installs will fail during this downtime [16:14:55] Logged the message, RobH [16:17:27] PROBLEM - Host carbon is DOWN: PING CRITICAL - Packet loss = 100% [16:24:48] RECOVERY - Host carbon is UP: PING OK - Packet loss = 0%, RTA = 26.81 ms [16:43:05] New patchset: Aaron Schulz; "Added support for timeline/math extensions." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24303 [16:43:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24303 [16:45:14] New review: Brian Wolff; "Looks good to me to the best of my knowledge." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/24561 [16:47:55] New review: Aaron Schulz; "Random file deleted for no reason (maybe since I'm doing this from a win7 box?)." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/24303 [16:52:03] !log mc1002 offline, memory bad per rt3612, new memory will arrive tomorrow [16:52:13] Logged the message, RobH [17:26:53] New patchset: ArielGlenn; "Added support for timeline/math extensions." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24303 [17:27:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/24303 [17:28:47] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:28:47] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [17:28:47] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [17:28:47] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [17:28:47] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [18:01:43] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [18:06:30] !log mc1006 offline for memory troubleshooting per rt 3613 [18:06:41] Logged the message, RobH [18:09:38] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24818 [18:17:25] New patchset: Andrew Bogott; "Point out that this file shouldn't be used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25276 [18:18:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25276 [18:19:38] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25276 [18:39:13] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [18:42:50] PROBLEM - SSH on ms-be6 is CRITICAL: Connection refused [18:43:56] !log mc1006 testing done, single bad dimm, support ticket filed rt3614 [18:44:06] Logged the message, RobH [18:44:15] !log correction rt 3613 [18:44:25] Logged the message, RobH [18:48:32] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [18:55:53] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [18:56:02] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [18:56:32] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24308 [19:20:46] New patchset: Pyoungmeister; "correcting mac for mw1103" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25282 [19:21:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25282 [19:21:50] PROBLEM - SSH on ms-be6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:39] Change abandoned: Pyoungmeister; "weird...." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25282 [19:24:59] RECOVERY - SSH on ms-be6 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:25:18] * Damianz wonders if Pyoungmeister is feeling ok [19:26:25] that commit couldn't be merged because of a dependency that was totally blank [19:26:42] don't feel like figurin' for a change of 10 characters.... [19:27:08] heh [19:27:22] New patchset: Pyoungmeister; "correcting mac for mw1103" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25283 [19:28:17] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [19:28:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25283 [19:28:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25283 [19:45:41] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Wed Sep 26 19:45:34 UTC 2012 [19:47:11] RECOVERY - swift-account-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:47:11] RECOVERY - swift-object-auditor on ms-be6 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:47:20] RECOVERY - swift-container-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:47:29] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:47:47] RECOVERY - swift-account-reaper on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:47:56] RECOVERY - swift-object-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:47:56] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:47:56] RECOVERY - swift-container-auditor on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:48:14] RECOVERY - swift-container-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:48:14] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:51:05] PROBLEM - Host ms-be6 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:35] RECOVERY - Host ms-be6 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [19:52:53] RECOVERY - swift-account-replicator on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:53:11] RECOVERY - swift-object-updater on ms-be6 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [20:02:11] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:02:56] PROBLEM - SSH on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:47] RECOVERY - SSH on fenari is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:06:41] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4416 bytes in 0.003 seconds [20:09:43] why doesn't MegaCli64 report any adapters? [20:17:20] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [20:28:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24998 [20:28:45] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24440 [20:30:12] !log cp1031 will remain offline until replacement parts arrive tomorrow rt3614 [20:30:23] Logged the message, RobH [20:30:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25233 [20:31:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24557 [20:35:09] * apergos updates the tickets and calls it quits for the day [20:36:41] PROBLEM - Puppet freshness on cp1030 is CRITICAL: Puppet has not run in the last 10 hours [20:39:43] !log mc1009, 1010 drac fixed [20:39:53] Logged the message, RobH [20:39:54] RobH: woo! [20:39:55] thanks [20:40:09] np, trying to get 1011 fixed before i have to prepare for this call [20:40:26] the reboot takes a few minutes, then you have 5 seconds to enter a key combo =P [20:42:29] !log mc1011 drac fixed [20:42:33] woo! [20:42:38] then I get to do shit by hand on it too! [20:42:38] Logged the message, RobH [20:43:55] RobH: can you do mw1014 [20:44:03] so that I can test the cable thingy? [20:53:38] PROBLEM - Swift HTTP on ms-fe1 is CRITICAL: Connection refused [20:55:35] PROBLEM - Puppet freshness on cp1029 is CRITICAL: Puppet has not run in the last 10 hours [20:59:20] RobH: mc1009 can't see its 10g nic :/ [20:59:32] .... [20:59:43] sigh...i flashed them alllllll [20:59:46] whyyyyyyyy [20:59:51] WWWWWWWWWRRRRRRRRRRRRRYYYYYYYYYYYYY [20:59:56] *headdesk* [20:59:58] dunno, lemme check mc1010 as well [21:01:44] notpeter: lemme know, i am headed back into dc floor to work on the rest of them. [21:01:52] kk [21:02:04] RobH: let's try to focus on the one for the cable testing [21:02:11] so that we can definitely get that data [21:07:06] RobH: yeah, mc1010 also not able to see its 10g nic for pxe [21:08:03] damn it. [21:08:38] PROBLEM - Puppet freshness on cp1032 is CRITICAL: Puppet has not run in the last 10 hours [21:09:09] ok, i will finish fixing drac details then pull the flash media for the pxe 10g and check them out [21:09:26] sweet. thanks! [21:09:31] sorry that this is such a pain.... [21:09:32] i dunno why they arent working i flashed those the same as the others [21:09:37] since i could do that without dca [21:09:41] meh, not yer fault [21:09:46] yeah, it's weird.... [21:09:50] more dell servers not working [21:09:55] this is my not shocked face. [21:10:00] lulz [21:11:19] New patchset: Dzahn; "planet: add translations of the index.html, turn index.html.tmpl.erb into a puppet template. add a definition to create planet theme. add hash of translations to role class,..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25428 [21:12:13] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/25428 [21:15:23] New patchset: Dzahn; "planet: add translations of the index.html, turn index.html.tmpl.erb into a puppet template. add a definition to create planet theme. add hash of translations to role class,..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25428 [21:16:14] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/25428 [21:24:41] PROBLEM - Puppet freshness on cp1034 is CRITICAL: Puppet has not run in the last 10 hours [21:24:41] PROBLEM - Puppet freshness on cp1035 is CRITICAL: Puppet has not run in the last 10 hours [21:26:28] notpeter: did you verify the boot order or disable pxe on nic1-4? [21:26:38] disabled [21:27:03] cmjohnson1: we need you on 2002 [21:27:11] k [21:29:47] notpeter: Im checking out the bios flash on 1009 now [21:30:17] RobH: sweet! thanks [21:34:03] notpeter: these are flashed [21:34:10] shows the same setting in the intel bios as the others. [21:34:19] hrm...... [21:34:30] cuz i go to enable flash [21:34:36] and its unsupporte,d since its already enabled [21:34:43] (i saw this when i flashed them awhile ago) [21:34:46] and shows PXE supported [21:34:48] like it should. [21:35:05] but i have not compared bios settings [21:35:14] can i reboot any of the 1001-1008? [21:35:20] any, sure [21:36:50] rebooting 1008 as well [21:36:55] kk [21:37:20] this is strange that its not working. [21:37:27] agreed [21:37:59] are all the other working? robh do you have link lights? [21:38:24] i had a bad sfp at the switch ...and had no link lights [21:38:45] cmjohnson1: nah, it's weird. it's not like it doesn't get link, its like it thinks there is absolutely no network devices [21:39:48] hrmm...is the mac correct? [21:40:35] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [21:40:35] can't get mac becuase can't hit ^-s [21:40:35] it dosnt show using it in post [21:40:45] notpeter: you saw the option to enter intel bios? [21:40:49] cuz im seeing it [21:40:56] nope [21:41:05] oh what box? [21:41:08] *on [21:41:09] 1009 [21:41:14] im on it now confirming [21:41:37] hopping on mc1010 [21:42:55] New patchset: Andrew Bogott; "Fix the public_ip value in instance status." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25435 [21:43:02] notpeter: so im in the bios on 1009 [21:43:09] the intel bios [21:43:12] not system bios. [21:43:17] it says its setup for pxe [21:43:35] huh [21:43:52] New patchset: Dzahn; "planet: add translations of the index.html, turn index.html.tmpl.erb into a puppet template. add a definition to create planet theme. add hash of translations to role class,..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25428 [21:43:52] gerrit-wm: come on [21:43:59] PROBLEM - Puppet freshness on cp1033 is CRITICAL: Puppet has not run in the last 10 hours [21:44:14] trying one more thing [21:44:22] now that i can see it telling it pxe again [21:44:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25435 [21:44:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25428 [21:44:43] though i didnt change any values, other than confirm the intel firmaware via the bootutil cd [21:45:22] what do you hit to get into it? [21:45:22] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25428 [21:45:31] well, i loaded bootutil off cd [21:45:38] but the normal intel bios is ctrol+s after the SAS display [21:45:44] yeah [21:45:50] I never saw the ^-s option [21:45:54] I've also sat there trying to hit it [21:45:59] w/o success.... [21:46:14] i dont recall if i saw it on 10, will check it in a moment [21:46:29] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25435 [21:46:32] on 1010 [21:46:44] but I also didn't see it on 1009 [21:46:59] heh, i have 1009 'init and establishling link' [21:47:02] but media fail. [21:47:35] RECOVERY - Swift HTTP on ms-fe1 is OK: HTTP OK HTTP/1.1 200 OK - 397 bytes in 0.010 seconds [21:47:48] notpeter: so the links have no lights [21:47:55] i dont think that mark or leslie set these up . [21:48:18] notpeter: lemme take over 1010 ok? [21:48:21] just to confirm [21:48:24] sure [21:49:40] so if i can see the attempt on 1010 without entering bootutil [21:49:46] then its just the switch needs to be configured [21:49:53] hhhmmm, ok [21:49:54] as the bulk are identical cables to 1001-1008 [21:50:07] but the 1001-1008 have link lights [21:50:10] 1009+ no lights [21:50:18] seems like a network issue [21:52:06] interesting, i dont see the ctrl+s on 1010 [21:52:09] lemme redo the flash and see [21:52:26] seems 1009 has the issue you report fix by flash, but still not working since it has no netowrk link light [21:52:32] will see if this fixes 1010 shortly [21:53:23] i miss having a network admin around in both my morining and evening. [21:53:26] =P [21:53:53] PROBLEM - Puppet freshness on cp1036 is CRITICAL: Puppet has not run in the last 10 hours [21:53:53] Maybe they don't feel like peering with you anymore :D [21:53:58] MaxSem: Can you fix your umask on fenari please? [21:54:17] there's various paths from you chmod 644 and 755 that only roots can modify [21:54:24] (trying to update docroot/bits/favicons) [21:54:34] drwxr-xr-x 3 maxsem wikidev 4.0K Sep 12 22:57 favicon/ [21:54:45] -rw-r--r-- 1 maxsem wikidev 1.2K Sep 12 22:57 piece.ico [21:54:57] maxsem@fenari:~$ cat ~/.bashrc [21:54:57] umask 002 [21:55:22] old files I suppose [21:55:27] fix em anyway [21:55:32] maxsem@fenari:~$ umask [21:55:32] 0022 [21:55:33] its not retroactive [21:55:45] yes the umask is right, but these files were created before you did that [21:57:31] I tried running now find -user maxsem -exec chmod g+w {} \; [21:57:31] notpeter: ok, confirmed [21:57:37] seems i must have done this batch only half right [21:57:43] cuz it doesnt need the flash enabled [21:57:49] but needs pxe set to the nic so it shows the ctrol+s [21:57:57] so 1009 and 1010 are fixed, i will fix the rest now [21:57:58] ...but it seems I can't chmod my own files:P [21:58:11] RobH: I know! leslie should come back to the US [21:58:20] RobH: thanks! [21:58:21] MaxSem: the group bit is something else [21:59:04] find . -type d -exec chmod 755 {} \; [21:59:09] notpeter: ok, try 1009 now [21:59:11] i have link lights [21:59:12] find . -type f -exec chmod 644 {} \; [21:59:16] i just reseated them all, and links away [21:59:25] so 1009 and 1010 can be tested, fixing rest [21:59:26] (on /h/w/c/docroot) [21:59:36] eh 775 and 664 of course [22:00:59] mutante: and you, /h/w/c/docroot/www.wikidata.org [22:01:13] git pull is still stuck on fenari because of chmod issue there [22:01:55] docroot/www.wikidata.org/ [22:01:55] -rw-r--r-- 1 dzahn wikidev 318 Sep 25 20:45 favicon.ico [22:01:55] -rw-r--r-- 1 dzahn wikidev 271 Jul 4 13:11 robots.txt [22:02:24] Krinkle, done [22:03:45] thx [22:17:39] Here's the full list, if you see any others of you, please fix those too :) [22:17:39] https://office.wikimedia.org/wiki/User:Ttijhof/dump#perm_issues_in_.2Fh.2Fw.2Fc.2Fdocroot [22:18:30] (pipe the command to grep username for a live stat) [22:19:19] Reedy: (most of you) [22:21:38] mine are fixed [22:21:57] And most are now non-existant brion [22:26:00] ran it on common now instead of common/docroot, dumped on ^ [22:26:10] yeah yours are all fixed now [22:26:31] many preilly, asher and dzahn [22:31:43] binasher: just fyi, I'm here now (ready whenever you are regarding wmfPrefStats) - updated files in https://github.com/Krinkle/ts-krinkle-wmfPrefStats [22:41:45] Reedy: so, how are you going to commit this? [22:41:55] I wouldn't know [22:42:10] or can it just work from fenari? [22:42:18] commit what? [22:42:25] those perm changes [22:42:33] oh [22:42:34] they show up in git status [22:43:29] done [22:50:50] New patchset: Krinkle; "killscripts.php: Add in version control, used for the italian blacklist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25448 [22:51:17] New review: Krinkle; "Was already on fenari, adding to version control." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/25448 [22:51:17] Change merged: Krinkle; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25448 [22:54:09] New patchset: Krinkle; "killscripts.php: Old file, no longer referenced anywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25450 [22:54:23] Change merged: Krinkle; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25450 [23:02:35] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [23:02:35] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [23:11:47] New review: Krinkle; "What're we waiting for?" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/23425 [23:16:17] New patchset: Demon; "Fix I696578f4: Don't need to pass ssh_key to gerrit::jetty anymore" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25453 [23:17:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25453 [23:18:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25453 [23:37:27] New patchset: Dzahn; "planet: now that planet_languages is a hash with the values for translation, need to split out just the keys and use them to iterate through" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25458 [23:38:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25458 [23:38:45] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25458 [23:39:05] New patchset: Pyoungmeister; "decom db17" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25459 [23:39:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25459 [23:40:56] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25459 [23:53:35] PROBLEM - Puppet freshness on mw1 is CRITICAL: Puppet has not run in the last 10 hours [23:53:35] PROBLEM - Puppet freshness on mw10 is CRITICAL: Puppet has not run in the last 10 hours [23:53:35] PROBLEM - Puppet freshness on mw11 is CRITICAL: Puppet has not run in the last 10 hours [23:53:35] PROBLEM - Puppet freshness on mw12 is CRITICAL: Puppet has not run in the last 10 hours [23:53:35] PROBLEM - Puppet freshness on mw14 is CRITICAL: Puppet has not run in the last 10 hours [23:53:36] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [23:53:36] PROBLEM - Puppet freshness on mw13 is CRITICAL: Puppet has not run in the last 10 hours [23:53:37] PROBLEM - Puppet freshness on mw2 is CRITICAL: Puppet has not run in the last 10 hours [23:53:37] PROBLEM - Puppet freshness on mw16 is CRITICAL: Puppet has not run in the last 10 hours [23:53:38] PROBLEM - Puppet freshness on mw4 is CRITICAL: Puppet has not run in the last 10 hours [23:53:38] PROBLEM - Puppet freshness on mw5 is CRITICAL: Puppet has not run in the last 10 hours [23:53:39] PROBLEM - Puppet freshness on mw8 is CRITICAL: Puppet has not run in the last 10 hours [23:53:39] PROBLEM - Puppet freshness on mw7 is CRITICAL: Puppet has not run in the last 10 hours [23:53:40] PROBLEM - Puppet freshness on mw6 is CRITICAL: Puppet has not run in the last 10 hours [23:53:40] PROBLEM - Puppet freshness on mw9 is CRITICAL: Puppet has not run in the last 10 hours [23:53:41] PROBLEM - Puppet freshness on mw3 is CRITICAL: Puppet has not run in the last 10 hours [23:59:35] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours