[00:00:17] *affect [00:01:32] preilly: got you https://gerrit.wikimedia.org/r/46472 as well :-] [00:01:32] maybe the claim TTL was exceeded [00:01:40] while the job processes were still running [00:01:55] preilly: the linting tool "pylint" reports a bunch of minor issues, might be worth a look at [00:02:03] how many emails are in one jobs? heh [00:02:09] * AaronSchulz looks [00:02:29] so maybe those emails were actually sent, some multiple times [00:02:33] preilly: and you might want to enforce fast-forward merge on sartoris repo to prevents the mad merges Gerrit creats [00:02:43] 3 times perhaps [00:03:17] TimStarling: if that is true it would be in runJobs.log [00:03:23] since it gives the run time [00:03:42] * AaronSchulz looks [00:05:55] hashar: you want Fast Forward Only? [00:06:01] rfaulkner: so stat1001 wont run puppet and update with your name [00:06:10] until ottomata fixes whatever this duplicate puppet webserver thing is [00:06:21] (im advised ottomata knows of this issue?) [00:06:31] ottomata: err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Apache_site[000_default] is already defined in file /var/lib/git/operations/puppet/manifests/webserver.pp at line 320; cannot redefine at /var/lib/git/operations/puppet/manifests/webserver.pp:320 on node stat1001.wikimedia.org [00:06:33] that you? [00:06:46] RobH: that's easy to fix you just pass --incredible-faulk [00:06:55] haha [00:07:10] it's an undocumented option [00:07:54] preilly: I enforce fast forward on the integration/* repository, but that is just to have a clean history :-] [00:08:14] preilly: you might want to have long topic branches that get merged form time to time [00:08:36] hashar: yeah [00:12:34] preilly: bah I created the ohloh project but can't figure out how to add you as a project manager :D [00:12:35] https://www.ohloh.net/p/wikimedia-sartoris [00:12:43] TimStarling: not seeing enotifNotify jobs that took that long around that day in the logs [00:14:26] hashar: this is my account: https://www.ohloh.net/accounts/preillyme [00:14:48] preilly: maybe you have to apply for the manager position [00:15:34] well probably not a big issue [00:15:40] hashar: https://www.ohloh.net/p/wikimedia-sartoris/managers (pending) [00:15:55] preilly: you are such a hacker [00:16:11] granted! [00:16:11] :-] [00:16:13] hashar: thanks [00:16:23] * preilly isn't sure if hashar just insulted him [00:16:24] which server is the otrs database [00:16:32] this way if I die while sleeping, someone can take over easily \O/ [00:16:33] preilly: always assume yes and get mad. [00:16:50] RobH: re duplicate 000-default, yeah i know about that one, I can figure it out, [00:17:08] ottomata: cool, its affecting stat1001 and rfaulkner's access to it as well [00:17:13] just fyi [00:17:18] ok [00:17:21] RobH: Okay [00:17:23] i'll get to both of those tomorrow [00:17:28] coolness, thanks! [00:17:38] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46220 [00:17:51] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [00:17:55] RobH: you do know that I'm now going to call you, "Runkle" right? [00:18:10] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 191 seconds [00:18:18] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46221 [00:18:40] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 267 seconds [00:19:16] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 253 seconds [00:20:10] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:20:26] RobH: are you working on an RT ticket with that OTRS question? :D maybe? [00:20:28] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [00:20:30] 4430 [00:20:34] probably not :P [00:20:51] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [00:20:55] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:21:41] RD: the run the query and attach output? [00:21:45] i wish fundraising would alter table engine=innodb and run non-blocking db backups [00:22:15] RECOVERY - Solr on solr3 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.008 seconds [00:22:16] RECOVERY - Solr on solr1001 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.054 seconds [00:22:34] Yeah, that's just the latest other thing I heard about. I'm just 'interested' in the subject [00:22:42] RECOVERY - Solr on solr1 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.007 seconds [00:22:43] RECOVERY - Solr on solr2 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.015 seconds [00:23:09] RECOVERY - Solr on solr1003 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.057 seconds [00:23:36] RECOVERY - Solr on solr1002 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.058 seconds [00:25:58] Reedy: https://rt.wikimedia.org/Ticket/Display.html?id=4430 is resolved, attached output of otrs closed account query [00:27:19] Great, thankyou :) [00:27:50] glad to help [00:35:05] err: /Stage[main]/Squid/File[/etc/squid/squid.conf]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///volatile/squid/squid.conf/knsq18.esams.wikimedia.org at /var/lib/git/operations/puppet/manifests/squid.pp:44 [00:35:07] sigh. [00:36:35] !log crazy memory and cpu spike on knsq18 for no apparent reason, restarting process (memory leak?) [00:36:46] Logged the message, RobH [00:44:40] !log squid backend on knsq48 rebuilding [00:44:50] Logged the message, RobH [00:44:50] !log typo, knsq18 [00:45:01] Logged the message, RobH [01:17:03] TimStarling: can you look at https://gerrit.wikimedia.org/r/#/c/45510/ (I'm not 100% sure on it) [01:20:27] 1024 seems quite low [01:23:24] oh, yeah I forgot to bump that up [01:23:44] so the whole think should be < 1000kb, so 1kb is too low [01:23:57] maybe 100kb? [01:24:15] yeah, that should be enough [01:24:23] 1KB would hit JPEGs with EXIF metadata [01:24:27] i.e. pretty much every image [01:29:42] AaronSchulz: http://paste.tstarling.com/p/RCQcOb.html [01:30:13] yeak, 100k looks good [01:30:28] I like your sha1 sampling trick [01:52:23] TimStarling: it may help to backport that after all wikis are on wmf8 to avoid dueling caches when wmf9 is rolled out (e.g. increased invalidations) [01:53:02] crosswiki memcached use and het deploy and key name or format changes can be tricky [01:53:17] yeah [01:53:25] I think it's what what was confusing the job queue last week [01:53:44] * AaronSchulz will removes his job hack from puppet when wmf8 is on everything [01:53:54] make sure Reedy knows [01:54:02] Reedy: ^ [01:54:06] TimStarling: there ;p [01:55:01] !log adding ldap entries for combined sysadmin/netadmin role, projectadmin [01:55:14] Logged the message, Master [01:59:39] TimStarling: so asher and I have talked about possibly moving the job queue to redis. Would it be worth talking about that some time? [02:00:29] * AaronSchulz would like to get that load of the primary dbs [02:00:32] *off [02:02:59] You've done what now? [02:03:17] Reedy: just a few landmines, nothing you can't handle [02:05:30] Reedy: I think I scared tim off [02:10:09] * AaronSchulz will pester tim later [02:17:41] !log deleting sysadmin and netadmin roles from ldap [02:17:54] Logged the message, Master [02:18:00] \o/ [02:18:02] LeslieCarr: hey, back [02:23:16] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [02:24:28] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [02:27:28] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [02:28:01] paravoid: is the \o/ for me, or LeslieCarr ? [02:28:02] heh [02:28:07] for you [02:28:11] heh [02:28:22] yeah. this should make things a little easier [02:28:31] yeah, I hadn't thought it before [02:28:31] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [02:28:32] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [02:28:44] but when you threw the idea it totally made sense to me [02:29:05] maybe it'll again make sense at some point in our distant future [02:29:34] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [02:30:05] yeah [02:30:20] well, when labs first started, those roles were hardcoded in labs [02:30:21] err [02:30:22] in nova [02:30:28] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [02:30:38] they turned that into a policy based system in essex [02:30:43] !log LocalisationUpdate completed (1.21wmf8) at Tue Jan 29 02:30:42 UTC 2013 [02:30:46] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 188 seconds [02:30:54] Logged the message, Master [02:30:54] or possibly diablo [02:31:01] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 191 seconds [02:31:04] 2013-01-29 02:30:54.611314 mon.0 [INF] pgmap v2048808: 16952 pgs: 16911 active+clean, 12 active+remapped+wait_backfill, 15 active+remapped+backfilling, 3 active+degraded+backfilling, 2 active+clean+scrubbing, 1 active+degraded+remapped+backfilling, 8 active+clean+scrubbing+deep; 25008 GB data, 53347 GB used, 186 TB / 238 TB avail; 71454/122575532 degraded (0.058%) [02:31:04] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 192 seconds [02:31:11] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 193 seconds [02:31:13] * paravoid waits for that go to 0% [02:41:34] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [02:44:34] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [02:44:34] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [02:48:28] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [02:55:45] !log LocalisationUpdate completed (1.21wmf7) at Tue Jan 29 02:55:44 UTC 2013 [02:55:55] Logged the message, Master [02:57:19] paravoid: are you logged on to analytics1001 ? [02:57:24] I am [02:57:27] I stopped apache [02:57:32] just mailed abou that. [02:57:34] paravoid: Okay thanks [03:12:05] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [03:12:24] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:12:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [03:13:04] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:13:24] PROBLEM - Parsoid Varnish on celsus is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:24] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:33] that'd be me [03:41:36] sorry for paging [03:42:15] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [03:42:16] paravoid: how are the bugs going? [03:42:19] I'm trying to load the ceph cluster [03:42:34] good actually [03:43:06] finally got rid of h310 boxes [03:51:31] Aaron|home: quick silly question [03:51:55] Aaron|home: if I request a thumb size from thumb.php that exists in swift, will it be regenerated or served from swift? [03:52:17] it won't be regenerated [03:52:24] I guessed as much [03:52:27] what happens with multiwrite? [03:52:38] does it check on all backend stores? [03:52:46] no, just the main one [03:53:16] okay [03:53:29] I'm wondering if we can abuse all that to gradually fill up ceph with thumbs [03:53:33] instead of one large sync process [03:55:11] Ryan_Lane: salt-minion dies and dmesg complains every 30' [03:55:28] logs are filled with salt-minion terminated with status 42 [04:00:28] Aaron|home: what's your take? [04:01:21] how long would this be running? [04:01:46] with the current avg rate > 1 month [04:01:56] right [04:02:09] I'll try to optimize it [04:02:35] I thought we weren't going to use multiwrite? [04:03:06] but copying them from swift on ceph misses for a month or two while serving originals and ceph thumb hits [04:03:13] might be a nice alternative [04:03:24] or very ugly [04:03:44] erm what do you mean? [04:03:57] oh for thumbs [04:04:08] yeah but that assumes we have some other way of doing this... [04:05:46] anyway, I have to add 5 more boxes to the cluster (up from 7 right now) [04:05:52] and find ways to optimize this [04:06:04] i.e. find the bottleneck first :) [04:06:27] I guess we can use multiwrite, *shrug* and call resyncFile() via some hook in thumb.php [04:07:02] *resyncFiles() [04:07:11] nod [04:07:17] that fill ceph with anything in swift [04:07:23] (for the given thumbnail) [04:07:29] nod [04:07:37] I wonder what that will do to user-facing latency... [04:08:14] so the ceph cluster will be a single-dc cluster in ashburn? [04:08:38] yes [04:08:49] TimStarling: so asher and I have talked about possibly moving the job queue to redis. Would it be worth talking about that some time? [04:08:51] * Aaron|home is just being sure [04:08:57] yes, that would be worth talking about [04:09:03] paravoid: what does that give us over swift atm? [04:09:17] sage said they're planning for async replication in the radosgw layer by the summer [04:09:33] TimStarling: maybe we can bring back the weekly calls ;) [04:09:35] to be implemented until then [04:09:54] async on the rados layer is a long-term goal of theirs [04:09:55] so any specific month? [04:09:59] not going to come anytime soon [04:10:18] does it seem to have lower latency? [04:10:27] ceph, that is [04:10:31] "this spring/summer" [04:11:15] 03:42 there will eventually be dr in rados itself, but it's a harder problem than rgw, and won't look quite the same. [04:11:31] lower latency than what? [04:11:32] swift? [04:11:39] yes [04:11:46] haven't really benchmarked [04:12:15] I'd guess writes are better because of journaling in SSDs [04:12:28] so subsequently reads are also faster because of less disk seeks [04:12:34] but that's just guesses so far [04:12:40] does rgw do connection pooling? [04:12:50] of what? [04:12:58] connections to osds [04:14:00] it keeps connections with all osds it communicates with [04:14:10] TimStarling: you can look at https://gerrit.wikimedia.org/r/#/c/39174/ see what I have in mind [04:17:21] paravoid: so I was thinking about using ffmpeg + swift tempurls for oggs in addition to webm (as now), which reminded me of the double GET bug in swift for RANGE requests [04:18:55] it's awkward since upgraded swift is low-priority now but it will be a while before we are on ceph [04:19:13] upgraded to what? [04:19:19] 1.7.5? [04:20:02] yep [04:20:22] I'm not sure if I'd do that [04:20:34] well yeah it's kind of tedious atm [04:20:35] it's an interim release, we currently run off openstack releases [04:20:46] with canonical's packages and everything [04:21:04] I guess it wouldn't be much work... [04:21:15] https://bugs.launchpad.net/swift/+bug/1065869 [04:21:24] really quite an obnoxious bug [04:21:36] I think we should wait until all swift hardware swaps are done though... [04:21:46] what's left? [04:21:52] *sigh* [04:21:59] a few more boxes [04:22:00] plus [04:22:08] replacing all the ones we just replaced [04:22:16] because of h310->h710 swaps [04:22:22] gaaah [04:22:28] did I *sigh*? [04:22:44] yeah, fun [04:22:50] too much fighting for each inch [04:24:12] huh [04:24:19] I'm looking at the patch for that bug [04:24:25] it's three lines [04:24:29] applies cleanly on 1.7.4 [04:24:38] maybe that's a nice option in the meantime [04:24:52] five lines maybe [04:24:54] https://review.openstack.org/#/c/14497/3/swift/proxy/controllers/obj.py [04:25:44] oh and there's 1.7.6 already [04:28:12] okay [04:28:14] 6:30am [04:28:17] time to sleep [04:29:02] heh [04:29:10] strange hours [04:29:55] you think? :) [04:30:33] paravoid: let me know if you merge that patch [04:30:53] $ ls -l php-1.21wmf7/cache/*.cdb [04:30:54] -rw-rw-r-- 1 reedy wikidev 891138 Jan 17 22:17 php-1.21wmf7/cache/interwiki.cdb [04:30:54] -rw-rw-r-- 1 reedy wikidev 891142 Jan 10 19:12 php-1.21wmf7/cache/trusted-xff.cdb [04:31:00] suspiciously similar sizes [04:31:18] $ ls -l php-1.21wmf6/cache/trusted-xff.cdb [04:31:19] -rw-rw-r-- 1 demon wikidev 2319902 Oct 3 05:21 php-1.21wmf6/cache/trusted-xff.cdb [04:32:26] * Aaron|home slowly grins [04:33:02] md5sum *.cdb ? [04:33:15] gah, that would be useless [04:33:24] * Aaron|home gets tired [04:33:58] $ md5sum php-1.21wmf6/cache/interwiki.cdb php-1.21wmf7/cache/trusted-xff.cdb [04:33:58] 849e2d2a39f9efda9e0b290f5158983d php-1.21wmf6/cache/interwiki.cdb [04:33:58] 849e2d2a39f9efda9e0b290f5158983d php-1.21wmf7/cache/trusted-xff.cdb [04:34:22] ahem [04:36:02] !log tstarling synchronized php-1.21wmf7/cache/trusted-xff.cdb 'fixing corrupted CDB file' [04:36:13] Logged the message, Master [04:36:44] !log tstarling synchronized php-1.21wmf8/cache/trusted-xff.cdb 'fixing corrupted CDB file' [04:36:54] Logged the message, Master [04:38:16] paravoid: on which system [04:38:17] ? [04:38:40] last time I checked this was due to puppet running and changing values in the config file [04:39:58] that should have been fixed a while ago [04:47:39] !log tstarling synchronized php-1.21wmf8/cache/trusted-xff.cdb [04:47:49] Logged the message, Master [04:51:39] !log tstarling synchronized php-1.21wmf8/cache/trusted-xff.cdb [04:51:49] Logged the message, Master [04:53:45] !log tstarling synchronized php-1.21wmf8/cache/trusted-xff.cdb [04:53:55] Logged the message, Master [04:57:52] !log tstarling synchronized php-1.21wmf7/cache/trusted-xff.cdb [04:58:01] Logged the message, Master [04:58:13] !log tstarling synchronized php-1.21wmf6/cache/trusted-xff.cdb [04:58:23] Logged the message, Master [05:57:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45330 [06:17:48] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [06:18:36] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [06:36:16] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 188 seconds [06:36:26] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 198 seconds [06:37:01] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 211 seconds [06:37:10] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 214 seconds [06:41:16] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:41:26] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:41:40] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:42:07] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:53:17] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [06:53:40] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 188 seconds [06:53:46] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [06:55:01] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 204 seconds [07:00:47] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:00:52] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:01:16] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:02:04] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [08:20:35] !log pgehres synchronized php-1.21wmf8/extensions/CentralNotice/ 'Updating CentralNotice to master' [08:20:46] Logged the message, Master [09:00:23] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [09:12:58] New patchset: Hashar; "(bug 44424) wikiversions.cdb for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46240 [09:14:08] New review: Hashar; "Oh I forgot the full path stuff, that should be fixed with PS2." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46240 [09:36:32] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [09:36:32] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [09:36:32] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [09:36:34] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:13:41] New review: Hashar; "Sorry for not having followed up on that issue. Here is a summary of a discussion we had in a restri..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/31580 [10:41:11] Change abandoned: MaxSem; "Per the above." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31580 [12:24:00] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [12:26:06] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [12:30:11] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [12:30:11] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [12:30:11] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [12:30:11] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [12:31:23] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [12:34:04] New review: Brian Wolff; ">"whenever a file is updated, we have to purge each thumbnails ever generated"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31580 [12:42:29] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [12:45:29] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [12:45:29] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [12:49:23] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [14:03:00] paravoid: ahhh faidon :-] [14:03:03] hey [14:03:12] slept well ? ready for some README.md madness and a few debian packages reviews ? :-] [14:10:33] PROBLEM - spamassassin on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:43] PROBLEM - Exim SMTP on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:53] PROBLEM - HTTPS on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:13] PROBLEM - HTTP on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:23] PROBLEM - mailman on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:24] PROBLEM - SSH on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:12:36] PROBLEM - mailman on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:54] looking [14:13:12] PROBLEM - SSH on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:12] PROBLEM - spamassassin on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:13:47] PROBLEM - HTTP on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:48] PROBLEM - Exim SMTP on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:53] !log powercycled sodium [14:14:59] PROBLEM - HTTPS on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:04] Logged the message, Master [14:22:13] RECOVERY - SSH on sodium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:22:24] RECOVERY - spamassassin on sodium is OK: PROCS OK: 4 processes with args spamd [14:22:33] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.230 sec. response time [14:22:43] RECOVERY - HTTPS on sodium is OK: OK - Certificate will expire on 08/22/2015 22:23. [14:23:03] RECOVERY - HTTP on sodium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 190 bytes in 0.003 second response time [14:23:04] RECOVERY - SSH on sodium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:23:04] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.254 sec. response time [14:23:14] RECOVERY - mailman on sodium is OK: PROCS OK: 10 processes with args mailman [14:23:31] RECOVERY - HTTP on sodium is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 0.054 second response time [14:23:32] RECOVERY - mailman on sodium is OK: PROCS OK: 10 processes with args mailman [14:23:58] RECOVERY - HTTPS on sodium is OK: OK - Certificate will expire on 08/22/2015 22:23. [14:23:58] RECOVERY - spamassassin on sodium is OK: PROCS OK: 4 processes with args spamd [14:34:13] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [14:34:28] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 192 seconds [14:34:33] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 195 seconds [14:37:13] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [14:37:33] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [14:37:55] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [14:48:07] New patchset: Mark Bergsma; "Handle RADOSGW (Swift API) url rewriting for the basic case" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44067 [14:48:08] New patchset: Mark Bergsma; "Implement thumb 404 image scaler handling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44068 [14:48:08] New patchset: Mark Bergsma; "Add timeline, math, score rewrites" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44072 [14:48:08] New patchset: Mark Bergsma; "Remove double slashes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44076 [14:48:09] New patchset: Mark Bergsma; "Support project/language prefixes for math" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44077 [14:48:09] New patchset: Mark Bergsma; "Set CORS header in vcl_deliver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44078 [14:54:19] New review: Hashar; ":-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45330 [15:30:43] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:33:43] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:50:06] ottomata: could you please start getting us a list of destinations/sources/ports/protocols the analytics cluster needs to communicate with? [15:50:14] I'd like to have an ACL in place by the end of the week [15:50:30] RT #4433 can be used to assemble the list [15:51:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44067 [15:53:00] mark: [15:53:02] https://ccp.cloudera.com/display/CDH4DOC/Configuring+Ports+for+CDH4 [15:53:03] https://www.mediawiki.org/wiki/Analytics/Kraken/JMX_Ports [15:53:17] New patchset: Mark Bergsma; "Revert "Cleanup ldap script formatting"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46530 [15:53:31] sure, but mark you mean outside of the analytics cluster, right? [15:53:37] yes [15:53:40] sure [15:53:50] anything between the analytics cluster and other WMF, non-analytics servers [15:54:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46530 [15:54:18] you can assume that the basics will be there, i.e. puppetmaster, dns, ntp [15:54:28] anything every server does [16:12:42] New patchset: Mark Bergsma; "Add the image_scalers backend to varnish upload backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46533 [16:13:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46533 [16:18:16] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [16:19:30] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [16:21:41] New patchset: Mark Bergsma; "Add image_scalers as backend also, not just director" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46534 [16:22:04] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46534 [16:32:14] mark: Pardon my question, but what do these two image scaler patches just mentioned here actually do/change? I'd like to understand it, as there's been reports this weekend about reuploads to commons plus invalidating old cached thumbnails failing [16:32:34] they do nothing yet [16:32:45] this is in preparation for introducing ceph [16:32:49] New patchset: Jgreen; "flipped payments.wikimedia.org lvs monitoring from ptmpa to eqiad" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46535 [16:33:21] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46535 [16:34:04] ah. Gotcha. Thanks! [16:37:03] New patchset: Mark Bergsma; "Implement thumb 404 image scaler handling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44068 [16:41:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44068 [16:45:49] New patchset: Silke Meyer; "Add a variable to enable/disable experimental Wikidata features in labsconsole" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46536 [16:47:23] New review: Silke Meyer; "In wikidata-client-requires.php I also moved the respective lines up to make sure they are included ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/46536 [16:54:10] MaxSem: hey [16:54:15] paravoid, hey [16:54:32] I asked Tomasz about Copenhagen on Sunday [16:54:39] and he said it wasn't sure it was going to happen yet [16:55:16] now the only way it can't happen is that they don't give me a visa [16:56:45] New patchset: Mark Bergsma; "return (restart);" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46537 [17:02:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46537 [17:10:10] !log demon synchronized wmf-config/InitialiseSettings.php [17:10:21] Logged the message, Master [17:10:36] New patchset: Demon; "Simplewikiquote was still editable" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46538 [17:10:53] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46538 [17:15:56] New patchset: Mark Bergsma; "Add timeline, math, score rewrites" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44072 [17:15:56] New patchset: Mark Bergsma; "Remove double slashes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44076 [17:15:56] New patchset: Mark Bergsma; "Support project/language prefixes for math" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44077 [17:15:56] New patchset: Mark Bergsma; "Set CORS header in vcl_deliver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44078 [17:17:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44072 [17:25:00] New patchset: Mark Bergsma; "Remove double slashes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44076 [17:25:00] New patchset: Mark Bergsma; "Support project/language prefixes for math" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44077 [17:25:00] New patchset: Mark Bergsma; "Set CORS header in vcl_deliver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44078 [17:26:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44077 [17:26:30] mark: What kind of requests does that serve Access-Control-Allow-Origin: * for? [17:26:39] all [17:26:47] .... oh on upload.wm.o only though? [17:26:50] yes [17:26:55] this is already set on all swift objects [17:27:01] * RoanKattouw just noticed the filename [17:27:01] phew [17:27:01] OK [17:27:01] hehe :) [17:27:03] Nothing to worry about then :) [17:27:07] but it's a bit pointless to do that for every object [17:27:12] just costs cache space [17:28:14] that's already the case since weeks ago btw [17:32:16] New patchset: Mark Bergsma; "Remove double slashes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44076 [17:32:16] New patchset: Mark Bergsma; "Set CORS header in vcl_deliver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44078 [17:57:20] !log restarted apache and memcached on virt0 (apparently fixing main page errors) [17:57:31] Logged the message, Master [18:03:24] New patchset: MaxSem; "A couple of MobileFrontend tweaks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46542 [18:20:45] sbernardin: So when I do rack determination with cmjohnson1 I tend to do it in channel [18:20:50] so he knows why I am picking the racks I pick [18:21:10] so will review my picks in here so you know whats up (but it will all go in a ticket as well so no need to take notes) [18:21:52] Robh: OK...and the boxes are being tossed? [18:21:57] We'll have a total of 60 new apache servers coming in, which will be mw75-mw95 [18:22:08] sbernardin: I forgot, we have decom servers from them [18:22:17] so you may wanna keep the boxes to box up the servers we are turning off and unracking [18:22:50] Robh: OK...that's what I figured we would need [18:24:08] Robh: once we know how many we're replacing I can put boxes asside [18:24:34] We'll be replacing atleast 60 [18:24:54] so keep them all, the old 1950 'srv###' are going away, but just the older ones, will have more of a list shortly [18:25:08] i know srv257 and below are out of warranty, checkign the others now [18:25:37] looks like srv258+ are under warranty until april this year [18:25:41] so we won't replace those yet [18:26:37] we also have room in d1-sdtpa, so we'll end up filling that in with these as well, just gotta see how much power overhead we have there [18:26:44] to do that, i can use torrus, which is historical data [18:26:50] and i can login to the power strip directly. [18:27:05] sbernardin: if you are onsite, you can login to the power strip easily, if you are NOT onsite, you ahve to have a proxy into the cluster [18:28:08] sbernardin: So every power strip is on the mgmt network and accessible with http://ps1-rack-datacenter.pmtpa.wmnet [18:28:17] so in this case http://ps1-d1-sdtpa.mgmt.pmtpa.wmnet [18:28:58] I also want to ensure we have not spiked power during peak load times [18:29:05] which I check out on torrus, example: http://torrus.wikimedia.org/torrus/Facilities?nodeid=device//ps1-d1-sdtpa [18:31:37] sbernardin: the torrus link is public viewable [18:31:43] so can click that when not on site and still works [18:32:01] we can see the power historically in this rack never peaks above 3.3kw [18:32:37] it is a single power feed, not dual =[ [18:32:45] so math time! [18:32:52] it is volts * amps * 80% = usable watts [18:33:09] You never want to load a circuit above 80% of its capacity, as it will spike during power up [18:33:22] OK [18:33:25] so 208V 3Phase 30A is [18:33:25] 3*120*30*0.8 = 8640 WATTS [18:33:33] our ceiling is 8.6kw [18:33:50] * hashar today I learned: american people uses volts, amperes and watts just like everyone else. [18:35:19] so lets do more simple math, we have 30 servers currently in d1-sdtpa running at 3.3 peak kw 3300 / 30 = 110w per server [18:35:39] hashar: we're only backward with weight, volume, and distance. electricity is OK [18:36:18] if we fill the rack, it adds 15 more servers, 15 * 110 = 1650w + 3300 = 4950W [18:36:32] which is below our 8.6 so we should be able to fill up D1 with new apaches without an issue [18:36:43] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:54] of course, my math is a bit of guesswork, the new apaches are slightly beefier than the older ones currently in d1 (410 veruss 420) [18:36:56] RobH: sorry to bother, do you monitor the hardware of the dell machines? [18:37:04] !log ms-be1001 going down to replace h/w [18:37:14] Logged the message, Master [18:37:16] matanya: yes, nagios tells of of bad disks, etc [18:37:26] though we need to add more comprehensive checking [18:37:38] I can help out if you want [18:37:47] * cmjohnson1 loves robh's tampa power lesson [18:37:54] just done it in my company for 100+ servers [18:38:13] matanya: you can see what we are monitoring @ nagios.wikimedia.org [18:38:16] which will show its pretty basic [18:38:19] sure [18:38:29] we should be monitoring for memory failures and the like [18:38:31] I monitor every part of the server [18:38:34] but those tend to only be caught when they crash servers [18:38:38] temp as well, etc [18:38:43] I do that [18:38:44] we only do the most basic stuff [18:39:09] can I put here an output of my checks? [18:39:26] can i come back to you on it [18:39:34] sure [18:39:36] i have to find a home for a bunch of servers before sbernardin can do anything today =] [18:39:48] but im on rt duty, so following up on stuff like this is my job this week ^_^ [18:40:01] sbernardin: So, we now have a home for 15 of the 60 servers, d1 [18:40:03] sound interesting [18:40:53] OK [18:40:54] ok, so on rack d2-sdtpa, it is all old old poweredge1950s [18:40:57] which i want gone [18:41:15] unfortunately, racktables is private, as it cannot do anything other than login with full admin, or no login at all [18:41:19] (sorry for non staff following along) [18:41:22] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:41:37] so https://racktables.wikimedia.org/index.php?page=rack&rack_id=41 is old crap [18:41:52] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [18:42:00] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:42:08] so just to know whats up in the rack [18:42:09] http://torrus.wikimedia.org/torrus/Facilities?path=/Power_strips/sdtpa/ps1-d2-sdtpa/System/ [18:42:25] the power draw is 5.5kw [18:42:38] but, if we can help it, I want to pull out ALL of those old servers, so we need to see what they are doing [18:42:48] ie: are they all in api cluster, or some in normal cluster [18:42:59] as we want to make sure if we pull 10 out of api, we put 10 back (atleast) [18:43:52] RobH: ping me when you can. [18:43:58] OK [18:44:09] matanya: will do thx =] [18:44:30] sbernardin: So the easy way to check this is via our puppet files, (site.pp in puppet, which you may not be setup yet to pull that data but we'll get you setup sometime soon) [18:44:55] OK [18:45:27] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [18:45:49] srv226-247 are normal application servers, 248 and 249 are bits app servers, and 250 to 257 are api [18:46:05] so when we add the new servers back to the cluster, we need to replace those ranges accordingly (just fyi) [18:46:49] brb [18:47:13] OK [18:48:30] ok, so lemme see [18:49:01] so 15 of them are going in the free space in d1 [18:49:10] then the remaining 45 into d2-sdtpa [18:49:10] RobH, just going through some older RT tickets, once you're done with the current stuff you're working on can you check if https://rt.wikimedia.org/Ticket/Display.html?id=2787 has been done? [18:50:47] Thehelpfulone: done, just got someone to volunteer to take it ;] [18:51:10] done with check that is, heh [18:51:27] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [18:51:36] sbernardin: Ok, so I'll drop a ticket with all the instructions but just to continue conversation in here [18:51:48] since we will be killing the servers in d2-sdtpa, once I update our software for that [18:51:55] ah :P [18:52:10] you will pull the network (blue production, not green mgmt) [18:52:19] and reboot them into boot and nuke [18:52:30] sbernardin: http://www.dban.org/ [18:52:41] you will want to make a usb stick of that [18:52:59] then reboot each old server into it, once its up and running the wipe you can usually just pull the usb stick and move to next [18:53:06] as the wipe software resides in memory once launched [18:53:18] (atleast that is what i used to do, cmjohnson1 has done it more recently) [18:53:24] cmjohnson1: that how you handled it? [18:53:45] we could put a boot and nuke into pxe to run on things, but having something data destructive like that as a boot option scares me. [18:59:12] sorry to interrupt again, wouldn't it be easier to use a master nuke server? [18:59:50] yes and no [18:59:52] yes overall [19:00:14] no in that for it to be on network comfortably i would want it in its own vlan so dhcp/pxe server wont serve that image to normal subnets [19:00:18] to prevent accidental data loss [19:00:32] its on my list, because i want that vlan to be the 'test' vlan that also spins up automated burn in testing on new systems [19:00:45] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [19:00:59] got it [19:01:14] we just have not had time to project and work on it ;_; [19:01:18] if you had the env, it was best [19:01:39] but having a nuke server on the normal vlans, that will serve to any of our systems, that makes me nervous [19:01:46] so its been shelved for awhile =[ [19:01:51] New patchset: Alex Monk; "(bug 44472) Add throttle exception for a wikipedia editing workshop in Mumbai" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46547 [19:02:06] sbernardin: I am writing up the ticket(s) now for the decom and new server add, will ping you with links when done [19:02:14] see your point. I use a local switch for such cases [19:03:12] New patchset: MaxSem; "Add WikiVoyage to GeoData update scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46548 [19:04:05] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:04:09] can someone take a look at ^^^ please? [19:06:36] robh sebenardin: yep..there is a dban stick already there [19:09:15] !log ms-be1002 going down to replace h/w [19:09:16] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:09:26] Logged the message, Master [19:12:36] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:03] PROBLEM - Host ms-be1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:15] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [19:15:55] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [19:15:56] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [19:17:30] RECOVERY - SSH on ms-be1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:17:40] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [19:17:51] RECOVERY - SSH on ms-be1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:20:01] RECOVERY - SSH on ms-be1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:20:42] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [19:29:55] !log upgrading glusterfs to 3.3.1 from 3.3.0 and replacing old glusterfs package with new packages [19:30:09] New review: Demon; "I don't think we'll need to add the cdb files to git. The .cdb file is built easily, so I think we c..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46240 [19:30:10] Logged the message, Master [19:30:28] New patchset: Asher; "sync gdash graph definitions and actually deploy from puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46552 [19:31:46] !log ms-be1003 going down to replace h/w [19:31:56] Logged the message, Master [19:34:01] RECOVERY - Host ms-be1002 is UP: PING WARNING - Packet loss = 61%, RTA = 0.39 ms [19:34:20] PROBLEM - NTP on ms-be1001 is CRITICAL: NTP CRITICAL: No response from NTP server [19:34:30] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:06] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:33] sbernardin: So lets not decom d2-sdtpa until you rack and have the 15 new servers in d1-sdtpa [19:35:45] just peace of mind to have some more online before we pull a bunch [19:35:50] sbernardin: https://rt.wikimedia.org/Ticket/Display.html?id=4436 [19:36:29] and https://rt.wikimedia.org/Ticket/Display.html?id=4437 [19:36:30] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46552 [19:36:40] PROBLEM - SSH on ms-be1002 is CRITICAL: Connection refused [19:37:03] RECOVERY - Host ms-be1002 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [19:37:21] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [19:37:21] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [19:37:21] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [19:37:21] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [19:37:21] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:37:22] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [19:37:22] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:37:23] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [19:38:08] OK [19:41:24] PROBLEM - SSH on ms-be1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:56] sbernardin: Actually, I fucked up and missed older servers [19:42:06] We'll not be putting things in d2-sdtpa, or decomming those servers [19:42:15] turns out srv190-srv225 are in s5-sdtpa [19:42:17] and much older. [19:42:40] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:42:52] New review: Hashar; "Ryan idea was probably to have the .cdb in the git repo so we can git-deploy it to the other hosts. ..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46240 [19:43:03] RECOVERY - SSH on ms-be1002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:43:28] hashar: for beta you want to switch back to the old deployment methof [19:43:31] *method [19:43:49] yeah I was wondering if you / sam were still working on the newdeploy branch [19:43:57] not for now [19:44:09] from a talk with Rob yesterday, we would want to keep git-deploy on beta [19:44:24] so we can use the beta cluster as a test/preprod host for git-deploy stuff [19:44:29] I'd really like to replace how we're handling l10n [19:44:30] so I guess we want to stick to it [19:44:41] sticking it into git was a short-term thing [19:44:50] BUT, I will probably want to use mediawiki-config @ master [19:44:55] yes [19:45:04] there's a few issues with it right now [19:45:28] I will probably hack that tomorrow morning [19:45:37] sticking it into git was a short-term thing (but now 18 years of child support) [19:45:50] hahaha [19:46:07] :-] [19:46:18] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46240 [19:46:36] binasher: well, I was going to marry l10n off to bittorrent [19:46:45] 90 minutes of pleasure, 9months of waiting time, 3 * 9 years of troubles [19:47:06] PROBLEM - NTP on ms-be1001 is CRITICAL: NTP CRITICAL: No response from NTP server [19:47:42] !log jenkins: updating all jobs [19:47:52] Logged the message, Master [19:48:31] PROBLEM - NTP on ms-be1002 is CRITICAL: NTP CRITICAL: No response from NTP server [19:49:40] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 182 seconds [19:49:50] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 187 seconds [19:50:33] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 199 seconds [19:50:51] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 207 seconds [19:51:28] Ryan_Lane: can I pm? [19:51:38] !log ms-be1004 going down for h/w replacement [19:51:42] matanya: it can't be asked in here? [19:51:48] Logged the message, Master [19:52:11] binasher: can you add job-pop-duplicate and job-insert-duplicate stats to gdash? [19:52:28] it can, but topic says: serious stuff. I don't want to bother serious stuff. [19:52:38] doh, aaron isn't in here [19:54:01] PROBLEM - Host ms-be1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:54:19] PROBLEM - Host ms-be1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:55:41] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:56:09] anyway, are all wikimedia servers dell servers? in which way nagios is working, snmp? nrpe? ssh? are all servers 12.04? [19:56:47] took many questions :-D [19:56:49] too [19:57:41] PROBLEM - SSH on ms-be1003 is CRITICAL: Connection refused [19:57:53] matanya: our nagios configuration is in operations/puppet.git [19:58:13] matanya: I guess we use mostly nrpe [19:58:22] matanya: and most servers are running 12.04 (Precise) [19:59:06] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 26.48 ms [19:59:18] thanks hashar [19:59:50] matanya: if you know about nagios, Damianz is maintaining the nagios system on wmflabs [20:00:23] is he around? I'd like to submit quite a lot of scripts and checks [20:01:35] matanya: he has been away for two days, he is #wikimedia-labs often [20:02:00] thanks [20:02:06] matanya: ah his script is in labs/nagios-builder.git [20:02:20] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:02:24] PROBLEM - NTP on ms-be1002 is CRITICAL: NTP CRITICAL: No response from NTP server [20:03:09] PROBLEM - SSH on ms-be1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:27] hashar: who maintain the production nagios? [20:03:31] *s [20:03:57] matanya: we are migrating to icinga (the nagios community fork). LeslieCarr has the most knowledge about it [20:04:10] else random ops people update the puppet manifests [20:04:21] any blog post why to migrate? [20:04:33] iirc I suggested it to Leslie [20:04:58] compare http://nagios.wikimedia.org versus http://neon.wikimedia.org/icinga/ [20:05:04] icinga looks a bit more modern [20:05:17] don't have access [20:05:29] oh [20:05:36] unfortunate, you should [20:05:38] hi guys, i'm looking into merging the analytics branch of operations/puppet into production for review [20:05:38] or maybe I am whitelisted [20:05:51] got some qs on how to best go about it [20:06:00] hashar: who should I speak regarding scripts and checks donation? [20:06:15] is that Q&A session today? ;;-] [20:06:28] isn't it? :) [20:06:33] first off [20:06:34] matanya: what do you mean by donating scripts and checks ? [20:06:45] New patchset: Dereckson; "Maintenance for http://fr.planet.wikimedia.org/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46558 [20:06:50] matanya: if you want to update the hosts, you can hack the puppet manifests and the labs/nagios-builder.git [20:07:01] there are 4 puppet modules that I am maintaining as separate repositories and using git submodules to clone them [20:07:19] there was a discussion back in october relating to using git submodules in puppet [20:07:27] I just finished a nagios review in my company, for 100+ servers [20:07:30] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [20:07:31] ottomata: I don't think ops will allow submodules, they would be considered third parties / untrusted [20:07:38] ottomata: are they upstream modules? [20:07:41] they would be hosted in gerrit [20:07:43] ottomata: one could update the submodule to grant itself root in turn giving him root on the cluster. [20:07:44] no, they are mine [20:07:45] the main problem is that of review, and especially final review of diffs [20:07:49] on sockpuppet [20:07:56] would like to help install the scripts and checks on wmf servers [20:08:16] so, i can commit them directly to ops/puppet if I have to [20:08:30] i'd rather not, because they are useful as standalone modules [20:08:39] and have nothing wmf specific in them [20:09:09] so, the git submodules problem mainly has to do with how diffs are reviewed on sockpuppet before we merge into the working copy there [20:09:44] i wrote a wikipage on how this would work as is [20:09:45] http://wikitech.wikimedia.org/view/User:Ottomata/Git_Submodules_and_Puppet [20:10:20] so, this is less of a problem if you do development in labs [20:10:21] submodules won't be automatically updated as part of our regular merge process [20:10:35] matanya: and here is a nagios vs icinga chart https://www.icinga.org/nagios/feature-comparison/ [20:10:40] this sounds like a list question, honestly [20:11:00] yeah, i think so too, hmm [20:11:00] but [20:11:10] lots of our modules are/will be useful to third parties [20:11:30] since I'm already in lots of trouble, i guess i'm trying to feel out how much I should rock the boat here. [20:11:37] it feels wrong to commit these directly to ops/puppet [20:11:40] why? [20:11:44] but that is the path of least resistance [20:11:46] that's how all the rest of our modules are done [20:12:11] unless we're going to change all modules to submodules it doesn't make sense [20:12:12] i guess for open source discovery purposes, the modules have nothing to do with WMF [20:12:19] i don't think thats true [20:12:22] lots of our modules don't, really [20:12:22] many of our modules are WMF specific [20:12:46] one way would be to stick the modules in operations/puppet.git with the rest [20:12:54] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:13:05] and if we want to make them available to third parties, we could script something that would publish them to different git repos [20:13:32] that would be great actually [20:13:33] redis, salt, mongodb, java, deployment <— current modules that aren't wmf specific [20:13:50] right, and they are difficult for others to discover or use [20:13:51] hashar: that's not terribly easy [20:13:56] since it would be difficult to keep history [20:13:58] hmmm [20:13:59] right [20:14:10] PROBLEM - Host ms-be1003 is DOWN: PING CRITICAL - Packet loss = 100% [20:14:14] Ryan_Lane: I am pretty sure I did something like that already using git-filter branch [20:14:38] oh and for wmf specific stuff I proposed a "wikimedia" module [20:14:56] got to catch paravoid to get it merged in :-D [20:15:29] New patchset: Pyoungmeister; "db box bookkeeping" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46561 [20:15:29] the change is at https://gerrit.wikimedia.org/r/#/c/43420/ feel free to cast your voice there [20:17:29] hashar, I think that's a really good idea [20:17:46] although, i'm not so sure how different it is than just manifests/ [20:18:09] RECOVERY - Host ms-be1004 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [20:18:46] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [20:19:20] RECOVERY - Host ms-be1003 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [20:20:09] ottomata: the point is mainly to stop including everything from manifests/site.pp [20:20:14] yeah [20:20:17] and takes advantage of puppet autoloading [20:20:17] that makes sense for sure [20:20:22] that would make puppet a bit faster [20:20:31] RECOVERY - Host ms-be1004 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:20:31] annnnd let us finally start writing integration tests [20:22:05] oh? [20:22:05] how so? [20:22:50] PROBLEM - SSH on ms-be1004 is CRITICAL: Connection timed out [20:23:29] regarding analytics branch, Ryan_Lane and hashar, I guess I'm trying to ask if you think I should restart the submodules discussion. I had planned to do so in a few months when we were ready to merge in analytics branch, but due to everything that's happening at the moment we should do this now [20:23:42] PROBLEM - SSH on ms-be1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:54] I think that is the right way to do things, but if it is going to just cause more friction it might not be worth it [20:24:02] sigh, ha I really want to ask paravoid [20:24:07] maybe I should just email him :p [20:25:20] or mail ops list ? [20:25:40] ottomata: other people could cast their voices by mailing to ops list [20:25:40] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:25:48] RECOVERY - SSH on ms-be1003 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:26:33] RECOVERY - SSH on ms-be1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:26:35] yeah, there is a discussion thread about it [20:26:50] RECOVERY - SSH on ms-be1004 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:26:57] right, that's what I would do if I was ok with causing friction [20:27:11] at the moment errrghhh, i dunno [20:27:55] PROBLEM - NTP on ms-be1004 is CRITICAL: NTP CRITICAL: No response from NTP server [20:29:52] hashar Ryan_Lane, i'm going to email faidon and just ask what he thinks, and if he thinks I should start a discussion I will [20:31:00] sound sgood [20:31:17] or send the patch in Gerrit and add him as a reviewer ;-]]]]] [20:32:46] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:35:46] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [20:39:09] RECOVERY - Puppet freshness on ms-be1008 is OK: puppet ran at Tue Jan 29 20:38:54 UTC 2013 [20:41:15] heads up: we'll start deploying MobileFrontend updates, will someone be around to help with Varnish? [20:43:19] !log updated rt config, pmtpa comments now come from pmtpa-comment, not rt-comment to match the rest of the queues configuration [20:43:30] Logged the message, RobH [20:44:06] PROBLEM - NTP on ms-be1004 is CRITICAL: NTP CRITICAL: Offset unknown [20:44:36] New review: awjrichards; "Looks good, I'm happy to merge once we're in the clear for deployment." [operations/mediawiki-config] (master); V: 2 C: 1; - https://gerrit.wikimedia.org/r/46542 [20:47:42] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 182 seconds [20:48:09] RECOVERY - NTP on ms-be1001 is OK: NTP OK: Offset -0.02065074444 secs [20:48:23] RECOVERY - NTP on ms-be1001 is OK: NTP OK: Offset -0.02181839943 secs [20:48:23] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 196 seconds [20:48:27] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 196 seconds [20:48:36] robh: so I am looking at where to put the mw servers in eqiad and while we want to keep them together w/other apaches. I don't have enough space or available power for 60 [20:48:45] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 203 seconds [20:49:38] cmjohnson1: im looking at that right now actually [20:49:38] k [20:50:22] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 2 seconds [20:50:34] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [20:50:43] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:51:00] RECOVERY - NTP on ms-be1002 is OK: NTP OK: Offset -0.02260315418 secs [20:51:43] RECOVERY - NTP on ms-be1002 is OK: NTP OK: Offset -0.02382671833 secs [20:51:54] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:52:04] cmjohnson1: So looking at power in a5/a6/b6/b7/b8 they seem ok for power [20:52:17] sorry..should [20:52:26] for adding 10 each should be ok [20:52:26] have been more clear...no available power outlets [20:52:32] ohh, no power outlets in those? [20:52:45] yeah..give me a sec i will give you a count [20:53:46] I thought we had 84 outlet plugs in those, but that may just be row c... [20:55:43] RECOVERY - NTP on ms-be1004 is OK: NTP OK: Offset -0.02612674236 secs [20:56:15] RECOVERY - NTP on ms-be1004 is OK: NTP OK: Offset -0.02628481388 secs [20:56:53] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 16 seconds [20:57:22] robh: i have a total of 33 [20:57:22] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 11 seconds [20:57:36] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 8 seconds [20:57:36] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 8 seconds [20:57:52] cmjohnson1: need more info, 33 pairs of ports across all 5 of those racks? [20:57:54] a7/7 a6/5 b6/5 b7/8 b8/8 [20:58:01] damn. [20:58:13] yes pairs [20:58:20] so its 48 ports per side iirc? [20:58:30] sorry, 42 [20:58:32] 42 [20:58:40] yea, ok, thats to be expected then [20:58:47] row c is the same [20:58:51] w/42 [20:59:02] RECOVERY - NTP on ms-be1008 is OK: NTP OK: Offset -0.02078998089 secs [20:59:24] cmjohnson1: Ok, so with the mw servers, fill in those 5 racks with the space you have [20:59:33] and put the remainder mw servers in rack c6 [20:59:38] which will be row c apache rack. [20:59:51] RECOVERY - NTP on ms-be1008 is OK: NTP OK: Offset -0.02116334438 secs [21:00:19] k [21:03:17] put new mediawiki servers in row C [21:03:22] bleehehhh [21:03:29] mark: he is [21:03:38] but shouldnt we fill in row A and B first so we dont have open space? [21:03:44] no [21:03:47] wh... [21:03:48] why? [21:03:52] if we have a row switch failure we're dead [21:03:57] ..... in one rack [21:04:04] its 5 racks [21:04:04] the entire row [21:04:17] but this is my point, why not order app servers by the rack? [21:04:29] but you are arguing against filling a rack with apaches [21:04:37] we want to fill the existing racks before expanding [21:04:39] yes, i'm arguing for a site with good uptime [21:04:47] you just said order by the rack? [21:04:50] those older apaches will be gone first [21:04:53] .... [21:04:57] and then it will be mixed with newer ones, blehh [21:05:09] so instead we leave open racks [21:05:10] buy this stuff by the rack, spread out over rows [21:05:10] wasted amps [21:05:15] and wasted real estate in each rack? [21:05:15] now row C is wasted [21:05:19] what's the difference? [21:05:24] row C will get filled in over time. [21:05:28] so will a/b [21:05:32] and a/b will clear out sooner [21:05:33] not as much [21:05:37] look at sdtpa [21:05:46] we have the issue that we have racks half filled all over the place [21:05:55] and its hard to allocate and rack new stuff in groups [21:06:05] apache racks? [21:06:07] yes [21:06:16] that's because it wasn't done the way I wanted ;) [21:06:20] .... [21:06:22] thats bullshit [21:06:23] i've said for years to buy by the rack [21:06:32] and decommission by the rack [21:06:47] we cannot fit 60 servers in a rack [21:06:51] no [21:06:52] so we take two of row C [21:06:54] we can fit like 32-40 [21:06:55] and waste the space in a and b [21:06:59] sitting empty and we pay for it. [21:06:59] why 60 anyway [21:07:03] who came up with that silly number [21:07:06] because its what asher/ct told me to do [21:07:10] sigh [21:07:20] they said order 60 for tampa. [21:07:25] then asher said order another 60 for eqiad. [21:07:32] amen [21:07:39] and ct said that you and asher get to tech review [21:07:44] and didnt say i have to have you both sign off [21:07:47] so i did what they said. [21:08:18] So my problem is if we put all the new 60 servers in row C, we take up 2 of the racks for that and now we have 7 apache racks, all filled to 75% capacity [21:08:28] instead of 6 apache racks, 5 of which are 100% full [21:08:35] and one of which is 45% full [21:08:41] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46542 [21:09:33] now if we want them spaced evenly in all three rows, thats fine, but we need to atleast fill up the racks [21:09:33] if you keep mixing old and new across racks, you also keep moving shit around all the time [21:09:49] proper way is to calculate needed power capacity, fill a rack, then be happy with it [21:09:56] ok, im trying to do that. [21:09:57] fill a rack. [21:10:10] it has to happen over the course of multiple orders, because we didnt fill them to capacity before [21:10:15] and I didnt do the intial ordering for eqiad [21:10:41] when those were ordered, we did not properly fill a rack to capacity, an issue i am now trying to correct. [21:10:49] how much are they using? [21:11:21] a6-eqiad is at roughtly7.1kw [21:11:33] it can take atleast 5 more servers [21:12:29] we're also limited by number of ports, we have 42 per tower [21:12:37] so in a6 we have 5 ports left. [21:12:52] that should (though we may end up having to pull one once we test the power up levels) [21:12:57] be able to handle all being filled [21:13:01] then a5 is at capacity. [21:13:06] so you want to do one rack in row C, then spread the remainder over existing racks? [21:13:18] I want to first fill rows A and B apache racks to capacity [21:13:25] then take the remainder and create an apache rack in row C [21:13:27] don't try to max up that precisely, if you're not sure one will trip the 8.6 kW, just don't put it there [21:13:44] ok, so we can put 4 and leave a set of plugs open too [21:13:50] (for usb cdroms, etc) [21:13:52] just fill one new C rack to the max [21:13:55] then spread the rest [21:14:00] so at least one is sane :P [21:14:16] that is close enough to splitting the difference to me. [21:14:27] I just hate having all the racks at 75% cpacity with wasted overhead [21:14:34] your solution come close enough to addressing that to me. [21:14:48] cmjohnson1: got it? c6 is row C apache rack [21:14:53] yep [21:14:58] so fill that then populate a5/6/7 and b6/7 [21:16:11] cmjohnson1: Soooo did all the new misc go in c4? [21:16:17] and none in a4/b4? [21:16:40] (or did those get populated to capacity first?) [21:16:42] yep....not enough power outlets [21:16:50] damn lack of power plugs! [21:16:50] they are at max capacity [21:16:55] heh [21:17:02] cool [21:17:02] so all are in c4 [21:17:10] populating racktables now [21:17:30] mark: recall when we ran out of Uspace before power? [21:17:34] heh [21:17:36] yes [21:17:43] those days are gone [21:17:44] we may have to look at 20A instead of 30A in the future [21:17:48] servers have gotten more efficient [21:18:14] pmtpa row c and d are good examples of 20amp [21:19:04] cmjohnson1: So when you are done racking them in row C [21:19:24] you wanna do the network labeling and vlan tagging? (I can walk you through it, LeslieCarr set you up with a limtied access account like mine for this) [21:19:54] granted, still must be careful, the limited access is still more than enough to take down the site. [21:19:59] that would be cool [21:20:09] cool, lemme konw when you are at that point and we can work on it [21:20:21] k [21:20:43] now where the hell am i gonna git 10 misc servers in tampa... [21:21:54] New patchset: MaxSem; "Revert "(bug 44424) wikiversions.cdb for labs"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46616 [21:22:04] well, i suppose we dont have many items without a service processor [21:22:09] which was b1-sdtpa rack [21:22:14] as it has per outlet power control. [21:22:21] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46616 [21:22:24] its also a rack of misc crap, so it may be the only place for them. [21:23:00] sbernardin: When you have a moement, I would like to know how many power outlets are available in b1-sdtpa, so I dont assign a rackign space to them and not have the power [21:23:19] (b1-sddtpa may have less plugs than other rakcs, as its per outlet power control) [21:25:37] robh: iirc b1 is maxed out or really close (maybe 1...since storage3 is decom'd) [21:27:40] max'd out in plugs ya mean right? [21:27:44] that sucks. [21:28:33] cmjohnson1: you sure? i see lots of open ports [21:29:21] i see 9 ports on tower a open [21:29:33] since this is per outlet control, its also per port sensing [21:29:39] but it doesnt count stuff that isnt powered on [21:29:44] hence my ask sbernardin to confirm [21:29:52] i am pretty confident...but yep..ask sbernardin [21:30:10] different power strips...2 individual strips on each side [21:30:41] yep, and i know that there is a c15 plug on them [21:30:47] so we lose atleast 1 or 2 ports per strip to that. [21:30:59] oh well, he is afk unboxing servers im sure ;] [21:31:12] cmjohnson1: you onsite? fix my row C/D cameras so we can all watch you all the time. [21:31:15] bwahahahaha [21:31:45] 'whats that ladder in row A coldrow for???!?!?; (mostly kidding ;) [21:31:56] haha....yeah...so you can sit at your desk in sfo and reminisce [21:32:12] RobH: there are a few ports available....but only one with a normal outlet [21:32:34] ladder in a cold row...no it's in the hot row [21:32:51] the console is in the cold row right now at a1 [21:33:26] cmjohnson1: ladder is totally in cold row a [21:33:29] in front of a8 [21:33:31] !log stopping gluster services on labstore1, this will affect nfs mounts (like ssh keys and public data exports) [21:33:35] you cannot hide from me (except in the camera blind spots) [21:33:36] Normal plug I mean...the rest of the available one's have an alternate looking plug that I don't know about [21:33:42] Logged the message, Master [21:33:47] sbernardin: yep, ok cmjohnson1 was right that rack is full, crappy! [21:33:49] hah...gonna work in the storage room now! [21:34:45] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46561 [21:37:46] RECOVERY - Puppet freshness on db1037 is OK: puppet ran at Tue Jan 29 21:37:32 UTC 2013 [21:38:25] !log authdns update adding mgmt ip's to zone files for hpm misc servers in eqiad [21:38:35] Logged the message, Master [21:40:12] cmjohnson1: in retrospect, we should have named the cameras after aisles [21:40:22] have aisle 1 be row a intake, aisle 2 row a and b output [21:40:24] but whatever. [21:40:26] semantics [21:41:13] RECOVERY - Puppet freshness on db1015 is OK: puppet ran at Tue Jan 29 21:40:58 UTC 2013 [21:42:24] Ok, I do not want to add misc servers to any old rack [21:42:34] but the only rack i have any measure of free space is c2-pmtpa. [21:43:05] i can fit a single server into b1/b3/b4-sdtpa [21:43:12] thats 3 of 10 [21:43:16] =P [21:44:13] RECOVERY - Puppet freshness on db1045 is OK: puppet ran at Tue Jan 29 21:44:00 UTC 2013 [21:46:59] New patchset: Ottomata; "Adding kafka module for review." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46618 [21:48:07] RECOVERY - Puppet freshness on db62 is OK: puppet ran at Tue Jan 29 21:47:51 UTC 2013 [21:54:16] RECOVERY - Puppet freshness on db1016 is OK: puppet ran at Tue Jan 29 21:54:00 UTC 2013 [21:57:09] !log stopping all gluster services [21:57:19] Logged the message, Master [21:57:48] New patchset: RobH; "adding in eqiad row c, pmtpa rows c and d" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23426 [21:58:10] RECOVERY - Puppet freshness on db1044 is OK: puppet ran at Tue Jan 29 21:57:40 UTC 2013 [21:58:11] RECOVERY - Puppet freshness on db1030 is OK: puppet ran at Tue Jan 29 21:58:03 UTC 2013 [21:58:11] RECOVERY - Puppet freshness on db1023 is OK: puppet ran at Tue Jan 29 21:58:06 UTC 2013 [21:58:22] !log rebooting labstore1 [21:58:32] Logged the message, Master [21:58:48] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23426 [21:59:13] RECOVERY - Puppet freshness on db1031 is OK: puppet ran at Tue Jan 29 21:58:45 UTC 2013 [21:59:23] New patchset: Pyoungmeister; "testing: making db1012 fake es1 host" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46624 [22:00:29] PROBLEM - Host labstore1 is DOWN: PING CRITICAL - Packet loss = 100% [22:00:30] !log temp stopping puppet on brewster [22:00:42] Logged the message, notpeter [22:01:01] !log updating torrus files via puppet, sorry if torrus crashes out, its bitchy [22:01:12] Logged the message, RobH [22:01:48] RECOVERY - Host labstore1 is UP: PING OK - Packet loss = 0%, RTA = 26.64 ms [22:02:13] RECOVERY - Puppet freshness on db1012 is OK: puppet ran at Tue Jan 29 22:01:56 UTC 2013 [22:02:30] !log rebooting labstore2 [22:02:40] Logged the message, Master [22:02:55] torrus recompile takes a long time, and doesnt update console during it when via puppet. [22:04:52] PROBLEM - Host labstore2 is DOWN: PING CRITICAL - Packet loss = 100% [22:05:49] PROBLEM - Host db1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:06:09] RECOVERY - Host labstore2 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [22:06:21] RECOVERY - Puppet freshness on db1014 is OK: puppet ran at Tue Jan 29 22:05:51 UTC 2013 [22:07:10] RECOVERY - Puppet freshness on db1029 is OK: puppet ran at Tue Jan 29 22:06:47 UTC 2013 [22:09:58] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [22:10:19] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [22:11:18] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.61 ms [22:11:21] New patchset: RobH; "invalid ps1-a4-pmtpa reference left over" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46627 [22:11:27] !log maxsem Started syncing Wikimedia installation... : https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2013-01-29 [22:11:28] PROBLEM - Host labstore2 is DOWN: PING CRITICAL - Packet loss = 100% [22:11:38] Logged the message, Master [22:11:40] RECOVERY - Host db1012 is UP: PING OK - Packet loss = 0%, RTA = 26.46 ms [22:11:49] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [22:12:17] RECOVERY - Host labstore2 is UP: PING OK - Packet loss = 0%, RTA = 26.52 ms [22:15:43] PROBLEM - SSH on db1012 is CRITICAL: Connection refused [22:15:52] PROBLEM - MySQL disk space on db1012 is CRITICAL: Connection refused by host [22:17:18] PROBLEM - SSH on labstore3 is CRITICAL: Connection timed out [22:18:08] RECOVERY - SSH on labstore3 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:19:23] New patchset: RobH; "invalid ps1-a4-pmtpa reference left over" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46627 [22:19:37] PROBLEM - Host db1012 is DOWN: PING CRITICAL - Packet loss = 100% [22:20:58] RECOVERY - SSH on db1012 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:21:07] RECOVERY - Host db1012 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [22:21:08] PROBLEM - SSH on labstore2 is CRITICAL: Connection timed out [22:21:21] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46627 [22:21:32] <^demon> Someone mind taking a look at the apache log on wikitech? [22:21:36] <^demon> It's giving 500s. [22:21:59] RECOVERY - SSH on labstore2 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [22:25:18] !log maxsem Finished syncing Wikimedia installation... : https://www.mediawiki.org/wiki/Extension:MobileFrontend/Deployments/2013-01-29 [22:25:28] Logged the message, Master [22:26:31] RECOVERY - MySQL disk space on db1012 is OK: DISK OK [22:27:20] !log torrus is all happy again, huzzah [22:27:31] Logged the message, RobH [22:29:42] New patchset: Pyoungmeister; "testing: swapping db1012 to coredb es1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46629 [22:31:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46624 [22:34:54] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46629 [22:38:21] !log manual "logrotate" of SAL on wikitech [22:40:02] !log rotated SAL wikitech page - test logging [22:40:31] morebots: log!!! [22:41:48] slow bots get beaten [22:42:31] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [22:43:05] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [22:45:44] mutante: maybe you confused it by deleting all the date headers [22:47:02] !log rotated SAL wikitech page - test logging [22:47:03] Logged the message, Master [22:47:12] there you go [22:48:38] !log maxsem synchronized php-1.21wmf8/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/46630' [22:48:39] Logged the message, Master [22:49:27] thanks TIm [22:49:37] Jan 29 22:48:13 wikitech apache2: PHP Fatal error: Maximum execution time of 30 seconds exceeded in /srv/org/wikimedia/wikitech-1.19.2/includes/diff/DairikiDiff.php on line 1272 [22:50:27] preilly: https://gerrit.wikimedia.org/r/46637 [22:50:31] latest commit [22:51:48] !log maxsem synchronized php-1.21wmf7/extensions/MobileFrontend 'https://gerrit.wikimedia.org/r/46630' [22:51:49] Logged the message, Master [22:52:49] New patchset: Reedy; "Cleanup of InitialiseSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46638 [22:53:01] 1 file changed, 74 insertions(+), 215 deletions(-) [22:53:13] Oh yeah, I'd meant to archive the SAL. [22:57:32] New patchset: MaxSem; "Enable Special:Nearby everywhere" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46639 [22:57:57] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46639 [23:00:00] !log maxsem synchronized wmf-config/InitialiseSettings.php 'Enable Special:Nearby everywhere' [23:00:01] Logged the message, Master [23:03:06] PROBLEM - mysqld processes on db1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:05:01] New patchset: Reedy; "Remove some old stuff from CommonSettings too" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46641 [23:08:24] New patchset: Pyoungmeister; "switching es1 shard to coredb module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46642 [23:11:36] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [23:12:08] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [23:14:06] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46642 [23:14:24] paravoid: have you seen http://ceph.com/community/ceph-bobtail-performance-io-scheduler-comparison/#4mbradoswrite ? [23:14:39] might be interesting [23:17:42] !log pgehres synchronized php-1.21wmf7/extensions/CentralNotice/ 'Updating CentralNotice to master, resolving namespace issues' [23:17:43] Logged the message, Master [23:18:46] PROBLEM - Host labstore2 is DOWN: PING CRITICAL - Packet loss = 100% [23:19:07] RECOVERY - Host labstore2 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [23:21:28] !log pgehres synchronized php-1.21wmf8/extensions/CentralNotice/ 'Updating CentralNotice to master, resolving namespace issues' [23:21:28] Logged the message, Master [23:27:17] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [23:27:46] PROBLEM - Host labstore1 is DOWN: PING CRITICAL - Packet loss = 100% [23:28:35] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [23:29:16] RECOVERY - Host labstore1 is UP: PING OK - Packet loss = 0%, RTA = 26.62 ms [23:31:45] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [23:31:46] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 26.60 ms [23:37:07] New patchset: Mwalker; "Add CNBanner namespace by hand" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46647 [23:38:06] Change merged: Pgehres; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46647 [23:41:19] Another potential fallout issue from datacenter migration: https://bugzilla.wikimedia.org/show_bug.cgi?id=44485 - fatalmonitor currently displays only errors from testwiki [23:41:23] Would anybody be willing to investigate? [23:42:43] MaxSem: Critical? Really? [23:42:57] The apache syslogs are presumably being written somewhere else... [23:43:04] Reedy, what if there were a problem during deployment? [23:43:16] fatal logs on fluorine? [23:43:25] exception logs on fluorine? [23:43:41] I already had to revert an undeployed config change because I wasn't confident of deploying it w/o fatalmonitor [23:43:41] I wonder if udp2log works properly cross-DC [23:43:55] Or if the traffic isn't filtered by some rule not expecting 10.64/16 addresses [23:44:33] Looking at puppet, the location is still /h/w/l/syslog/apache.log [23:44:47] So presumably the traffic not reaching the collector is the issue [23:45:15] ^demon: you about? I am looking at your access request ticket https://rt.wikimedia.org/Ticket/Display.html?id=4066 [23:45:30] mutante submitted a patchset that would do what you think i need. [23:45:37] RoanKattouw: Might be easier just moving the relevant puppet classes to bast1001 [23:46:01] bleh... nm [23:46:08] i see faidon commented with an alternative. [23:46:25] mutante: Did you wanna adjust your patchset accordingly? https://gerrit.wikimedia.org/r/#/c/42791/1 ? [23:46:30] New patchset: Reedy; "Cleanup of InitialiseSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46638 [23:46:35] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46638 [23:46:53] Reedy: Possibly.... but we wouldn't want logging to break if/when we have to do a switchover to pmtpa [23:47:30] That's not likely [23:47:42] I guess we have fluorine for that [23:47:46] Wait wait [23:47:51] fluorine is *in eqiad* [23:47:53] wtf [23:47:54] ja [23:47:58] OK now I don't get it any more [23:48:05] what's wrong with fluorine? its been collecting logs [23:48:20] andre__ Another potential fallout issue from datacenter migration: https://bugzilla.wikimedia.org/show_bug.cgi?id=44485 - fatalmonitor currently displays only errors from testwiki [23:48:30] binasher: That's what's supposedly wrong [23:48:44] if eqiad is wrong, I don't wanna be right [23:48:51] * RoanKattouw_away actually goes away [23:49:07] binasher: apache syslogs weren't moved [23:49:12] looks at my tail at dberror.log on fluorine.. it has recent stuff: ue Jan 29 23:38:16 UTC 2013 mw1129 commonswiki WikiPage::updateCategoryCounts 10.64.16.27 1213 Deadlock found when trying to get lock; try restarting transaction (10.64.16.27) [23:49:22] binasher: everything else seems fine [23:49:27] <^demon> RobH: `sudo -u apache strace` sounds sane, if everyone agrees. [23:49:51] binasher: /home/wikipedia/syslog/apache.log [23:50:09] binasher: manifests/misc/logging.pp [23:51:29] ^demon: i think so too, no one wants to chat or take it over [23:51:54] New patchset: Reedy; "Remove some old stuff from CommonSettings too" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46641 [23:53:01] New patchset: Reedy; "Remove some old stuff from CommonSettings too" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46641 [23:53:54] ^demon: so yea [23:53:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46641 [23:54:00] if i can figure out how the hell to do it [23:54:04] im gonna do it and merge it. [23:54:16] its over 3 days old and only paravoid felt the need to comment on mutante's patchset [23:54:21] that means to me that everyone else in ops agrees [23:54:32] as the rule we all agreed on was 3 working days without arguement = approval. [23:54:55] !log reedy synchronized wmf-config/ [23:54:56] Logged the message, Master [23:56:16] Reedy: the syslog problem is $syslog_remote_real in base.pp [23:56:31] eqiad apaches send there syslog stream to syslog.eqiad.wmnet [23:56:33] which doesn't exist [23:56:46] wheee [23:56:59] i wonder why syslog was never moved off fenari to fluorine [23:57:59] i guess the quickest logging fix would be do add a syslog.eqiad cname to fenari that can be moved to a local box later [23:58:32] rsyslogd on fluorine should take over though [23:59:02] Indeed [23:59:10] I know Tim migrated most of the stuff, just not everything [23:59:26] Could you add a cname for the time being? Or are you going to fix it properly? ;)