[00:00:17] *affect [00:01:32] preilly: got you https://gerrit.wikimedia.org/r/46472 as well :-] [00:01:32] maybe the claim TTL was exceeded [00:01:40] while the job processes were still running [00:01:55] preilly: the linting tool "pylint" reports a bunch of minor issues, might be worth a look at [00:02:03] how many emails are in one jobs? heh [00:02:09] * AaronSchulz looks [00:02:29] so maybe those emails were actually sent, some multiple times [00:02:33] preilly: and you might want to enforce fast-forward merge on sartoris repo to prevents the mad merges Gerrit creats [00:02:43] 3 times perhaps [00:03:17] TimStarling: if that is true it would be in runJobs.log [00:03:23] since it gives the run time [00:03:42] * AaronSchulz looks [00:05:55] hashar: you want Fast Forward Only? [00:06:01] rfaulkner: so stat1001 wont run puppet and update with your name [00:06:10] until ottomata fixes whatever this duplicate puppet webserver thing is [00:06:21] (im advised ottomata knows of this issue?) [00:06:31] ottomata: err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Apache_site[000_default] is already defined in file /var/lib/git/operations/puppet/manifests/webserver.pp at line 320; cannot redefine at /var/lib/git/operations/puppet/manifests/webserver.pp:320 on node stat1001.wikimedia.org [00:06:33] that you? [00:06:46] RobH: that's easy to fix you just pass --incredible-faulk [00:06:55] haha [00:07:10] it's an undocumented option [00:07:54] preilly: I enforce fast forward on the integration/* repository, but that is just to have a clean history :-] [00:08:14] preilly: you might want to have long topic branches that get merged form time to time [00:08:36] hashar: yeah [00:12:34] preilly: bah I created the ohloh project but can't figure out how to add you as a project manager :D [00:12:35] https://www.ohloh.net/p/wikimedia-sartoris [00:12:43] TimStarling: not seeing enotifNotify jobs that took that long around that day in the logs [00:14:26] hashar: this is my account: https://www.ohloh.net/accounts/preillyme [00:14:48] preilly: maybe you have to apply for the manager position [00:15:34] well probably not a big issue [00:15:40] hashar: https://www.ohloh.net/p/wikimedia-sartoris/managers (pending) [00:15:55] preilly: you are such a hacker [00:16:11] granted! [00:16:11] :-] [00:16:13] hashar: thanks [00:16:23] * preilly isn't sure if hashar just insulted him [00:16:24] which server is the otrs database [00:16:32] this way if I die while sleeping, someone can take over easily \O/ [00:16:33] preilly: always assume yes and get mad. [00:16:50] RobH: re duplicate 000-default, yeah i know about that one, I can figure it out, [00:17:08] ottomata: cool, its affecting stat1001 and rfaulkner's access to it as well [00:17:13] just fyi [00:17:18] ok [00:17:21] RobH: Okay [00:17:23] i'll get to both of those tomorrow [00:17:28] coolness, thanks! [00:17:38] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46220 [00:17:51] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [00:17:55] RobH: you do know that I'm now going to call you, "Runkle" right? [00:18:10] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 191 seconds [00:18:18] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46221 [00:18:40] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 267 seconds [00:19:16] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 253 seconds [00:20:10] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:20:26] RobH: are you working on an RT ticket with that OTRS question? :D maybe? [00:20:28] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 14 seconds [00:20:30] 4430 [00:20:34] probably not :P [00:20:51] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 9 seconds [00:20:55] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:21:41] RD: the run the query and attach output? [00:21:45] i wish fundraising would alter table engine=innodb and run non-blocking db backups [00:22:15] RECOVERY - Solr on solr3 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.008 seconds [00:22:16] RECOVERY - Solr on solr1001 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.054 seconds [00:22:34] Yeah, that's just the latest other thing I heard about. I'm just 'interested' in the subject [00:22:42] RECOVERY - Solr on solr1 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.007 seconds [00:22:43] RECOVERY - Solr on solr2 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.015 seconds [00:23:09] RECOVERY - Solr on solr1003 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.057 seconds [00:23:36] RECOVERY - Solr on solr1002 is OK: HTTP OK HTTP/1.1 200 OK - 3078 bytes in 0.058 seconds [00:25:58] Reedy: https://rt.wikimedia.org/Ticket/Display.html?id=4430 is resolved, attached output of otrs closed account query [00:27:19] Great, thankyou :) [00:27:50] glad to help [00:35:05] err: /Stage[main]/Squid/File[/etc/squid/squid.conf]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///volatile/squid/squid.conf/knsq18.esams.wikimedia.org at /var/lib/git/operations/puppet/manifests/squid.pp:44 [00:35:07] sigh. [00:36:35] !log crazy memory and cpu spike on knsq18 for no apparent reason, restarting process (memory leak?) [00:36:46] Logged the message, RobH [00:44:40] !log squid backend on knsq48 rebuilding [00:44:50] Logged the message, RobH [00:44:50] !log typo, knsq18 [00:45:01] Logged the message, RobH [01:17:03] TimStarling: can you look at https://gerrit.wikimedia.org/r/#/c/45510/ (I'm not 100% sure on it) [01:20:27] 1024 seems quite low [01:23:24] oh, yeah I forgot to bump that up [01:23:44] so the whole think should be < 1000kb, so 1kb is too low [01:23:57] maybe 100kb? [01:24:15] yeah, that should be enough [01:24:23] 1KB would hit JPEGs with EXIF metadata [01:24:27] i.e. pretty much every image [01:29:42] AaronSchulz: http://paste.tstarling.com/p/RCQcOb.html [01:30:13] yeak, 100k looks good [01:30:28] I like your sha1 sampling trick [01:52:23] TimStarling: it may help to backport that after all wikis are on wmf8 to avoid dueling caches when wmf9 is rolled out (e.g. increased invalidations) [01:53:02] crosswiki memcached use and het deploy and key name or format changes can be tricky [01:53:17] yeah [01:53:25] I think it's what what was confusing the job queue last week [01:53:44] * AaronSchulz will removes his job hack from puppet when wmf8 is on everything [01:53:54] make sure Reedy knows [01:54:02] Reedy: ^ [01:54:06] TimStarling: there ;p [01:55:01] !log adding ldap entries for combined sysadmin/netadmin role, projectadmin [01:55:14] Logged the message, Master [01:59:39] TimStarling: so asher and I have talked about possibly moving the job queue to redis. Would it be worth talking about that some time? [02:00:29] * AaronSchulz would like to get that load of the primary dbs [02:00:32] *off [02:02:59] You've done what now? [02:03:17] Reedy: just a few landmines, nothing you can't handle [02:05:30] Reedy: I think I scared tim off [02:10:09] * AaronSchulz will pester tim later [02:17:41] !log deleting sysadmin and netadmin roles from ldap [02:17:54] Logged the message, Master [02:18:00] \o/ [02:18:02] LeslieCarr: hey, back [02:23:16] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [02:24:28] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [02:27:28] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [02:28:01] paravoid: is the \o/ for me, or LeslieCarr ? [02:28:02] heh [02:28:07] for you [02:28:11] heh [02:28:22] yeah. this should make things a little easier [02:28:31] yeah, I hadn't thought it before [02:28:31] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [02:28:32] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [02:28:44] but when you threw the idea it totally made sense to me [02:29:05] maybe it'll again make sense at some point in our distant future [02:29:34] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [02:30:05] yeah [02:30:20] well, when labs first started, those roles were hardcoded in labs [02:30:21] err [02:30:22] in nova [02:30:28] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [02:30:38] they turned that into a policy based system in essex [02:30:43] !log LocalisationUpdate completed (1.21wmf8) at Tue Jan 29 02:30:42 UTC 2013 [02:30:46] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 188 seconds [02:30:54] Logged the message, Master [02:30:54] or possibly diablo [02:31:01] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 191 seconds [02:31:04] 2013-01-29 02:30:54.611314 mon.0 [INF] pgmap v2048808: 16952 pgs: 16911 active+clean, 12 active+remapped+wait_backfill, 15 active+remapped+backfilling, 3 active+degraded+backfilling, 2 active+clean+scrubbing, 1 active+degraded+remapped+backfilling, 8 active+clean+scrubbing+deep; 25008 GB data, 53347 GB used, 186 TB / 238 TB avail; 71454/122575532 degraded (0.058%) [02:31:04] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 192 seconds [02:31:11] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 193 seconds [02:31:13] * paravoid waits for that go to 0% [02:41:34] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [02:44:34] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [02:44:34] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [02:48:28] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [02:55:45] !log LocalisationUpdate completed (1.21wmf7) at Tue Jan 29 02:55:44 UTC 2013 [02:55:55] Logged the message, Master [02:57:19] paravoid: are you logged on to analytics1001 ? [02:57:24] I am [02:57:27] I stopped apache [02:57:32] just mailed abou that. [02:57:34] paravoid: Okay thanks [03:12:05] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [03:12:24] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:12:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [03:13:04] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [03:13:24] PROBLEM - Parsoid Varnish on celsus is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:24] PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:41:33] that'd be me [03:41:36] sorry for paging [03:42:15] RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 247 bytes in 0.001 second response time [03:42:16] paravoid: how are the bugs going? [03:42:19] I'm trying to load the ceph cluster [03:42:34] good actually [03:43:06] finally got rid of h310 boxes [03:51:31] Aaron|home: quick silly question [03:51:55] Aaron|home: if I request a thumb size from thumb.php that exists in swift, will it be regenerated or served from swift? [03:52:17] it won't be regenerated [03:52:24] I guessed as much [03:52:27] what happens with multiwrite? [03:52:38] does it check on all backend stores? [03:52:46] no, just the main one [03:53:16] okay [03:53:29] I'm wondering if we can abuse all that to gradually fill up ceph with thumbs [03:53:33] instead of one large sync process [03:55:11] Ryan_Lane: salt-minion dies and dmesg complains every 30' [03:55:28] logs are filled with salt-minion terminated with status 42 [04:00:28] Aaron|home: what's your take? [04:01:21] how long would this be running? [04:01:46] with the current avg rate > 1 month [04:01:56] right [04:02:09] I'll try to optimize it [04:02:35] I thought we weren't going to use multiwrite? [04:03:06] but copying them from swift on ceph misses for a month or two while serving originals and ceph thumb hits [04:03:13] might be a nice alternative [04:03:24] or very ugly [04:03:44] erm what do you mean? [04:03:57] oh for thumbs [04:04:08] yeah but that assumes we have some other way of doing this... [04:05:46] anyway, I have to add 5 more boxes to the cluster (up from 7 right now) [04:05:52] and find ways to optimize this [04:06:04] i.e. find the bottleneck first :) [04:06:27] I guess we can use multiwrite, *shrug* and call resyncFile() via some hook in thumb.php [04:07:02] *resyncFiles() [04:07:11] nod [04:07:17] that fill ceph with anything in swift [04:07:23] (for the given thumbnail) [04:07:29] nod [04:07:37] I wonder what that will do to user-facing latency... [04:08:14] so the ceph cluster will be a single-dc cluster in ashburn? [04:08:38] yes [04:08:49] TimStarling: so asher and I have talked about possibly moving the job queue to redis. Would it be worth talking about that some time? [04:08:51] * Aaron|home is just being sure [04:08:57] yes, that would be worth talking about [04:09:03] paravoid: what does that give us over swift atm? [04:09:17] sage said they're planning for async replication in the radosgw layer by the summer [04:09:33] TimStarling: maybe we can bring back the weekly calls ;) [04:09:35] to be implemented until then [04:09:54] async on the rados layer is a long-term goal of theirs [04:09:55] so any specific month? [04:09:59] not going to come anytime soon [04:10:18] does it seem to have lower latency? [04:10:27] ceph, that is [04:10:31] "this spring/summer" [04:11:15] 03:42 there will eventually be dr in rados itself, but it's a harder problem than rgw, and won't look quite the same. [04:11:31] lower latency than what? [04:11:32] swift? [04:11:39] yes [04:11:46] haven't really benchmarked [04:12:15] I'd guess writes are better because of journaling in SSDs [04:12:28] so subsequently reads are also faster because of less disk seeks [04:12:34] but that's just guesses so far [04:12:40] does rgw do connection pooling? [04:12:50] of what? [04:12:58] connections to osds [04:14:00] it keeps connections with all osds it communicates with [04:14:10] TimStarling: you can look at https://gerrit.wikimedia.org/r/#/c/39174/ see what I have in mind [04:17:21] paravoid: so I was thinking about using ffmpeg + swift tempurls for oggs in addition to webm (as now), which reminded me of the double GET bug in swift for RANGE requests [04:18:55] it's awkward since upgraded swift is low-priority now but it will be a while before we are on ceph [04:19:13] upgraded to what? [04:19:19] 1.7.5? [04:20:02] yep [04:20:22] I'm not sure if I'd do that [04:20:34] well yeah it's kind of tedious atm [04:20:35] it's an interim release, we currently run off openstack releases [04:20:46] with canonical's packages and everything [04:21:04] I guess it wouldn't be much work... [04:21:15] https://bugs.launchpad.net/swift/+bug/1065869 [04:21:24] really quite an obnoxious bug [04:21:36] I think we should wait until all swift hardware swaps are done though... [04:21:46] what's left? [04:21:52] *sigh* [04:21:59] a few more boxes [04:22:00] plus [04:22:08] replacing all the ones we just replaced [04:22:16] because of h310->h710 swaps [04:22:22] gaaah [04:22:28] did I *sigh*? [04:22:44] yeah, fun [04:22:50] too much fighting for each inch [04:24:12] huh [04:24:19] I'm looking at the patch for that bug [04:24:25] it's three lines [04:24:29] applies cleanly on 1.7.4 [04:24:38] maybe that's a nice option in the meantime [04:24:52] five lines maybe [04:24:54] https://review.openstack.org/#/c/14497/3/swift/proxy/controllers/obj.py [04:25:44] oh and there's 1.7.6 already [04:28:12] okay [04:28:14] 6:30am [04:28:17] time to sleep [04:29:02] heh [04:29:10] strange hours [04:29:55] you think? :) [04:30:33] paravoid: let me know if you merge that patch [04:30:53] $ ls -l php-1.21wmf7/cache/*.cdb [04:30:54] -rw-rw-r-- 1 reedy wikidev 891138 Jan 17 22:17 php-1.21wmf7/cache/interwiki.cdb [04:30:54] -rw-rw-r-- 1 reedy wikidev 891142 Jan 10 19:12 php-1.21wmf7/cache/trusted-xff.cdb [04:31:00] suspiciously similar sizes [04:31:18] $ ls -l php-1.21wmf6/cache/trusted-xff.cdb [04:31:19] -rw-rw-r-- 1 demon wikidev 2319902 Oct 3 05:21 php-1.21wmf6/cache/trusted-xff.cdb [04:32:26] * Aaron|home slowly grins [04:33:02] md5sum *.cdb ? [04:33:15] gah, that would be useless [04:33:24] * Aaron|home gets tired [04:33:58] $ md5sum php-1.21wmf6/cache/interwiki.cdb php-1.21wmf7/cache/trusted-xff.cdb [04:33:58] 849e2d2a39f9efda9e0b290f5158983d php-1.21wmf6/cache/interwiki.cdb [04:33:58] 849e2d2a39f9efda9e0b290f5158983d php-1.21wmf7/cache/trusted-xff.cdb [04:34:22] ahem [04:36:02] !log tstarling synchronized php-1.21wmf7/cache/trusted-xff.cdb 'fixing corrupted CDB file' [04:36:13] Logged the message, Master [04:36:44] !log tstarling synchronized php-1.21wmf8/cache/trusted-xff.cdb 'fixing corrupted CDB file' [04:36:54] Logged the message, Master [04:38:16] paravoid: on which system [04:38:17] ? [04:38:40] last time I checked this was due to puppet running and changing values in the config file [04:39:58] that should have been fixed a while ago [04:47:39] !log tstarling synchronized php-1.21wmf8/cache/trusted-xff.cdb [04:47:49] Logged the message, Master [04:51:39] !log tstarling synchronized php-1.21wmf8/cache/trusted-xff.cdb [04:51:49] Logged the message, Master [04:53:45] !log tstarling synchronized php-1.21wmf8/cache/trusted-xff.cdb [04:53:55] Logged the message, Master [04:57:52] !log tstarling synchronized php-1.21wmf7/cache/trusted-xff.cdb [04:58:01] Logged the message, Master [04:58:13] !log tstarling synchronized php-1.21wmf6/cache/trusted-xff.cdb [04:58:23] Logged the message, Master [05:57:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45330 [06:17:48] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [06:18:36] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [06:36:16] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 188 seconds [06:36:26] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 198 seconds [06:37:01] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 211 seconds [06:37:10] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 214 seconds [06:41:16] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:41:26] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:41:40] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [06:42:07] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [06:53:17] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [06:53:40] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 188 seconds [06:53:46] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 189 seconds [06:55:01] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 204 seconds [07:00:47] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:00:52] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [07:01:16] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [07:02:04] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [08:20:35] !log pgehres synchronized php-1.21wmf8/extensions/CentralNotice/ 'Updating CentralNotice to master' [08:20:46] Logged the message, Master [09:00:23] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [09:12:58] New patchset: Hashar; "(bug 44424) wikiversions.cdb for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/46240 [09:14:08] New review: Hashar; "Oh I forgot the full path stuff, that should be fixed with PS2." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/46240 [09:36:32] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [09:36:32] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [09:36:32] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [09:36:33] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [09:36:34] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [10:13:41] New review: Hashar; "Sorry for not having followed up on that issue. Here is a summary of a discussion we had in a restri..." [operations/mediawiki-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/31580 [10:41:11] Change abandoned: MaxSem; "Per the above." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31580 [12:24:00] PROBLEM - Puppet freshness on db1031 is CRITICAL: Puppet has not run in the last 10 hours [12:26:06] PROBLEM - Puppet freshness on db1037 is CRITICAL: Puppet has not run in the last 10 hours [12:30:11] PROBLEM - Puppet freshness on db1012 is CRITICAL: Puppet has not run in the last 10 hours [12:30:11] PROBLEM - Puppet freshness on db1015 is CRITICAL: Puppet has not run in the last 10 hours [12:30:11] PROBLEM - Puppet freshness on db1023 is CRITICAL: Puppet has not run in the last 10 hours [12:30:11] PROBLEM - Puppet freshness on db1014 is CRITICAL: Puppet has not run in the last 10 hours [12:31:23] PROBLEM - Puppet freshness on db1030 is CRITICAL: Puppet has not run in the last 10 hours [12:34:04] New review: Brian Wolff; ">"whenever a file is updated, we have to purge each thumbnails ever generated"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31580 [12:42:29] PROBLEM - Puppet freshness on db1029 is CRITICAL: Puppet has not run in the last 10 hours [12:45:29] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [12:45:29] PROBLEM - Puppet freshness on db1044 is CRITICAL: Puppet has not run in the last 10 hours [12:49:23] PROBLEM - Puppet freshness on db1016 is CRITICAL: Puppet has not run in the last 10 hours [14:03:00] paravoid: ahhh faidon :-] [14:03:03] hey [14:03:12] slept well ? ready for some README.md madness and a few debian packages reviews ? :-] [14:10:33] PROBLEM - spamassassin on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:10:43] PROBLEM - Exim SMTP on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:53] PROBLEM - HTTPS on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:13] PROBLEM - HTTP on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:11:23] PROBLEM - mailman on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:24] PROBLEM - SSH on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:12:36] PROBLEM - mailman on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:54] looking [14:13:12] PROBLEM - SSH on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:12] PROBLEM - spamassassin on sodium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:13:47] PROBLEM - HTTP on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:13:48] PROBLEM - Exim SMTP on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:14:53] !log powercycled sodium [14:14:59] PROBLEM - HTTPS on sodium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:15:04] Logged the message, Master [14:22:13] RECOVERY - SSH on sodium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:22:24] RECOVERY - spamassassin on sodium is OK: PROCS OK: 4 processes with args spamd [14:22:33] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.230 sec. response time [14:22:43] RECOVERY - HTTPS on sodium is OK: OK - Certificate will expire on 08/22/2015 22:23. [14:23:03] RECOVERY - HTTP on sodium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 190 bytes in 0.003 second response time [14:23:04] RECOVERY - SSH on sodium is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [14:23:04] RECOVERY - Exim SMTP on sodium is OK: SMTP OK - 0.254 sec. response time [14:23:14] RECOVERY - mailman on sodium is OK: PROCS OK: 10 processes with args mailman [14:23:31] RECOVERY - HTTP on sodium is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 0.054 second response time [14:23:32] RECOVERY - mailman on sodium is OK: PROCS OK: 10 processes with args mailman [14:23:58] RECOVERY - HTTPS on sodium is OK: OK - Certificate will expire on 08/22/2015 22:23. [14:23:58] RECOVERY - spamassassin on sodium is OK: PROCS OK: 4 processes with args spamd [14:34:13] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 185 seconds [14:34:28] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 192 seconds [14:34:33] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 195 seconds [14:37:13] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [14:37:33] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 0 seconds [14:37:55] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [14:48:07] New patchset: Mark Bergsma; "Handle RADOSGW (Swift API) url rewriting for the basic case" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44067 [14:48:08] New patchset: Mark Bergsma; "Implement thumb 404 image scaler handling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44068 [14:48:08] New patchset: Mark Bergsma; "Add timeline, math, score rewrites" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44072 [14:48:08] New patchset: Mark Bergsma; "Remove double slashes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44076 [14:48:09] New patchset: Mark Bergsma; "Support project/language prefixes for math" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44077 [14:48:09] New patchset: Mark Bergsma; "Set CORS header in vcl_deliver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44078 [14:54:19] New review: Hashar; ":-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45330 [15:30:43] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:33:43] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:50:06] ottomata: could you please start getting us a list of destinations/sources/ports/protocols the analytics cluster needs to communicate with? [15:50:14] I'd like to have an ACL in place by the end of the week [15:50:30] RT #4433 can be used to assemble the list [15:51:11] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44067 [15:53:00] mark: [15:53:02] https://ccp.cloudera.com/display/CDH4DOC/Configuring+Ports+for+CDH4 [15:53:03] https://www.mediawiki.org/wiki/Analytics/Kraken/JMX_Ports [15:53:17] New patchset: Mark Bergsma; "Revert "Cleanup ldap script formatting"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46530 [15:53:31] sure, but mark you mean outside of the analytics cluster, right? [15:53:37] yes [15:53:40] sure [15:53:50] anything between the analytics cluster and other WMF, non-analytics servers [15:54:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46530 [15:54:18] you can assume that the basics will be there, i.e. puppetmaster, dns, ntp [15:54:28] anything every server does [16:12:42] New patchset: Mark Bergsma; "Add the image_scalers backend to varnish upload backends" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46533 [16:13:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/46533 [16:18:16] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [16:19:30] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [16:21:41]