[00:05:05] PROBLEM host: precise-test is DOWN address: i-0000022d CRITICAL - Host Unreachable (i-0000022d) [00:05:05] compatibility between gluster 3.3 and 3.2 seems to work perfectly fine [00:05:24] I just made two servers, made a volume, mounted it on two clients [00:05:33] tested it was working normally [00:05:46] then upgraded gluster on the servers [00:05:53] and restarted their processes [00:06:02] on the clients, the mounts still works [00:06:03] *worked [00:06:08] then I tried unmount and remounting them [00:06:19] tested that replication was working properly and such as well [00:07:08] so, based on that, I think we're ready to upgrade gluster on labstore1-4 [00:07:22] and then push the package into the repo, so it'll get picked up by the instances [00:07:31] I'll also dsh to all the instances and force an upgrade [00:09:30] hm. I need to do the same on the dataset systems [00:21:23] ugghh. load is spiking [00:21:31] really need to upgrade the hardware [00:25:33] PROBLEM dpkg-check is now: CRITICAL on webserver-lcarr i-00000134 output: DPKG CRITICAL dpkg reports broken packages [00:25:58] PROBLEM dpkg-check is now: CRITICAL on bastion-restricted1 i-0000019b output: DPKG CRITICAL dpkg reports broken packages [00:27:28] PROBLEM dpkg-check is now: CRITICAL on opengrok-web i-000001e1 output: DPKG CRITICAL dpkg reports broken packages [00:30:24] RECOVERY dpkg-check is now: OK on webserver-lcarr i-00000134 output: All packages OK [00:30:54] RECOVERY dpkg-check is now: OK on bastion-restricted1 i-0000019b output: All packages OK [00:32:12] paravoid: util % of the disks on the virt hosts actually shows between 50-90% [00:32:24] RECOVERY dpkg-check is now: OK on opengrok-web i-000001e1 output: All packages OK [00:32:37] I wonder how much of that is due to swap [00:32:48] it's *ALWAYS* swap. [00:32:56] swap has two effects here [00:33:06] well, memory pressure has two effects [00:33:12] one is increased load due to swapping [00:33:31] the other one is page cache eviction, which results into more i/o for the normal i/o of the VMs [00:33:39] * Ryan_Lane nods [00:35:13] we can probably live migrate some things across [00:35:14] PROBLEM host: precise-test is DOWN address: i-0000022d CRITICAL - Host Unreachable (i-0000022d) [00:35:20] we just need to be smart about it [00:35:39] I have no idea why precise-test is down [00:35:41] for sure, we can live migrate the largest instances afross [00:35:48] it could just be a nagios issue [00:35:59] hm [00:35:59] though thats ping [00:36:17] let me see which instances we can move across [00:36:29] first let me see which instances are hurting us the most [00:36:38] and which hosts are the most overloaded [00:42:03] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 6.73, 6.62, 5.59 [01:01:01] PROBLEM Free ram is now: WARNING on mobile-enwp i-000000ce output: Warning: 18% free memory [01:04:29] paravoid: hm. that image must not be right in some kind of way [01:06:41] PROBLEM host: precise-test is DOWN address: i-0000022d CRITICAL - Host Unreachable (i-0000022d) [01:07:00] 37 ubuntu-12.04-precise qcow2 ovf 105385 [01:07:03] it's way too small [01:07:11] glance index [01:21:11] RECOVERY Free ram is now: OK on mobile-enwp i-000000ce output: OK: 29% free memory [01:36:41] PROBLEM host: precise-test is DOWN address: i-0000022d CRITICAL - Host Unreachable (i-0000022d) [01:54:47] drdee: you're going to kill me [01:54:58] I mean almost definitely [01:55:43] drdee: I thought I was on another page to delete precise-test, and it was reportcard1.... [01:56:16] please tell me you had that stuff backed up [02:06:43] PROBLEM host: precise-test is DOWN address: i-0000022d CRITICAL - Host Unreachable (i-0000022d) [02:07:04] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 4.74, 4.35, 4.93 [02:17:31] !log puppet deleted instance precise-test [02:17:35] Logged the message, Master [02:17:47] !log reportcard stupidly deleted reportcard1 by accident [02:17:48] Logged the message, Master [02:33:54] PROBLEM Current Load is now: CRITICAL on precise-test i-00000231 output: Connection refused or timed out [02:34:17] Ryan_Lane: okay, what's the problem with the image? [02:34:24] dunno [02:34:24] PROBLEM Current Users is now: CRITICAL on precise-test i-00000231 output: Connection refused by host [02:34:27] I downloaded the same one [02:34:30] and did the same thing as you [02:34:40] did you wait until it was totally finished downloading? [02:34:45] of course [02:34:47] hm [02:34:49] I dunno [02:34:52] I'm not that big of an idiot [02:34:55] did it ywork for you? [02:34:56] heh [02:34:57] yeah [02:35:04] PROBLEM Disk Space is now: CRITICAL on precise-test i-00000231 output: Connection refused by host [02:35:06] really? [02:35:10] i did the same thing as you [02:35:19] ah [02:35:20] I see [02:35:23] look in root :) [02:35:42] you canceled once, didn't you? :) [02:35:44] PROBLEM Free ram is now: CRITICAL on precise-test i-00000231 output: Connection refused by host [02:35:50] I placed mine in /var/tmo [02:35:51] oh? [02:35:55] maybe I am that big of an idiot [02:35:56] /var/tmp [02:36:00] hahaha [02:36:37] still, the one on /var/tmp is 219M [02:36:42] while the .1 in /root is 216M [02:36:51] odd [02:36:54] PROBLEM Total Processes is now: CRITICAL on precise-test i-00000231 output: Connection refused by host [02:37:06] ugh. it's so fucking slow [02:37:14] precise? [02:37:16] or everything? [02:37:21] no. everything [02:37:28] yes it is [02:37:30] I'm force running a package upgrade on all instances [02:37:34] but, they are running serially [02:37:38] there's no reason for this [02:37:53] PROBLEM dpkg-check is now: CRITICAL on precise-test i-00000231 output: Connection refused by host [02:38:27] I'm going to live migrate some instances [02:38:33] how? [02:39:36] nova-manage vm live_migration i-000000e2 virt5 [02:40:48] and what's the thing that doesn't work? [02:41:03] PROBLEM Current Load is now: WARNING on bots-sql3 i-000000b4 output: WARNING - load average: 6.40, 6.19, 5.40 [02:41:38] well, it does work [02:42:43] RECOVERY Total Processes is now: OK on scribunto i-0000022c output: PROCS OK: 125 processes [02:42:53] RECOVERY dpkg-check is now: OK on scribunto i-0000022c output: All packages OK [02:42:54] so, it says it just migrated it [02:43:26] so, it worked fine [02:43:43] RECOVERY Free ram is now: OK on scribunto i-0000022c output: OK: 88% free memory [02:43:57] as long as you only do it on one at a time, it's fine [02:44:03] RECOVERY Current Load is now: OK on scribunto i-0000022c output: OK - load average: 0.25, 0.62, 0.51 [02:44:14] for some reason the fucking scheduler launched the precise instance on virt4, not on virt5 [02:44:25] RECOVERY Current Users is now: OK on scribunto i-0000022c output: USERS OK - 2 users currently logged in [02:45:53] RECOVERY Disk Space is now: OK on scribunto i-0000022c output: DISK OK [02:47:41] though, some really busy instances can take ages [02:47:51] RECOVERY dpkg-check is now: OK on gluster-client2 i-00000228 output: All packages OK [02:47:51] and it'll pause them [02:48:19] RECOVERY Current Load is now: OK on gluster-client2 i-00000228 output: OK - load average: 0.53, 0.83, 0.38 [02:48:42] RECOVERY Current Users is now: OK on gluster-client2 i-00000228 output: USERS OK - 0 users currently logged in [02:49:17] RECOVERY Disk Space is now: OK on gluster-client2 i-00000228 output: DISK OK [02:49:43] RECOVERY Free ram is now: OK on gluster-client2 i-00000228 output: OK: 87% free memory [02:51:29] RECOVERY Total Processes is now: OK on gluster-client2 i-00000228 output: PROCS OK: 82 processes [02:52:44] RECOVERY Current Load is now: OK on gluster-server2 i-0000022a output: OK - load average: 0.20, 0.07, 0.02 [02:52:54] RECOVERY Current Users is now: OK on gluster-server2 i-0000022a output: USERS OK - 1 users currently logged in [02:53:12] RECOVERY dpkg-check is now: OK on gluster-server2 i-0000022a output: All packages OK [02:53:12] RECOVERY Disk Space is now: OK on gluster-server2 i-0000022a output: DISK OK [02:53:24] RECOVERY Free ram is now: OK on gluster-server2 i-0000022a output: OK: 81% free memory [02:56:24] PROBLEM dpkg-check is now: UNKNOWN on gluster-server1 i-00000229 output: Invalid host name i-00000229 [02:59:34] PROBLEM host: gluster-client2 is DOWN address: i-00000228 check_ping: Invalid hostname/address - i-00000228 [02:59:54] PROBLEM host: gluster-client1 is DOWN address: i-00000227 check_ping: Invalid hostname/address - i-00000227 [03:01:25] labs-nagios-wm_: catch up, you, I deleted them already ;) [03:21:05] RECOVERY Current Load is now: OK on bots-sql3 i-000000b4 output: OK - load average: 1.57, 2.61, 4.41 [03:28:05] RECOVERY Disk Space is now: OK on wikidata-dev-3 i-00000225 output: DISK OK [03:28:05] RECOVERY Total Processes is now: OK on wikidata-dev-3 i-00000225 output: PROCS OK: 99 processes [03:28:10] RECOVERY dpkg-check is now: OK on wikidata-dev-3 i-00000225 output: All packages OK [03:28:55] RECOVERY Current Load is now: OK on wikidata-dev-3 i-00000225 output: OK - load average: 0.15, 0.38, 0.32 [03:29:25] RECOVERY Current Users is now: OK on wikidata-dev-3 i-00000225 output: USERS OK - 0 users currently logged in [03:30:55] RECOVERY Free ram is now: OK on wikidata-dev-3 i-00000225 output: OK: 75% free memory [03:37:08] RECOVERY Current Users is now: OK on pediapress-ocg2 i-00000226 output: USERS OK - 0 users currently logged in [03:37:55] RECOVERY Disk Space is now: OK on pediapress-ocg2 i-00000226 output: DISK OK [03:38:05] RECOVERY Free ram is now: OK on pediapress-ocg2 i-00000226 output: OK: 84% free memory [03:39:05] RECOVERY Total Processes is now: OK on pediapress-ocg2 i-00000226 output: PROCS OK: 82 processes [03:39:25] RECOVERY dpkg-check is now: OK on pediapress-ocg2 i-00000226 output: All packages OK [03:40:55] RECOVERY Current Load is now: OK on pediapress-ocg2 i-00000226 output: OK - load average: 0.08, 0.31, 0.18 [03:42:25] PROBLEM Free ram is now: WARNING on utils-abogott i-00000131 output: Warning: 15% free memory [03:49:25] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f output: Warning: 14% free memory [03:51:05] PROBLEM Free ram is now: WARNING on nova-daas-1 i-000000e7 output: Warning: 14% free memory [03:53:25] PROBLEM Free ram is now: WARNING on test-oneiric i-00000187 output: Warning: 14% free memory [03:57:25] PROBLEM Free ram is now: CRITICAL on utils-abogott i-00000131 output: Critical: 5% free memory [04:07:25] RECOVERY Free ram is now: OK on utils-abogott i-00000131 output: OK: 97% free memory [04:08:25] PROBLEM Free ram is now: CRITICAL on test-oneiric i-00000187 output: Critical: 5% free memory [04:09:25] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f output: Critical: 3% free memory [04:11:05] PROBLEM Free ram is now: CRITICAL on nova-daas-1 i-000000e7 output: Critical: 5% free memory [04:13:25] RECOVERY Free ram is now: OK on test-oneiric i-00000187 output: OK: 97% free memory [04:14:25] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f output: OK: 96% free memory [04:16:05] PROBLEM Free ram is now: CRITICAL on test3 i-00000093 output: Critical: 2% free memory [04:21:05] RECOVERY Free ram is now: OK on nova-daas-1 i-000000e7 output: OK: 93% free memory [04:21:05] RECOVERY Free ram is now: OK on test3 i-00000093 output: OK: 96% free memory [07:25:10] Platonides: he [07:25:24] deployment is broken [07:25:42] it doesn't look up IP [08:08:15] PROBLEM Puppet freshness is now: CRITICAL on swift-be4 i-000001ca output: Puppet has not run in last 20 hours [10:39:05] PROBLEM Free ram is now: WARNING on mobile-enwp i-000000ce output: Warning: 15% free memory [10:49:11] RECOVERY Free ram is now: OK on mobile-enwp i-000000ce output: OK: 30% free memory [11:02:11] PROBLEM Free ram is now: WARNING on mobile-enwp i-000000ce output: Warning: 17% free memory [11:53:55] Reedy: some idea why deployment site started to see local ip of squid server for all users as their ip, since we update puppet class? [11:54:17] did you see hashar? [11:54:23] becaues I think it's his work [11:56:36] It's probably because you haven't updated $wgSquidServers [12:33:49] Change on 12mediawiki a page Wikimedia Labs/status was modified, changed by 101.160.61.96 link https://www.mediawiki.org/w/index.php?diff=530071 edit summary: /* 2012-04-21 */ new section [12:33:50] Change on 12mediawiki a page Wikimedia Labs/status was modified, changed by 101.160.61.96 link https://www.mediawiki.org/w/index.php?diff=530071 edit summary: /* 2012-04-21 */ new section [12:35:20] Change on 12mediawiki a page Wikimedia Labs/status was modified, changed by SVG link https://www.mediawiki.org/w/index.php?diff=530072 edit summary: Undo revision 530071 by [[Special:Contributions/101.160.61.96|101.160.61.96]] [12:35:21] Change on 12mediawiki a page Wikimedia Labs/status was modified, changed by SVG link https://www.mediawiki.org/w/index.php?diff=530072 edit summary: Undo revision 530071 by [[Special:Contributions/101.160.61.96|101.160.61.96]] [12:56:48] @RC- mediawiki Wikimedia Labs/status [12:56:49] Can't find item in a list [12:56:53] @RC- mediawiki WikimediaLabs/status [12:56:53] Can't find item in a list [12:57:13] :O [12:57:29] @RC- mediawiki Wikimedia_Labs/status [12:57:30] Deleted item from feed [12:57:34] here we go [13:13:16] New review: Dzahn; "looks ok to me now. note though that using the PPA repos will not be possible if you plan to ever ge..." [operations/puppet] (test); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5599 [13:13:20] Change merged: Dzahn; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5599 [13:21:26] New review: Dzahn; "and for future enhancement: would be nicer to split the "add the repo"-part into it's own class." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/5599 [15:16:11] PROBLEM Free ram is now: WARNING on mobile-enwp i-000000ce output: Warning: 19% free memory [15:26:11] RECOVERY Free ram is now: OK on mobile-enwp i-000000ce output: OK: 26% free memory [17:05:13] PROBLEM Free ram is now: CRITICAL on mobile-enwp i-000000ce output: Critical: 5% free memory [17:10:13] PROBLEM Free ram is now: WARNING on mobile-enwp i-000000ce output: Warning: 15% free memory [17:20:13] RECOVERY Free ram is now: OK on mobile-enwp i-000000ce output: OK: 27% free memory [17:28:18] * Ryan_Lane sighs [17:28:26] nslcd.conf has changed in precise [17:28:42] map group uniquemember member <— doesn't work and causes the service to fail [17:28:50] apparently member is now the default [17:29:02] I'm still having some issues with precise :( [17:29:03] that is *fucking ridiculous* [17:29:52] why not just ignore it? [17:30:03] * Ryan_Lane shakes his fist in the air [17:31:05] * ^demon shakes his fist in solidarity, but for totally unrelated reasons [17:31:27] * Damianz is still shaking his first for ubuntu dropping his keyboard on every upgrade [17:32:35] It's rather awesome when you have to grant something access via the something that needs access which results in you then having to search though the rack to find a usb keyboard [17:37:26] Change on 12mediawiki a page Wikimedia Labs/status was modified, changed by Sumanah link https://www.mediawiki.org/w/index.php?diff=530135 edit summary: latest [17:57:02] <^demon> Ryan_Lane: I updated [[wikitech:Gerrit]] with some extra links and a "how do I upgrade" section. [18:09:13] PROBLEM Puppet freshness is now: CRITICAL on swift-be4 i-000001ca output: Puppet has not run in last 20 hours [18:59:18] 04/27/2012 - 18:59:18 - Creating a home directory for otto at /export/home/nginx/otto [19:00:21] 04/27/2012 - 19:00:21 - Updating keys for otto [19:30:53] RECOVERY Disk Space is now: OK on precise-test i-00000231 output: DISK OK [19:30:53] RECOVERY Free ram is now: OK on precise-test i-00000231 output: OK: 92% free memory [19:31:53] RECOVERY Total Processes is now: OK on precise-test i-00000231 output: PROCS OK: 85 processes [19:32:53] RECOVERY dpkg-check is now: OK on precise-test i-00000231 output: All packages OK [19:33:43] RECOVERY Current Load is now: OK on precise-test i-00000231 output: OK - load average: 0.16, 0.47, 0.30 [19:34:23] RECOVERY Current Users is now: OK on precise-test i-00000231 output: USERS OK - 2 users currently logged in [22:46:07] New patchset: Sara; "Iteration of adding ganglia webfrontend for labs." [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6048 [22:46:21] New review: gerrit2; "Lint check passed." [operations/puppet] (test); V: 1 - https://gerrit.wikimedia.org/r/6048 [22:48:21] New review: Sara; "(no comment)" [operations/puppet] (test); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6048 [22:48:24] Change merged: Sara; [operations/puppet] (test) - https://gerrit.wikimedia.org/r/6048 [22:59:04] Change on 12mediawiki a page OAuth/User stories was modified, changed by CSteipp link https://www.mediawiki.org/w/index.php?diff=530238 edit summary: [23:14:04] Ryan_Lane: hypothetically speaking, would it be difficult to move a wiki from labs to the production cluster? [23:16:28] well, we could create the wiki in production, export the content from labs, and add it to the production wiki [23:16:31] Thehelpfulone: whys that? [23:18:30] it's an idea for potential new sister projects Ryan_Lane [23:19:32] I mean a way that sister projects could be developed and a community shown etc to make them viable to become a Foundation sister project [23:20:02] so if it was set up with the idea that it may be imported into production, it could be doable? [23:20:30] ah [23:20:31] yeah [23:20:53] possible. but we're entering dangerous territory there ;) [23:21:36] we should talk to erik about that