[00:05:50] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5.pmtpa.wmflabs output: Warning: 16% free memory [00:29:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [00:59:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [01:29:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [01:45:12] RECOVERY Disk Space is now: OK on conventionextension-trial i-000003bf.pmtpa.wmflabs output: DISK OK [01:50:43] RECOVERY Free ram is now: OK on bots-3 i-000000e5.pmtpa.wmflabs output: OK: 21% free memory [01:53:13] PROBLEM Disk Space is now: WARNING on conventionextension-trial i-000003bf.pmtpa.wmflabs output: DISK WARNING - free space: / 77 MB (5% inode=51%): [01:59:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [02:29:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [02:40:43] RECOVERY Free ram is now: OK on bots-sql2 i-000000af.pmtpa.wmflabs output: OK: 20% free memory [02:53:42] PROBLEM Free ram is now: WARNING on bots-sql2 i-000000af.pmtpa.wmflabs output: Warning: 14% free memory [02:59:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [03:29:35] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [03:53:52] PROBLEM Free ram is now: WARNING on orgcharts-dev i-0000018f.pmtpa.wmflabs output: Warning: 15% free memory [03:59:33] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [04:18:57] PROBLEM Free ram is now: CRITICAL on orgcharts-dev i-0000018f.pmtpa.wmflabs output: Critical: 5% free memory [04:28:53] RECOVERY Free ram is now: OK on orgcharts-dev i-0000018f.pmtpa.wmflabs output: OK: 95% free memory [04:29:35] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [04:59:35] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:29:33] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:58:52] RECOVERY dpkg-check is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: All packages OK [05:59:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [05:59:32] RECOVERY Total processes is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: PROCS OK: 229 processes [05:59:42] PROBLEM Free ram is now: WARNING on aggregator2 i-000002c0.pmtpa.wmflabs output: Warning: 8% free memory [06:00:55] RECOVERY Disk Space is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: DISK OK [06:01:32] RECOVERY Current Users is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [06:02:05] RECOVERY SSH is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [06:02:22] RECOVERY Current Load is now: OK on aggregator2 i-000002c0.pmtpa.wmflabs output: OK - load average: 0.12, 0.24, 0.41 [06:29:36] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [06:32:25] PROBLEM Total processes is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS WARNING: 154 processes [06:38:52] PROBLEM Disk Space is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:39:32] PROBLEM Current Users is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:39:42] PROBLEM Free ram is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:40:25] PROBLEM Current Load is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:41:52] PROBLEM dpkg-check is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:42:32] PROBLEM Total processes is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: CHECK_NRPE: Error - Could not complete SSL handshake. [06:45:12] PROBLEM SSH is now: CRITICAL on aggregator2 i-000002c0.pmtpa.wmflabs output: Server answer: [06:47:22] RECOVERY Total processes is now: OK on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS OK: 150 processes [06:52:38] Change on 12mediawiki a page Developer access was modified, changed by Devsundar link https://www.mediawiki.org/w/index.php?diff=596196 edit summary: [06:57:45] PROBLEM Disk Space is now: WARNING on mw1-21beta-lucid i-00000416.pmtpa.wmflabs output: DISK WARNING - free space: / 78 MB (5% inode=51%): [06:59:37] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [07:12:42] RECOVERY dpkg-check is now: OK on wikidata-dev-2 i-00000259.pmtpa.wmflabs output: All packages OK [07:29:33] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [07:36:31] hello [07:36:36] hi [07:36:41] !ping [07:36:41] pong [07:40:25] PROBLEM Total processes is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS WARNING: 151 processes [07:43:22] hi [07:44:52] When you install mediawiki on a labs instance do you use the puppet thing called "role::mediawiki-install::labs"?? [07:45:21] * Silke_WMDE is determined to understand this puppet thing now [07:46:05] I'm trying to write a puppet module for wikidata. [07:47:57] For which I obviously need a fresh mediawiki. For which there is probably some existing puppet stuff already there... [07:59:33] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [08:00:44] Silke_WMDE: I think there is a mediawiki_new puppet class [08:01:23] yes. It looks... short. [08:05:20] ah! it comes from a special apt source. I see. [08:18:17] Silke_WMDE: so that is a module in puppet modules/mediawiki_new/ but you probably figured that out :-] [08:18:30] yep [08:18:32] I have no experience with it myself [08:18:37] might just be for production usage [08:18:44] since it uses sync-common [08:19:39] ohh Silke_WMDE there is also a role::labs-mediawiki-install [08:19:55] in manifests/role/labsmw.pp [08:20:18] and a role::mediawiki-install::labs in manifests/role/labsmediawiki.pp [08:20:23] ah, I didn't find out where that one came from, thx! [08:20:40] not sure which one should be used. The later seems to get more content [08:21:17] yeah role::labs-mediawiki-install is marked about how it should not be used [08:21:27] so role::mediawiki-install::labs is most probably what you want [08:21:50] errr [08:22:12] role::mediawiki-install::labs is the one that is integrated into the clicky interface [08:22:33] seems to be the correct one [08:23:03] you apparently need to add a password in /srv/mediawiki/orig/adminpass [08:23:16] Andrew most probably wrote some documentation about that class somewhere [08:23:46] which timezone is Andrew in? [08:25:37] And you, is it early, late or something in between for you? [08:29:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [08:34:20] Silke_WMDE: I think Andrew is in SF [08:36:10] Silke_WMDE: seems so according to his linked in profile [08:36:24] I never had any interaction with him :-( [08:39:18] hashar: hey! have you checked out the vagrant thing i posted to wikitech-l the other week by any chance? [08:39:31] ori-l: not at all :-] [08:39:45] happily delegating that to the feature team and volunteers to test it out :-] [08:39:56] definitely something I need to look at though [08:40:15] https://github.com/wikimedia/wmf-vagrant <- spins up ubuntu vm and provisions working mediawiki instance using puppet [08:41:30] ori-l: so the vagrant VM boot with those puppet files ? [08:42:41] it starts with a base box, configures it as vm for virtualbox, will mount the puppet modules and manifest as a shared folder inside the vm, and run puppet to provision the machine [08:43:11] nice [08:43:17] how long does it take to get the VM ready? [08:43:35] port 8080 on the host (i.e., your machine) is mapped to port 80 in the guest, so if you browse to 127.0.0.1 you get mediawiki served by the vm [08:43:49] depends if you have the ubuntu base image or not. you need to get that once. that's about 300mb iirc [08:43:54] after that it's about 2-3 minutes [08:44:08] ahh [08:44:11] only because i include an "apt-get update" run [08:44:23] would it be possible to have a VM which already has been bootstrapped ? [08:44:26] that might speed up thing [08:44:47] then whenever a change is made to the wikimedia/wmf-vagrant repo, we could recreate a new VM file [08:44:58] that might make it faster to boot a new instance [08:44:59] (pure supposition) [08:45:03] yes, just provision it once and then "vagrant package" [08:45:33] supposedly, that prebuilt vm will be faster to boot up :) [08:45:56] can you somehow pass parameters to "vagrant up" ? [08:45:57] maybe i'm misunderstanding. once the vm is provisioned you can suspend / boot it in seconds [08:46:21] you can, but most of the behavior is configured in the Vagrantfile (see repo root) [08:46:35] yeah fast boot is what we will want for continuous integration. We will want to simply copy an existing file then resume it and start tests. [08:46:39] brb [08:54:22] ori-l: sorry daughter is sick :-] [08:54:35] ori-l: will follow up by email, I don't want to get to bed too late :-] [08:54:49] me neither. sorry to hear about your daughter [08:54:54] hope she feels better soon [08:54:58] ttyl [08:55:04] she is fine :-] [08:55:12] just has a bit of fever [08:55:13] (aka she is hot) [08:56:24] ah, good. get some rest then & ttyl [08:59:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [09:16:20] Change on 12mediawiki a page Developer access was modified, changed by Devsundar link https://www.mediawiki.org/w/index.php?diff=596215 edit summary: /* User:Devsundar */ [09:16:49] Change on 12mediawiki a page Developer access was modified, changed by Devsundar link https://www.mediawiki.org/w/index.php?diff=596216 edit summary: /* User:Devsundar */ [09:29:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [09:59:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [10:00:48] Change on 12mediawiki a page Developer access was modified, changed by Kota Tolikara Papua link https://www.mediawiki.org/w/index.php?diff=596224 edit summary: /* Developer Tolikara */ new section [10:02:43] RECOVERY Disk Space is now: OK on mw1-21beta-lucid i-00000416.pmtpa.wmflabs output: DISK OK [10:10:43] PROBLEM Disk Space is now: WARNING on mw1-21beta-lucid i-00000416.pmtpa.wmflabs output: DISK WARNING - free space: / 78 MB (5% inode=51%): [10:25:53] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 5.22, 5.33, 5.09 [10:29:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [10:35:54] RECOVERY Current Load is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: OK - load average: 4.16, 4.52, 4.83 [10:53:38] Change on 12mediawiki a page Developer access was modified, changed by Tegel link https://www.mediawiki.org/w/index.php?diff=596235 edit summary: Reverted edits by [[Special:Contributions/Kota Tolikara Papua|Kota Tolikara Papua]] ([[User talk:Kota Tolikara Papua|talk]]) to last revision by [[User:Devsundar|Devsundar]] [11:02:05] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [11:32:02] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [12:02:05] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [12:03:51] Hey puppet people! I get an error message referring to /etc/puppet/manifests/generic-definitions.pp - a timeout. Any idea what could be wrong? [12:04:27] It happens when trying to make puppet pull from git. [12:04:42] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5.pmtpa.wmflabs output: Warning: 15% free memory [12:04:55] PROBLEM Free ram is now: WARNING on dumps-bot1 i-000003ed.pmtpa.wmflabs output: Warning: 19% free memory [12:19:58] Pulling an extension works while pulling core doesn't. [12:34:13] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:04:22] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:05:34] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 196 processes [13:10:32] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 103 processes [13:15:17] @seen Ryan_Lane [13:15:17] petan: Last time I saw Ryan_Lane they were quiting the network N/A at 10/22/2012 7:19:18 AM (05:55:58.9526000 ago) [13:29:18] andrewbogott_afk: [13:29:24] Oops [13:30:19] andrewbogott_afk: I have got a question concerning the puppet file labsmediawiki.pp which you might have written. [13:30:53] Would be cool if we could chat for a second. [13:34:22] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [13:54:42] RECOVERY Free ram is now: OK on bots-3 i-000000e5.pmtpa.wmflabs output: OK: 21% free memory [14:04:25] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [14:34:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [14:46:52] PROBLEM Current Load is now: WARNING on parsoid-roundtrip3 i-000004d8.pmtpa.wmflabs output: WARNING - load average: 6.70, 6.92, 5.87 [14:56:53] RECOVERY Current Load is now: OK on parsoid-roundtrip3 i-000004d8.pmtpa.wmflabs output: OK - load average: 4.00, 4.20, 4.88 [15:04:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [15:04:32] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 196 processes [15:26:00] Change on 12mediawiki a page Developer access was modified, changed by Jimmy xu wrk link https://www.mediawiki.org/w/index.php?diff=596282 edit summary: [15:28:57] PROBLEM Current Load is now: CRITICAL on mwreview-param i-000004da.pmtpa.wmflabs output: Connection refused by host [15:29:39] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 103 processes [15:29:39] PROBLEM Current Users is now: CRITICAL on mwreview-param i-000004da.pmtpa.wmflabs output: Connection refused by host [15:30:22] PROBLEM Disk Space is now: CRITICAL on mwreview-param i-000004da.pmtpa.wmflabs output: Connection refused by host [15:30:52] PROBLEM Free ram is now: CRITICAL on mwreview-param i-000004da.pmtpa.wmflabs output: Connection refused by host [15:32:32] PROBLEM Total processes is now: CRITICAL on mwreview-param i-000004da.pmtpa.wmflabs output: Connection refused by host [15:33:14] PROBLEM dpkg-check is now: CRITICAL on mwreview-param i-000004da.pmtpa.wmflabs output: Connection refused by host [15:34:32] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [15:34:42] RECOVERY Current Users is now: OK on mwreview-param i-000004da.pmtpa.wmflabs output: USERS OK - 1 users currently logged in [15:35:22] RECOVERY Disk Space is now: OK on mwreview-param i-000004da.pmtpa.wmflabs output: DISK OK [15:35:52] RECOVERY Free ram is now: OK on mwreview-param i-000004da.pmtpa.wmflabs output: OK: 931% free memory [15:37:32] RECOVERY Total processes is now: OK on mwreview-param i-000004da.pmtpa.wmflabs output: PROCS OK: 93 processes [15:38:53] RECOVERY Current Load is now: OK on mwreview-param i-000004da.pmtpa.wmflabs output: OK - load average: 1.43, 1.03, 0.68 [15:43:12] RECOVERY dpkg-check is now: OK on mwreview-param i-000004da.pmtpa.wmflabs output: All packages OK [16:04:33] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [16:34:42] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [17:04:43] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [17:07:43] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5.pmtpa.wmflabs output: Warning: 19% free memory [17:16:18] Change on 12mediawiki a page Developer access was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=596304 edit summary: /* User:Devsundar */ [17:32:42] RECOVERY Free ram is now: OK on bots-3 i-000000e5.pmtpa.wmflabs output: OK: 21% free memory [17:35:14] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [17:40:29] !log deployment-prep csteipp: cherry picked in LBFactory_Multi config [18:06:15] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [18:28:32] PROBLEM Current Users is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: USERS WARNING - 6 users currently logged in [18:35:22] PROBLEM Total processes is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS WARNING: 153 processes [18:37:02] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [18:41:52] RECOVERY Current Load is now: OK on parsoid-roundtrip4 i-000004d7.pmtpa.wmflabs output: OK - load average: 0.92, 3.52, 4.89 [18:57:53] csteipp: hey :-] [18:58:06] csteipp: so yeah I guess voyage on beta will need a wgLBFactoryConf configuration like in production [18:58:16] csteipp: wmf-config/db-wmflabs.php [18:58:20] Ryan_Lane, hello sir [18:58:25] howdy [18:58:25] csteipp: that is a bit of work though :-/ [18:58:29] I think so... I cherrypicked in the change, and it's working [19:07:02] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [19:07:32] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 196 processes [19:08:45] * Damianz streatches [19:12:00] Jan_Luca: sorry that I keep making this same mistake… is this you, or the other Jan? https://gerrit.wikimedia.org/r/#/c/28355/1 [19:13:12] There are 2 Jan's? [19:14:09] hmm reminds me I should merge in my puppet stuff eventhough it's a bit shite [19:14:12] Jan Gerber, Jan Luca. [19:15:25] csteipp: oh you already submitted a change for wikivoyage LBFactoryConf. I reviewed it https://gerrit.wikimedia.org/r/#/c/29344/ :-] Seems almost good to me [19:22:43] Ryan_Lane: Can you tell me (again) your thoughts about how to avoid the Apache timeout when cloning MW? You thought we should try to host a repo on shared storage? [19:23:12] andrewbogott: ^demon's idea was to enable gerrit replication to labs [19:23:21] then you could clone from a global read-only share [19:23:26] which would be way faster [19:23:41] As long as our storage isn't crappy that is :) [19:23:48] *hehem nfs* [19:23:56] it would be nfs [19:24:01] but nfs shared from gluster [19:24:09] not a central instance [19:24:17] andrewbogott: That's me [19:24:23] Damianz: and yes, I need to take care of that before I leave for vacation [19:24:23] dumb question… do we even have a labs-wide sharedfs at the moment? [19:24:25] yes [19:24:31] public data [19:24:37] Jan_Luca: Ok! Just wanted to let you know that I have the refactor in process; I'm testing now. [19:25:01] Ryan_Lane: Hm, thought that was project-specific for some reason. I guess I haven't been paying attention :) [19:25:05] Ryan_Lane: :) [19:25:08] ok but I go sleep now [19:25:20] andrewbogott: I think git clone as a way to only clone a single branch (aka master), that might save up some time when cloning. [19:25:23] andrewbogott: oh and Hi :-] [19:25:26] Jan_Luca: Sure; I'll attach my updated patch to your patchset. I just wanted to avoid you duplicating effort. [19:25:43] andrewbogott: it has the public dumps and datasets [19:25:56] andrewbogott: also I have been asked by the wikidata folks if you have any documentation about the mediawiki puppet class. Maybe on labsconsole ? [19:25:57] it needs to be global and read-only because it comes from another source [19:26:04] I still wonder how well beta pulling from git is going to work long term with migrations/cache puring etc =/ [19:26:38] Ryan_Lane: OK, I will read about gerrit replication. [19:26:53] Damianz: we are going to use Gerrit build-in replication system instead of the lame "poll every 3 minutes" loop [19:27:08] hashare: Was that Silke? [19:27:14] andrewbogott: yeah Silke [19:27:23] hashar: Well yeah that would help, I was thinking more maintaince that needs running to stop updates borking the site though. [19:27:39] andrewbogott: I have pointed her to one of the class that seemed to be up to date. Something like role::labs-install::mediawiki [19:27:58] OK -- we exchanged emails this morning, hopefully He/She (?) is caught up. The documentation such as it is is here: https://labsconsole.wikimedia.org/wiki/Help:InstanceConfigMediawiki [19:27:58] andrewbogott: talk with ^demon|lunch about it [19:28:11] Ryan_Lane: ok! [19:28:21] <^demon|lunch> What are we replicating? [19:28:27] ^demon|lunch: gerrit to labs [19:28:35] ^demon|lunch, shouldn't you be eating? [19:28:41] <^demon|lunch> I stopped eating like 2.5h ago :p [19:29:03] lunch is a state of mind [19:29:18] I guess it should be ^demon|teatime where you are [19:29:46] We use to have snacks in the channel, Ryan would be food for a few hours a day :D [19:30:16] ^demon|lunch Right now I pretty much can't clone mediawiki into a labs instance because Apache times out. I guess Gerrit can push an up-to-date repo into labs storage? [19:30:18] * Ryan_Lane Ryan|flamebait [19:30:20] err [19:31:15] <^demon|lunch> andrewbogott: Should be feasible. All gerrit needs is a receiving box that has SSH + Git. [19:31:24] PROBLEM Disk Space is now: WARNING on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK WARNING - free space: /export 1006 MB (5% inode=59%): /home/SAVE 1006 MB (5% inode=59%): [19:31:33] labs-home-wm: Blame Ryan :D [19:31:41] heh [19:31:53] <^demon|lunch> If https://gerrit.wikimedia.org/r/#/c/25508/ went in, this stuff would be easier. [19:32:07] <^demon|lunch> I made a new role that installs a gerritslave user + ssh key. [19:32:30] you have a −1 on it;) [19:32:41] <^demon|lunch> Yes I know. [19:32:44] <^demon|lunch> It's not perfect yet [19:32:50] ^demon|lunch: If you want me to just ask you about this again in two weeks that's perfectly acceptable. [19:33:04] No sense in solving the problem twice. [19:33:22] <^demon|lunch> Well most of the problems have been solved. Just need to clean up the rough edges on Ie91b0077. [19:33:33] <^demon|lunch> Then doing replication will be easy to in-cluster things. [19:34:10] Ryan_Lane: When the time comes, what host would you like us to use as the conduit to nfs? [19:34:16] hm [19:34:36] well, gluster would be configured for this [19:34:44] like publicdata is [19:34:57] yeah but gerrit won't have the volume mounted will it? [19:34:57] we can use the same project for that [19:35:17] ok [19:35:20] Hmm, as the automounts for gluster are in ldap do you have a global as well as per project ones? [19:35:40] Damianz: per-project ones are automount in a non-ghosted way [19:35:49] public datasets are keyed automounts [19:36:01] ah [19:36:02] so they are ghosted [19:36:12] can't ghost wildcard mounts [19:37:06] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [19:40:13] <^demon|lunch> There is a Github api client written in ActionScript. [19:40:18] <^demon|lunch> I was unaware anyone used ActionScript [19:40:33] ^demon|lunch: OK, I'm adding myself as a watcher on that patch and will bug you about replication once it's settled. [19:41:00] ^demon|lunch: o.0 [19:41:26] <^demon|lunch> https://github.com/cbrammer/api-github-as3 [19:41:48] meanwhile… hashar, do you know how I tell git to clone only a single branch vs. a whole rep? [19:42:40] <^demon|lunch> Clone is always going to grab all branches. You can explicitly tell it which branch you want to initially checkout though. [19:42:49] <^demon|lunch> `git clone -b ` [19:43:23] That's what I feared… I'm doing that already. [19:43:28] apparenetly --single-branch [19:43:33] need to try out [19:43:37] ooh! [19:43:38] * andrewbogott tries [19:43:41] Why would you want to only clone 1 branch? [19:43:53] another way, if you don't care about updating it, is to use "git snapshot" [19:44:02] which takes a copy at a specific revision [19:44:20] something like git snapshot REL1_19 [19:44:24] or v1.19.2 [19:44:26] Damianz: bandwidth. [19:44:50] Downloading the entire linux kernel repo doesn't take /that/ long. [19:44:58] so some puppet parameterized class could be written to accept a branch / tag as a parameter [19:45:15] Damianz: The problem is when puppet clones it's non-interactive, which makes Apache on the server impatient. [19:45:19] And it hangs up halfway through. [19:45:27] <^demon|lunch> We upped the timeout :\ [19:45:29] Puppet needs a life [19:45:46] ~demon|lunch: The timeout on the server? [19:45:49] <^demon|lunch> Yes. [19:45:55] ^demon|lunch: Is it /over 9000/? [19:46:03] Well, shit, why's it still happening then? Maybe I've misdiagnosed the problem. [19:46:12] <^demon|lunch> Dzahn and I raised the timeout to like 5 minutes awhile back. [19:46:29] andrewbogott: Did you forget to feed the poor hamsters again? [19:46:31] <^demon|lunch> `git clone --quiet ` to your local system can replicate it, most likely. Doesn't require puppet. [19:46:38] Are there maybe two timeouts, one for git server and one for apache? [19:46:43] <^demon|lunch> In any case, cloning core shouldn't take *nearly* so long. It's not that big a repo. [19:47:20] ^demon|lunch: I did have multiple timeouts when cloning from gallium (in eqiad iirc) using the CLI [19:47:23] though that was a few months ago [19:47:27] Tbf gerrit is fucking slow sometimes [19:47:48] <^demon|lunch> hashar: Yeah this was known. It was why we upped the Apache timeout like ~2mo ago [19:48:06] <^demon|lunch> Damianz: I'm actually pretty sure it's not gerrit's fault entirely. At least with MediaWiki core. [19:48:11] <^demon|lunch> Other repos clone fast (as they should). [19:48:24] <^demon|lunch> I think we've got a really crappy history that's making jgit misbehave. [19:48:26] * ^demon|lunch shrugs [19:48:32] Not entirely, but I don't think it helps at all. [19:48:45] I'm running a quiet clone right now, will see what happens. [19:49:46] <^demon|lunch> Damianz: Cloning the same repo from github takes forever too. It's the repo, not gerrit. [19:49:57] andrewbogott: git snapshot does not exist :-] It is "git archive". Sorry for the confusion. You do something like: git archive --remote= master | (cd /srv/mediawiki && tar xf - ) [19:50:03] Also java sucks for doing random shit :( I spent the entire week debugging an AD issue that turn out to be 'java's implimentation is stoopid'. [19:50:22] Maybe we just need to run git-gc on the master repo? [19:50:24] git archive is basically svn export [19:50:30] <^demon|lunch> NO [19:50:44] That erases history, I take it? [19:50:51] andrewbogott: They see me trolling [19:50:57] I'm not trolling, just ignorant! [19:51:09] <^demon|lunch> git gc can make things inexplicably explode. [19:51:13] * andrewbogott thought that git-gc archived/compressed. [19:51:13] ^demon|lunch totally wants a week of fixing the repo again :D [19:51:19] <^demon|lunch> 2 weeks. [19:51:21] <^demon|lunch> And no, I don't. [19:51:24] ok, noted! [19:51:34] So it isn't /supposed/ to erase history, it just does. [19:51:37] andrewbogott: that also deleted unclaimed objects. Seems to be a problem with gerrit :/ [19:51:40] andrewbogott: You didn't notice when ^demon|lunch imploded both copies of the repos for a week? [19:51:42] or at least with our gerrit setup [19:52:17] http://www.youtube.com/watch?v=03yZK18PX-Y [19:52:17] anyway, running "git gc", although dismissed, is a good idea :-) [19:52:46] andrewbogott: that video is eligible for the international caps lock day (which is today) http://capslockday.com [19:53:56] * andrewbogott HAD NO IDEA [19:54:11] Change on 12mediawiki a page Developer access was modified, changed by Lockal link https://www.mediawiki.org/w/index.php?diff=596336 edit summary: [19:55:46] <^demon|lunch> Actually, there might be a safer way to run git-gc. [19:55:50] <^demon|lunch> https://gerrit.googlesource.com/gerrit/+/master/contrib/git-exproll.sh [19:55:50] fun fact: Back when I said I was doing a quiet clone, I had already started the clone. And I am still waiting. [19:55:57] So that's seven minutes and counting... [19:56:05] <^demon|lunch> I haven't tested it yet, so could very well Do Bad Things. [19:56:33] ^demon|lunch: Isn't it really a feature that should be rolled into gerrit as a scheduled job. [19:57:02] <^demon|lunch> Not something gerrit should do. [19:57:03] <^demon|lunch> Jgit itself should do it. [19:57:17] "And every night at 2AM, it automatically breaks everything" [19:57:46] andrewbogott: I know a setup where you can time network outages down to the second when backups start :) [19:58:21] ^demon|lunch: OK, I finally got myself a failure from git clone --quiet. RPC failed; result=22, HTTP code = 502 [19:58:39] So I want to say… make the timeout 12 minutes? [19:58:50] <^demon|lunch> Let's ask Ryan_Lane [19:58:58] * andrewbogott subscribes to the 'get a bigger hammer' school of troubleshooting [19:59:05] <^demon|lunch> Ryan_Lane: Can we raise the apache proxy timeout to like 12m instead of 10m? [19:59:15] <^demon|lunch> Or whatever I set it to last. [19:59:36] Problem with bumping the time out is you risk getting to the point when you're either dropping connections right down and can handle shit or you risk ooming the box holding sessions. [19:59:36] probably, yeah [19:59:44] that's true [19:59:47] <^demon|lunch> Oh, 5m. [19:59:56] but likely not a huge deal [19:59:56] <^demon|lunch> So yeah, even 10m would be nice. [20:00:00] go for it [20:00:10] It's never a big deal until it explodes in your face [20:00:11] heh [20:00:17] if that happens we'll lower it [20:00:52] <^demon|lunch> andrewbogott: Ok, it's in the puppet repo. templates/apache/sites/gerrit.wikimedia.org.erb (look for TimeOut) [20:01:49] ^demon|lunch: do you mean you just changed it, or are you suggesting that I change it? [20:01:53] you know what? I want to kill all our apahe stuff with a hammer and actually puppetize apache not the half assed lets include static files all over the place effort :( [20:01:54] <^demon|lunch> Oh, I'll change it now. [20:01:57] <^demon|lunch> Sorry for the confusion. [20:02:11] I'm happy to, just couldn't tell which [20:02:33] PROBLEM Total processes is now: CRITICAL on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS CRITICAL: 289 processes [20:03:45] <^demon|lunch> https://gerrit.wikimedia.org/r/29432 [20:04:02] <^demon|lunch> Ah crap, I forgot to grab a fresh branch. [20:04:03] <^demon|lunch> Second. [20:04:42] ^ this is why I hate the not 'branch, push, pull request' workflow [20:04:51] <^demon|lunch> Ok, rebased. [20:06:29] * andrewbogott will merge [20:07:04] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [20:07:56] * andrewbogott predicts that 7 won't be enough [20:08:02] <^demon|lunch> I miscounted. We were already at 10. [20:08:05] <^demon|lunch> It was 600s. [20:08:20] <^demon|lunch> So I bumped it to 12m = 720s [20:08:26] Oh wait, you're right. [20:08:32] I forgot how many seconds are in a minute, apparently. [20:08:40] well, as did you :) [20:08:56] <^demon|lunch> The problem is really apache giving up since it thinks nothing is happening. [20:09:31] So the delay isn't data transfer, it's gerrit psyching itself up? [20:09:51] hm. maybe I shouldn't have scheduled all instances to update salt exactly a minute from now [20:09:51] <^demon|lunch> No, it's actually transferring, but --quiet means nothing over stdout. [20:10:04] <^demon|lunch> So apache is being stupid and thinks nothing is being proxied. [20:10:13] oh well [20:10:19] too late :) [20:10:33] RECOVERY Disk Space is now: OK on testing-arky i-0000033b.pmtpa.wmflabs output: DISK OK [20:11:07] <^demon|lunch> Ryan_Lane: When do you think would be a good time to upgrade manganese & formey to precise? [20:11:30] some time this week, I guess [20:11:34] my schedule is so screwed up [20:12:03] I <3 salt's event system [20:12:12] <^demon|lunch> I've already tested precise. We've only got one minor config tweak to do to make gerrit work (jre moved paths :\) [20:12:26] I had all instances schedule something via at, then made them fire an event when they were done [20:13:52] Ryan_Lane: Sounds like an easy way to trash every instance at once :) [20:14:05] Damianz: yep [20:15:44] PROBLEM Free ram is now: WARNING on bots-3 i-000000e5.pmtpa.wmflabs output: Warning: 19% free memory [20:16:16] well, seems that upgrade didn't go so well [20:16:25] none of the minions are connecting [20:16:40] ah. needed to restart the master [20:17:32] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 196 processes [20:17:57] I really like the idea of salt from the point of being able to mass execute stuff in a replicatable fasion and build an interface for auditing/access control purposes around it. It also scares the shit out of me and I'd rather use pdsh with key auth... though tehnically it's key auth with zeromq should be sound. [20:18:12] yeah [20:18:41] seems to be good software so far [20:18:42] PROBLEM Disk Space is now: WARNING on testing-arky i-0000033b.pmtpa.wmflabs output: DISK WARNING - free space: / 77 MB (5% inode=51%): [20:19:10] hm. most of the minions didn't upgrade [20:19:31] I'm thinking about rolling a simple web interface in django ontop of the master for our management stuff. When we build out our cdn it would provide a clean way to mass purge data from the edge nodes that are distributed around the world. [20:22:14] Damianz: a rest api is also being added right now [20:22:38] From what I've seen of the rest api code it's not really in the right direction atm... [20:22:50] no? [20:22:52] why not? [20:23:12] RECOVERY Disk Space is now: OK on conventionextension-trial i-000003bf.pmtpa.wmflabs output: DISK OK [20:23:34] It needs some form of authorization/authentication system that's modular, so you can support say ldap and give groups access to run certain commands. The code I looked at last seemed very basic and more of a POC of how to do a rest interface against the main codebase rather than a suitable to deploy system. [20:23:42] RECOVERY Disk Space is now: OK on testing-arky i-0000033b.pmtpa.wmflabs output: DISK OK [20:24:01] Damianz: yeah, it's very basic right now [20:24:02] Also needs lots of logging that you can hook into for auditing purposes adding for enterprise use. [20:24:05] what you are suggesting is being worked on [20:24:16] we're going to add keystone auth at some point [20:24:38] the groups/roles stuff will be in salt core [20:24:45] I think it could be freaking awesome... eventually, was kinda hoping to get some time to look at working on it but I'm kinda busy atm. [20:25:00] * Ryan_Lane nods [20:25:42] RECOVERY Disk Space is now: OK on mw1-21beta-lucid i-00000416.pmtpa.wmflabs output: DISK OK [20:25:59] hm. some of these minions just don't want to upgrade [20:26:15] I think roles should be extended in the core some more also in some cases. Like grains are useful, but sometimes you can't trust the end machine. If you could control the grains code centrally then use that data for roles (think puppet/facter) it would be awesome. [20:26:15] ah. they may be down [20:26:26] connect to host i-000001cc.pmtpa.wmflabs port 22: Connection refused [20:26:31] or they may have bad security groups [20:27:25] Nagios says 1 down, though about 10 are ignored. [20:27:43] * Damianz likes how it works, people ignore him, he makes nagios ignore their instances :) [20:28:01] hm [20:28:09] it shouldn't refuse my connection... [20:28:52] weird [20:29:07] I wonder if its a new instance and its security groups are screwed up [20:29:09] Tbh I kinda dislike security groups, they're awesome for public stuff but really we should use iptables and keep setups standard against prod or we split away every time and make sadface. [20:29:39] we hardly use iptables in production [20:30:05] and security is more of an issue in labs [20:30:12] since people tend to run random shit [20:30:32] Mostly on stuff that isn't behind lbs afaik, but yeah [20:30:41] Sometimes I just want to stab people [20:30:47] Actually I mostly just want to stab people, but that's another story [20:31:10] Though I think in some cases we should try to replicate stuff as close as possible and it's a shame we can't vlan off projects. [20:31:13] ah. this instance is running a dpkg command [20:31:24] I need to change puppet to install a specific version of salt [20:31:59] Damianz: it's possible to disable security groups per project [20:32:08] Damianz: make a rule that allows all in [20:32:26] then you can manage the groups in the instances [20:32:35] err [20:32:35] the rules [20:32:41] I'd argue that a filter allowing doesn't equal no filtering but that's semantics of the kernel over practicality [20:32:52] right [20:33:19] we can't *currently* vlan off projects [20:33:24] when we can start using quantum, we could [20:33:37] quantum is way more powerful than nova-network [20:34:32] It would be sort of nice to support vlans in labs, make some of the network nodes act as routers and peer with the real wmf routers for labs public ip space. Though it's probably just me that thinks labs sits too close to prod and is restircted in places currently. [20:34:52] PROBLEM Current Load is now: WARNING on parsoid-roundtrip4 i-000004d7.pmtpa.wmflabs output: WARNING - load average: 5.94, 6.39, 5.44 [20:34:52] PROBLEM Current Load is now: WARNING on parsoid-roundtrip3 i-000004d8.pmtpa.wmflabs output: WARNING - load average: 3.44, 4.93, 5.07 [20:35:12] Damianz: In theory I'm on the hook for adding keystone to salt-api, but I'm waiting for the api development to settle down a bit. [20:35:43] andrewbogott: yeah. likely a good idea [20:35:50] andrewbogott: Assuming they have an awesome auth setup like the django system it should be easy as pie... mmm pie [20:36:10] we need it for creating a database service [20:36:24] I guess we could shell out [20:36:29] * Ryan_Lane really hates shelling out [20:36:42] You could just do a mysql connection as root remotly *shudder* [20:36:53] no thanks :) [20:37:07] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [20:37:45] If we actually impliment salt we could do some cool stuff with bots etc for a start/stop/status page hookable by authors etc... oh yeah did I mention we need oauth/openid still? [20:37:50] * Damianz looks at csteipp [20:38:18] Damianz: yeah. we need to make salt multi-tenant before we can do that, though [20:38:24] <^demon|lunch> Speaking of OpenID, the OpenID extension is in git now. [20:38:32] \o/ [20:38:32] <^demon|lunch> (As of today) [20:38:36] :o [20:38:47] Damianz: technicaly, we can do it without making it multi-tenant [20:38:57] if we use grains and runners [20:39:18] See I like your idea behind grain based access control, but I also hate it from a security point of view [20:39:32] well, it's fine, depending on how you implement it [20:39:33] Being able to hook a middleware that pulls our project members and assigns roles to instances based on that == ftw [20:40:13] so, if you are sending sensitive information, it's a fail [20:40:43] if you use a runner, which is configured to only use a specific grain [20:41:00] and that grain is on the systems you want to control [20:41:05] and the runner only takes very specific commands [20:41:28] it should be ok [20:41:35] any instance that adds the grain will pick up the call [20:41:51] I can see how it could be ok, but personally that still yells security risk too much. Trusting editable content is generally bad, but yes only if you're sending private data (say mysql root passwords ;)) [20:41:56] but if you don't pass sensitive info, it doesn't matter [20:42:11] that's why all of this would be code reviewed ;) [20:42:19] I'd rather just load up instances, map to projects, map to users. Bang central control users can't touch. [20:42:26] to make sure people aren't doing things that would be insecure [20:42:43] Code review is boring when you can't get people to review your code :P [20:42:45] Damianz: well, that's not how zeromq works [20:43:05] when you send data out, it gets sent to all minions [20:43:11] the minions subscribe [20:43:38] Oh, I was thinking a level higher than that. [20:43:44] ? [20:43:44] IE pre-authing random ass end user to run command x on y. [20:43:51] yeah [20:43:53] that's the end-goal [20:43:59] with adding keystone auth [20:44:15] there's already an ACL system [20:44:28] it isn't fine grained enough, though [20:44:35] you can say user x can run command y [20:44:39] but not on system z [20:44:47] hmm [20:44:49] I have bug reports in for this [20:44:52] RECOVERY Current Load is now: OK on parsoid-roundtrip3 i-000004d8.pmtpa.wmflabs output: OK - load average: 6.57, 4.75, 4.95 [20:45:19] See with a concept of groups and user x can run y on z where z is or and y is or , say like how sudo works. That would be fucking awesome. [20:45:30] yep [20:45:35] that's what I'd like as well [20:45:58] I'd like to be able to assign a set of commands to a role [20:46:03] then be able to assign that role to people [20:46:06] within a tenant [20:47:09] You know this would be awesome for stuff like labs-home-wm where we're doing crazy ssh stuff (though he can diaf instead), but you know what I mean. [20:47:18] ssh stuff? [20:47:24] what's doing ssh stuff? [20:47:38] glusterfs management code that creates shares [20:47:38] ah [20:47:38] that [20:47:44] that's not the same as labs-home-wm [20:47:53] and we're fixing that another way [20:47:55] though salt could replace ssh there, yes [20:48:08] We have too many bots :P [20:48:43] if events didn't need to be watched by a root process, I'd replace most of the bots with salt, too [20:49:15] hmm [20:49:24] I could technically run a secondary salt-master as non-root, I guess [20:49:27] I suppose doing a sudo tail over salt and streaming the data back would be crazy [20:49:39] http://docs.saltstack.org/en/latest/topics/event/index.html [20:49:52] you can raise events to the master [20:49:55] with tags [20:50:05] That's pretty cool [20:50:09] hmm I wonder [20:50:17] so, ircecho could raise an event to the master for log messages [20:50:31] Could we use logstash to grab the logs, trigger a call to saltmaster from that instance based on tags and do it that way? [20:50:32] then the master could spit it out to irc [20:50:44] Could also use salt to trigger nagios alerts for log stuff that way too [20:50:58] yeah, but nagios already has mechanisms for doing so [20:51:03] no need to reinvent that wheel ;) [20:51:16] :P [20:51:26] I wouldn't mind replacing the smptrap with a salt trigger [20:51:33] now, that said.... [20:51:34] That pile of crap dies too often [20:51:54] you could do auto-scaling with salt [20:51:57] and nagios and ganglia [20:52:08] I think ganglia is a little broken atm [20:52:09] have nagios send alerts on ganglia thresholds [20:52:15] But yes, esp for bots when we have a scheduler and prod instances [20:52:37] salt could create an instance, add it to a pool [20:52:52] PROBLEM Current Load is now: WARNING on parsoid-roundtrip3 i-000004d8.pmtpa.wmflabs output: WARNING - load average: 8.60, 7.86, 6.45 [20:52:54] if the threshold isn't met, create another and add to the pool [20:53:15] You'd just have to be careful downscaling in steps to not loadspike everything. [20:53:20] can do the same for the opposite direction [20:53:20] yeah [20:53:40] Draining traffic off backends to cleanly take them out of service is nice when possible. [20:53:40] but events and salt-cloud could handle that [20:53:57] you could use an event to say when traffic is drained, too ;) [20:54:46] it's definitely possible to do some awesome things with events [20:55:16] the first thing I want to do is to notify labsconsole when an instance is actually fully built [20:55:51] I'd actually like to take a hammer to labsconsole [20:55:58] I want to switch to using salt to install puppet, then do a rn, etc [20:55:58] *run [20:56:02] heh [20:56:06] wish it wasn't mw tbh [20:56:14] We could make the ux 200000000000x better [20:56:18] we *could* switch to horizon [20:56:20] like live stepping though build stages etc [20:56:22] PROBLEM Disk Space is now: CRITICAL on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK CRITICAL - free space: /export 504 MB (2% inode=59%): /home/SAVE 504 MB (2% inode=59%): [20:56:35] we could do that in mw too [20:56:35] Because for devs/noobs it's a horrid interface that's confusing and overcomplicated [20:56:44] just need to make an api and use js [20:57:22] horizon's interface isn't much better IMO [20:57:29] neither is amazon's web console [20:57:34] or rackspace's [20:57:45] horizon is bland, amazon's console is confusing and a mash of services [20:57:57] hp cloud's is embarrasingly bad [20:58:55] Totally not a designer but seriously, keep it simple, keep a nice workflow, keep it clean. We can do better than this (yes, the same line I use in relation to Captchas) [20:59:52] * Damianz thinks interfaces should be accessible from idiot to expert with minimal changes and encourage a workflow or learning, seperating docs with interfaces with lots of buttons makes kittens die [20:59:55] the current workflow for many tasks is somewhat simple [20:59:55] * Damianz also stops ranting [21:00:11] for managing project membership it is *terrible* [21:00:15] It's little things that annoy me [21:00:35] creating instances is way easier than it was before :) [21:00:42] liek create an instance -> oh need a security group -> WHY DO I HAVE TO GO TO ANOTHER PAGE, WHY DO I HAVE TO REFRESH LOOSING MY INFO ARGH, KITTENS DIED [21:00:50] * gwicke wished performance on glusterfs was better [21:00:56] gwicke: is it currently very slow? [21:01:04] little js, dropdown, add inline, ajaxy validate, happy face [21:01:05] distributed filesystems are slow :( [21:01:07] Damianz: yep [21:01:12] I am a fussy bastard though [21:01:14] Ryan_Lane: last time I measured it was 300kB/s [21:01:21] gwicke: when did you measure? [21:01:33] Saturday [21:01:33] we had a problem with it over the last two weeks [21:01:34] gimme a sec [21:01:55] ok. memory usage seems reasonable [21:02:53] yeah. it's unfortunate that gluster is so slow [21:03:01] restarting a sqlite-backed server takes a looong time [21:03:01] ceph likely won't be much better [21:03:16] I guess the cache is thrown away [21:03:24] I recommend using local storage, rather than gluster storage for things that need to be high-speed [21:03:31] sqlite doesn't really say 'performance' tbf [21:03:32] ok [21:03:38] Use a real relation database [21:03:51] Damianz: it is purely IO-bound [21:04:07] you'll be surprised, but SQLite is relational and database [21:04:27] Still sqlite relies on file locking, it doens't scale, it's slow, formats are dodgy, it doesn't do caching. [21:04:32] * Damianz hugs his postgre [21:04:56] * MaxSem burns that heretic [21:05:11] * gwicke moves the db to local storage [21:05:14] oh yeah [21:05:17] anyone else wanting to say anything about unholy PG? [21:05:25] Just because you mw dudes dislike real database and kiss oracle's bum :D [21:05:25] RECOVERY Total processes is now: OK on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS OK: 149 processes [21:05:25] for definite, you should use local storage for databases [21:05:39] using gluster for DB access is bad :) [21:05:55] I should likely write a performance guide for labs [21:05:56] Ryan_Lane: MySQL agrees with you :P [21:06:11] labs has performance? I remember when we where worried about stability [21:06:11] :D [21:06:17] heh [21:06:24] well, it does if you use local storage [21:06:28] it does not if you use gluster [21:06:28] ahhh, much faster now [21:06:36] local storage could be better still tbf [21:06:40] Damianz: how so> [21:06:45] Like lvms vs images on fs [21:06:52] yeah [21:06:56] it's supported in openstack [21:07:11] but you lose features when you do that [21:07:20] or you know, buy us some ssds and use ceph :D totally need like 50tb of ssds [21:07:26] heh [21:07:38] we'd lose some space if we switched to SSDs [21:07:46] yeah :( [21:07:53] Give it another 10years and we'll have 4tb ssds [21:07:54] and on virt6, for instance, we're at 60% used space [21:09:13] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [21:10:41] Hmm I seriously have strange interests =\ Need to get a day job outside of my general interest or I spend all day thinking of crazy stuff to build. [21:11:17] heh [21:11:44] Ryan_Lane, so in which folder should databases be stored? [21:11:52] Platonides: /mnt [21:12:10] we'll have a database service eventually (hopefully in the next 6 months) [21:12:26] I might get around to re-installing the bots sql servers before then [21:12:28] not much good if it's read-only [21:12:35] it should be r/w [21:12:49] Platonides: no, a user database service [21:13:22] PROBLEM Total processes is now: WARNING on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS WARNING: 155 processes [21:13:34] Ryan_Lane: Btw I was thinking the other day with people complaining, what do you think about using federated tables to pull the replicas onto the userdb servers and let people do w/e the hell they want and let mysql 'proxy' the access? [21:13:48] that's a possibility [21:13:50] need to talk to asher about that, though [21:13:56] he's handling this part [21:14:16] I guess asher just wanted to use "normal" db servers [21:14:22] * Damianz leaves binasher cookies on the table for when he comes around again [21:14:26] that would explain the requisite of no user tables [21:14:39] user tables complicated having a clean replica [21:14:46] s/ed// [21:15:13] not really [21:15:20] as they would be in a different db [21:16:13] Imo it's messy and also breaks away from the standard replica setup which means master changes etc are more complicated [21:16:59] as far as you don't try to make it a master, it should be fine [21:17:36] it would be cool to direct some rr connections to normal slaves if they have little load [21:17:51] but who knows what could reach there :P [21:18:16] Loadbalancing over real slaves would be win, sadly the risk of impacting production probably makes that a straight no [21:18:35] It's bad enough when people do template changes and take down wikipedia [21:18:52] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 5.56, 5.54, 5.17 [21:19:19] Seriously hate binlog replication sometimes for all its bad points [21:19:19] oooooo [21:19:21] "The pillar communication has been updated to add some extra levels of verification that the intended minion is the only one allowed to gather the data." [21:19:25] yes, even having some limited labs-shared slaves [21:19:37] win [21:19:39] the wait might stop, waiting for those highly lagged slaves [21:20:01] *the master might stop, waiting for those highly lagged slaves [21:20:23] (well, actually mw logic...) [21:20:23] doing queries against production databases is not gonna happen [21:20:37] I think having another teir of slaves that labs uses primarly and prod can fall back onto could be worthwhile in some cases (for huge read load spikes, which is usually what impacts mysql). [21:20:45] if we don't have enough capacity in Labs, we'll need to buy more database servers [21:20:53] Since real hardware with valid data might as well be used [21:21:37] Ryan_Lane: I think you should just give us root access to the prod db as long as we promise not to break it =D [21:21:38] they would be trivial to connect [21:21:43] Totally how it use to work [21:21:47] heh [21:22:00] labs is purposely disconnected from production [21:22:11] he should allow you to break it, but just once [21:22:14] :D [21:22:31] that's a rite of passage for wmf roots, breaking wikipedia :) [21:22:34] true [21:22:40] You can't say you work in ops until you take down the production cluster at least once :) [21:22:44] you're not officially a member of the team until you've broken the site [21:23:14] that said, I support the petition :D [21:23:51] You get extra points if you break the side by pushing bad code just for faling in testing, staging and qa =D [21:25:47] bleh. events are broken in 0.10.3 :( [21:26:06] thankfully 0.10.4 is going to be released like today or tomorrow [21:26:06] Ryan just took the red and blue pill [21:28:38] good night [21:28:51] nn plat [21:35:29] https://github.com/saltstack/salt/issues/2139 [21:35:31] external_auth [21:36:22] PROBLEM Disk Space is now: WARNING on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK WARNING - free space: /export 521 MB (3% inode=59%): /home/SAVE 521 MB (3% inode=59%): [21:39:16] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [21:39:44] Change on 12mediawiki a page Developer access was modified, changed by Irate link https://www.mediawiki.org/w/index.php?diff=596367 edit summary: [21:41:23] :o [21:41:23] PROBLEM Disk Space is now: CRITICAL on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK CRITICAL - free space: /export 498 MB (2% inode=59%): /home/SAVE 498 MB (2% inode=59%): [21:41:39] smexy [21:42:32] Change on 12mediawiki a page Developer access was modified, changed by Kitchen Knife link https://www.mediawiki.org/w/index.php?diff=596368 edit summary: /* User:Kitchen Knife */ [21:48:52] RECOVERY Current Load is now: OK on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: OK - load average: 3.34, 3.78, 4.73 [21:55:31] I'm still getting Error: invalid magic word 'ISONLINE' from every page on http://en.wikipedia.beta.wmflabs.org/wiki after I attempted a login [21:55:59] * Damianz pokes hashar [21:56:23] he's probably sleeping though it's a hashar thing [21:57:50] Damianz: awake still [21:57:50] though not for long [21:57:53] :o [21:58:23] PROBLEM Disk Space is now: WARNING on maps-test-puppetosm2 i-000004d9.pmtpa.wmflabs output: DISK WARNING - free space: / 519 MB (5% inode=91%): [21:58:47] Hi, anybody in charge / aware of the labs/* projects in Gerrit? [21:59:03] aware yes, in charge what do you mean? [21:59:09] I just need to know whether all / any / none of the projects under labs/* at Gerrit should be featured at Ohloh.net [21:59:11] spagewmf: ahh that is related to the l10n cache not being updated [21:59:15] they are just general projects [21:59:22] spagewmf: that ISONLINE magic word probably comes from a recent change [21:59:35] * Damianz shurg [21:59:39] Damianz, "just general projects", meaning? [21:59:58] spagewmf: the l10n cache is supposed to update automatically though :(( [22:00:06] vaugly labs related stuff, for example the nagios config building code is under labs/ [22:00:31] maybe it runs out of memor [22:00:43] The main thing about labs/* is it spams in here, which is sorta good and sorta rubbish [22:00:50] Damianz, yes, I'm seeing the list of projects in Gerrit but I can't asses which ones are relevant to be featured in Ohloh [22:01:10] Matter of opinion imo [22:01:23] Reedy: poke [22:01:26] they're all open and contributed too, worthyness depends on your application [22:01:27] I like your recursive sentence :) [22:01:41] * Damianz has lots of opinions, you'll get use to it [22:01:42] Reedy: the nice http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page gives a stacktrace about an invalid magic word ISONLINE [22:01:52] Damianz, I can easily create projects for each of them [22:01:52] PROBLEM Current Load is now: WARNING on ve-roundtrip2 i-0000040d.pmtpa.wmflabs output: WARNING - load average: 7.65, 6.89, 5.78 [22:01:56] Reedy: don't we just have to rebuild the l10n cache? to fix that ? [22:02:34] PROBLEM Total processes is now: CRITICAL on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS CRITICAL: 289 processes [22:02:43] It's kinda a LIST ALL THE THINGS vs LIST DECENT THINGS, Ohloh has a lot of crap so I guess either is fine *shrug* [22:03:17] * Damianz delegates the answer to Ryan or ^demon because they're like knowledgeable dudes [22:03:17] hashar: should do [22:03:26] might not be running for some reason [22:03:31] labs/centralauth and labs/incubator have at least a description [22:03:39] !log deployment-prep refreshing l10ncache manually [22:04:10] labs/maps labs/nagios-builder have none. At least this is an objective observation (that can be useless, yes) [22:04:49] See now I'd argue wth is the point of a description in gerrit, we have readme files for that.. the browsing just sucks and the overview page sucks [22:05:00] Hate duplicating info, it just gets out of date quickly [22:05:07] ok Damianz thanks, and I will wait for Ryan_Lane or ^demon|lunch opinions :) [22:05:24] Improve all the things! [22:05:44] I don't see why all wouldn't be added [22:06:16] Quality of code to pull in contributors [22:06:27] main and only reason really [22:06:32] for pure stats that's mute [22:07:25] <^demon|lunch> qgil: labs/* stuff is really random. labs/private should *not* be anywhere, for sure though. [22:07:41] We're a random bunch of people, what do you expect. [22:08:08] ok, forgetting about labs/* for Ohloh by now, then. Thanks ^demon|lunch [22:08:22] Change on 12mediawiki a page Developer access was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=596375 edit summary: /* User:Jimmy xu wrk */ [22:08:39] !log deployment-prep Fixed l10n cache permissions: chown l10nupdate:l10nupdate /home/wikipedia/common/php-master/cache/l10n/* [22:09:00] Change on 12mediawiki a page Developer access was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=596376 edit summary: /* User:Lockal */ done [22:09:22] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [22:09:42] Ryan_Lane: Oh btw did you pull my c0d3 today? I only ask cause if you did I'll reolve the bug before going to bed. [22:09:48] bahh [22:10:02] Damianz: I haven't yet [22:10:08] Change on 12mediawiki a page Developer access was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=596377 edit summary: /* User:Kitchen Knife */ [22:10:12] we'd like to create another 8-core VM, but get a failure while doing so [22:10:17] <^demon|lunch> Ryan_Lane: How does thursday sound for precise on gerrit? [22:10:21] probably quoat related, ask Ryan [22:10:25] project visualeditor [22:10:31] s/quoat/quota/ [22:10:31] Ryan_Lane: ^^ [22:10:34] likely a quota, yes [22:10:41] ^demon|lunch: that's likely fine [22:10:59] <^demon|lunch> I'll go ahead and put us down for Thursday. We can shift it if need be. [22:11:03] ok [22:11:13] <^demon|lunch> Since it'll involve actual gerrit downtime, I'll e-mail engineering as well to see if anyone has objections. [22:11:43] Because gerrit doesn't even randomly have downtime due to restarts :) [22:12:26] <^demon|lunch> Yeah, but an actual upgrade is going to take a bit longer than a random restart. [22:12:26] Damianz: aren't you the one that tweaked the nagios configurations on labs? [22:12:27] ^demon|lunch: precise upgrade for gerrit precedes precise upgrade for gallium, yes? [22:12:32] PROBLEM Total processes is now: WARNING on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS WARNING: 196 processes [22:12:41] <^demon|lunch> chrismcmahon: Unrelated. [22:12:57] Change on 12mediawiki a page Developer access was modified, changed by Sharihareswara (WMF) link https://www.mediawiki.org/w/index.php?diff=596378 edit summary: took away ones that are done [22:12:58] ^demon|lunch: thanks! I thought they were related. [22:12:59] chrismcmahon: for gerrit we needed a patch to be back ported on our installation. That has been deployed by Chad [22:13:03] If by tweaked you mean un-broke, bitched about, broke, un-broke, wrote something not grepping json, trying to puppetized/expand upon. Maybe. [22:13:23] chrismcmahon: zuul is a python software that requires various dependencies which are only in Precise :/ [22:13:47] <^demon|away> chrismcmahon: One does not block the other. So gallium can be scheduled independently of formey/manganese :) [22:13:56] Damianz: yeah well hmm. Maybe you could help anyway. The hostnames are filled using the actual hostname, I was wondering if we could display the instance name instead. Ex: http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?hostgroup=deployment-prep&style=detail [22:14:21] hashar: As soon as Ryan_Lane renames everything. [22:14:27] hashar, andrewbogott, did you try `git clone --depth=1`? should greatly reduce the amount of stuff to transfer [22:14:36] Basically we had to switch to fqdns to support multiple regions [22:15:01] So new instances are ..wmflabs, old ones are ..wmflabs (which sucks and will go away eventually). [22:15:12] The naming changed in the openstack upgrade [22:15:21] spagewmf: nice one.Never heard of --depth option to "git clone" before. That is probably want andrewbogott should use :-] [22:15:34] * andrewbogott tries it [22:15:52] andrewbogott: probably better than my lame "git archive" [22:16:03] Though I wonder if I can show the alias on that form instead since we don't really have duplicates *yet*.. would need to look at the C stuff. [22:16:09] cause apparently one will be able to simply "git pull" to receive new update. [22:16:37] Damianz: ah that make sense [22:17:18] Damianz: (and we should probably upgrade/switch to Icinga, that is a nagios community fork) [22:17:33] hashar, andrewbogott Also `git clone ssh://USER@gerrit.wikimedia.org:29418/mediawiki/core.git ` is meant to be lighter-weight than `git clone https://gerrit.wikimedia.org/r/p/mediawiki/core.git` [22:17:36] yes [22:17:56] I'd like to get a base puppet class role sorted as it's very hacked up from prod atm :( [22:18:33] Ryan_Lane: can you ping me when it would make sense to retry the creation? [22:23:04] aawiki-5f8a0015: 0.6235 15.8M SQL ERROR: Table 'aawikibooks.job' doesn't exist (10.4.0.53) [22:23:07] poooor labs [22:24:13] o.0 [22:25:14] !log deployment-prep test [22:25:54] PROBLEM Current Load is now: WARNING on parsoid-roundtrip3 i-000004d8.pmtpa.wmflabs output: WARNING - load average: 4.95, 5.53, 5.97 [22:27:24] Reedy: csteipp: an interesting issue. Whenever wikiversions.dat is the one from production, we end up with the jobrunner trying to connect to aawikibooks.job which does not exist on beta :-] [22:27:33] Reedy: csteipp: end result: ton of spam in dberror.log :-] [22:27:47] mmmmm spam [22:28:12] Oh, is that what was happening? [22:28:37] I was trying to tail the file, and couldn't see anything on account of the errors flying by [22:29:20] and logs are in /home/wikipedia/logs so that also fill the tiny (and shared) /home partition [22:29:21] :-( [22:29:49] Mon Oct 22 22:29:32 UTC 2012 deployment-video05 aawiki nextJobDB::getPendingDbs 10.4.0.53 1146 Table 'aawikibooks.job' doesn't exist (10.4.0.53) [22:29:49] :-( [22:31:22] RECOVERY Disk Space is now: OK on labs-nfs1 i-0000005d.pmtpa.wmflabs output: DISK OK [22:34:39] hashar: make it a link to somewhere on /mnt or /data/project? [22:36:07] mount bind all the things [22:36:12] Ryan_Lane: yeah got a bug for it. Need to write the puppet stuff for it [22:36:25] why does that need to be in puppet? [22:38:15] Ryan_Lane: puppet ensure that /home/wikipedia/log is a directory [22:38:17] so will happily remove the link [22:38:20] ugh [22:39:22] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [22:39:32] imo /home/wikipedia is a stupid place and it should be configurable in puppet [22:43:22] PROBLEM Disk Space is now: CRITICAL on maps-test-puppetosm2 i-000004d9.pmtpa.wmflabs output: DISK CRITICAL - free space: / 273 MB (2% inode=91%): [22:44:39] found the issue [22:47:32] RECOVERY Total processes is now: OK on wikistats-01 i-00000042.pmtpa.wmflabs output: PROCS OK: 102 processes [23:00:10] spagewmf: --depth=1 seems to help. Do you know of any hazzards involved in using such a repo? Do I need to post warning signs? [23:00:37] Forums seem to disagree about whether it's safe and/or possible to push from such a repo, I don't know what gerrit will think of it. [23:01:31] <^demon|away> --depth *should* be ok since the upgrade to jgit. At least for cloning. [23:01:43] <^demon|away> I have zero clue about pushing back though--never tried [23:02:29] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 41285] foreachwiki on beta use all.dblist instead of all-wmflabs.dblist - https://bugzilla.wikimedia.org/show_bug.cgi?id=41285 [23:02:57] We don't really use 'push' anyway... [23:04:52] RECOVERY Current Load is now: OK on parsoid-roundtrip4 i-000004d7.pmtpa.wmflabs output: OK - load average: 1.90, 3.52, 4.99 [23:04:55] Ryan_Lane: ping ;) [23:05:07] pong [23:05:37] <^demon|away> andrewbogott: Yes we do. Every `git review` is a push :p [23:05:42] Ryan_Lane: we still get the error on instance creation [23:05:43] <^demon|away> `git push origin HEAD:refs/for/$branch` [23:06:06] Hm, guess I thought that 'git review' did something more abstract. [23:06:06] "Failed to create instance. " [23:06:07] Well, I will get to test this soon enough. [23:06:20] gwicke: try now [23:06:39] Ryan_Lane: ahhhh, thanks!! [23:06:45] yw [23:08:04] <^demon|away> andrewbogott: It tries to be more friendly and do other "magic" [23:08:07] <^demon|away> It mostly just annoys me. [23:08:47] <^demon|away> I kind of like http://engblog.nextdoor.com/post/27136956002/introducing-git-change, but haven't had the chance to really play with it much yet. [23:09:23] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [23:13:29] !log deployment-prep log [23:13:35] Logged the message, Master [23:13:52] PROBLEM Current Load is now: CRITICAL on parsoid-roundtrip5-8core i-000004db.pmtpa.wmflabs output: Connection refused by host [23:13:52] PROBLEM Current Load is now: CRITICAL on mwreview-param2 i-000004dc.pmtpa.wmflabs output: Connection refused by host [23:13:59] !log deployment-prep Fixed the jobrunner spamming dberror.log (the all-wmflabs.dblist contained databases from production) [23:14:05] Logged the message, Master [23:14:23] !log deployment-prep Started a manual l10n rebuild in a screen on -dbdump [23:14:25] Logged the message, Master [23:14:33] PROBLEM Current Users is now: CRITICAL on parsoid-roundtrip5-8core i-000004db.pmtpa.wmflabs output: Connection refused by host [23:14:33] PROBLEM Current Users is now: CRITICAL on mwreview-param2 i-000004dc.pmtpa.wmflabs output: Connection refused by host [23:14:37] !log deployment-prep Applying database updates to all wiki (in a screen on -dbdump) [23:14:39] Logged the message, Master [23:14:49] !log deployment-prep getting to bed :-] [23:14:49] Logged the message, Master [23:14:57] we will see how beta is tomorrow morning :-] [23:15:12] PROBLEM Disk Space is now: CRITICAL on parsoid-roundtrip5-8core i-000004db.pmtpa.wmflabs output: Connection refused by host [23:15:12] PROBLEM Disk Space is now: CRITICAL on mwreview-param2 i-000004dc.pmtpa.wmflabs output: Connection refused by host [23:16:02] PROBLEM Free ram is now: CRITICAL on parsoid-roundtrip5-8core i-000004db.pmtpa.wmflabs output: Connection refused by host [23:16:02] PROBLEM Free ram is now: CRITICAL on mwreview-param2 i-000004dc.pmtpa.wmflabs output: Connection refused by host [23:17:22] PROBLEM Total processes is now: CRITICAL on mwreview-param2 i-000004dc.pmtpa.wmflabs output: Connection refused by host [23:18:12] PROBLEM dpkg-check is now: CRITICAL on mwreview-param2 i-000004dc.pmtpa.wmflabs output: Connection refused by host [23:18:52] RECOVERY Current Load is now: OK on parsoid-roundtrip5-8core i-000004db.pmtpa.wmflabs output: OK - load average: 0.38, 0.73, 0.48 [23:19:32] RECOVERY Current Users is now: OK on parsoid-roundtrip5-8core i-000004db.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [23:19:32] RECOVERY Current Users is now: OK on mwreview-param2 i-000004dc.pmtpa.wmflabs output: USERS OK - 0 users currently logged in [23:20:15] RECOVERY Disk Space is now: OK on parsoid-roundtrip5-8core i-000004db.pmtpa.wmflabs output: DISK OK [23:20:15] RECOVERY Disk Space is now: OK on mwreview-param2 i-000004dc.pmtpa.wmflabs output: DISK OK [23:21:03] RECOVERY Free ram is now: OK on parsoid-roundtrip5-8core i-000004db.pmtpa.wmflabs output: OK: 4923% free memory [23:21:03] RECOVERY Free ram is now: OK on mwreview-param2 i-000004dc.pmtpa.wmflabs output: OK: 1121% free memory [23:22:24] RECOVERY Total processes is now: OK on mwreview-param2 i-000004dc.pmtpa.wmflabs output: PROCS OK: 84 processes [23:23:03] RECOVERY dpkg-check is now: OK on mwreview-param2 i-000004dc.pmtpa.wmflabs output: All packages OK [23:23:53] RECOVERY Current Load is now: OK on mwreview-param2 i-000004dc.pmtpa.wmflabs output: OK - load average: 0.02, 0.49, 0.52 [23:28:24] PROBLEM Total processes is now: CRITICAL on parsoid-spof i-000004d6.pmtpa.wmflabs output: PROCS CRITICAL: 210 processes [23:39:34] PROBLEM host: i-000003ef.pmtpa.wmflabs is DOWN address: i-000003ef.pmtpa.wmflabs CRITICAL - Host Unreachable (i-000003ef.pmtpa.wmflabs) [23:53:10] andrewbogott, pushing from a shallow git can fail according to http://stackoverflow.com/questions/6941889 , I don't know if the failure is cryptic or whether it's easy to dig a deeper git repo. [23:53:39] spagewmf: Yeah, there are other reports of it working, so maybe the issue is that it 'can' fail, as you say... [23:53:57] I'm not sure that people will be trying to push from this application anyway, since the tree will be root-owned… hm. [23:54:08] andrewbogott the stackoverflow questions suggest git clone --orphan is better, I've never tried that. [23:54:42] Yeah, I must've read that same page. I don't see evidence that git clone --orphan actually exists, though, it certainly doesn't work with my current git install. [23:55:05] I'm amused that a tool as young as git has so many mysterious, unplumbed depths. [23:57:16] E3 team never pushes from our labs test instances, we only git fetch to them and paste those refs/changes/43/29343/3 lines from gerrit to try out patches. I don't even know that two people can git push from the same directory, the default permissions fight it. [23:59:43] That's pretty much what I"m thinking. I always have a parallel private repo that I use to actually push from.