[00:14:36] !logs [00:14:36] logs http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs [00:22:33] server is back [00:30:08] yeah. I rebooted it to see if I could find bad memory [00:30:49] +/- 5 minutes off [00:30:55] eh? [00:31:18] my instance, and labsconsole.wikimedia.org were unavailable for that time [00:32:26] your instance should not have been [00:32:43] oops, yes the instance was 100% online [00:32:52] but not reachable via the web procy [00:32:54] proxy [00:33:29] THIS was off http://openid-wiki.instance-proxy.wmflabs.org/wiki/ [00:34:53] eh? that should have also had been fine [00:37:52] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 27% free memory [00:38:22] RECOVERY Free ram is now: OK on bots-sql2.pmtpa.wmflabs 10.4.0.41 output: OK: 20% free memory [00:40:37] Ryan_Lane: to make you happy: I am working on E:OpenID towards fixing these bugs [00:40:48] It's only bad memory when you can't see it anymore [00:41:52] RECOVERY Free ram is now: OK on bots-nr1.pmtpa.wmflabs 10.4.1.2 output: OK: 20% free memory [00:46:23] PROBLEM Free ram is now: WARNING on bots-sql2.pmtpa.wmflabs 10.4.0.41 output: Warning: 19% free memory [00:49:53] PROBLEM Free ram is now: WARNING on bots-nr1.pmtpa.wmflabs 10.4.1.2 output: Warning: 17% free memory [00:50:53] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 14% free memory [01:04:42] PROBLEM Total processes is now: WARNING on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS WARNING: 184 processes [01:14:43] RECOVERY Total processes is now: OK on bots-salebot.pmtpa.wmflabs 10.4.0.163 output: PROCS OK: 107 processes [01:29:53] PROBLEM host: followdb01d.pmtpa.wmflabs is DOWN address: 10.4.1.78 CRITICAL - Host Unreachable (10.4.1.78) [01:33:52] RECOVERY host: followdb01d.pmtpa.wmflabs is UP address: 10.4.1.78 PING OK - Packet loss = 0%, RTA = 0.63 ms [01:34:22] PROBLEM Total processes is now: CRITICAL on followdb01d.pmtpa.wmflabs 10.4.1.78 output: Connection refused by host [01:35:21] PROBLEM dpkg-check is now: CRITICAL on followdb01d.pmtpa.wmflabs 10.4.1.78 output: Connection refused by host [01:35:51] PROBLEM Current Load is now: CRITICAL on followdb01d.pmtpa.wmflabs 10.4.1.78 output: Connection refused by host [01:36:11] PROBLEM Free ram is now: WARNING on nova-precise2.pmtpa.wmflabs 10.4.1.57 output: Warning: 16% free memory [01:36:31] PROBLEM Current Users is now: CRITICAL on followdb01d.pmtpa.wmflabs 10.4.1.78 output: Connection refused by host [01:37:22] PROBLEM Disk Space is now: CRITICAL on followdb01d.pmtpa.wmflabs 10.4.1.78 output: Connection refused by host [01:38:01] PROBLEM Free ram is now: CRITICAL on followdb01d.pmtpa.wmflabs 10.4.1.78 output: Connection refused by host [01:43:38] :) labsconsole is back :) [01:45:52] RECOVERY Current Load is now: OK on followdb01d.pmtpa.wmflabs 10.4.1.78 output: OK - load average: 1.04, 1.06, 0.67 [01:46:32] RECOVERY Current Users is now: OK on followdb01d.pmtpa.wmflabs 10.4.1.78 output: USERS OK - 0 users currently logged in [01:46:42] RECOVERY dpkg-check is now: OK on followdb01d.pmtpa.wmflabs 10.4.1.78 output: All packages OK [01:47:12] RECOVERY Disk Space is now: OK on followdb01d.pmtpa.wmflabs 10.4.1.78 output: DISK OK [01:48:03] RECOVERY Free ram is now: OK on followdb01d.pmtpa.wmflabs 10.4.1.78 output: OK: 2861% free memory [01:48:14] 2861% free memory? o.O [01:49:17] :P [01:49:22] RECOVERY Total processes is now: OK on followdb01d.pmtpa.wmflabs 10.4.1.78 output: PROCS OK: 100 processes [01:55:09] Maybe it means 2861MB free memory? [01:56:41] would make more sense :P [02:09:51] It's that magic ram ;) [02:14:46] PROBLEM dpkg-check is now: CRITICAL on followdb01d.pmtpa.wmflabs 10.4.1.78 output: DPKG CRITICAL dpkg reports broken packages [02:26:15] RECOVERY Free ram is now: OK on nova-precise2.pmtpa.wmflabs 10.4.1.57 output: OK: 20% free memory [03:16:55] PROBLEM Total processes is now: WARNING on bots-4.pmtpa.wmflabs 10.4.0.64 output: PROCS WARNING: 155 processes [03:31:56] RECOVERY Total processes is now: OK on bots-4.pmtpa.wmflabs 10.4.0.64 output: PROCS OK: 144 processes [04:18:53] PROBLEM Total processes is now: WARNING on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS WARNING: 151 processes [04:38:53] RECOVERY Total processes is now: OK on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS OK: 147 processes [04:39:53] RECOVERY Free ram is now: OK on bots-nr1.pmtpa.wmflabs 10.4.1.2 output: OK: 20% free memory [04:40:54] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 27% free memory [04:48:52] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 14% free memory [05:02:52] PROBLEM Free ram is now: WARNING on bots-nr1.pmtpa.wmflabs 10.4.1.2 output: Warning: 17% free memory [05:14:52] PROBLEM host: tstarling-puppet.pmtpa.wmflabs is DOWN address: 10.4.1.79 CRITICAL - Host Unreachable (10.4.1.79) [05:18:57] RECOVERY host: tstarling-puppet.pmtpa.wmflabs is UP address: 10.4.1.79 PING OK - Packet loss = 0%, RTA = 0.61 ms [05:19:27] PROBLEM Total processes is now: CRITICAL on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: Connection refused by host [05:20:13] PROBLEM dpkg-check is now: CRITICAL on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: Connection refused by host [05:20:52] PROBLEM Current Load is now: CRITICAL on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: Connection refused by host [05:21:32] PROBLEM Current Users is now: CRITICAL on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: Connection refused by host [05:22:12] PROBLEM Disk Space is now: CRITICAL on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: Connection refused by host [05:23:02] PROBLEM Free ram is now: CRITICAL on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: Connection refused by host [05:29:22] RECOVERY Total processes is now: OK on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: PROCS OK: 84 processes [05:30:48] RECOVERY Current Load is now: OK on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: OK - load average: 0.28, 0.88, 0.63 [05:30:49] RECOVERY dpkg-check is now: OK on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: All packages OK [05:31:28] RECOVERY Current Users is now: OK on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: USERS OK - 0 users currently logged in [05:32:18] RECOVERY Disk Space is now: OK on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: DISK OK [05:32:58] RECOVERY Free ram is now: OK on tstarling-puppet.pmtpa.wmflabs 10.4.1.79 output: OK: 897% free memory [06:29:34] PROBLEM Total processes is now: WARNING on dumps-bot2.pmtpa.wmflabs 10.4.0.60 output: PROCS WARNING: 153 processes [06:31:54] PROBLEM Total processes is now: WARNING on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS WARNING: 154 processes [06:33:44] PROBLEM Free ram is now: WARNING on newprojectsfeed-bot.pmtpa.wmflabs 10.4.0.232 output: Warning: 19% free memory [06:43:54] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 31% free memory [06:48:39] RECOVERY Free ram is now: OK on newprojectsfeed-bot.pmtpa.wmflabs 10.4.0.232 output: OK: 50% free memory [06:51:54] RECOVERY Total processes is now: OK on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS OK: 150 processes [06:54:34] RECOVERY Total processes is now: OK on dumps-bot2.pmtpa.wmflabs 10.4.0.60 output: PROCS OK: 149 processes [07:21:55] PROBLEM Current Load is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: WARNING - load average: 7.99, 6.96, 5.81 [07:34:43] RECOVERY dpkg-check is now: OK on followdb01d.pmtpa.wmflabs 10.4.1.78 output: All packages OK [08:11:03] Ryan_Lane: is it possible for you to open for me e.g. port 9001 http://openid-wiki.instance-proxy.wmflabs.org:9001 so that I can try my Etherpad installation? [08:11:54] RECOVERY Current Load is now: OK on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: OK - load average: 4.12, 4.34, 4.97 [08:15:53] PROBLEM Total processes is now: WARNING on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS WARNING: 151 processes [08:17:33] PROBLEM Total processes is now: WARNING on dumps-bot2.pmtpa.wmflabs 10.4.0.60 output: PROCS WARNING: 151 processes [08:22:59] !bots [08:22:59] http://www.mediawiki.org/wiki/Wikimedia_Labs/Create_a_bot_running_infrastructure proposal for bots [08:23:09] bingo [08:37:32] RECOVERY Total processes is now: OK on dumps-bot2.pmtpa.wmflabs 10.4.0.60 output: PROCS OK: 146 processes [08:40:52] RECOVERY Total processes is now: OK on parsoid-roundtrip4-8core.pmtpa.wmflabs 10.4.0.39 output: PROCS OK: 147 processes [08:41:52] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 15% free memory [08:42:27] addshore here [08:42:29] :D [08:42:32] :) [08:42:34] needed to start my irc [08:42:47] actually I am running it in C# IDE [08:42:48] :D [08:42:54] so I needed to rebuild it [08:42:57] haha [08:43:06] iv been awake for 30 hours :D [08:43:11] lol [08:43:12] go sleep [08:43:20] nah, its nearly 9am [08:43:20] xD [08:49:44] addshore it's still unclear how it's going to be done, but for now consider *nr production [08:49:58] @labs-resolve nr [08:49:59] I don't know this instance - aren't you are looking for: I-0000049e (bots-nr1), I-00000567 (bots-nr2), I-0000056f (bots-bnr1), [08:50:15] bnr1 is big [08:50:34] nr instances for the live bots, some other instances for deving on? [08:51:04] I did notice you make that one :) [08:53:43] petan, labs console broke for me again >.< [08:55:11] how [08:55:24] cant login, it asks about cookies [08:55:38] last time It did this to me it lasted for about 24 hours :/ [08:56:59] ugh. it's back up [08:57:45] [= [09:10:27] !log proposals reboot mysql1 [09:10:34] >.< [09:14:24] !log deployment-prep Attempting to setup the testing mobile Varnish instance (deployment-varnish-t) using {{gerrit|44709}} [09:20:05] addshore what did you want to log? [09:20:34] exactly what I said :P Im just editing the console page for now [09:20:52] Ryan_Lane is it me or are there some problem with performance? [09:20:56] all vm's are so slow [09:21:22] !ping [09:21:22] pong [09:21:34] just wm-bot is fast as always :D [09:22:01] everything looks fine in ganglia... [09:22:35] petan, give it a reboot, worked for me, [09:22:43] reboot to what [09:22:51] your instance [09:22:56] :/ [09:23:01] that would be like 20 instances [09:23:03] xD [09:23:09] I don't want to reboot whole bots project [09:23:29] logbot is back [09:24:47] Ryan_Lane: get to bed! :-D [09:24:54] yeah. I am very soon [09:24:55] * hashar hands a pillow to Ryan [09:25:01] * hashar and a book [09:25:10] I'll take a look at performance issues in the morning [09:25:24] will ping andrew when he connects :-D [09:42:05] petan: so, I believe this was linked to ldap issues [09:42:18] the ldap server on virt0 didn't come up properly and was hung [09:42:28] nslcd was properly timing out and failing over to virt1000 [09:42:57] I had rebooted virt0 earlier in the day [09:43:57] so, nslcd (and related libraries) have a timeout that increases as it sees the primary is dead [09:44:08] that's what makes things slow [09:44:25] the next time it checks the primary it should see that it's up and things will go back to normal [09:45:20] aha [09:45:26] ok [09:45:30] well, I say that, but now I'm getting timeouts again [09:45:59] ah. because they were on virt1000 [09:46:08] which was also somehow stuck [09:46:12] bleh [09:46:57] there we go [09:47:09] as a side effect, I also had to restart dns [09:51:54] !log deployment-prep rebooting deployment-varnish-t to find out how well it goes on restart :-] [09:51:56] Logged the message, Master [09:55:22] PROBLEM Disk Space is now: CRITICAL on deployment-varnish-t.pmtpa.wmflabs 10.4.1.74 output: Connection refused by host [09:56:02] PROBLEM Free ram is now: CRITICAL on deployment-varnish-t.pmtpa.wmflabs 10.4.1.74 output: Connection refused by host [09:56:12] PROBLEM dpkg-check is now: CRITICAL on deployment-varnish-t.pmtpa.wmflabs 10.4.1.74 output: Connection refused by host [09:56:42] PROBLEM SSH is now: CRITICAL on deployment-varnish-t.pmtpa.wmflabs 10.4.1.74 output: Connection refused [09:57:09] !log relaying Ryan: he restarted ldap on virt0 (was hung after server restart). nscld was properly falling back to virt1000 but ldap was stuck there too. DNS got restarted. [09:57:10] relaying is not a valid project. [09:57:21] !log bastion relaying Ryan: he restarted ldap on virt0 (was hung after server restart). nscld was properly falling back to virt1000 but ldap was stuck there too. DNS got restarted. [09:57:22] Logged the message, Master [09:57:32] PROBLEM Total processes is now: CRITICAL on deployment-varnish-t.pmtpa.wmflabs 10.4.1.74 output: Connection refused by host [09:57:38] pooor instance [09:58:52] PROBLEM Current Load is now: CRITICAL on deployment-varnish-t.pmtpa.wmflabs 10.4.1.74 output: Connection refused by host [09:58:56] !log deployment-prep Rebooting deployment-varnish-t from labsconsole. I guess there is a mount for /dev/sda* :( [09:58:57] Logged the message, Master [09:59:32] PROBLEM Current Users is now: CRITICAL on deployment-varnish-t.pmtpa.wmflabs 10.4.1.74 output: Connection refused by host [10:09:53] PROBLEM Current Load is now: WARNING on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: WARNING - load average: 6.30, 6.22, 5.39 [10:18:43] PROBLEM host: deployment-varnish-t.pmtpa.wmflabs is DOWN address: 10.4.1.74 CRITICAL - Host Unreachable (10.4.1.74) [10:25:32] !log depoyment-prep re rebooting deployment-varnish-t [10:25:32] depoyment-prep is not a valid project. [10:25:49] ... [10:25:58] !log deployment-prep re rebooting dpeloyment-varnish-t [10:26:01] Logged the message, Master [10:26:08] I should get that project renamed to simply "beta" [10:38:01] !log deployment-prep creating deployment-varnish-t2 to replace broken deployment-varnish-t [10:38:03] Logged the message, Master [10:48:14] !log deployment-prep moved 208.80.153.143 from deployment-varnish-t to deployment-varnish-t2 (IP is in DNS as *.m.beta.wmflabs.org ) [10:48:16] Logged the message, Master [10:48:52] PROBLEM host: deployment-varnish-t.pmtpa.wmflabs is DOWN address: 10.4.1.74 CRITICAL - Host Unreachable (10.4.1.74) [10:49:52] PROBLEM host: deployment-varnish-t2.pmtpa.wmflabs is DOWN address: 10.4.1.80 CRITICAL - Host Unreachable (10.4.1.80) [10:53:54] RECOVERY host: deployment-varnish-t2.pmtpa.wmflabs is UP address: 10.4.1.80 PING OK - Packet loss = 0%, RTA = 0.84 ms [10:54:24] PROBLEM Total processes is now: CRITICAL on deployment-varnish-t2.pmtpa.wmflabs 10.4.1.80 output: Connection refused by host [10:55:52] PROBLEM Current Load is now: CRITICAL on deployment-varnish-t2.pmtpa.wmflabs 10.4.1.80 output: Connection refused by host [10:55:52] PROBLEM dpkg-check is now: CRITICAL on deployment-varnish-t2.pmtpa.wmflabs 10.4.1.80 output: Connection refused by host [10:56:32] PROBLEM Current Users is now: CRITICAL on deployment-varnish-t2.pmtpa.wmflabs 10.4.1.80 output: Connection refused by host [10:57:12] PROBLEM Disk Space is now: CRITICAL on deployment-varnish-t2.pmtpa.wmflabs 10.4.1.80 output: Connection refused by host [10:58:02] PROBLEM Free ram is now: CRITICAL on deployment-varnish-t2.pmtpa.wmflabs 10.4.1.80 output: Connection refused by host [11:14:53] RECOVERY Current Load is now: OK on wikidata-dev-9.pmtpa.wmflabs 10.4.1.41 output: OK - load average: 4.00, 4.23, 4.78 [11:18:52] PROBLEM host: deployment-varnish-t.pmtpa.wmflabs is DOWN address: 10.4.1.74 CRITICAL - Host Unreachable (10.4.1.74) [11:26:44] labs piss me off sometime [11:27:52] !log deployment-prep created deployment-varnish-t3 , deleted deployment-varnish-t2 [11:27:53] Logged the message, Master [11:39:53] PROBLEM host: deployment-cache-mobile01.pmtpa.wmflabs is DOWN address: 10.4.1.82 CRITICAL - Host Unreachable (10.4.1.82) [11:39:53] PROBLEM host: deployment-varnish-t3.pmtpa.wmflabs is DOWN address: 10.4.1.83 CRITICAL - Host Unreachable (10.4.1.83) [11:43:53] RECOVERY host: deployment-varnish-t3.pmtpa.wmflabs is UP address: 10.4.1.83 PING OK - Packet loss = 0%, RTA = 0.70 ms [11:44:23] PROBLEM Total processes is now: CRITICAL on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: Connection refused by host [11:45:52] PROBLEM Current Load is now: CRITICAL on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: Connection refused by host [11:46:32] PROBLEM Current Users is now: CRITICAL on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: Connection refused by host [11:46:32] PROBLEM dpkg-check is now: CRITICAL on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: Connection refused by host [11:47:12] PROBLEM Disk Space is now: CRITICAL on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: Connection refused by host [11:48:02] PROBLEM Free ram is now: CRITICAL on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: Connection refused by host [11:52:13] RECOVERY Disk Space is now: OK on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: DISK OK [11:53:03] RECOVERY Free ram is now: OK on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: OK: 898% free memory [11:53:53] RECOVERY host: deployment-cache-mobile01.pmtpa.wmflabs is UP address: 10.4.1.82 PING OK - Packet loss = 0%, RTA = 0.55 ms [11:54:23] PROBLEM Total processes is now: CRITICAL on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: Connection refused by host [11:54:23] RECOVERY Total processes is now: OK on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: PROCS OK: 84 processes [11:55:53] PROBLEM Current Load is now: CRITICAL on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: Connection refused by host [11:55:53] PROBLEM dpkg-check is now: CRITICAL on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: Connection refused by host [11:55:53] RECOVERY Current Load is now: OK on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: OK - load average: 0.04, 0.55, 0.48 [11:56:33] PROBLEM Current Users is now: CRITICAL on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: Connection refused by host [11:56:34] RECOVERY Current Users is now: OK on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: USERS OK - 0 users currently logged in [11:56:34] RECOVERY dpkg-check is now: OK on deployment-varnish-t3.pmtpa.wmflabs 10.4.1.83 output: All packages OK [11:57:13] PROBLEM Disk Space is now: CRITICAL on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: Connection refused by host [11:58:03] PROBLEM Free ram is now: CRITICAL on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: Connection refused by host [12:04:22] RECOVERY Total processes is now: OK on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: PROCS OK: 84 processes [12:05:54] RECOVERY Current Load is now: OK on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: OK - load average: 0.26, 0.95, 0.74 [12:05:54] RECOVERY dpkg-check is now: OK on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: All packages OK [12:06:34] RECOVERY Current Users is now: OK on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: USERS OK - 0 users currently logged in [12:07:13] RECOVERY Disk Space is now: OK on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: DISK OK [12:08:03] RECOVERY Free ram is now: OK on deployment-cache-mobile01.pmtpa.wmflabs 10.4.1.82 output: OK: 901% free memory [12:28:15] !log deployment-prep applying role::cache::mobile on deployment-varnish-t3 [12:28:18] Logged the message, Master [12:42:22] !log deployment-prep -varnish-t3 : removing /dev/sda* entries from /etc/fstab , applying {{gerrit|44709}} ps 6 and rerunning puppet [12:42:24] Logged the message, Master [15:05:03] PROBLEM host: wikidata-testclient.pmtpa.wmflabs is DOWN address: 10.4.0.23 CRITICAL - Host Unreachable (10.4.0.23) [15:25:42] RECOVERY host: wikidata-testclient.pmtpa.wmflabs is UP address: 10.4.0.23 PING OK - Packet loss = 0%, RTA = 3.12 ms [15:25:52] PROBLEM Current Load is now: CRITICAL on wikidata-testclient.pmtpa.wmflabs 10.4.0.23 output: Connection refused by host [15:25:52] PROBLEM dpkg-check is now: CRITICAL on wikidata-testclient.pmtpa.wmflabs 10.4.0.23 output: Connection refused by host [15:26:32] PROBLEM Current Users is now: CRITICAL on wikidata-testclient.pmtpa.wmflabs 10.4.0.23 output: Connection refused by host [15:27:12] PROBLEM Disk Space is now: CRITICAL on wikidata-testclient.pmtpa.wmflabs 10.4.0.23 output: Connection refused by host [15:28:02] PROBLEM Free ram is now: CRITICAL on wikidata-testclient.pmtpa.wmflabs 10.4.0.23 output: Connection refused by host [15:29:32] PROBLEM Total processes is now: CRITICAL on wikidata-testclient.pmtpa.wmflabs 10.4.0.23 output: CHECK_NRPE: Socket timeout after 10 seconds. [15:54:53] PROBLEM Free ram is now: WARNING on swift-be4.pmtpa.wmflabs 10.4.0.127 output: Warning: 19% free memory [16:00:00] !log wikidata-dev wikidata-dev-9: Memcached errors again earlier today. Restarted memcached manually. [16:00:01] Logged the message, Master [16:24:55] RECOVERY Free ram is now: OK on swift-be4.pmtpa.wmflabs 10.4.0.127 output: OK: 21% free memory [16:32:53] PROBLEM Free ram is now: WARNING on swift-be4.pmtpa.wmflabs 10.4.0.127 output: Warning: 19% free memory [16:41:52] RECOVERY Free ram is now: OK on sube.pmtpa.wmflabs 10.4.0.245 output: OK: 27% free memory [16:46:51] Hello! :) [16:47:13] hi :D [16:48:17] What's your experience with labs instances that fail to get ready for use? I sometimes throw some instances away and create new ones but at least 1/3 never gets to a state that I could log in via ssh. [16:49:00] The latest one says in the log output that puppet cannot run apt-get to install the necessary ldap stuff. [16:49:48] Are error message from newly created instances collected anywhere? In Bugzilla for example? [16:54:37] Actually, I'll ask this on the ML. [17:04:52] PROBLEM Free ram is now: WARNING on sube.pmtpa.wmflabs 10.4.0.245 output: Warning: 9% free memory [17:45:54] PROBLEM Free ram is now: WARNING on swift-be4.pmtpa.wmflabs 10.4.0.127 output: Warning: 19% free memory [18:55:54] RECOVERY Free ram is now: OK on swift-be4.pmtpa.wmflabs 10.4.0.127 output: OK: 21% free memory [19:01:44] PROBLEM Free ram is now: WARNING on wordpressbeta-precise.pmtpa.wmflabs 10.4.0.215 output: Warning: 19% free memory [19:06:42] RECOVERY Free ram is now: OK on wordpressbeta-precise.pmtpa.wmflabs 10.4.0.215 output: OK: 20% free memory [19:07:51] petan: I'd like to join the Webtools project - can you help me with that? [19:32:52] PROBLEM Free ram is now: CRITICAL on bots-nr1.pmtpa.wmflabs 10.4.1.2 output: Critical: 5% free memory [19:37:53] RECOVERY Free ram is now: OK on bots-nr1.pmtpa.wmflabs 10.4.1.2 output: OK: 59% free memory [19:38:53] PROBLEM Free ram is now: WARNING on swift-be4.pmtpa.wmflabs 10.4.0.127 output: Warning: 19% free memory [19:48:33] hashar: are you hoping for a prompt review of https://gerrit.wikimedia.org/r/#/c/44709/ or still tinkering? [19:48:53] * hashar looks up "hoping" and "tinkering" in his dictionnary [19:49:05] andrewbogott: so yeah that made mark cry [19:49:06] :-] [19:49:32] I have applied that change manually, but I guess Mark will have the final word before it lands in the repo [19:49:41] specially with eqiad being deployed [19:49:44] Pretty much the only way I can contribute is by nagging mark to review, so if you're already talking to mark then I'll keep out of it [19:50:04] if you have any idea how to enable varnish logging under /var/log/varnish , that would be nice [19:50:20] the role::cache::mobile has three varnish_logging() calls which I have disabled under beta [19:50:32] cause they point to prod machines and one points to multicast [19:50:45] I guess there is a way to make varnish log in a file [19:51:15] I know less than google, I'm sure [19:51:21] ahh [19:51:27] maybe it is better than yahoo.fr [19:51:39] I should buy that google book everyone is talking about [19:51:53] anyway, that is progressing [19:52:27] andrewbogott: unrelated, but memcached died again (I have seen your comment) [19:52:40] Is it dead right now? [19:52:40] and ryan fixed up virt0 ldap like 11 - 12 hours ago [19:52:56] just writing you a post mortem hehe [19:52:58] totally should just use redis [19:53:00] memcached is alive I believe [19:53:02] ok. [19:53:03] nagios died too I think [19:53:13] I'm not sure I know what the virt0 ldap thing is… maybe I'm behind on my email [19:53:38] not a big deal, that is fixed :-) [19:53:51] I had another issue but self fixed it (delete / recreate an instance haha) [19:54:04] so I guess you can enjoy your day off [19:54:34] nagios is alive ish [19:54:50] yeah [19:54:59] oh I meant the production one [19:55:32] Damianz: have you figured out why nagios does not receive the SNMP puppet freshness traps? [19:55:34] Damianz: http://nagios.wmflabs.org/cgi-bin/nagios3/status.cgi?hostgroup=deployment-prep&style=detail [19:55:47] oh, it recieves them just fine [19:55:52] it's just sent the wrong hostname [19:56:43] PROBLEM Free ram is now: WARNING on wordpressbeta-precise.pmtpa.wmflabs 10.4.0.215 output: Warning: 19% free memory [19:57:47] Damianz: ahhh that is cause of the instance id / alias ? [19:57:59] like i-0000abcd vs some-host-name-meaningful ? [19:58:49] actually it's not running atm, but yes it has the issue that somehost.someregion.wmflabs doesn't match i-xxxx [19:58:58] hashar: database sequence :) [20:00:43] Nope actually it is running, it just put a stupid error in the log for some reason -.- [20:04:35] Warning: Passive check result was received for service 'Puppet freshness' on host 'nagios.wikimedia.org', but the host could not be found! < for example is actually nagios-main.pmtpa.wmflabs [20:31:00] bah [20:31:04] I hate start-stop-daemon [20:31:09] I end up using strace to get the error message [20:31:10] write(2, "ERROR, no such host\n", 20) = 20 [20:31:11] nice [20:37:52] argh crap [20:38:52] RECOVERY Free ram is now: OK on swift-be4.pmtpa.wmflabs 10.4.0.127 output: OK: 22% free memory [20:50:16] hashar: stderr redirected to /dev/hell [21:06:53] PROBLEM Free ram is now: WARNING on swift-be4.pmtpa.wmflabs 10.4.0.127 output: Warning: 19% free memory [21:29:34] PROBLEM Current Users is now: CRITICAL on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: Connection refused by host [21:30:13] PROBLEM Disk Space is now: CRITICAL on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: Connection refused by host [21:30:53] PROBLEM Free ram is now: CRITICAL on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: Connection refused by host [21:30:54] PROBLEM Current Load is now: CRITICAL on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: Connection refused by host [21:30:54] saper: yeah kind of [21:31:13] saper: so I end up editing the init script to use bash -x [21:31:18] run it to get the full command [21:31:23] then strace that command :-] [21:32:23] PROBLEM Total processes is now: CRITICAL on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: Connection refused by host [21:33:13] PROBLEM dpkg-check is now: CRITICAL on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: Connection refused by host [21:34:33] RECOVERY Current Users is now: OK on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: USERS OK - 0 users currently logged in [21:35:14] RECOVERY Disk Space is now: OK on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: DISK OK [21:35:54] RECOVERY Free ram is now: OK on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: OK: 746% free memory [21:35:54] RECOVERY Current Load is now: OK on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: OK - load average: 0.54, 1.01, 0.54 [21:37:25] RECOVERY Total processes is now: OK on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: PROCS OK: 79 processes [21:38:14] RECOVERY dpkg-check is now: OK on build-lucid1.pmtpa.wmflabs 10.4.1.53 output: All packages OK [21:41:01] !log wlmjudging purged proftpd and webmin packages [21:41:02] Logged the message, Master [21:41:14] !log wlmjudging disabled phpmyadmin in apache [21:41:15] Logged the message, Master [21:41:27] webmin and phpmyadmin. bleh [21:51:53] RECOVERY Free ram is now: OK on swift-be4.pmtpa.wmflabs 10.4.0.127 output: OK: 21% free memory [22:23:50] andrewbogott_afk: did you merge the nginx proxy stuff? [22:24:35] it doesn't look dangerous, I'm merging it on sockpuppet [22:25:02] PROBLEM Free ram is now: WARNING on swift-be4.pmtpa.wmflabs 10.4.0.127 output: Warning: 19% free memory [22:40:42] Ryan_Lane, I'm moving the FBot project (now ContinuityBot) back to Labs, is there anything i need to know about where to put it in the bots project and which instances are more/less resource stricken? [22:41:05] JasonDC: I'm not totally sure [22:41:12] Damianz and petan would know best [22:46:09] alright, also is the wmflabs ganglia supposed to be how dead (inactive) it is? [22:56:40] it's broken currently [22:56:44] someone needs to fix it [22:56:56] s/currently/always/ [22:57:05] well, it wasn't broken originally ;) [23:05:36] !log bots ContinuityBot deployment server selected to be bots-nr2. [23:05:37] Logged the message, Master [23:19:08] Ryan_Lane, when you have a moment, please add this to the bots-nr2 puppet config [23:19:58] really need to get bots stuff merged [23:20:17] http://fpaste.org/N9H6/ [23:20:39] indeed [23:20:46] Damianz: want to do a sprint on that soon? [23:21:01] If you supply the drugs ;) [23:21:10] :D [23:21:20] are you going to be at the ams hackathon? [23:21:23] that stuff can go into default bots stuff [23:21:27] I almost wrote hackathong [23:21:35] ams? amsterdam [23:21:38] yeah [23:22:06] hmmm May, I can probably do May [23:22:18] Should have my new motorbike then so that's a good excuse to take it for a ride [23:22:31] cool [23:22:51] ugh. stupid ciscos disconnecting me from the console after a timeout [23:22:56] right in the middle of a fucking install [23:23:07] timeout 0 [23:23:12] or exec-timeout 0 [23:23:24] not in cisco [23:23:56] and they take like 10 minutes to boot [23:24:08] ah you're talking about the horrid server things [23:24:12] yes [23:24:12] * Damianz only does cisco switchy stuff [23:24:22] Though we might get ucs sooon... not sure how I feel about that [23:24:38] they are good for virtualization [23:24:43] not sure what else I'd ever use them for [23:25:49] yeah... I'd just be sticking vmware on them [23:49:03] awww europython is in July [23:59:16] PROBLEM Free ram is now: WARNING on nova-precise2.pmtpa.wmflabs 10.4.1.57 output: Warning: 19% free memory