[00:01:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42701 [00:03:36] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [00:07:11] PROBLEM - MySQL Idle Transactions on db62 is CRITICAL: Connection refused by host [00:07:21] PROBLEM - MySQL Recent Restart on db62 is CRITICAL: Connection refused by host [00:07:29] PROBLEM - MySQL Replication Heartbeat on db62 is CRITICAL: Connection refused by host [00:07:57] PROBLEM - SSH on db62 is CRITICAL: Connection refused [00:07:58] PROBLEM - Full LVS Snapshot on db62 is CRITICAL: Connection refused by host [00:08:15] PROBLEM - MySQL Slave Delay on db62 is CRITICAL: Connection refused by host [00:08:41] PROBLEM - MySQL Slave Running on db62 is CRITICAL: Connection refused by host [00:08:42] PROBLEM - MySQL disk space on db62 is CRITICAL: Connection refused by host [00:10:20] what's this apache_status ganglia module doing on the nginx boxes? [00:10:48] ah [00:11:06] seems it isn't working properly [00:12:40] on the nginx boxes, i wouldn't expect it to [00:13:11] RECOVERY - SSH on db62 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:13:12] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [00:14:53] http://wiki.nginx.org/HttpStubStatusModule [00:14:57] well, that's configured [00:15:01] but it returns a 400 error [00:16:08] !log starting innobackupex on db1004 [00:16:18] Logged the message, notpeter [00:19:02] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [00:20:03] Ryan_Lane: did you make the working copy changes in tin:/srv/deployment/mediawiki/common ? [00:20:59] TimStarling: yes, you can wipe them out [00:21:11] I was looking at how much work it would be to change the paths [00:22:16] I think we were considering committing them [00:22:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:27] to a dev branch of course [00:22:54] * Ryan_Lane nods [00:23:28] I was considering a way to handle test, as well, btw [00:23:38] git deploy start [00:23:50] git deploy test <— would run a test sync script [00:24:05] it would also write a tag that had -test appended [00:24:10] it can probably just die [00:24:10] and would write out a .deploy.test file [00:24:13] oh? [00:24:16] we don't really use it anymore [00:24:20] Just keep test2? [00:24:26] that makes things simpler [00:24:41] Or move it to a more normal config and keep 2 test wikis? [00:25:19] yeah, we can keep the actual wiki, people write stuff on there that they might want to keep, like gadgets [00:26:33] it's not like it'd be difficult to have the test wikis on different branches/versions most of the time if necessary [00:28:56] PROBLEM - NTP on db62 is CRITICAL: NTP CRITICAL: No response from NTP server [00:29:27] is there any documentation about Antoine's labs project? [00:29:56] it seems like it should have its own name rather than being called after what it runs on... [00:31:16] ori-l: Did you ask mutante about your bz account merge? [00:31:53] i created an RT ticket and someone from ops (can't remember who) responded wondering out loud if this is proper to do.. in a meeting atm tho. [00:33:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.844 seconds [00:33:01] is it the deployment-prep project in labsconsole? [00:33:05] yes [00:35:36] ori-l, that was me (not quite from ops, just curious!) [00:37:08] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40344 [00:38:05] RECOVERY - MySQL Slave Running on db62 is OK: OK replication [00:38:23] RECOVERY - MySQL disk space on db62 is OK: DISK OK [00:38:32] RECOVERY - MySQL Idle Transactions on db62 is OK: OK longest blocking idle transaction sleeps for seconds [00:38:41] RECOVERY - MySQL Replication Heartbeat on db62 is OK: OK replication delay seconds [00:38:50] RECOVERY - MySQL Recent Restart on db62 is OK: OK seconds since restart [00:39:08] RECOVERY - Full LVS Snapshot on db62 is OK: OK no full LVM snapshot volumes [00:39:26] RECOVERY - MySQL Slave Delay on db62 is OK: OK replication delay seconds [00:40:42] New patchset: Pyoungmeister; "testing: further testing on s2/db62" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42703 [00:41:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42703 [00:44:11] AaronSchulz: what about deleted containers? do we need to sync those? [00:44:51] which ones? [00:44:57] all of them? [00:45:01] you mean the "deletion" container? [00:45:07] which deleted files [00:45:08] *-*-local-deleted [00:45:10] *with [00:45:20] heh, I thought you meant "containers that were deleted" ;) [00:45:29] oh, sorry [00:45:31] sync to where? ceph? [00:45:45] yes [00:46:09] yeah [00:46:15] okay [00:46:16] thanks :) [00:46:49] New patchset: Pyoungmeister; "testing: db61 and db62 with s3 and lucid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42704 [00:47:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42704 [00:48:31] paravoid: wait, it's almost 4:48pm here [00:48:43] yeah? [00:48:56] you are -0 right? [00:49:05] I'm EET, UTC+2 [00:49:23] so almost 3am? [00:49:24] localtime is 02:49am [00:49:45] well to be fair I committed something at like 2:30am last night pst [00:50:24] is this the first time you're noticing that I tend to work late? :) [00:50:32] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [00:50:41] paravoid? work late? [00:50:44] I don't believe it [00:51:05] haha [00:51:13] notpeter: did you just reboot db62? [00:51:26] agh [00:51:34] yes [00:51:36] I'm sorry [00:51:37] please abort lucid reimage [00:51:41] ok [00:51:56] that would be a big fail [00:52:29] you can't run lucid on an r720 [00:52:39] ah, ok [00:52:52] and i'm doing mha experimentation on db62 [00:52:57] ah, sorry [00:53:13] if no lucid, then I'm done with db61 and db62 [00:53:16] sorry for not checking first [00:53:52] notpeter: take db1036, it's a r510 that was just repaired and isn't slaving anything [00:54:00] binasher: ok, cool [00:54:44] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [00:55:41] binasher: let me know when you have a sec [00:56:20] AaronSchulz: sup [01:00:32] New patchset: RobH; "adding caesium to parsoid eqiad cluster role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:02:23] New patchset: Pyoungmeister; "testing: actually db1036 as fake s3 master (lucid test)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42705 [01:03:09] New patchset: RobH; "adding caesium to parsoid eqiad cluster role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:04:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42705 [01:05:15] !log temp stopping puppet on brewster [01:05:27] Logged the message, notpeter [01:05:47] New patchset: RobH; "adding caesium to parsoid eqiad cluster role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:07:09] is http://en.wikipedia.beta.wmflabs.org/ meant to work? [01:07:18] or was squid stopped on it deliberately? [01:07:24] New patchset: RobH; "adding caesium to parsoid eqiad cluster role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:08:10] New review: RobH; "That took entirely too many patchsets, it must be the end of the day." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42358 [01:08:11] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:08:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 270 seconds [01:13:29] PROBLEM - SSH on db1036 is CRITICAL: Connection refused [01:14:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:14:23] PROBLEM - MySQL disk space on db1036 is CRITICAL: Connection refused by host [01:14:44] TimStarling: I believe it should be working [01:15:04] I'll start squid on it [01:15:07] * Ryan_Lane nods [01:16:13] !log on deployment-squid.pmtpa.wmflabs: started squid [01:16:24] Logged the message, Master [01:16:52] New patchset: Dzahn; "add aaron, maryana, rfaulkner accounts to vanadium (RT-4139)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42708 [01:17:02] TimStarling, several of the VMs in that cluster were freaking out because of runaway logfiles resulting in full disks. That's probably why it is (or was) down [01:18:06] New patchset: Dzahn; "add aaron, maryana, rfaulkner accounts to vanadium (RT-4139)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42708 [01:18:17] (specifically, /var/log/glusterfs/*.log) [01:19:07] it seems fine now [01:19:40] New review: Dzahn; "RT-4139" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42708 [01:19:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42708 [01:22:20] RECOVERY - SSH on db1036 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:25:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.268 seconds [01:28:11] RECOVERY - MySQL disk space on db1036 is OK: DISK OK [01:28:23] binasher: http://pastebin.com/Xkp7K2Kp [01:29:23] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:44] New patchset: Pyoungmeister; "testing: db1036 set to use coredb s3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42710 [01:35:15] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [01:36:16] binasher: oh, and it must be 2.6 since I see Lua options in the conf [01:37:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42710 [01:42:48] binasher: enabling "persistent" mode seems to fix it [01:42:50] odd [01:42:59] (in RedisBagOStuff) [01:43:25] New patchset: Pyoungmeister; "coredb: cp/paste error for lucid package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42711 [01:43:58] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42711 [01:44:02] binasher: lots of our redis.log is 'Redis exception on server X' too [01:44:07] for session [01:44:20] maybe this problem is related [01:44:44] I hope phpredis is not making a bunch of new connections due to some bug [01:45:13] *all of our redis.log [01:45:22] though it's not terribly huge [01:47:36] New patchset: Pyoungmeister; "testing: cleanup from testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42712 [01:48:03] bumping maxclients to 4096 in redis.conf helps :) [01:48:36] oh, wait I left persistence on [01:48:41] no it doesn't then [01:48:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42712 [01:49:03] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [01:58:29] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [01:59:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:02] New patchset: Pyoungmeister; "coredb: hacking in temporary mariadb support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42713 [02:03:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42713 [02:09:48] New patchset: Pyoungmeister; "coredb: also need mariadb-related config options in my.cnf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42714 [02:10:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42714 [02:15:21] New patchset: Pyoungmeister; "testing: making db1036 mariadb s4 slave" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42715 [02:15:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [02:17:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42715 [02:19:03] PROBLEM - Host db1036 is DOWN: PING CRITICAL - Packet loss = 100% [02:24:54] RECOVERY - Host db1036 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:26:01] !log LocalisationUpdate completed (1.21wmf7) at Tue Jan 8 02:26:01 UTC 2013 [02:26:11] Logged the message, Master [02:28:58] PROBLEM - MySQL Slave Delay on db1036 is CRITICAL: Connection refused by host [02:28:59] PROBLEM - SSH on db1036 is CRITICAL: Connection refused [02:29:15] PROBLEM - Full LVS Snapshot on db1036 is CRITICAL: Connection refused by host [02:29:25] PROBLEM - MySQL Idle Transactions on db1036 is CRITICAL: Connection refused by host [02:29:25] PROBLEM - mysqld processes on db1036 is CRITICAL: Connection refused by host [02:29:25] PROBLEM - MySQL Slave Running on db1036 is CRITICAL: Connection refused by host [02:30:10] PROBLEM - MySQL Recent Restart on db1036 is CRITICAL: Connection refused by host [02:30:10] PROBLEM - MySQL disk space on db1036 is CRITICAL: Connection refused by host [02:30:27] PROBLEM - MySQL Replication Heartbeat on db1036 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:21] RECOVERY - SSH on db1036 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:51:09] !log LocalisationUpdate completed (1.21wmf6) at Tue Jan 8 02:51:08 UTC 2013 [02:51:19] Logged the message, Master [02:52:28] New patchset: Eloquence; "(bug 43565) Switch Wikivoyage to new logo." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [02:59:24] PROBLEM - NTP on db1036 is CRITICAL: NTP CRITICAL: No response from NTP server [03:29:28] New review: Dereckson; "To be completed." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/42717 [04:00:55] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [04:04:22] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.008 second response time on port 11000 [04:21:03] New review: Aude; "I have protected http://commons.wikimedia.org/wiki/File:WikivoyageLogoSmall.png and we will need to ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42717 [04:44:07] root@deployment-bastion:/data/project# time tar -xzf /var/tmp/mediawiki-1.20.2.tar.gz [04:44:07] real 2m19.383s [04:44:20] tstarling@deployment-bastion:/var/tmp$ time tar -xzf mediawiki-1.20.2.tar.gz [04:44:20] real 0m0.753s [04:46:01] * TimStarling suspects the gluster server is just Ryan_Lane's laptop with a few external USB hard drives ;) [04:47:28] New review: Dereckson; "Actually, it seems to be " // Bug" the most favored comment." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/42688 [04:51:06] TimStarling: maybe desktop? otherwise what happens when he goes home? [04:51:19] New patchset: Ori.livneh; "(bug 43565) Update Wikivoyage logo and favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [04:53:26] maybe an old laptop with a broken battery that he donated to labs [04:55:56] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:01:55] New patchset: Dereckson; "(bug 43565) Switch Wikivoyage to new logo." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [05:04:29] New review: Dereckson; "PS3: to support IE browsers, favicon in icon format, instead PNG." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [05:20:14] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [05:22:59] TimStarling: don't get me started on gluster [05:23:11] lolololol [05:23:37] I hate it with a passion [05:23:49] Gluster is banned in our office [05:25:00] I'm hoping cephfs doesn't suck [05:25:10] the root partition and /mnt are both local storage, right? [05:25:14] yes [05:25:25] it'll be faster, to a point [05:26:06] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.071 second response time [05:26:19] definitely much, much faster than gluster, for a couple reasons: 1. it's SAS. 2. it's not gluster [05:26:29] we should put the git deploy trees on one of those local partitions, I think [05:26:34] yes [05:26:54] unless you're masochistic, you don't want it on /data/project [05:27:39] that's where the MW tree is at the moment, and I think that's probably why parser cache hits take 0.8s [05:27:53] in labs? [05:28:03] yes, deployment-bastion etc. [05:28:07] yeah [05:28:20] /usr/local/apache is a symlink to /data/project/apache [05:28:54] well, git is known to be very slow on shared storage [05:29:06] php hitting the files themselves shouldn't be as bad [05:29:23] assuming gluster isn't freaking out again [05:29:42] APC is probably installed and working, otherwise it would be worse [05:29:46] yep [05:29:50] but there will still be a few stat calls per request [05:30:13] hm. no gluster is acting normally right now [05:30:53] it's so, so slow :( [05:31:49] if I had to choose again, I would have went with SAS disks and less storage for the project storage rather than more storage and SATA disks [05:35:36] <3 https://github.com/saltstack/salt/issues/3181 [05:35:47] 9 hours. what a great upstream. [05:36:30] salt++ [05:36:46] have you looked at it at all yet? [05:39:00] it's very clean python [05:39:10] but a pretty young project, still [05:39:16] yep [05:39:22] i was a bit surprised they have tshirts already but no api [05:39:30] no api> [05:39:49] https://github.com/saltstack/salt-api [05:39:56] yeah [05:39:58] not out yet [05:40:08] you mean a rest api [05:40:30] there's a client api: http://docs.saltstack.org/en/latest/ref/python-api.html [05:40:47] well, http would be nice, rest or not [05:40:53] agreed [05:41:11] or document the 0mq wire protocol better, it didn't look too hard [05:41:11] A windows installer that isn't painful would be nice too [05:41:40] Damianz: oh, UtahDave told me yesterday that they just made one [05:41:54] He said he might be soon last week to me [05:41:57] but nice sphinx docs, responsive upstream, tidy code, zeromq -> +1 from me. [05:42:10] ori-l: I've been waiting patiently for the rest api [05:42:15] I need it pretty badly [05:42:24] it has keystone auth now ;) [05:42:40] well, kind of ;) [05:42:43] it does for modules [05:42:56] I want it for the external_auth framework [05:43:08] that + rest api is what I need [05:43:20] https://github.com/saltstack/salt/tree/develop/salt/auth ? [05:43:35] this is for external_auth? [05:43:44] That's what rest api uses [05:43:44] Damianz: <3 [05:43:50] you are awesome [05:44:37] And on that note, I'll go get ready for work :P [05:44:48] ah, this does password auth against keystone [05:45:38] I need to add token support [05:45:43] where it takes a generic token and uses it for project tokens [05:45:58] or verified project tokens [05:46:02] *verifies [05:46:41] I think the auth stuff for keystone needs some wider work - did you see the comments on that PR? [05:46:58] Doing validation of a token is pretty easier, but you can't re-use it for auth in modules sadly atm [05:47:08] right [05:47:34] however, if you want to run a runner for a project, it would be able to verify the user is in that project [05:48:03] you can reuse generic tokens [05:49:29] I wonder if they'll take a backend for staticly configured users in the config.... that would be useful for something I'm working on atm hmmm [05:50:14] staticly config'd users? [05:50:46] ah, like user/pass combos? [05:51:20] yeah, just a list in the config of user -> pass [05:51:28] right [05:53:05] yeah, I could see that being useful [05:53:42] realistically, you could implement that as an external_auth module [05:58:03] eauth would be interesting to extend - in the case of ldap using groups as well as users for acl definitions would be amazingly useful [05:58:47] well, they were telling me it should support that. I didn't really see how [05:59:23] I think you'd have to move the auth logic tbh [06:35:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.270 seconds [06:41:43] New patchset: Tim Starling; "Directory structure reorganisation for git-deploy" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42727 [06:41:43] New patchset: Tim Starling; "Fix a path name from the prior commit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42728 [06:41:44] New patchset: Tim Starling; "Cleanup for the removal of the testwiki NFS feature" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42729 [06:50:18] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:50:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:50:51] New patchset: Tim Starling; "Directory structure reorganisation for git-deploy" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42730 [06:50:52] New patchset: Tim Starling; "Fix a path name from the prior commit" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42731 [06:50:52] New patchset: Tim Starling; "Cleanup for the removal of the testwiki NFS feature" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42732 [06:51:11] Change abandoned: Tim Starling; "wrong branch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42727 [06:51:13] Change abandoned: Tim Starling; "wrong branch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42728 [06:51:19] Change abandoned: Tim Starling; "wrong branch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42729 [06:52:27] Change merged: Tim Starling; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42730 [06:52:32] Change merged: Tim Starling; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42731 [06:52:38] Change merged: Tim Starling; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42732 [06:58:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.180 seconds [07:12:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:18:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.690 seconds [07:37:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:43:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [07:53:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.077 seconds [08:12:22] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [08:15:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:49] Change merged: Matthias Mullie; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32362 [08:22:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [08:32:09] MaxSem: are you able to restart jetty on vanadium? looks like it's still using the old schema [08:33:39] no [08:33:55] apergos or paravoid should be able to [08:34:19] okay, let's see if they are around [08:34:24] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [08:34:35] but if it's still using the old schema it means that Puppet hasn't been run [08:34:51] cuz schema.xml changes trigger a jetty restart [08:35:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:52] but I can't chech when it has been run either [08:37:53] Nikerabbit, but it's using the new schema [08:38:04] curl 'http://vanadium:8983/solr/admin/file/?contentType=text/xml;charset=utf-8&file=schema.xml'|less [08:38:25] I see _version_ there [08:39:31] MaxSem: oh cool, then I need to figure out why it is failing :() [08:39:47] and how does it fail? [08:40:02] was https://wikitech.wikimedia.org/view/Solr#Upgrading_schema followed? [08:40:23] I need to run delete *:* to see if that helps [08:40:35] New patchset: Aude; "Harmonize bugzilla links for $wgAutoConfirmCount" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42688 [08:41:23] you can nuke the index from fenari [08:41:48] does ttm support reindexing from scratch? [08:42:15] MaxSem: yes [08:42:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.970 seconds [08:54:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:24] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [09:00:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [09:00:22] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [09:08:53] MaxSem: yup I just forgot to delete existing content, now it works [09:11:09] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [09:15:14] yeah but that's a lie, just fooling with the install [09:18:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:05] New patchset: ArielGlenn; "ms-be5 gets ssd disk layout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42737 [09:24:52] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42737 [09:25:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [09:29:38] !log nikerabbit synchronized php-1.21wmf7/extensions/Translate/resources/js/ext.translate.special.translate.js [09:29:49] Logged the message, Master [09:34:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:38:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.526 seconds [09:38:36] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [09:44:27] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:44:57] New patchset: Hashar; "(bug 43630) set autoconfirm count to 10 @ fawiki, per consensus" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [09:54:06] !log mlitn synchronized wmf-config/InitialiseSettings.php 'Increase AbuseFilter disable threshold for AFT entries to 20%' [09:54:17] Logged the message, Master [09:56:31] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [09:56:31] matthiasmullie, only 100% will help:P [09:56:43] :D [09:57:13] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [09:57:31] MaxSem: then some admin would disable AFT completely [09:57:46] as if it's something bad;) [09:57:49] going to deploy https://gerrit.wikimedia.org/r/#/c/42232/ a trivial conf change [09:58:00] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [09:58:50] !log hashar synchronized wmf-config/InitialiseSettings.php '{{gerrit|42232}} set autoconfirm count to 10 @ fawiki' [09:58:54] New review: Hashar; "deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [09:59:00] Logged the message, Master [09:59:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:04:15] New patchset: Faidon; "reprepro: add jenkins to conf/updates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42741 [10:04:20] hashar: ^^^ [10:04:49] paravoid: my hero :-] [10:04:55] +1? [10:05:17] paravoid: so that is something like the uuscan / uupdate ? [10:05:19] btw, we also now get gpg verification of the packages [10:05:26] not exactly, no [10:05:41] # reprepro checkupdate precise-wikimedia [10:05:41] Warning: Override-Files of 'precise-wikimedia' ignored as not yet supported while updating! [10:05:44] Calculating packages to get... [10:05:47] nothing new for 'precise-wikimedia|main|source' (use --noskipold to process anyway) [10:05:50] Updates needed for 'precise-wikimedia|main|amd64': [10:05:52] 'jenkins': newly installed as '1.480.2' (from 'jenkins'): [10:05:55] files needed: pool/main/j/jenkins/jenkins_1.480.2_all.deb [10:06:12] New patchset: Aude; "Harmonize bugzilla links for $wgAutoConfirmCount" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42688 [10:06:23] New review: Aude; "rebased" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42688 [10:06:27] New review: Faidon; "Tested & works as intended." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/42741 [10:06:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42741 [10:06:56] New review: Hashar; "Nice! Way better than having to manually download the package and run a command :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42741 [10:07:20] paravoid: can you possibly update the doc at https://wikitech.wikimedia.org/view/Jenkins ? [10:07:58] I am doing that already :-) [10:10:37] argh [10:11:19] Error or unsupported message received: '103 Redirect [10:11:19] URI: http://mirrors.jenkins-ci.org/debian-stable/jenkins_1.480.2_all.deb [10:11:22] New-URI: http://ftp-chi.osuosl.org/pub/jenkins/debian-stable/jenkins_1.480.2_all.deb [10:11:41] :-] [10:12:00] reprepro bug, fixed with http://anonscm.debian.org/gitweb/?p=mirrorer/reprepro.git;a=commitdiff;h=f8d2cff19654a9ed6f545ac0321589f4fbe49224, included in 4.12.4-1 [10:12:24] precise has 4.8.2-1build1, quantal 4.12.3-1, raring 4.12.5-1 [10:12:25] dammit [10:12:30] PROBLEM - LDAP on nfs1 is CRITICAL: Connection refused [10:13:11]