[00:01:25] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42701 [00:03:36] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [00:07:11] PROBLEM - MySQL Idle Transactions on db62 is CRITICAL: Connection refused by host [00:07:21] PROBLEM - MySQL Recent Restart on db62 is CRITICAL: Connection refused by host [00:07:29] PROBLEM - MySQL Replication Heartbeat on db62 is CRITICAL: Connection refused by host [00:07:57] PROBLEM - SSH on db62 is CRITICAL: Connection refused [00:07:58] PROBLEM - Full LVS Snapshot on db62 is CRITICAL: Connection refused by host [00:08:15] PROBLEM - MySQL Slave Delay on db62 is CRITICAL: Connection refused by host [00:08:41] PROBLEM - MySQL Slave Running on db62 is CRITICAL: Connection refused by host [00:08:42] PROBLEM - MySQL disk space on db62 is CRITICAL: Connection refused by host [00:10:20] what's this apache_status ganglia module doing on the nginx boxes? [00:10:48] ah [00:11:06] seems it isn't working properly [00:12:40] on the nginx boxes, i wouldn't expect it to [00:13:11] RECOVERY - SSH on db62 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:13:12] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [00:14:53] http://wiki.nginx.org/HttpStubStatusModule [00:14:57] well, that's configured [00:15:01] but it returns a 400 error [00:16:08] !log starting innobackupex on db1004 [00:16:18] Logged the message, notpeter [00:19:02] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [00:20:03] Ryan_Lane: did you make the working copy changes in tin:/srv/deployment/mediawiki/common ? [00:20:59] TimStarling: yes, you can wipe them out [00:21:11] I was looking at how much work it would be to change the paths [00:22:16] I think we were considering committing them [00:22:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:22:27] to a dev branch of course [00:22:54] * Ryan_Lane nods [00:23:28] I was considering a way to handle test, as well, btw [00:23:38] git deploy start [00:23:50] git deploy test <— would run a test sync script [00:24:05] it would also write a tag that had -test appended [00:24:10] it can probably just die [00:24:10] and would write out a .deploy.test file [00:24:13] oh? [00:24:16] we don't really use it anymore [00:24:20] Just keep test2? [00:24:26] that makes things simpler [00:24:41] Or move it to a more normal config and keep 2 test wikis? [00:25:19] yeah, we can keep the actual wiki, people write stuff on there that they might want to keep, like gadgets [00:26:33] it's not like it'd be difficult to have the test wikis on different branches/versions most of the time if necessary [00:28:56] PROBLEM - NTP on db62 is CRITICAL: NTP CRITICAL: No response from NTP server [00:29:27] is there any documentation about Antoine's labs project? [00:29:56] it seems like it should have its own name rather than being called after what it runs on... [00:31:16] ori-l: Did you ask mutante about your bz account merge? [00:31:53] i created an RT ticket and someone from ops (can't remember who) responded wondering out loud if this is proper to do.. in a meeting atm tho. [00:33:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.844 seconds [00:33:01] is it the deployment-prep project in labsconsole? [00:33:05] yes [00:35:36] ori-l, that was me (not quite from ops, just curious!) [00:37:08] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40344 [00:38:05] RECOVERY - MySQL Slave Running on db62 is OK: OK replication [00:38:23] RECOVERY - MySQL disk space on db62 is OK: DISK OK [00:38:32] RECOVERY - MySQL Idle Transactions on db62 is OK: OK longest blocking idle transaction sleeps for seconds [00:38:41] RECOVERY - MySQL Replication Heartbeat on db62 is OK: OK replication delay seconds [00:38:50] RECOVERY - MySQL Recent Restart on db62 is OK: OK seconds since restart [00:39:08] RECOVERY - Full LVS Snapshot on db62 is OK: OK no full LVM snapshot volumes [00:39:26] RECOVERY - MySQL Slave Delay on db62 is OK: OK replication delay seconds [00:40:42] New patchset: Pyoungmeister; "testing: further testing on s2/db62" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42703 [00:41:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42703 [00:44:11] AaronSchulz: what about deleted containers? do we need to sync those? [00:44:51] which ones? [00:44:57] all of them? [00:45:01] you mean the "deletion" container? [00:45:07] which deleted files [00:45:08] *-*-local-deleted [00:45:10] *with [00:45:20] heh, I thought you meant "containers that were deleted" ;) [00:45:29] oh, sorry [00:45:31] sync to where? ceph? [00:45:45] yes [00:46:09] yeah [00:46:15] okay [00:46:16] thanks :) [00:46:49] New patchset: Pyoungmeister; "testing: db61 and db62 with s3 and lucid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42704 [00:47:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42704 [00:48:31] paravoid: wait, it's almost 4:48pm here [00:48:43] yeah? [00:48:56] you are -0 right? [00:49:05] I'm EET, UTC+2 [00:49:23] so almost 3am? [00:49:24] localtime is 02:49am [00:49:45] well to be fair I committed something at like 2:30am last night pst [00:50:24] is this the first time you're noticing that I tend to work late? :) [00:50:32] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [00:50:41] paravoid? work late? [00:50:44] I don't believe it [00:51:05] haha [00:51:13] notpeter: did you just reboot db62? [00:51:26] agh [00:51:34] yes [00:51:36] I'm sorry [00:51:37] please abort lucid reimage [00:51:41] ok [00:51:56] that would be a big fail [00:52:29] you can't run lucid on an r720 [00:52:39] ah, ok [00:52:52] and i'm doing mha experimentation on db62 [00:52:57] ah, sorry [00:53:13] if no lucid, then I'm done with db61 and db62 [00:53:16] sorry for not checking first [00:53:52] notpeter: take db1036, it's a r510 that was just repaired and isn't slaving anything [00:54:00] binasher: ok, cool [00:54:44] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [00:55:41] binasher: let me know when you have a sec [00:56:20] AaronSchulz: sup [01:00:32] New patchset: RobH; "adding caesium to parsoid eqiad cluster role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:02:23] New patchset: Pyoungmeister; "testing: actually db1036 as fake s3 master (lucid test)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42705 [01:03:09] New patchset: RobH; "adding caesium to parsoid eqiad cluster role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:04:22] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42705 [01:05:15] !log temp stopping puppet on brewster [01:05:27] Logged the message, notpeter [01:05:47] New patchset: RobH; "adding caesium to parsoid eqiad cluster role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:07:09] is http://en.wikipedia.beta.wmflabs.org/ meant to work? [01:07:18] or was squid stopped on it deliberately? [01:07:24] New patchset: RobH; "adding caesium to parsoid eqiad cluster role" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:08:10] New review: RobH; "That took entirely too many patchsets, it must be the end of the day." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42358 [01:08:11] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42358 [01:08:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:18] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 270 seconds [01:13:29] PROBLEM - SSH on db1036 is CRITICAL: Connection refused [01:14:05] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [01:14:23] PROBLEM - MySQL disk space on db1036 is CRITICAL: Connection refused by host [01:14:44] TimStarling: I believe it should be working [01:15:04] I'll start squid on it [01:15:07] * Ryan_Lane nods [01:16:13] !log on deployment-squid.pmtpa.wmflabs: started squid [01:16:24] Logged the message, Master [01:16:52] New patchset: Dzahn; "add aaron, maryana, rfaulkner accounts to vanadium (RT-4139)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42708 [01:17:02] TimStarling, several of the VMs in that cluster were freaking out because of runaway logfiles resulting in full disks. That's probably why it is (or was) down [01:18:06] New patchset: Dzahn; "add aaron, maryana, rfaulkner accounts to vanadium (RT-4139)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42708 [01:18:17] (specifically, /var/log/glusterfs/*.log) [01:19:07] it seems fine now [01:19:40] New review: Dzahn; "RT-4139" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42708 [01:19:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42708 [01:22:20] RECOVERY - SSH on db1036 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [01:25:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.268 seconds [01:28:11] RECOVERY - MySQL disk space on db1036 is OK: DISK OK [01:28:23] binasher: http://pastebin.com/Xkp7K2Kp [01:29:23] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [01:34:44] New patchset: Pyoungmeister; "testing: db1036 set to use coredb s3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42710 [01:35:15] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [01:36:16] binasher: oh, and it must be 2.6 since I see Lua options in the conf [01:37:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42710 [01:42:48] binasher: enabling "persistent" mode seems to fix it [01:42:50] odd [01:42:59] (in RedisBagOStuff) [01:43:25] New patchset: Pyoungmeister; "coredb: cp/paste error for lucid package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42711 [01:43:58] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42711 [01:44:02] binasher: lots of our redis.log is 'Redis exception on server X' too [01:44:07] for session [01:44:20] maybe this problem is related [01:44:44] I hope phpredis is not making a bunch of new connections due to some bug [01:45:13] *all of our redis.log [01:45:22] though it's not terribly huge [01:47:36] New patchset: Pyoungmeister; "testing: cleanup from testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42712 [01:48:03] bumping maxclients to 4096 in redis.conf helps :) [01:48:36] oh, wait I left persistence on [01:48:41] no it doesn't then [01:48:43] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42712 [01:49:03] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [01:58:29] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [01:59:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:02:02] New patchset: Pyoungmeister; "coredb: hacking in temporary mariadb support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42713 [02:03:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42713 [02:09:48] New patchset: Pyoungmeister; "coredb: also need mariadb-related config options in my.cnf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42714 [02:10:28] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42714 [02:15:21] New patchset: Pyoungmeister; "testing: making db1036 mariadb s4 slave" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42715 [02:15:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [02:17:03] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42715 [02:19:03] PROBLEM - Host db1036 is DOWN: PING CRITICAL - Packet loss = 100% [02:24:54] RECOVERY - Host db1036 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:26:01] !log LocalisationUpdate completed (1.21wmf7) at Tue Jan 8 02:26:01 UTC 2013 [02:26:11] Logged the message, Master [02:28:58] PROBLEM - MySQL Slave Delay on db1036 is CRITICAL: Connection refused by host [02:28:59] PROBLEM - SSH on db1036 is CRITICAL: Connection refused [02:29:15] PROBLEM - Full LVS Snapshot on db1036 is CRITICAL: Connection refused by host [02:29:25] PROBLEM - MySQL Idle Transactions on db1036 is CRITICAL: Connection refused by host [02:29:25] PROBLEM - mysqld processes on db1036 is CRITICAL: Connection refused by host [02:29:25] PROBLEM - MySQL Slave Running on db1036 is CRITICAL: Connection refused by host [02:30:10] PROBLEM - MySQL Recent Restart on db1036 is CRITICAL: Connection refused by host [02:30:10] PROBLEM - MySQL disk space on db1036 is CRITICAL: Connection refused by host [02:30:27] PROBLEM - MySQL Replication Heartbeat on db1036 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:34:21] RECOVERY - SSH on db1036 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [02:51:09] !log LocalisationUpdate completed (1.21wmf6) at Tue Jan 8 02:51:08 UTC 2013 [02:51:19] Logged the message, Master [02:52:28] New patchset: Eloquence; "(bug 43565) Switch Wikivoyage to new logo." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [02:59:24] PROBLEM - NTP on db1036 is CRITICAL: NTP CRITICAL: No response from NTP server [03:29:28] New review: Dereckson; "To be completed." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/42717 [04:00:55] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [04:04:22] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.008 second response time on port 11000 [04:21:03] New review: Aude; "I have protected http://commons.wikimedia.org/wiki/File:WikivoyageLogoSmall.png and we will need to ..." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42717 [04:44:07] root@deployment-bastion:/data/project# time tar -xzf /var/tmp/mediawiki-1.20.2.tar.gz [04:44:07] real 2m19.383s [04:44:20] tstarling@deployment-bastion:/var/tmp$ time tar -xzf mediawiki-1.20.2.tar.gz [04:44:20] real 0m0.753s [04:46:01] * TimStarling suspects the gluster server is just Ryan_Lane's laptop with a few external USB hard drives ;) [04:47:28] New review: Dereckson; "Actually, it seems to be " // Bug" the most favored comment." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/42688 [04:51:06] TimStarling: maybe desktop? otherwise what happens when he goes home? [04:51:19] New patchset: Ori.livneh; "(bug 43565) Update Wikivoyage logo and favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [04:53:26] maybe an old laptop with a broken battery that he donated to labs [04:55:56] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:01:55] New patchset: Dereckson; "(bug 43565) Switch Wikivoyage to new logo." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [05:04:29] New review: Dereckson; "PS3: to support IE browsers, favicon in icon format, instead PNG." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [05:20:14] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [05:22:59] TimStarling: don't get me started on gluster [05:23:11] lolololol [05:23:37] I hate it with a passion [05:23:49] Gluster is banned in our office [05:25:00] I'm hoping cephfs doesn't suck [05:25:10] the root partition and /mnt are both local storage, right? [05:25:14] yes [05:25:25] it'll be faster, to a point [05:26:06] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.071 second response time [05:26:19] definitely much, much faster than gluster, for a couple reasons: 1. it's SAS. 2. it's not gluster [05:26:29] we should put the git deploy trees on one of those local partitions, I think [05:26:34] yes [05:26:54] unless you're masochistic, you don't want it on /data/project [05:27:39] that's where the MW tree is at the moment, and I think that's probably why parser cache hits take 0.8s [05:27:53] in labs? [05:28:03] yes, deployment-bastion etc. [05:28:07] yeah [05:28:20] /usr/local/apache is a symlink to /data/project/apache [05:28:54] well, git is known to be very slow on shared storage [05:29:06] php hitting the files themselves shouldn't be as bad [05:29:23] assuming gluster isn't freaking out again [05:29:42] APC is probably installed and working, otherwise it would be worse [05:29:46] yep [05:29:50] but there will still be a few stat calls per request [05:30:13] hm. no gluster is acting normally right now [05:30:53] it's so, so slow :( [05:31:49] if I had to choose again, I would have went with SAS disks and less storage for the project storage rather than more storage and SATA disks [05:35:36] <3 https://github.com/saltstack/salt/issues/3181 [05:35:47] 9 hours. what a great upstream. [05:36:30] salt++ [05:36:46] have you looked at it at all yet? [05:39:00] it's very clean python [05:39:10] but a pretty young project, still [05:39:16] yep [05:39:22] i was a bit surprised they have tshirts already but no api [05:39:30] no api> [05:39:49] https://github.com/saltstack/salt-api [05:39:56] yeah [05:39:58] not out yet [05:40:08] you mean a rest api [05:40:30] there's a client api: http://docs.saltstack.org/en/latest/ref/python-api.html [05:40:47] well, http would be nice, rest or not [05:40:53] agreed [05:41:11] or document the 0mq wire protocol better, it didn't look too hard [05:41:11] A windows installer that isn't painful would be nice too [05:41:40] Damianz: oh, UtahDave told me yesterday that they just made one [05:41:54] He said he might be soon last week to me [05:41:57] but nice sphinx docs, responsive upstream, tidy code, zeromq -> +1 from me. [05:42:10] ori-l: I've been waiting patiently for the rest api [05:42:15] I need it pretty badly [05:42:24] it has keystone auth now ;) [05:42:40] well, kind of ;) [05:42:43] it does for modules [05:42:56] I want it for the external_auth framework [05:43:08] that + rest api is what I need [05:43:20] https://github.com/saltstack/salt/tree/develop/salt/auth ? [05:43:35] this is for external_auth? [05:43:44] That's what rest api uses [05:43:44] Damianz: <3 [05:43:50] you are awesome [05:44:37] And on that note, I'll go get ready for work :P [05:44:48] ah, this does password auth against keystone [05:45:38] I need to add token support [05:45:43] where it takes a generic token and uses it for project tokens [05:45:58] or verified project tokens [05:46:02] *verifies [05:46:41] I think the auth stuff for keystone needs some wider work - did you see the comments on that PR? [05:46:58] Doing validation of a token is pretty easier, but you can't re-use it for auth in modules sadly atm [05:47:08] right [05:47:34] however, if you want to run a runner for a project, it would be able to verify the user is in that project [05:48:03] you can reuse generic tokens [05:49:29] I wonder if they'll take a backend for staticly configured users in the config.... that would be useful for something I'm working on atm hmmm [05:50:14] staticly config'd users? [05:50:46] ah, like user/pass combos? [05:51:20] yeah, just a list in the config of user -> pass [05:51:28] right [05:53:05] yeah, I could see that being useful [05:53:42] realistically, you could implement that as an external_auth module [05:58:03] eauth would be interesting to extend - in the case of ldap using groups as well as users for acl definitions would be amazingly useful [05:58:47] well, they were telling me it should support that. I didn't really see how [05:59:23] I think you'd have to move the auth logic tbh [06:35:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.270 seconds [06:41:43] New patchset: Tim Starling; "Directory structure reorganisation for git-deploy" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42727 [06:41:43] New patchset: Tim Starling; "Fix a path name from the prior commit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42728 [06:41:44] New patchset: Tim Starling; "Cleanup for the removal of the testwiki NFS feature" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42729 [06:50:18] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [06:50:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [06:50:51] New patchset: Tim Starling; "Directory structure reorganisation for git-deploy" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42730 [06:50:52] New patchset: Tim Starling; "Fix a path name from the prior commit" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42731 [06:50:52] New patchset: Tim Starling; "Cleanup for the removal of the testwiki NFS feature" [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42732 [06:51:11] Change abandoned: Tim Starling; "wrong branch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42727 [06:51:13] Change abandoned: Tim Starling; "wrong branch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42728 [06:51:19] Change abandoned: Tim Starling; "wrong branch" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42729 [06:52:27] Change merged: Tim Starling; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42730 [06:52:32] Change merged: Tim Starling; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42731 [06:52:38] Change merged: Tim Starling; [operations/mediawiki-config] (newdeploy) - https://gerrit.wikimedia.org/r/42732 [06:58:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.180 seconds [07:12:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:18:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.690 seconds [07:37:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:43:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [07:53:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:02:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.077 seconds [08:12:22] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [08:12:22] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [08:15:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:18:49] Change merged: Matthias Mullie; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/32362 [08:22:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [08:32:09] MaxSem: are you able to restart jetty on vanadium? looks like it's still using the old schema [08:33:39] no [08:33:55] apergos or paravoid should be able to [08:34:19] okay, let's see if they are around [08:34:24] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [08:34:35] but if it's still using the old schema it means that Puppet hasn't been run [08:34:51] cuz schema.xml changes trigger a jetty restart [08:35:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:35:52] but I can't chech when it has been run either [08:37:53] Nikerabbit, but it's using the new schema [08:38:04] curl 'http://vanadium:8983/solr/admin/file/?contentType=text/xml;charset=utf-8&file=schema.xml'|less [08:38:25] I see _version_ there [08:39:31] MaxSem: oh cool, then I need to figure out why it is failing :() [08:39:47] and how does it fail? [08:40:02] was https://wikitech.wikimedia.org/view/Solr#Upgrading_schema followed? [08:40:23] I need to run delete *:* to see if that helps [08:40:35] New patchset: Aude; "Harmonize bugzilla links for $wgAutoConfirmCount" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42688 [08:41:23] you can nuke the index from fenari [08:41:48] does ttm support reindexing from scratch? [08:42:15] MaxSem: yes [08:42:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.970 seconds [08:54:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:24] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [09:00:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [09:00:22] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [09:08:53] MaxSem: yup I just forgot to delete existing content, now it works [09:11:09] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 2.44 ms [09:15:14] yeah but that's a lie, just fooling with the install [09:18:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:23:05] New patchset: ArielGlenn; "ms-be5 gets ssd disk layout" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42737 [09:24:52] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42737 [09:25:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [09:29:38] !log nikerabbit synchronized php-1.21wmf7/extensions/Translate/resources/js/ext.translate.special.translate.js [09:29:49] Logged the message, Master [09:34:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:38:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.526 seconds [09:38:36] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [09:44:27] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:44:57] New patchset: Hashar; "(bug 43630) set autoconfirm count to 10 @ fawiki, per consensus" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [09:54:06] !log mlitn synchronized wmf-config/InitialiseSettings.php 'Increase AbuseFilter disable threshold for AFT entries to 20%' [09:54:17] Logged the message, Master [09:56:31] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [09:56:31] matthiasmullie, only 100% will help:P [09:56:43] :D [09:57:13] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [09:57:31] MaxSem: then some admin would disable AFT completely [09:57:46] as if it's something bad;) [09:57:49] going to deploy https://gerrit.wikimedia.org/r/#/c/42232/ a trivial conf change [09:58:00] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [09:58:50] !log hashar synchronized wmf-config/InitialiseSettings.php '{{gerrit|42232}} set autoconfirm count to 10 @ fawiki' [09:58:54] New review: Hashar; "deployed on live site." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [09:59:00] Logged the message, Master [09:59:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:04:15] New patchset: Faidon; "reprepro: add jenkins to conf/updates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42741 [10:04:20] hashar: ^^^ [10:04:49] paravoid: my hero :-] [10:04:55] +1? [10:05:17] paravoid: so that is something like the uuscan / uupdate ? [10:05:19] btw, we also now get gpg verification of the packages [10:05:26] not exactly, no [10:05:41] # reprepro checkupdate precise-wikimedia [10:05:41] Warning: Override-Files of 'precise-wikimedia' ignored as not yet supported while updating! [10:05:44] Calculating packages to get... [10:05:47] nothing new for 'precise-wikimedia|main|source' (use --noskipold to process anyway) [10:05:50] Updates needed for 'precise-wikimedia|main|amd64': [10:05:52] 'jenkins': newly installed as '1.480.2' (from 'jenkins'): [10:05:55] files needed: pool/main/j/jenkins/jenkins_1.480.2_all.deb [10:06:12] New patchset: Aude; "Harmonize bugzilla links for $wgAutoConfirmCount" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42688 [10:06:23] New review: Aude; "rebased" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42688 [10:06:27] New review: Faidon; "Tested & works as intended." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/42741 [10:06:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42741 [10:06:56] New review: Hashar; "Nice! Way better than having to manually download the package and run a command :-]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42741 [10:07:20] paravoid: can you possibly update the doc at https://wikitech.wikimedia.org/view/Jenkins ? [10:07:58] I am doing that already :-) [10:10:37] argh [10:11:19] Error or unsupported message received: '103 Redirect [10:11:19] URI: http://mirrors.jenkins-ci.org/debian-stable/jenkins_1.480.2_all.deb [10:11:22] New-URI: http://ftp-chi.osuosl.org/pub/jenkins/debian-stable/jenkins_1.480.2_all.deb [10:11:41] :-] [10:12:00] reprepro bug, fixed with http://anonscm.debian.org/gitweb/?p=mirrorer/reprepro.git;a=commitdiff;h=f8d2cff19654a9ed6f545ac0321589f4fbe49224, included in 4.12.4-1 [10:12:24] precise has 4.8.2-1build1, quantal 4.12.3-1, raring 4.12.5-1 [10:12:25] dammit [10:12:30] PROBLEM - LDAP on nfs1 is CRITICAL: Connection refused [10:13:11] seems like you are going to backport a package :-] [10:13:15] PROBLEM - LDAPS on nfs1 is CRITICAL: Connection refused [10:13:23] sigh [10:15:30] Change merged: Matthias Mullie; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42402 [10:17:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.695 seconds [10:18:48] RECOVERY - LDAPS on nfs1 is OK: TCP OK - 0.001 second response time on port 636 [10:19:24] RECOVERY - LDAP on nfs1 is OK: TCP OK - 0.005 second response time on port 389 [10:23:27] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [10:25:21] oh god, brewster is lucid [10:26:42] things are pilling up quickly :-] [10:27:13] New review: Aude; "both logos are protected now on Commons" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/42717 [10:29:42] !log installing new backported reprepro version to {lucid,precise}-wikimedia and upgrading it in brewster [10:29:52] Logged the message, Master [10:30:46] !log reprepro update for jenkins (fetched 1.480.2) [10:30:56] Logged the message, Master [10:34:34] New patchset: Faidon; "reprepro: update conf/updates for new version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42742 [10:34:47] New review: Faidon; "Tested" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/42742 [10:34:57] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42742 [10:36:17] hashar: do you need anything else from me? [10:36:26] paravoid: so going to upgrade Jenkins [10:36:28] hashar: I think you can do apt-get upgrade yourself whenever you find it convienient, right? [10:36:39] paravoid: then will need you to update an API key in the private repo [10:36:54] but that's unrelated to all this, right? [10:37:15] !log shutdowning jenkins for an unscheduled maintenance and upgrading the installation. [10:37:19] paravoid: yeah that is unrelated. [10:37:24] k [10:37:25] Logged the message, Master [10:37:36] upgrading [10:37:53] !log stopping Zuul [10:38:03] Logged the message, Master [10:39:09] now I have to wait for Jenkins to finish loading :/ [10:42:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:46:39] New patchset: Hashar; "(bug 43729) create /mnt/srv and /srv on beta mw installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42743 [10:47:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.995 seconds [10:50:55] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:42] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.058 second response time [10:59:43] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [11:06:11] RECOVERY - NTP on ms-be5 is OK: NTP OK: Offset -0.08418869972 secs [11:10:58] RECOVERY - swift-container-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:11:08] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:11:08] RECOVERY - swift-object-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:11:26] RECOVERY - swift-container-server on ms-be5 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:11:26] RECOVERY - swift-object-server on ms-be5 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:11:53] RECOVERY - swift-object-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:12:01] RECOVERY - swift-account-server on ms-be5 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:12:11] RECOVERY - swift-container-updater on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:12:19] RECOVERY - swift-account-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:12:19] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:12:19] RECOVERY - swift-object-auditor on ms-be5 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:13:13] RECOVERY - swift-account-replicator on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:15:01] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [11:19:01] New patchset: ArielGlenn; "ms-be5 moved to stanza for 720xd with ssds" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42745 [11:19:41] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42745 [11:25:04] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:28:59] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [11:39:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds [11:42:56] PROBLEM - Host ms-be5 is DOWN: PING CRITICAL - Packet loss = 100% [11:46:04] PROBLEM - Puppet freshness on db1036 is CRITICAL: Puppet has not run in the last 10 hours [11:46:59] RECOVERY - Host ms-be5 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [11:57:10] PROBLEM - swift-account-reaper on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [12:00:01] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [12:11:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.053 seconds [12:34:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:36:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.051 seconds [12:43:32] re [12:46:05] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [12:52:51] !log cleared out /var/lg/messages.[1-x] on older ms-be hosts, several had full root partitions; disabled logging to that facility in rsyslogd conf [12:53:02] Logged the message, Master [12:57:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:09:37] RECOVERY - Puppet freshness on ms-be9 is OK: puppet ran at Tue Jan 8 13:09:18 UTC 2013 [13:12:10] RECOVERY - Puppet freshness on ms-be11 is OK: puppet ran at Tue Jan 8 13:11:57 UTC 2013 [13:12:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.055 seconds [13:18:10] New patchset: Ori.livneh; "(bug 43565) Update Wikivoyage logo and favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [13:30:37] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 198 seconds [13:30:47] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 196 seconds [13:31:23] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 223 seconds [13:31:31] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 220 seconds [13:33:34] New review: Ori.livneh; "favicon generated w/imagemagick's convert tool." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42717 [13:38:19] !log stopped and/or shot all swift processes on ms-be2, some were in an odd state, had load spike [13:38:29] !Log and restarted them [13:38:31] Logged the message, Master [13:38:31] grr [13:38:35] !log and restarted them [13:38:45] Logged the message, Master [13:38:54] apergos: I'm deleting some files from swift [13:39:16] lots of DELETEs [13:39:25] saw some of those go by [13:39:34] but some of the processes refused to die, [13:39:41] so something was off [13:39:48] other hosts look reasonable [13:40:30] I'm about to add to that load, going to give some weight back to ms5 which means more data moving around [13:43:34] looks like it had run out of disk space on / earlier [13:43:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:07] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 1 seconds [13:50:34] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 1 seconds [13:50:46] New review: Faidon; "Is there a way of making the script less noisy by default -or, alternatively, have it write errors t..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42600 [13:50:52] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [13:51:37] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [13:55:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.186 seconds [14:07:36] Change abandoned: Demon; "Abandoning rather than keeping this on everyone's review list when we have no intention on merging i..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/41978 [14:12:38] <^demon> Gerrit's going to be unavailable for just a short bit--rebooting manganese. [14:13:00] <^demon> !log rebooting manganese [14:13:10] Logged the message, Master [14:13:34] nice [14:13:39] will find out how Zuul catch up [14:15:26] <^demon> Gerrit's back up. [14:17:14] <^demon> !log manganese is back up, gerrit's running fine. [14:17:24] Logged the message, Master [14:17:26] <^demon> Caches may be cold for a bit, but nothing awful. [14:18:35] <^demon> hashar: Oh btw, two repositories are causing problems for replication to gallium. mediawiki/core and mediawiki/extensions/Wikibase. Could you run a git gc on both of those repositories in /var/lib/git? [14:18:47] sure [14:19:09] ^demon: can we potentially get a mail report whenever that fails ? [14:19:31] <^demon> Eh, it's in gerrit's error log. [14:19:43] <^demon> We could maybe hack up something to keep an eye on the error log to spot replication failures. [14:19:49] log rotate could post process it with a shell / perl script maybe [14:19:57] ops are probably aware of such a tool [14:20:00] <^demon> [2013-01-08 14:12:53,522] ERROR com.google.gerrit.server.git.PushReplication : Cannot replicate to gerritslave@gallium.wikimedia.org:/var/lib/git/mediawiki/extensions/Wikibase.git [14:20:15] <^demon> It's always in the form of ERROR com.google.gerrit.server.git.PushReplication : Cannot replicate to [14:20:41] warning: packfile ./objects/pack/pack-bb1cc28d27ec761d4836f0ae80dfb78f0e7ffb1e.pack cannot be accessed [14:20:41] error: refs/changes/98/34298/3 does not point to a valid object! [14:20:42] grmbmblbl [14:20:52] then after such spam : fatal: failed to read object acad0e7eed3ac84a5682ab4699afc013abd2931a: Too many open files [14:21:12] in /var/lib/git/mediawiki/extensions/Wikibase.git [14:21:35] <^demon> Bah, that too many open files is what I'm seeing on my end. [14:21:39] <^demon> I'm wondering what's happening. [14:22:00] on mw core I got warning: packfile ./objects/pack/pack-78ebf8c684d7019981d18c177b4dcb73e9a256dc.pack cannot be accessed [14:22:09] the file does exist [14:22:10] though it is r--r--r-- [14:22:27] !log rebooting solr1 to fix HT in bios [14:22:37] Logged the message, Master [14:23:03] [pid 4800] access("./objects/pack/pack-e37bf96475063946122db2fd20f6f038e27f0cb9.keep", F_OK) = -1 ENOENT (No such file or directory) [14:23:04] oh [14:24:27] ah it complains cause the file DOES exist [14:25:01] PROBLEM - Host solr1 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:12] !log gallium: running git fsck in /var/lib/git/mediawiki/extensions/Wikibase.git . That repository might have ended up being corrupted somehow :( [14:26:22] Logged the message, Master [14:26:22] ^demon: ^^^ [14:26:42] <^demon> Bah. Lemme check out master. [14:27:28] would git repack fix something ? [14:27:46] <^demon> Packing should be independent of clones--you could try a repack on that side. [14:27:55] I have no idea how the gerrit replication stuff work though [14:28:47] <^demon> it's just pushing. [14:28:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:30:06] RECOVERY - Host solr1 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [14:32:48] <^demon> git gc seems to have cleaned up wikibase on manganese. [14:33:00] not on gallium :/ [14:34:24] ^demon: I am afraid the pack files got corrupted by the server [14:34:32] either a hardware fault / driver fault / raid issue [14:34:54] <^demon> I doubt that, could be a dozen other things. Try this. `git repack -a -d -f --depth=250 --window=250` [14:35:10] running [14:35:10] <^demon> Worst case, we'll just clone it fresh. [14:35:32] Delta compression using up to 8 threads. [14:35:33] \O/ [14:35:55] ahh your repack stuff worked [14:36:12] only 2 .pack files in .git/objects/pack [14:36:44] that is for mw/ext/Wikibase.git [14:37:00] ^demon: mw/core.git has some git process running right now on gallium [14:37:30] !log rebooting solr2 to fix HT in bios [14:37:39] Logged the message, Master [14:39:10] <^demon> I manually kicked off a replication of Wikibase. [14:39:43] PROBLEM - Host solr2 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:47] k [14:39:56] Coke time + cig brb [14:41:19] <^demon> When you get back, run the same repack on core. [14:44:03] RECOVERY - Host solr2 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:45:38] ^demon: doing so [14:45:41] <^demon> The repack seems to have fixed the replication of wikibase. [14:45:44] <^demon> Just replicated fine. [14:46:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [14:47:07] <^demon> Core repack might take a bit :) [14:48:34] Change abandoned: Hashar; "would let the file owner fix that one day :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39407 [14:48:46] Change abandoned: Hashar; "would let the files owner fix that one day :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39040 [14:48:57] Change abandoned: Hashar; "would let the files owner fix that one day :-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39039 [14:50:26] Counting objects: 149355 [14:50:27] ... [14:50:47] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [14:51:17] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [14:51:30] stupid Zuul lost its connection [14:51:48] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [14:52:49] New review: Hashar; "recheck" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [14:53:05] !log gallium: restarted Zuul which could not recover its Gerrit connection :/ [14:53:15] Logged the message, Master [14:53:59] <^demon> hashar: Have we filed a bug to have zuul handle that better? [14:55:20] <^demon> Also, did core finish repacking? [14:55:40] Counting objects: 483768 [14:55:54] hi yall [14:56:07] ^demon: I am filling a bug for Zuul [14:56:10] anyone awake know about the arbcom-l http auth thing? [14:56:11] there are instructions here: [14:56:11] https://wikitech.wikimedia.org/view/Lists.wikimedia.org#Alter_arbcom-l_archive_access_list [14:56:13] and I have a very small question [14:56:48] RECOVERY - Puppet freshness on ms-be12 is OK: puppet ran at Tue Jan 8 14:56:35 UTC 2013 [15:04:01] ^demon: Compressing objects: 84% (443942/524966) [15:11:28] ^demon: git repack completed for mw/core [15:12:08] !log updating Zuul code base : 6c4d51a...23ec1ba (forced update) [15:12:18] Logged the message, Master [15:13:24] <^demon> hashar: Yay, replication completed for core, no errors :) [15:13:31] should I git gc again ? [15:13:44] <^demon> Shouldn't need to, but you can. [15:14:09] <^demon> The repack did most of what a gc would accomplish. [15:14:52] yeah will skip that [15:15:00] + there is a git process running right now (there is a lock file) [15:15:23] <^demon> There's lots of git processes running all the time :) [15:15:55] <^demon> Usually they're very brief--replication keeps up nicely so pushes are small. [15:16:07] <^demon> Only when you've got weird failures--want those fixed quickly. [15:16:25] <^demon> Ok, that's all that was broken for gallium. Thanks hashar :) [15:17:05] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:21:06] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [15:24:51] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [15:31:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.081 seconds [15:35:45] New review: Dereckson; "According http://www.ie6countdown.com/ this is an acceptable solution." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/42717 [15:49:28] New patchset: Hashar; "(bug 43729) create /mnt/srv and /srv on beta mw installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42743 [15:50:11] New review: Hashar; "PS2: move the $::realm == wmflabs exception from the submodule to the role::applicationserver::commo..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [15:51:07] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.002 second response time on port 11000 [15:58:45] New patchset: ArielGlenn; "ssh key for dab" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42773 [15:59:40] !log upgraded Zuul [15:59:50] Logged the message, Master [16:05:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:51] New patchset: Dereckson; "(bug 43659) Import sources for eo.wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42774 [16:07:38] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/42774 [16:13:07] New review: Umherirrender; "Thanks for this one." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/42538 [16:15:01] New patchset: Ottomata; "Removing /a from backup list." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42776 [16:15:16] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42776 [16:16:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [16:17:24] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42773 [16:33:49] New patchset: Raimond Spekking; "Set WMF default of $wgUnwatchedPageThreshold" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42779 [16:40:55] New patchset: Matthias Mullie; "AFT test group permissions have been removed already; these lines no longer make sense" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42538 [16:44:10] New patchset: Ottomata; "Removing stats user from analytics nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42782 [16:44:16] New review: Umherirrender; "FYI: It is better to avoid rebase and changes in one Patch Set" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/42538 [16:45:01] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42782 [16:45:21] AaronSchulz: around? [16:50:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:51:29] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [16:51:29] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [17:03:24] New patchset: Silke Meyer; "Puppet files to install Wikidata repo / client on different labs instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42786 [17:04:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.088 seconds [17:09:21] New review: Silke Meyer; "Okay, I have adapted the files a lot to our needs. The mysql database name is now customizable in ro..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/42786 [17:25:15] PROBLEM - SSH on pdf3 is CRITICAL: Server answer: [17:26:54] RECOVERY - SSH on pdf3 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [17:29:42] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42717 [17:30:42] New review: CSteipp; "For the image directories, we could also add an alias in alias on beta." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/38307 [17:37:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:58] !log olivneh synchronized wmf-config/InitialiseSettings.php 'New logo for wikivoyage & eswikivoyage' [17:38:08] Logged the message, Master [17:38:51] !log olivneh synchronized docroot/wikivoyage.org/favicon.ico 'New favicon.ico for wikivoyage' [17:39:01] Logged the message, Master [17:51:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.066 seconds [18:13:32] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [18:13:33] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [18:13:33] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [18:13:33] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [18:13:33] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [18:13:33] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [18:23:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:23:54] heads up: the mobile team will deploy today and might need your help with Varnish around 3 PM [18:25:18] it's way past 3 pm [18:27:04] PST;) [18:34:29] New patchset: Dzahn; "sudo for strace/tcpdump for demon in role appservers (RT-4066)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42791 [18:37:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.963 seconds [18:51:26] !log reedy synchronized php-1.21wmf7/extensions/TimedMediaHandler [18:51:36] Logged the message, Master [19:01:43] New review: MZMcBride; "Looks good to me as an initial (default) value." [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/42779 [19:02:09] New patchset: Andrew Bogott; "Rotate all gluster logfiles, and only keep three weeks' worth." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42796 [19:09:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42779 [19:16:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42538 [19:16:39] Reedy: Are you gonna scap? [19:17:03] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42497 [19:18:06] New patchset: Reedy; "Disable TorBlock on private wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42389 [19:18:33] Not scap [19:18:41] But I will sync the the config files in a few minutes [19:19:05] Sweet. [19:19:45] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42368 [19:19:53] This sync will make watcher largely deprecated. [19:20:03] I've been working on this for about a year. [19:20:05] \o/ [19:20:19] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42217 [19:20:31] New patchset: Reedy; "(bug 29692) Per-wiki namespace aliases shouldn't override (remove) global ones" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25737 [19:20:52] anomie: howdy [19:21:00] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42688 [19:21:01] hi Ryan_Lane [19:21:14] read your email [19:21:20] lol [19:21:28] it's long :) [19:21:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 1.190 seconds [19:22:27] I didn't know how much things were set up for the new system yet, so I wanted to cover everything that seemed to be needed for the l10n [19:22:33] !log reedy synchronized wmf-config/ [19:22:45] Logged the message, Master [19:23:01] * Ryan_Lane nods [19:23:09] I was about to push in a change [19:23:26] for what I'm calling "dependent repos" [19:23:39] when you git deploy one repo, it'll git-deploy all of its dependent repos first [19:23:47] New review: Hashar; "commit summary claims 3 weeks are kept but then the logrotate configuration has a daily rotation and..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/42796 [19:23:58] it doesn't just do a git-deploy for the repo, though [19:24:03] it really just runs a script [19:24:22] that script would git deploy start, , git deploy sync [19:24:40] heh. this is in bash [19:24:47] anomie: do you know python? :) [19:25:15] Ryan_Lane- Haven't had opportunity to learn it, although I've muddled through one or two things. I do know Perl pretty well though ;) [19:25:20] I think the l10n stuff needs to match what's in the repo that's being deployed [19:25:26] no perl, please no perl :) [19:25:54] no worries. we can use the bash stuff and over time convert it to python [19:26:09] Yeah, we need a l10n branch/repo/whatever for each slot [19:26:12] right [19:26:16] well, no [19:26:25] can you ssh into tin? [19:26:46] ok, there [19:27:03] <^demon> I should rewrite the gerrit hooks in perl. [19:27:07] <^demon> Just to annoy Ryan ;-) [19:27:16] ^demon: I would fucking stab you :) [19:27:57] anomie: so, go to /srv/deployment/mediawiki [19:28:10] I've added repos for l10n-slot0 and l10n-slot1 [19:28:15] that said, they are empty [19:28:25] they are only meant to hold the localization stuff. not mediawiki [19:28:36] notice that there is also slot0 and slot1 [19:28:47] hm [19:28:50] wait... [19:29:00] how does the current stuff work? does it always update against head? [19:29:08] or does it update against what's deployed? [19:29:23] s/head/master/ [19:29:44] Does anyone feel like adding uploads6/upload7 and thumbs/thumbs2 on hume please? [19:30:42] The current setup has [[mw:Extension:LocalisationUpdate]] installed, which is basically a setup to have these cdb files contain translations from master even though the actual deployed code is older. [19:32:08] What LocalisationUpdate does is basically clone master into a separate location and then compare all the message files to figure out just which translations to take from master and which to ignore because they might not match the deployed code. [19:32:22] oh [19:32:27] wow. that's absurdly simple, then [19:32:37] what you have is likely good, then [19:32:49] I was thinking it had to be pulled from what's deployed [19:34:35] It compares what's deployed to master. Anything in English that's changed gets ignored for all other languages from master (although if 123456 is deployed, then you run LocalisationUpdate on 234567, then in 345678 the English message gets changed, it will keep the non-English messages from 234567 rather than suddenly going back to 123456). [19:34:58] ^demon: Wouldn't re-writing them in Ruby be a better option? [19:35:16] Any English message that's the same between deployed and master, it will pull the corresponding non-English messages from master for the CDB files. [19:35:24] <^demon> Reedy: That would require me learning more ruby :p [19:37:25] So that's basically what the script tries to do: it pulls master, then runs extensions/LocalisationUpdate/update.php and maintenance/rebuildLocalisationCache.php for each slot to build the CDB files that go in the l10n-slot. [19:38:05] hm [19:38:11] ok. so, it pulls master into a different repo [19:38:18] then it checks that repo against the deployed repo? [19:38:23] Yes [19:38:32] let's assume we want to deploy slot0 [19:38:56] (on fenari, that master repo happens to be in /var/lib/l10nupdate/mediawiki) [19:39:06] let's assume on tin :) [19:39:16] we'd be in /srv/deployment/mediawiki/slot0 [19:39:22] and do: git deploy start [19:39:43] then we'd make whatever changed we're going to make [19:39:50] and would do: git deploy sync [19:39:55] that runs a sync script for slot0 [19:40:07] the sync script will pull it's configuration down from salt [19:40:36] it'll then see which dependent repos it has [19:40:45] it'll run a script for each dependent repo [19:40:59] in this case, slot0 only has l10n as a depedency [19:41:07] so, it'll run the l10n script [19:41:46] the l10n script will update its master core repo (and all extensions, I guess) [19:41:51] yes [19:41:52] it'll compare it against slot0 [19:42:05] then, it'll do the following: [19:42:12] binasher: I think the connection bug was in MW [19:42:25] git deploy start (in /srv/deployment/mediawiki/l10n-slot0) [19:42:39] it'll add the updates into the repo and commit them [19:42:48] then it'll run: git deploy sync [19:43:23] AaronSchulz: what was it? [19:43:29] at that point, it returns a status back to the slot0 sync script [19:43:39] then slot0 will tell minions to fetch [19:43:42] then to checkout [19:43:45] <^demon> AaronSchulz: I added a minor feature to ExtDist today--lets us define mw-ns messages to describe the branches. [19:44:02] anomie: sane? [19:44:33] it may be necessary to have a core checkout for every slot [19:44:52] it's possible that someone will do a deployment on slot0 and slot1 at the same time [19:44:57] Right [19:45:01] though people should honestly know better [19:45:04] The core checkout is ro thuogh [19:45:06] Ryan_Lane- Sounds reasonable to me. The l10n script you mention could look roughly similar to the l10nupdate script, except it would run just on the one slot. [19:45:16] RoanKattouw: true [19:45:31] anomie: btw, you can pull configuration from salt [19:45:36] As long as git won't blow up if two processes "git pull" on the core repo at the same time, it can be shared [19:45:41] using: sudo salt-call pillar.data —out json [19:45:46] Also, funnily enough, the slot0 and slot1 checkouts are themselves core.git checkouts, just of non-master branches [19:45:59] RoanKattouw: indeed [19:46:01] anomie: I'd imagine one of them fails with a failure to acquire a lock [19:46:48] Ryan_Lane: Also I would encourage you to make l10n-slot0 and l10n-slot1 checkouts of different branches of the same repo [19:46:55] With slotN building on top of slotN-1 [19:47:07] what's it matter? they are local repos [19:47:33] Better retention of history and presumably better delta compression [19:47:36] ah [19:47:45] RoanKattouw: want to take a stab at that? :) [19:48:04] Making l10n one repo? Shouldn't be too hard, right? [19:48:04] you guys know what's needed heremore than me [19:48:19] I'm working on dependent repo support right now [19:48:22] about to push it in [19:48:48] What is the mechanism for ensuring a dependent repo is deployed at the right time? It's just deployed immediately before? [19:48:49] * anomie tries "sudo salt-call pillar.data -out json", but doesn't remember his password for sudo. If he even has one. [19:49:02] You can't sudo on tin [19:51:50] Ryan_Lane: Basically I'm proposing that the l10n repo work the same way as the wmf/* branches in the core repo [19:52:17] RoanKattouw: so, my way of handling this is.... [19:52:20] <^demon> It's still gonna be on its own repo, right? [19:52:31] * ^demon freaked out at seeing "like wmf/* branches" [19:52:35] Having just gotten back from vacation I don't exactly have much time to work on it, but I can talk to you, Brad or Chad [19:52:37] ^demon: Yes [19:52:39] * Ryan_Lane nods [19:52:43] I just wanted it to be analogous [19:52:46] <^demon> Ok, ok. [19:52:57] so, for dependent repos... [19:52:58] I don't want wmf5's l10n to be a separate repo from wmf4's l10n [19:53:05] Instead, wmf5's l10n should continue where wmf4's left off [19:53:13] <^demon> (mw/core is already huge, I didn't want to start stashing cdb objects in it ;-)) [19:53:18] when you do a git deploy sync on slot0, it would run scripts for its dependent repos [19:53:22] such that all l10n is in one repo [19:53:35] they would do a git deploy, but wouldn't actually tell the minions to fetch/checkout [19:53:45] Right [19:53:50] it would do nothing, and return to the sync for slot0 [19:53:58] slot0 would tell its minions to fetch/checkout [19:54:07] So it fetches the l10n data onto the minions but not actually check it out yet? [19:54:13] *but doesn't actually [19:54:27] in the fetch state on the minion, it will fetch all dependent repos and then the repo itself [19:54:38] in the checkout stage, it will checkout all dependent repos, then the repo itself [19:54:43] OK [19:54:44] that way all of the data is already downloaded [19:54:55] Right, that makes sense [19:54:57] and it all switches at the same time [19:55:08] So the end result is the same as with submodules, except that the checkout order is reversed, right? [19:55:20] yes [19:55:47] Alright, that makes perfect sense [19:56:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:46] cool, cause that's what I was just finishing up :) [20:09:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.465 seconds [20:15:46] New patchset: awjrichards; "Enable wgMFForceSecureLogin on testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42803 [20:18:33] New patchset: Hashar; "(bug 43729) create /mnt/srv and /srv on beta mw installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42743 [20:20:39] <^demon> RoanKattouw: I don't think you need https://gerrit.wikimedia.org/r/#/c/2141/ anymore, but can you check? [20:20:42] New review: Hashar; "PS3: move the file{} snippets directly in the role class instead of a new manifests of mediawiki_new..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42743 [20:22:31] Change abandoned: Catrope; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2141 [20:22:41] ^demon: Abandoned, it's an old piece of junk [20:22:50] <^demon> I thought so :) [20:22:51] Hi Roan. [20:23:34] Hi Susan [20:23:53] I see you've taken a new name in the 3 weeks I've been gone :) [20:25:59] :D [20:26:08] I've missed you. You're back now? [20:26:33] Yeah [20:26:41] I was on vacation Dec 14ish till yesterday [20:26:51] And I'm in school on Mon and Wed now [20:27:10] Cool, cool. Did you go anywhere for vacation? [20:28:00] Holland, visiting family for the holidays [20:44:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:56:56] okay, my deployment [20:58:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [21:08:22] New patchset: MaxSem; "Enable wgMFForceSecureLogin on testwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42803 [21:08:57] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42803 [21:15:51] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [21:29:48] PROBLEM - Puppet freshness on knsq17 is CRITICAL: Puppet has not run in the last 10 hours [21:30:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:32:17] !log reedy synchronized php-1.21wmf7/extensions/TimedMediaHandler [21:32:27] Logged the message, Master [21:42:22] Change abandoned: Dzahn; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37153 [21:45:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [21:45:49] New patchset: Dzahn; "$wgCategoryCollation = 'identity' on iswiktionary for bug 30722" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42856 [21:46:54] PROBLEM - Puppet freshness on db1036 is CRITICAL: Puppet has not run in the last 10 hours [21:49:15] New review: Dzahn; "bug 30722" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42856 [21:49:16] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42856 [21:53:59] !log dzahn synchronized ./wmf-config/InitialiseSettings.php [21:54:06] Logged the message, Master [21:55:03] New patchset: Ryan Lane; "Add support for dependent repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42859 [21:57:16] PROBLEM - Host ms-be1010 is DOWN: PING CRITICAL - Packet loss = 100% [21:59:31] PROBLEM - MySQL Slave Running on db1047 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error BIGINT UNSIGNED value is out of range in (enwiki.article_f [22:00:51] New review: Hashar; "note my deployment role for beta which introduce "common" classes https://gerrit.wikimedia.org/r/#/c..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42859 [22:00:51] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [22:00:51] PROBLEM - MySQL Slave Running on db59 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error BIGINT UNSIGNED value is out of range in (enwiki.article_f [22:00:52] PROBLEM - MySQL Slave Running on db1043 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error BIGINT UNSIGNED value is out of range in (enwiki.article_f [22:01:12] hahahaha^^^ [22:01:35] AFT overflown past BIGINT [22:11:32] hashar: which instance should be the deployment host? [22:11:35] bastion? [22:11:40] deployment-bastion [22:11:42] ok [22:12:17] that is where people connect too, kind of like tin + fenari + hume + whatever [22:12:21] indeed [22:13:12] I got a lame change to get /mnt/srv created on instances and /srv as a symlink to it [22:13:31] would let us match the production file system [22:13:35] yep [22:13:45] and bypass GlusterFS in favor of a local disk ( /dev/vdb ) [22:13:53] feel free to amend the change / merge it whenever you feel [22:13:53] yep [22:14:09] well, we don't want it on a shared filesystem for a bunch of reasons [22:14:22] what else is on /mnt on these systems? [22:14:27] nothing [22:14:28] I checked [22:14:52] the instances are bootstrapped using puppet classes and or use /home/wikipedia which are symlinks to /data/project [22:14:58] we could probably just change the mountpoint to /src [22:14:59] bbl hangout with Rob [22:15:00] err [22:15:02] /srv [22:15:11] oh /srv is unused [22:15:30] yeah, so the disk on /mnt just becomes /srv [22:15:47] by default it's /mnt, no reason to keep it that way [22:16:00] ahh if you can mount /dev/vdb on /srv it would be nicer :-) [22:16:28] !log tstarling synchronized php-1.21wmf7/includes/AutoLoader.php [22:16:39] Logged the message, Master [22:16:46] !log tstarling synchronized php-1.21wmf7/includes/ArrayUtils.php [22:16:59] Logged the message, Master [22:17:03] !log tstarling synchronized php-1.21wmf7/includes/objectcache/RedisBagOStuff.php [22:17:04] hashar: actually, it can be mounted in both spots [22:17:14] Logged the message, Master [22:17:24] !log tstarling synchronized php-1.21wmf7/includes/objectcache/SqlBagOStuff.php [22:17:32] binasher: there you go [22:17:34] Logged the message, Master [22:17:56] hashar: use the mount directive [22:18:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:18:29] is tin puppetized at all? [22:18:36] almost completely [22:18:45] I need to add in the sudo config [22:19:35] Your branch is behind 'origin/production' by 747 commits, and can be fast-forwarded. [22:19:39] I'm going to set up deployment-prep soon, so we'll see if I missed anything [22:19:56] ok, tin is there now ;) [22:22:29] TimStarling: thanks! so the current 'server' array becomes a 'servers' array of arrays? does dbname, user, password, type need to be defined for each individually? [22:22:51] TimStarling: is it just me or is RedisBagOStuff:debug not defined? [22:22:52] yes and yes [22:23:29] just you [22:23:37] it is in the parent [22:23:45] ah, ok [22:24:13] right, I was just going crazy for a second :) [22:24:41] binasher: so you could do something like... [22:25:11] $template = array( 'type' => 'mysql', 'user' => $wgDBuser, 'password' => $wgDBpassword ); [22:25:38] foreach ( array( 'pc1001', 'pc1002', 'pc1003' ) as $host ) { [22:25:59] $servers[] = array( 'host' => $host ) + $template; [22:25:59] } [22:27:10] that looks much nicer than what i was contemplating [22:29:03] Change abandoned: Dzahn; "per CT" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28741 [22:29:09] New patchset: Brion VIBBER; "MobileFrontend/Zero setup for XL Axiata in Indonensia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42867 [22:32:30] binasher: fixing MW seems to make those connection errors go away [22:32:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.552 seconds [22:33:11] AaronSchulz: what was the problem in mw? [22:34:33] the queue objects where not singletons nor did they do any other connection sharing, so runJobs.php kept making new ones (so the connection cache per object was useless) [22:34:48] I whipped up a quick JobQueueRedisPool and that problem is gone now [22:35:05] ah [22:35:31] binasher: yeah, silly :) [22:35:39] so not at all a problem with how RedisBagOStuff uses php-redis [22:36:00] and glad its not a php-redis bug! [22:36:02] no, since we basically singleton the bagostuff in one global [22:38:59] Ryan_Lane: I see you have your own personal indenting style for manifest files [22:39:15] no. I've switched to the upstream styling [22:39:35] shall I write some vim scripts to automatically detect the author and switch indent method accordingly? [22:39:42] sorry about that [22:40:04] I think we should switch all of our manifests to the upstream style [22:40:17] no matter what we'll have this problem with external modules [22:40:50] hm. maybe we can put tabstop info into the manifests? [22:41:07] I've started switching all of my python to pep8 standards as well [22:41:19] once everything is working properly, I'm going to fully pep8 the python [22:42:03] Ryan_Lane: IIRC superm401 recently amended the code guidelines to permit spaces in Python code [22:42:16] tabs you mean? [22:43:01] I'll be honest, I prefer spaces in python. it allows you to align things in a saner way [22:43:13] PEP8 suggests 4-space indents for all new code and tabs for existing codebases, but in my experience tab-indents are so fantastically uncommon in the Python world that many people (myself included) specify it in their vimrc, etc. [22:43:22] scapping [22:43:49] ori-l: Ryan_Lane and you can ask pep8 to ignore the tab/whitespaces errors [22:44:13] hashar: pep8 won't complain unless you're inconsistent, in which case it's right to complain :) [22:44:20] ye[ [22:44:21] *yep [22:44:42] ah even better [22:44:45] I know I'm probably going to drive people insane since I'm being inconsistent with how we normally do things [22:44:46] Ryan_Lane: our python code is mostly glue to integrate with various existing Python projects, and the the bulk of them use 4-space indents [22:44:54] exactly [22:45:02] and that's why I'm switching [22:45:10] to 4 spaces ? [22:45:12] yes [22:45:28] could you possibly convince mark ? :) [22:45:33] viva la revolucion [22:45:42] New patchset: Andrew Bogott; "Rotate all gluster logfiles, and only keep three weeks' worth." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42796 [22:45:44] next step puppet-lint!!! ;-] [22:46:08] this is using salt. should I use salt style indents for modules, returners, runners, etc, and use tabs otherwise? no matter what I'm being inconsistent [22:46:31] so, yeah, spaces. [22:46:54] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [22:47:11] New patchset: Tim Starling; "Split scap scripts from other scripts useful on deployment hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42871 [22:47:12] New review: Hashar; "Following our conversation on IRC I guess you tested / asked whether -HUP works :-]" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/42796 [22:47:14] i declare today day 0 of year 0 of our new calendar [22:47:44] (which will be decimal, naturally.) [22:47:50] the day we decided to migrate to python ? [22:48:16] New review: Hashar; "Note: that is for https://bugzilla.wikimedia.org/show_bug.cgi?id=41104" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/42796 [22:48:20] naw, tolerate 4-space indents in python code [22:49:32] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [22:49:43] Logged the message, Master [22:49:46] err, aborted [22:50:10] ok moving out now :) [22:50:32] can someone fix permissions on /home/wikipedia/common/wmf-config/ExtensionMessages-1.21wmf7.php please? [22:51:02] MaxSem: Looks like it's 664 to me [22:51:07] don't we have set-group-write ? [22:51:18] And yes, we do [22:51:21] PHP Warning: file_put_contents(/home/wikipedia/common/wmf-config/ExtensionMessages-1.21wmf7.php): failed to open stream: Permission denied in /home/wikipedia/common/php-1.21wmf7/maintenance/mergeMessageFileList.php on line 119 [22:51:32] -rw-rw-r-- 1 reedy wikidev 20592 Dec 27 22:53 /home/wikipedia/common/wmf-config/ExtensionMessages-1.21wmf7.php [22:51:37] MaxSem: set-group-write should g+w all files under /h/w/common [22:51:39] Running as whom? [22:51:45] hashar: Examples I find online look like this: "/usr/bin/killall -HUP glusterfs > /dev/null 2>&1 || true" <- any idea what the || true is for? [22:52:03] RoanKattouw, as myself [22:52:28] andrewbogott: killall returns exit code 1 if not process matched [22:52:37] RoanKattouw: it needs to be owned by mwdeploy [22:52:43] andrewbogott: if the first succeeds then do NOT do the second [22:52:44] andrewbogott: I guess log rotate will abort / complain if the post rotate script return with exit != 0 [22:52:50] hashar: ok, makes sense. [22:52:52] Also, it's not really anything you need to worry about unless you're deploying a new extension [22:52:53] andrewbogott: and the || true is there to ignore the error [22:52:53] Oh I see [22:52:55] Fixing [22:53:04] Even in that case, just add to it manually as it's grouped for wikidev [22:53:06] andrewbogott: also one could use: || : [22:53:07] ;) [22:53:11] opposite of && [22:53:11] (in bash) [22:53:37] MaxSem: Fixed [22:53:46] thanks! [22:53:59] ugh. I hate how renaming files makes any diff previously useful no longer useful [22:54:08] New patchset: Andrew Bogott; "Rotate all gluster logfiles, and only keep three weeks' worth." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42796 [22:54:27] New patchset: Ryan Lane; "Add support for dependent repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42859 [22:57:05] Ryan_Lane: you can ask git diff to change the rename detection threshold git diff -M90% iirc [22:57:20] though it is not possible in Gerrit diff view :/ [22:57:23] heh [22:57:29] New patchset: Ryan Lane; "Add support for dependent repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42859 [22:57:29] well, that's good to know [22:57:39] I usually review my patches on my laptop just to be able to do that [22:57:41] hashar: Ooooooh that's nice [22:57:46] -M90% works with "git show" too :-] [22:57:55] so you an git-review -d then git show -M90% [22:58:17] and still have a list of the changes in the renamed file (tweak percentage as needed) [22:58:20] jeremyb: so - i want to merge https://gerrit.wikimedia.org/r/#/c/22698/4 but my rebase-foo is subpar [22:58:34] RoanKattouw: appreciating your support :-] [22:58:45] i was considering just redoing that change manually [22:58:57] jeremyb: if you wouldn't mind [22:59:01] (going through old gerrit changes) [23:00:36] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42796 [23:01:33] New patchset: Ryan Lane; "Add support for dependent repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42859 [23:02:31] Change merged: preilly; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42867 [23:06:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:06:14] !log maxsem Started syncing Wikimedia installation... : Weekly mobile deployment [23:06:24] Logged the message, Master [23:08:12] New review: Anomie; "Not that I know much about this..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/42871 [23:14:47] New patchset: Ryan Lane; "Add support for dependent repos" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42859 [23:14:58] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 185 seconds [23:15:08] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 189 seconds [23:17:47] New patchset: Lcarr; "fixing some icinga errors - need libssl.so.0.9.8" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42881 [23:18:28] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42881 [23:18:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.116 seconds [23:19:09] preilly: i'm merging on sockpuppet - are your varnish acl changes ok to merge ? [23:19:19] LeslieCarr: yes [23:19:32] ok, merging now [23:19:41] LeslieCarr: Okay cool thanks [23:25:56] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42859 [23:28:26] New patchset: Ryan Lane; "Add forgotten parameter to deployment::salt_master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42882 [23:30:24] New patchset: Ryan Lane; "Follow up to I56379393" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42882 [23:30:49] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42882 [23:38:11] New patchset: Ryan Lane; "Another follow up to I56379393" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42883 [23:39:46] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42883 [23:45:47] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [23:45:55] binasher, why do you think the mobile HTML is mangled too? I didn't find anything on the page you've pointed at [23:46:51] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [23:49:11] New review: Andrew Bogott; "The content here looks fine; I've made a couple of suggestions to improve ease of reading and maint..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/42786 [23:49:39] MaxSem: ok, js is mangling it after the fact [23:49:56] binasher, currently scapping a fix [23:50:07] yay, don't have to flush varnish [23:51:05] dunno about flushes, scap doesn't mean flush isn't needed:) [23:51:34] New patchset: Tim Starling; "Script updates for the new deployment system" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42887 [23:51:34] New patchset: Tim Starling; "Split scap scripts from other scripts useful on deployment hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42871 [23:51:35] you shouldn't need to flush html cache to update js loaded from bits [23:52:32] MaxSem: so now that i'm actually looking at cached html from mobilefrontend… [23:52:51] "mobile-frontend-logged-in-toast-notification":"Logged in as 207.231.231.103." "stopMobileRedirectCookieName":"stopMobileRedirect","stopMobileRedirectCookieDuration":15552000,"stopMobileRedirectCookieDomain":".wikipedia.org","hookOptions":"","username":"207.231.231.103" [23:52:56] that is not my ip address… [23:53:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:17] information about users is being put in the html response directly in script tags and cached in varnish for others to see [23:53:20] thats bad [23:53:44] at least it's fast, who cares about cache posioning [23:54:36] New patchset: Tim Starling; "Script updates for the new deployment system" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42887 [23:54:36] oh shi~~~ [23:55:30] first of all, it shouldn't tell anons they're logged in;) [23:56:45] !log maxsem Finished syncing Wikimedia installation... : Weekly mobile deployment [23:56:54] Logged the message, Master [23:59:10] LeslieCarr: sorry, idk why i didn't hear a bell [23:59:17] LeslieCarr: i'll be back in 15-20 mins [23:59:58] New review: Tim Starling; "PS2: sorry, accidental automatic rebase" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/42871