[00:01:07] $wgTiffMaxMetaSize = 64*1024; [00:01:19] * AaronSchulz wonders what explodes when that is higher [00:01:19] so...the TIFF metadata thing may be related to a recent shell bug [00:01:34] * robla looks up the bug he's thinking of [00:02:04] the max thumb area? [00:02:20] AaronSchulz: why would that change with precise deployment? [00:02:26] these seem to be below 25 Mpx anyway [00:03:37] the overall rate of "thumbnail failed" messages in thumbnail.log is actually a bit lower than a week ago [00:04:19] Nemo_bis: yeah, that's the one: https://bugzilla.wikimedia.org/show_bug.cgi?id=41125 [00:04:54] TimStarling: well it does shell out in retrieveMetadata(), not sure why the metadata would enlarge [00:05:09] it probably isn't related...probably [00:05:26] binasher: which is always quite high :) [00:05:42] the obvious oggThumb error is "OggHandler requires oggThumb version 0.9 or later" [00:05:52] which it does, and the code that generates that error is pretty specific [00:06:08] that's all I get for the ogg files after a few refreshes [00:06:36] it's not something we can fix easily in MW [00:06:52] what is the version being used? Is it not present? [00:07:05] it's present [00:07:13] if ( count( $lines ) > 0 [00:07:13] && preg_match( '/invalid option -- \'n\'$/', $lines[0] ) ) [00:07:13] { [00:07:13] return wfMessage( 'ogg-oggThumb-version', '0.9' )->inContentLanguage()->text(); [00:07:19] see, very specific [00:07:27] it checks for an "invalid option" message [00:07:33] TimStarling: did we have a backported version in Lucid or something? [00:07:37] yes [00:08:32] root@srv220:~# dpkg-query -W oggvideotools [00:08:32] oggvideotools 0.8-1 [00:08:35] binasher: paravoid: would it be faster to build the new package, or downgrade the scalers? [00:09:00] new package [00:09:01] or set up existing lucid apaches as scalers [00:09:30] add them to the puppet group, create /a/magick-tmp [00:09:38] let me check quickly [00:09:46] modify pybal conf, then you're pretty much done, right? [00:10:00] lucid-wikimedia|main|amd64: oggvideotools 0.8a-1 [00:10:01] lucid-wikimedia|main|amd64: oggvideotools-dbg 0.8a-1 [00:10:01] lucid-wikimedia|main|source: oggvideotools 0.8a-1 [00:10:12] that's what we had in lucid [00:10:19] let's check what 0.8"a" is [00:10:29] maybe we had a patch from 9? [00:10:31] a==awesome [00:11:17] i made the packaging for 0.8a from scratch [00:11:27] it should just build directly on precise [00:11:29] oggvideotools (0.8a-1) lucid-wikimedia; urgency=low [00:11:29] * Initial packaging of CMake build based oggtools [00:11:29] - all new debian/* [00:11:29] * Minor bugfixes and enhancements [00:11:29] -- Asher Feldman Fri, 09 Sep 2011 14:06:00 -0700 [00:11:32] oggvideotools (0.8-1) unstable; urgency=low [00:12:54] so.....someone starting to build? [00:12:55] so why MW says that it needs 0.9? [00:13:05] 0.9 doesn't exist [00:13:12] yes robla [00:13:20] cool, thanks! [00:13:33] oh you beat me to it? [00:13:34] oh well [00:13:46] no [00:14:10] because I thought that they might get around to releasing some time in the two years after they made that change [00:14:29] and 0.9 is the version our changes would be released into [00:14:34] oh....we're using the 0.9 alpha [00:15:04] which is called 0.8a because....I'm assuming because it's alpha software [00:15:15] binasher: did you modify 0.8a in any way? [00:15:19] 0.8a was an official release [00:15:27] no [00:15:28] I know [00:15:29] okay [00:15:45] I'll backport the quantal package then if you don't mind [00:16:13] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:16:14] not that anything's wrong with yours, it's just that if anything's wrong with the quantal ones I'd like them to get fixed before our next upgrade :) [00:19:36] I'm isolating that TIFF metadata thing [00:20:27] seems to be 450KB either way [00:20:32] I'll just increase the limit [00:21:23] Now it would be funny if stuff blew up after saying "I'll just increase the limit" :) [00:21:45] * AaronSchulz wonders why the limit had the value it did [00:23:10] New patchset: Tim Starling; "Increase $wgTiffMaxMetaSize" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29913 [00:23:34] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29913 [00:24:17] !log tstarling synchronized wmf-config/CommonSettings.php [00:24:25] Logged the message, Master [00:25:15] TimStarling: time to start testing tiffs? [00:25:46] I loaded a few of the test cases on commons, they all worked [00:25:50] but you can test too if you like [00:25:59] yeah, it seems to work [00:27:12] !log deploying updated squid mobile redirector [00:27:22] Logged the message, Master [00:27:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.404 seconds [00:28:52] the PDF issue could be OOM [00:33:11] maybe not, I tried a test case and it seems to be broken on both lucid and precise [00:34:10] with no memory limit [00:34:20] it's possible that the pdfs are an old problem [00:34:56] !log upgrading oggvideotools to 0.8a on all imagescalers, fixing regression on the lucid->precise upgrade [00:35:09] Logged the message, Master [00:35:09] ok, lets have our one-on-one meeting now [00:35:19] and done [00:35:35] TimStarling: yup [00:36:36] paravoid: already? [00:36:36] yes [00:36:36] Error creating thumbnail: oggThumb failed to create the thumbnail. [00:36:39] yay... [00:36:50] well no version error anymore [00:55:45] New patchset: Asher; "remove duplicate nagios grp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29920 [00:56:08] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29920 [01:02:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:03:46] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [01:14:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.439 seconds [01:26:52] PROBLEM - Puppet freshness on srv222 is CRITICAL: Puppet has not run in the last 10 hours [01:27:46] PROBLEM - Puppet freshness on srv221 is CRITICAL: Puppet has not run in the last 10 hours [01:40:40] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 295 seconds [01:49:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:51:55] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [01:52:13] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [02:00:41] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 306 seconds [02:03:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [02:35:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:37:52] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:37:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:38:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [02:42:10] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Thu Oct 25 02:41:35 UTC 2012 [02:42:10] RECOVERY - Puppet freshness on srv222 is OK: puppet ran at Thu Oct 25 02:41:36 UTC 2012 [02:44:17] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [02:48:10] RECOVERY - Puppet freshness on srv221 is OK: puppet ran at Thu Oct 25 02:48:02 UTC 2012 [02:50:43] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 214 seconds [02:50:43] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.004 second response time on port 11000 [02:50:43] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 217 seconds [02:54:46] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [02:57:19] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [02:57:19] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:11:52] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [03:24:46] PROBLEM - Puppet freshness on mw40 is CRITICAL: Puppet has not run in the last 10 hours [03:31:40] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 206 seconds [03:31:40] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 208 seconds [03:36:38] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [03:36:38] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [03:45:28] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 13 seconds [04:33:10] RECOVERY - Puppet freshness on spence is OK: puppet ran at Thu Oct 25 04:33:00 UTC 2012 [04:36:55] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [07:12:50] !log Stopping backend squid on sq82, sda I/O errors [07:13:11] Logged the message, Master [07:25:46] New patchset: Mark Bergsma; "Install ngrep on all machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29929 [07:29:58] New patchset: Mark Bergsma; "Fix memcached monitoring mess" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29930 [07:35:26] New patchset: Mark Bergsma; "Install ngrep on all machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29929 [07:35:26] New patchset: Mark Bergsma; "Fix memcached monitoring mess" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29930 [07:35:56] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29929 [07:36:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29930 [08:33:29] !log Added LVS service IPs for wikidata and wikivoyage (pmtpa/eqiad) to DNS [08:33:41] Logged the message, Master [08:35:18] ack, they look okay [08:35:41] ? [08:35:55] the DNS entries :) [08:36:51] New patchset: Mark Bergsma; "Fix nagios group description" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29934 [08:37:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29934 [08:39:27] paravoid: mark: I would like to extract a PHP linting script out of misc::deployment::scripts, I am wondering if should create a puppet module such as wmfscripts which would have a wmfscripts::phplinter or simply add a new manifest under misc (such as misc::phplinter ). [08:39:41] the change itself is pretty straightforward, simply need to require a package and copy two files :-] [08:40:10] I just can't make a choice between module or an additional class in the main config [08:41:19] we're slowly moving into modules but if you're not willing to clean things up and move other things besides phplinter to the module, I think a manifest under misc is ok. [08:43:04] I guess I should write the module so. No point in continuing pilling stuff I guess [08:43:18] would it be acceptable to write a base wmf scripts module that simply provide the PHP linter for now? [08:43:25] we could move the other scripts over time [08:43:54] !log Added georecords for wikidata-lb.wikimedia.org and wikivoyage-lb.wikimedia.org, their geomaps containing just a default entry pointing to eqiad [08:44:07] Logged the message, Master [08:45:50] mark: want me to do anything wrt wikidata/voyage? [08:46:07] no [08:46:08] I thought Daniel was doing most of it, but I'd be happy to if you want [08:46:29] yeah well but yesterday evening it was clear to me he didn't really understand how it (dns mostly) works [08:46:38] so I wasn't comfortable with him doing it without supervision [08:46:44] so I offered to do it today and tell him what I did [08:47:28] aha, okay [08:47:41] I think I know how DNS works, but I also think you did everything already, so... [08:51:04] New patchset: Mark Bergsma; "Add IPv6 LVS service IPs for wikidata/wikivoyage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29936 [08:54:51] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29936 [08:57:15] grr slow puppet [09:03:15] New patchset: Hashar; "move PHP linter to a new `wmfscripts` module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29937 [09:03:39] paravoid: here the lame change to move the PHP linter out of the main manifests to a new 'wmfscripts' module https://gerrit.wikimedia.org/r/29937 [09:30:48] New patchset: Mark Bergsma; "Add LVS realserver IPs to protoproxy hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29938 [09:31:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29938 [09:35:35] New patchset: Mark Bergsma; "Add IPv6 service IP to wikidata SSL service, enable IPv6" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29939 [09:36:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29939 [09:42:36] PROBLEM - Puppet freshness on cp1043 is CRITICAL: Puppet has not run in the last 10 hours [09:42:36] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [09:42:36] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [09:42:36] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [09:42:57] !log Hacked Nagios back up [09:43:01] it'll break again [09:43:06] Logged the message, Master [09:44:42] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [09:45:11] PROBLEM - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is CRITICAL: Connection refused [09:45:18] which brilliant mind setup LVS monitoring of lvs services that hadn't even been created yet [09:45:36] PROBLEM - Backend Squid HTTP on sq82 is CRITICAL: Connection refused [09:45:36] PROBLEM - SSH on nickel is CRITICAL: Server answer: [09:45:54] PROBLEM - HTTP on nickel is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:48:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:53:57] ok can I put some traffic on eqiad now paravoid? [09:55:18] hm? [09:55:27] varnish [09:55:50] i.e. more swift traffic due to misses [09:55:55] caches are empty [09:56:11] ganglia doesn't work for me [09:56:33] indeed [09:56:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [09:56:52] nickel is down [09:57:06] is it "no monitoring day"? [09:57:34] people have screwed it up quite well with that memcached stuff [09:57:40] might be ganglia related also [09:58:30] RECOVERY - LVS HTTPS IPv4 on wikidata-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 2528 bytes in 0.030 seconds [09:58:45] looks like swapping [09:58:52] Inickel login: root [09:58:52] Password: [09:58:52] Last login: Wed Oct 24 21:39:27 UTC 2012 from ool-45755507.dyn.optonline.net on pts/1 [09:58:58] and can't get a shell [10:01:57] New patchset: Mark Bergsma; "Fix eqiad IPv6 addresses" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29941 [10:02:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29941 [10:03:57] !log powercycling nickel, OOM & unable to login [10:04:11] Logged the message, Master [10:05:08] mark: I don't see a problem with loading swift more, although I'd like to have ganglia back before you do that [10:05:31] me too [10:05:51] RECOVERY - SSH on nickel is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [10:06:47] RECOVERY - HTTP on nickel is OK: HTTP OK - HTTP/1.1 302 Found - 0.063 second response time [10:08:04] why did we lose 2½ hours? [10:08:40] (of ganglia data) [10:13:35] New patchset: Mark Bergsma; "Add LVS service monitoring for wikidata/wikivoyage HTTP" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29944 [10:14:23] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29944 [10:24:55] !log Sending Canadian upload traffic to eqiad [10:25:11] Logged the message, Master [10:25:17] yay [10:25:46] I should totally replace the prompt with "\o/, master" [10:29:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:31:35] I'm *so* looking forward to ditching squid [10:32:33] me too [10:32:43] there's almost no load now, as it's night [10:32:46] I think i'll add some spanish traffic as well [10:32:57] not that it really matters for upload anyway [10:37:11] !log Sending Brazil upload traffic to eqiad [10:37:24] Logged the message, Master [10:38:00] (ok ok portugese ;) [10:38:58] > 20% hit rate now [10:39:48] more like 44% actually [10:41:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.787 seconds [10:42:11] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [10:42:39] restarted [10:43:39] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 0.137 seconds response time. www.wikipedia.org returns 208.80.154.225 [10:56:52] !log Sending Argentina upload traffic to eqiad [10:57:04] Logged the message, Master [11:00:31] added mexico [11:02:28] j^: btw, https://bugs.launchpad.net/bugs/1071085 (openstack-docs bug about the missing documentation for arbitrary headers) [11:03:43] ok [11:03:51] perhaps in an hour or so, i'll see if i can add the US [11:04:21] bbl [11:04:34] mark: I'm leaving for a few hours [11:04:44] I'll be back at 14:00 UTC for the gallium upgrade [11:05:17] (and after the upgrade booked with meetings until 1am localtime...) [11:05:31] don't hesitate to call me if anything's wrong with swift. [11:15:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:21:01] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [11:22:21] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:30:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [11:32:42] ok, all is well. see you in ~2h. [11:32:52] yep [11:47:35] ori-l, it's a bad idea to read in the bath about syslogd and get ideas. [11:50:52] 1. the well-known system fields should all be prefixed by one underscore. _browser_time , _user_anon_token, _page_id, etc. Mention any of these in your data model, and logEvent() will supply it. Two underscores are for PHP system fields for clicktracking on the server. [11:52:39] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [11:54:01] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29809 [11:59:20] ^demon: thanks for the integration gerrit user rights :) [11:59:42] PROBLEM - Puppet freshness on sq76 is CRITICAL: Puppet has not run in the last 10 hours [12:00:51] <^demon> hashar: yw [12:01:05] !log demon synchronized wmf-config/wgConf.php 'Syncing out new prefixes for wikidata/wikivoyage' [12:01:10] !log Sending US upload traffic to eqiad [12:01:20] Logged the message, Master [12:01:34] Logged the message, Master [12:03:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:08:15] well well [12:08:18] swift is doing 1000 req/s [12:17:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [12:38:42] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:38:42] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:49:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:55:48] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [13:04:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.053 seconds [13:14:03] New patchset: Hashar; "jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [13:15:27] paravoid: ^demon aren't we doing the upgrades to day , [13:15:28] ? [13:15:31] or was it friday? [13:15:40] <^demon> Today, I thought. [13:16:13] been swamped in some puppet manifest, haven't seen the time yet ;-] [13:16:29] <^demon> In 45m :) [13:16:54] oh my god [13:16:56] I am so bad [13:16:58] with timezone [13:17:08] I though I was already in GMT+1 but still in +2 hehe [13:17:31] I need to get out by 14:50 UTC :-( [13:17:45] so left me only 50 minutes $$$$$ [13:17:59] <^demon> Well you've only got 1 box to dist-upgrade. I've got 2 :) [13:18:01] so cold outside that I already adjusted to the non DST time [13:18:07] true [13:18:55] once Gerrit has restarted, we will have to verify the Gerrit Trigger plugin is still communicating with Gerrit [13:19:09] it might need to be restarted via the Jenkins web interface [13:19:12] <^demon> If paravoid is willing, you could start gallium a bit earlier. I don't want to start gerrit early since I announce the window (and it affects more people) [13:19:37] yup [13:19:42] New review: TheDJ; "what about svgs with png thumbnails. Would that be a problem ?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/29805 [13:19:44] <^demon> But that's up to him :) [13:21:02] <^demon> Add this to the long list of reasons I *hate* DST :) [13:21:32] I love having an extra hour of light in the evening [13:27:56] I have soooo many pending changes https://gerrit.wikimedia.org/r/#/q/owner:hashar+is:open,n,z [13:36:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:38:53] <^demon> hashar: I merged all your tweaks to sql.php [13:38:59] <^demon> Looked nice :) [13:39:08] ^demon: thanks :-] [13:45:57] fresh air, brb [13:49:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.913 seconds [13:53:08] hashar: you mean, fresh air, mixed with tobacco? ;) [13:55:28] Reedy: indeed :/ [13:55:35] I must quit smoking [13:56:24] heh [13:56:33] Any more thoughts about the nl hackathon? [13:59:08] Reedy: are you coming ? [13:59:37] we had a few mail exchanges [14:00:10] I think I'm going to [14:00:59] I transferred you 4 mails [14:01:06] <^demon> hashar, paravoid: You guys ready to start? [14:01:19] Reedy: would be mostly about CI stuff / Jenkins :-] [14:01:33] I am [14:01:42] but I think paravoid disappeared :-] [14:03:15] <^demon> uh oh [14:05:09] !log Sending all non-European upload traffic to eqiad [14:05:23] Logged the message, Master [14:06:19] heya [14:06:28] <^demon> Ah there he is :) [14:07:33] <^demon> I was going to step out for 5 minutes to the store across the street, then I'll be ready. [14:08:15] <^demon> Feel free to start gallium though, hashar's on a short timeline today [14:08:32] yeah I'm starting [14:09:09] i'm finishing [14:09:17] hmm? [14:09:42] I'm impressed by swift. [14:09:51] yes [14:09:56] it behaved quite well today [14:14:16] New patchset: Hashar; "jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [14:16:39] <^demon> paravoid: I'm going to start on formey now (gerrit slave) before doing manganese (master) [14:21:32] puppet I hate you :( [14:22:42] grr [14:23:24] hashar: why the hell do we have postgres on gallium? [14:23:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:38] you can remove it for now [14:23:53] we originally wanted to run unit tests using a postgre backend [14:24:01] but it is not used yet, so feel free to remove it [14:24:10] <^demon> Well, rephrase: "we wanted to run it on alternate backends" [14:24:21] <^demon> There wasn't much about wanting postgres itself :p [14:24:30] oh [14:24:36] forgot you are not part of that cabal [14:24:38] sorry ;-] [14:24:48] PROBLEM - HTTPS on formey is CRITICAL: Connection refused [14:25:21] PROBLEM - HTTP on formey is CRITICAL: Connection refused [14:25:30] pff [14:25:33] <^demon> Yes yes, hang tight nagios. [14:26:04] one day I will have to look at the nagios suite and make the service checks dependent on the host check [14:26:14] so the service stop reporting they are down when the host does not even ping [14:28:15] <^demon> Why the heck is exim running on formey? [14:28:20] <^demon> Probably something stupid for svn :\ [14:28:33] local mta perhaps? [14:28:36] <^demon> Possibly. [14:28:37] every server runs exim [14:28:39] isn't it installed by default on all instances so the box can send mails by themselves ? [14:28:47] aka smtp( host => localhost ) [14:28:48] just usually not as a deamon [14:29:07] <^demon> It stopped...started during the upgrade. [14:30:34] !g v [14:30:34] https://gerrit.wikimedia.org/r/#q,v,n,z [14:30:40] !g 26420 [14:30:40] https://gerrit.wikimedia.org/r/#q,26420,n,z [14:33:05] New review: Hashar; "being tested on instance `integration-jobbuilder`. The password is not expanded despite it being the..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24620 [14:33:45] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [14:34:39] PROBLEM - Puppet freshness on db9 is CRITICAL: Puppet has not run in the last 10 hours [14:34:52] poor db9 [14:35:10] paravoid: how it is going on on gallium? [14:35:47] progressing. [14:36:17] New patchset: Hashar; "jenkins: OpenStack jenkins-job-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/24620 [14:36:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.850 seconds [14:36:43] New review: Hashar; "Fixed ;] Simply dont put a dollar sign in templates!" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/24620 [14:37:48] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [14:37:49] <^demon> Done on formey, rebooting. [14:39:51] <^demon> Hmm, can't seem to SSH to it. Pinging ok. [14:42:21] <^demon> paravoid: ^ [14:42:27] PROBLEM - SSH on formey is CRITICAL: Connection refused [14:43:01] there you go paravoid [14:43:06] they can fix sq82 now ;-) [14:43:42] bah more broken packages :-( [14:44:14] jeff_green: db78 is ready for you [14:44:22] mark: ? [14:44:36] cmjohnson1: great--thank you! [14:44:39] ^demon: okay, let me login to the mgmt [14:45:06] well i got people off squid [14:45:12] ah [14:46:01] paravoid: according to apt log the gallium upgrade does work that well [14:46:06] we can resume after formey/ gerrit [14:46:16] must grab my daughter :/ [14:46:23] I have though we were out of DST already duh [14:47:05] it's progressing, I'm fixing problems as I go [14:47:20] I'm not worried [14:47:42] i noticed there is no candidate for the testswarm package not an issue though since we are no more using it [14:47:54] can be removed from the box [14:48:14] will reconnect in about 50 minutes [14:48:28] ^demon: nothing on the console. what did you do? [14:48:37] brb [14:48:58] <^demon> paravoid: I just did do-release-upgrade. Nothing notable happened. Got to the end and rebooted. [14:49:19] tried SSHing to 1022? [14:49:33] huh, doesn't work [14:49:36] ok then, I'll powercycle it [14:49:36] <^demon> Yes, didn't work. [14:52:43] !log shutting down srv194 to replace disk [14:52:55] Logged the message, Master [14:54:23] !log shutting down sq82 to replace /dev/sda [14:54:35] Logged the message, Master [14:55:30] PROBLEM - Host formey is DOWN: CRITICAL - Host Unreachable (208.80.152.147) [14:56:06] PROBLEM - LVS HTTP IPv4 on wikivoyage-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:56:33] The disk drive for /svnroot is not ready yet or not present. [14:56:49] The disk drive for /var/lib/gerrit2 is not ready yet or not present. [14:56:53] Continue to wait, or Press S to skip mounting or M for manual recovery [14:56:59] SSh should be back. [14:57:27] RECOVERY - SSH on formey is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:57:36] RECOVERY - Host formey is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [14:57:40] I'm going to silence that [14:57:54] !log silencing wikivoyage checks in nagios until deployed [14:58:08] Logged the message, notpeter [14:58:30] PROBLEM - Host sq82 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:24] ^demon: are you fixing those mount errors or do you need help? [15:00:05] notpeter: kudos [15:00:40] <^demon> paravoid: I'm not sure how, I'm afraid :\ [15:00:59] don't worry, I'll have a look :-) [15:01:26] <^demon> Also, it's not letting me ssh (prompting for password) [15:01:30] <^demon> But at least ssh is up. [15:01:40] it doesn't? [15:01:47] it lets me just fine [15:05:21] <^demon> paravoid: I've got my key in /home/demon/.ssh/authorized_keys, right? [15:06:07] Oct 25 15:05:52 formey sshd[13950]: input_userauth_request: invalid user gerrit2 [preauth] [15:06:12] Oct 25 15:05:52 formey sshd[13906]: Failed password for invalid user demon from 208.80.152.165 port 40703 ssh2 [15:06:26] oh, ldap [15:06:28]