[00:01:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [00:06:35] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [00:06:35] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [00:34:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:48:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [01:21:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [01:41:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 243 seconds [01:47:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 12 seconds [02:00:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 313 seconds [02:08:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [02:30:57] !log LocalisationUpdate completed (1.21wmf2) at Mon Oct 22 02:30:57 UTC 2012 [02:31:14] Logged the message, Master [02:37:20] $ git grep -ni php-fatal-error files [02:37:20] files/php/wmerrors.ini:5:wmerrors.message_file=/usr/local/apache/common-local/php-fatal-error.html [02:37:55] maybe i'm just sleepy but I'm having some trouble figuring out where to file php-fatal-error.html in version control. or where to even get a current copy of it [02:38:33] i don't see it in beta labs either [02:38:37] (poking around in the shell) [02:39:53] RECOVERY - Puppet freshness on search13 is OK: puppet ran at Mon Oct 22 02:39:39 UTC 2012 [02:40:20] RECOVERY - Puppet freshness on argon is OK: puppet ran at Mon Oct 22 02:40:19 UTC 2012 [02:41:50] RECOVERY - Puppet freshness on search1008 is OK: puppet ran at Mon Oct 22 02:41:33 UTC 2012 [02:41:50] RECOVERY - Puppet freshness on search1016 is OK: puppet ran at Mon Oct 22 02:41:42 UTC 2012 [02:42:53] RECOVERY - Puppet freshness on sq86 is OK: puppet ran at Mon Oct 22 02:42:21 UTC 2012 [02:43:20] RECOVERY - Puppet freshness on mw7 is OK: puppet ran at Mon Oct 22 02:43:01 UTC 2012 [02:43:20] RECOVERY - Puppet freshness on nitrogen is OK: puppet ran at Mon Oct 22 02:43:06 UTC 2012 [02:43:38] RECOVERY - Puppet freshness on analytics1003 is OK: puppet ran at Mon Oct 22 02:43:26 UTC 2012 [02:44:54] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Mon Oct 22 02:44:49 UTC 2012 [02:45:53] RECOVERY - Puppet freshness on search19 is OK: puppet ran at Mon Oct 22 02:45:47 UTC 2012 [02:48:26] RECOVERY - Puppet freshness on sq77 is OK: puppet ran at Mon Oct 22 02:48:02 UTC 2012 [02:51:32] !log LocalisationUpdate completed (1.21wmf1) at Mon Oct 22 02:51:32 UTC 2012 [02:51:45] Logged the message, Master [02:51:58] RECOVERY - Puppet freshness on sq62 is OK: puppet ran at Mon Oct 22 02:51:25 UTC 2012 [02:53:23] RECOVERY - Puppet freshness on sq76 is OK: puppet ran at Mon Oct 22 02:52:55 UTC 2012 [02:53:50] RECOVERY - Puppet freshness on analytics1009 is OK: puppet ran at Mon Oct 22 02:53:28 UTC 2012 [02:57:38] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [02:57:53] RECOVERY - Puppet freshness on sq75 is OK: puppet ran at Mon Oct 22 02:57:43 UTC 2012 [02:58:02] RECOVERY - Puppet freshness on stat1001 is OK: puppet ran at Mon Oct 22 02:57:51 UTC 2012 [02:58:11] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Mon Oct 22 02:57:57 UTC 2012 [02:59:23] RECOVERY - Puppet freshness on search20 is OK: puppet ran at Mon Oct 22 02:59:16 UTC 2012 [03:00:27] RECOVERY - Puppet freshness on marmontel is OK: puppet ran at Mon Oct 22 03:00:03 UTC 2012 [03:02:23] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Mon Oct 22 03:02:16 UTC 2012 [03:02:50] RECOVERY - Puppet freshness on yvon is OK: puppet ran at Mon Oct 22 03:02:39 UTC 2012 [03:05:52] RECOVERY - Puppet freshness on sq51 is OK: puppet ran at Mon Oct 22 03:05:24 UTC 2012 [03:05:52] RECOVERY - Puppet freshness on kaulen is OK: puppet ran at Mon Oct 22 03:05:32 UTC 2012 [03:06:26] RECOVERY - Puppet freshness on search1024 is OK: puppet ran at Mon Oct 22 03:05:54 UTC 2012 [03:06:26] RECOVERY - Puppet freshness on search24 is OK: puppet ran at Mon Oct 22 03:06:15 UTC 2012 [03:07:58] RECOVERY - Puppet freshness on search32 is OK: puppet ran at Mon Oct 22 03:07:32 UTC 2012 [03:21:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 21 seconds [03:31:15] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [03:32:59] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:33:07] LeslieCarr: ^ [03:33:21] (lvs) [04:04:30] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:30:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:30:35] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:05:32] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [05:55:11] PROBLEM - MySQL disk space on db22 is CRITICAL: DISK CRITICAL - free space: /a 18374 MB (3% inode=99%): [06:30:35] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:37:02] PROBLEM - MySQL disk space on db22 is CRITICAL: DISK CRITICAL - free space: /a 18465 MB (3% inode=99%): [06:44:24] RECOVERY - MySQL disk space on db22 is OK: DISK OK [07:11:20] apergos: ping? [07:18:18] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [07:19:47] ponggg [07:19:52] paravoid: [07:20:10] you saw theulimit email from tim I guess? [07:22:00] yeah, we already chatted about it yesterday [07:22:06] great [07:23:04] so, [07:24:19] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [07:29:42] so...? [07:30:12] do we think the upgrade on swift made ms-fe1 happy? I realize a bunch of other stuff happened in themeantime [07:30:12] ms-fe1 is looking good [07:31:38] so maybe we want to move over the other proxy servers in a day or two [07:31:50] what other stuff? [07:31:56] I did the 1.7.4 upgrade [07:32:27] I mean we had image scalers rebooting and etc [07:32:33] other sorts of failures [07:32:48] nah, that's completely separate [07:32:55] I did not keep good track, I was mostly afk [07:33:00] it just was a busy weekend for me [07:33:02] just saw lots of activity [07:33:23] the leak is looking good [07:33:24] great [07:33:36] so, this week we have to a) upgrade (some of) the rest of the proxies to precise/1.7.4 [07:33:59] b) provision the R720xd for the replacement of the 4 broken servers [07:34:11] did those come in?? [07:34:12] will you work this week or are you too busy with bureaucracy? [07:34:18] oh I'm here [07:34:21] they came but Chris was in eqiad last week [07:34:24] ok [07:34:27] I think they're going to get racked up this week [07:34:36] so I'm planning for it [07:34:36] I expected I would work with him to get those set up [07:34:37] do you want to split some of the work there? [07:34:39] okay [07:34:41] I can do the proxies [07:34:50] sure [07:34:50] and communicate with swiftstack about that [07:34:51] if you need a hand you can let me know [07:34:55] and the rest of the issues I sent them [07:35:05] am I on cc n those emails? [07:35:12] hm, lemme see [07:35:13] (now you know how far behind I am in my emails :-P) [07:35:18] I think you are [07:35:19] but let me double-check [07:36:07] okay, you were in one, about the DELETEs and the posix_fadvise() bugs ("Two performance-related issues (non-sync-related)") [07:36:14] but not in the other because I just replied-all [07:36:16] let me fwd that [07:36:30] hello [07:36:37] hi hashar [07:36:41] I've seen your mail [07:36:46] but I'm trying to make a plan for the week (see above) [07:36:54] before I reply :) [07:37:08] I sent it out in a rush just before leaving my coworking place, it was probably unclear :-/ [07:37:09] I saw the posix_fadvise stuff [07:37:27] apergos: okay, sent [07:37:29] maybe I'm not so far behind after all [07:37:31] thanks! [07:37:38] sorry for not originally including you [07:37:42] no worries [07:37:59] ah let me send you the one that I replied too, too [07:38:05] from John [07:38:19] are you thinking you'll leave one proxy on 1.5 (so rings can be built on it) for this week? [07:38:22] paravoid: ping me whenever you have some time to chat :-) [07:38:26] yeah, and for contigency [07:38:29] yep [07:38:33] sounds great [07:38:33] we can always build rings in backends [07:38:47] but I'd like to play it safe and keep a 1.5 proxy for now [07:38:54] hashar: shoot, I can multitask :) [07:41:15] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [07:41:21] paravoid: as I understand it, ops are upgrading servers to Precise by reinstalling them from scratch. Which is a good thing :-) [07:42:03] paravoid: we will have to backup some directories though since some stuff is not in puppet (such as jenkins build data) and some files served on the integration.mediawiki.org site. [07:42:11] paravoid: and I need to package the Android SDK too :-] [07:43:19] I have listed a few steps in https://wikitech.wikimedia.org/view/Gallium/Upgrade_to_Precise [07:46:30] ah are you moving to precise at the same time for the proxy servers you upgrade btw? [07:55:12] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [07:55:53] apergos: yes [07:58:18] sweet [08:45:18] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:55:59] New patchset: Mark Bergsma; "Add Range support to Varnish in streaming mode" [operations/debs/varnish] (patches/streaming-range) - https://gerrit.wikimedia.org/r/29273 [08:56:18] \o/ [08:58:16] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/streaming-range) - https://gerrit.wikimedia.org/r/29273 [08:58:58] Change abandoned: Mark Bergsma; "Pushed into new branch patches/streaming-range instead" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/28379 [09:25:28] New patchset: ArielGlenn; "allow rsync from local filesystem, not just from remote host" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/29278 [09:35:33] !log Built new varnish 3.0.3plus~rc1-wm2 packages and inserted them into the precise-wikimedia APT repository [09:35:46] Logged the message, Master [09:51:24] mark: have a min to give a second opinion? [09:51:33] yes [09:51:33] https://wiki.ubuntu.com/ServerTeam/CloudArchive [09:51:56] so, Ubuntu's having an extra official repository, in which they'll add new Openstack releases [09:52:14] I'm wondering if I should get the packages from there and put them in our apt or just use that [09:52:30] if it follows the same practices as the ubuntu archives themselves pretty much... [09:52:33] I think the second's better [09:52:42] well, kind of :) [09:52:47] yeah [09:52:56] they have a section per ubuntu release per openstack release [09:53:05] but they do use the same SRU policy [09:53:14] (stable release updates) [10:00:28] paravoid: could you have a look at https://gerrit.wikimedia.org/r/#/c/28208/ at some point, as soon as its merged in i can test it on labs again and if all works will push a change to enable it on tmh1/2 [10:04:26] I cringe when I see the "class foo{" syntax, but it's like that everywhere in the file [10:07:14] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [10:07:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:07:42] Change abandoned: J; "upload is also an independent class that needs to be enabled on production, so leaving it out is mor..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28341 [10:08:11] New review: Faidon; "Looks good. Let's see how it'll work on labs :)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/28208 [10:08:12] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28208 [10:08:43] j^: :) [10:09:08] which TZ are you in btw? [10:14:48] meh [10:14:55] how do I select a random backend in VCL [10:16:29] paravoid: right now in Berlin(CEST) [10:22:37] oh well, get an error on labs: manifests/role/applicationserver.pp at line 118; cannot redefine at manifests/role/applicationserver.pp:101 [10:23:07] is class {"::jobrunner": loading the local jobrunner class and not the top level one? [10:25:53] no that can't be [10:25:55] let me check [10:26:00] in that case jobrunner would loop [10:29:38] which machine is that? [10:30:02] deployment-video05 [10:30:20] switching to puppetmaster::self right now so i can change things locally [10:30:27] did you do it already? [10:30:37] I wanted to do a puppet run to see the error myself [10:30:51] too late :) [10:31:06] New patchset: Mark Bergsma; "Cleanup, fix regexp match length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29285 [10:31:23] paravoid: https://textb.org/t/puppeterror/ here the error [10:32:06] New patchset: Mark Bergsma; "Define separate storage backends for large objects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29286 [10:32:54] what a mess [10:33:00] paravoid: also now switch to puppetmaster::self is done so the error shows up again on deployment-video05 [10:33:07] New patchset: Mark Bergsma; "Select a big-object storage backend for objects > 100 MB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29287 [10:34:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29285 [10:34:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29285 [10:34:03] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29286 [10:34:03] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29287 [10:36:51] strange [10:39:23] it's the jobrunner indeed [10:39:30] but it doesn't make any sense [10:39:37] because jobrunner should also loop that way [10:42:23] smells like a puppet bug [10:42:26] one way would be to not have the classes written in a nested way but put them all toplevel in the file as role::applicationserver::jobrunner etc [10:45:07] if its a puppet bug, possibly moving videoscaler above the local jobrunner helps [10:45:38] I think it's a puppet bug [10:45:59] class { "::jobrunner": } has no reason to load role::applicationserver::jobrunner [10:47:29] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter type at /etc/puppet/manifests/role/applicationserver.pp:129 on node deployment-video05.pmtpa.wmflabs [10:47:36] that's if I try to run class { "jobrunner": } [10:47:42] include even [10:47:50] because role::applicationserver::jobrunner isn't parameterized [10:48:13] so, it loads the correct jobrunner, the top-scoped one [10:48:21] but then complains for a duplicate definition of another include [10:48:25] that's insane [10:56:23] hmmm [11:09:53] * j^ is reading https://groups.google.com/forum/?fromgroups=#!topic/puppet-users/ZYggexu5T2U and gets more and more confused about puppet parameterized classes [11:10:16] puppet's scoping is insane [11:10:21] they're supposed to have fixed that at 3.0 [11:10:31] but I'm scared to read more about it [11:10:45] what we're experiencing is a bug I think [11:11:28] I've commented-out everything but class { "::jobrunner" } [11:11:33] and now I get [11:11:35] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Class[Jobrunner] is already defined in file /etc/puppet/manifests/role/applicationserver.pp at line 131; cannot redefine at /etc/puppet/manifests/role/applicationserver.pp:103 on node deployment-video05.pmtpa.wmflabs [11:11:42] which is *crazy* [11:19:39] New patchset: Mark Bergsma; "Define separate storage backends for large objects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29288 [11:20:41] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29286 [11:20:42] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29287 [11:20:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29288 [11:21:39] New patchset: Mark Bergsma; "Define separate storage backends for large objects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29288 [11:22:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29288 [11:23:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29288 [11:31:02] j^: look what I've done on the machine itself [11:31:05] and how that fails [11:31:07] yay for puppet bugs [11:45:47] New patchset: Mark Bergsma; "Upgrade Varnish on all upload caches to 3.0.3~rc1-wm2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29290 [11:46:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29290 [11:47:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29290 [11:52:51] paravoid: still fails with the same error? i would understand it if both role::applicationserver::videoscaler and role::applicationserver::jobrunner are loaded since one can only have one declaration of a paramerized class but we only load role::applicationserver::videoscaler [12:00:34] what i dont understand is that jobrunner fails now but role::applicationserver::common did not cause the same issues before, could be that this is becuase role::applicationserver::common is inside the same scope but jobrunner is not [12:02:46] paravoid: if you close your vi I would try moving ::jobrunner into role::applicationserver::jobrunnerr and include that in role::applicationserver::jobrunner and role::applicationserver::videoscaler [12:05:26] daughter is sick so I must move out early :/ [12:05:29] will be back tonight [12:13:39] PROBLEM - Host zinc is DOWN: PING CRITICAL - Packet loss = 100% [12:20:12] New patchset: J; "move ::jobrunner to role::applicationserver::jobs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29292 [12:21:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29292 [12:21:37] paravoid: ^ pushed a change that works on video05, what do you think? [12:22:45] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:23:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:24:50] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:26:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:26:28] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:27:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:32:38] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:33:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:33:49] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:34:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:35:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:38:40] New patchset: Mark Bergsma; "Correct package version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29296 [12:39:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29296 [12:51:03] New patchset: Mark Bergsma; "Collect Via headers into one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29297 [12:52:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29297 [12:52:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29297 [12:58:12] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:59:42] New patchset: Mark Bergsma; "Revert "Collect Via headers into one" - doesn't work in vcl_deliver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29301 [13:01:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29301 [13:03:18] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:04:57] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.368 second response time [13:10:48] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:20] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.187 second response time [13:12:54] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:07] New patchset: Mark Bergsma; "Cleanup, logging to shmlog" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29305 [13:15:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29305 [13:15:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29305 [13:16:21] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:42] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [13:17:54] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [13:28:38] New patchset: Mark Bergsma; "Fix regexes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29307 [13:29:36] planning on upgrading more imagescalers. any objections? [13:29:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29307 [13:29:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29307 [13:30:00] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:03] the upgrade seems to be good on srv190 (I've been grepping through the logs a fair amount) [13:30:27] mark: ^ [13:30:49] go ahead [13:30:54] excellent [13:31:30] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.232 second response time [13:31:57] New patchset: Mark Bergsma; "Fix missing semicolon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29308 [13:32:09] I wonder how many open scaler esque bugs we have in BZ atm [13:32:26] !log removing srv219 and srv220 from rendering pool for upgarde to precise [13:32:38] Reedy: I can only imagine... there are like 6 rt tickets... [13:32:38] Logged the message, notpeter [13:33:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29308 [13:34:40] https://bugzilla.wikimedia.org/show_bug.cgi?id=36623 [13:35:12] 31122, 34792, 35622, 36580 at least [13:35:22] ja [13:35:22] soon! [13:35:22] 38010 is related, but not a "upgrade to fix" [13:36:04] I'm super stoked about getting memcache off of our apache boxes... [13:36:10] de-linking those things will be very nice [13:49:31] New patchset: Pyoungmeister; "setting srv219 and srv220 to use appalicationserver modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29310 [13:50:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29310 [13:54:00] PROBLEM - Host srv220 is DOWN: PING CRITICAL - Packet loss = 100% [13:55:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29310 [13:57:18] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:48] PROBLEM - Apache HTTP on srv219 is CRITICAL: Connection refused [13:58:48] PROBLEM - SSH on srv219 is CRITICAL: Connection refused [13:58:48] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.520 second response time [13:59:43] RECOVERY - Host srv220 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [14:00:03] New patchset: Pyoungmeister; "setting srv220 as ganglia aggregator for imagescalers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29312 [14:01:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29312 [14:01:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29312 [14:03:10] PROBLEM - Apache HTTP on srv220 is CRITICAL: Connection refused [14:03:38] RECOVERY - SSH on srv219 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:04:05] PROBLEM - SSH on srv220 is CRITICAL: Connection refused [14:05:15] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:07:12] RECOVERY - SSH on srv220 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:08:51] New patchset: Mark Bergsma; "Add Range support to Varnish in streaming mode" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29315 [14:08:51] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm2) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29316 [14:09:30] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29315 [14:09:46] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29316 [14:10:12] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.018 seconds [14:14:09] j^: ping? [14:16:00] it's currently taking about 6s for requesting a range just under the 64 MB threshold [14:16:04] that's a bit long [14:16:04] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.009 seconds [14:16:04] i'm gonna halve that now [14:16:08] to 32 MB [14:17:33] paravoid: yes [14:18:27] PROBLEM - NTP on srv219 is CRITICAL: NTP CRITICAL: Offset unknown [14:19:10] New patchset: Mark Bergsma; "Halve the 64 MB stream/range pass threshold to 32 MB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29317 [14:19:36] so, what should I see? [14:20:20] New patchset: Mark Bergsma; "Halve the 64 MB stream/range pass threshold to 32 MB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29317 [14:20:35] paravoid: https://gerrit.wikimedia.org/r/#/c/29292/ [14:20:51] that way i can run puppetd again [14:21:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29317 [14:21:39] PROBLEM - Apache HTTP on srv219 is CRITICAL: Connection refused [14:25:48] PROBLEM - Apache HTTP on srv220 is CRITICAL: Connection refused [14:30:00] PROBLEM - NTP on srv220 is CRITICAL: NTP CRITICAL: Offset unknown [14:31:12] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:31:12] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:34:39] RECOVERY - NTP on srv219 is OK: NTP OK: Offset -0.03080141544 secs [14:38:53] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [14:39:49] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [14:39:54] RECOVERY - NTP on srv220 is OK: NTP OK: Offset -0.02454566956 secs [14:40:50] New review: Faidon; "Nak, this does not belong to a role class." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/29292 [14:46:11] paravoid: got other ideas how to work around the puppet limitations in that case? [14:48:14] hrm [14:48:21] I wonder where does jobrunner belong [14:48:34] notpeter: so, I think this falls under your area too, so you should probably join this conversation [14:48:45] since you're around too :) [14:48:56] sup [14:49:05] so, we have a very nice puppet bug it seems [14:49:17] role::application::jobrunner has class { "::jobrunner": ... } [14:49:25] and the new role::application::videoscaler also has class { "::jobrunner": ... } [14:49:37] this conflicts for whatever puppet bug [14:49:37] why? [14:49:42] why what? [14:49:51] why does it have the jobrunner class? [14:50:20] I guess because we'll do transcoding in jobs? [14:50:38] you can't really transcode hour-long videos in realtime :) but j^ would know more of the architecture [14:51:07] I mean, don't we not want jobrunner procs on any boxes but the jobrunners? [14:51:09] the jobrunner class is the infrastructure to run jobs [14:51:13] video transcodes are jobs [14:51:14] yes [14:51:21] so it needs the jobrunner class [14:51:27] but shouldn't that run on dedicated boxes? [14:51:27] yes [14:51:37] those boxes would have role::application::videoscaler [14:51:55] but the puppet bug prevents role::application::videoscaler to include ::jobrunner [14:52:11] or puppet design if i understand it right [14:52:25] so the videoscalers are significantly different than the imagesclalers? [14:52:27] you can not have 2 declaractions of the same class inside of one toplevel object or something like that [14:52:40] videoscalers are more like jobrunners [14:52:50] ok, this is making more sense now [14:52:51] cool [14:53:10] initially there was some idea that they would also do thumbnails but right now the idea is that the imagescalers can do that [14:53:11] (sorry, needed to figure out the context :) ) [14:53:19] ok, cool [14:54:30] notpeter: it's nice to hear the context too, I've always just assumed :-) [14:54:44] can you show me what your corrent working version is? [14:54:59] it's merged [14:55:04] in puppet [14:55:05] ok, cool [14:55:17] can you paste in the output from the puppet run? [14:55:41] there the plan is to run that puppet cals on the tmh1 and tmh2 servers [14:55:45] they are already up [14:55:59] but need the right class to work [14:56:19] currently role::application::videoscaler looked like the right name but we now have the issue with ::jobrunner [14:56:28] notpeter: I've debugged it extensively [14:56:33] my conclusion is that puppet's buggy [14:56:47] so I could move role::application::videoscaler out of role::application or make ::jobrunner part of role::application [14:56:57] so we have to workaround it, but I didn't like j^'s hack [14:57:09] (move the whole jobrunner class under the role class) [14:57:16] so, one thing that has me wondering is [14:57:18] https://gerrit.wikimedia.org/r/#/c/29292/ [14:57:20] ah, yes, I saw that [14:57:28] jobrunner falls under the role::applicationserver [14:57:38] but jobrunner is a top-level class? should it perhaps be mediawiki::jobrunner? [14:57:57] paravoid: could be [14:58:02] the lines are very blurry here.... [14:58:52] but yeah, that sounds reasonable [14:58:55] one of you said it should be top-level, can try if mediawiki:: works [14:58:57] you've kinda worked on that area lately, that's why I pinged you [14:59:55] paravoid: yeah. I mean, where things belong most is a judgement call... it's like var naming. there are always aeveral good options :) [15:00:18] but yeah, mediawiki::jobrunner sounds reasonable to me [15:00:32] same here [15:00:50]