[00:01:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [00:06:35] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [00:06:35] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [00:34:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:48:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [01:21:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:35:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.028 seconds [01:41:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 243 seconds [01:47:42] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 12 seconds [02:00:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 313 seconds [02:08:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [02:30:57] !log LocalisationUpdate completed (1.21wmf2) at Mon Oct 22 02:30:57 UTC 2012 [02:31:14] Logged the message, Master [02:37:20] $ git grep -ni php-fatal-error files [02:37:20] files/php/wmerrors.ini:5:wmerrors.message_file=/usr/local/apache/common-local/php-fatal-error.html [02:37:55] maybe i'm just sleepy but I'm having some trouble figuring out where to file php-fatal-error.html in version control. or where to even get a current copy of it [02:38:33] i don't see it in beta labs either [02:38:37] (poking around in the shell) [02:39:53] RECOVERY - Puppet freshness on search13 is OK: puppet ran at Mon Oct 22 02:39:39 UTC 2012 [02:40:20] RECOVERY - Puppet freshness on argon is OK: puppet ran at Mon Oct 22 02:40:19 UTC 2012 [02:41:50] RECOVERY - Puppet freshness on search1008 is OK: puppet ran at Mon Oct 22 02:41:33 UTC 2012 [02:41:50] RECOVERY - Puppet freshness on search1016 is OK: puppet ran at Mon Oct 22 02:41:42 UTC 2012 [02:42:53] RECOVERY - Puppet freshness on sq86 is OK: puppet ran at Mon Oct 22 02:42:21 UTC 2012 [02:43:20] RECOVERY - Puppet freshness on mw7 is OK: puppet ran at Mon Oct 22 02:43:01 UTC 2012 [02:43:20] RECOVERY - Puppet freshness on nitrogen is OK: puppet ran at Mon Oct 22 02:43:06 UTC 2012 [02:43:38] RECOVERY - Puppet freshness on analytics1003 is OK: puppet ran at Mon Oct 22 02:43:26 UTC 2012 [02:44:54] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Mon Oct 22 02:44:49 UTC 2012 [02:45:53] RECOVERY - Puppet freshness on search19 is OK: puppet ran at Mon Oct 22 02:45:47 UTC 2012 [02:48:26] RECOVERY - Puppet freshness on sq77 is OK: puppet ran at Mon Oct 22 02:48:02 UTC 2012 [02:51:32] !log LocalisationUpdate completed (1.21wmf1) at Mon Oct 22 02:51:32 UTC 2012 [02:51:45] Logged the message, Master [02:51:58] RECOVERY - Puppet freshness on sq62 is OK: puppet ran at Mon Oct 22 02:51:25 UTC 2012 [02:53:23] RECOVERY - Puppet freshness on sq76 is OK: puppet ran at Mon Oct 22 02:52:55 UTC 2012 [02:53:50] RECOVERY - Puppet freshness on analytics1009 is OK: puppet ran at Mon Oct 22 02:53:28 UTC 2012 [02:57:38] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [02:57:53] RECOVERY - Puppet freshness on sq75 is OK: puppet ran at Mon Oct 22 02:57:43 UTC 2012 [02:58:02] RECOVERY - Puppet freshness on stat1001 is OK: puppet ran at Mon Oct 22 02:57:51 UTC 2012 [02:58:11] RECOVERY - Puppet freshness on search1006 is OK: puppet ran at Mon Oct 22 02:57:57 UTC 2012 [02:59:23] RECOVERY - Puppet freshness on search20 is OK: puppet ran at Mon Oct 22 02:59:16 UTC 2012 [03:00:27] RECOVERY - Puppet freshness on marmontel is OK: puppet ran at Mon Oct 22 03:00:03 UTC 2012 [03:02:23] RECOVERY - Puppet freshness on search1005 is OK: puppet ran at Mon Oct 22 03:02:16 UTC 2012 [03:02:50] RECOVERY - Puppet freshness on yvon is OK: puppet ran at Mon Oct 22 03:02:39 UTC 2012 [03:05:52] RECOVERY - Puppet freshness on sq51 is OK: puppet ran at Mon Oct 22 03:05:24 UTC 2012 [03:05:52] RECOVERY - Puppet freshness on kaulen is OK: puppet ran at Mon Oct 22 03:05:32 UTC 2012 [03:06:26] RECOVERY - Puppet freshness on search1024 is OK: puppet ran at Mon Oct 22 03:05:54 UTC 2012 [03:06:26] RECOVERY - Puppet freshness on search24 is OK: puppet ran at Mon Oct 22 03:06:15 UTC 2012 [03:07:58] RECOVERY - Puppet freshness on search32 is OK: puppet ran at Mon Oct 22 03:07:32 UTC 2012 [03:21:47] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 21 seconds [03:31:15] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [03:32:59] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [03:33:07] LeslieCarr: ^ [03:33:21] (lvs) [04:04:30] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [04:30:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:30:35] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [05:05:32] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [05:55:11] PROBLEM - MySQL disk space on db22 is CRITICAL: DISK CRITICAL - free space: /a 18374 MB (3% inode=99%): [06:30:35] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:37:02] PROBLEM - MySQL disk space on db22 is CRITICAL: DISK CRITICAL - free space: /a 18465 MB (3% inode=99%): [06:44:24] RECOVERY - MySQL disk space on db22 is OK: DISK OK [07:11:20] apergos: ping? [07:18:18] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [07:19:47] ponggg [07:19:52] paravoid: [07:20:10] you saw theulimit email from tim I guess? [07:22:00] yeah, we already chatted about it yesterday [07:22:06] great [07:23:04] so, [07:24:19] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [07:29:42] so...? [07:30:12] do we think the upgrade on swift made ms-fe1 happy? I realize a bunch of other stuff happened in themeantime [07:30:12] ms-fe1 is looking good [07:31:38] so maybe we want to move over the other proxy servers in a day or two [07:31:50] what other stuff? [07:31:56] I did the 1.7.4 upgrade [07:32:27] I mean we had image scalers rebooting and etc [07:32:33] other sorts of failures [07:32:48] nah, that's completely separate [07:32:55] I did not keep good track, I was mostly afk [07:33:00] it just was a busy weekend for me [07:33:02] just saw lots of activity [07:33:23] the leak is looking good [07:33:24] great [07:33:36] so, this week we have to a) upgrade (some of) the rest of the proxies to precise/1.7.4 [07:33:59] b) provision the R720xd for the replacement of the 4 broken servers [07:34:11] did those come in?? [07:34:12] will you work this week or are you too busy with bureaucracy? [07:34:18] oh I'm here [07:34:21] they came but Chris was in eqiad last week [07:34:24] ok [07:34:27] I think they're going to get racked up this week [07:34:36] so I'm planning for it [07:34:36] I expected I would work with him to get those set up [07:34:37] do you want to split some of the work there? [07:34:39] okay [07:34:41] I can do the proxies [07:34:50] sure [07:34:50] and communicate with swiftstack about that [07:34:51] if you need a hand you can let me know [07:34:55] and the rest of the issues I sent them [07:35:05] am I on cc n those emails? [07:35:12] hm, lemme see [07:35:13] (now you know how far behind I am in my emails :-P) [07:35:18] I think you are [07:35:19] but let me double-check [07:36:07] okay, you were in one, about the DELETEs and the posix_fadvise() bugs ("Two performance-related issues (non-sync-related)") [07:36:14] but not in the other because I just replied-all [07:36:16] let me fwd that [07:36:30] hello [07:36:37] hi hashar [07:36:41] I've seen your mail [07:36:46] but I'm trying to make a plan for the week (see above) [07:36:54] before I reply :) [07:37:08] I sent it out in a rush just before leaving my coworking place, it was probably unclear :-/ [07:37:09] I saw the posix_fadvise stuff [07:37:27] apergos: okay, sent [07:37:29] maybe I'm not so far behind after all [07:37:31] thanks! [07:37:38] sorry for not originally including you [07:37:42] no worries [07:37:59] ah let me send you the one that I replied too, too [07:38:05] from John [07:38:19] are you thinking you'll leave one proxy on 1.5 (so rings can be built on it) for this week? [07:38:22] paravoid: ping me whenever you have some time to chat :-) [07:38:26] yeah, and for contigency [07:38:29] yep [07:38:33] sounds great [07:38:33] we can always build rings in backends [07:38:47] but I'd like to play it safe and keep a 1.5 proxy for now [07:38:54] hashar: shoot, I can multitask :) [07:41:15] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [07:41:21] paravoid: as I understand it, ops are upgrading servers to Precise by reinstalling them from scratch. Which is a good thing :-) [07:42:03] paravoid: we will have to backup some directories though since some stuff is not in puppet (such as jenkins build data) and some files served on the integration.mediawiki.org site. [07:42:11] paravoid: and I need to package the Android SDK too :-] [07:43:19] I have listed a few steps in https://wikitech.wikimedia.org/view/Gallium/Upgrade_to_Precise [07:46:30] ah are you moving to precise at the same time for the proxy servers you upgrade btw? [07:55:12] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [07:55:53] apergos: yes [07:58:18] sweet [08:45:18] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:55:59] New patchset: Mark Bergsma; "Add Range support to Varnish in streaming mode" [operations/debs/varnish] (patches/streaming-range) - https://gerrit.wikimedia.org/r/29273 [08:56:18] \o/ [08:58:16] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/streaming-range) - https://gerrit.wikimedia.org/r/29273 [08:58:58] Change abandoned: Mark Bergsma; "Pushed into new branch patches/streaming-range instead" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/28379 [09:25:28] New patchset: ArielGlenn; "allow rsync from local filesystem, not just from remote host" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/29278 [09:35:33] !log Built new varnish 3.0.3plus~rc1-wm2 packages and inserted them into the precise-wikimedia APT repository [09:35:46] Logged the message, Master [09:51:24] mark: have a min to give a second opinion? [09:51:33] yes [09:51:33] https://wiki.ubuntu.com/ServerTeam/CloudArchive [09:51:56] so, Ubuntu's having an extra official repository, in which they'll add new Openstack releases [09:52:14] I'm wondering if I should get the packages from there and put them in our apt or just use that [09:52:30] if it follows the same practices as the ubuntu archives themselves pretty much... [09:52:33] I think the second's better [09:52:42] well, kind of :) [09:52:47] yeah [09:52:56] they have a section per ubuntu release per openstack release [09:53:05] but they do use the same SRU policy [09:53:14] (stable release updates) [10:00:28] paravoid: could you have a look at https://gerrit.wikimedia.org/r/#/c/28208/ at some point, as soon as its merged in i can test it on labs again and if all works will push a change to enable it on tmh1/2 [10:04:26] I cringe when I see the "class foo{" syntax, but it's like that everywhere in the file [10:07:14] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [10:07:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [10:07:42] Change abandoned: J; "upload is also an independent class that needs to be enabled on production, so leaving it out is mor..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28341 [10:08:11] New review: Faidon; "Looks good. Let's see how it'll work on labs :)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/28208 [10:08:12] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28208 [10:08:43] j^: :) [10:09:08] which TZ are you in btw? [10:14:48] meh [10:14:55] how do I select a random backend in VCL [10:16:29] paravoid: right now in Berlin(CEST) [10:22:37] oh well, get an error on labs: manifests/role/applicationserver.pp at line 118; cannot redefine at manifests/role/applicationserver.pp:101 [10:23:07] is class {"::jobrunner": loading the local jobrunner class and not the top level one? [10:25:53] no that can't be [10:25:55] let me check [10:26:00] in that case jobrunner would loop [10:29:38] which machine is that? [10:30:02] deployment-video05 [10:30:20] switching to puppetmaster::self right now so i can change things locally [10:30:27] did you do it already? [10:30:37] I wanted to do a puppet run to see the error myself [10:30:51] too late :) [10:31:06] New patchset: Mark Bergsma; "Cleanup, fix regexp match length" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29285 [10:31:23] paravoid: https://textb.org/t/puppeterror/ here the error [10:32:06] New patchset: Mark Bergsma; "Define separate storage backends for large objects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29286 [10:32:54] what a mess [10:33:00] paravoid: also now switch to puppetmaster::self is done so the error shows up again on deployment-video05 [10:33:07] New patchset: Mark Bergsma; "Select a big-object storage backend for objects > 100 MB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29287 [10:34:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29285 [10:34:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29285 [10:34:03] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29286 [10:34:03] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29287 [10:36:51] strange [10:39:23] it's the jobrunner indeed [10:39:30] but it doesn't make any sense [10:39:37] because jobrunner should also loop that way [10:42:23] smells like a puppet bug [10:42:26] one way would be to not have the classes written in a nested way but put them all toplevel in the file as role::applicationserver::jobrunner etc [10:45:07] if its a puppet bug, possibly moving videoscaler above the local jobrunner helps [10:45:38] I think it's a puppet bug [10:45:59] class { "::jobrunner": } has no reason to load role::applicationserver::jobrunner [10:47:29] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter type at /etc/puppet/manifests/role/applicationserver.pp:129 on node deployment-video05.pmtpa.wmflabs [10:47:36] that's if I try to run class { "jobrunner": } [10:47:42] include even [10:47:50] because role::applicationserver::jobrunner isn't parameterized [10:48:13] so, it loads the correct jobrunner, the top-scoped one [10:48:21] but then complains for a duplicate definition of another include [10:48:25] that's insane [10:56:23] hmmm [11:09:53] * j^ is reading https://groups.google.com/forum/?fromgroups=#!topic/puppet-users/ZYggexu5T2U and gets more and more confused about puppet parameterized classes [11:10:16] puppet's scoping is insane [11:10:21] they're supposed to have fixed that at 3.0 [11:10:31] but I'm scared to read more about it [11:10:45] what we're experiencing is a bug I think [11:11:28] I've commented-out everything but class { "::jobrunner" } [11:11:33] and now I get [11:11:35] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Class[Jobrunner] is already defined in file /etc/puppet/manifests/role/applicationserver.pp at line 131; cannot redefine at /etc/puppet/manifests/role/applicationserver.pp:103 on node deployment-video05.pmtpa.wmflabs [11:11:42] which is *crazy* [11:19:39] New patchset: Mark Bergsma; "Define separate storage backends for large objects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29288 [11:20:41] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29286 [11:20:42] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29287 [11:20:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29288 [11:21:39] New patchset: Mark Bergsma; "Define separate storage backends for large objects" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29288 [11:22:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29288 [11:23:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29288 [11:31:02] j^: look what I've done on the machine itself [11:31:05] and how that fails [11:31:07] yay for puppet bugs [11:45:47] New patchset: Mark Bergsma; "Upgrade Varnish on all upload caches to 3.0.3~rc1-wm2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29290 [11:46:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29290 [11:47:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29290 [11:52:51] paravoid: still fails with the same error? i would understand it if both role::applicationserver::videoscaler and role::applicationserver::jobrunner are loaded since one can only have one declaration of a paramerized class but we only load role::applicationserver::videoscaler [12:00:34] what i dont understand is that jobrunner fails now but role::applicationserver::common did not cause the same issues before, could be that this is becuase role::applicationserver::common is inside the same scope but jobrunner is not [12:02:46] paravoid: if you close your vi I would try moving ::jobrunner into role::applicationserver::jobrunnerr and include that in role::applicationserver::jobrunner and role::applicationserver::videoscaler [12:05:26] daughter is sick so I must move out early :/ [12:05:29] will be back tonight [12:13:39] PROBLEM - Host zinc is DOWN: PING CRITICAL - Packet loss = 100% [12:20:12] New patchset: J; "move ::jobrunner to role::applicationserver::jobs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29292 [12:21:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29292 [12:21:37] paravoid: ^ pushed a change that works on video05, what do you think? [12:22:45] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:23:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:24:50] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:26:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:26:28] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:27:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:32:38] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:33:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:33:49] New patchset: Mark Bergsma; "Support range requests as uncached passes on the frontend" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:34:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29293 [12:35:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29293 [12:38:40] New patchset: Mark Bergsma; "Correct package version" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29296 [12:39:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29296 [12:51:03] New patchset: Mark Bergsma; "Collect Via headers into one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29297 [12:52:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29297 [12:52:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29297 [12:58:12] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:59:42] New patchset: Mark Bergsma; "Revert "Collect Via headers into one" - doesn't work in vcl_deliver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29301 [13:01:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29301 [13:03:18] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:04:57] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.368 second response time [13:10:48] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:12:20] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.187 second response time [13:12:54] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:14:07] New patchset: Mark Bergsma; "Cleanup, logging to shmlog" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29305 [13:15:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29305 [13:15:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29305 [13:16:21] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:42] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [13:17:54] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [13:28:38] New patchset: Mark Bergsma; "Fix regexes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29307 [13:29:36] planning on upgrading more imagescalers. any objections? [13:29:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29307 [13:29:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29307 [13:30:00] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:30:03] the upgrade seems to be good on srv190 (I've been grepping through the logs a fair amount) [13:30:27] mark: ^ [13:30:49] go ahead [13:30:54] excellent [13:31:30] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.232 second response time [13:31:57] New patchset: Mark Bergsma; "Fix missing semicolon" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29308 [13:32:09] I wonder how many open scaler esque bugs we have in BZ atm [13:32:26] !log removing srv219 and srv220 from rendering pool for upgarde to precise [13:32:38] Reedy: I can only imagine... there are like 6 rt tickets... [13:32:38] Logged the message, notpeter [13:33:05] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29308 [13:34:40] https://bugzilla.wikimedia.org/show_bug.cgi?id=36623 [13:35:12] 31122, 34792, 35622, 36580 at least [13:35:22] ja [13:35:22] soon! [13:35:22] 38010 is related, but not a "upgrade to fix" [13:36:04] I'm super stoked about getting memcache off of our apache boxes... [13:36:10] de-linking those things will be very nice [13:49:31] New patchset: Pyoungmeister; "setting srv219 and srv220 to use appalicationserver modules" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29310 [13:50:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29310 [13:54:00] PROBLEM - Host srv220 is DOWN: PING CRITICAL - Packet loss = 100% [13:55:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29310 [13:57:18] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:48] PROBLEM - Apache HTTP on srv219 is CRITICAL: Connection refused [13:58:48] PROBLEM - SSH on srv219 is CRITICAL: Connection refused [13:58:48] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.520 second response time [13:59:43] RECOVERY - Host srv220 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [14:00:03] New patchset: Pyoungmeister; "setting srv220 as ganglia aggregator for imagescalers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29312 [14:01:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29312 [14:01:53] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29312 [14:03:10] PROBLEM - Apache HTTP on srv220 is CRITICAL: Connection refused [14:03:38] RECOVERY - SSH on srv219 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:04:05] PROBLEM - SSH on srv220 is CRITICAL: Connection refused [14:05:15] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:07:12] RECOVERY - SSH on srv220 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:08:51] New patchset: Mark Bergsma; "Add Range support to Varnish in streaming mode" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29315 [14:08:51] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm2) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29316 [14:09:30] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29315 [14:09:46] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/29316 [14:10:12] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK HTTP/1.1 200 OK - 454 bytes in 0.018 seconds [14:14:09] j^: ping? [14:16:00] it's currently taking about 6s for requesting a range just under the 64 MB threshold [14:16:04] that's a bit long [14:16:04] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.009 seconds [14:16:04] i'm gonna halve that now [14:16:08] to 32 MB [14:17:33] paravoid: yes [14:18:27] PROBLEM - NTP on srv219 is CRITICAL: NTP CRITICAL: Offset unknown [14:19:10] New patchset: Mark Bergsma; "Halve the 64 MB stream/range pass threshold to 32 MB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29317 [14:19:36] so, what should I see? [14:20:20] New patchset: Mark Bergsma; "Halve the 64 MB stream/range pass threshold to 32 MB" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29317 [14:20:35] paravoid: https://gerrit.wikimedia.org/r/#/c/29292/ [14:20:51] that way i can run puppetd again [14:21:19] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29317 [14:21:39] PROBLEM - Apache HTTP on srv219 is CRITICAL: Connection refused [14:25:48] PROBLEM - Apache HTTP on srv220 is CRITICAL: Connection refused [14:30:00] PROBLEM - NTP on srv220 is CRITICAL: NTP CRITICAL: Offset unknown [14:31:12] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:31:12] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:34:39] RECOVERY - NTP on srv219 is OK: NTP OK: Offset -0.03080141544 secs [14:38:53] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.050 second response time [14:39:49] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [14:39:54] RECOVERY - NTP on srv220 is OK: NTP OK: Offset -0.02454566956 secs [14:40:50] New review: Faidon; "Nak, this does not belong to a role class." [operations/puppet] (production); V: 0 C: -2; - https://gerrit.wikimedia.org/r/29292 [14:46:11] paravoid: got other ideas how to work around the puppet limitations in that case? [14:48:14] hrm [14:48:21] I wonder where does jobrunner belong [14:48:34] notpeter: so, I think this falls under your area too, so you should probably join this conversation [14:48:45] since you're around too :) [14:48:56] sup [14:49:05] so, we have a very nice puppet bug it seems [14:49:17] role::application::jobrunner has class { "::jobrunner": ... } [14:49:25] and the new role::application::videoscaler also has class { "::jobrunner": ... } [14:49:37] this conflicts for whatever puppet bug [14:49:37] why? [14:49:42] why what? [14:49:51] why does it have the jobrunner class? [14:50:20] I guess because we'll do transcoding in jobs? [14:50:38] you can't really transcode hour-long videos in realtime :) but j^ would know more of the architecture [14:51:07] I mean, don't we not want jobrunner procs on any boxes but the jobrunners? [14:51:09] the jobrunner class is the infrastructure to run jobs [14:51:13] video transcodes are jobs [14:51:14] yes [14:51:21] so it needs the jobrunner class [14:51:27] but shouldn't that run on dedicated boxes? [14:51:27] yes [14:51:37] those boxes would have role::application::videoscaler [14:51:55] but the puppet bug prevents role::application::videoscaler to include ::jobrunner [14:52:11] or puppet design if i understand it right [14:52:25] so the videoscalers are significantly different than the imagesclalers? [14:52:27] you can not have 2 declaractions of the same class inside of one toplevel object or something like that [14:52:40] videoscalers are more like jobrunners [14:52:50] ok, this is making more sense now [14:52:51] cool [14:53:10] initially there was some idea that they would also do thumbnails but right now the idea is that the imagescalers can do that [14:53:11] (sorry, needed to figure out the context :) ) [14:53:19] ok, cool [14:54:30] notpeter: it's nice to hear the context too, I've always just assumed :-) [14:54:44] can you show me what your corrent working version is? [14:54:59] it's merged [14:55:04] in puppet [14:55:05] ok, cool [14:55:17] can you paste in the output from the puppet run? [14:55:41] there the plan is to run that puppet cals on the tmh1 and tmh2 servers [14:55:45] they are already up [14:55:59] but need the right class to work [14:56:19] currently role::application::videoscaler looked like the right name but we now have the issue with ::jobrunner [14:56:28] notpeter: I've debugged it extensively [14:56:33] my conclusion is that puppet's buggy [14:56:47] so I could move role::application::videoscaler out of role::application or make ::jobrunner part of role::application [14:56:57] so we have to workaround it, but I didn't like j^'s hack [14:57:09] (move the whole jobrunner class under the role class) [14:57:16] so, one thing that has me wondering is [14:57:18] https://gerrit.wikimedia.org/r/#/c/29292/ [14:57:20] ah, yes, I saw that [14:57:28] jobrunner falls under the role::applicationserver [14:57:38] but jobrunner is a top-level class? should it perhaps be mediawiki::jobrunner? [14:57:57] paravoid: could be [14:58:02] the lines are very blurry here.... [14:58:52] but yeah, that sounds reasonable [14:58:55] one of you said it should be top-level, can try if mediawiki:: works [14:58:57] you've kinda worked on that area lately, that's why I pinged you [14:59:55] paravoid: yeah. I mean, where things belong most is a judgement call... it's like var naming. there are always aeveral good options :) [15:00:18] but yeah, mediawiki::jobrunner sounds reasonable to me [15:00:32] same here [15:00:50] paravoid: hey do you have a few minutes to talk about gallium? [15:00:59] hashar: yes [15:01:06] sorry for not replying earlier [15:01:13] it is ok :-) [15:01:23] so, from what I understand, there are a still things to puppetize from your side [15:01:27] then had to get my daughter from her nany (daughter is sick hehe) [15:01:36] I'm wondering if we should just dist-upgrade the box... [15:01:43] Gallium is mostly about Jenkins [15:01:47] which I have installed using the puppet class [15:01:56] though we need PHPUnit to be installed from pear [15:02:05] and have to migrate the old build data [15:02:14] argh [15:02:15] I wrote a basic guideline at http://wikitech.wikimedia.org/view/Gallium/Upgrade_to_Precise [15:02:17] pear? really? [15:02:24] yeah [15:02:24] no packages for that either? [15:02:29] the version provided by Ubuntu are too old [15:02:39] so I ask from time to time for someone to upgrade from pear [15:03:06] why haven't we built packages for this? [15:03:06] as we do with pretty much everything else? [15:03:36] I mean, the Android SDK sounds special and we can discuss it, but PHP extensions aren't [15:03:39] we use a ton [15:03:54] it is not really an extension [15:04:01] it is a PHP software [15:04:21] !log reedy synchronized php-1.21wmf2/extensions/ExtensionDistributor [15:04:23] though we can probably refresh the ubuntu package easily [15:04:33] Logged the message, Master [15:05:05] anyway, all of that kinda makes me to want to do a dist-upgrade [15:05:11] not sure what we usually do on misc boxes [15:05:25] no idea either [15:05:30] but that is probably less troubles afterall [15:05:47] paravoid: we've mostly used upgardes as a reason to puppetize and have reimaged [15:05:53] unless the upgrade from Lucid to Precise is troublesome [15:05:54] except for when it's really annoying ;) [15:06:00] hehehe [15:06:00] yep [15:06:22] notpeter: we are talking about gallium the continuous integration server [15:06:22] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [15:06:36] which is not entirely puppetized / packaged :-/ [15:06:41] gotcha [15:08:49] paravoid: the debian packages as a `watch` file pointing to github, so it is probably easy to refresh it [15:09:06] dh $@ --buildsystem=phppear --with phppear [15:09:12] debian rocks [15:10:09] New patchset: Pyoungmeister; "removing notify => Service[apache] from all manifests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29325 [15:11:27] paravoid: renaming to mediawiki::jobrunner works too(tested on video05) https://gerrit.wikimedia.org/r/29324 [15:11:39] New patchset: J; "rename ::jobserver to mediawiki::jobserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29324 [15:13:02] New review: Faidon; "Yes!" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/29325 [15:13:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29325 [15:13:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29324 [15:18:48] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29325 [15:19:02] notpeter: what's the status of mediawiki_new? [15:19:07] should that go to mediawiki_new? [15:19:11] yes [15:19:19] that's the stuff that's being used on precise [15:19:21] (j^ is going to kill me for suggesting it :P) [15:19:27] playing ping-pong with puppet classes [15:19:31] is called that for namespace reasons.... [15:19:49] well, we can have everyhting not be terrible once all apaches are on precise [15:20:02] yeah, sorry for not being more clear earlier... [15:20:05] videoscalers are targetting precise exclusively afaik [15:20:19] yeah [15:20:27] so that would be the correct one to use [15:20:33] j^: sorry :> [15:21:00] ok will update shortly, talking to robla right now [15:21:11] can't puppet be made to report the command std err / std out instead of the useless "returned 1 instead of one of [0]" ? [15:21:24] there was a bug about it iirc [15:21:33] yeah it can [15:21:39] log_output = true or something like that [15:21:43] yeah [15:21:52] that reminds me something [15:22:28] mark: as an environment variable for puppetd -tv ? [15:22:36] http://projects.puppetlabs.com/issues/2359 that's the bug I was referring to [15:22:45] logoutput => true used to log just stdout, not stderr [15:22:50] but I see that this is fixed now [15:23:11] hashar: no, just exec { "...": ..., logoutput => true } [15:23:34] which mean I need to change the manifest to find out what is going wrong :-] (or run the command line manually) [15:24:00] apparently defaults to on_failure in latest puppet [15:25:40] New patchset: Jgreen; "configure db1013 as a temporary fundraising db slave" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29327 [15:26:40] New patchset: Pyoungmeister; "adding apache scorecard stats to ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28544 [15:27:50] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29327 [15:27:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/28544 [15:29:25] New patchset: Hashar; "misc::builder requires libcrypt-ssleay-perl package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29328 [15:29:47] a simple package dependency at https://gerrit.wikimedia.org/r/29328 :-] [15:30:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29328 [15:32:02] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/28544 [15:33:57] New patchset: Jgreen; "configure db1013 as a temporary fundraising db slave + fix typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29327 [15:34:55] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/29327 [15:39:07] New patchset: Jgreen; "configure db1013 as a temporary fundraising db slave + fix typo + arghbhabarhaga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29327 [15:40:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29327 [15:43:07] !log re-adding srv219 and srv220 to rendering pool [15:43:15] Logged the message, notpeter [15:43:40] notpeter: what's the situation that prompted you to tangle with apache-graceful-all? [15:44:32] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29327 [15:45:03] Jeff_Green: I wanted to update the apache2.conf [15:45:07] to add some ganglia stuff [15:45:18] and I was staring at it and being like "I don't want to merge this....." [15:45:28] chance of disaster: high [15:45:38] ah, you want to stage it? [15:45:49] yeah [15:46:03] makes sense [15:46:04] and notify service apache just does a restart of apache, not a graceful [15:46:12] (although that could be added as an exec) [15:46:24] paravoid: so what do we do for gallium after all ? :-) [15:46:38] hashar: puppetize as much as you can [15:46:52] then we'll see, but probably a dist-upgrade [15:46:55] fine [15:46:57] I'll try to enlist someone's else help too [15:47:09] my main issue will be with the Debian packages [15:47:20] to pass it to someone else I mean, to avoid delaying it any longer [15:47:20] notpeter: that could probably be modified, but even so it sucks to have no way to stage changes [15:47:33] just spent 20 minutes trying to update the phpunit one without any success :-( [15:47:42] uscan / uupdate do not like men heh [15:48:15] !log reedy synchronized php-1.21wmf2/extensions/ExtensionDistributor [15:48:18] Jeff_Green: yeah... there seem to be a number of non-ideal solutions. I think that this is at least the safest [15:48:29] Logged the message, Master [15:48:33] and I think I just messed up ganglia on the apaches..... [15:48:35] wee! [15:49:11] you could remove the notify temporarily, while you stage and test, then enable it when you're happy with the result [15:49:25] yeah, that may be what ends up happening [15:49:53] although, imo letting the randomness of puppet decide when to restart apaches (even gracefully) is fugly [15:50:07] that is also true.... [15:50:15] I'd rathe rhave to push a go button and do them all at once [15:52:20] ya, perhaps a couple at a time [15:53:10] New patchset: Pyoungmeister; "replaceing content with source for conf file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29332 [15:53:14] yeah, i want to write some kinda rolling restart script [15:53:17] where it restarts some [15:53:24] and then asks if you want to continue [15:53:29] could be handy tool [15:53:38] sure [15:54:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29332 [15:54:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29332 [15:55:51] New patchset: J; "rename ::jobserver to mediawiki_new::jobserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29324 [15:56:33] out for daughter duty [15:57:01] paravoid: this time as mediawiki_new::jobserver https://gerrit.wikimedia.org/r/29332 [15:57:01] Change abandoned: J; "ok" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29292 [15:57:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29324 [16:01:17] !log reedy synchronized php-1.21wmf2/extensions/ExtensionDistributor [16:01:30] Logged the message, Master [16:02:14] !log reedy synchronized wmf-config/ [16:02:27] Logged the message, Master [16:03:18] PROBLEM - Host db1013 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:00] RECOVERY - Host db1013 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [16:09:14] New patchset: Pyoungmeister; "adding apaches::ganglia to imagescalers, api, bits, and test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29334 [16:10:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29334 [16:10:38] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29334 [16:14:44] New patchset: Pyoungmeister; "correcting port for apache server status" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29335 [16:15:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29335 [16:16:13] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29335 [16:24:17] j^: no it doesn't work that way [16:24:21] you have to move it to the module [16:24:39] are you tired by this shit already and you want me to do it for you? :) [16:28:00] New patchset: J; "rename ::jobserver to mediawiki_new::jobserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29324 [16:28:14] paravoid: pushed one more try, after that you can do it. :) [16:29:10] heh, yeah, it needs another fix :) [16:29:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29324 [16:29:26] templates & files go in the module now [16:31:16] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [16:46:21] New patchset: J; "rename ::jobserver to mediawiki_new::jobserver" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29324 [16:46:59] !log stopping mysql on db1025 to clone it to db1013 [16:47:13] Logged the message, Master [16:47:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29324 [16:49:11] py is doing a graceful restart of all apaches [16:49:25] !log py gracefulled all apaches [16:49:38] Logged the message, Master [16:51:56] PROBLEM - mysqld processes on db1025 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:19:12] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [17:25:12] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [17:28:46] Whom should I ping specifically about an IPv6 issue? https://bugzilla.wikimedia.org/show_bug.cgi?id=41270 "No IPv6 addresses on Wikimedia nameservers" [17:30:10] Probably should just have an RT ticket made and ops will deal with it... [17:30:52] New patchset: CSteipp; "LBFactory_Multi setup for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29344 [17:35:37] Reedy: bad answer, apparently [17:39:59] K4-713: can you give me region as well? [17:40:11] yep [17:40:25] wait1 [17:42:20] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [17:42:52] notpeter: Did I forget the sudo? [17:43:05] Reedy: would seem so [17:56:15] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [18:10:04] ottomata: so the analytics c2100s [18:10:18] we dont have SSDs in there do we? (i dont think we do) [18:10:45] ummmm are there SSDs in the ciscos? [18:10:50] either the ciscos or the c2100s have ssds [18:10:52] can't remember which [18:11:13] you guys only spun up one of the c2100s [18:11:27] but either way, doesnt matter, cmjohnson1 and i were disucssiong the 720 replacements [18:11:42] but we are just going to require trhem to have the 2.5" internal bracket [18:11:46] i bet it was the dell's [18:11:56] i think so too [18:12:10] i have put ssd's in cisco, but it was for soemthign else i think. [18:12:46] soon, soon the c2100s will be an unhappy memory [18:12:48] =P [18:15:27] hah, yeah [18:15:35] RobH, we have 12 of the C2100s running [18:15:41] umm, no 11 [18:15:43] analytics1011-1022 [18:16:10] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf2 [18:16:10] 12 [18:16:12] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.156 second response time [18:16:14] uhhh [18:16:24] Logged the message, Master [18:16:25] 1011-1022 [18:16:46] riiiigh, wait I am forgetting about 23-28 [18:17:12] analytics has a total of 27 boxes [18:17:27] 1001-1027 [18:17:34] 1011-1010 are cisco [18:17:48] 1011-1022 are c2100 [18:17:54] 1023-1027 are r310 [18:18:02] ahhh, right ok [18:18:04] danke [18:18:08] So my understanding is I am unracking 1011 LAST [18:18:16] and the rest can be pulled when i get to them [18:18:30] aaaand, i think the c2100s are SSDs (I don't actually care about SSD so much, we want sppaaaaaace) [18:18:30] ottomata: if you have data you need pulled off those, do so today please =] [18:18:36] naw, its cool, they can go anytime [18:18:40] cool [18:18:44] i'm running stuff on various ones [18:18:45] you want 1011 to stay till last right? [18:18:59] but its all volatile [18:19:05] it doesn't really matter [18:19:07] cool, im going to start unracking and boxing them tomorrow [18:19:13] they will go back next week [18:19:16] a couple of weeks ago we were doing stuff, but we have the data already [18:19:32] and the 720 replacements are scheduled to ship, but we are putting a slight hold while we get dell to add the 2.5" internal backets to the order [18:19:44] I would assume about three weeks from now [18:19:49] but perhaps four if things go horribly wrong [18:20:02] what does the 2.5 bracket stuff mean for us? [18:20:06] on the plus side, the 720 seems a hell of a lot more stable (plus its part of the R line we usually use) [18:20:26] so these are hot swap disk hots, a bunch of 3.5" disks on front [18:20:28] much like the r510s for databases [18:20:32] and the c2100s were [18:20:44] but just like both of those, we want the abiltiy to add in up to 2 2.5" disks internally [18:20:54] Ah cool, so [18:20:54] just means that dell includes a hardware mount inside the chassis [18:20:55] 2 2.5s for hotswap? [18:21:00] OS can go on those or somethign? [18:21:01] the internals dont hot swap [18:21:16] on the swift hosts its for quick data caching on ssd [18:21:22] oh! [18:21:25] hm [18:21:26] OS on that for server is usually not worth it [18:21:40] didn't realize that meant the 2.5s were going to be SSD [18:21:40] not saying we have to include or will in the analytics hosts [18:21:41] nonono [18:21:43] right [18:21:45] um [18:21:45] ha [18:21:47] Ok, let me explain [18:21:54] you dont have SSDs now that i know of [18:21:55] becuase we didnt order them for you [18:22:00] i think we do... [18:22:04] (now, perhaps we have a isngle machine with one cuz you asked) [18:22:08] not that we need them, but I think we do [18:22:08] oh? [18:22:08] hmm [18:22:11] i didnt install them. [18:22:13] so i dunno how [18:22:13] dschoon, do you recall? [18:22:22] (that i recall) [18:22:22] i remember having to think about this when we set these up [18:22:23] buh. *reads* [18:22:31] do we ahve SSDs on any of our machines? (dells?) [18:23:20] i think the C2100s we have do, because we are not doing mirrored RAID for OS on them like we are the Ciscos [18:23:24] ..... [18:23:37] hrmm, lemme look [18:23:39] here's what i recall, but i never interacted with them. [18:23:52] I am pretty sure that all/some of the C2100s have an SSD in them [18:24:18] the plan was to use it for the Write Ahead Log, as that's the single IO bottleneck for all the distributed datastores [18:24:26] so the faster you can fsync it, the better [18:24:50] but! i believe the SSDs that went into the boxes were the same that the database servers usually have [18:24:54] and they do not respect fsync [18:25:09] (they ACTUALLY LIE TO YOU when you call it) [18:25:22] hrmm, i think yo umay be right [18:25:28] so, 1. if they *do* have SSDs, and i think they do, you are welcome to take them back [18:25:42] i see 8 seagate 2tb disks [18:25:47] 2. we *do* need space badly, so if we have normal platters lying around, that'd be awesome [18:25:47] and a single toshiba 160gb [18:25:50] right. [18:25:58] that toshiba is the SSD, yes? [18:26:01] dschoon: Ok, so we are going to pull those out of the c2100s, as they dont go back to dell with them [18:26:05] correct [18:26:10] and we will put them into the new 720s that come in [18:26:16] those came from the closet where we have just piles of them [18:26:18] cmjohnson1: ^ see above, we need those brackets for all of them [18:26:24] dschoon: yep, that would be them [18:26:34] well, confirm or deny -- they don't respect fsync [18:26:39] dschoon: I did that for all the swift backends as well, so not shocked i dont recall [18:26:40] dschoon: no clue. [18:26:42] hokay. [18:26:46] you would wanna chat with asher about them [18:26:48] welp. i guess we'll work it out. [18:26:53] yeah, this is coming from asher. [18:26:54] he has tested them and a bunch of others [18:26:55] ok [18:27:06] it was a few weeks ago, though, so i might recall incorrectly [18:27:14] well, if it turns out you dont need these in the new ones [18:27:20] find out and lemme know [18:27:23] saves me having to remove them later to put in something else [18:27:50] yeah. [18:27:52] exactly. [18:28:51] !log reedy synchronized php-1.21wmf2/skins/CologneBlue.php [18:29:03] Logged the message, Master [18:29:06] any chance we have a closet of normal spinning platters lying around? [18:29:29] not normal, BIG ONES [18:32:32] uhh.... [18:32:32] no. [18:32:33] for what? [18:32:42] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:01] heheh, i thikn he was thinking instead of the SSD, we'd be ok with more space :) [18:33:29] just use the 3.5" disks they come with then =P [18:33:34] 2.5" form factor isnt very good for space [18:34:26] i dunno if the 720s that we are getting for you are at full 2.5" disk capacity [18:34:53] can always budget for more disks to expand if there is space, but we dont keep that many disks on hand [18:35:02] technically we used SSDs spares for you gys [18:35:08] ;] [18:36:04] wait, really? [18:36:13] they don't have any more drive slots? [18:37:43] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.864 second response time [18:38:14] dschoon: so on the c2100s [18:38:20] you guys have 8 of the 12 disk slots filled [18:38:26] New patchset: Asher; "adding db65 to s4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29407 [18:38:26] its all you had budget for [18:38:27] yes. [18:38:40] cmjohnson1: did you get a quote for the replacement analytics or not yet? [18:39:16] dschoon: So the new 720XD can also hold 12 disks [18:39:22] but you guys are only populating 8 of them [18:39:25] with 2tb disks. [18:39:29] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29407 [18:39:58] dschoon: So, the new systems are identical to the old in the number of 2.5" slots (12 - hotswap) and internal 2.5" slots (non hot swap) [18:40:06] the systems come iwth 8 of those 12 filled for you [18:40:32] aiight. i thought you had said we didn't have any room for growth in these machines [18:40:32] phew [18:43:37] confusion \o/ >_< [18:46:15] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:53:47] !log streaming hotbackup of db1038 to db65, preparing to replace to db22 [18:53:59] Logged the message, Master [18:58:29] robh: i did not get anything back yet [18:58:42] no worries, just curios [18:58:44] curious [18:59:19] mark, binasher, robla: hangout URL for the meeting: https://plus.google.com/hangouts/_/a5c3abf6f7c28dddd22c08f9e80b4aa87c246d63 [19:06:52] !log reedy synchronized php-1.21wmf2/includes/api/ApiParse.php [19:07:05] Logged the message, Master [19:13:44] now here was a meeting that could not have been done by email or irc ;-) [19:14:09] i agree! [19:14:09] you know how long that would have taken to *type*? [19:14:38] New review: Hashar; "Nice. Added a few inline comment. I am not sure what is the default charset when establishing a con..." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/29344 [19:26:42] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:12] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.255 second response time [19:39:45] RECOVERY - mysqld processes on db1025 is OK: PROCS OK: 1 process with command name mysqld [19:41:52] RECOVERY - mysqld processes on db1013 is OK: PROCS OK: 1 process with command name mysqld [19:44:07] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 6630 seconds [19:52:22] notpeter: hi, is there a chance that you could grep/find on the Wikimedia Commons server for SVG files that miss do not include Asking as presumably the recent move to swift cleared some caches and now some (strictly speaking invalid) SVG files are not rendered anymore, and we get quite some bug reports as it's highly visible. [19:53:22] Probably not the right person to ask [19:53:29] It's doable, it's just not gonna be quick [19:53:37] AaronSchulz: ^ [19:53:51] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [19:55:11] Reedy, thanks [19:55:22] background: https://bugzilla.wikimedia.org/show_bug.cgi?id=41174 [19:57:02] Aye [19:59:46] andre__: yeah, I think that aaron would be much better suited to this than I [19:59:56] glad it got to the right person :) [19:59:57] sorry then for pinging [20:00:04] no prob :) [20:00:25] notpeter: I suggested you because you were "on rt duty" in the /topic - what would have been the right way to go instead? [20:00:38] (in the "I know I need to talk to Ops but am not sure who" [20:00:38] ) [20:00:55] ah, yes, that's outdated. was me as of last week [20:01:07] binasher: change topic, plx :) [20:01:36] sumanah: but, yay! I'm glad that we're getting into the groove of using the ops rotation thing :) [20:01:53] ok! thanks! yes, I should pay more attention to the date the /topic was changed :) [20:02:04] and I'm very grateful for the ops rotation [20:02:24] andre__: ^ (you are absolved!) [20:02:53] still something that AaronSchulz will probably have to help with [20:03:25] oh, definitely :) [20:03:41] New patchset: Demon; "Upping gerrit timeout to 12m" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29432 [20:03:46] there's no way to grep all of the svg's in swift but code could be written to find them in the db, download them one by one, and inspect them.. sounds like a bit of a monster [20:04:51] New patchset: Demon; "Upping gerrit timeout to 12m" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29432 [20:05:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29432 [20:06:22] New review: Andrew Bogott; "More is more!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/29432 [20:06:22] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29432 [20:08:15] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [20:12:44] seee come no [20:27:09] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:55] New patchset: CSteipp; "LBFactory_Multi setup for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29344 [20:35:06] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.128 second response time [20:49:43] !log draining traffic from cr2-eqiad in preparation for maintenance and upgrade [20:49:55] Logged the message, Mistress of the network gear. [20:56:23] New patchset: CSteipp; "LBFactory_Multi setup for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29344 [21:04:42] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:09] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.177 second response time [21:11:15] PROBLEM - Apache HTTP on srv194 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:12:21] New patchset: Pyoungmeister; "correcting mac for pc3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29438 [21:12:45] RECOVERY - Apache HTTP on srv194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.037 second response time [21:13:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29438 [21:13:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29438 [21:17:27] !log olivneh synchronized php-1.21wmf2/extensions/PostEdit [21:17:39] Logged the message, Master [21:19:47] !log olivneh synchronized php-1.21wmf2/extensions/Vector [21:20:01] Logged the message, Master [21:20:25] !log replacing disk13 on tridge array [21:20:37] Logged the message, Master [21:23:11] !log commit full on cr2-eqiad which will restart all routing processes [21:23:23] Logged the message, Mistress of the network gear. [21:48:03] !log switch routing engine mastership on cr2-eqiad [21:48:16] Logged the message, Mistress of the network gear. [21:49:03] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [21:50:55] PROBLEM - Host cr2-eqiad is DOWN: CRITICAL - Network Unreachable (208.80.154.197) [21:51:54] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::1 [21:52:07] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_ipv6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::3 [21:52:21] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 29, down: 0, shutdown: 2 [21:52:30] RECOVERY - Host cr2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 35.07 ms [21:52:39] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 112.44 ms [21:52:49] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 112.54 ms [21:53:17] sorry about the pages folks [21:53:26] i guess the bgp flip made the connectivity unhappy [21:53:32] during the RE switchover [22:01:46] !log stopping puppet on aluminium so fr-tech can local-test a config change [22:01:58] Logged the message, Master [22:06:01] !log undraining cr2-eqiad [22:06:13] Logged the message, Mistress of the network gear. [22:19:06] !log asher synchronized wmf-config/db.php 'replacing db22 with db65' [22:19:19] Logged the message, Master [22:27:48] paravoid: why is https://office.wikimedia.org/w/img_auth.php/f/ff/TheHammock.jpg served from nginx? [22:28:30] You have a hammock in the office? o.0 [22:29:22] AaronSchulz: https [22:30:04] Damianz: one of the developers bought it and brought it in [22:30:07] it is now "the coding hammock" [22:31:25] also AaronSchulz it doesn't autocomplete in search [22:32:57] ^demon|away: would it make sense to coordinate gerrit & jenkins downtimes? [22:33:04] !log switching cr2-eqiad routing engine mastership [22:33:11] hashar: ^^^ [22:33:16] Logged the message, Mistress of the network gear. [22:33:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: Requested table is empty or does not exist, [22:33:40] paravoid: gerrit downtime is probably more critical, but it would be great for me to upgrade them both at the same time [22:33:40] I don't know much about that infrastructure but I think it depends on each other and I guess it'd suck to have two downtimes in the same week [22:34:23] paravoid: I have some projects waiting on them both, so the sooner the better (thanks for bringing it up) [22:34:32] ok, I'll reply to the mail, clearly the relevant people are not here :) [22:34:55] thanks! [22:35:54] paravoid: gerrit downtimes are usually less than an hour. The upgrade is quick :) [22:37:21] !log asher synchronized wmf-config/db.php 'raising weight of db65' [22:37:33] Logged the message, Master [22:38:44] hashar: chad is doing a dist-upgrade to precise, not a gerrit upgrade [22:38:59] so, a bit longer and about the same if we do a dist-upgrade on gallium :) [22:39:17] New patchset: Asher; "adding db22 to decom list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29459 [22:40:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29459 [22:40:51] paravoid hashar I understand there is also data to be backed up and restored, at least on gallium, that might affect downtime too? [22:41:15] I think we're just going to do a dist-upgrade [22:41:22] no point in copying 46G around for a reinstall [22:43:29] sigh [22:43:32] !log draining cr2-eqiad again for downgrade [22:43:45] Logged the message, Mistress of the network gear. [22:44:44] LeslieCarr: I was trying to imagine what "undraining" means, and here you go draining again :) [22:44:46] hehe [22:45:00] just a term for moving traffic on and off the router via configuration methods [22:45:02] <^demon|away> paravoid: We can coordinate if you think that's best. The upgrades don't truly on one another though. [22:45:07] rather than just pulling the plug with live traffic [22:45:14] <^demon|away> (Other than in the "related systems" sense) [22:45:20] figured. [22:45:31] I just like "undraining" :) [22:48:22] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29459 [22:48:51] <^demon|away> paravoid: It's dinnertime here, so let's just continue this discussion onlist :) [22:50:32] New patchset: Hashar; "all-wmflabs.dblist now contains beta wiki list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29463 [22:51:00] New patchset: Hashar; "beta: add enwikivoyage to the dblist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29464 [22:51:14] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29463 [22:51:19] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29464 [22:51:36] csteipp: Reedy ^^^ had to fix the all-wmflabs.dblist which provided the databases from production [22:51:56] added back enwikivoyage [22:52:45] Cool. So going forward, should people add directly to those files in git, or let addWiki.php add to them? [22:53:09] we still have to handle wikiversions.dat though [22:53:15] Ah, that too... [22:54:20] I have updated to the latest mediawiki-config [22:56:01] restarted the job runner [22:56:05] no moaar spam [22:57:15] so mw-update-l10n now runs on as root on -dbump in a screen [22:57:59] Mon Oct 22 22:57:40 UTC 2012 deployment-jobrunner06 enwiki JobQueueDB::claim 10.4.0.53 105 Unknown column 'job_token' in 'where clause' (10.4.0.53) [22:58:00] bahh [22:58:24] Reedy: do you happen to know if the job_token new column has been validated by asher / deployed on production? [22:58:29] yup [22:58:31] it's live [22:58:57] need to update beta now :-] [22:59:08] I usually do something like "foreachwiki update.php" [22:59:15] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [22:59:34] heh [22:59:35] probably will be fine [22:59:42] of course foreachwiki uses all.dblist :-] [22:59:57] foreachwikiinlabsgoddamnit [23:00:17] !log switching RE mastership on cr2-eqiad [23:00:31] Logged the message, Mistress of the network gear. [23:02:51] Reedy: I would totally write that method [23:04:33] !log reedy synchronized php-1.21wmf2/includes/EditPage.php [23:04:45] Logged the message, Master [23:05:31] paravoid: sorry been busy with a labs issue :/ [23:05:54] paravoid: Gerrit downtime are usually less than an hour iirc [23:06:10] the gallium upgrade is probably going to take a bit more :/ [23:13:20] bed time for now [23:13:27] bye hashar [23:15:18] sumanah: thanks :-] [23:19:59] New patchset: Asher; "adding pecl memcached ext to apaches, removing db22 cruft from mysql.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29471 [23:21:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/29471 [23:21:07] binasher: is that going to be used by all apaches? if yes, please add it to manifests/apaches.pp [23:21:21] Change abandoned: Reedy; "https://gerrit.wikimedia.org/r/#/c/29471/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7349 [23:21:41] oh, ah, added it in mediawiki.pp, nvm [23:21:42] notpeter: should that be in place of either of where i added it? [23:22:04] nope, sorry. just didn't see it in mediawiki.pp [23:22:25] thanks for reviewing! [23:22:39] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/29471 [23:22:54] I'm trying to be hawkish about making sure that everything goes into the modules and the older stuff :) [23:23:08] as this is my transitional mess [23:24:49] this needs to go everywhere mediawiki runs on the cluster, including boxes that might just do mwscript / cron but maybe don't have apache. if such a thing even exists [23:25:20] ah, yes, then the mediawiki_new module would be better than in the applicaitonserver module [23:25:25] sorry for the naming... :/ [23:26:00] ah.. do new apaches get both modules? [23:26:03] yes [23:26:15] ok, i'll move it [23:27:13] huh, i wonder if php-luasandbox should move there too [23:27:49] there aren't that many cases where there's a mediawiki install and not apache... [23:28:10] mostly the jobrunners and the random boxes for scripts [23:29:08] do the new job runners currently get the applicationserver module? [23:29:16] parts of it :/ [23:29:29] packages.pp? [23:30:06] * notpeter sighs [23:30:06] yes [23:30:24] also, for ther record, it's really hard to untangle two things that are completely interdependent... :/ [23:30:35] yeah [23:31:11] applicationserver::packages has all of the php extensions, so it seems like php5-memcached should actually be in there too [23:31:34] sure, seems reasonable [23:31:49] 7 Warning: PHP Startup: apc.shm_size now uses M/G suffixes, please update your ini files in Unknown on line 0 [23:33:06] wait a second, it's 7pm and I'm drinking beer. why am I in work irc? ttfn! [23:33:18] notpeter: good question [23:38:24] RECOVERY - mysqld processes on es10 is OK: PROCS OK: 1 process with command name mysqld [23:42:09] PROBLEM - MySQL disk space on db1001 is CRITICAL: DISK CRITICAL - free space: /a 54263 MB (3% inode=99%): [23:42:54] PROBLEM - MySQL Replication Heartbeat on es10 is CRITICAL: CRIT replication delay 602849 seconds [23:42:54] PROBLEM - MySQL Slave Delay on es10 is CRITICAL: CRIT replication delay 602850 seconds [23:45:18] RECOVERY - MySQL disk space on db1001 is OK: DISK OK