[00:01:38] binasher: no i left [00:01:50] what's up? did you see my msg about the main board? [00:02:11] New patchset: Reedy; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425 [00:02:12] mutante: ^ Can you approve that to finish tidying up the noc docroot move please? :D [00:03:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425 [00:06:02] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [00:08:08] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [00:08:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [00:09:29] !log storage3 being reimaged [00:09:41] Logged the message, Master [00:10:49] Warning: opendir(/mnt/originals/private/ExtensionDistributor/mw-snapshot/trunk/extensions) [function.opendir]: failed to open dir: No such file or directory in /usr/local/apache/common-local/php-1.21wmf1/extensions/ExtensionDistributor/ExtensionDistributor_body.php on line 80 [00:11:57] mutante: ALSO! did you move the noc folder over the new location? And if so, can we delete /h/w/htdocs/noc and noc.old? [00:14:07] Oct 11 23:42:00 10.0.8.33 apache2[1735]: PHP Warning: opendir(/mnt/originals/private/ExtensionDistributor/mw-snapshot/trunk/extensions) [function.opendir]: failed to open dir: No such file or directory in /usr/local/apache/common-local/php-1.21wmf1/extensions/ExtensionDistributor/ExtensionDistributor_body.php on line 80 [00:14:26] fenari has /mnt/originals, apaches still have /mnt/upload7 [00:29:03] !log reedy synchronized php-1.21wmf1/extensions/Collection 'Update collection to master' [00:29:14] Logged the message, Master [00:31:34] !log reedy synchronized php-1.21wmf1/extensions/Collection 'Revert due to error' [00:31:45] Logged the message, Master [00:37:28] !log reedy synchronized php-1.21wmf1/extensions/Collection 'And back to master' [00:37:40] Logged the message, Master [00:41:06] New patchset: Ryan Lane; "Check users against a trust list" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27657 [00:41:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:19] !log switched payments back to pmtpa and increased TTL to 5 min [00:42:30] Logged the message, Master [00:43:55] New patchset: Ryan Lane; "Check users against a trust list" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27657 [00:44:08] X-Storage-Url: http://localhost:8080:8080/swift/v1 [00:44:10] X-Storage-Token: AUTH_rgwtk0f0000006d6564696177696b693a7377696674de89c43e1b0a27b4756a77501802d330000ae4d0603c493a0472bc57f54665131a587560 [00:44:11] X-Auth-Token: AUTH_rgwtk0f0000006d6564696177696b693a7377696674de89c43e1b0a27b4756a77501802d330000ae4d0603c493a0472bc57f54665131a587560 [00:44:24] binasher: hmm, I wonder why the gateway is adding 8080 twice :/ [00:48:30] Reedy: Krinkle|detached : eh, i'm back. yes, we did not use the index.html from puppet repo. Erik made a change in mediawiki-config repo. (does noc. really belong in mediawiki though?) [00:48:59] Maybe, maybe not [00:49:01] i merged the change in mediawiki-config, then saw that the Apache config on fenari still pointed to a different directory [00:49:03] Better than it was before ;) [00:49:13] so it did not make the actual noc. change [00:49:37] to make Erik's change go live i made that change to the doc root [00:50:01] i saw one other related change sitting in gerrit, but it was not merged yet [00:50:30] note that https://noc.wikimedia.org/ still has the old content [00:52:09] Reedy: i also used sync-common on the index.html , but that makes no sense since it is just hosted on fenari itself and not the cluster .. ?! [00:52:22] did we also want to change that? [00:54:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.023 seconds [00:55:00] Hm? [00:55:34] the files are in the ./common directory now, right. so usually you would use sync-common after making changes there for cluster changes [00:55:44] but noc.wm is just hosted on fenari itself and not the cluster [00:55:47] yup [00:56:00] did you push the config change and restart apache too? [00:56:11] did we want to change the fact that it is hosted on fenari? [00:56:27] nope [00:56:30] that shouldn't change.. [00:56:30] i pushed the change, but did not restart apaches, just changes to an index.html [00:56:38] but that made no difference [00:56:52] well, except that it also syncs to fenari [00:57:09] yeah, currently it's pointing at /h/w/htdocs/noc... so the changes need to be made in there... [00:57:19] https is different from http now [00:57:56] Reedy: but that is what i changed earlier [00:58:03] https://gerrit.wikimedia.org/r/#/c/23425/6/files/apache/sites/noc.wikimedia.org,unified [00:58:09] You didn't update both versions :p [00:58:10] https://gerrit.wikimedia.org/r/#/c/27646/2/files/apache/sites/noc.wikimedia.org [00:58:28] you only updated it for http/port 80 [00:58:39] gotcha..yup [00:58:49] https://gerrit.wikimedia.org/r/#/c/23425/ [00:58:52] ^ that does it all ;) [00:59:47] cool [00:59:55] i guess it needs rebasing now? [01:00:04] i rebased it a few minutes ago [01:01:18] ok, looks good to me. want me to merge? [01:01:32] yeaah [01:01:36] please [01:01:39] then we can tidy it up [01:01:49] what exactly do you mean :Note, this also requires the files copying first! [01:01:58] oh [01:01:58] i just did a git pull in the directory [01:02:04] that was from before [01:02:06] after merge in mediawiki repo [01:02:12] ok [01:02:22] need to copy in any last changes from /h/w/htdocs/noc [01:02:26] but then we should delete it ;) [01:02:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425 [01:05:08] merged on sockpuppet [01:06:20] yay [01:06:23] can you do the file sync? [01:06:28] running puppet on fenari [01:06:31] thanks [01:06:34] Yeah, I will do [01:06:39] np.yw [01:07:54] done on fenari. Filebucketed /etc/apache2/sites-available/noc.wikimedia.org [01:08:10] !log restarting Apache on fenari for noc.wm change [01:08:21] Logged the message, Master [01:08:41] Eloquence: https:// also looks good to me now [01:08:51] looks like both are up to date [01:08:53] sweet. [01:09:18] thanks guys. [01:09:21] thanks Reedy, did not see you had this change right in there waiting for merge [01:09:32] heh :) [01:09:55] rm -rf /home/wikipedia/htdocs/noc.old [01:10:01] rm -rf /home/wikipedia/htdocs/noc [01:10:08] ^ Mind tidying up the old files too? [01:11:45] if you say they are synced. sure [01:12:01] yeah, copied one over the other and no changes [01:12:06] heh, so old. should be old.old now?:) [01:12:24] alright, lets nuke it then [01:13:20] Most of the stuff in /h/w/htdocs is rubbish [01:13:39] I do like the full copy of bugzilla in /h/w/htdocs/bugzilla though [01:15:20] !log deleting old document root dirs on fenari. /home/wikipedia/htdocs/noc and noc.old (use /h/w/common/docroot/noc and mediawiki-config repo) [01:15:34] Logged the message, Master [01:15:36] done. i also did a quick rsync with simulated option [01:16:01] Bugzilla. heh..:p [01:19:36] lets move it to /root/backup first..you never know, people just asked for patches we applied in the past [01:19:42] just 28M [01:21:15] !log moving ancient copy of bugzilla out of /h/w/htdocs (to /root/backup on fenari temp. just in case) [01:21:27] Logged the message, Master [01:21:54] Reedy: looked at the timestamps?:) [01:22:03] Yup [01:22:19] there's quite a lot of random stuff around on there [01:22:27] that is almost for something like ./museum/ [01:28:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.227 seconds [01:42:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 308 seconds [01:42:28] New patchset: Kaldari; "Updating $wgNoticeHideBannersExpiration for new fundraiser" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27660 [01:43:59] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27660 [01:45:29] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [02:01:16] !log LocalisationUpdate failed: git pull of extensions failed [02:01:27] Logged the message, Master [02:15:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:41:17] RECOVERY - Puppet freshness on srv190 is OK: puppet ran at Fri Oct 12 02:40:49 UTC 2012 [02:42:47] RECOVERY - Puppet freshness on mw70 is OK: puppet ran at Fri Oct 12 02:42:46 UTC 2012 [02:45:47] RECOVERY - Puppet freshness on srv249 is OK: puppet ran at Fri Oct 12 02:45:40 UTC 2012 [02:50:53] RECOVERY - Puppet freshness on mw38 is OK: puppet ran at Fri Oct 12 02:50:45 UTC 2012 [02:53:17] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Fri Oct 12 02:53:09 UTC 2012 [02:56:53] RECOVERY - Puppet freshness on srv263 is OK: puppet ran at Fri Oct 12 02:56:24 UTC 2012 [02:57:20] RECOVERY - Puppet freshness on srv291 is OK: puppet ran at Fri Oct 12 02:57:16 UTC 2012 [03:00:20] RECOVERY - Puppet freshness on mw44 is OK: puppet ran at Fri Oct 12 02:59:57 UTC 2012 [03:02:08] RECOVERY - Puppet freshness on db1038 is OK: puppet ran at Fri Oct 12 03:01:56 UTC 2012 [03:03:38] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Fri Oct 12 03:03:22 UTC 2012 [03:04:23] RECOVERY - Puppet freshness on mw74 is OK: puppet ran at Fri Oct 12 03:04:16 UTC 2012 [04:05:17] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:05:17] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:40:14] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [04:57:45] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:06:58] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:25:48] New patchset: Tim Starling; "Don't exit on fflush() error" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/27662 [06:40:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:50:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.611 seconds [06:55:00] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [07:24:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:39:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds [08:12:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:24] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:25:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.282 seconds [08:50:00] hello [08:51:52] * Damianz waves at hashy [08:59:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.057 seconds [09:43:45] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:43:45] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:48:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:03:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [10:49:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.354 seconds [11:08:47] !log Disabled csw2-oe12-esams:vcp-1 [11:09:00] Logged the message, Master [11:24:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:38:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.247 seconds [12:12:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:04] !log Added new EX4200-48T asw-oe11-esams to csw2-esams stack [12:16:16] Logged the message, Master [12:26:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.727 seconds [12:27:55] !log Upgraded new stack member asw-oe11-esams to junos 11.4R5.5 and rebooted it [12:28:06] Logged the message, Master [13:01:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [13:24:27] !log storage3 powering down [13:24:38] Logged the message, Master [13:45:05] New patchset: Reedy; "Remove duplicate servername line" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27682 [13:46:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27682 [13:48:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:52] !log Moved AS38930 transit from csw1-esams:e8/4 to cr1-esams:xe-1/3/0 [13:50:04] Logged the message, Master [13:59:42] !log Activated aggregate generation on cr1-esams [13:59:54] Logged the message, Master [14:02:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.971 seconds [14:06:42] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:06:42] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:17:24] !log Upgrading JunOS on cr1-esams to 11.4R5.5, preparing for reboot [14:17:35] Logged the message, Master [14:19:48] mark: so, maxmind doesn't provide geoip city with ipv6 as far as I can see [14:20:01] I'm not sure what can we do regarding that RT then other than just fall back on geoiplookup.wm.org [14:21:10] btu they provide geoip country ipv6 right now, right ? [14:21:14] me neither [14:22:03] they do but what's the point? [14:23:40] New patchset: Hashar; "(bug 40686) zuul role for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [14:24:39] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [14:25:36] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25236 [14:26:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [14:26:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [14:26:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25236 [14:26:52] hello there. Got a small issue installing php5 packages on labs. I get weird dependency issue: [14:26:52] php5-mysql : Depends: php5-common (= 5.3.10-1ubuntu3.4) but 5.3.10-1ubuntu3.4+wmf1 is to be installed [14:27:13] php5-mysql seems to be pinned: Package pin: 5.3.10-1ubuntu3.4 [14:27:22] which instance is that? [14:27:37] integration-jenkins [14:27:49] cache-policy is at http://dpaste.org/KgG8s/ [14:28:05] maybe it is missing some apt pinning [14:29:06] what are you trying to install? [14:29:16] LeslieCarr: check RT #3702 [14:29:38] yeah, saw that, thanks for the super detail [14:29:57] paravoid: I am applying the misc::contint::test::packages class wich install php packages among other. Seems the class pin package to Ubuntu version. [14:30:25] but does not pin php5-common ;-] [14:31:12] trying [14:33:37] * mark is cabling two ceph servers ;) [14:33:44] oh! [14:35:16] hashar: what a mess... still trying to figure it out [14:35:53] paravoid: the class is not pinning php5-common [14:36:11] plus the actual pinning is not a requirement to the package installation :-D [14:36:34] that's not it, no [14:37:00] !log Rebooting cr1-esams [14:37:11] Logged the message, Master [14:37:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:53] paravoid: from pmy puppet run, it failed installing package, then eventually did a a pinning of php5-common [14:38:03] then managed to install some packages [14:39:38] * hashar nature's call [14:41:39] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [14:46:11] New patchset: Hashar; "pin php5-common in misc::contint::test::packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27692 [14:46:28] paravoid: https://gerrit.wikimedia.org/r/27692 pins php5-common . That seems to fix the issue [14:47:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27692 [14:50:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.496 seconds [14:52:44] and had to fix the ant1.8 installation class ;-) [14:52:51] New patchset: Hashar; "Precise has ant1.8 by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27696 [14:53:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27696 [14:54:35] New patchset: Asher; "esams bits event log routing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27697 [14:55:39] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27697 [14:56:29] hayaaaaaa mark [14:56:44] gonna start bringing in some analytics puppet stuff soon [14:57:05] just been chatting with paravoid a bit in a PM, thought we should ask you somethign to see if you had an opinion [14:57:21] I don't htink I have a need yet for an analytics module [14:57:30] so I was going to create manifests/analytics.pp now [14:57:46] and if/when things get larger and need more parameterization, move to a module [14:57:58] s'ok? or would you rather me just start with a module right now? [14:58:25] as probably the largest proponents of modules, I think that's okay for now [14:58:36] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:58:41] but I don't think it's too difficult to go either way [14:58:59] i.e. a module right now won't add complexity [14:59:08] yeah, if I do a module I just have to think about more files, no? or do you just put everythign in init.pp? [14:59:33] you have to split every class in a separate file, yes [15:01:04] hm, at the moment it doesn't make sense for me to even have a class in init.pp then... [15:01:12] i don't have a single class that can be installed on every analytics node [15:02:37] that's okay [15:02:42] you don't need an init.pp [15:02:58] oh, didn't know that [15:03:23] the only class i'm going with right now (until we settle that email I sent about git submodules) is analytics::http_proxy [15:03:25] so I could do [15:03:35] modules/analytics/manifests/http_proxy.pp [15:03:35] ? [15:03:41] yes [15:03:42] class analytics::http_proxy [15:03:43] hmmm [15:03:45] ok cool! [15:03:48] (not a big fan of underscores though :P) [15:03:53] no? [15:04:01] (bikeshedding) [15:04:03] heheh [15:04:20] um ok! actually [15:04:22] i don't mind [15:04:25] analytcs::web::proxy [15:04:35] there might be aother analytics::web stuff later [15:04:40] New patchset: Asher; "esams bits event log routing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27697 [15:04:43] yeah, having a multiple hierarchy there might make sense [15:04:53] woudl that be analytics/web/proxy.pp [15:04:54] ? [15:04:54] but as always, be careful to not introduce too many abstractions early on [15:05:00] yeah, right [15:05:10] that's kinda why I was leaning towards just the analytics.pp file rather than the module :p [15:05:19] at first. [15:05:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27697 [15:06:14] hashar: okay, I have a solution for your problem [15:06:21] drop all the pinnings [15:07:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27697 [15:07:20] paravoid: ohh we could do that indeed [15:07:46] works much better, try it [15:07:54] Change abandoned: Hashar; "will just stop pinning them" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27692 [15:08:54] (it's already removed on your box, everything works) [15:10:15] I am not sure what will be the impact on gallium [15:10:15] New patchset: Hashar; "misc::contint::test::packages stop pinning php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27707 [15:10:24] here is the change ^^^^^ [15:10:33] might want to merge it monday though [15:10:40] since that will change the PHP version on gallium which run the Jenkins jobs [15:11:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27707 [15:13:11] New review: Faidon; "Is $CI_PHP_packages still used? If not, remove it?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/27707 [15:14:05] New review: Hashar; "they are used just above to ensure the packages are present." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27707 [15:14:36] gah talk about layering violations [15:14:43] a random misc class installing php packages [15:21:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27707 [15:21:54] hashar: done & merged [15:22:02] thanks ;) [15:22:27] paravoid: also what are we going to do for gallium ? Should we simply upgrade it using ubuntu script ? [15:22:31] or do we setup a new server [15:22:50] is everything puppetized? [15:23:04] not really [15:23:15] what isn't? [15:23:30] well at first I am not sure everything works fine on Precise [15:23:41] such as installing ant 1.8 but that was a trivial change anyway [15:24:16] the jobs are in a specific repository, could just reinstall that manually [15:24:41] and then there is the Android SDK installed in my home dir. I have never took the time to puppetize it [15:25:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:58] bad hashar [15:26:24] have set that up during xmas ;-D [15:26:28] was a puppet noob [15:26:40] can we try if everything works on a precise labs instance? [15:26:48] and puppetize the remaining bits? [15:26:49] anyway that use some autoupdate from android + maven which install a bunch of third party libraries [15:27:16] and by we I mean mostly you :P I can help but I know zilch about jenkins [15:27:37] i installed jenkins on labs using the puppet class :-] [15:27:43] on precise? [15:27:46] what I should do is migrate the contint to a module [15:28:00] yeah on precise [15:28:03] integration-jenkins [15:28:27] the jenkins configuration is in a separate repo ( integration/jenkins.git ) but that is not deployed by puppet ;-D [15:28:30] i handle that manually [15:28:42] so that should be mostly fine [15:28:51] (I mean, migrating to a new server would be mostly fine) [15:31:20] I'd like to use labs to test that everything will work as it, then possibly rebuild gallium in-place [15:31:47] paravoid: what do you mean by rebuild in place ? [15:32:08] rebuild gallium with precise [15:32:10] on the same box [15:32:21] by running the ubuntu upgraded ? [15:32:25] err upgrader ? [15:32:33] !log Disabled csw1-esams:e8/2 [15:32:45] Logged the message, Master [15:32:48] no, pxe boot [15:32:52] format the box [15:33:44] New patchset: Ottomata; "Adding analytics.pp and setting up internal proxy for analytics cluster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27717 [15:34:27] PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.235) [15:34:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27717 [15:34:45] PROBLEM - Host bits.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.233) [15:34:45] PROBLEM - Host upload.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.234) [15:34:46] PROBLEM - Host bits.esams.wikimedia.org_https is DOWN: CRITICAL - Network Unreachable (91.198.174.233) [15:34:46] PROBLEM - Host wikiversity-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.231) [15:34:46] PROBLEM - Host wikiquote-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.227) [15:34:51] whoops [15:34:54] PROBLEM - Host mediawiki-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.232) [15:34:54] PROBLEM - Host wikibooks-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.228) [15:34:55] PROBLEM - Host wikinews-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.230) [15:35:00] mark: happy friday ;-] [15:35:02] I guess that's you? [15:35:03] PROBLEM - Host wikisource-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.229) [15:35:21] PROBLEM - Host wikimedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:35:21] PROBLEM - Host wikipedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:35:22] PROBLEM - Host wiktionary-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.226) [15:36:06] PROBLEM - Host foundation-lb.esams.wikimedia.org_https is DOWN: CRITICAL - Network Unreachable (91.198.174.235) [15:36:23] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27717 [15:36:27] * Damianz waves bye to the network [15:36:33] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_https is DOWN: CRITICAL - Network Unreachable (91.198.174.232) [15:38:19] mark: just deactivated the aggregate reject route until ospf is back up [15:38:42] disabled the interface [15:38:48] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [15:39:06] PROBLEM - Host text.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:39:11] that's not it [15:39:15] PROBLEM - Host upload.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:24] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:24] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:25] PROBLEM - Host wikinews-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:33] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:33] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:42] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:43] PROBLEM - Host wikisource-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:51] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:40:20] i'm not sure what the issue is, i'm rolling back [15:41:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.854 seconds [15:41:27] it's not getting the ibgp routes for the load balancers [15:41:29] assuming that mark has this under control. let me know if there's anything you need [15:41:38] i see them on csw2-esams in bgp [15:41:48] uh oh [15:41:48] RECOVERY - Host foundation-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 16%, RTA = 132.36 ms [15:41:57] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Fri Oct 12 15:41:44 UTC 2012 [15:41:57] RECOVERY - Host bits.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.32 ms [15:41:58] RECOVERY - Host wikinews-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.47 ms [15:42:06] RECOVERY - Host wiktionary-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.16 ms [15:42:06] RECOVERY - Host mediawiki-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.33 ms [15:42:07] hmm right [15:42:09] i hate that setup [15:42:15] it's different from pmtpa/eqiad [15:42:38] i checked those addresses on cr2-knams, that had it [15:42:45] but cr1-esams not indeed [15:43:02] we need to have them peer with cr1-esams i'd say ? [15:43:20] well [15:43:25] while that is not yet the case it should be getting them anyhow [15:43:27] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 1, down: 3, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS43821 not established - WIKIMEDIA-EUBRPeering with AS64600 not established - BR [15:43:27] RECOVERY - Host wikibooks-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.15 ms [15:43:45] PROBLEM - Host csw1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.247) [15:44:05] so very many pages [15:44:08] so yes [15:44:14] it should do iBGP peering with csw2-esams [15:44:18] but only those routes :) [15:44:40] if anyone didnt set the proper timezones for their paging, we'll find out shortly ;] [15:44:42] could you perhaps set that up while I cable stuff here? [15:44:48] RECOVERY - Host wikisource-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.61 ms [15:44:48] RECOVERY - Host wikimedia-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [15:44:49] RECOVERY - Host foundation-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.18 ms [15:44:49] RECOVERY - Host text.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.26 ms [15:44:54] then we'll redo later [15:44:57] RECOVERY - Host upload.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [15:45:06] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.15 ms [15:45:06] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [15:45:07] RECOVERY - Host wikinews-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.39 ms [15:45:13] ok [15:45:15]