[00:01:38] binasher: no i left [00:01:50] what's up? did you see my msg about the main board? [00:02:11] New patchset: Reedy; "Move noc from /h/w/htdocs to /h/w/c/docroot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425 [00:02:12] mutante: ^ Can you approve that to finish tidying up the noc docroot move please? :D [00:03:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/23425 [00:06:02] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [00:08:08] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [00:08:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [00:09:29] !log storage3 being reimaged [00:09:41] Logged the message, Master [00:10:49] Warning: opendir(/mnt/originals/private/ExtensionDistributor/mw-snapshot/trunk/extensions) [function.opendir]: failed to open dir: No such file or directory in /usr/local/apache/common-local/php-1.21wmf1/extensions/ExtensionDistributor/ExtensionDistributor_body.php on line 80 [00:11:57] mutante: ALSO! did you move the noc folder over the new location? And if so, can we delete /h/w/htdocs/noc and noc.old? [00:14:07] Oct 11 23:42:00 10.0.8.33 apache2[1735]: PHP Warning: opendir(/mnt/originals/private/ExtensionDistributor/mw-snapshot/trunk/extensions) [function.opendir]: failed to open dir: No such file or directory in /usr/local/apache/common-local/php-1.21wmf1/extensions/ExtensionDistributor/ExtensionDistributor_body.php on line 80 [00:14:26] fenari has /mnt/originals, apaches still have /mnt/upload7 [00:29:03] !log reedy synchronized php-1.21wmf1/extensions/Collection 'Update collection to master' [00:29:14] Logged the message, Master [00:31:34] !log reedy synchronized php-1.21wmf1/extensions/Collection 'Revert due to error' [00:31:45] Logged the message, Master [00:37:28] !log reedy synchronized php-1.21wmf1/extensions/Collection 'And back to master' [00:37:40] Logged the message, Master [00:41:06] New patchset: Ryan Lane; "Check users against a trust list" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27657 [00:41:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:42:19] !log switched payments back to pmtpa and increased TTL to 5 min [00:42:30] Logged the message, Master [00:43:55] New patchset: Ryan Lane; "Check users against a trust list" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27657 [00:44:08] X-Storage-Url: http://localhost:8080:8080/swift/v1 [00:44:10] X-Storage-Token: AUTH_rgwtk0f0000006d6564696177696b693a7377696674de89c43e1b0a27b4756a77501802d330000ae4d0603c493a0472bc57f54665131a587560 [00:44:11] X-Auth-Token: AUTH_rgwtk0f0000006d6564696177696b693a7377696674de89c43e1b0a27b4756a77501802d330000ae4d0603c493a0472bc57f54665131a587560 [00:44:24] binasher: hmm, I wonder why the gateway is adding 8080 twice :/ [00:48:30] Reedy: Krinkle|detached : eh, i'm back. yes, we did not use the index.html from puppet repo. Erik made a change in mediawiki-config repo. (does noc. really belong in mediawiki though?) [00:48:59] Maybe, maybe not [00:49:01] i merged the change in mediawiki-config, then saw that the Apache config on fenari still pointed to a different directory [00:49:03] Better than it was before ;) [00:49:13] so it did not make the actual noc. change [00:49:37] to make Erik's change go live i made that change to the doc root [00:50:01] i saw one other related change sitting in gerrit, but it was not merged yet [00:50:30] note that https://noc.wikimedia.org/ still has the old content [00:52:09] Reedy: i also used sync-common on the index.html , but that makes no sense since it is just hosted on fenari itself and not the cluster .. ?! [00:52:22] did we also want to change that? [00:54:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.023 seconds [00:55:00] Hm? [00:55:34] the files are in the ./common directory now, right. so usually you would use sync-common after making changes there for cluster changes [00:55:44] but noc.wm is just hosted on fenari itself and not the cluster [00:55:47] yup [00:56:00] did you push the config change and restart apache too? [00:56:11] did we want to change the fact that it is hosted on fenari? [00:56:27] nope [00:56:30] that shouldn't change.. [00:56:30] i pushed the change, but did not restart apaches, just changes to an index.html [00:56:38] but that made no difference [00:56:52] well, except that it also syncs to fenari [00:57:09] yeah, currently it's pointing at /h/w/htdocs/noc... so the changes need to be made in there... [00:57:19] https is different from http now [00:57:56] Reedy: but that is what i changed earlier [00:58:03] https://gerrit.wikimedia.org/r/#/c/23425/6/files/apache/sites/noc.wikimedia.org,unified [00:58:09] You didn't update both versions :p [00:58:10] https://gerrit.wikimedia.org/r/#/c/27646/2/files/apache/sites/noc.wikimedia.org [00:58:28] you only updated it for http/port 80 [00:58:39] gotcha..yup [00:58:49] https://gerrit.wikimedia.org/r/#/c/23425/ [00:58:52] ^ that does it all ;) [00:59:47] cool [00:59:55] i guess it needs rebasing now? [01:00:04] i rebased it a few minutes ago [01:01:18] ok, looks good to me. want me to merge? [01:01:32] yeaah [01:01:36] please [01:01:39] then we can tidy it up [01:01:49] what exactly do you mean :Note, this also requires the files copying first! [01:01:58] oh [01:01:58] i just did a git pull in the directory [01:02:04] that was from before [01:02:06] after merge in mediawiki repo [01:02:12] ok [01:02:22] need to copy in any last changes from /h/w/htdocs/noc [01:02:26] but then we should delete it ;) [01:02:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/23425 [01:05:08] merged on sockpuppet [01:06:20] yay [01:06:23] can you do the file sync? [01:06:28] running puppet on fenari [01:06:31] thanks [01:06:34] Yeah, I will do [01:06:39] np.yw [01:07:54] done on fenari. Filebucketed /etc/apache2/sites-available/noc.wikimedia.org [01:08:10] !log restarting Apache on fenari for noc.wm change [01:08:21] Logged the message, Master [01:08:41] Eloquence: https:// also looks good to me now [01:08:51] looks like both are up to date [01:08:53] sweet. [01:09:18] thanks guys. [01:09:21] thanks Reedy, did not see you had this change right in there waiting for merge [01:09:32] heh :) [01:09:55] rm -rf /home/wikipedia/htdocs/noc.old [01:10:01] rm -rf /home/wikipedia/htdocs/noc [01:10:08] ^ Mind tidying up the old files too? [01:11:45] if you say they are synced. sure [01:12:01] yeah, copied one over the other and no changes [01:12:06] heh, so old. should be old.old now?:) [01:12:24] alright, lets nuke it then [01:13:20] Most of the stuff in /h/w/htdocs is rubbish [01:13:39] I do like the full copy of bugzilla in /h/w/htdocs/bugzilla though [01:15:20] !log deleting old document root dirs on fenari. /home/wikipedia/htdocs/noc and noc.old (use /h/w/common/docroot/noc and mediawiki-config repo) [01:15:34] Logged the message, Master [01:15:36] done. i also did a quick rsync with simulated option [01:16:01] Bugzilla. heh..:p [01:19:36] lets move it to /root/backup first..you never know, people just asked for patches we applied in the past [01:19:42] just 28M [01:21:15] !log moving ancient copy of bugzilla out of /h/w/htdocs (to /root/backup on fenari temp. just in case) [01:21:27] Logged the message, Master [01:21:54] Reedy: looked at the timestamps?:) [01:22:03] Yup [01:22:19] there's quite a lot of random stuff around on there [01:22:27] that is almost for something like ./museum/ [01:28:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:41:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.227 seconds [01:42:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 308 seconds [01:42:28] New patchset: Kaldari; "Updating $wgNoticeHideBannersExpiration for new fundraiser" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27660 [01:43:59] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27660 [01:45:29] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [02:01:16] !log LocalisationUpdate failed: git pull of extensions failed [02:01:27] Logged the message, Master [02:15:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:31:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:41:17] RECOVERY - Puppet freshness on srv190 is OK: puppet ran at Fri Oct 12 02:40:49 UTC 2012 [02:42:47] RECOVERY - Puppet freshness on mw70 is OK: puppet ran at Fri Oct 12 02:42:46 UTC 2012 [02:45:47] RECOVERY - Puppet freshness on srv249 is OK: puppet ran at Fri Oct 12 02:45:40 UTC 2012 [02:50:53] RECOVERY - Puppet freshness on mw38 is OK: puppet ran at Fri Oct 12 02:50:45 UTC 2012 [02:53:17] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Fri Oct 12 02:53:09 UTC 2012 [02:56:53] RECOVERY - Puppet freshness on srv263 is OK: puppet ran at Fri Oct 12 02:56:24 UTC 2012 [02:57:20] RECOVERY - Puppet freshness on srv291 is OK: puppet ran at Fri Oct 12 02:57:16 UTC 2012 [03:00:20] RECOVERY - Puppet freshness on mw44 is OK: puppet ran at Fri Oct 12 02:59:57 UTC 2012 [03:02:08] RECOVERY - Puppet freshness on db1038 is OK: puppet ran at Fri Oct 12 03:01:56 UTC 2012 [03:03:38] RECOVERY - Puppet freshness on db1043 is OK: puppet ran at Fri Oct 12 03:03:22 UTC 2012 [03:04:23] RECOVERY - Puppet freshness on mw74 is OK: puppet ran at Fri Oct 12 03:04:16 UTC 2012 [04:05:17] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:05:17] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:40:14] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [04:57:45] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:06:58] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [06:25:48] New patchset: Tim Starling; "Don't exit on fflush() error" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/27662 [06:40:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:50:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.611 seconds [06:55:00] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [07:24:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:39:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds [08:12:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:24] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:25:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.282 seconds [08:50:00] hello [08:51:52] * Damianz waves at hashy [08:59:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.057 seconds [09:43:45] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:43:45] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:48:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:03:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [10:49:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.354 seconds [11:08:47] !log Disabled csw2-oe12-esams:vcp-1 [11:09:00] Logged the message, Master [11:24:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:38:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.247 seconds [12:12:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:04] !log Added new EX4200-48T asw-oe11-esams to csw2-esams stack [12:16:16] Logged the message, Master [12:26:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.727 seconds [12:27:55] !log Upgraded new stack member asw-oe11-esams to junos 11.4R5.5 and rebooted it [12:28:06] Logged the message, Master [13:01:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:17:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [13:24:27] !log storage3 powering down [13:24:38] Logged the message, Master [13:45:05] New patchset: Reedy; "Remove duplicate servername line" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27682 [13:46:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27682 [13:48:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:49:52] !log Moved AS38930 transit from csw1-esams:e8/4 to cr1-esams:xe-1/3/0 [13:50:04] Logged the message, Master [13:59:42] !log Activated aggregate generation on cr1-esams [13:59:54] Logged the message, Master [14:02:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.971 seconds [14:06:42] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:06:42] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:17:24] !log Upgrading JunOS on cr1-esams to 11.4R5.5, preparing for reboot [14:17:35] Logged the message, Master [14:19:48] mark: so, maxmind doesn't provide geoip city with ipv6 as far as I can see [14:20:01] I'm not sure what can we do regarding that RT then other than just fall back on geoiplookup.wm.org [14:21:10] btu they provide geoip country ipv6 right now, right ? [14:21:14] me neither [14:22:03] they do but what's the point? [14:23:40] New patchset: Hashar; "(bug 40686) zuul role for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [14:24:39] New patchset: Hashar; "import zuul module from OpenStack" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25235 [14:25:36] New patchset: Hashar; "zuul role for labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/25236 [14:26:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [14:26:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25235 [14:26:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/25236 [14:26:52] hello there. Got a small issue installing php5 packages on labs. I get weird dependency issue: [14:26:52] php5-mysql : Depends: php5-common (= 5.3.10-1ubuntu3.4) but 5.3.10-1ubuntu3.4+wmf1 is to be installed [14:27:13] php5-mysql seems to be pinned: Package pin: 5.3.10-1ubuntu3.4 [14:27:22] which instance is that? [14:27:37] integration-jenkins [14:27:49] cache-policy is at http://dpaste.org/KgG8s/ [14:28:05] maybe it is missing some apt pinning [14:29:06] what are you trying to install? [14:29:16] LeslieCarr: check RT #3702 [14:29:38] yeah, saw that, thanks for the super detail [14:29:57] paravoid: I am applying the misc::contint::test::packages class wich install php packages among other. Seems the class pin package to Ubuntu version. [14:30:25] but does not pin php5-common ;-] [14:31:12] trying [14:33:37] * mark is cabling two ceph servers ;) [14:33:44] oh! [14:35:16] hashar: what a mess... still trying to figure it out [14:35:53] paravoid: the class is not pinning php5-common [14:36:11] plus the actual pinning is not a requirement to the package installation :-D [14:36:34] that's not it, no [14:37:00] !log Rebooting cr1-esams [14:37:11] Logged the message, Master [14:37:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:53] paravoid: from pmy puppet run, it failed installing package, then eventually did a a pinning of php5-common [14:38:03] then managed to install some packages [14:39:38] * hashar nature's call [14:41:39] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [14:46:11] New patchset: Hashar; "pin php5-common in misc::contint::test::packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27692 [14:46:28] paravoid: https://gerrit.wikimedia.org/r/27692 pins php5-common . That seems to fix the issue [14:47:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27692 [14:50:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.496 seconds [14:52:44] and had to fix the ant1.8 installation class ;-) [14:52:51] New patchset: Hashar; "Precise has ant1.8 by default" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27696 [14:53:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27696 [14:54:35] New patchset: Asher; "esams bits event log routing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27697 [14:55:39] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27697 [14:56:29] hayaaaaaa mark [14:56:44] gonna start bringing in some analytics puppet stuff soon [14:57:05] just been chatting with paravoid a bit in a PM, thought we should ask you somethign to see if you had an opinion [14:57:21] I don't htink I have a need yet for an analytics module [14:57:30] so I was going to create manifests/analytics.pp now [14:57:46] and if/when things get larger and need more parameterization, move to a module [14:57:58] s'ok? or would you rather me just start with a module right now? [14:58:25] as probably the largest proponents of modules, I think that's okay for now [14:58:36] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [14:58:41] but I don't think it's too difficult to go either way [14:58:59] i.e. a module right now won't add complexity [14:59:08] yeah, if I do a module I just have to think about more files, no? or do you just put everythign in init.pp? [14:59:33] you have to split every class in a separate file, yes [15:01:04] hm, at the moment it doesn't make sense for me to even have a class in init.pp then... [15:01:12] i don't have a single class that can be installed on every analytics node [15:02:37] that's okay [15:02:42] you don't need an init.pp [15:02:58] oh, didn't know that [15:03:23] the only class i'm going with right now (until we settle that email I sent about git submodules) is analytics::http_proxy [15:03:25] so I could do [15:03:35] modules/analytics/manifests/http_proxy.pp [15:03:35] ? [15:03:41] yes [15:03:42] class analytics::http_proxy [15:03:43] hmmm [15:03:45] ok cool! [15:03:48] (not a big fan of underscores though :P) [15:03:53] no? [15:04:01] (bikeshedding) [15:04:03] heheh [15:04:20] um ok! actually [15:04:22] i don't mind [15:04:25] analytcs::web::proxy [15:04:35] there might be aother analytics::web stuff later [15:04:40] New patchset: Asher; "esams bits event log routing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27697 [15:04:43] yeah, having a multiple hierarchy there might make sense [15:04:53] woudl that be analytics/web/proxy.pp [15:04:54] ? [15:04:54] but as always, be careful to not introduce too many abstractions early on [15:05:00] yeah, right [15:05:10] that's kinda why I was leaning towards just the analytics.pp file rather than the module :p [15:05:19] at first. [15:05:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27697 [15:06:14] hashar: okay, I have a solution for your problem [15:06:21] drop all the pinnings [15:07:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27697 [15:07:20] paravoid: ohh we could do that indeed [15:07:46] works much better, try it [15:07:54] Change abandoned: Hashar; "will just stop pinning them" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27692 [15:08:54] (it's already removed on your box, everything works) [15:10:15] I am not sure what will be the impact on gallium [15:10:15] New patchset: Hashar; "misc::contint::test::packages stop pinning php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27707 [15:10:24] here is the change ^^^^^ [15:10:33] might want to merge it monday though [15:10:40] since that will change the PHP version on gallium which run the Jenkins jobs [15:11:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27707 [15:13:11] New review: Faidon; "Is $CI_PHP_packages still used? If not, remove it?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/27707 [15:14:05] New review: Hashar; "they are used just above to ensure the packages are present." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/27707 [15:14:36] gah talk about layering violations [15:14:43] a random misc class installing php packages [15:21:31] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27707 [15:21:54] hashar: done & merged [15:22:02] thanks ;) [15:22:27] paravoid: also what are we going to do for gallium ? Should we simply upgrade it using ubuntu script ? [15:22:31] or do we setup a new server [15:22:50] is everything puppetized? [15:23:04] not really [15:23:15] what isn't? [15:23:30] well at first I am not sure everything works fine on Precise [15:23:41] such as installing ant 1.8 but that was a trivial change anyway [15:24:16] the jobs are in a specific repository, could just reinstall that manually [15:24:41] and then there is the Android SDK installed in my home dir. I have never took the time to puppetize it [15:25:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:58] bad hashar [15:26:24] have set that up during xmas ;-D [15:26:28] was a puppet noob [15:26:40] can we try if everything works on a precise labs instance? [15:26:48] and puppetize the remaining bits? [15:26:49] anyway that use some autoupdate from android + maven which install a bunch of third party libraries [15:27:16] and by we I mean mostly you :P I can help but I know zilch about jenkins [15:27:37] i installed jenkins on labs using the puppet class :-] [15:27:43] on precise? [15:27:46] what I should do is migrate the contint to a module [15:28:00] yeah on precise [15:28:03] integration-jenkins [15:28:27] the jenkins configuration is in a separate repo ( integration/jenkins.git ) but that is not deployed by puppet ;-D [15:28:30] i handle that manually [15:28:42] so that should be mostly fine [15:28:51] (I mean, migrating to a new server would be mostly fine) [15:31:20] I'd like to use labs to test that everything will work as it, then possibly rebuild gallium in-place [15:31:47] paravoid: what do you mean by rebuild in place ? [15:32:08] rebuild gallium with precise [15:32:10] on the same box [15:32:21] by running the ubuntu upgraded ? [15:32:25] err upgrader ? [15:32:33] !log Disabled csw1-esams:e8/2 [15:32:45] Logged the message, Master [15:32:48] no, pxe boot [15:32:52] format the box [15:33:44] New patchset: Ottomata; "Adding analytics.pp and setting up internal proxy for analytics cluster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27717 [15:34:27] PROBLEM - Host foundation-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.235) [15:34:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27717 [15:34:45] PROBLEM - Host bits.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.233) [15:34:45] PROBLEM - Host upload.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.234) [15:34:46] PROBLEM - Host bits.esams.wikimedia.org_https is DOWN: CRITICAL - Network Unreachable (91.198.174.233) [15:34:46] PROBLEM - Host wikiversity-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.231) [15:34:46] PROBLEM - Host wikiquote-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.227) [15:34:51] whoops [15:34:54] PROBLEM - Host mediawiki-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.232) [15:34:54] PROBLEM - Host wikibooks-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.228) [15:34:55] PROBLEM - Host wikinews-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.230) [15:35:00] mark: happy friday ;-] [15:35:02] I guess that's you? [15:35:03] PROBLEM - Host wikisource-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.229) [15:35:21] PROBLEM - Host wikimedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:35:21] PROBLEM - Host wikipedia-lb.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:35:22] PROBLEM - Host wiktionary-lb.esams.wikimedia.org is DOWN: CRITICAL - Network Unreachable (91.198.174.226) [15:36:06] PROBLEM - Host foundation-lb.esams.wikimedia.org_https is DOWN: CRITICAL - Network Unreachable (91.198.174.235) [15:36:23] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27717 [15:36:27] * Damianz waves bye to the network [15:36:33] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_https is DOWN: CRITICAL - Network Unreachable (91.198.174.232) [15:38:19] mark: just deactivated the aggregate reject route until ospf is back up [15:38:42] disabled the interface [15:38:48] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: No response from remote host 91.198.174.244, [15:39:06] PROBLEM - Host text.esams.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [15:39:11] that's not it [15:39:15] PROBLEM - Host upload.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:24] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:24] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:25] PROBLEM - Host wikinews-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:33] PROBLEM - Host wikiquote-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:33] PROBLEM - Host wikipedia-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:42] PROBLEM - Host wikiversity-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:43] PROBLEM - Host wikisource-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:39:51] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_https is DOWN: PING CRITICAL - Packet loss = 100% [15:40:20] i'm not sure what the issue is, i'm rolling back [15:41:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.854 seconds [15:41:27] it's not getting the ibgp routes for the load balancers [15:41:29] assuming that mark has this under control. let me know if there's anything you need [15:41:38] i see them on csw2-esams in bgp [15:41:48] uh oh [15:41:48] RECOVERY - Host foundation-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 16%, RTA = 132.36 ms [15:41:57] RECOVERY - Puppet freshness on analytics1001 is OK: puppet ran at Fri Oct 12 15:41:44 UTC 2012 [15:41:57] RECOVERY - Host bits.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.32 ms [15:41:58] RECOVERY - Host wikinews-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.47 ms [15:42:06] RECOVERY - Host wiktionary-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.16 ms [15:42:06] RECOVERY - Host mediawiki-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.33 ms [15:42:07] hmm right [15:42:09] i hate that setup [15:42:15] it's different from pmtpa/eqiad [15:42:38] i checked those addresses on cr2-knams, that had it [15:42:45] but cr1-esams not indeed [15:43:02] we need to have them peer with cr1-esams i'd say ? [15:43:20] well [15:43:25] while that is not yet the case it should be getting them anyhow [15:43:27] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 1, down: 3, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS43821 not established - WIKIMEDIA-EUBRPeering with AS64600 not established - BR [15:43:27] RECOVERY - Host wikibooks-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.15 ms [15:43:45] PROBLEM - Host csw1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.247) [15:44:05] so very many pages [15:44:08] so yes [15:44:14] it should do iBGP peering with csw2-esams [15:44:18] but only those routes :) [15:44:40] if anyone didnt set the proper timezones for their paging, we'll find out shortly ;] [15:44:42] could you perhaps set that up while I cable stuff here? [15:44:48] RECOVERY - Host wikisource-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.61 ms [15:44:48] RECOVERY - Host wikimedia-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [15:44:49] RECOVERY - Host foundation-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.18 ms [15:44:49] RECOVERY - Host text.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.26 ms [15:44:54] then we'll redo later [15:44:57] RECOVERY - Host upload.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [15:45:06] RECOVERY - Host wikibooks-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.15 ms [15:45:06] RECOVERY - Host wikimedia-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [15:45:07] RECOVERY - Host wikinews-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.39 ms [15:45:13] ok [15:45:15] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 263.94 ms [15:45:15] RECOVERY - Host wikipedia-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.22 ms [15:45:15] RECOVERY - Host wikiquote-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.19 ms [15:45:24] RECOVERY - Host wikiversity-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.20 ms [15:45:24] RECOVERY - Host wikisource-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.18 ms [15:45:33] RECOVERY - Host wikiversity-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.23 ms [15:45:33] RECOVERY - Host wiktionary-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.16 ms [15:45:50] so in pmtpa/eqiad it gets them over ospf [15:45:51] PROBLEM - Varnish HTTP bits on cp3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:51] RECOVERY - Host upload.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.61 ms [15:45:54] but that has issues as well, if you remember [15:45:58] oh klaxon, so annoying [15:45:59] yeah [15:46:00] RECOVERY - Host wikiquote-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.44 ms [15:46:06] paravoid: so if you want to erase out gallium, we will have to do a full backup of it and restore the jenkins var directory. [15:46:09] RECOVERY - Host wikipedia-lb.esams.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.35 ms [15:46:14] it's about time we clean that up and make it consistent :/ [15:46:16] i hate doing ibgp with partial routes … [15:46:17] paravoid: it has the history of the builds [15:46:18] RECOVERY - Host bits.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.43 ms [15:46:20] :-/ [15:46:20] me too [15:46:33] New patchset: Ottomata; "Escaping $1 in analytics::web::proxy RewriteRule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27721 [15:46:36] PROBLEM - LVS HTTP IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:21] RECOVERY - Varnish HTTP bits on cp3002 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 4.934 seconds [15:47:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27721 [15:47:33] paravoid: possibly /home too . I am not sure how it is setup [15:48:02] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27721 [15:48:06] RECOVERY - Host mediawiki-lb.esams.wikimedia.org_https is UP: PING OK - Packet loss = 0%, RTA = 108.64 ms [15:48:15] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:06] mark: bits hasn't come back since your rollback [15:49:15] stupid bits [15:49:20] (just trying to summarize the dozens of pages) [15:49:25] can you move its traffic to pmtpa and then move it back? [15:49:30] okay [15:49:42] actually keep it there while i'm working here [15:49:46] one less thing to worry about it [15:49:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27195 [15:50:00] just bits [15:50:06] yeah I remember the varnish issue [15:50:17] we're on the normal scenario right? [15:50:22] yes [15:50:25] just edit bits-lb [15:50:35] I remember someone playing with another scenario recently [15:50:46] i did, it's all back to normal now [15:52:09] PROBLEM - Varnish HTTP bits on cp3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:52:26] hey mark, want to do a quick show| compare on csw2-esams before i commit ? [15:52:45] !log storage 3 replacing disk 12 and rebuilding raid [15:52:54] ok [15:52:57] Logged the message, Master [15:53:59] !log moving bits from esams to pmtpa (authdns normal scenario) [15:54:11] Logged the message, Master [15:54:36] LeslieCarr: i'm not sure why that would solve it where the group iBGP on csw2-esams doesn't? [15:54:45] note that iBGP on csw2-esams is a bit misleading, it's specific to lvs prefixes [15:55:05] ah that is [15:55:12] i did not read through the policies [15:55:21] can we rename that group ibgp_LVSonly or something ? [15:55:24] yes [15:55:27] cool [15:55:30] and then set that up on both cr1-esams and cr2-knams [15:55:41] if we want to continue doing filtered iBGP :/ [15:55:45] PROBLEM - Varnish HTTP bits on cp3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:48] it's just that the ospf setup has issues too [15:57:55] let me know if you need anything else [15:59:58] want to do some more show | compares mark ? [16:00:14] yes [16:01:25] not you who did this I think, but policy-statement LVS_import on csw2-esams has a weird final from statement [16:01:54] yeah then protocol bgp wouldn't ever get evaluated [16:01:57] i'll remove from both sides [16:02:03] since then reject basically terminates [16:02:15] yes [16:02:46] removed from both [16:03:25] will you set it up on cr2-knams as well? [16:03:30] since it's iBGP it's not transitive [16:03:33] RECOVERY - Varnish HTTP bits on cp3001 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 2.796 seconds [16:04:45] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:53] sure, i'll make it peertastic with csw2 [16:05:11] then I think this is right [16:05:30] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 3883 bytes in 0.334 seconds [16:06:15] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 3890 bytes in 0.467 seconds [16:07:28] cr2-knams now has the same pending :) [16:07:36] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [16:07:48] mark , i'll commit confirmed 5 on csw2-esams and cr1-esams -- okay ? [16:08:18] ok [16:08:30] PROBLEM - Varnish HTTP bits on cp3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:09:44] LeslieCarr: need to set this up on csw1-esams too actually [16:09:51] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [16:09:59] but that should be fine, it already has the setup [16:10:10] hrm no [16:10:15] it already has a full ibgp peering [16:13:09] PROBLEM - LVS HTTPS IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:27] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:13:56] the prefix-list of LVS-service-Is is different on csw2-esams vs other routers -- fixing [16:14:06] !log running installer on storage3 [16:14:10] and i'll add the ipv6 ip's as another term as well [16:14:17] Logged the message, Master [16:14:39] RECOVERY - LVS HTTPS IPv4 on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3892 bytes in 0.556 seconds [16:15:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:54] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 5, down: 0, shutdown: 0 [16:16:57] let me know when you're ready to move the link again [16:18:06] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 3895 bytes in 0.226 seconds [16:18:06] RECOVERY - LVS HTTP IPv4 on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3892 bytes in 4.356 seconds [16:18:07] okay, looks like all the switches have the routes now [16:18:28] mark: second time's the charm ? [16:18:37] let's hope so ;) [16:18:53] ok [16:18:58] i'll need to reconfigure the routers for the ip change and mtu [16:19:09] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:21] can't I pull in a previous change set ;) [16:20:39] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 3888 bytes in 0.451 seconds [16:20:57] RECOVERY - Varnish HTTP bits on cp3001 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.218 seconds [16:21:51] ok i am ready [16:22:09] RECOVERY - Varnish HTTP bits on cp3002 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.627 seconds [16:22:11] hold on, i realized i didn't have ipv6 policy statement in there - don't want to break v6 :) [16:22:47] ok [16:24:03] let's do it mark :) [16:24:27] so you're ready? [16:24:40] ready [16:26:03] PROBLEM - Host knsq20 is DOWN: PING CRITICAL - Packet loss = 100% [16:26:38] done [16:27:40] !log reedy synchronized wmf-config/CommonSettings.php 'Fix extdist mount related warnings' [16:27:42] PROBLEM - Host knsq26 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:42] PROBLEM - Host knsq17 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:49] argh [16:27:52] Logged the message, Master [16:28:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.741 seconds [16:28:09] PROBLEM - Host knsq23 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:17] New patchset: Hashar; "fix sudo rights for wmf-beta-autoupdater" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27731 [16:28:26] ... [16:28:27] PROBLEM - LVS HTTPS IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:31] rolling back [16:28:33] argh [16:28:36] PROBLEM - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [16:28:36] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:36] PROBLEM - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:36] PROBLEM - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:36] PROBLEM - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:37] PROBLEM - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:37] PROBLEM - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:38] PROBLEM - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:45] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [16:28:45] PROBLEM - HTTPS on ssl3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:46] PROBLEM - Frontend Squid HTTP on knsq16 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:46] PROBLEM - HTTPS on ssl3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:46] PROBLEM - Varnish HTTP bits on cp3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:46] PROBLEM - SSH on cp3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:46] PROBLEM - SSH on ssl3002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:47] PROBLEM - Backend Squid HTTP on knsq28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:54] PROBLEM - Frontend Squid HTTP on knsq28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:54] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [16:28:54] PROBLEM - Backend Squid HTTP on knsq29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:54] PROBLEM - LVS HTTPS IPv4 on wikiquote-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:54] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:55] PROBLEM - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:55] PROBLEM - LVS HTTPS IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:28:56] PROBLEM - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27731 [16:29:21] PROBLEM - SSH on knsq24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:21] PROBLEM - SSH on knsq18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:21] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:21] PROBLEM - LVS HTTP IPv4 on bits.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:22] RECOVERY - Host knsq26 is UP: PING WARNING - Packet loss = 93%, RTA = 506.46 ms [16:29:22] RECOVERY - Host knsq23 is UP: PING WARNING - Packet loss = 93%, RTA = 494.80 ms [16:29:22] RECOVERY - Host knsq17 is UP: PING WARNING - Packet loss = 80%, RTA = 483.96 ms [16:29:30] PROBLEM - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [16:29:30] PROBLEM - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [16:29:30] PROBLEM - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:30] PROBLEM - SSH on ssl3003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:30] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:39] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Cannot make SSL connection [16:29:39] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: (Return code of 141 is out of bounds) [16:29:48] RECOVERY - LVS HTTPS IPv4 on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 0.438 seconds [16:29:48] RECOVERY - Host knsq20 is UP: PING OK - Packet loss = 0%, RTA = 108.21 ms [16:29:57] RECOVERY - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58510 bytes in 0.565 seconds [16:29:57] RECOVERY - LVS HTTP IPv6 on wikisource-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58511 bytes in 0.566 seconds [16:29:57] RECOVERY - LVS HTTP IPv6 on wikipedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58511 bytes in 0.569 seconds [16:29:58] RECOVERY - LVS HTTP IPv6 on wikinews-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58512 bytes in 0.572 seconds [16:29:58] RECOVERY - LVS HTTP IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58510 bytes in 0.570 seconds [16:29:58] RECOVERY - LVS HTTPS IPv4 on wikinews-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 69246 bytes in 0.768 seconds [16:29:58] RECOVERY - LVS HTTPS IPv6 on wiktionary-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58511 bytes in 0.833 seconds [16:29:59] RECOVERY - LVS HTTPS IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 79040 bytes in 0.914 seconds [16:30:00] fail [16:30:15] RECOVERY - Frontend Squid HTTP on knsq16 is OK: HTTP OK HTTP/1.0 200 OK - 652 bytes in 0.217 seconds [16:30:15] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 637 bytes in 0.229 seconds [16:30:16] RECOVERY - SSH on cp3002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:30:16] RECOVERY - SSH on ssl3002 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:30:16] RECOVERY - Varnish HTTP bits on cp3002 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.218 seconds [16:30:16] RECOVERY - Backend Squid HTTP on knsq28 is OK: HTTP OK HTTP/1.0 200 OK - 1424 bytes in 0.218 seconds [16:30:16] RECOVERY - HTTPS on ssl3003 is OK: OK - Certificate will expire on 07/19/2016 16:13. [16:30:17] RECOVERY - HTTPS on ssl3001 is OK: OK - Certificate will expire on 07/19/2016 16:14. [16:30:24] RECOVERY - Frontend Squid HTTP on knsq28 is OK: HTTP OK HTTP/1.0 200 OK - 1583 bytes in 0.217 seconds [16:30:24] RECOVERY - Backend Squid HTTP on knsq29 is OK: HTTP OK HTTP/1.0 200 OK - 1423 bytes in 0.217 seconds [16:30:24] RECOVERY - LVS HTTP IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 43611 bytes in 0.435 seconds [16:30:24] RECOVERY - LVS HTTPS IPv4 on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3900 bytes in 0.440 seconds [16:30:24] RECOVERY - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58507 bytes in 0.567 seconds [16:30:25] RECOVERY - LVS HTTPS IPv4 on wikiquote-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 53385 bytes in 0.664 seconds [16:30:25] RECOVERY - LVS HTTPS IPv4 on wikipedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 58512 bytes in 0.878 seconds [16:30:26] RECOVERY - LVS HTTPS IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 39696 bytes in 0.960 seconds [16:30:42] RECOVERY - SSH on knsq18 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:30:42] RECOVERY - SSH on knsq24 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:30:51] RECOVERY - LVS HTTP IPv4 on bits.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3894 bytes in 0.218 seconds [16:30:51] RECOVERY - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58372 bytes in 0.566 seconds [16:30:52] RECOVERY - SSH on ssl3003 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [16:30:52] RECOVERY - LVS HTTP IPv6 on foundation-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 58511 bytes in 0.565 seconds [16:30:52] RECOVERY - LVS HTTPS IPv4 on wikisource-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43242 bytes in 0.660 seconds [16:31:00] RECOVERY - LVS HTTP IPv6 on wikimedia-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 79040 bytes in 0.681 seconds [16:31:00] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79040 bytes in 0.766 seconds [16:31:09] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 43617 bytes in 0.661 seconds [16:31:09] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 62433 bytes in 0.947 seconds [16:34:36] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [16:37:00] paravoid: still around? I could use a review for some minor fix I have done to the beta cluster autoupdater [16:37:03] list at https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:beta,n,z [16:37:11] I am but a bit busy atm [16:37:15] split in 3 changes. Could probably squash them in just one [16:37:19] okkkk ;-] [16:40:18] RECOVERY - Host storage3 is UP: PING WARNING - Packet loss = 28%, RTA = 0.30 ms [16:44:01] New patchset: Hashar; "make sure mwdeploy can write to the beta updater log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27736 [16:44:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27736 [16:55:36] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [17:02:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:24] PROBLEM - NTP on storage3 is CRITICAL: NTP CRITICAL: No response from NTP server [17:14:15] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27736 [17:14:52] gerrit-wm: .. [17:15:29] ah, it just did not mention a pure comment [17:16:59] ottomata: merging your mongodb class that was merged but not in sockpuppet yet [17:17:48] bwerrr? [17:17:53] someone merged it? [17:18:05] asher merged it in gerrit [17:18:09] yea [17:18:12] oh! [17:18:13] cool ok [17:18:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.156 seconds [17:18:22] ok thanks [17:18:30] good to know, i'll have another commit to use it on stat1 coming soon [17:18:33] that shoudl be fine [17:18:56] Change abandoned: Ottomata; "Using MongoDB module instead:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26799 [17:19:01] heh:) [17:29:56] !log pc2 shutting down to replaced hdd w/ ssd rt3703 [17:30:08] Logged the message, Master [17:36:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27682 [17:38:55] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27696 [17:45:04] New patchset: Kaldari; "Removing UK from LandingCheck exemptions, since they aren't doing their own payment processing this year." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27748 [17:45:12] New review: Dzahn; "per brief talk with Ryan, please move the file and package definitions out of the role class into a ..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/25236 [17:46:52] !log pc3 shutting down to replace hdd's with ssd's rt3703 [17:47:04] Logged the message, Master [17:47:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27372 [17:47:45] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27371 [17:48:58] New patchset: Kaldari; "Updating secure URL for banner loading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27749 [17:49:42] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27748 [17:50:49] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27749 [17:52:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:04] notpeter: pc2 is ready for you...the ibm tech is here for db42 issues right now so I will finish pc3 shortly if you want to install pc2 and see if there are any other issues [18:01:37] New review: Dzahn; "wee.. manpages for our misc scripts :)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/16606 [18:01:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16606 [18:02:18] cmjohnson1 - we have onsite ibm support? [18:02:48] cmthank you! [18:08:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [18:16:05] New patchset: Dzahn; "add full sudo rights for demon on gerrit hosts per RT-3698" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27753 [18:17:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27753 [18:17:31] New patchset: Dzahn; "add full sudo rights for demon on gerrit hosts per RT-3698" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27753 [18:18:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27753 [18:19:53] !log troubleshooting db42 w/IBM tech [18:20:05] Logged the message, Master [18:21:48] New patchset: preilly; "Don't exit on fflush() error" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/27662 [18:22:36] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:22:53] New patchset: Dereckson; "Fixing typo in dologmsg(1)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27755 [18:24:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27755 [18:25:43] New patchset: Dereckson; "Fixing whitespace issue" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27756 [18:26:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27756 [18:26:51] How do I get +2 on operations/debs/varnish [18:28:42] !log reedy synchronized php-1.21wmf1/extensions/ArticleFeedback/ [18:28:53] Logged the message, Master [18:31:18] #3708: Grant preilly +2 on operations/debs/varnish [18:31:25] https://rt.wikimedia.org/Ticket/Display.html?id=3708&results=6b15a430df3a1f7249f3b82a2c690383 [18:32:13] <^demon|busy> preilly: I answered in the other channel....no need for RT. Just ask on-wiki (like normal) and get someone from ops to +1 the request. [18:33:26] ^demon|busy: I did what CT told me to do [18:33:28] !log reedy synchronized wmf-config/InitialiseSettings.php 'AFT (non v5) doesn't use clicktracking now' [18:33:39] Logged the message, Master [18:34:05] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27755 [18:35:10] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27756 [18:37:42] cmjohnson1: any news on db42? [18:39:25] mutante: here is what I know...there is no h/w issues [18:40:06] oh? but it doesnt boot? [18:40:20] mark: ping [18:40:21] it doesn't recognize our external dvd rom so i am going to try and create a bootable usb [18:40:35] mutante: no it will not boot [18:41:33] ^demon|busy: you replied to the wrong RT [18:41:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:41:42] <^demon|busy> Blargghhh [18:41:52] New patchset: Kaldari; "Some changes for CentralNotice..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27760 [18:42:30] <^demon|busy> paravoid: I did? I meant to reply to the one re: +2 permissions. [18:57:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [18:59:21] ^demon|busy: sorry, my bad [18:59:25] i misread that [18:59:40] <^demon|busy> :) [19:01:38] !log powering down search32 to replace cpu1 [19:01:49] Logged the message, Master [19:03:00] cmjohnson1: re. thanks for info. got poked by Dario [19:03:22] plz tell Dario, still working on it thx [19:05:38] just did, np [19:12:29] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27760 [19:20:39] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [19:22:26] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27753 [19:29:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:17] ^demon|busy: running puppet on the boxes..takes a while:) [19:29:31] <^demon|busy> okie dokie :) [19:30:25] New patchset: Ottomata; "Installing MongoDB on stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27762 [19:31:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27762 [19:31:58] I ran puppet on manganese already [19:32:09] oh shit. we're supposed to upgrade gerrit today, aren't we? [19:32:14] hahaha [19:32:16] forgot about that [19:32:25] ^demon|busy: when are we doing this? [19:32:28] 30 mins from now? [19:32:42] <^demon|busy> Yeah, but I'm free whenever you're ready to start. [19:32:50] I need to make the package, I guess [19:32:50] ah, good timing then to have sudo [19:33:16] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27762 [19:33:37] <^demon|busy> Ryan_Lane: The *.war is built already. http://noc.wikimedia.org/~demon/gerrit-2.4.2-1-ga076b99.war [19:33:46] yep [19:34:17] <^demon|busy> There's no schema or database changes, so it should just be bring down -> install new *.war -> bring back up [19:34:22] <^demon|busy> *schema or data [19:36:15] notpeter: pc2 and pc3 are finished for you [19:36:25] ah [19:36:25] good [19:36:36] upgrade to what? 2.5? [19:36:45] no [19:36:46] <^demon|busy> Custom 2.4.2 build [19:36:53] ah, a shame [19:37:03] 2.5 will be soon enough, hopefully [19:37:03] <^demon|busy> Yes. 2.5 has a nasty regression we need fixed first :( [19:37:34] cmjohnson1: woo! thanks [19:37:40] although pc2 is being really frustrating [19:37:42] yw...anytime [19:37:55] but... same as the analytics ciscos... [19:37:57] *sigh* [19:37:58] anything i can do to help? [19:38:04] heh [19:38:10] <^demon|busy> paravoid: I wouldn't bother updating but hashar needs a patch (that made the 2.5 release) pulled in sooner for his Zuul work. It's trivial, so I just made our own build :) [19:38:17] maybe eventually... but not yet. thanks, though! [19:38:52] k [19:42:15] <^demon|busy> Ryan_Lane: I'd also like to dist-upgrade manganese & formey sometime. Maybe next week? [19:44:02] ^demon|busy: I'm at openstack conference next week [19:44:16] <^demon|busy> Ah, no rush. We'll figure sometime out. [19:44:48] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [19:44:48] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [19:45:12] * AaronSchulz wonders why the rados gateway works via bash and curl but not fully in MW [19:45:40] SwiftFileBackend::doStoreInternal: Invalid response (0): (curl error: 0) : Failed to obtain valid HTTP response. [19:46:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.043 seconds [19:47:30] ^demon|busy: ok. should I upgrade formey first? [19:47:36] to make sure it's going to work? [19:48:00] <^demon|busy> Yeah, let's do formey first just in case. [19:48:04] * Ryan_Lane nods [19:48:09] hah, when I google that error I get a bunch of CF stuff [19:48:19] <^demon|busy> I've already tested this locally & on labs, so I don't think it'll be a problem [19:48:42] oh wait, that part of the message is custom so that makes sense ;) [19:48:47] ugh. stupid package [19:48:52] it tries to remove the gerrit user [19:48:56] and group [19:49:27] <^demon|busy> Package needs fixing then ;-) [19:49:54] and wtf isn't my token lasting for 24 hours [19:51:02] <^demon|busy> Ryan_Lane: Maybe we could bribe paravoid into fixing our package one day :) [19:52:04] ^demon|busy: seems to be working [19:52:35] in other news: http://imgur.com/gallery/hSCpf [19:53:01] ugh. I hate that these systems are set up for ldap [19:53:22] ^demon|busy: I did my best for that :P [19:53:25] <^demon|busy> And `gerrit query` now has the new --submit-records option. [19:54:57] I'm removing the nsswitch ldap config on manganese before I start [19:55:30] <^demon|busy> Ok? I have nfc what that does anyway [19:56:34] motherfucker [19:56:59] I really don't get why it tries to delete the group and user [19:57:14] ah [19:57:20] I'm not checking to see if this is an upgrade [19:57:42] rebuilding the package [19:58:31] <^demon|busy> Too bad MW's OpenID provider is totally broken or we could've used that from day 1. [19:58:42] !log reedy synchronized wmf-config/ [19:58:54] Logged the message, Master [19:58:57] ^demon|busy: we wouldn't have been able to use ldap groups, then [19:59:08] <^demon|busy> True. [19:59:29] <^demon|busy> But the number of ldap groups isn't *huge* compared to internal groups, so we could've managed I think. [19:59:36] not true [19:59:38] cmjohnson1: would you be willing to take a look at pc2? when I set it to netboot it seems to freeze/stops displaying to console right about when it should start pxebooting... [19:59:44] I want to use the labs project groups [19:59:58] for branch access at minumum [20:00:00] <^demon|busy> Oh, well future plans don't count :) [20:00:07] I had immediate plans for that [20:00:09] <^demon|busy> Right now we've got like 5 ldap groups we use in gerrit [20:00:19] yeah [20:00:59] <^demon|busy> Bummer you won't be coming to the user summit. I'm giving a talk on how the project listing page sucks :p [20:01:30] notpeter: sure..give me a few mins [20:03:41] ^demon|busy: done [20:03:54] ^demon|busy: mind checking it? [20:04:04] <^demon|busy> Prod still says 2.4.2, hrm [20:04:33] <^demon|busy> Or not, CLI reports custom build. [20:04:36] <^demon|busy> UI is being silly [20:05:35] ^demon|busy: silly in which way? [20:05:46] it's fine for me [20:05:53] <^demon|busy> UI says 2.4.2 [20:06:06] notpeter: rebooting now via keyboard/monitor [20:06:09] <^demon|busy> Should say 2.4.2-1-ga076b99 [20:06:12] cmjohnson1: cool! [20:06:13] ah [20:06:22] oh! pc3 is behaving nicely [20:06:29] <^demon|busy> `ssh -p 29418 gerrit.wikimedia.org gerrit version` is reporting the correct version though. [20:06:32] <^demon|busy> And the new feature shows up. [20:06:33] [2012-10-12 20:03:24,210] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 2.4.2-1-ga076b99 ready [20:06:49] <^demon|busy> Yeah. Something in the UI might be cached. [20:07:10] * Ryan_Lane nods [20:07:13] ok. lunch time. [20:07:22] <^demon|busy> Thanks Ryan ! [20:07:54] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [20:10:51] notpeter: what do you want as your #1 boot option? [20:11:04] uuuuhhhh [20:11:08] it needs to pxe once [20:11:13] so pxe, Isuppose [20:11:59] ok..it was set that way [20:12:23] yeah, that part wasn't an issue [20:12:43] it was that it became unresponsive at some point in the booting [20:13:04] can you try booting it up with kvm plugged in and see if it actually PXE's? [20:13:09] if it doesn't then it's a console issue [20:13:16] if not, then it's something else... [20:14:14] yeah...pxe is timing out [20:14:27] getting the media failure/check cable error [20:14:45] weird.... [20:14:52] it PXE'd yesterday [20:14:56] well, might be a plug in error [20:15:32] odd..yeah, my thought is to replace the cables..it seems the most logical and simplest check [20:15:32] mutante: we have man pages now!!!!!!!!!!!!! yeah !!! \O/ rejoice :-] [20:15:53] hashar: heh, it was merge-Friday:) [20:16:02] well, code review time i mean [20:16:51] mutante: seems you have been busy reviewing operations/puppet.git today ;-] [20:18:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:38] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27768 [20:19:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27768 [20:21:58] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [20:23:10] New patchset: Hashar; "typo in dologmsg(1)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27769 [20:23:15] hashar: having man pages is cool. yay! thx [20:24:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [20:24:20] Change abandoned: DamianZaremba; "BLEH" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27768 [20:24:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27769 [20:25:43] New patchset: Hashar; "fix sudo rights for wmf-beta-autoupdater" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27731 [20:26:54] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [20:28:06] New review: Hashar; "and there was another typo in the asciidoc title that prevented the man page generation. Fixed it w..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27755 [20:28:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27731 [20:28:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [20:32:23] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27731 [20:32:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [20:32:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27769 [20:43:59] heya paravoid, you still working? [20:44:05] i gotta a really strange puppet naming conflict [20:44:39] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [20:44:50] https://gerrit.wikimedia.org/r/#/c/27762/1/manifests/misc/statistics.pp [20:44:55] this [20:44:55] class misc::statistics::mongodb [20:44:57] is giving me [20:45:04] Duplicate definition: Class[Misc::Statistics::Mongodb] is already defined; cannot redefine at /var/lib/git/operations/puppet/manifests/misc/statistics.pp:241 [20:45:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [20:47:09] New patchset: Ottomata; "Renaming classes to avoid weird class naming conflict" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27771 [20:48:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27771 [20:48:59] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27771 [20:49:18] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [20:49:24] ottomata: nak [20:49:30] oh too late for that [20:49:52] you need class { "::mongodb": ... } [20:49:53] yeah, puppet sucks for that [20:50:04] wah? [20:50:10] to get the top-namespaced mongodb [20:50:15] hm [20:50:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [20:51:59] hm, ok, thanks paravoid, good to know [20:55:33] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [20:56:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [20:57:14] New review: Hashar; "would move that to the zuul module under a wmf_configuration sub class." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/25236 [20:58:30] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [20:59:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [20:59:40] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:00:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:02:42] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:03:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:06:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:07:18] New patchset: Cmjohnson; "Fixing dhcpd entry for pc2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27777 [21:08:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27777 [21:10:12] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:11:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:12:57] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:14:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:14:29] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:15:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:16:20] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:17:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:18:46] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:19:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.614 seconds [21:19:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:21:11] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:22:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:23:04] ironic how that is the "first hash" [21:23:21] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:23:54] Damianz: any particular reason you're pushing a patchset every two minutes or so? :) [21:24:18] patchsets are always welcome of course, just seemed a bit peculiar [21:24:22] paravoid: Commiting stuff from my test instance sucks so it's easier to push from my lapton and pull to the dev box [21:24:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:43:52] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:44:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:53:44] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:53:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:54:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [21:55:25] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [21:56:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [22:09:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.052 seconds [22:12:22] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [22:13:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [22:15:41] New patchset: DamianZaremba; "First hash at starting to puppetize bots in labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [22:15:56] RECOVERY - Host db42 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [22:16:51] New patchset: Reedy; "Disable ReaderFeedback" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27830 [22:16:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [22:28:24] New patchset: DamianZaremba; "Puppetizing the bots setup for labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [22:29:29] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27777 [22:29:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [22:31:42] New patchset: DamianZaremba; "Puppetizing the bots setup for labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [22:32:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [22:33:12] New patchset: DamianZaremba; "Puppetizing the bots setup for labs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26441 [22:34:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/26441 [22:36:36] !log muted nagios [22:36:48] Logged the message, Mistress of the network gear. [22:36:53] !log switching europe traffic to pmtpa via authdns-scenario [22:37:04] Logged the message, Mistress of the network gear. [22:48:33] LeslieCarr: do you have a second to check something for me? [22:49:38] notpeter: sure [22:49:45] oh, second's up [22:49:53] it's the mc10* hosts... [22:49:58] :( [22:50:06] not my fault you didn't ask for more [22:50:10] do you hav ea schedule of when the seconds are available? [22:50:20] is there some kinda leslie's time installment plan that I can use? [22:50:35] Hrm, I need an AA [22:51:10] can we get those? [22:51:15] I probably can't... [22:51:20] you should give it a go! [22:52:02] :) [22:52:13] hrm, i guess i can give 240 more seconds [22:52:14] starting.... [22:52:15] NOW [22:52:19] so [22:52:35] mc1009 and mc1010 are both following media tests [22:52:41] they're happy from a network perspective, yes? [22:52:51] this is most likely cabling or network car in the host, yes? [22:53:47] so 1009 and 1010 aren't showing as having sfp's [22:54:04] need new cables i think [22:54:12] ok, cool [22:54:16] stupid cabling [22:54:36] WHY IT SO HARD!?!?!? [22:55:01] sweet. thank oyu [22:55:12] *thank you [22:56:09] yw [22:56:20] beccause vendors like to lock cables for no reason so they can charge 200x markups [22:56:27] it's like printer ink [22:57:00] these things make me so sad, leslie [22:57:18] "oh, this cable? $500. and you're gonna pay it." [22:57:21] *sigh* [22:57:38] :( [22:58:09] this is not what progress in the world looks like [23:00:21] progress is when they all cost the same amount because the same 10 year olds make them! [23:00:24] New patchset: Asher; "disabling session saves to redis in order to reconfigure test instance" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27833 [23:00:51] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27833 [23:01:18] LeslieCarr: :( [23:01:44] !log asher synchronized wmf-config/CommonSettings.php 'disabling session multiwrites to redis' [23:01:47] sorr,y, i'm awful [23:01:55] Logged the message, Master [23:01:58] so, in 5 minutes i'll reboot esams switch stack [23:03:57] * Damianz runs away screaming [23:04:22] most of the traffic is gone damianz [23:04:32] hopefully :) [23:15:57] well, mostly gone :) [23:16:10] !log rebooting esams core switch stack, expect reachability problems to esams [23:16:22] Logged the message, Mistress of the network gear. [23:18:55] New patchset: Asher; "re-enable session multiwrite to redis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27842 [23:19:35] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27842 [23:20:19] !log asher synchronized wmf-config/CommonSettings.php 'enabling session multiwrites to redis' [23:20:30] Logged the message, Master [23:21:27] * AaronSchulz hears binasher sneaking around [23:21:55] these are not the droids you're looking for [23:22:18] binasher: so after some cloudfiles fixes, ceph passes all MW tests and seems to work with manual testing btw [23:23:54] * binasher strokes beard [23:29:32] binasher: did you enable the redis debug log? [23:30:08] * AaronSchulz doesn't see it in fluorine [23:31:27] i just enabled logging locally on the redis server, but not via syslog or anything in mediawiki [23:31:46] might be nice to set wgDebugLogGroups for 'redis' [23:39:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:39:59] !log reenabled nagios alerts [23:40:11] Logged the message, Mistress of the network gear. [23:40:15] !log moving traffic back to esams via authdns-scenario [23:40:26] Logged the message, Mistress of the network gear. [23:41:20] AaronSchulz: will just 'redis' => 'udp://...' actually work? is RedisBagO.. mapped to 'redis'? [23:41:38] yes [23:42:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.472 seconds [23:45:28] New patchset: Asher; "redis debuglog group" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27845 [23:45:36] AaronSchulz: ^^ that right? [23:46:17] yep, you just have to create the empty file (with the same perms as the others) [23:47:15] Couldn't resolve host 'ms-fe.pmtpa.wmnet': Failed to obtain valid HTTP response [23:47:20] keeps coming up in swift-backend.log [23:47:32] yeah, ariel and peter looked at that and then gave up ;) [23:47:40] it only comes from the precise job runners [23:48:01] mostly just on zhwiki [23:48:08] yep, and nlwiki or something [23:48:18] but i guess zhwiki is still 90% of all jobs [23:48:34] hopefully i'll get an iphone 5 out of it [23:48:45] guessing is no fun, graph it! [23:49:54] :) [23:50:31] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27845 [23:51:11] !log asher synchronized wmf-config/CommonSettings.php 'enabling redis debug log' [23:51:23] Logged the message, Master