[00:02:00] !log catrope Started syncing Wikimedia installation... : Scap for VE update, contained i18n changes [00:02:08] Logged the message, Master [00:02:14] github is faster than gerrit [00:04:54] New patchset: Legoktm; "Have gerrit-wm send all pywikibot/* commits to #pywikipediabot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70780 [00:09:30] !log catrope Finished syncing Wikimedia installation... : Scap for VE update, contained i18n changes [00:09:33] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [00:09:38] Logged the message, Master [00:15:18] New patchset: Legoktm; "Have gerrit-wm send all pywikibot/* commits to #pywikipediabot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70780 [00:15:59] !log catrope synchronized php-1.22wmf7/resources/startup.js 'touch' [00:16:08] Logged the message, Master [00:16:21] !log catrope synchronized php-1.22wmf8/resources/startup.js 'touch' [00:16:30] Logged the message, Master [00:23:09] !log updated Parsoid to eccca39 [00:23:17] Logged the message, Master [01:01:56] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.003178834915 secs [01:02:26] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.002837061882 secs [01:06:17] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [01:22:07] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [01:31:57] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.001416444778 secs [01:32:37] !log updated Parsoid to 091ebece [01:32:46] Logged the message, Master [01:34:19] !log Roan cleared Parsoid caches [01:34:29] Logged the message, Master [01:35:27] PROBLEM - Varnish HTTP parsoid-backend on titanium is CRITICAL: Connection refused [01:35:57] RoanKattouw: ^^ [01:36:15] Ugh [01:36:26] I also still get cached content [01:36:27] Silly me [01:36:35] I started Parsoid instead of Varnish [01:36:44] ah ;) [01:37:13] Oh, wait [01:37:15] And! [01:37:18] I did it on the wrong boxes [01:37:22] We've moved to cpNNNN now [01:37:26] yes, I was wondering about that ;) [01:37:27] RECOVERY - Varnish HTTP parsoid-backend on titanium is OK: HTTP OK: HTTP/1.1 200 OK - 636 bytes in 0.005 second response time [01:39:03] OK, done [01:40:30] RoanKattouw: thanks, looks good [02:07:05] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [02:07:37] !log LocalisationUpdate completed (1.22wmf8) at Thu Jun 27 02:07:37 UTC 2013 [02:07:48] Logged the message, Master [02:13:40] !log LocalisationUpdate completed (1.22wmf7) at Thu Jun 27 02:13:40 UTC 2013 [02:13:49] Logged the message, Master [02:19:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 27 02:19:05 UTC 2013 [02:19:14] Logged the message, Master [02:55:41] !log added springle to wmf and ops LDAP groups [02:55:50] Logged the message, Master [03:08:03] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [03:12:07] New patchset: Tim Starling; "Use the /usr/local copy of MW for noc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70792 [03:14:01] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70792 [03:28:53] PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - free space: / 5682 MB (3% inode=98%): [03:33:09] noc.wikimedia.org/dbtree has stopped working [03:33:41] TimStarling, ^ related to rt 70792? [03:33:50] hey springle [03:33:53] welcome :) [03:34:02] hi paravoid, thanks :) [03:34:04] (I'm Faidon) [03:36:45] it was because I moved that MW source tree away [03:36:48] I fixed it [04:07:03] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [04:08:03] RECOVERY - Puppet freshness on mw1066 is OK: puppet ran at Thu Jun 27 04:07:59 UTC 2013 [04:18:23] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Jun 27 04:18:19 UTC 2013 [04:19:03] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [04:30:43] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Jun 27 04:30:35 UTC 2013 [04:31:04] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [05:03:22] apergos: redirects are gone I see! [05:07:19] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:50] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:50] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [05:08:51] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:51] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [05:08:52] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:52] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [05:10:09] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:22:29] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:31:05] New patchset: BBlack; "more build/pkg fixes" [operations/software/varnish/libvmod-netmapper] (master) - https://gerrit.wikimedia.org/r/70795 [05:31:15] morning bblack :) [05:31:28] Change merged: BBlack; [operations/software/varnish/libvmod-netmapper] (master) - https://gerrit.wikimedia.org/r/70795 [05:31:34] or evening :) [05:31:45] thanks for the clarification, that's exactly what I meant, yes [06:11:50] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [06:12:10] PROBLEM - Disk space on ms-be1002 is CRITICAL: DISK CRITICAL - free space: / 5699 MB (3% inode=98%): [06:14:30] PROBLEM - SSH on mc15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:15:20] RECOVERY - SSH on mc15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:21:53] apergos: I rewrote the Architecture section https://wikitech.wikimedia.org/wiki/Media_storage [06:22:00] your content helped me set the direction, thanks :) [06:22:27] I also added info regarding eraseArchivedFile.php [06:22:41] we should send all this to Aaron when we're done [06:22:51] I'm sure he'll have corrections and additions [06:27:10] RECOVERY - Disk space on ms-be1002 is OK: DISK OK [06:27:50] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [06:33:32] paravoid: awesome [06:45:54] hey paravoid [06:46:13] or apergos [06:53:59] yess? [06:55:19] hi [07:00:36] nothing major, just wanted advice [07:00:49] i have a re-write of the eventlogging puppet module that i've been tweaking [07:01:11] one of the things it does is replace supervisord (a python-based process management thingabob) with upstart [07:01:58] it seemed annoying to insist on using some other then upstart to manage services when upstart was already managing everything [07:02:32] but the nice thing about that was that it gave me a management interface that was specific to the six or seven processes that i cared about [07:02:54] without having to grep for or struggle to recall service names [07:03:35] now they're just lost in the crowd of random system services i hardly ever care about [07:04:14] does that sound sensible? [07:06:14] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [07:06:43] to do what? [07:06:48] management how? [07:07:48] ensure they're running, restart if necessary (with groups), tail stderr [07:08:39] 'with groups' meaning there's a notion of process groups in supervisor and you can scope a command to a group of processes [07:08:42] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [07:08:57] I'm not sure I understand what you want to do exactly [07:09:53] i was trying to move away from supervisor and just use upstart, but am now reconsidering, and wondering if my reservations are legitimate [07:10:49] ori-l: name all of the services eventlogging- maybe? [07:11:09] no, I mean, what do you usually do with supervisor? [07:11:13] and is that manually? [07:12:10] Ryan_Lane: i guess "service --status-all | grep eventlogging-", hrm [07:12:33] paravoid: usually tail stderr and restart individual components for code upgrades [07:12:39] and yes, manually [07:13:32] oh, and e-mail alerts [07:14:05] I'd be okay with shipping a shell script in /usr/local/sbin that had a few management commands [07:14:22] wmel status, wmel debug, wmel restart, etc. [07:14:33] tail stderr isn't something that supervisord does anyway :) [07:14:42] sure it does [07:14:50] it odes? [07:15:51] yeah, supervisorctl has a 'tail -f' command [07:16:23] ugh :) [07:16:23] it beats fishing the right file in /var/log [07:17:19] http://vanadium.eqiad.wmnet:2828/ [07:18:37] i like the idea of upstart + management script + consistent service name prefix though [07:18:55] web interfaces for management, bleh [07:19:07] (i don't use it :)) [07:20:19] I don't mind supervisord, but it is a bit kind of counter-intuitive [07:20:31] both in general and in the sense that we don't generally use it so people are not familiar with it [07:21:17] but if you have reasons to prefer it, that's okay [07:21:28] your call :) [07:21:54] worth a shot [07:22:04] plus i like 'wmel' [07:22:29] :) [07:22:57] alright, thanks [07:23:00] I'd also add a "wmel check" that would run the nagios checks [07:23:15] there are no nagios checks [07:23:26] time to add them! [07:23:28] :-) [07:24:06] yes, probably a good idea [07:24:15] * Ryan_Lane scoffs [07:24:17] nagios checks [07:24:21] * Ryan_Lane scoffs [07:24:25] * Ryan_Lane should really go to sleep [08:01:45] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.00316131115 secs [08:02:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [08:03:25] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.00300860405 secs [08:07:31] hi :) [08:07:32] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [08:08:46] yesterday Azatoth introduced to me a project that let you easily build Debian packages under Jenkins :-D [08:09:02] took like 2 hours, but I got pybal packaged via Jenkins! [08:09:33] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Jun 27 08:09:25 UTC 2013 [08:09:33] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [08:17:02] New review: Hashar; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70322 [08:17:12] New patchset: Hashar; "beta: tweak $wgLoadScript to use the bits cache" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70322 [08:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [08:48:44] New review: Hashar; "I have filled an issue upstream to have them add tags in git https://github.com/facebook/buck/issues/37" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/70673 [08:57:44] New patchset: Hashar; "beta: removes incubator wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70804 [09:10:12] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [09:27:32] RECOVERY - Disk space on cp1048 is OK: DISK OK [09:27:32] RECOVERY - RAID on cp1048 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:28:22] RECOVERY - DPKG on cp1048 is OK: All packages OK [09:32:32] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:17] apergos: finally catching up with the puppet "modules and roles" thread in ops list :-) [09:34:23] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [09:34:23] hh [09:34:26] *heh [09:38:00] !log Pooled new eqiad upload caches with 1% load [09:38:09] Logged the message, Master [09:44:07] upped to 5% now [09:44:12] let's try not to overload swift today [09:44:19] I'll try to keep the load below 1000 req/s [09:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [10:12:06] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [10:13:42] 10%... [10:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:02:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:03:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:10:06] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [11:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [11:42:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:43:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:06:49] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [12:13:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:14:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:30:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.154 second response time [12:39:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [12:46:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:47:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [12:54:04] re$ [12:55:44] New patchset: Hashar; "contint: explicitly require php5-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70182 [13:03:40] Change abandoned: Hashar; "cant be rebased, will just redo that patch." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50064 [13:03:54] AzaToth: hi. So i saw you solved the problem with the buck repo. Did you had the project destroyed and recreated ? [13:07:53] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [13:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.433 second response time [13:37:41] New patchset: Hashar; "gerrit-wm: pywikibot/* events to #pywikipediabot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70780 [13:38:25] New review: Hashar; "good to go." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/70780 [13:38:29] apergos: regarding ms-be9...disk2 was unconfigured good so I cleared the foreign cfg and added back. should be good to go now [13:40:14] akosiaris: we wiped it totally so the old changesets where pruned [13:40:27] but I didn't had to, it was just we decided to do so [13:45:54] cmjohnson1: thanks, that's excellent [13:47:33] New patchset: Hashar; "beta: adapt role::cache::varnish::upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [13:47:48] New review: Hashar; "follow up in https://gerrit.wikimedia.org/r/70818" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50064 [13:51:05] err: /Stage[main]/Nrpe::Service/Service[nagios-nrpe-server]/ensure: change from stopped to running failed: Could not start Service[nagios-nrpe-server]: Execution of '/etc/init.d/nagios-nrpe-server start' returned 2: at /etc/puppet/manifests/nrpe.pp:108 [13:51:06] :D [13:51:11] poooor nrpe [13:52:22] Jun 27 13:50:18 uploadtest07 nrpe[15058]: Unable to open config file '/etc/icinga/nrpe.cfg' for reading [13:52:22] Jun 27 13:50:18 uploadtest07 nrpe[15058]: Config file '/etc/icinga/nrpe.cfg' contained errors, aborting... [13:52:25] yeah that does not help [13:54:30] <^demon> qchris: I'm having a phone call, then we'll do this thing :) [13:56:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:37] ^demon: Upgrading, upgrading, upgrading, ... Yay \o/ [13:58:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:58:36] New review: Hashar; "I have applied that change to uploadtest07.pmtpa.wmflabs instance. Varnish instances managed to st..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [14:00:11] qchris: helllo :-] [14:00:22] Hi hashar :-) [14:00:27] I filled an issue against Buck to have them tag versions https://github.com/facebook/buck/issues/37 [14:00:39] I saw that. Thanks. [14:00:45] and AzaToth has submitted a change that would package Buck for debian [14:00:57] somewhere in Gerrit, maybe operations/debs/buck [14:00:57] I am curious whether they'll add them. [14:01:13] we just need to catch a Google VP now :-] [14:01:23] Yes, I am just upgrading an Ubuntu instance so I can have Java 7 there, so I can test that. [14:01:35] oh [14:01:47] * hashar checks whether gallium has java 7 [14:01:57] java version "1.6.0_27" [14:01:57] :( [14:02:03] Building buck requires Java 7. [14:02:07] No way around it :-( [14:02:24] openjdk-7-jre-headless is installed [14:02:24] Building gerrit will also require it :-( [14:02:29] Ah. Ok. [14:02:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:45] so we will have to points the Gerrit job to the java7 install [14:03:08] iirc it is already configured in Jenkins and the java runtime to use can be set on a per job basis [14:03:12] using some kind of droplist [14:03:26] Yes, the jenkins maven plugin had that IIRC. [14:03:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [14:03:43] unfortunately, there is no buck plugin for maven :-) [14:03:53] s/maven/jenkins/ [14:04:03] freestyle job we will use :-) [14:04:10] So it shall be. [14:04:18] ooor [14:04:27] you could write a Jenkins plugin to nicely integrate buck [14:04:57] Ok, you rewrite Jenkins in Python, and I'll write the buck plugin :-) [14:05:39] I guess free style jobs will have to do for now. Given I detest buck, I do not really want to make it easier for people to migrate to it. [14:06:20] <^demon> qchris: So, we'll try to get the change for replication merged today. But here's an example of what we were hitting: [14:06:24] <^demon> [2013-06-27 14:02:50,503] ERROR com.googlesource.gerrit.plugins.replication.ReplicationQueue : Failed replicate of refs/heads/sandbox/anomie/merged2 to gerritslave@antimony.wikimedia.org:/var/lib/git/mediawiki/extensions/CentralAuth.git: status REJECTED_NONFASTFORWARD [14:06:44] Do we force push there? [14:07:09] ^demon: Are force pushes not allowed on sandbox branches? [14:07:43] <^demon> They are [14:07:54] <^demon> I've just been trying to fix the replication of them :) [14:07:58] <^demon> anomie: Sorry for the ping :) [14:09:16] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [14:10:07] ^demon: antimony seems to still use "push" => "refs/*:refs/*" [14:10:15] ^demon so no force push :-( [14:10:30] https://gerrit.wikimedia.org/r/#/c/70457/ [14:10:36] ^ Should solve the problem [14:10:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:54] <^demon> !log gerrit: running puppet and restarting service [14:10:54] <^demon> qchris: Yeah, we need to merge the change you pushed for it [14:11:03] Logged the message, Master [14:11:16] PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - free space: / 5698 MB (3% inode=98%): [14:11:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [14:13:38] gerrit is up again :-) [14:14:10] <^demon> Yeah, I added JenkinsBot to stream events. [14:14:30] \o/ [14:14:43] <^demon> paravoid: Could you take a look at https://gerrit.wikimedia.org/r/#/c/70457/? [14:15:27] zool is doing work. [14:15:36] Looks good. [14:15:56] <^demon> Yeah, zuul's fine. hooks-bz is giving me an exception tho [14:16:01] <^demon> (not the auth problem from the other day) [14:16:14] <^demon> Blah, misread...not hooks-bz [14:16:34] <^demon> http://p.defau.lt/?KntwieWufnulHVgZkWFlXQ - when I updated commit summary on https://gerrit.wikimedia.org/r/#/c/66665/ to test stream events [14:17:08] That's hooks-its [14:17:46] Looks like the hooks-its jar /with isDraft/, while we are now running the gerrit without isDraft [14:17:58] http://quelltextlich.at/gerrit/hooks-bugzilla-2.7-SNAPSHOT-84f08e8-hooks-its-3b7d4be.jar [14:18:08] ^ That's the hooks-bz with hooks-its without isDrfat [14:18:24] ^demon: I am also going to add yet another git replication destination [14:18:33] <^demon> hashar: ok [14:18:45] ^demon: I am going to receive a second CI server that will be a Jenkins Slave. will need the repos there :-] [14:19:00] <^demon> qchris: Ah, did I grab the wrong build of hooks-bz? [14:19:32] ^demon: Looks like you took the new double shiny one (which requires a modded gerrit) [14:20:01] ^demon: The one for the unmodded gerrit will still give us the new event handling. [14:20:08] So that should work fine. [14:20:31] <^demon> Reloaded the plugin [14:21:43] New patchset: Mark Bergsma; "Prepare the bits cache manifests for the new eqiad servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70819 [14:22:31] New patchset: Mark Bergsma; "Prepare the bits cache manifests for the new eqiad servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70819 [14:22:58] <^demon> Getting the ACL on stream-events right + not deploying the draft updates seems to be a much smoother rollout than last attempt. [14:23:11] <^demon> Gustaf's change has some flaws, methinks. [14:23:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70819 [14:24:05] * qchris likes smooth upgrades [14:28:15] <^demon> I hate stupid exceptions. [14:28:37] <^demon> That IOException in org.apache.sshd.server.session.ServerSession has annoyed me since day 1. [14:28:57] :-) [14:29:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:54] New patchset: Mark Bergsma; "Install cp1056/57, cp1069/70 as bits caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70820 [14:30:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70820 [14:34:53] ^demon: Notifications in bugzilla work. Do we want to test upgrading to the new event/comment mechanism as well? [14:35:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.639 second response time [14:35:40] <^demon> That's https://gerrit.wikimedia.org/r/#/c/69475/, right? [14:36:04] Yes. [14:36:22] That change turns on the new mechanism (and does not yet turn off the old one, so well get double comments) [14:36:36] https://gerrit.wikimedia.org/r/#/c/69476/ [14:36:44] <^demon> 476 is to turn it off, right? [14:36:45] ^ will turn off the old comments [14:36:48] <^demon> Yeah [14:37:01] Yes. I split it, so we can selectively revert if needed. [14:37:04] * ^demon finds something to bribe an opsen with [14:39:15] * hashar hides [14:41:21] * apergos peeks in [14:41:33] Hi apergos :-) [14:42:00] what's the bribe? :-D [14:42:01] We currently upgrading gerrit and want to test switching to a new way to add comments to bugzilla [14:42:10] I'm looking at the change now [14:42:17] * qchris puts on nice smile [14:42:23] Great! thanks. [14:42:30] <^demon> +1 [14:42:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:44:31] <^demon> qchris: Logs still quiet :) [14:44:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.801 second response time [14:44:58] And no one screaming "gerrit does not work" :-) [14:45:11] the only part of this I can reasonably review is the gerrit.pp change; the vm and the config file I dunno the syntax or what they do [14:45:40] apergos: That should be fine. In case they cause problems, we can revert back [14:45:44] New patchset: Mark Bergsma; "Update caching proxy list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70822 [14:45:54] actually I had a few gerrit 404s earlier [14:45:56] but seems ok now [14:46:46] needs rebase maybe [14:46:58] New patchset: Mark Bergsma; "Update caching proxy list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70822 [14:47:04] says gerrit [14:47:39] ^demon or qchris: ^^ [14:47:56] Change merged: Mark Bergsma; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70822 [14:48:00] New patchset: QChris; "Take advantage of hook-bugzillas new event mechanism" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69475 [14:48:03] <^demon> mark: You possibly hit it during the like 2 minutes it was restarting. [14:48:21] New patchset: QChris; "Turn off hooks-bugzilla legacy event handling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69476 [14:48:46] New review: Hashar; "Tried on a fresh instance uploadtest08.ptmpa.wmflabs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [14:48:47] possible [14:49:02] apergos: I rebased the changes now. [14:49:06] !log mark synchronized wmf-config/squid.php [14:49:07] yep saw it [14:49:14] Logged the message, Master [14:49:46] !log mark synchronized wmf-config/squid.php [14:50:13] uh oh [14:50:25] I appear not to be logged in and I don't see a way to log in now [14:50:27] that's really weird [14:50:35] ah wait [14:50:59] Gerrit requires wiiiide monitors :-( [14:51:42] New patchset: Mark Bergsma; "Add the new bits servers to the $active_nodes list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70823 [14:52:11] grrrr [14:52:18] "working....." [14:52:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70823 [14:53:09] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69475 [14:53:57] merged on sockpuppet [14:54:04] apergos: thanks! [14:54:32] running puppet on manganese [14:55:05] PROBLEM - Host cp1056 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:26] RECOVERY - Host cp1056 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [14:58:04] New patchset: Hashar; "beta: adapt role::cache::varnish::upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [14:58:06] changes now live [14:58:24] Seems to work: https://bugzilla.wikimedia.org/show_bug.cgi?id=44441#c4 [14:58:35] Fantastic apergos. Thanks. [14:58:43] <^demon> Everything looks great. Thanks apergos [14:58:53] sure [14:59:35] Should we turn off the old style comments as well? [14:59:41] <^demon> Prolly [14:59:52] qchris: while you are around would be nice to list the git repo on which the change has been made :) [14:59:59] but I should probably feel a bug about it [15:00:29] Yes. Now with the new system, we can change the message as we want :-) [15:00:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:44] \O/ [15:01:15] PROBLEM - Host cp1069 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:22] New review: Hashar; "PS2 adds some symbolic links for /srv/sda3 and /srv/sdb3 that points to /srv/vdb . Result:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [15:01:32] apergos: Could you please have a look at https://gerrit.wikimedia.org/r/#/c/69476/ as well? [15:01:46] RECOVERY - Host cp1069 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [15:01:54] It stops gerrit from commenting in the old style. [15:02:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.010 second response time [15:02:38] right [15:02:58] New review: Hashar; "I did the host tweak on upload08 and the same curl commands used before. Works fine :-] So I guess ..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/70818 [15:03:15] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69476 [15:03:38] mark: finally rebased my beta upload::cache change. I got it tested in labs and that seems to work fine https://gerrit.wikimedia.org/r/70818 :-D [15:03:45] PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:05] mark: I added you as a reviewer already [15:04:12] excellent [15:04:41] mark: there is still a bit of a hack for /srv/sda3 :-) resolved that by creating symlinks hehe [15:04:58] why do you need the hack? [15:04:58] qchris: change is live [15:05:05] Thanks! [15:05:11] * qchris hugs apergos [15:05:15] RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:05:41] New patchset: Hashar; "erb: cast string to array for ruby 1.9" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [15:05:44] New patchset: Hashar; "Change link in notifyNewProjects to HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64462 [15:06:12] hashar: just use /srv/vdb directly? [15:06:16] qchris: ^demon we have a new [Cherry Pick To button] \O/ [15:06:26] :-D [15:06:31] mark: yeah I thought about that, but I would have to move the storage to a if ( :: realm ) [15:06:56] mark: and copy paste the default line for labs then replace the sda / sdd by vdb [15:06:58] didn't we do that for the other clusters already? [15:07:08] mark: Ithought it was easier to read / understand by using symlinks [15:07:30] parsoid has it like that [15:07:33] just copy it? [15:07:55] mobile too [15:08:03] then if you change the production one, the labs one will be out of sync :-D [15:08:03] why do something different now? [15:08:47] ok ok :-) [15:08:49] will amend [15:09:05] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:08] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:08] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [15:09:08] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:09]