[00:02:00] !log catrope Started syncing Wikimedia installation... : Scap for VE update, contained i18n changes [00:02:08] Logged the message, Master [00:02:14] github is faster than gerrit [00:04:54] New patchset: Legoktm; "Have gerrit-wm send all pywikibot/* commits to #pywikipediabot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70780 [00:09:30] !log catrope Finished syncing Wikimedia installation... : Scap for VE update, contained i18n changes [00:09:33] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [00:09:38] Logged the message, Master [00:15:18] New patchset: Legoktm; "Have gerrit-wm send all pywikibot/* commits to #pywikipediabot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70780 [00:15:59] !log catrope synchronized php-1.22wmf7/resources/startup.js 'touch' [00:16:08] Logged the message, Master [00:16:21] !log catrope synchronized php-1.22wmf8/resources/startup.js 'touch' [00:16:30] Logged the message, Master [00:23:09] !log updated Parsoid to eccca39 [00:23:17] Logged the message, Master [01:01:56] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.003178834915 secs [01:02:26] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.002837061882 secs [01:06:17] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [01:22:07] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [01:31:57] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.001416444778 secs [01:32:37] !log updated Parsoid to 091ebece [01:32:46] Logged the message, Master [01:34:19] !log Roan cleared Parsoid caches [01:34:29] Logged the message, Master [01:35:27] PROBLEM - Varnish HTTP parsoid-backend on titanium is CRITICAL: Connection refused [01:35:57] RoanKattouw: ^^ [01:36:15] Ugh [01:36:26] I also still get cached content [01:36:27] Silly me [01:36:35] I started Parsoid instead of Varnish [01:36:44] ah ;) [01:37:13] Oh, wait [01:37:15] And! [01:37:18] I did it on the wrong boxes [01:37:22] We've moved to cpNNNN now [01:37:26] yes, I was wondering about that ;) [01:37:27] RECOVERY - Varnish HTTP parsoid-backend on titanium is OK: HTTP OK: HTTP/1.1 200 OK - 636 bytes in 0.005 second response time [01:39:03] OK, done [01:40:30] RoanKattouw: thanks, looks good [02:07:05] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [02:07:37] !log LocalisationUpdate completed (1.22wmf8) at Thu Jun 27 02:07:37 UTC 2013 [02:07:48] Logged the message, Master [02:13:40] !log LocalisationUpdate completed (1.22wmf7) at Thu Jun 27 02:13:40 UTC 2013 [02:13:49] Logged the message, Master [02:19:05] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 27 02:19:05 UTC 2013 [02:19:14] Logged the message, Master [02:55:41] !log added springle to wmf and ops LDAP groups [02:55:50] Logged the message, Master [03:08:03] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [03:12:07] New patchset: Tim Starling; "Use the /usr/local copy of MW for noc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70792 [03:14:01] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70792 [03:28:53] PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - free space: / 5682 MB (3% inode=98%): [03:33:09] noc.wikimedia.org/dbtree has stopped working [03:33:41] TimStarling, ^ related to rt 70792? [03:33:50] hey springle [03:33:53] welcome :) [03:34:02] hi paravoid, thanks :) [03:34:04] (I'm Faidon) [03:36:45] it was because I moved that MW source tree away [03:36:48] I fixed it [04:07:03] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [04:08:03] RECOVERY - Puppet freshness on mw1066 is OK: puppet ran at Thu Jun 27 04:07:59 UTC 2013 [04:18:23] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Jun 27 04:18:19 UTC 2013 [04:19:03] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [04:30:43] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Jun 27 04:30:35 UTC 2013 [04:31:04] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [05:03:22] apergos: redirects are gone I see! [05:07:19] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:49] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:50] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:50] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [05:08:51] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:51] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [05:08:52] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [05:08:52] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [05:10:09] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:22:29] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:31:05] New patchset: BBlack; "more build/pkg fixes" [operations/software/varnish/libvmod-netmapper] (master) - https://gerrit.wikimedia.org/r/70795 [05:31:15] morning bblack :) [05:31:28] Change merged: BBlack; [operations/software/varnish/libvmod-netmapper] (master) - https://gerrit.wikimedia.org/r/70795 [05:31:34] or evening :) [05:31:45] thanks for the clarification, that's exactly what I meant, yes [06:11:50] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [06:12:10] PROBLEM - Disk space on ms-be1002 is CRITICAL: DISK CRITICAL - free space: / 5699 MB (3% inode=98%): [06:14:30] PROBLEM - SSH on mc15 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:15:20] RECOVERY - SSH on mc15 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [06:21:53] apergos: I rewrote the Architecture section https://wikitech.wikimedia.org/wiki/Media_storage [06:22:00] your content helped me set the direction, thanks :) [06:22:27] I also added info regarding eraseArchivedFile.php [06:22:41] we should send all this to Aaron when we're done [06:22:51] I'm sure he'll have corrections and additions [06:27:10] RECOVERY - Disk space on ms-be1002 is OK: DISK OK [06:27:50] RECOVERY - Disk space on ms-be1001 is OK: DISK OK [06:33:32] paravoid: awesome [06:45:54] hey paravoid [06:46:13] or apergos [06:53:59] yess? [06:55:19] hi [07:00:36] nothing major, just wanted advice [07:00:49] i have a re-write of the eventlogging puppet module that i've been tweaking [07:01:11] one of the things it does is replace supervisord (a python-based process management thingabob) with upstart [07:01:58] it seemed annoying to insist on using some other then upstart to manage services when upstart was already managing everything [07:02:32] but the nice thing about that was that it gave me a management interface that was specific to the six or seven processes that i cared about [07:02:54] without having to grep for or struggle to recall service names [07:03:35] now they're just lost in the crowd of random system services i hardly ever care about [07:04:14] does that sound sensible? [07:06:14] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [07:06:43] to do what? [07:06:48] management how? [07:07:48] ensure they're running, restart if necessary (with groups), tail stderr [07:08:39] 'with groups' meaning there's a notion of process groups in supervisor and you can scope a command to a group of processes [07:08:42] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [07:08:57] I'm not sure I understand what you want to do exactly [07:09:53] i was trying to move away from supervisor and just use upstart, but am now reconsidering, and wondering if my reservations are legitimate [07:10:49] ori-l: name all of the services eventlogging- maybe? [07:11:09] no, I mean, what do you usually do with supervisor? [07:11:13] and is that manually? [07:12:10] Ryan_Lane: i guess "service --status-all | grep eventlogging-", hrm [07:12:33] paravoid: usually tail stderr and restart individual components for code upgrades [07:12:39] and yes, manually [07:13:32] oh, and e-mail alerts [07:14:05] I'd be okay with shipping a shell script in /usr/local/sbin that had a few management commands [07:14:22] wmel status, wmel debug, wmel restart, etc. [07:14:33] tail stderr isn't something that supervisord does anyway :) [07:14:42] sure it does [07:14:50] it odes? [07:15:51] yeah, supervisorctl has a 'tail -f' command [07:16:23] ugh :) [07:16:23] it beats fishing the right file in /var/log [07:17:19] http://vanadium.eqiad.wmnet:2828/ [07:18:37] i like the idea of upstart + management script + consistent service name prefix though [07:18:55] web interfaces for management, bleh [07:19:07] (i don't use it :)) [07:20:19] I don't mind supervisord, but it is a bit kind of counter-intuitive [07:20:31] both in general and in the sense that we don't generally use it so people are not familiar with it [07:21:17] but if you have reasons to prefer it, that's okay [07:21:28] your call :) [07:21:54] worth a shot [07:22:04] plus i like 'wmel' [07:22:29] :) [07:22:57] alright, thanks [07:23:00] I'd also add a "wmel check" that would run the nagios checks [07:23:15] there are no nagios checks [07:23:26] time to add them! [07:23:28] :-) [07:24:06] yes, probably a good idea [07:24:15] * Ryan_Lane scoffs [07:24:17] nagios checks [07:24:21] * Ryan_Lane scoffs [07:24:25] * Ryan_Lane should really go to sleep [08:01:45] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.00316131115 secs [08:02:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:03:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [08:03:25] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset 0.00300860405 secs [08:07:31] hi :) [08:07:32] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [08:08:46] yesterday Azatoth introduced to me a project that let you easily build Debian packages under Jenkins :-D [08:09:02] took like 2 hours, but I got pybal packaged via Jenkins! [08:09:33] RECOVERY - Puppet freshness on celsus is OK: puppet ran at Thu Jun 27 08:09:25 UTC 2013 [08:09:33] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [08:17:02] New review: Hashar; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70322 [08:17:12] New patchset: Hashar; "beta: tweak $wgLoadScript to use the bits cache" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70322 [08:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [08:48:44] New review: Hashar; "I have filled an issue upstream to have them add tags in git https://github.com/facebook/buck/issues/37" [operations/debs/buck] (master) - https://gerrit.wikimedia.org/r/70673 [08:57:44] New patchset: Hashar; "beta: removes incubator wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70804 [09:10:12] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [09:27:32] RECOVERY - Disk space on cp1048 is OK: DISK OK [09:27:32] RECOVERY - RAID on cp1048 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:28:22] RECOVERY - DPKG on cp1048 is OK: All packages OK [09:32:32] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [09:34:17] apergos: finally catching up with the puppet "modules and roles" thread in ops list :-) [09:34:23] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [09:34:23] hh [09:34:26] *heh [09:38:00] !log Pooled new eqiad upload caches with 1% load [09:38:09] Logged the message, Master [09:44:07] upped to 5% now [09:44:12] let's try not to overload swift today [09:44:19] I'll try to keep the load below 1000 req/s [09:52:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:54:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [10:12:06] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [10:13:42] 10%... [10:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:52:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [11:02:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:03:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [11:10:06] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [11:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [11:42:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:43:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:06:49] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [12:13:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:14:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:22:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:30:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:32:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.154 second response time [12:39:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:40:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [12:46:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:47:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [12:54:04] re$ [12:55:44] New patchset: Hashar; "contint: explicitly require php5-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70182 [13:03:40] Change abandoned: Hashar; "cant be rebased, will just redo that patch." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50064 [13:03:54] AzaToth: hi. So i saw you solved the problem with the buck repo. Did you had the project destroyed and recreated ? [13:07:53] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [13:22:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:31:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.433 second response time [13:37:41] New patchset: Hashar; "gerrit-wm: pywikibot/* events to #pywikipediabot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70780 [13:38:25] New review: Hashar; "good to go." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/70780 [13:38:29] apergos: regarding ms-be9...disk2 was unconfigured good so I cleared the foreign cfg and added back. should be good to go now [13:40:14] akosiaris: we wiped it totally so the old changesets where pruned [13:40:27] but I didn't had to, it was just we decided to do so [13:45:54] cmjohnson1: thanks, that's excellent [13:47:33] New patchset: Hashar; "beta: adapt role::cache::varnish::upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [13:47:48] New review: Hashar; "follow up in https://gerrit.wikimedia.org/r/70818" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/50064 [13:51:05] err: /Stage[main]/Nrpe::Service/Service[nagios-nrpe-server]/ensure: change from stopped to running failed: Could not start Service[nagios-nrpe-server]: Execution of '/etc/init.d/nagios-nrpe-server start' returned 2: at /etc/puppet/manifests/nrpe.pp:108 [13:51:06] :D [13:51:11] poooor nrpe [13:52:22] Jun 27 13:50:18 uploadtest07 nrpe[15058]: Unable to open config file '/etc/icinga/nrpe.cfg' for reading [13:52:22] Jun 27 13:50:18 uploadtest07 nrpe[15058]: Config file '/etc/icinga/nrpe.cfg' contained errors, aborting... [13:52:25] yeah that does not help [13:54:30] <^demon> qchris: I'm having a phone call, then we'll do this thing :) [13:56:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:56:37] ^demon: Upgrading, upgrading, upgrading, ... Yay \o/ [13:58:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:58:36] New review: Hashar; "I have applied that change to uploadtest07.pmtpa.wmflabs instance. Varnish instances managed to st..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [14:00:11] qchris: helllo :-] [14:00:22] Hi hashar :-) [14:00:27] I filled an issue against Buck to have them tag versions https://github.com/facebook/buck/issues/37 [14:00:39] I saw that. Thanks. [14:00:45] and AzaToth has submitted a change that would package Buck for debian [14:00:57] somewhere in Gerrit, maybe operations/debs/buck [14:00:57] I am curious whether they'll add them. [14:01:13] we just need to catch a Google VP now :-] [14:01:23] Yes, I am just upgrading an Ubuntu instance so I can have Java 7 there, so I can test that. [14:01:35] oh [14:01:47] * hashar checks whether gallium has java 7 [14:01:57] java version "1.6.0_27" [14:01:57] :( [14:02:03] Building buck requires Java 7. [14:02:07] No way around it :-( [14:02:24] openjdk-7-jre-headless is installed [14:02:24] Building gerrit will also require it :-( [14:02:29] Ah. Ok. [14:02:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:45] so we will have to points the Gerrit job to the java7 install [14:03:08] iirc it is already configured in Jenkins and the java runtime to use can be set on a per job basis [14:03:12] using some kind of droplist [14:03:26] Yes, the jenkins maven plugin had that IIRC. [14:03:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [14:03:43] unfortunately, there is no buck plugin for maven :-) [14:03:53] s/maven/jenkins/ [14:04:03] freestyle job we will use :-) [14:04:10] So it shall be. [14:04:18] ooor [14:04:27] you could write a Jenkins plugin to nicely integrate buck [14:04:57] Ok, you rewrite Jenkins in Python, and I'll write the buck plugin :-) [14:05:39] I guess free style jobs will have to do for now. Given I detest buck, I do not really want to make it easier for people to migrate to it. [14:06:20] <^demon> qchris: So, we'll try to get the change for replication merged today. But here's an example of what we were hitting: [14:06:24] <^demon> [2013-06-27 14:02:50,503] ERROR com.googlesource.gerrit.plugins.replication.ReplicationQueue : Failed replicate of refs/heads/sandbox/anomie/merged2 to gerritslave@antimony.wikimedia.org:/var/lib/git/mediawiki/extensions/CentralAuth.git: status REJECTED_NONFASTFORWARD [14:06:44] Do we force push there? [14:07:09] ^demon: Are force pushes not allowed on sandbox branches? [14:07:43] <^demon> They are [14:07:54] <^demon> I've just been trying to fix the replication of them :) [14:07:58] <^demon> anomie: Sorry for the ping :) [14:09:16] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [14:10:07] ^demon: antimony seems to still use "push" => "refs/*:refs/*" [14:10:15] ^demon so no force push :-( [14:10:30] https://gerrit.wikimedia.org/r/#/c/70457/ [14:10:36] ^ Should solve the problem [14:10:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:10:54] <^demon> !log gerrit: running puppet and restarting service [14:10:54] <^demon> qchris: Yeah, we need to merge the change you pushed for it [14:11:03] Logged the message, Master [14:11:16] PROBLEM - Disk space on ms-be1001 is CRITICAL: DISK CRITICAL - free space: / 5698 MB (3% inode=98%): [14:11:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [14:13:38] gerrit is up again :-) [14:14:10] <^demon> Yeah, I added JenkinsBot to stream events. [14:14:30] \o/ [14:14:43] <^demon> paravoid: Could you take a look at https://gerrit.wikimedia.org/r/#/c/70457/? [14:15:27] zool is doing work. [14:15:36] Looks good. [14:15:56] <^demon> Yeah, zuul's fine. hooks-bz is giving me an exception tho [14:16:01] <^demon> (not the auth problem from the other day) [14:16:14] <^demon> Blah, misread...not hooks-bz [14:16:34] <^demon> http://p.defau.lt/?KntwieWufnulHVgZkWFlXQ - when I updated commit summary on https://gerrit.wikimedia.org/r/#/c/66665/ to test stream events [14:17:08] That's hooks-its [14:17:46] Looks like the hooks-its jar /with isDraft/, while we are now running the gerrit without isDraft [14:17:58] http://quelltextlich.at/gerrit/hooks-bugzilla-2.7-SNAPSHOT-84f08e8-hooks-its-3b7d4be.jar [14:18:08] ^ That's the hooks-bz with hooks-its without isDrfat [14:18:24] ^demon: I am also going to add yet another git replication destination [14:18:33] <^demon> hashar: ok [14:18:45] ^demon: I am going to receive a second CI server that will be a Jenkins Slave. will need the repos there :-] [14:19:00] <^demon> qchris: Ah, did I grab the wrong build of hooks-bz? [14:19:32] ^demon: Looks like you took the new double shiny one (which requires a modded gerrit) [14:20:01] ^demon: The one for the unmodded gerrit will still give us the new event handling. [14:20:08] So that should work fine. [14:20:31] <^demon> Reloaded the plugin [14:21:43] New patchset: Mark Bergsma; "Prepare the bits cache manifests for the new eqiad servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70819 [14:22:31] New patchset: Mark Bergsma; "Prepare the bits cache manifests for the new eqiad servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70819 [14:22:58] <^demon> Getting the ACL on stream-events right + not deploying the draft updates seems to be a much smoother rollout than last attempt. [14:23:11] <^demon> Gustaf's change has some flaws, methinks. [14:23:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70819 [14:24:05] * qchris likes smooth upgrades [14:28:15] <^demon> I hate stupid exceptions. [14:28:37] <^demon> That IOException in org.apache.sshd.server.session.ServerSession has annoyed me since day 1. [14:28:57] :-) [14:29:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:54] New patchset: Mark Bergsma; "Install cp1056/57, cp1069/70 as bits caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70820 [14:30:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70820 [14:34:53] ^demon: Notifications in bugzilla work. Do we want to test upgrading to the new event/comment mechanism as well? [14:35:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.639 second response time [14:35:40] <^demon> That's https://gerrit.wikimedia.org/r/#/c/69475/, right? [14:36:04] Yes. [14:36:22] That change turns on the new mechanism (and does not yet turn off the old one, so well get double comments) [14:36:36] https://gerrit.wikimedia.org/r/#/c/69476/ [14:36:44] <^demon> 476 is to turn it off, right? [14:36:45] ^ will turn off the old comments [14:36:48] <^demon> Yeah [14:37:01] Yes. I split it, so we can selectively revert if needed. [14:37:04] * ^demon finds something to bribe an opsen with [14:39:15] * hashar hides [14:41:21] * apergos peeks in [14:41:33] Hi apergos :-) [14:42:00] what's the bribe? :-D [14:42:01] We currently upgrading gerrit and want to test switching to a new way to add comments to bugzilla [14:42:10] I'm looking at the change now [14:42:17] * qchris puts on nice smile [14:42:23] Great! thanks. [14:42:30] <^demon> +1 [14:42:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:44:31] <^demon> qchris: Logs still quiet :) [14:44:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 9.801 second response time [14:44:58] And no one screaming "gerrit does not work" :-) [14:45:11] the only part of this I can reasonably review is the gerrit.pp change; the vm and the config file I dunno the syntax or what they do [14:45:40] apergos: That should be fine. In case they cause problems, we can revert back [14:45:44] New patchset: Mark Bergsma; "Update caching proxy list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70822 [14:45:54] actually I had a few gerrit 404s earlier [14:45:56] but seems ok now [14:46:46] needs rebase maybe [14:46:58] New patchset: Mark Bergsma; "Update caching proxy list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70822 [14:47:04] says gerrit [14:47:39] ^demon or qchris: ^^ [14:47:56] Change merged: Mark Bergsma; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70822 [14:48:00] New patchset: QChris; "Take advantage of hook-bugzillas new event mechanism" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69475 [14:48:03] <^demon> mark: You possibly hit it during the like 2 minutes it was restarting. [14:48:21] New patchset: QChris; "Turn off hooks-bugzilla legacy event handling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69476 [14:48:46] New review: Hashar; "Tried on a fresh instance uploadtest08.ptmpa.wmflabs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [14:48:47] possible [14:49:02] apergos: I rebased the changes now. [14:49:06] !log mark synchronized wmf-config/squid.php [14:49:07] yep saw it [14:49:14] Logged the message, Master [14:49:46] !log mark synchronized wmf-config/squid.php [14:50:13] uh oh [14:50:25] I appear not to be logged in and I don't see a way to log in now [14:50:27] that's really weird [14:50:35] ah wait [14:50:59] Gerrit requires wiiiide monitors :-( [14:51:42] New patchset: Mark Bergsma; "Add the new bits servers to the $active_nodes list" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70823 [14:52:11] grrrr [14:52:18] "working....." [14:52:31] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70823 [14:53:09] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69475 [14:53:57] merged on sockpuppet [14:54:04] apergos: thanks! [14:54:32] running puppet on manganese [14:55:05] PROBLEM - Host cp1056 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:26] RECOVERY - Host cp1056 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [14:58:04] New patchset: Hashar; "beta: adapt role::cache::varnish::upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [14:58:06] changes now live [14:58:24] Seems to work: https://bugzilla.wikimedia.org/show_bug.cgi?id=44441#c4 [14:58:35] Fantastic apergos. Thanks. [14:58:43] <^demon> Everything looks great. Thanks apergos [14:58:53] sure [14:59:35] Should we turn off the old style comments as well? [14:59:41] <^demon> Prolly [14:59:52] qchris: while you are around would be nice to list the git repo on which the change has been made :) [14:59:59] but I should probably feel a bug about it [15:00:29] Yes. Now with the new system, we can change the message as we want :-) [15:00:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:00:44] \O/ [15:01:15] PROBLEM - Host cp1069 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:22] New review: Hashar; "PS2 adds some symbolic links for /srv/sda3 and /srv/sdb3 that points to /srv/vdb . Result:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [15:01:32] apergos: Could you please have a look at https://gerrit.wikimedia.org/r/#/c/69476/ as well? [15:01:46] RECOVERY - Host cp1069 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [15:01:54] It stops gerrit from commenting in the old style. [15:02:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.010 second response time [15:02:38] right [15:02:58] New review: Hashar; "I did the host tweak on upload08 and the same curl commands used before. Works fine :-] So I guess ..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/70818 [15:03:15] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/69476 [15:03:38] mark: finally rebased my beta upload::cache change. I got it tested in labs and that seems to work fine https://gerrit.wikimedia.org/r/70818 :-D [15:03:45] PROBLEM - Host cp1070 is DOWN: PING CRITICAL - Packet loss = 100% [15:04:05] mark: I added you as a reviewer already [15:04:12] excellent [15:04:41] mark: there is still a bit of a hack for /srv/sda3 :-) resolved that by creating symlinks hehe [15:04:58] why do you need the hack? [15:04:58] qchris: change is live [15:05:05] Thanks! [15:05:11] * qchris hugs apergos [15:05:15] RECOVERY - Host cp1070 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:05:41] New patchset: Hashar; "erb: cast string to array for ruby 1.9" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [15:05:44] New patchset: Hashar; "Change link in notifyNewProjects to HTTPS" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64462 [15:06:12] hashar: just use /srv/vdb directly? [15:06:16] qchris: ^demon we have a new [Cherry Pick To button] \O/ [15:06:26] :-D [15:06:31] mark: yeah I thought about that, but I would have to move the storage to a if ( :: realm ) [15:06:56] mark: and copy paste the default line for labs then replace the sda / sdd by vdb [15:06:58] didn't we do that for the other clusters already? [15:07:08] mark: Ithought it was easier to read / understand by using symlinks [15:07:30] parsoid has it like that [15:07:33] just copy it? [15:07:55] mobile too [15:08:03] then if you change the production one, the labs one will be out of sync :-D [15:08:03] why do something different now? [15:08:47] ok ok :-) [15:08:49] will amend [15:09:05] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:07] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:08] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:08] PROBLEM - Puppet freshness on spence is CRITICAL: No successful Puppet run in the last 10 hours [15:09:08] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [15:09:09] PROBLEM - Puppet freshness on sodium is CRITICAL: No successful Puppet run in the last 10 hours [15:11:21] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [15:14:33] New patchset: Hashar; "beta: adapt role::cache::varnish::upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [15:17:47] New patchset: Mark Bergsma; "Update ganglia aggregators for bits & upload caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70825 [15:19:25] New patchset: Hashar; "beta: adapt role::cache::varnish::upload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [15:20:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70825 [15:21:52] New review: Hashar; "fixed some puppet parsing error." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70818 [15:22:08] mark: got it fixed :-] [15:22:13] thank you [15:22:17] i'll change some other things too [15:22:19] in the vcl [15:22:32] yeah you told me there was a nicer way to handle the upload domain difference [15:22:43] I haven't found out a better way though :( [15:22:46] hopefully anyway [15:23:01] or we could use hiera() :-D [15:23:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.138 second response time [15:25:48] Change merged: Akosiaris; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70457 [15:26:50] ^demon: the replication problem should be solved now ^ [15:27:24] (Next to puppet run and maybe restarting the plugin) [15:28:43] <^demon> Sweet [15:29:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 6.303 second response time [15:40:33] !log Pooled new eqiad bits caches [15:40:42] Logged the message, Master [15:42:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 2.690 second response time [16:04:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.587 second response time [16:05:06] !log Depooled row C bits caches, repooled old bits caches to investigate IPv6 problem [16:05:10] Logged the message, Master [16:05:10] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:05:33] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [16:06:59] ottomata, I note that you imported the rsync module -- do you know about how rsyncd.conf works? [16:07:04] The upstream module sets uid and gid to 'nobody' and can't be configured otherwise… our existing uses of rsyncd specify gids and uids. [16:07:22] I'm wondering if that's important, or if I can get by just specifying those in the subsections of the conf file [16:09:32] we imported that directly , i think you can modify the module as suits your needs [16:10:32] PROBLEM - Disk space on ms-be1002 is CRITICAL: DISK CRITICAL - free space: / 5688 MB (3% inode=98%): [16:11:38] !log reedy synchronized php-1.22wmf9/ [16:11:46] Logged the message, Master [16:13:18] !log reedy synchronized php-1.22wmf9/extensions/DataValues [16:13:25] Logged the message, Master [16:13:56] !log reedy synchronized php-1.22wmf9/extensions/Diff [16:14:04] hashar: thanks for fixing up my patchset. can i also talk to you about setting up jenkins tests for pywikibot? [16:14:04] Logged the message, Master [16:14:47] !log reedy synchronized php-1.22wmf9/extensions/Wikibase/ [16:14:55] Logged the message, Master [16:18:37] !log reedy synchronized php-1.22wmf9/extensions/WikibaseDataModel [16:18:45] Logged the message, Master [16:22:21] legoktm: I think I added a few basic tests already [16:22:28] legoktm: can definitely add mroe [16:23:26] hashar: can i see where those ones are listed? [16:24:15] the main one is to run our test suite with "python setup.py test", but that requires creating a file in ~/.pywikibot/user-config.py first [16:24:39] legoktm: hold on brb in a few minutes [16:24:51] ok [16:27:49] !log Corrected IPv6 problem, repooled the new servers, depooled the old bits servers [16:27:57] Logged the message, Master [16:29:09] !log reedy Started syncing Wikimedia installation... : test2wiki to 1.22wmf9 rebuild l10n cache [16:29:18] Logged the message, Master [16:29:46] ottomata, ok, but you don't know what the uid in the header of that conf file affects vs. the uid in the sections? [16:29:57] naw, i don't [16:30:46] New patchset: Mark Bergsma; "Correct regex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70838 [16:31:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70838 [16:32:23] New patchset: Reedy; "Fixup writing of newlines and done to make output consistent and sensible" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70839 [16:32:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:33:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.264 second response time [16:35:14] New patchset: Akosiaris; "Introducing bacula module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70840 [16:37:56] New patchset: Reedy; "Add missing done to syntax/lint checking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70841 [16:38:29] PROBLEM - Apache HTTP on mw1070 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:19] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.229 second response time [16:39:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:40:13] ^demon: Jenkins seems to be really slow this morning [16:40:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.462 second response time [16:40:40] <^demon> jenkins? or gerrit? [16:40:50] New patchset: Reedy; "Remove superfluous comment which is repeated in sync-wikiversions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70842 [16:41:29] New patchset: Reedy; "Remove superfluous comment which is repeated in sync-wikiversions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70842 [16:41:30] ^demon: a merge takes a long time, apparently waiting for zuul [16:41:33] https://gerrit.wikimedia.org/r/#/c/70797/ [16:41:44] https://gerrit.wikimedia.org/r/#/c/70826/ [16:42:11] <^demon> Hmm, well gerrit got upgraded this AM, but it's been fine. [16:42:15] <^demon> hashar: Jenkins ok? [16:42:38] Aaron|home: hey [16:42:40] * hashar looks at status page at https://integration.wikimedia.org/zuul/ [16:42:46] 0 results, 0 events pending [16:42:56] suspicious [16:44:09] <^demon> Nothing in zuul log looks suspicious. [16:44:12] <^demon> to me [16:44:15] hmm [16:44:18] it got locked at some point [16:44:26] while 2013-06-27 16:42:04,304 INFO zuul.Gerrit: Getting information for 70842,2 [16:44:46] !log reedy Finished syncing Wikimedia installation... : test2wiki to 1.22wmf9 rebuild l10n cache [16:44:55] Logged the message, Master [16:45:10] 2013-06-27 16:43:58,338 DEBUG zuul.Scheduler: Adding trigger event: [16:45:30] that is the next event [16:45:42] ^demon: I guess Zuul was waiting for some Jenkins API Query to complete [16:45:47] hashar: have you done any more on the debs? [16:46:07] ^demon: the API query logs are in /var/log/jenkins/access.log [16:46:28] <^demon> k. [16:47:56] so I guess the usual slowness [16:48:08] I got a bunch of patches on python-jenkins to make some api queries a bit faster [16:48:13] have yet to test them out though [16:48:22] AzaToth: nop. Might have a look at it next week [16:48:25] ok [16:50:33] paravoid: hi [16:52:32] Can someone merge etc https://gerrit.wikimedia.org/r/#/c/67274/ ? It's getting annoying having to fix the new file every week. Thanks! [16:52:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:06] Coren: Could you merge https://gerrit.wikimedia.org/r/#/c/67274/1 (see Reedy's comment)? [16:53:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [16:56:33] New patchset: Andrew Bogott; "Convert swift's rsyncd from generic::rsyncd to the new rsync module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70846 [16:56:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:13] New review: Andrew Bogott; "(the log file still isn't set properly here.)" [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/70846 [17:00:38] !log LocalisationUpdate completed (1.22wmf8) at Thu Jun 27 17:00:37 UTC 2013 [17:00:47] Logged the message, Master [17:01:20] !log LocalisationUpdate completed (1.22wmf7) at Thu Jun 27 17:01:20 UTC 2013 [17:01:29] Logged the message, Master [17:02:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.874 second response time [17:03:00] hashar: zuul still did not get to jobs submitted over an hour ago, and shows an empty queue again [17:03:28] gwicke: maybe it got lost ? Gerrit got restarted a few times earlier [17:05:12] ok, retrying [17:06:22] PROBLEM - Puppet freshness on manutius is CRITICAL: No successful Puppet run in the last 10 hours [17:10:06] hashar: no luck again [17:10:27] now https://integration.wikimedia.org/zuul/ shows "Queue only mode: preparing to reconfigure, queue length: 1" [17:10:44] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [17:10:45] !log Jenkins/Zuul is no more merging changes due to --force-message options no more being available in Gerrit :/ [17:10:54] Logged the message, Master [17:11:01] wha? [17:11:07] we need CI for our CI [17:11:43] !log LocalisationUpdate completed (1.22wmf9) at Thu Jun 27 17:11:43 UTC 2013 [17:11:52] Logged the message, Master [17:12:50] !log Zuul/Jenkins merging bug is {{bug|50300}} [17:12:59] Logged the message, Master [17:15:23] RoanKattouw: Gimme a min. (was at lunch) [17:15:49] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 27 17:15:49 UTC 2013 [17:15:50] No worries [17:15:54] New review: coren; "Simple enough." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/67274 [17:15:55] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67274 [17:15:57] Logged the message, Master [17:16:01] No particular rush, it's just been waiting for a long time [17:16:03] Oh, there it goes [17:16:05] Thanks man [17:16:20] RoanKattouw: All done. [17:17:15] * Coren goes to the vet with the dogs for their checkup. [17:20:12] !log zuul: removing --force-message from layout {{gerrit|70849}} and reloading zuul. Caused {{bug|50300}} [17:20:23] Logged the message, Master [17:23:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:24:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [17:30:01] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki back to 1.22wmf8 [17:30:09] Logged the message, Master [17:35:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:36:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [17:36:38] !log reedy synchronized php-1.22wmf7/extensions/Wikibase 'Updating to master of 1.22wmf6 branch' [17:36:48] Logged the message, Master [17:37:27] !log reedy synchronized php-1.22wmf8/extensions/Wikibase 'Updating to master of 1.22wmf6 branch' [17:37:37] Logged the message, Master [17:42:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki back to 1.22wmf9 [17:42:24] Logged the message, Master [17:48:37] New patchset: Ottomata; "Adding Adam Baso to stats group on analytics nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70854 [17:49:08] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70854 [17:55:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.144 second response time [17:58:18] ^demon: loading patch pages seems to be broken with FF and latest gerrit [17:58:28] <^demon> gwicke: Roan thought so too. [17:58:31] <^demon> I couldn't replicate. [17:58:38] <^demon> And it worked for him when he logged out [17:58:39] both subbu and me see the same [17:58:45] even after a restart of the browser [17:58:51] works in Chromium though [17:59:06] <^demon> Yeah, that's what Roan said. [17:59:25] <^demon> I tried FF 21 and 22 on this machine and couldn't replicate. [17:59:54] I'm on 20 [18:00:13] Mozilla/5.0 (X11; Linux x86_64; rv:20.0) Gecko/20100101 Firefox/20.0 Iceweasel/20.0 [18:00:37] <^demon> Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:22.0) Gecko/20100101 Firefox/22.0 [18:01:05] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.22wmf8 [18:01:14] Logged the message, Master [18:02:05] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Rest of wikipedias to 1.22wmf8 [18:02:13] Logged the message, Master [18:03:19] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki, mediawikiwiki and testwikidatawiki to 1.22wmf9 [18:03:27] Logged the message, Master [18:03:32] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:03:33] bits down? [18:03:50] AzaToth: no [18:03:53] I'm also waiting on bits.... [18:04:01] New patchset: Reedy; "test2wiki to 1.22wmf9" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70855 [18:04:01] New patchset: Reedy; "testwiki, mediawikiwiki and testwikidatawiki to 1.22wmf9" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70856 [18:04:11] I've seen some requests fail.. [18:04:14] ori-l: well, it seems to be fuck'd [18:04:22] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [18:04:35] wfm *shrug* [18:04:43] New patchset: Krinkle; "Enable VE experimental mode on test2wiki per Bug 49963" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70440 [18:05:01] loaded now at least [18:05:01] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/70440/2 [18:05:05] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70855 [18:05:10] was prolly some intermittent [18:05:32] New patchset: Reedy; "Wikipedias to 1.22wmf8" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70856 [18:05:33] New review: Krinkle; "It's great Doc. Thanks!" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70440 [18:05:34] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:05:41] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70856 [18:06:12] ^demon: is there a bug for it already? [18:06:22] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.815 second response time [18:06:33] <^demon> gwicke: No bug, no. [18:06:36] ^demon: eh? Verified and Code-Review are swapped now. Any idea why? [18:06:43] <^demon> They were before? [18:06:54] <^demon> Oh, they're consistent again. [18:07:21] <^demon> They've been broken for awhile, showing different on the dashboard and change page. [18:07:51] Verified happens on submission, CR afterwards [18:08:21] http://paste.debian.net/13009/ [18:08:22] I know we do it the other way around for new users, but that's a temporary exception for security reasons until we enter the next phase with Jenkins [18:08:29] 503 on bits again [18:08:38] (and no, I've not made twinkle too big again) [18:08:41] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:08:50] AzaToth: works for me [18:08:52] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [18:09:15] Krinkle: it's intermittent [18:09:20] there's a lot of servers :) [18:09:31] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [18:10:22] why not just use one big server? ツ [18:11:01] ^demon: https://bugzilla.wikimedia.org/show_bug.cgi?id=50309 [18:18:05] New patchset: Andrew Bogott; "Convert swift's rsyncd from generic::rsyncd to the new rsync module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70846 [18:18:05] New patchset: Andrew Bogott; "Add support for specifying a global log rsyncd log file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70858 [18:21:54] New review: Andrew Bogott; "For reference... the old rsyncd.conf file looked like this:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70846 [18:26:23] AzaToth: Yep, you guessed it. http://greg.porter.name/wordpress/wp-content/uploads/2009/10/Big-FC-Server.jpg [18:26:46] New patchset: Reedy; "add WikibaseDataModel extension to extension-list" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70829 [18:27:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70829 [18:27:55] New patchset: Reedy; "Clean the aliases of Proofread Page managed namespaces" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70377 [18:29:06] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70377 [18:29:30] New patchset: Reedy; "Fix links to Gitweb in highlight.php on noc.wikimedia.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70463 [18:30:05] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70463 [18:31:01] PROBLEM - Puppet freshness on mw1039 is CRITICAL: No successful Puppet run in the last 10 hours [18:31:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.151 second response time [18:34:14] AzaToth: like that big spike at 16.20? https://gdash.wikimedia.org/dashboards/reqerror/ [18:34:32] [assuming it's the right graph] [18:34:48] !log Created Echo tables on enwikivoyage [18:34:57] Logged the message, Master [18:41:03] !log reedy synchronized wmf-config/InitialiseSettings.php [18:41:12] Logged the message, Master [18:42:50] New patchset: Reedy; "Install Thanks and Echo extensions on enwikivoyage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70861 [18:43:26] !log reedy synchronized wmf-config/InitialiseSettings.php [18:45:26] New patchset: Reedy; "Add Extension:NewUserMessage to de.wikiversity" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70862 [18:46:00] !log reedy synchronized wmf-config/InitialiseSettings.php 'Add Extension:NewUserMessage to de.wikiversity' [18:46:08] Logged the message, Master [18:46:48] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/70862 [18:48:11] If I was interested in making https://github.com/wikimedia/puppet-jmxtrans available to us, what would I have to do? [18:51:09] manybubbles: Do you mean by "us" as in use it on the WMF cluster/similar? [18:51:51] Reedy: yes. I mean to use it in labs and eventually on the production cluster. [18:54:37] manybubbles: Presumably add it as a submodule at modules/jmxtrans for starts [18:54:41] And throw things at ottomata [18:55:14] ok. I think we'd also need to package jmxtrans but that might be reasonably simple. [18:55:31] Ryan_Lane: if i wanted to use git-deploy to deploy ishmael instead of deb packaging it, are the directions on https://wikitech.wikimedia.org/wiki/Sartoris still good for setting it up? [18:55:44] good question [18:55:45] I'm just not sure how we go about packaging anything. is there a page for it? [18:56:02] manybubbles: ottomata might have done something like that, I'm presuming he created those manifests for analytics usage.. [18:56:17] manybubbles: https://wikitech.wikimedia.org/wiki/Help:Packaging_software [18:56:20] Ryan_Lane: i'm also wondering if any manual salt configuration is needed? [18:56:20] binasher: seems I don't have info about adding new repos [18:56:45] there's a small amount of manual bootstrapping necessary for brand new repos, yes [18:56:52] manybubbles [18:57:05] if you can work on getting the .deb into our apt [18:57:07] binasher: it's mostly managed by puppet [18:57:26] ottomata: I can totally work on that. [18:57:31] I can walk you through it and document it while doing so [18:57:44] ottomata: I want so badly to have those graphs for solr. [18:58:02] i can get jmxtrans submitted for review [18:58:05] i actually already worked on it [18:58:07] just need to push it [18:58:12] (puppet module) [18:58:25] binasher: in puppet, it's configured via manifests/role/deployment.pp [18:58:52] ottomata: super! I'll go read that packaging stuff and get it ready [18:59:09] binasher: this has no submodules, right? [18:59:16] right [19:11:57] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [19:18:47] ottomata: were you guys working on puppet for zookeeper? [19:19:46] yes, that should be in [19:19:53] and useable [19:20:11] https://gerrit.wikimedia.org/r/#/admin/projects/operations/puppet/zookeeper [19:28:07] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 215 seconds [19:30:18] PROBLEM - Varnish HTTP mobile-frontend on cp3011 is CRITICAL: HTTP CRITICAL - No data received from host [19:32:07] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [19:32:58] Reedy, did you sync docroot/noc/conf/highlight.php? [19:33:03] No [19:33:06] noc is on fenari [19:33:34] I did notice the updated urls didn't seem to work [19:34:00] I'm not getting the new URLs... hmm [19:34:17] RECOVERY - Varnish HTTP mobile-frontend on cp3011 is OK: HTTP OK: HTTP/1.1 200 OK - 707 bytes in 0.178 second response time [19:34:18] Hmm [19:34:20] When did that change [19:34:21] DocumentRoot /usr/local/apache22/htdocs/noc [19:35:01] got them now [19:35:12] yeah, I'm running sync-common on fenari [19:36:21] DocumentRoot /usr/local/apache22/htdocs/noc [19:37:07] That doesn't even exist [19:37:08] reedy@fenari:/home/wikipedia/htdocs$ ls -al /usr/local/apache22/htdocs/noc [19:37:08] ls: cannot access /usr/local/apache22/htdocs/noc: No such file or directory [19:38:37] I wonder how that's even working.. [19:39:44] I wonder if I should just ignore this. [19:43:06] New patchset: QChris; "Make hook-bugzilla act on "bug" footers as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70870 [20:05:06] Reedy, greg-g all OK for E3 to begin its deploy ? [20:05:28] Yeah [20:05:39] I've not been doing anything deploy related for a couple of hours now [20:05:57] Just looking at fixing up crap the users broke ;) [20:06:51] just the typical day in the life of Reedy :) [20:07:28] Open shell bugs < 50 now! [20:08:16] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [20:10:11] Reedy: accidental submit before being done https://bugzilla.wikimedia.org/show_bug.cgi?id=49189#c2 ? [20:11:19] Or I go distracted and thought I'd finished.. [20:11:42] heh, or that [20:19:59] Zuul/Jenkins has 59 events in the queue [20:20:17] !log jenkins: migrating mediawiki-core-phpunit-api from master to slaves [20:20:25] Logged the message, Master [20:20:48] New patchset: Andrew Bogott; "Convert swift's rsyncd from generic::rsyncd to the new rsync module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70846 [20:20:48] New patchset: Andrew Bogott; "Add support for specifying a global rsyncd log file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70858 [20:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [20:23:24] spagewmf: so yeah Zuul is a bit busy :-] That is the hour when l10n-bot is submitting a bunch of changes :-] [20:24:19] NP, though for some reason I thought it ran around 1am PDT [20:24:54] I ordered ops a 2nd jenkins server :-] [20:25:11] I worked a bit this week to make it possible on Jenkins isde [20:25:21] now we just have to get the server installed and we will be fine :-) [20:31:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:38] !log jenkins: migrating mediawiki-core-phpunit-databaseless form master to slaves [20:32:46] Logged the message, Master [20:33:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [20:34:11] hashar: piuparts [20:34:56] AzaToth: what is that ? [20:35:05] http://wiki.debian.org/piuparts [20:35:24] hehe [20:35:27] default automagical setup of jenkins-debian-glue createsa a piuparts job [20:35:42] will have to do that later on I guess [20:35:48] !log spage synchronized php-1.22wmf9/extensions/GuidedTour 'updating GuidedTour in 1.22wmf9' [20:35:57] I will already be busy enough converting my yesterday hack in a normal job [20:35:58] Logged the message, Master [20:36:11] hashar sounds good. Zuul queue 0! BTW, join #wikimedia-e3 and type "!jenkins!%$#!!! " (thanks to Ori) [20:39:07] New patchset: Ottomata; "jmxtrans puppetization." [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/70915 [20:40:23] New patchset: Ottomata; "jmxtrans puppetization." [operations/puppet/jmxtrans] (master) - https://gerrit.wikimedia.org/r/70915 [20:40:51] manybubbles: https://gerrit.wikimedia.org/r/#/c/70915/ [20:41:20] hashar: http://awesomescreenshot.com/02f1g1tb38 [20:41:20] thanks. Strill struggling with packaging. It might look like it is packaged normally but everything is backwards. [20:41:50] !log spage synchronized php-1.22wmf8/extensions/GuidedTour 'updating GuidedTour in 1.22wmf8' [20:41:58] !log deployment-prep jenkins: migrating mediawiki-core-phpunit-misc from master to slaves [20:41:59] Logged the message, Master [20:42:08] Logged the message, Master [20:42:19] AzaToth: sounds easy :-) [20:42:34] AzaToth: do you have a link to debian-glue related documentation? [20:42:36] yup [20:42:43] FWIW sync-dir reports "mw1173: ssh: connect to host mw1173 port 22: Connection timed out" [20:42:45] I will fill a bug about it to remember about that [20:43:53] !log spage synchronized php-1.22wmf8/extensions/Campaigns 'updating Campaigns in 1.22wmf8' [20:44:02] Logged the message, Master [20:45:16] AzaToth: logged https://bugzilla.wikimedia.org/show_bug.cgi?id=50318 [20:45:20] hashar: doesn't seems to exists a specific docs for that [20:45:30] I attached your screenshot to it [20:45:36] .. to the bug [20:46:19] I notice the screenshotter fnucked up in the middle due to the jenkins bottom overlay [20:47:10] hashar, FYI one of my branch extension updates failed in Jenkins doxygen: https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-publish/1334/console . No worries, it merged [20:48:15] spagewmf: oops [20:48:30] spagewmf: looks like it always failed on wmf branches [20:48:48] spagewmf: would you mind filling a but about it against Wikimedia > Continuous integration please? :-) [20:48:54] just copy paste the console + URL [20:49:08] and summary something like: Jenkins: doxygen doesn't build on wmf branches [20:49:14] that would be sweet :-] [20:50:26] you're right, the doxygen job shows up later so I missed it on the other commits. Will file a bug. [20:50:56] thank you! [20:51:42] !log jenkins: migrating mediawiki-core-phpunit-parser from master to slaves [20:51:51] Logged the message, Master [20:53:01] hashar BTW I notice there are multiple jobs running for these updates, e.g. https://gerrit.wikimedia.org/r/#/c/70910/ : a Verified+2, a Starting gate-and-submit that seems to run identical jobs, and then the doxygen which repeates mediawiki-core-lint. [20:53:38] hashar: sad there aint any xml to yaml converter ヾ [20:53:57] spagewmf: so looking at that one [20:54:13] spagewmf: first jenkins-bot result is the one triggered by the patchset being submitted [20:54:26] spagewmf: the second one is the gate-and-submit result [20:54:38] spagewmf: the last one, I have NO idea :-] [20:55:01] spagewmf: ahhh [20:55:15] spagewmf: the last one is are tests being run after merge [20:55:36] we should add the pipeline name in the message [20:56:14] hashar can the Zuul Skynet AI™ notice the jobs and/or the individual CI commands are the same and magically coalesce them? [20:57:04] seems piuparts runs a chroot in a chroot [20:58:42] spagewmf: not really [20:58:58] spagewmf: zuul has a concept of a pipeline, the change enter in it and the result is a notification sent back to gerrit [20:59:12] the pipelines do not interact with each others [20:59:19] and there is no central process to report back to gerrit [21:00:41] spagewmf: might one day find a way to aggregates the different messages. [21:00:50] I am out for now :-) have a good afternoon everyone [21:00:51] someone had a patch to add the coalesce feature to Zuul and the head of Intel's server chip business had him killed. [21:01:05] hehe :-) [21:01:14] hashar: http://192.168.20.71:8080/job/jenkins-debian-glue-piuparts/1/tapResults/? [21:01:15] goodnight, thanks! [21:01:24] oops, wrong click [21:01:25] AzaToth: can't access a private address :D [21:01:58] spagewmf: and make sure to fill a bug about Doxygen not working for wmf branches! [21:02:47] hashar: can you access http://azatoth.net:8080/job/jenkins-debian-glue-piuparts/1/tapResults/? ? [21:04:30] yup [21:04:47] AzaToth: you should attach that to the bug report I opened ! [21:05:44] AzaToth: looks nice thanks! [21:05:59] I am off for real now! see you tomorrow or next week :] [21:07:59] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [21:09:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [21:32:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:33:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [21:40:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:41:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [21:54:32] New review: Faidon; "How silly of them to not allow arbitrary settings or a custom template. Oh well..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/70858 [21:56:10] New review: Faidon; "Awesome! Thanks, feel free to merge this whenever its dependency gets merged!" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70846 [21:57:55] binasher: were all the job tables truncated? [21:57:57] * Aaron|home can't remember [21:59:21] i don't either, hmm [21:59:42] Aaron|home: nope [21:59:47] Aaron|home: want me to? [21:59:52] enjoy :) [21:59:56] woooo [22:00:05] Aaron|home: hey [22:00:17] Aaron|home: when you have some spare time, your input on https://wikitech.wikimedia.org/view/Media_storage would be greatly appreciated :) [22:00:21] binasher: try not to drop `page` though [22:00:30] I'm sure there are inaccuracies in there [22:00:41] like doc review? [22:00:53] I guess :) [22:01:05] also https://wikitech.wikimedia.org/wiki/Ceph although that'd be less interesting I'm guessing [22:01:53] next on the doc todo is to clean up all those swift dev pages that talk about ms7 or whatever :) [22:02:06] deploy plans etc. [22:02:31] and update swift to reflect reality, while moving the more general parts into Media_storage [22:02:47] and have swift be software-centric [22:03:45] Aaron|home: oh crap.. *double checks the contents of truncate-job.sql* [22:04:13] paravoid: so varnish no longer falls back to squid in any way? [22:04:17] for uploads [22:04:50] New review: Faidon; "I can't comment on jmxtrans itself (e.g. the config file) so whatever you say on that :-) puppet-wis..." [operations/puppet/jmxtrans] (master) C: 1; - https://gerrit.wikimedia.org/r/70915 [22:07:09] TimStarling: hi. did you mean to remove php-mail and php-mail-mime packages or was it just a side-effect when removing apaches.pp ? (RT-5338) [22:07:17] bsitu: ^ hey [22:07:20] Coren: Is it by design that `become` doesn't source the bashrc of the target? [22:07:40] I can understand that `sudo -su` doesn't do it as it preserves the current shell, but become creates a new one. [22:07:40] mutante: hi, saw your email, thx [22:08:05] * Aaron|home reads through puppet [22:08:16] Aaron|home: no [22:08:17] Krinkle: That's what becomes /does/. :-) [22:08:18] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [22:08:32] Coren: Yes, but it doesn't source the target bashrc [22:08:33] Krinkle: It sources .profile and friends rather. [22:08:42] oh, so it does have that ability [22:08:44] Krinkle: Yes, that's normal -- it's a login shell. [22:08:54] ah, yeah, I confused bashrc and bash_profile [22:08:56] Krinkle: I moved the PATH setting to .profile [22:09:10] Or, I think I sourced bashrc from profile? I forget [22:09:15] Krinkle: What one normally does is source the .bashrc from the .profile so that it gets sourced both way. [22:09:32] yeah, that's what I do normally as well [22:09:48] I just rarely re-create such set up from scratch so I only did bashrc :) [22:10:11] https://github.com/Krinkle/dotfiles/blob/master/templates/bash_profile https://github.com/Krinkle/dotfiles/blob/master/templates/bashrc [22:10:13] thx :) [22:10:21] mutante: I only changed the search indexers [22:10:40] are you saying that the search indexers used php-mail and php-mail-mime, but not the main application servers? [22:11:43] TimStarling: so far i just saw the request to add those packages and remembered i added them a couple months ago already, then i saw apaches.pp, which had them, had been deleted and then looked at git log [22:12:08] well, when you duplicate your configuration, mistakes are going to happen [22:12:23] that's one of the reasons why I wanted to remerge [22:12:50] but I think most of the work was done by notpeter, like I say, I only cleaned up one last usage of apaches.pp [22:13:03] ok, just making sure you weren't opposed to using those packages in general or something [22:13:19] so obviously anything that was in apaches.pp at that point wasn't going to be useful for echo [22:14:14] ok, bsitu can just install them on the testing labs instances for today and as Coren already commented we should find out which role classes really make sense once we need it in production [22:14:18] bsitu: ^ [22:15:01] mutante: Or make a new one for the task, at need. [22:15:01] are they PECL or PEAR? [22:15:08] mutante: I just installed them in the labs instance [22:15:28] TimStarling: They're in apt, actually. [22:15:46] TimStarling: PEAR packaged as .deb [22:15:49] bsitu: cool [22:16:00] have they been reviewed? [22:16:21] I mean the code, I would expect PHP code to be reviewed before it is deployed [22:17:15] not specifically by us, but it's maintained by Debian [22:17:27] well, being in debian doesn't mean anyone has looked at the code [22:17:27] Debian PHP PEAR Maintainers [22:17:47] we have a fair few PHP developers who are capable of reviewing PEAR packages [22:18:06] mutante: I think TimStarling is volunteering [22:19:14] mutante: either that or he is implying that the requestors should find a reviewer for it before it ever hits ops [22:19:50] New patchset: Ori.livneh; "Add .gitreview" [operations/software/varnish/varnishkafka] (master) - https://gerrit.wikimedia.org/r/70926 [22:20:22] because it's surely not our responsibility to find a reviewer from dev to review some php when its being requested by dev [22:21:09] and if that's the blocker, then maybe it should be brought up in the engineering meetings before it ever comes to us [22:21:18] otherwise our time is just being wasted [22:22:08] PROBLEM - Puppet freshness on mw8 is CRITICAL: No successful Puppet run in the last 10 hours [22:22:19] New review: Andrew Bogott; "Looks like Otto has been tracking local changes in CHANGELOG, so I've added an entry there, plus a R..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70858 [22:22:31] Change merged: Ori.livneh; [operations/software/varnish/varnishkafka] (master) - https://gerrit.wikimedia.org/r/70926 [22:22:55] I wasn't expecting an answer of "no" [22:23:29] ori-l: I don't think that I've been treating you or anyone else without root as an idiot, and although I'm not sure, I don't think it's the case for most of the people with +2 powers [22:23:43] New patchset: Andrew Bogott; "Convert swift's rsyncd from generic::rsyncd to the new rsync module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70846 [22:23:43] New patchset: Andrew Bogott; "Add support for specifying a global rsyncd log file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70858 [22:24:08] TimStarling: what answer were you expecting? [22:24:44] in general we trust debian packages to be reviewed by debian unless we find a reason not to [22:24:52] "yes, we reviewed it months ago, during the design stage" [22:24:58] if dev wants to do further review, it should be on them [22:25:59] TimStarling: if you're giving us a requirement to block devs on this, at your direction as an architect, I'm cool with that [22:26:08] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: No successful Puppet run in the last 10 hours [22:26:08] PROBLEM - Puppet freshness on mw107 is CRITICAL: No successful Puppet run in the last 10 hours [22:26:08] PROBLEM - Puppet freshness on mw1132 is CRITICAL: No successful Puppet run in the last 10 hours [22:26:57] we've had trouble with 3rd party PHP code in the past [22:27:01] what I'm trying to avoid is ops being the bad guy in some inconsistent way. we have quite a few php packages that likely never went through review [22:27:03] DoS vulnerabilities in the CSS minifier [22:27:08] PROBLEM - Puppet freshness on amssq32 is CRITICAL: No successful Puppet run in the last 10 hours [22:27:08] PROBLEM - Puppet freshness on analytics1020 is CRITICAL: No successful Puppet run in the last 10 hours [22:27:08] PROBLEM - Puppet freshness on arsenic is CRITICAL: No successful Puppet run in the last 10 hours [22:27:08] PROBLEM - Puppet freshness on analytics1017 is CRITICAL: No successful Puppet run in the last 10 hours [22:27:08] PROBLEM - Puppet freshness on cp1035 is CRITICAL: No successful Puppet run in the last 10 hours [22:27:21] scary local file operations in the YAML decoder [22:27:27] yep [22:27:44] arbitrary script execution in a wordpress caching engine [22:27:57] the requirement seems sane, but I'd like us to have a policy that's consistent that we can enforce [22:28:22] so yes, I think I'm happy to make it a requirement that at least a very cursory security review be done for PHP code specifically [22:28:24] otherwise "ops is being dickish and is blocking my work" [22:29:05] New patchset: Edenhill; "Initial version of varnishkafka" [operations/software/varnish/varnishkafka] (master) - https://gerrit.wikimedia.org/r/70928 [22:29:08] PROBLEM - Puppet freshness on amssq36 is CRITICAL: No successful Puppet run in the last 10 hours [22:29:08] PROBLEM - Puppet freshness on amssq41 is CRITICAL: No successful Puppet run in the last 10 hours [22:29:08] PROBLEM - Puppet freshness on analytics1006 is CRITICAL: No successful Puppet run in the last 10 hours [22:29:08] PROBLEM - Puppet freshness on amssq58 is CRITICAL: No successful Puppet run in the last 10 hours [22:29:08] PROBLEM - Puppet freshness on analytics1024 is CRITICAL: No successful Puppet run in the last 10 hours [22:29:15] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/70453 [22:29:19] TimStarling: tbh saying "PHP code specifically" smells a bit like double standards to me [22:29:20] hm. where to actually document stuff like this.... [22:29:31] well yeah, maybe it should be expanded [22:29:40] you know that I have reviewed a lot of C code before deployment [22:30:07] and rejected some solutions on the basis of such reviews -- specifically abcm2ps which was a huge pile of fail despite being in debian [22:30:52] !log updated Parsoid to 8b38dcc [22:30:53] I think it's sane to do some superficial review for sanity but downright impossible to review every piece of code that hits production [22:31:03] Logged the message, Master [22:31:08] PROBLEM - Puppet freshness on amssq38 is CRITICAL: No successful Puppet run in the last 10 hours [22:31:08] PROBLEM - Puppet freshness on amssq46 is CRITICAL: No successful Puppet run in the last 10 hours [22:31:08] PROBLEM - Puppet freshness on analytics1026 is CRITICAL: No successful Puppet run in the last 10 hours [22:31:08] PROBLEM - Puppet freshness on amssq62 is CRITICAL: No successful Puppet run in the last 10 hours [22:31:08] PROBLEM - Puppet freshness on analytics1027 is CRITICAL: No successful Puppet run in the last 10 hours [22:31:18] PHP code is where the cost/benefit analysis makes the most sense [22:31:39] since like I say, we have plenty of PHP devs who can do this, so the costs are low, and we have had trouble with it in the past, so the benefits are higher [22:32:08] PROBLEM - Puppet freshness on amssq33 is CRITICAL: No successful Puppet run in the last 10 hours [22:32:08] PROBLEM - Puppet freshness on amssq39 is CRITICAL: No successful Puppet run in the last 10 hours [22:32:08] PROBLEM - Puppet freshness on amssq42 is CRITICAL: No successful Puppet run in the last 10 hours [22:32:08] PROBLEM - Puppet freshness on amssq45 is CRITICAL: No successful Puppet run in the last 10 hours [22:32:08] PROBLEM - Puppet freshness on amssq49 is CRITICAL: No successful Puppet run in the last 10 hours [22:32:57] other programs can be easier to shield with eg. apparmor [22:33:08] PROBLEM - Puppet freshness on amssq55 is CRITICAL: No successful Puppet run in the last 10 hours [22:33:08] PROBLEM - Puppet freshness on amssq52 is CRITICAL: No successful Puppet run in the last 10 hours [22:33:08] PROBLEM - Puppet freshness on analytics1012 is CRITICAL: No successful Puppet run in the last 10 hours [22:33:08] PROBLEM - Puppet freshness on analytics1010 is CRITICAL: No successful Puppet run in the last 10 hours [22:33:08] PROBLEM - Puppet freshness on cerium is CRITICAL: No successful Puppet run in the last 10 hours [22:33:26] * Aaron|home tends to be suspicious of php libraries [22:33:30] New review: Dzahn; "just comments and saner output" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70842 [22:33:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70842 [22:33:41] Platonides: it's too bad ubuntu/debian puts basically 0 effort into apparmor [22:34:06] New patchset: Dzahn; "Add missing done to syntax/lint checking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70841 [22:34:08] PROBLEM - Puppet freshness on amssq34 is CRITICAL: No successful Puppet run in the last 10 hours [22:34:08] PROBLEM - Puppet freshness on amssq57 is CRITICAL: No successful Puppet run in the last 10 hours [22:34:08] PROBLEM - Puppet freshness on analytics1015 is CRITICAL: No successful Puppet run in the last 10 hours [22:34:08] PROBLEM - Puppet freshness on cp1003 is CRITICAL: No successful Puppet run in the last 10 hours [22:34:08] PROBLEM - Puppet freshness on cp1006 is CRITICAL: No successful Puppet run in the last 10 hours [22:34:08] PROBLEM - Puppet freshness on cp1013 is CRITICAL: No successful Puppet run in the last 10 hours [22:34:43] I thought ubuntu was in the apparmor field? [22:35:00] shell out to /usr/bin/mail ? shrug [22:35:06] I wasn't expecting anything spectacular [22:35:08] PROBLEM - Puppet freshness on amssq44 is CRITICAL: No successful Puppet run in the last 10 hours [22:35:08] PROBLEM - Puppet freshness on analytics1023 is CRITICAL: No successful Puppet run in the last 10 hours [22:35:08] PROBLEM - Puppet freshness on cp1036 is CRITICAL: No successful Puppet run in the last 10 hours [22:35:08] PROBLEM - Puppet freshness on cp1049 is CRITICAL: No successful Puppet run in the last 10 hours [22:35:08] PROBLEM - Puppet freshness on db1004 is CRITICAL: No successful Puppet run in the last 10 hours [22:35:10] it is, but have you ever looked at the apparmor coverage in ubuntu? [22:35:16] nope :P [22:35:18] it's relatively non-existant [22:35:32] especially compared to selinux coverage in fedora/rhel [22:35:41] (and I would prefer apparmor to be more flexible) [22:35:41] New review: Dzahn; "just comments and output format" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70841 [22:35:42] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70841 [22:35:45] you know I have reported a couple of bugs in ubuntu's apparmor configuration [22:35:49] they kept breaking xubuntu [22:36:01] New patchset: Dzahn; "Fixup writing of newlines and done to make output consistent and sensible" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70839 [22:36:04] and it took many months for them to fix it [22:36:08] PROBLEM - Puppet freshness on analytics1014 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:09] PROBLEM - Puppet freshness on cp1010 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:09] PROBLEM - Puppet freshness on cp1002 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:10] PROBLEM - Puppet freshness on cp1005 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:10] PROBLEM - Puppet freshness on cp1024 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:10] PROBLEM - Puppet freshness on cp3012 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:10] PROBLEM - Puppet freshness on cp1030 is CRITICAL: No successful Puppet run in the last 10 hours [22:36:17] I just mean, it is easy to add a profile of "you can read /usr/share, no, we don't allow you to execute other programs, create files or open sockets" [22:36:24] New review: Dzahn; "less newlines in output" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/70839 [22:36:31] Platonides: indeed [22:36:34] for a program we add for eg. calculating sha512 [22:36:49] TimStarling: profiles? or the functionality? [22:36:58] profiles [22:37:31] in fact, it should be so easy to create profiles like that for packaged programs... [22:38:03] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70839 [22:38:40] the problem was that when you "open" a downloaded file, under XFCE, it needs to run XFCE wrappers instead of gnome wrappers [22:39:05] and the configuration for running XFCE wrappers had bugs in it [22:39:11] And I'm guessing there isn't an xfce macro.. [22:39:22] and wasn't properly updated when XFCE was updated [22:39:58] they had made an effort, it was just untested [22:43:55] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:46:28] New patchset: Asher; "ishmael conf/vhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70931 [22:46:45] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [22:50:31] New patchset: Asher; "ishmael conf/vhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70931 [22:54:00] New review: Pyoungmeister; "do you even lift, bro?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70931 [23:00:36] New patchset: Asher; "ishmael conf/vhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70931 [23:03:06] New review: Pyoungmeister; "clearly, you lift." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/70931 [23:04:58] paravoid: maybe https://wikitech.wikimedia.org/wiki/Ceph can mention monitors a bit more and where container dbs are [23:05:34] container dbs? [23:06:11] PROBLEM - Puppet freshness on celsus is CRITICAL: No successful Puppet run in the last 10 hours [23:07:13] New patchset: Asher; "ishmael conf/vhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70931 [23:23:06] paravoid: object listings [23:23:23] Aaron|home: these have nothing to do with monitors [23:23:33] I know [23:23:47] http://tracker.ceph.com/issues/4613 [23:23:48] I'm saying those both could be mentioned more, not that they are related [23:23:54] this was closed 3 days ago [23:24:02] (the ticket was opened per my request) [23:25:26] but points taken [23:25:33] I'll take a look at those tomorrow [23:25:36] thanks for the input :) [23:28:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70931 [23:41:39] New patchset: Dzahn; "redirect wiikipedia.com and wekipedia.com domains (RT #4679, RT #4681)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/70943 [23:46:36] New patchset: Dzahn; "redirect wiikipedia.com and wekipedia.com domains (RT #4679, RT #4681)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/70943 [23:47:50] New review: Dzahn; "testing 4 urls on 1 servers, totalling 4 requests" [operations/apache-config] (master) C: 2; - https://gerrit.wikimedia.org/r/70943 [23:49:50] New patchset: Dzahn; "redirect wiikipedia.com and wekipedia.com domains (RT #4679, RT #4681)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/70943 [23:51:29] New patchset: Asher; "enable site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70944 [23:54:49] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/70944 [23:55:37] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/70943 [23:59:19] !log graceful Apaches, activate wiikipedia and wekipedia [23:59:27] Logged the message, Master