[00:05:32] I guess that fixed UW too [00:06:23] I missed one of the masters, I got it just now [00:07:14] so we should have no more errors after 00:05 [00:07:51] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 00:07:47 UTC 2013 [00:08:11] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:08:32] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 00:08:21 UTC 2013 [00:09:11] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:09:21] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 00:09:14 UTC 2013 [00:10:06] ok icinga, we get it [00:10:11] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:10:58] ah, ms1 is decommissioned, must be causing some weird funkyness [00:13:32] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:13:46] New patchset: MaxSem; "Postgres module for OSM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36155 [00:18:32] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [00:18:32] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 188 seconds [00:19:31] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 247 seconds [00:20:17] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 279 seconds [00:21:32] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:21:55] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [00:22:19] New patchset: Dzahn; "remove index.html.tmpl from files, replaced by .erb template and drop the empty lines in the template causing issues (yes, really)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45271 [00:22:31] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [00:23:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [00:23:07] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:23:56] New patchset: Lcarr; "fixing icinga restart condition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45272 [00:24:31] New patchset: CSteipp; "Lower email throttle" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45273 [00:24:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45271 [00:27:26] New patchset: Lcarr; "fixing icinga restart conditions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45274 [00:28:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45274 [00:29:15] gotta love bugs like that.. i got some template that creates an index.html, now if there are empty comment lines in it like "### " they will end up (and mess up) the resulting HTML, while comment lines that actually have comments like "### foo" don't cause any issue.. [00:29:29] TimStarling: I'm having a hard time triggering MWExceptionHandler::handle [00:29:54] set_exception_handler definitely gets called [00:33:23] strange [00:33:38] does it work for you? [00:34:54] working in sapi mode [00:35:33] not cli [00:37:01] New patchset: Lcarr; "removed old style definition of removing generic site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45275 [00:37:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45275 [00:46:46] New patchset: Lcarr; "fixing syntax error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45276 [00:46:49] on fenari "throw new MWException('ping');" with eval.php appears to work [00:47:11] the error output from MW is a bit different to the default PHP output [00:50:34] * AaronSchulz was testing on his laptop [00:53:34] I tested on my laptop also, I saw an error message in the debug log [00:53:45] TimStarling: definitely does not work on my laptop, using 2>/dev/null still shows the error (though if I throw a raw Exception it doesn't show) [00:55:51] New patchset: MaxSem; "WIP: OSM module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36222 [00:58:06] is anyone here familiar with tmh operationally? [00:59:41] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [00:59:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [00:59:55] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [01:00:11] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [01:01:01] binasher: I wouldn't say familiar, but it probably won't take me long to get up to speed [01:01:08] what is the problem? [01:02:04] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45276 [01:03:11] so tmh1001 is a job runner with JR_TYPE="webVideoTranscode" in /etc/default/mw-job-runner [01:03:16] TimStarling: the tmh servers in eqiad aren't getting code deploys (not sure since when) and are trying to talk to pmtpa db's via a now deleted db.php [01:03:41] i saw that notpeter exempted the tmh hosts from his puppet change enabling job runners in eqiad though, so i'm not sure if those hosts are even supposed to be running right now [01:03:45] sure, they're missing [01:04:14] I added them to mw-eqiad which I was going to use as a source for mediawiki-installation [01:04:23] but then someone else updated mediawiki-installation for me [01:05:26] ah. and was tmh processing only running in eqiad all this time? (so they didn't need enabling like the other eqiad job runners) [01:05:49] !log added tmh1001 and tmh1002 to mediawiki-installation [01:05:59] Logged the message, Master [01:06:08] there is also tmh1 and tmh2, I assume they are in pmtpa [01:07:43] !log ran scap-1 on tmh1001 and tmh1002 [01:07:53] Logged the message, Master [01:11:10] judging by ganglia, tmh1 and tmh2 were active up until today [01:11:27] someone stopped the job runners on tmh1 and tmh2 and started them on tmh1001 and tmh1002 [01:12:21] no /etc/rc3.d/*mw-job-runner on any of them [01:13:12] nor on the regular job runners [01:13:20] so I suppose they rely on being started manually on boot [01:14:56] New patchset: Dzahn; "use different CSS for Arabic, use a puppet selector within the file definition. took planet.css from old Arabic planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45283 [01:15:43] TimStarling: looks like puppet has an ensure => running for mw-job-runner [01:15:52] right [01:16:29] TimStarling: maybe jobs-loop should check $? [01:16:54] seems that puppet tries to ensure sync-common has been run before starting it as well [01:17:04] AaronSchulz: I guess that would help [01:17:35] getting reports from wikidata users about random 404's some users are getting, but not all, and the ones who do say it is fixed by reload but happens quite a bit [01:17:53] binasher: https://gerrit.wikimedia.org/r/#/c/45159/ [01:18:17] notpeter switched over both sets of job runners [01:18:50] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45283 [01:19:21] mutante: are all the apaches that are enabled in LVS also in mediawiki-installation? [01:20:04] TimStarling: thanks, i had only actually read https://gerrit.wikimedia.org/r/#/c/45156/ which enabled job runners in eqiad, but left tmh disabled there [01:21:58] !log aaron synchronized php-1.21wmf7/includes/api/ApiUpload.php 'deployed 2a4ad3e32cf4f6814554ebcf09ac0250546a2549' [01:22:11] Logged the message, Master [01:23:02] !log aaron synchronized php-1.21wmf8/includes/api/ApiUpload.php 'deployed 2a4ad3e32cf4f6814554ebcf09ac0250546a2549' [01:23:12] Logged the message, Master [01:23:42] TimStarling: confirmed that everything host enabled in /h/w/conf/pybal/eqiad/apaches is in /etc/dsh/group/mediawiki-installation [01:23:57] same with api [01:26:23] ah, dsh group, first thought you are talking about a puppet class, thx asher [01:26:48] yea, and the reports are not limited to Europe, got one from Canada [01:28:38] do any of these reports come with a URL? [01:29:42] not yet..trying to get some .."17:03 < Jasper_Deng> but its sporadicness leads me to speculate about URL-rewrite rules not being executed correctly on all servers [01:29:58] 17:36 < Jasper_Deng> mutante: Special:Watchlist and http://www.wikidata.org/wiki/Wikidata:Administrators/Confirm_2013/1 for me at least [01:30:01] 17:36 < Jasper_Deng> it appears to be URL-independent [01:30:20] from #wikimedia-tech [01:30:31] TimStarling, I've seen it on http://www.wikidata.org/wiki/Special:Watchlist [01:43:22] mutante: I got 2 of those random 404's on wikivoyage from the office [01:45:18] Jamesofur: what url? and did a reload fix it? [01:46:08] binasher: A reload did fix it, on two different occasions it was http://en.wikivoyage.org/wiki/Wikivoyage:TOC [01:54:13] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:02:17] still moving servers? [02:04:52] Nope, we're done [02:04:59] We finished around, 11ish? [02:05:03] Before lunchtime I think [02:05:19] So about 6-7 hours ago [02:07:30] New patchset: Aaron Schulz; "Check the return status of nextJobDB.php." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45292 [02:08:09] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [02:08:10] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [02:08:26] mutante: mw1017, 1018, and 1019 give 404's for wikivoyage urls every time and are pooled [02:08:50] the rest of the pooled eqiad apaches are ok [02:09:47] New patchset: Ryan Lane; "Combine sysadmin and netadmin into projectadmin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45267 [02:10:42] * AaronSchulz wonders what's with http://commons.wikimedia.org/wiki/File:%E6%85%88%E6%BF%9F%E5%AE%AEwikipedia.jpg [02:11:16] AaronSchulz, The file exists, http://upload.wikimedia.org/wikipedia/commons/b/b5/%E6%85%88%E6%BF%9F%E5%AE%AEwikipedia.jpg [02:11:23] binasher: I'll fix them [02:11:25] RoanKattouw: if you're done, shouldn't all wikis be writeable now? [02:11:51] They are, aren't they? [02:11:57] testwiki isn't [02:12:12] The administrator who locked it offered this explanation: Wikimedia Sites are currently read-only during maintenance, please try again soon. [02:12:22] Haha [02:12:24] Right [02:12:27] Yeah, testwiki is a special case [02:12:33] :-( [02:12:58] The way it's set up is incompatible with how eqiad is set up I think [02:12:58] TimStarling: sync-apache uses the apaches dsh group which doesn't include any of the eqiad hosts. it looks like puppet is supposed to take care of this now though [02:13:08] You should file a bug about it being down if there isn't one already [02:13:17] techman224: http://commons.wikimedia.org/w/thumb.php?f=%E6%85%88%E6%BF%9F%E5%AE%AEwikipedia.jpg&w=400 [02:14:52] !log added eqiad apaches to /etc/dsh/group/apaches [02:15:04] Logged the message, Master [02:16:11] AaronSchulz, http://commons.wikimedia.org/wiki/Special:Undelete/File:%E6%85%88%E6%BF%9F%E5%AE%AEwikipedia.jpg [02:16:30] Same picture? [02:16:40] I can't see that [02:18:08] binasher: puppet will only run that sync command once [02:18:16] it has creates => "/usr/local/apache/conf", [02:18:19] so it won't run again [02:18:29] those servers were probably installed early [02:18:38] AaronSchulz, I can't get the thumbnails from the links under the picture [02:18:55] i suppose there could be other inconsistencies as well [02:19:10] i'll add the eqiad api/apps to the apaches group [02:19:16] !log running sync-apache to fix stale eqiad apache configuration [02:19:25] Logged the message, Master [02:19:27] I did already, see above log entry [02:19:51] so you did, thanks [02:23:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37566 [02:23:48] !log reloading all apaches [02:23:58] Logged the message, Master [02:26:05] !log LocalisationUpdate completed (1.21wmf7) at Wed Jan 23 02:26:04 UTC 2013 [02:26:07] PROBLEM - Puppet freshness on db50 is CRITICAL: Puppet has not run in the last 10 hours [02:26:08] PROBLEM - Puppet freshness on db68 is CRITICAL: Puppet has not run in the last 10 hours [02:26:15] Logged the message, Master [02:29:59] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [02:32:07] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:32:09] PROBLEM - swift-account-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:32:10] PROBLEM - swift-container-server on ms-be1008 is CRITICAL: Connection refused by host [02:32:10] PROBLEM - swift-object-updater on ms-be1008 is CRITICAL: Connection refused by host [02:32:10] PROBLEM - swift-container-updater on ms-be1008 is CRITICAL: Connection refused by host [02:32:20] PROBLEM - swift-container-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:32:20] PROBLEM - swift-object-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:32:20] PROBLEM - swift-account-server on ms-be1008 is CRITICAL: Connection refused by host [02:32:39] PROBLEM - SSH on ms-be1008 is CRITICAL: Connection refused [02:32:40] PROBLEM - swift-object-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:32:40] PROBLEM - swift-account-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:32:49] PROBLEM - swift-object-server on ms-be1008 is CRITICAL: Connection refused by host [02:32:49] PROBLEM - swift-container-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:32:49] PROBLEM - swift-account-reaper on ms-be1008 is CRITICAL: Connection refused by host [02:34:19] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:35:43] PROBLEM - swift-account-reaper on ms-be1008 is CRITICAL: Connection refused by host [02:35:43] PROBLEM - swift-object-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:35:52] PROBLEM - swift-object-updater on ms-be1008 is CRITICAL: Connection refused by host [02:36:10] PROBLEM - swift-account-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:36:28] PROBLEM - SSH on ms-be1008 is CRITICAL: Connection refused [02:36:29] PROBLEM - swift-container-updater on ms-be1008 is CRITICAL: Connection refused by host [02:36:38] PROBLEM - swift-container-server on ms-be1008 is CRITICAL: Connection refused by host [02:36:46] PROBLEM - swift-container-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:36:47] PROBLEM - swift-container-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:37:13] PROBLEM - swift-object-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:37:19] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:37:22] PROBLEM - swift-account-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:37:22] PROBLEM - swift-object-server on ms-be1008 is CRITICAL: Connection refused by host [02:37:40] PROBLEM - swift-account-server on ms-be1008 is CRITICAL: Connection refused by host [02:41:07] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [02:42:05] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [02:45:55] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [02:46:49] RECOVERY - Puppet freshness on db65 is OK: puppet ran at Wed Jan 23 02:46:20 UTC 2013 [02:46:58] RECOVERY - MySQL Replication Heartbeat on db65 is OK: OK replication delay 0 seconds [02:47:05] RECOVERY - MySQL Replication Heartbeat on db65 is OK: OK replication delay 0 seconds [02:49:23] !log LocalisationUpdate completed (1.21wmf8) at Wed Jan 23 02:49:23 UTC 2013 [02:49:33] Logged the message, Master [02:51:05] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [02:51:10] RECOVERY - Puppet freshness on db68 is OK: puppet ran at Wed Jan 23 02:50:46 UTC 2013 [02:51:46] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [02:53:16] RECOVERY - Puppet freshness on db57 is OK: puppet ran at Wed Jan 23 02:53:13 UTC 2013 [02:53:56] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [02:54:15] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [02:54:28] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [02:55:22] RECOVERY - Puppet freshness on db50 is OK: puppet ran at Wed Jan 23 02:55:10 UTC 2013 [02:55:40] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [02:55:56] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [03:02:52] RECOVERY - Puppet freshness on db66 is OK: puppet ran at Wed Jan 23 03:02:21 UTC 2013 [03:02:55] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:03:10] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:07:49] RECOVERY - Puppet freshness on db60 is OK: puppet ran at Wed Jan 23 03:07:27 UTC 2013 [03:07:58] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [03:08:25] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [03:08:52] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Wed Jan 23 03:08:39 UTC 2013 [03:09:05] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [03:09:46] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [03:11:23] ssh trouser.org [03:11:26] garg. [03:14:55] is that getting caught with your pants down? [03:15:10] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [03:15:10] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [03:15:10] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [03:15:10] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:15:10] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [03:15:11] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:15:11] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:16:12] binasher: the way db-*.php is selected is not quite what I expected [03:16:35] when eqiad is the active data centre, hosts in tampa need to be using the eqiad masters [03:16:37] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [03:16:42] it doesn't matter where the actual client is [03:16:52] binasher: I suppose so. At least this time it wasn't a passphrase. [03:16:57] so it's kind of weird to be using getRealmSpecificFilename() [03:17:22] presumably all the cron jobs on hume are broken, as well as testwiki [03:18:19] i thought it was unfortunate that function wants to stat realm files only used by labs before looking at site [03:19:09] there should really be an eqiad replacement for hume [03:19:29] i guess testwiki needs to stay in ptmpa while it depends on nfs [03:19:56] sure, there should be, but we discussed this, it was decided that there wouldn't be [03:20:15] not at first [03:21:27] that's annoying [03:21:55] ideally, maintenance scripts running in hume would use the eqiad masters but the pmtpa slaves [03:22:18] we could have puppet set /etc/mediawiki-site to $::mw_primary on specific hosts, instead of to $::site [03:22:44] bast1001 is still not anything like a replacement for fenari, it has no dsh for one thing [03:23:17] so it will be necessary to run maintenance scripts on fenari for deployment [03:23:49] I was thinking about just having it set to a string literal in CommonSettings.php [03:24:10] that sounds better [03:24:23] $wmgActiveCluster = 'eqiad'; [03:25:16] then you would just change it and sync it out [03:29:33] things like sectionsByDB shouldn't be duplicated between these two files [03:29:41] the inactive one will rot [03:30:59] I think the best thing to do is to merge them back into one file [03:32:30] db.php needs to be parseable by switch.php, right? [03:32:39] you still use that master switch script don't you? [03:33:42] last time I checked, it was still in subversion [04:01:33] New patchset: Tim Starling; "Quick hack to get pmtpa out of read-only mode" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45296 [04:04:26] anyone still around? [04:05:30] didn't think so [04:05:37] TimStarling: ahoy [04:05:42] probably not who you were hoping for [04:05:49] but if I can help, let me know. [04:06:09] thanks ori-l [04:06:15] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45296 [04:07:47] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 04:07:37 UTC 2013 [04:07:57] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 04:07:47 UTC 2013 [04:08:08] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:08:11] apparently someone deleted docroot/noc locally on fenari and didn't commit it [04:08:27] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 04:08:20 UTC 2013 [04:08:37] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:08:48] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 04:08:43 UTC 2013 [04:08:55] that's fine! I will clean up while the ops team parties! ;) [04:09:08] RECOVERY - Puppet freshness on db65 is OK: puppet ran at Wed Jan 23 04:08:59 UTC 2013 [04:09:08] RECOVERY - Puppet freshness on db68 is OK: puppet ran at Wed Jan 23 04:09:01 UTC 2013 [04:09:08] RECOVERY - Puppet freshness on db57 is OK: puppet ran at Wed Jan 23 04:09:02 UTC 2013 [04:09:08] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:09:08] RECOVERY - Puppet freshness on db50 is OK: puppet ran at Wed Jan 23 04:09:03 UTC 2013 [04:09:08] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 04:09:06 UTC 2013 [04:09:08] RECOVERY - Puppet freshness on db66 is OK: puppet ran at Wed Jan 23 04:09:06 UTC 2013 [04:09:17] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Wed Jan 23 04:09:09 UTC 2013 [04:09:18] RECOVERY - Puppet freshness on db60 is OK: puppet ran at Wed Jan 23 04:09:09 UTC 2013 [04:09:38] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:10:05] !log tstarling synchronized wmf-config/db-pmtpa.php 'back to read/write mode (with eqiad master)' [04:10:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:10:09] Logged the message, Master [04:10:26] heh. [04:11:07] !log tstarling synchronized wmf-config/db-eqiad.php 'cleanup, added pmtpa host/IP mappings' [04:11:19] Logged the message, Master [04:15:27] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 04:15:25 UTC 2013 [04:15:37] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:19:45] that had an interesting effect on http://noc.wikimedia.org/dbtree/ :) [04:20:39] binasher: do you know what is going on with the noc docroot? [04:21:22] i probably fucked it up, let me look [04:23:50] wow, i sure did. fixed. [04:25:34] there's still a complete copy of the mediawiki-config repository in docroot/mediawiki-config, should that be deleted? [04:25:51] hey I wonder if there's another copy in docroot/mediawiki-config/docroot/mediawiki-config [04:27:44] deleted [04:27:51] thanks [04:27:55] * binasher picked the wrong day to stop sniffing glue  [04:28:34] yeah that can really mess with your head [04:29:14] ok, what else is broken? [04:58:18] New patchset: Tim Starling; "Fix 404s for nonexistent domains" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45298 [04:58:45] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45298 [05:06:54] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [05:07:24] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [05:09:19] New patchset: Tim Starling; "Allow sync-apache to be run as non-root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45300 [05:11:32] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45300 [06:09:54] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [06:16:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:25:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 06:25:42 UTC 2013 [06:25:55] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 06:25:50 UTC 2013 [06:26:16] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:26:16] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 06:26:10 UTC 2013 [06:26:26] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:26:35] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 06:26:25 UTC 2013 [06:27:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:27:26] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:35:15] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [06:35:20] New patchset: Tim Starling; "Bug 43448: don't use threads in varnishhtcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45302 [06:35:58] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45302 [06:36:04] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [06:41:01] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:43:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 06:43:37 UTC 2013 [06:44:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:46:33] TimStarling: I find it curious that you're not logging failures. [07:13:57] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45243 [07:15:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36146 [07:20:44] New patchset: ArielGlenn; "remove the images::rsync(d) classes from use too (see change I8bf98650)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45303 [07:21:57] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45303 [07:58:40] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [07:59:06] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:07:40] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 08:07:34 UTC 2013 [08:07:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:07:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 08:07:43 UTC 2013 [08:07:51] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 08:07:48 UTC 2013 [08:08:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:08:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 08:07:55 UTC 2013 [08:08:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:09:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:15:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 08:15:42 UTC 2013 [08:16:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:27:30] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 08:27:25 UTC 2013 [08:27:40] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [09:18:09] if (req.request == "PURGE") { [09:18:09] if (!client.ip ~ purge) { [09:18:09] error 405 "Denied."; [09:18:13] what is this meant to do? [09:19:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [09:25:26] never mind [09:40:08] PROBLEM - Varnish HTCP daemon on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (varnishhtcpd), args varnishhtcpd worker [09:42:38] PROBLEM - Varnish HTCP daemon on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (varnishhtcpd), args varnishhtcpd worker [10:00:56] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:03:06] TimStarling: ^^. Not sure if that's one of the instances you rolled out your rewrite to. [10:03:39] it's the one I'm using for testing [10:03:43] it's broken on all of them [10:04:15] oh. [10:04:39] also, re: client.ip ~ purge: see the third example on https://www.varnish-cache.org/docs/2.1/tutorial/vcl.html. '~' operator can match ACLs. [10:09:59] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:20:11] New patchset: Tim Starling; "Fix another bug in varnishhtcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45311 [10:23:48] New review: Tim Starling; "I confirmed that this bug affects the original version. I tested this version on cp1021." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45311 [10:23:49] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45311 [10:25:49] ori-l: in fact I can't see how it ever could have worked [10:26:38] I confirmed it with the original code, it definitely doesn't do what it says on the tin [10:26:45] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [10:27:09] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [10:27:53] argh [10:27:59] "more than half of requests result in a 404"! [10:28:02] https://gerrit.wikimedia.org/r/#/c/29786/1/files/varnish/varnishhtcpd,unified [10:28:07] there's the answer to that then [10:28:21] wikipedia dark matter [10:28:23] it would have worked before that change [10:29:19] * TimStarling hits mark with the testbat [10:31:21] how would it have worked before that change? proxy is still only set once, outside the loop [10:31:24] ah, except for the URL error [10:32:29] # This is a stupid hack to make varnishhtcpd work - it's using a perl mod that sends purge reqs like [10:32:30] # PURGE http://de.wikipedia.orghttp://de.wikipedia.org/w/index.php [10:32:30] } elsif (req.url ~ "^http://upload.wikimedia.org") { [10:32:30] set req.url = regsub ( req.url, "^http://[\w.]+(/.*)", "\1"); [10:33:02] instead of fixing two lines of perl code [10:34:26] '[\w.]+'? [10:37:14] is '.' in a character class a literal dot? [10:37:56] yes. nevermind. [10:43:17] right, not so easy to fix actually [10:43:24] the bug is in LWP, cooperating with varnish [10:43:38] two bugs, I guess [10:43:46] both will deny it though, if you report it to them [10:45:24] let's blame RFC 2616 [11:03:07] hey [11:04:31] hello [11:04:47] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [11:05:35] having fun with varnish? [11:05:38] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [11:05:41] yeah [11:05:55] done with puppet? [11:05:56] :P [11:06:12] this is definitely a varnish bug, RFC 2616 is actually really clear [11:06:40] what is? [11:07:00] say if I do: curl -vvv -x 10.64.0.143:80 'http://upload.wikimedia.org/kzizzle' [11:07:18] then in the log we see: cp1021.eqiad.wmnet 10676 2013-01-23T11:05:44 0.057930470 208.80.152.165 miss/404 882 GET http://upload.wikimedia.orghttp://upload.wikimedia.org/kzizzle - - - - curl/7.22.0%20(x86_64-pc-linux-gnu)%20libcurl/7.22.0%20OpenSSL/1.0.1%20zlib/1.2.3.4%20libidn/1.23%20librtmp/2.3 [11:07:34] the protocol and host is doubled [11:07:49] http://upload.wikimedia.orghttp://upload.wikimedia.org/kzizzle [11:07:56] aha [11:08:26] there's a comment in the varnish configuration that blames varnishhtcpd for sending such URLs [11:08:39] but actually varnishhtcpd is sending the exact right thing [11:08:54] TimStarling: seeing your two recent varnishhtcpd patches, do you have some concrete ideas what's needed to improve/fix https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 ("Monitor effectiveness of HTCP purging")? [11:09:05] and RFC 2616 requires even non-proxy webservers to understand it [11:09:11] presumably it sends PURGE http://upload.wikimedia.org/foo HTTP/1.1 instead of PURGE /foo HTTP/1.1 [11:09:23] yes, with a Host header as well [11:09:34] so varnish does http:// + host header + request URI [11:10:12] but the spec says "To allow for transition to absoluteURIs in all requests in future versions of HTTP, all HTTP/1.1 servers MUST accept the absoluteURI form in requests, even though HTTP/1.1 clients will only generate them in requests to proxies." [11:10:51] what?! [11:11:07] wow, I didn't know that [11:11:22] andre__: just purge a URL, request it, and check its Age header [11:11:27] it should be less than some threshold [11:11:55] http://tools.ietf.org/rfcmarkup?doc=2616#section-5.1.2 [11:12:54] I didn't know there was a goal of transitioning to absolute URLs in future versions of HTTP either [11:13:00] although this might have changed since [11:13:29] maybe that's why every webserver except varnish seems to work just fine when you use it as a proxy [11:13:49] so LWP does the right thing according to the RFC, since it has a proxy configured [11:13:51] I thought it was a very useful defacto standard, but in fact it's a real standard [11:14:04] yes, LWP and curl are doing the right thing, the same thing [11:14:16] sending an absolute URI with a host header [11:14:37] the section on host headers in RFC 2616 makes it clear that a host header is required, there's no exception for absolute URIs [11:15:32] anyway, purging should be fixed now, other than that [11:20:49] I guess I'll close the bug [11:28:01] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 27.31 ms [11:30:38] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [11:45:18] PROBLEM - NTP on ms-be1008 is CRITICAL: NTP CRITICAL: No response from NTP server [11:54:28] PROBLEM - NTP on ms-be1008 is CRITICAL: NTP CRITICAL: No response from NTP server [11:54:48] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:01:00] https://www.varnish-cache.org/trac/ticket/1255 [12:06:20] I'm surprised you didn't fix that [12:06:24] ;) [12:06:37] it's getting late [12:06:42] maybe tomorrow ;) [12:07:48] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 12:07:46 UTC 2013 [12:08:49] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:08:59] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 12:08:51 UTC 2013 [12:09:49] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [12:16:01] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [12:27:50] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:31] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [12:29:10] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [12:29:16] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms [12:32:01] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [12:33:11] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [12:42:46] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [12:54:00] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [12:54:01] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.105 second response time [12:54:02] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.099 second response time [12:56:13] New patchset: Demon; "Refactor wikibugs manifest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45320 [12:57:22] New patchset: Demon; "Refactor wikibugs manifest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45320 [13:15:29] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45320 [13:16:49] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [13:16:50] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [13:16:50] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [13:16:50] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [13:16:50] PROBLEM - Puppet freshness on ms-be1008 is CRITICAL: Puppet has not run in the last 10 hours [13:16:50] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:16:50] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [13:16:51] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:30:49] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [13:39:44] New patchset: Demon; "Apply wikibugs manifest to mchenry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45322 [13:40:01] New patchset: Demon; "Apply wikibugs manifest to mchenry" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45322 [14:05:37] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [14:27:50] Hi folks, we're getting scattered reports of incorrect date/time issues. Known issue? http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Users_reporting_site_time_issues [14:30:30] <^demon> Hmm, those reports all look like they're from anons. [14:30:35] <^demon> It's showing correct for me logged in. [14:31:54] Yeah, me too. [14:33:24] <^demon> Editing as an anon seems to work too, I'm getting the correct timestamp. [14:33:25] I'm seeing Typhoon Rusa as anon [14:33:39] varnishhtcpd troubles? [14:33:40] <^demon> I see Skye. [14:33:45] <^demon> As anon. [14:33:49] didn't purge Main Page [14:33:54] mark: ^^^ [14:34:52] squid [14:34:53] hmm [14:35:32] paravoid@serenity:~$ GET -H "User-Agent: foo" -H "Host: en.wikipedia.org" -Used 'http://wikipedia-lb.esams.wikimedia.org/w/index.php?title=Main_Page&action=history' |grep Last-Modified [14:35:36] Last-Modified: Tue, 22 Jan 2013 22:41:11 GMT [14:35:38] paravoid@serenity:~$ GET -H "User-Agent: foo" -H "Host: en.wikipedia.org" -Used 'http://wikipedia-lb.eqiad.wikimedia.org/w/index.php?title=Main_Page&action=history' |grep Last-Modified [14:35:42] Last-Modified: Wed, 23 Jan 2013 14:34:59 GMT [14:36:05] yeah, it's text [14:38:32] i'm seeing no purges for text indeed [14:40:14] no idea where to start debugging it [14:40:30] i'm on it [14:41:10] purges from eqiad apaches are not reaching dobson [14:42:10] same for file reuploads? [14:44:00] <^demon> Philippe|Away: I replied on-wiki. [14:44:06] Thanks, ^demon [14:44:11] <^demon> yw [14:45:25] Philippe|Away: and thank you for reporting it here :) [14:45:31] Thanks, all. Imma bug out now... :) [14:49:26] looks like a multicast routing problem [14:58:32] fixed [14:59:02] <^demon> Awesome, I'll report that. Will users need to purge any pages? [14:59:09] just a sec, let me verify [15:00:32] yeah looking much better [15:00:45] ^demon: yes... unfortunately all purges since the migration have gone lost, except within eqiad itself [15:00:50] so in practice, that affects europe [15:01:07] <^demon> Mmk. [15:02:53] multicast can be finicky [15:03:14] one missing pim interface statement somewhere in the edges of your network can cause problems [15:04:10] <^demon> Ugh. Just saw a report on enwiki of "No email notifs since yesterday." [15:04:22] right [15:04:26] probably a missing acl on the mail server [15:04:29] * mark checks [15:05:12] fixed, I think [15:05:14] <^demon> If not, jobqueue? *shudder* [15:07:52] for what? [15:08:57] perhaps it's not fixed, not sure [15:09:48] !log Added missing pim interface ae0.400 statement on cr2-pmtpa which prevented eqiad purges from reaching pmtpa [15:10:00] Logged the message, Master [15:10:33] <^demon> paravoid: E-mail notifs for watchlist changes. [15:10:50] <^demon> Ah, I just got a notification. Maybe it is fixed. [15:13:44] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [15:17:37] hi [15:18:31] hi [15:18:43] I've added another ssh key to labs.wikimedia.org [15:18:49] but it seems I can't use it on stat1 [15:19:15] says "permission denied" [15:19:33] so one of my keys, worked fine, but when I added this second one, it doesn't work now [15:19:36] paravoid: hi paravoid [15:19:41] my username is spetrea [15:20:52] is stat1 connected to labs somehow? [15:21:10] nope [15:21:57] your key is statically defined in puppet [15:22:06] nothing to do with labs [15:24:34] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:25:35] cmjohnson1: hey [15:25:41] cmjohnson1: ms-be1008 has been rebooting a lot since yesterday [15:29:41] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [15:35:15] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:36:41] New patchset: Matthias Mullie; "Reinstate AFTv5 test groups" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45329 [15:36:59] paravoid: what can I do to have it in the puppet ? [15:37:20] paravoid: is it hardcoded in puppet ? so I should create a gerrit patchset ? [15:38:07] paravoid: i've rebooted it once today..it took over 12 hours to come back up from yesterday. [15:38:08] yes [15:38:15] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [15:43:55] New patchset: Demon; "Cleanup ldap script formatting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45330 [15:44:43] New patchset: Demon; "Cleanup ldap script formatting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45330 [15:52:38] New patchset: Demon; "Remove legacy svn hooks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45332 [16:07:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 16:07:44 UTC 2013 [16:08:02] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 16:07:58 UTC 2013 [16:08:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:08:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:08:51] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 16:08:42 UTC 2013 [16:08:52] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 16:08:50 UTC 2013 [16:09:42] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:09:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:10:22] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [16:10:42] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 16:10:39 UTC 2013 [16:10:42] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [16:21:31] New review: Cmcmahon; "I need this also" [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/45329 [16:35:32] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [16:37:14] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [16:42:20] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [16:55:33] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [16:57:34] mutante: around? [17:07:34] anyone want to review those DNS changes before I authdns-update? [17:11:16] I guess [17:11:34] svn diff is empty on sockpuppet [17:13:07] I've commited them [17:13:13] they were a bunch of them [17:13:23] it didn't make sense to have them as a huge diff [17:14:16] svn diff -r 3860:3865 [17:14:21] or your inbox :-) [17:14:30] mark: ^ [17:15:12] you're removing download.wikimedia.org? [17:15:18] no [17:15:21] download.esams.wikimedia.org [17:15:25] ah [17:15:28] svn diff is nasty there [17:15:32] not enough context ;) [17:15:35] I know [17:15:48] commit messages help :-) [17:16:36] go ahead [17:21:07] oh dear [17:21:13] wikimedia.org has a whole esams section duplicated [17:22:50] yeah I really wonder how that happened [17:23:16] yyp [17:24:20] btw I guess swift is still sending requests to the tampa image scalers [17:25:11] yes [17:25:35] isn't it better than going all the way back to eqiad [17:25:37] ? [17:25:52] I think it's fine for now [17:26:06] just making sure everyone is aware [17:26:40] ok [17:38:41] PROBLEM - Varnish HTCP daemon on cp1041 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (varnishhtcpd), args varnishhtcpd worker [17:41:05] New patchset: Andrew Bogott; "Don't remove default apache site." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45339 [17:41:05] mark: have the courage to review "svn diff -c 3867" too? [17:41:26] PROBLEM - Varnish HTCP daemon on cp1041 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (varnishhtcpd), args varnishhtcpd worker [17:42:03] gosh svn is annoying now [17:42:22] yes :) [17:43:28] i think that's fine [17:43:52] thanks a lot [17:44:18] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45339 [17:44:24] !log running authdns-update - a bunch of cleanups, r3861:3867 [17:44:35] Logged the message, Master [17:47:12] New patchset: Asher; "set SO_RCVBUF on mcast udp listen socket to minimize buffer overruns, add --buffer config option" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45340 [17:50:05] New patchset: Asher; "set SO_RCVBUF on mcast udp listen socket to minimize buffer overruns, add --buffer config option" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45340 [17:54:14] perl.. at least it's unixy and kinda follows libc [17:54:38] hehe [17:55:15] btw, in your stats mail you said "everyone hates perl" [17:55:20] I don't :) [17:55:32] RECOVERY - Varnish HTCP daemon on cp1041 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [17:55:36] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45340 [17:55:41] RECOVERY - Varnish HTCP daemon on cp1041 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [17:56:56] paravoid: let's port mediawiki to perl 6. it can have the slogan "it can't be any worse" [17:57:11] hahaha [17:57:27] there's plenty to hate about perl tbh [17:57:28] I remember a Perl6 presentation where they said you can make your own syntax [17:58:30] it's slightly amusing that the original varnishhtcpd with its thread deadlocks was written by the guy who actually wrote the threading support in perl5 [17:59:00] who's that? [17:59:02] Artur? [17:59:31] yeah [17:59:40] hey there: https://meta.wikimedia.org/wiki/Servers#Hosting should probably be updated to reflect reality. [18:00:09] And a quick question: does WMF own its servers, or is it renting them at the different facilities? [18:00:14] reality.. shit just got real [18:00:21] New patchset: Silke Meyer; "Minor fixes related to images" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45343 [18:03:10] anyone? [18:03:29] abartov - we own those servers [18:03:45] !log amended wikiadmin grants on s7 to allow new jobrunners access to centralauth [18:03:46] thanks, woosters [18:03:55] Logged the message, Master [18:04:02] that page is indeed dated [18:04:53] this page should probably be deleted or marked historical only - http://wikitech.wikimedia.org/view/Server_roles [18:05:47] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [18:06:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [18:12:31] binasher: updated [18:13:40] ori-l: thanks :) i'm not sure why i took the time to type into irc instead of onto the wiki page, heh [18:14:09] happy to help [18:14:24] oh gosh [18:14:29] having to edit that table sucked [18:14:37] abartov: nowadays one has to learn puppet to (more or less) know what servers do [18:15:59] Change abandoned: Silke Meyer; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35173 [18:18:48] New review: Andrew Bogott; "lgtm" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45343 [18:18:57] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45343 [18:21:32] mark re. OTRS upgrade testing...I'm going to need a way to move a very large db dump from db49 or db1048 to colby. I'm not entirely sure how to do that given the sandboxing. suggestions? [18:26:50] fuck. forget it for now. colby doesn't have enough disk space for this project. [18:28:16] RD: ^ [18:32:01] heya, notpeter, i gotta run and get some lunch real soon [18:32:17] got a minute now, or hsoudl I find you after I get back? [18:36:03] Jeff_Green, I realise you may still be busy post-migration but "forget it for now" isn't exactly the best answer for a project we've been waiting on for years, so I hope you're not going to forget it for too long... :( [18:36:27] ok lunchtime, be back in a bit [18:37:49] ottomata: sorry about that. ping me when you're back [18:38:13] no probs ok [18:38:28] actually, notpeter, i haven't left yet [18:38:33] if you want a description of the problem you can cehck it out while I'm gone [18:38:34] eh? [18:38:35] sure, whenever's good for you [18:38:37] ok so [18:38:44] i'm testing this on cp1044.wikimedia.org [18:38:48] diederik linked a ticket [18:38:53] ah right [18:38:53] cool, I'll poke around there [18:38:54] cool, yeah [18:38:57] basically [18:39:17] varnishncsa log format (/etc/defaut/varnishncsa) is set to log the %{X-Carrier}i header [18:39:20] which i think is what we want [18:39:30] if I start my own local varnishncsa instance with that log format [18:39:35] ottomata: that is NOT what we want [18:39:35] that header gets printed out [18:39:56] but, the logs going to the main firehose do not have that header [18:39:59] preilly, oh no? [18:40:03] you'd rather have X-CS? [18:40:06] ottomata: NO you want X-CS [18:40:08] X-CS does not have the country name in it [18:40:10] so [18:40:14] Thehelpfulone: I'm not referring to the project, I'm referring to my request to mark for guidance on moving a db around. there's no point moving a dump to a box that's not viable as currently configured. [18:40:27] ottomata: that is the correct header to log [18:40:34] you won't be able to differnetiate between orange cameroon and orange tunisai [18:40:37] DO NOT LOG X-Carrier [18:41:00] erosen thought it would be more useful to have x-carrier [18:41:01] ottomata: you can change X-CS to be unique in Varnish [18:41:03] i mean, sure i'll do whatever [18:41:10] (no one told me this when I originally did it, btw) [18:41:15] so X-Carrier is happening currently [18:41:18] ottomata: it's in the RT ticket [18:41:46] https://rt.wikimedia.org/Ticket/Display.html?id=3158 [18:41:47] ? [18:42:00] but [18:42:00] anyway [18:42:10] asher mentioned it [18:42:13] apparently [18:42:15] yes [18:42:20] but i mean originally [18:42:22] this was deployed with x-carrier [18:42:25] he has since mentioned it [18:42:35] but, notpeter, the use of x-carrier or x-cs is irrelvant to the problem [18:42:38] the headers are being set [18:42:44] but they are not making into main firehose [18:42:50] ottomata: we should really make the X-CS header be the http://en.wikipedia.org/wiki/Mobile_country_code [18:42:52] we can argue about which one to log aside from that issue [18:43:05] ottomata: ok, I'll take a look at that. and you and preilly can hash out the rest [18:43:17] ottomata: like a X-CS = MCC-MNC [18:43:21] that's cool [18:43:26] its not that right now though [18:43:36] soooo! should we open a new ticket [18:43:40] notpeter: look to your right [18:43:41] i can make it log X-CS no prob [18:43:57] varnish .vcl has to be modified then to set the X-CS properly [18:44:41] ottomata: yes [18:45:20] ok cool, well, notpeter and I will sort this out then, and I'll submit a patch to make it log X-CS, and update the logging RT [18:45:25] you wanna take on fixing the .vcl? [18:46:26] I'll give it a shot and call in reinforcements as needed. and it seems like you, drdee, and preilly should chat about what data should be logged. as an architect, this is his job, afterall :) [18:46:47] ottomata: you'll need to modify, "templates/varnish/mobile-frontend.inc.vcl.erb" for the correct X-CS values based on: http://en.wikipedia.org/wiki/Mobile_country_code [18:47:13] I think preilly's suggestion is great! [18:48:01] ottomata: So, the X-CS value that is currently, "CL" will become, "502-13" [18:48:04] New patchset: Ryan Lane; "Add additional wikis for development in labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45349 [18:48:44] fiiiine w me! [18:48:50] ottomata: which equals the X-Carrier of, "Celcom Malaysia" [18:48:56] ok cool, thanks notpeter, [18:49:04] ottomata: Are you upset about this? [18:49:04] New patchset: Ryan Lane; "Add additional wikis for development in labs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45349 [18:49:09] nonono [18:49:10] not at all [18:49:16] paravoid: You around? [18:49:17] that was not a snarky fine [18:49:26] preilly: yes [18:49:26] notpeter: ignore what ottomata says in the rt ticket, it's wrong. do what i said earlier. need to log this via the response head %{}o not %{}i [18:49:26] that was fine like a 'fiiiiiine southern spring day' [18:49:40] binasher, cool! [18:49:42] why? [18:49:56] when I run varnishncsa manually, I get lots for %{}i but not for %{}o [18:50:00] logs* [18:50:05] paravoid: Are you cool with X-CS being: MCC-MNC [18:50:07] preilly: what's up? [18:50:09] ottomata: you ran it for the wrong varnish instance [18:50:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45349 [18:50:20] oh? [18:50:26] on cp1044? [18:50:32] are there more than one running? [18:50:38] when you just run "varnishncsa -F whatever" on a production varnish host, you are getting the backend varnish instance [18:50:46] logging is always done from the frontend [18:51:01] ahhhh [18:51:25] notpeter: what i said in the ticket, cut and paste. ignore rest of ticket. [18:52:03] but isn't the frontend instance still setting the req header? [18:52:22] yes, that is indeed where it is being set [18:52:56] why would logging the resp header in the front end be correct then? [18:53:36] magic! [18:54:28] paravoid: Have you looked at: https://github.com/wikimedia/Sartoris at all? [18:54:43] preilly: I haven't [18:54:59] paravoid: You have pretty strong Python skills right? [18:55:34] I can manage [18:55:36] what's up? [18:55:50] paravoid: That is our GitDeploy replacement [18:56:02] paravoid: It would be cool if you could take a look at it at some point [18:56:27] paravoid: Ryan_Lane and rfaulkner are working on it [18:56:28] as in the git-deploy part of git-deploy? ;) [18:56:38] yes [18:56:43] paravoid: Yes as in the perl-kiss-oh-death part of it [18:57:12] I needed to task switch to labs for a bit this week, though [18:57:14] we don't use gerrit for that? [18:57:19] it needed some love [18:57:27] I honestly want to switch to gerrit for it [18:57:30] https://github.com/git-deploy/git-deploy go bye-bye [18:57:33] and just let the changes replicate across [18:57:57] I'm not much of a fan of git-hub's process [18:57:58] paravoid: We can move it to Gerrit [18:58:16] paravoid: I'm just NOT a huge fan of Gerrit [18:58:16] as sad as it is, I'm now used to gerrit and find it easier :D [18:58:20] New patchset: Dzahn; "add wikivoyager.org|.de redirects to wikivoyage - RT-4333" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45351 [18:58:28] * preilly cries  [18:58:43] wants to keep gerrit [18:58:51] I wanted to modify a pull request yesterday [18:58:55] apparently that's impossible [18:59:01] if it's someone else's PR [18:59:11] Ryan_Lane: it's NOT impossible [18:59:15] someone else's? [18:59:42] ottomata: varnishlog would let you see it as a request header from the frontend, because it can see request headers as sent from the client, and as sent to the backend as separate entities. In varnish 3.0, varnishncsa could see requests as seen from the front or back and had a -c switch for just from the client, but that became the default and only way it works in 3.0.1 or .2. varnishlog can log vcl set request headers a [18:59:43] request headers. current versions of varnishncsa don't see them. doesn't matter for this because they're always returned as set by varnish in the response object to be varied upon. [18:59:47] how do I push a change into their PR? it's their fork/branch. not mine [18:59:53] Ryan_Lane: yes [18:59:58] New patchset: Dzahn; "add wikivoyager.org|.de redirects to wikivoyage - RT-4333" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45351 [19:00:16] Ryan_Lane: Also we don't need to use PRs [19:00:28] ottomata: but no project had to be blocked on my typing that.. [19:02:14] well, I want code review [19:02:23] how do you do review without PRs? [19:03:03] Thanks binasher, i'll add your response to the ticket for posterity [19:05:38] !log authdns-update [19:05:54] Logged the message, RobH [19:07:04] !log db61 coming offline, and going into otrs sandbox service [19:07:14] Logged the message, RobH [19:09:46] paravoid: I could definitely use some help with the packaging [19:09:54] of? [19:09:56] satoris? [19:09:58] sartoris [19:09:59] yeah [19:10:09] are you good with python setup, btw? [19:10:18] that's something I've never needed to handle [19:10:41] I've done this before, yes [19:10:45] can't say I'm an expert [19:10:49] cool [19:10:53] well, you'll be better than me :) [19:11:22] we should probably wait till we have more of the code base in, though [19:11:53] I'm not terribly sure how to organize this repo. it's going to need puppet, salt, murder_client, sartoris, etc [19:12:00] I think it'll end up being 3-4 packages [19:12:22] the problem is that we'll need to pull this into puppet somehow. probably as a module. [19:13:16] but it would be nice for it to be useable with salt only, as well. I can add some states for that, but the organization of the repo is going to be rough [19:14:03] what has puppet to do with that? [19:14:06] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45351 [19:14:12] I thought you were just using puppet to provision salt minions [19:14:49] there is an untracked file in apache-config repo. "static.conf" [19:15:22] hah [19:15:23] interesting [19:15:31] Ubuntu is thinking of changing their release model [19:15:40] http://arstechnica.com/information-technology/2013/01/ubuntu-considers-huge-change-that-would-end-traditional-release-cycle/ [19:15:46] sound a lot like Debian's :-) [19:15:59] are they running out of silly animal names? [19:16:13] no more 6-month releases [19:16:32] "both stability and cutting-edge features" for real? [19:16:33] just a rolling release (like Debian testing) + LTS releases (like Debian stable) [19:17:21] can they roll out a release that doesn't send local data to amazon? [19:17:23] that'd be sweet [19:17:45] hahaha [19:19:21] but now you'll be able to do that from your phone! [19:19:28] notpeter: ha ha ha [19:19:42] ""In 13.04, we expect to enable instant payments, powered by Ubuntu One, for both applications from the Software Center and music from the Music Store - to deliver the fastest possible purchasing experience directly from the Dash." [19:19:46] The Inquirer (http://s.tt/1wrzj) [19:20:04] I put up with google getting all of my personal data from my phone because they actually know how to protect data [19:20:06] mutante: scary [19:20:11] canonical... yeah, probably not so much [19:20:23] wth, and it inserted that URL on the second line all by itself [19:20:46] mutante: did you recently install ubuntu? [19:20:50] that might be a new featuer [19:21:00] hehe, no:) sticks to Debian [19:21:26] many sites do that [19:21:29] via javascript I think [19:21:44] paravoid: I am [19:21:58] you are? [19:22:00] paravoid: but if we want others to reuse the system [19:22:10] ah, puppet [19:22:12] oh [19:22:26] I read that wrong [19:22:26] we're using puppet to configure salt [19:22:33] because I'm not supposed to be using salt states for things [19:26:30] New patchset: Diederik; "Replaced custom CS carrier codes with MCC-MNC mobile carrier codes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45353 [19:26:45] heh, nice [19:26:51] preilly: liked my MCC-MNC idea? :) [19:27:21] paravoid: Well to be fair it was my idea before you brought it up again but yes I like it [19:27:43] you guys should fight over whose idea it was first [19:27:55] so this will be a swift review :D [19:28:04] !log aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Switched all non-wikipedias to 1.21-wmf8 [19:28:15] Logged the message, Master [19:28:20] notpeter: ha ha [19:29:34] I volunteer to be bookie for this fight, btw [19:34:37] mw1072 being broken is known, but mw1085 has same issue now [19:34:56] no Asher today? [19:35:32] MaxSem: he was on earlier. i think he'll be back online later [19:35:43] ok, thanks [19:35:52] oh, or not, just fresh install ? going to add to puppet [19:37:14] New review: CSteipp; "For some history on the old value, Tim made this comment:" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45273 [19:37:38] New patchset: preilly; "move repo" [operations/debs/sartoris] (master) - https://gerrit.wikimedia.org/r/45357 [19:38:42] After running "sudo apt-get update; sudo apt-get upgrade" on my labs instance I get prompted about 3 new files: /etc/apt/apt.conf.d/20auto-upgrades, /boot/grub/menu.lst, /etc/default/grub saying there are local mods to these files and asking if the old or new files should be used. [19:39:04] Is there any reason to not use the new ones ? [19:40:05] xyzram: always use the default [19:40:11] which is to not replace [19:40:45] xyzram: also best to ask that on the #wikimedia-labs channel ;) [19:41:22] !log stopping puppet on brewster cuz i wanna locally hack and reboot 1 time on db61 without partman auto kicking in [19:41:33] Logged the message, RobH [19:43:23] Change abandoned: preilly; "(no reason)" [operations/debs/sartoris] (master) - https://gerrit.wikimedia.org/r/45357 [19:43:44] !log puppet signing mw1085 (was sitting there like a fresh install) [19:43:55] Logged the message, Master [19:47:19] The following packages have unmet dependencies: php5-xmlrpc : Depends: php5-common (= 5.3.10-1ubuntu3.4+wmf1) but 5.3.10-1ubuntu3.5 is to be installed [19:49:21] err: /Stage[main]/Mediawiki_new/Service[timidity]: Could not evaluate: Could not find init script for 'timidity' [19:49:59] ok. these do not stop puppet run though. this is just "first-time" stuff [19:51:44] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [19:51:54] heh [19:52:02] that's an interesting alert :) [19:52:28] hahaha [19:52:34] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 20.66 ms [19:56:12] New patchset: Ottomata; "Logging X-CS as a response header. RT 3158." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45358 [19:59:29] New patchset: preilly; "add .gitreview file" [operations/debs/sartoris] (master) - https://gerrit.wikimedia.org/r/45359 [19:59:54] binasher, this should be ok to merge then? [19:59:56] https://gerrit.wikimedia.org/r/#/c/45358/1/files/varnish/varnishncsa.default [20:00:05] without babysitting? :) [20:00:37] oh perhaps he is offline [20:01:22] New patchset: Dzahn; "fix wikivoyager.org/.de redirects" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45360 [20:01:37] ottomata: the squid part is trivial [20:01:43] yeah, [20:01:55] I can help you with that. [20:01:55] i know, but it won't make difference right now [20:01:56] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45360 [20:02:02] that header isn't being set by squid (or nginx for that matter) [20:02:12] and we've got some other changes coming in soon that will be relevant for squid [20:02:23] so no need to deploy something to squid that change anything for it [20:02:26] Change merged: preilly; [operations/debs/sartoris] (master) - https://gerrit.wikimedia.org/r/45359 [20:02:33] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [20:02:38] wha? [20:02:41] stat1001 puppet [20:02:42] psshhh [20:03:28] but thanks paravoid! [20:03:43] actually, if you think that change is ok, I'll just merge it and let puppet put the new one in place [20:03:48] maybe i'll run it on cp1044 to be sure [20:03:49] s'ok [20:03:50] ? [20:05:46] New patchset: preilly; "move https://github.com/wikimedia/Sartoris" [operations/debs/sartoris] (master) - https://gerrit.wikimedia.org/r/45361 [20:06:02] Change merged: preilly; [operations/debs/sartoris] (master) - https://gerrit.wikimedia.org/r/45361 [20:06:14] Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/45361/ [20:06:34] ty [20:06:41] wait [20:06:53] I'm not sure this is correct [20:07:22] naming repos is hard. [20:07:45] Ryan_Lane: What's wrong with operations/debs/sartoris ? [20:07:45] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 20:07:36 UTC 2013 [20:07:53] I guess it's fine [20:08:04] I'm thinking how it's going to look on the github side [20:08:04] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 20:07:54 UTC 2013 [20:08:04] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:08:04] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:08:15] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 20:08:12 UTC 2013 [20:08:24] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 20:08:16 UTC 2013 [20:09:04] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:09:04] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 20:08:59 UTC 2013 [20:09:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:09:23] the real problem is how we're using most of this stuff [20:09:24] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 20:09:14 UTC 2013 [20:09:28] specifically us [20:09:43] some of the stuff in sartoris is going to be a puppet module [20:09:50] some is going to be debian packages [20:10:05] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [20:10:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [20:10:45] rfaulkner: take a look at https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/debs/sartoris,n,z [20:10:54] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [20:11:19] ^demon: ping [20:11:30] <^demon> Pongggg [20:11:53] ^demon: Can you set-up sync'ing between operations/debs/sartoris on gerrit and github [20:12:06] this doesn't really solve our problem... [20:12:18] New patchset: Diederik; "Replace custom CS carrier codes with MCC-MNC mobile carrier codes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45353 [20:12:20] the syncing is automatic [20:12:43] the problem is the organization of the repo and the fact that it's not just debian packages [20:12:45] <^demon> Not for repos outside mediawiki/* [20:12:51] ah [20:13:20] erm [20:13:32] I wouldn't name it operations/debs/sartoris [20:13:41] yeah, that's what I was thinking [20:13:47] it's not the "deb" part that's interesting here [20:14:00] <^demon> Bahhhhh.....and it already has commits. [20:14:04] <^demon> Makes moving it impossible. [20:14:13] it can be deleted, right? [20:14:28] paravoid: Well what about lucene-search-2 and squid for example [20:14:42] <^demon> Ryan_Lane: https://gerrit.wikimedia.org/r/#/q/sartoris,n,z - no. I don't want to clean this up by hand. [20:14:46] I didn't say that it's been done right in the past :-) [20:14:54] (but we're not upstream for squid) [20:14:54] paravoid: ha ha ha [20:15:09] this might attract more people from outside the wmf [20:15:15] it's an interesting project [20:15:16] ^demon: is it going to be possible in the newer version of gerrit/ [20:15:19] <^demon> Yes. [20:15:23] ok. cool [20:15:27] let's mark it as deleted, then [20:15:29] <^demon> delete project plugin works great. [20:15:56] <^demon> Marked read-only and description as "DELETE ME" [20:16:03] cool. thanks [20:16:11] ^demon: how's that new version coming along? :) [20:16:13] Ryan_Lane: Wait a second [20:16:24] Ryan_Lane: Why are you deleting it? [20:16:32] because of the name [20:16:36] <^demon> paravoid: Been testing. Pretty ready. Was waiting for eqiad excitement to die down. [20:16:38] Ryan_Lane: WTF [20:16:50] ^demon: that's 2.5 or 2.6? [20:16:55] did you just miss all of the backscroll? [20:17:01] <^demon> paravoid: master/2.6 [20:17:10] cool [20:17:12] Ryan_Lane: either we fix everything else in gerrit or we don't change it [20:17:13] does that include the proper mail diffs? [20:17:25] fix everything else? what do you mean? [20:17:37] Ryan_Lane: did you miss the back scroll? [20:17:40] <^demon> paravoid: Diffs in e-mail? 2.5 lets us add those, yeah. [20:17:50] yay [20:17:57] New patchset: Dzahn; "meh, trailing $ has gotta go from this redirect rule" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45364 [20:18:06] paravoid: thoughts on a name? just "sartoris" works for me [20:18:11] same here [20:18:47] preilly: naming it operations/debs/sartoris means it'll likely get no attention. also, it's not just a debian package [20:18:48] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45364 [20:18:57] that was my point as well [20:19:01] it's a puppet module, debian packages, and salt stuff as well [20:19:28] Ryan_Lane: well that sounds like three repos to me [20:19:35] Ryan_Lane: so operations/debs/sartoris is fine [20:19:49] Ryan_Lane: for the debian package part [20:20:03] most projects include the debian directory in with the project [20:20:07] it makes sense to do so here [20:20:20] at most it'd be a separate branch [20:20:23] can't be done otherwise [20:20:56] this is out of our normal naming conventions for our debs, but it's also outside of our conventions for puppet, too [20:20:56] I really don't understand why operations/debs/sartoris is an issue [20:21:12] and I'm getting really really annoyed [20:21:28] let's calm down :) [20:21:44] I think everyone hopes sartoris to be succesful [20:21:49] dzahn is doing a graceful restart of all apaches [20:22:07] and I guess this one of the reasons such a unique name (sartoris) was picked instead of git-deploy-deploy :) [20:22:29] !log dzahn gracefulled all apaches [20:22:39] New review: preilly; "Please find the appropriate MCC-MNC codes for the carriers that don't have them in this change set." [operations/puppet] (production); V: 1 C: -1; - https://gerrit.wikimedia.org/r/45353 [20:22:40] Logged the message, Master [20:22:41] in that sense, it might make sense to have it like a "real" project in gerrit as to ease third-party contributions [20:22:52] but personally, I don't particularly have a strong opinion [20:22:59] paravoid: I came up with the idea and the name [20:23:06] I know [20:23:06] the idea of what? [20:23:20] paravoid: the idea of doing it in Python and extending it [20:23:24] o.O [20:23:29] you're kidding, right? [20:23:35] s/paravoid/Ryan_Lane [20:23:42] Ryan_Lane: How am I kidding? [20:23:44] I talked about rewriting it well before then [20:23:55] Ryan_Lane: Not that I ever heard [20:24:06] does it matter? [20:24:11] !log removing mw1072 (hardware fail, RT-4381) from "apaches" dsh group [20:24:21] paravoid: Not at all [20:24:21] Logged the message, Master [20:24:57] as if I'd leave an all python system with a perl frontend [20:24:59] * Ryan_Lane shrugs [20:25:22] I remember finding git-deploy when I was in SF and pointing it to Ryan who was envisioning a deployment system based on git, plus the whole architecture with salt incl. a comment about how he'd like to rewrite the perl part in python [20:25:29] Ryan_Lane: it wouldn't really surprise me if you did [20:25:46] but I don't think it matters, since I presume we all have the same agenda here [20:25:51] let's calm down [20:26:02] this all boils down to a naming issue [20:26:12] it's my project. I think I get the choice here [20:26:26] <^demon> I should fix the operations/* repo acls. Right now they're all a hodge-podge since there's no top level operations. [20:26:33] <^demon> I also want a pony. [20:26:35] Ryan_Lane: yeah because Open Source is all about something being someones project WTF [20:27:09] yo preilly; about mcc-mnc, there is one missing, that is congo orange, i went to look at mcc-mnc.com and it's not mentioned there either [20:27:31] drdee: Can you ping Kul or Dan to ask the carrier for the correct value? [20:27:35] for tata in india, they have different id's per geography [20:27:53] that's why i put in asterisk [20:27:55] yes i'll do that [20:27:58] drdee: I'd leave it up to Kul, Dan or Amit [20:28:01] preilly: it seems to me in this discussion the vote is already 2-1. if it's democracy you want [20:28:19] I'd prefer consensus tbh [20:28:24] Ryan_Lane: I don't really give a flying fuck what you name it [20:28:35] I don't think this is an operations/ project, neither something debian-specific [20:28:46] I'm really missing why this is such a huge deal [20:28:51] but I don't mind either way [20:28:58] I don't particularly mind if it's github either [20:29:01] paravoid: completely agreed [20:29:17] I'd like it to be in gerrit for our own work, and to take PR from github [20:29:33] i almost don't dare to say it.. but github is non-free [20:29:37] I'm fine pulling and merging the PRs manually for now [20:29:43] <^demon> Ryan_Lane, paravoid, preilly: I'm going to name your repo unicorn-poop. You guys can deal with that now :) [20:30:18] I clone that [20:30:21] I'd* [20:30:51] remember "let me github that for you" ..cough [20:32:22] preilly: I apologize that I didn't discuss it with you before having it marked deleted. [20:32:44] aren't you guys sitting across each other? :) [20:32:53] is everyone ok with just naming it "sartoris"? [20:32:56] paravoid: I'm wfm [20:32:59] *wfh [20:37:40] <^demon> Ryan_Lane: I really did prefer unicorn-poop :\ [20:37:46] heh [20:38:26] is the novel still in copyright? [20:38:38] and if so does that even matter? I have no idea [20:38:55] it would matter for trademark maybe [20:39:05] I don't think so for copyright [20:42:09] "ice nine" :-) [20:42:18] heh [20:42:41] Shame etsy already used 'deployinator' [20:42:52] ha [20:50:15] BAH [20:50:19] db61 is on foundry network [20:50:30] bleehhh, now im used to juniper and dont wanna deal with foundry. [20:50:33] i blame LeslieCarr [20:56:47] !log swapped port 3/39 on csw1-sdtpa from vlan 2 to 102 [20:56:58] Logged the message, RobH [20:57:34] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45332 [21:02:19] preilly: wikimedia/Sartoris was deleted? [21:02:50] was operations/debs/sartoris imported with history? [21:03:03] !log authdns-update to push db61 into sandbox [21:03:15] Logged the message, RobH [21:05:18] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [21:06:08] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [21:11:38] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:11:47] RobH: ^^ [21:12:28] !log restarting pdns on ns1 [21:12:38] Logged the message, Master [21:13:29] RECOVERY - Auth DNS on ns1.wikimedia.org is OK: DNS OK: 0.031 seconds response time. www.wikipedia.org returns 208.80.154.225 [21:24:36] RECOVERY - Puppet freshness on mw15 is OK: puppet ran at Wed Jan 23 21:24:21 UTC 2013 [21:30:22] Jeff_Green: So db61 is now online and in the sandbox vlan, db61.wikimedia.org [21:30:28] and your key is in the root auth key file [21:30:30] !log restarting opendj on virt0 [21:30:35] Ryan_Lane: It was deleted on Github [21:30:36] RobH: yayy, thanks! [21:30:40] I'll update the bugzilla ticket and re-request the key info for martin [21:30:41] Logged the message, Master [21:30:49] RobH: k [21:30:57] preilly: was the history imported into gerrit? [21:31:09] Ryan_Lane: nope [21:31:58] anyone have a full checkout? [21:32:11] Ryan_Lane: you can get it from https://github.com/preillyme/Sartoris [21:32:29] ah. that has the full history? [21:32:35] cool [21:32:36] Ryan_Lane: yes [21:34:07] Ryan_Lane: Have you created a new repo for Sartoris in gerrit? [21:34:12] not yet [21:34:26] Ryan_Lane: Where are you planning on putting it? [21:34:35] I was thinking just "sartoris" [21:34:46] then replicating it to wikimedia/sartoris on github [21:35:36] or Sartoris [21:35:39] either/or [21:40:15] when it comes to HTTP redirects is one year still "temporary"?:) [21:55:25] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [21:56:44] !log authdns update adding mgmt ip's to zone files for 10 misc servers in eqiad [21:56:45] RoanKattouw_away: are you still using yongle? [21:56:53] Logged the message, Master [21:56:55] it's supposed to be decommissioned [21:57:00] but it's still in site.pp [21:57:04] and your account is on there [21:58:06] New patchset: Pyoungmeister; "removing some decommed boxes from site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45372 [21:59:25] notpeter: yongle was decommissioned a very long time ago and is no longer w/us [21:59:49] cmjohnson1: cool! thanks [22:00:05] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:00:28] who is doing the dns updates and not checking? [22:00:53] !log restarting pdns on ns0 [22:01:03] Logged the message, Master [22:01:26] notpeter: also, i installed the new 10Gb sfp cards in mc1017/1018 [22:01:27] cmjohnson1: when you do dns updates, you need to make sure all of the servers are returning answers [22:01:28] ryanlane [22:01:46] New patchset: Pyoungmeister; "removing some decommed boxes from site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45372 [22:01:47] if one doesn't it needs to have its pdns service restarted [22:01:55] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.037 seconds response time. www.wikipedia.org returns 208.80.154.225 [22:02:08] i did a dig on ns0 [22:03:46] hm. I wonder why it died later, then [22:03:48] stupid pdns [22:03:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45372 [22:04:11] hey... no beating the poor pdns, it's a lovely beast really [22:05:04] ryan_lane: dont know but notpeter pointed that out to me a long time ago [22:05:10] * Ryan_Lane nods [22:10:39] hey guyyys! i have an interesting problem that I think I need some ops networky eyes on [22:10:45] !log authdns-update correcting camera fqdn [22:10:56] Logged the message, RobH [22:10:59] LeslieCarr, maybe not today, but would you be available to help me check out a problem tomorrow or Friday? [22:12:05] ottomata: she isn't at work today afaik [22:12:21] ahhh right jury duty [22:12:22] drop her a mail might be best [22:12:26] yep [22:12:26] ja will do [22:15:38] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [22:15:57] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [22:17:17] New patchset: RobH; "adding colby back into standard installs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45436 [22:18:51] New review: RobH; "there's no crying in self review" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45436 [22:18:51] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45436 [22:19:05] * RobH is going to misquote a movie for each self review comment [22:19:21] out of original sarcastic material. [22:19:49] New patchset: Pyoungmeister; "moveing last role class out of site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45458 [22:20:04] !log netapp: nas1001-b cf giveback [22:20:16] Logged the message, Master [22:20:23] someone have a half merged site.pp change? [22:20:32] notpeter: ? [22:20:45] seems to be all dead servers, so legit. [22:21:14] yea, its all decom'd servers [22:21:16] so merging. [22:21:22] RobH: doh, sorry [22:21:33] no worries =] [22:21:46] i just like to know whose shit im merging so if it breaks [22:21:52] i can blame someone that isnt me. [22:21:54] ;] [22:23:08] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45458 [22:27:34] ottomata: ssh: connect to host analytics1007 port 22: Connection refused [22:28:14] uh huhhhhhh [22:28:30] that is an eternal message from the gods [22:28:42] https://rt.wikimedia.org/Ticket/Display.html?id=3946 [22:28:57] !log starting NTP on mw1008 (Tick tock, on the clock But the party don't stop, no) [22:29:08] Logged the message, Master [22:29:16] ottomata: gotcha, as long as its known [22:29:22] <^demon> mutante: You should really get out more :p [22:29:24] it is known [22:29:28] but I hear RobH's voice [22:29:40] perhaps he knows even more! [22:30:05] ^demon: what?:) can't i quote Kesha?:) [22:30:10] ? [22:30:13] New patchset: RobH; "calcium not smokeping server for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45476 [22:30:37] <^demon> mutante: I'm not sure you want everyone to know you *listen* to Ke$ha :p [22:31:07] ^demon: i dont, peter does:P) [22:31:12] New review: RobH; "mama's dont let your babies grow up to self-review" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45476 [22:31:13] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45476 [22:32:53] ACKNOWLEDGEMENT - SSH on analytics1007 is CRITICAL: Connection refused daniel_zahn #3946: analytics1007 wont hdd boot after install [22:32:54] New patchset: Pyoungmeister; "fixing ganglia for eqiad mc boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45477 [22:33:30] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45477 [22:34:23] New review: Dzahn; "you guys just removed "colby", but its still here" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45476 [22:39:07] PROBLEM - SSH on calcium is CRITICAL: Connection refused [22:39:41] PROBLEM - SSH on calcium is CRITICAL: Connection refused [22:42:07] New patchset: RobH; "calcium is software raid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45478 [22:42:50] New review: RobH; "the stuff you self-review ends up reviewing you" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/45478 [22:43:04] New review: RobH; "the stuff you self-review ends up reviewing you" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45478 [22:43:05] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45478 [22:47:02] PROBLEM - NTP on calcium is CRITICAL: NTP CRITICAL: No response from NTP server [22:49:12] RobH, we were referring to the 'analytics1007 has never been born' ticket: [22:49:20] https://rt.wikimedia.org/Ticket/Display.html?id=3946 [22:50:58] ^demon: which parent should I use for a new "sartoris" repo? [22:51:13] <^demon> It's going to be a top-level, right? [22:51:16] PROBLEM - NTP on calcium is CRITICAL: NTP CRITICAL: No response from NTP server [22:51:18] yeah [22:51:19] <^demon> Omit it, it'll fall back to All-Projects. [22:51:32] <^demon> Or specify All-Projects, if you prefer to be explicit :) [22:51:37] heh [22:52:10] how about owner? [22:52:18] <^demon> Who do you want to have CR on it? [22:52:35] myself, ryan falkner and preilly [22:52:42] <^demon> Make a new group then. [22:53:05] ok [22:53:18] !log calcium is my testbed server, ignore its bitching [22:53:28] Logged the message, RobH [22:54:45] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [22:55:06] ^demon: will this have any useful acls set up at all? :) [22:55:31] <^demon> Yeah, All-Projects is pretty useful these days. Grant Owner and Submit to your new group and you should be fine. [22:55:34] ah. cool [22:55:44] <^demon> CR/VR/Create Refs/Tags/etc all handled in All-Projects these days. [22:55:54] <^demon> (I got tired of maintaining ACLs and cleaned those up awhile ago) [22:56:22] nice [22:57:05] heh [22:57:09] no empty commit [22:57:16] how do I push in the old history? [22:57:40] <^demon> On your local repo, add a remote to the new gerrit repo, then `git push [remote] refs/*:refs/*` [22:58:22] I was just typing that [22:58:34] ^demon: What's the new repo location? [22:58:42] <^demon> Ask Ryan, he made it :) [22:58:48] New review: CSteipp; "Policy review by Fabrice, Matthias, Philippe, and Oliver. Code looks fine." [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45329 [22:58:56] "sartoris" [22:59:17] Change merged: CSteipp; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45329 [22:59:17] ah. I need to give myself some permissions for this, right? [22:59:35] create reference? [22:59:40] <^demon> Shitmuffins. Grant yourself "Forge Committer" and try again (you can un-grant that when you're done). [23:00:05] (can not create new references) [23:00:05] oh [23:00:17] do I need to logout and back in, since I have a new group? [23:00:25] well, this is a local group [23:00:30] it shouldn't require tat [23:00:32] *that [23:01:00] there we go [23:01:06] I needed create reference [23:01:35] <^demon> Hmm, thought I'd granted that already. [23:02:06] RECOVERY - SSH on calcium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:02:11] RECOVERY - SSH on calcium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [23:02:44] <^demon> Ryan_Lane: Turns out I didn't do Create Reference at All-Projects. If you want to create other non-master branches, you'll want to re-grant that. [23:02:53] well, that change isn't what I wanted to do. heh [23:02:53] <^demon> I really really need to audit all our acls one day. [23:03:16] make 'em pretty [23:05:24] ^demon: can you make sartoris replicate to wikimedia/sartoris ? [23:05:35] <^demon> I can, yes :) [23:05:39] I can probably figure out how [23:05:53] <^demon> I'll create the repo on github. [23:06:06] <^demon> Just grant 'mediawiki-replication' read permissions on refs/*. [23:06:45] doesn't it have that by default? [23:06:56] <^demon> Nope, since we're not replicating everything to github. [23:07:03] <^demon> (I was explicitly asked to *not* replicate some things) [23:07:06] ahhh. makes sense [23:07:17] done [23:07:47] <^demon> Forcing replication. [23:08:07] New patchset: Tim Starling; "Check the return status of nextJobDB.php." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45292 [23:08:13] <^demon> https://github.com/wikimedia/sartoris - done [23:08:13] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45292 [23:08:57] thanks [23:08:59] <^demon> yw. [23:12:59] RECOVERY - NTP on calcium is OK: NTP OK: Offset -0.01506149769 secs [23:13:15] RECOVERY - NTP on calcium is OK: NTP OK: Offset -0.01460146904 secs [23:17:07] RoanKattouw_away: are you still using testing-singer-puppetization ? [23:17:49] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [23:17:49] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [23:17:49] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [23:17:49] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [23:17:49] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [23:17:49] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:17:49] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [23:21:40] Ryan_Lane: Why do you have murder code in sartoris [23:21:44] New patchset: RobH; "adding user to calcium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45488 [23:21:57] because it's a dependency for the seeder [23:22:12] rfaulkner: you can get the code for sartoris at ssh://gerrit.wikimedia.org:29418/sartoris [23:22:16] and it's a target of the package [23:22:31] preilly: will do [23:22:33] New review: RobH; "the only way to win is not to self-review" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45488 [23:22:34] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45488 [23:22:51] ^demon|away: can you set-up replication to Github for sartoris.git [23:22:56] preilly: if there's a better way of handling it let's change that [23:23:03] preilly: already done [23:23:17] Ryan_Lane: But has the decision even been made to use Twitters deployment system? [23:23:22] https://github.com/wikimedia/sartoris [23:23:39] that's what I was going to use for the bittorrent part [23:23:49] it's just a wrapper around bittornado [23:23:55] exit [23:23:56] <^demon|away> We should call it gitorrent :p [23:24:05] blaaaaah, glad that was a innocent command [23:24:10] Ryan_Lane: Has the decision been made to use bittorrent? [23:24:11] gotta be more careful of window focus =P [23:24:16] RobH: and not a netflix password? :) [23:24:18] RobH: ha ha [23:24:25] preilly: yes [23:24:32] heh, the maintainer of sartoris is named Faulkner? that's beyond awesome. [23:24:34] at minimum for the l10n cache [23:24:37] Ryan_Lane: who made that decision? [23:24:44] it was discussed on the list [23:24:53] chrismcmahon: that was on purpose [23:25:11] :) [23:25:23] <^demon|away> I keep reading it as santorum, and that can't be a good thing. [23:25:39] Ryan_Lane: I didn't see a decision in that thread [23:26:06] discussed on the list along with the current issued with using fetch that need to be solved before we'd be able to switch back [23:26:13] ^demon|away: nothing wrong with http://www.urbandictionary.com/define.php?term=santorum is there? [23:26:32] preilly: unless you are going to suggest an alternative, I'm going to continue on with my plans [23:27:00] Ryan_Lane: well for now I was thinking that sartoris would just be 1-to-1 of git-deploy but in Python [23:27:21] someone really made a word for that?! [23:27:24] why limit it to that? [23:27:32] I was going to put the entire system in there [23:27:46] which includes the normal git method, the git-bt method and the bt method [23:28:38] also the salt module, runner and returner [23:28:53] and the sync scripts [23:29:38] Ryan_Lane: well I'd like to have a larger discussion about any bt dependancies [23:29:55] it was on the ops list. there were no major objections [23:29:57] it's over a week old [23:30:08] lack of objections is the same as acceptence [23:30:29] if I have to wait for someone to tell me it's ok to do something I'd never get anything done [23:30:34] preilly, Ryan_Lane: pulled the latest from ssh://gerrit.wikimedia.org:29418/sartoris [23:31:05] rfaulkner: Okay cool [23:31:14] preilly: also, the only person to give an alternative was mark, and he wasn't fully serious [23:31:46] Ryan_Lane: well for now I'd like to keep it simple and not include any bt related code [23:31:48] to clarify: all changes will go to this remote repo rather than https://github.com/wikimedia/sartoris [23:32:00] rfaulkner: yes that is correct [23:32:13] rfaulkner: they will be synced to github [23:32:34] sounds good [23:32:43] preilly: why? [23:34:09] Ryan_Lane: because that was not the original purpose of this port and I don [23:34:15] 't [23:34:15] [23:34:19] want other dependancies at this time [23:34:42] what was the original purpose of the port? [23:35:34] brb [23:36:24] this is the front-end of a system. Why are we only going to bother releasing it without any of the other work? [23:36:25] New patchset: Dzahn; "redirect wikimania.asia to wikimania2013 (RT-4228)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45490 [23:38:39] New patchset: Dzahn; "redirect wikimania.asia to wikimania2013 (RT-4228)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45490 [23:39:22] damn it, we should also use tabs instead of spaces in redirects.conf [23:40:52] New patchset: Dzahn; "redirect wikimania.asia to wikimania2013 (RT-4228)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45490 [23:41:28] New review: Dzahn; "RT-4228" [operations/apache-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45490 [23:41:28] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45490 [23:41:54] Ryan_Lane: my goal was to have a drop in replacement of git-deploy written in Python [23:42:02] to what end? [23:42:22] Ryan_Lane: and to NOT have to rely on the git-deploy codebase from bookings.com [23:42:39] yes, I've had that goal since I started using the perl code [23:42:43] Ryan_Lane: then after that point in time we can look at extending it [23:42:53] it's 1% of the system that's already written [23:43:11] Ryan_Lane: why is my stance an issue? [23:43:44] I'd prefer to have consensus. [23:44:13] what are you planning on extending it into? [23:44:40] could just use scap [23:44:49] we could, yes [23:44:51] you know I got 4 Gbps out of scap the other day [23:44:58] TimStarling: +2 [23:45:13] it would be faster still if it didn't use the netapp as a code repository [23:45:19] I think you guys misunderstood the reasoning for replacing scap [23:45:19] TimStarling: that's awesome was the fanning set to 30? [23:45:35] scap doesn't do any reporting [23:45:48] and we have to constantly deal with host key issues [23:45:54] the salt based system has full reporting [23:45:56] yes, fanout 30, with an rsyncd server for each row [23:46:06] we can make rsync a deployment method in the salt based system [23:47:03] we have no current way of knowing if all of the systems are actually running the correct version of mediawiki [23:47:19] <^demon|away> I think we can come up with a way to verify an install without using git. [23:47:26] yes, we can. [23:47:30] <^demon|away> Really, moving core's .git/objects around is a pain in the ass. [23:47:39] as I said, we can continue using rsync if we want [23:48:04] the salt based deployment system reports back to say a system is finished and gives status on its state [23:48:30] Ryan_Lane: we could do that without salt [23:48:37] * Ryan_Lane rolls his eyes [23:48:58] <^demon|away> I don't think salt's a bad idea, but salt+rysnc (with Tim's improvements) sounds promising. [23:49:15] <^demon|away> I just think salt + git-deploy is kind of a non-starter at this point. [23:49:24] ^demon|away: how so [23:49:35] the only part that doesn't work is git's fetch method [23:49:47] and l10n is too slow [23:49:52] <^demon|away> git fetch works fine for millions of people worldwide. [23:50:03] <^demon|away> It just doesn't scale well for deployments, I don't think. [23:50:08] agreed [23:50:25] TimStarling: how long did it take for scap and l10n when you tested it last? [23:50:26] it's not a matter of scaling. it's a matter of error correction [23:50:27] <^demon|away> And I think the fact that we're talking about replacing git-fetch with torrents is kind of silly when we've got a proven thing like rsync that *works* [23:50:53] how many systems are we using for rsync? [23:51:00] one per rack? [23:51:02] <^demon|away> I think l10n is slightly orthogonal to git-deploy. l10n-recache has sucked for awhile now, and should be solved regardless of the deploy system. [23:51:24] scap also writes files directly into the running copy [23:51:43] preilly: maybe the whole process took 10-15 mins, but the main rsync run was only 5 minutes of that [23:52:03] I think most of the time was spent waiting for the netapp, it would be much faster if the deployment host had its code on a local FS [23:52:36] TimStarling: that seems fairly doable [23:53:25] \o/ [23:53:25] Ryan_Lane: I set up one rsync server per row, 6 altogether [23:53:30] <^demon|away> Also, we could cut the fetch time on the deployment host if we replicated the git objects for core from manganese. [23:53:37] New patchset: Dzahn; "wikimania.asia redirect, there is no *.wikimania2013.wm" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45497 [23:53:41] <^demon|away> (And deployed extensions, probably) [23:53:42] plus one on nfs1 [23:54:17] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45497 [23:54:51] TimStarling: what is your opinion on saltstack? [23:56:09] I don't know, it could be useful for something [23:56:29] TimStarling: I could say that of a lot of things [23:56:32] I think it's complex, and so I think we need a clear justification for introducing it [23:56:40] dzahn is doing a graceful restart of all apaches [23:56:47] * AaronSchulz still keeping reading up on it to understand it [23:56:49] TimStarling: yeah I couldn't agree more [23:56:59] *keeps [23:57:15] it's not a drop-in replacement for dsh, it doesn't have the same privilege system [23:57:16] !log dzahn gracefulled all apaches [23:57:23] TimStarling: plus in general distributed remote execution scares me [23:57:26] Logged the message, Master [23:58:47] the salt minions all run as root, the method for reducing privileges is to write modules which do restricted things and to make those modules accessible to all users [23:59:05] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [23:59:28] all users of a specific system [23:59:36] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [23:59:47] controlled via sudo [23:59:50] As opposed to all users everywhere, heh.