[00:05:32] I guess that fixed UW too [00:06:23] I missed one of the masters, I got it just now [00:07:14] so we should have no more errors after 00:05 [00:07:51] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 00:07:47 UTC 2013 [00:08:11] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:08:32] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 00:08:21 UTC 2013 [00:09:11] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:09:21] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 00:09:14 UTC 2013 [00:10:06] ok icinga, we get it [00:10:11] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [00:10:58] ah, ms1 is decommissioned, must be causing some weird funkyness [00:13:32] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [00:13:46] New patchset: MaxSem; "Postgres module for OSM" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36155 [00:18:32] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 221 seconds [00:18:32] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 188 seconds [00:19:31] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 247 seconds [00:20:17] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 279 seconds [00:21:32] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:21:55] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [00:22:19] New patchset: Dzahn; "remove index.html.tmpl from files, replaced by .erb template and drop the empty lines in the template causing issues (yes, really)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45271 [00:22:31] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 5 seconds [00:23:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45251 [00:23:07] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [00:23:56] New patchset: Lcarr; "fixing icinga restart condition" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45272 [00:24:31] New patchset: CSteipp; "Lower email throttle" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45273 [00:24:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45271 [00:27:26] New patchset: Lcarr; "fixing icinga restart conditions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45274 [00:28:43] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45274 [00:29:15] gotta love bugs like that.. i got some template that creates an index.html, now if there are empty comment lines in it like "### " they will end up (and mess up) the resulting HTML, while comment lines that actually have comments like "### foo" don't cause any issue.. [00:29:29] TimStarling: I'm having a hard time triggering MWExceptionHandler::handle [00:29:54] set_exception_handler definitely gets called [00:33:23] strange [00:33:38] does it work for you? [00:34:54] working in sapi mode [00:35:33] not cli [00:37:01] New patchset: Lcarr; "removed old style definition of removing generic site" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45275 [00:37:41] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45275 [00:46:46] New patchset: Lcarr; "fixing syntax error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45276 [00:46:49] on fenari "throw new MWException('ping');" with eval.php appears to work [00:47:11] the error output from MW is a bit different to the default PHP output [00:50:34] * AaronSchulz was testing on his laptop [00:53:34] I tested on my laptop also, I saw an error message in the debug log [00:53:45] TimStarling: definitely does not work on my laptop, using 2>/dev/null still shows the error (though if I throw a raw Exception it doesn't show) [00:55:51] New patchset: MaxSem; "WIP: OSM module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36222 [00:58:06] is anyone here familiar with tmh operationally? [00:59:41] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [00:59:55] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [00:59:55] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [01:00:11] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [01:01:01] binasher: I wouldn't say familiar, but it probably won't take me long to get up to speed [01:01:08] what is the problem? [01:02:04] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45276 [01:03:11] so tmh1001 is a job runner with JR_TYPE="webVideoTranscode" in /etc/default/mw-job-runner [01:03:16] TimStarling: the tmh servers in eqiad aren't getting code deploys (not sure since when) and are trying to talk to pmtpa db's via a now deleted db.php [01:03:41] i saw that notpeter exempted the tmh hosts from his puppet change enabling job runners in eqiad though, so i'm not sure if those hosts are even supposed to be running right now [01:03:45] sure, they're missing [01:04:14] I added them to mw-eqiad which I was going to use as a source for mediawiki-installation [01:04:23] but then someone else updated mediawiki-installation for me [01:05:26] ah. and was tmh processing only running in eqiad all this time? (so they didn't need enabling like the other eqiad job runners) [01:05:49] !log added tmh1001 and tmh1002 to mediawiki-installation [01:05:59] Logged the message, Master [01:06:08] there is also tmh1 and tmh2, I assume they are in pmtpa [01:07:43] !log ran scap-1 on tmh1001 and tmh1002 [01:07:53] Logged the message, Master [01:11:10] judging by ganglia, tmh1 and tmh2 were active up until today [01:11:27] someone stopped the job runners on tmh1 and tmh2 and started them on tmh1001 and tmh1002 [01:12:21] no /etc/rc3.d/*mw-job-runner on any of them [01:13:12] nor on the regular job runners [01:13:20] so I suppose they rely on being started manually on boot [01:14:56] New patchset: Dzahn; "use different CSS for Arabic, use a puppet selector within the file definition. took planet.css from old Arabic planet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45283 [01:15:43] TimStarling: looks like puppet has an ensure => running for mw-job-runner [01:15:52] right [01:16:29] TimStarling: maybe jobs-loop should check $? [01:16:54] seems that puppet tries to ensure sync-common has been run before starting it as well [01:17:04] AaronSchulz: I guess that would help [01:17:35] getting reports from wikidata users about random 404's some users are getting, but not all, and the ones who do say it is fixed by reload but happens quite a bit [01:17:53] binasher: https://gerrit.wikimedia.org/r/#/c/45159/ [01:18:17] notpeter switched over both sets of job runners [01:18:50] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45283 [01:19:21] mutante: are all the apaches that are enabled in LVS also in mediawiki-installation? [01:20:04] TimStarling: thanks, i had only actually read https://gerrit.wikimedia.org/r/#/c/45156/ which enabled job runners in eqiad, but left tmh disabled there [01:21:58] !log aaron synchronized php-1.21wmf7/includes/api/ApiUpload.php 'deployed 2a4ad3e32cf4f6814554ebcf09ac0250546a2549' [01:22:11] Logged the message, Master [01:23:02] !log aaron synchronized php-1.21wmf8/includes/api/ApiUpload.php 'deployed 2a4ad3e32cf4f6814554ebcf09ac0250546a2549' [01:23:12] Logged the message, Master [01:23:42] TimStarling: confirmed that everything host enabled in /h/w/conf/pybal/eqiad/apaches is in /etc/dsh/group/mediawiki-installation [01:23:57] same with api [01:26:23] ah, dsh group, first thought you are talking about a puppet class, thx asher [01:26:48] yea, and the reports are not limited to Europe, got one from Canada [01:28:38] do any of these reports come with a URL? [01:29:42] not yet..trying to get some .."17:03 < Jasper_Deng> but its sporadicness leads me to speculate about URL-rewrite rules not being executed correctly on all servers [01:29:58] 17:36 < Jasper_Deng> mutante: Special:Watchlist and http://www.wikidata.org/wiki/Wikidata:Administrators/Confirm_2013/1 for me at least [01:30:01] 17:36 < Jasper_Deng> it appears to be URL-independent [01:30:20] from #wikimedia-tech [01:30:31] TimStarling, I've seen it on http://www.wikidata.org/wiki/Special:Watchlist [01:43:22] mutante: I got 2 of those random 404's on wikivoyage from the office [01:45:18] Jamesofur: what url? and did a reload fix it? [01:46:08] binasher: A reload did fix it, on two different occasions it was http://en.wikivoyage.org/wiki/Wikivoyage:TOC [01:54:13] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [02:02:17] still moving servers? [02:04:52] Nope, we're done [02:04:59] We finished around, 11ish? [02:05:03] Before lunchtime I think [02:05:19] So about 6-7 hours ago [02:07:30] New patchset: Aaron Schulz; "Check the return status of nextJobDB.php." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45292 [02:08:09] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [02:08:10] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [02:08:26] mutante: mw1017, 1018, and 1019 give 404's for wikivoyage urls every time and are pooled [02:08:50] the rest of the pooled eqiad apaches are ok [02:09:47] New patchset: Ryan Lane; "Combine sysadmin and netadmin into projectadmin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45267 [02:10:42] * AaronSchulz wonders what's with http://commons.wikimedia.org/wiki/File:%E6%85%88%E6%BF%9F%E5%AE%AEwikipedia.jpg [02:11:16] AaronSchulz, The file exists, http://upload.wikimedia.org/wikipedia/commons/b/b5/%E6%85%88%E6%BF%9F%E5%AE%AEwikipedia.jpg [02:11:23] binasher: I'll fix them [02:11:25] RoanKattouw: if you're done, shouldn't all wikis be writeable now? [02:11:51] They are, aren't they? [02:11:57] testwiki isn't [02:12:12] The administrator who locked it offered this explanation: Wikimedia Sites are currently read-only during maintenance, please try again soon. [02:12:22] Haha [02:12:24] Right [02:12:27] Yeah, testwiki is a special case [02:12:33] :-( [02:12:58] The way it's set up is incompatible with how eqiad is set up I think [02:12:58] TimStarling: sync-apache uses the apaches dsh group which doesn't include any of the eqiad hosts. it looks like puppet is supposed to take care of this now though [02:13:08] You should file a bug about it being down if there isn't one already [02:13:17] techman224: http://commons.wikimedia.org/w/thumb.php?f=%E6%85%88%E6%BF%9F%E5%AE%AEwikipedia.jpg&w=400 [02:14:52] !log added eqiad apaches to /etc/dsh/group/apaches [02:15:04] Logged the message, Master [02:16:11] AaronSchulz, http://commons.wikimedia.org/wiki/Special:Undelete/File:%E6%85%88%E6%BF%9F%E5%AE%AEwikipedia.jpg [02:16:30] Same picture? [02:16:40] I can't see that [02:18:08] binasher: puppet will only run that sync command once [02:18:16] it has creates => "/usr/local/apache/conf", [02:18:19] so it won't run again [02:18:29] those servers were probably installed early [02:18:38] AaronSchulz, I can't get the thumbnails from the links under the picture [02:18:55] i suppose there could be other inconsistencies as well [02:19:10] i'll add the eqiad api/apps to the apaches group [02:19:16] !log running sync-apache to fix stale eqiad apache configuration [02:19:25] Logged the message, Master [02:19:27] I did already, see above log entry [02:19:51] so you did, thanks [02:23:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37566 [02:23:48] !log reloading all apaches [02:23:58] Logged the message, Master [02:26:05] !log LocalisationUpdate completed (1.21wmf7) at Wed Jan 23 02:26:04 UTC 2013 [02:26:07] PROBLEM - Puppet freshness on db50 is CRITICAL: Puppet has not run in the last 10 hours [02:26:08] PROBLEM - Puppet freshness on db68 is CRITICAL: Puppet has not run in the last 10 hours [02:26:15] Logged the message, Master [02:29:59] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [02:32:07] RECOVERY - Host ms-be1008 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [02:32:09] PROBLEM - swift-account-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:32:10] PROBLEM - swift-container-server on ms-be1008 is CRITICAL: Connection refused by host [02:32:10] PROBLEM - swift-object-updater on ms-be1008 is CRITICAL: Connection refused by host [02:32:10] PROBLEM - swift-container-updater on ms-be1008 is CRITICAL: Connection refused by host [02:32:20] PROBLEM - swift-container-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:32:20] PROBLEM - swift-object-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:32:20] PROBLEM - swift-account-server on ms-be1008 is CRITICAL: Connection refused by host [02:32:39] PROBLEM - SSH on ms-be1008 is CRITICAL: Connection refused [02:32:40] PROBLEM - swift-object-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:32:40] PROBLEM - swift-account-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:32:49] PROBLEM - swift-object-server on ms-be1008 is CRITICAL: Connection refused by host [02:32:49] PROBLEM - swift-container-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:32:49] PROBLEM - swift-account-reaper on ms-be1008 is CRITICAL: Connection refused by host [02:34:19] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:35:43] PROBLEM - swift-account-reaper on ms-be1008 is CRITICAL: Connection refused by host [02:35:43] PROBLEM - swift-object-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:35:52] PROBLEM - swift-object-updater on ms-be1008 is CRITICAL: Connection refused by host [02:36:10] PROBLEM - swift-account-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:36:28] PROBLEM - SSH on ms-be1008 is CRITICAL: Connection refused [02:36:29] PROBLEM - swift-container-updater on ms-be1008 is CRITICAL: Connection refused by host [02:36:38] PROBLEM - swift-container-server on ms-be1008 is CRITICAL: Connection refused by host [02:36:46] PROBLEM - swift-container-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:36:47] PROBLEM - swift-container-replicator on ms-be1008 is CRITICAL: Connection refused by host [02:37:13] PROBLEM - swift-object-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:37:19] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [02:37:22] PROBLEM - swift-account-auditor on ms-be1008 is CRITICAL: Connection refused by host [02:37:22] PROBLEM - swift-object-server on ms-be1008 is CRITICAL: Connection refused by host [02:37:40] PROBLEM - swift-account-server on ms-be1008 is CRITICAL: Connection refused by host [02:41:07] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [02:42:05] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [02:45:55] PROBLEM - Host ms-be1008 is DOWN: PING CRITICAL - Packet loss = 100% [02:46:49] RECOVERY - Puppet freshness on db65 is OK: puppet ran at Wed Jan 23 02:46:20 UTC 2013 [02:46:58] RECOVERY - MySQL Replication Heartbeat on db65 is OK: OK replication delay 0 seconds [02:47:05] RECOVERY - MySQL Replication Heartbeat on db65 is OK: OK replication delay 0 seconds [02:49:23] !log LocalisationUpdate completed (1.21wmf8) at Wed Jan 23 02:49:23 UTC 2013 [02:49:33] Logged the message, Master [02:51:05] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [02:51:10] RECOVERY - Puppet freshness on db68 is OK: puppet ran at Wed Jan 23 02:50:46 UTC 2013 [02:51:46] RECOVERY - MySQL Replication Heartbeat on db68 is OK: OK replication delay 0 seconds [02:53:16] RECOVERY - Puppet freshness on db57 is OK: puppet ran at Wed Jan 23 02:53:13 UTC 2013 [02:53:56] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [02:54:15] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [02:54:28] RECOVERY - MySQL Replication Heartbeat on db57 is OK: OK replication delay 0 seconds [02:55:22] RECOVERY - Puppet freshness on db50 is OK: puppet ran at Wed Jan 23 02:55:10 UTC 2013 [02:55:40] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [02:55:56] RECOVERY - MySQL Replication Heartbeat on db50 is OK: OK replication delay 0 seconds [03:02:52] RECOVERY - Puppet freshness on db66 is OK: puppet ran at Wed Jan 23 03:02:21 UTC 2013 [03:02:55] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:03:10] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [03:07:49] RECOVERY - Puppet freshness on db60 is OK: puppet ran at Wed Jan 23 03:07:27 UTC 2013 [03:07:58] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [03:08:25] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [03:08:52] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Wed Jan 23 03:08:39 UTC 2013 [03:09:05] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [03:09:46] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 0 seconds [03:11:23] ssh trouser.org [03:11:26] garg. [03:14:55] is that getting caught with your pants down? [03:15:10] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [03:15:10] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [03:15:10] PROBLEM - Puppet freshness on msfe1002 is CRITICAL: Puppet has not run in the last 10 hours [03:15:10] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:15:10] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [03:15:11] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [03:15:11] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:16:12] binasher: the way db-*.php is selected is not quite what I expected [03:16:35] when eqiad is the active data centre, hosts in tampa need to be using the eqiad masters [03:16:37] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [03:16:42] it doesn't matter where the actual client is [03:16:52] binasher: I suppose so. At least this time it wasn't a passphrase. [03:16:57] so it's kind of weird to be using getRealmSpecificFilename() [03:17:22] presumably all the cron jobs on hume are broken, as well as testwiki [03:18:19] i thought it was unfortunate that function wants to stat realm files only used by labs before looking at site [03:19:09] there should really be an eqiad replacement for hume [03:19:29] i guess testwiki needs to stay in ptmpa while it depends on nfs [03:19:56] sure, there should be, but we discussed this, it was decided that there wouldn't be [03:20:15] not at first [03:21:27] that's annoying [03:21:55] ideally, maintenance scripts running in hume would use the eqiad masters but the pmtpa slaves [03:22:18] we could have puppet set /etc/mediawiki-site to $::mw_primary on specific hosts, instead of to $::site [03:22:44] bast1001 is still not anything like a replacement for fenari, it has no dsh for one thing [03:23:17] so it will be necessary to run maintenance scripts on fenari for deployment [03:23:49] I was thinking about just having it set to a string literal in CommonSettings.php [03:24:10] that sounds better [03:24:23] $wmgActiveCluster = 'eqiad'; [03:25:16] then you would just change it and sync it out [03:29:33] things like sectionsByDB shouldn't be duplicated between these two files [03:29:41] the inactive one will rot [03:30:59] I think the best thing to do is to merge them back into one file [03:32:30] db.php needs to be parseable by switch.php, right? [03:32:39] you still use that master switch script don't you? [03:33:42] last time I checked, it was still in subversion [04:01:33] New patchset: Tim Starling; "Quick hack to get pmtpa out of read-only mode" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45296 [04:04:26] anyone still around? [04:05:30] didn't think so [04:05:37] TimStarling: ahoy [04:05:42] probably not who you were hoping for [04:05:49] but if I can help, let me know. [04:06:09] thanks ori-l [04:06:15] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45296 [04:07:47] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 04:07:37 UTC 2013 [04:07:57] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 04:07:47 UTC 2013 [04:08:08] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:08:11] apparently someone deleted docroot/noc locally on fenari and didn't commit it [04:08:27] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 04:08:20 UTC 2013 [04:08:37] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:08:48] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 04:08:43 UTC 2013 [04:08:55] that's fine! I will clean up while the ops team parties! ;) [04:09:08] RECOVERY - Puppet freshness on db65 is OK: puppet ran at Wed Jan 23 04:08:59 UTC 2013 [04:09:08] RECOVERY - Puppet freshness on db68 is OK: puppet ran at Wed Jan 23 04:09:01 UTC 2013 [04:09:08] RECOVERY - Puppet freshness on db57 is OK: puppet ran at Wed Jan 23 04:09:02 UTC 2013 [04:09:08] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:09:08] RECOVERY - Puppet freshness on db50 is OK: puppet ran at Wed Jan 23 04:09:03 UTC 2013 [04:09:08] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 04:09:06 UTC 2013 [04:09:08] RECOVERY - Puppet freshness on db66 is OK: puppet ran at Wed Jan 23 04:09:06 UTC 2013 [04:09:17] RECOVERY - Puppet freshness on db55 is OK: puppet ran at Wed Jan 23 04:09:09 UTC 2013 [04:09:18] RECOVERY - Puppet freshness on db60 is OK: puppet ran at Wed Jan 23 04:09:09 UTC 2013 [04:09:38] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:10:05] !log tstarling synchronized wmf-config/db-pmtpa.php 'back to read/write mode (with eqiad master)' [04:10:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [04:10:09] Logged the message, Master [04:10:26] heh. [04:11:07] !log tstarling synchronized wmf-config/db-eqiad.php 'cleanup, added pmtpa host/IP mappings' [04:11:19] Logged the message, Master [04:15:27] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 04:15:25 UTC 2013 [04:15:37] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [04:19:45] that had an interesting effect on http://noc.wikimedia.org/dbtree/ :) [04:20:39] binasher: do you know what is going on with the noc docroot? [04:21:22] i probably fucked it up, let me look [04:23:50] wow, i sure did. fixed. [04:25:34] there's still a complete copy of the mediawiki-config repository in docroot/mediawiki-config, should that be deleted? [04:25:51] hey I wonder if there's another copy in docroot/mediawiki-config/docroot/mediawiki-config [04:27:44] deleted [04:27:51] thanks [04:27:55] * binasher picked the wrong day to stop sniffing glue [04:28:34] yeah that can really mess with your head [04:29:14] ok, what else is broken? [04:58:18] New patchset: Tim Starling; "Fix 404s for nonexistent domains" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45298 [04:58:45] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/45298 [05:06:54] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [05:07:24] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [05:09:19] New patchset: Tim Starling; "Allow sync-apache to be run as non-root" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45300 [05:11:32] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45300 [06:09:54] PROBLEM - Puppet freshness on mw15 is CRITICAL: Puppet has not run in the last 10 hours [06:16:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:25:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 06:25:42 UTC 2013 [06:25:55] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 06:25:50 UTC 2013 [06:26:16] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:26:16] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 06:26:10 UTC 2013 [06:26:26] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:26:35] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 06:26:25 UTC 2013 [06:27:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:27:26] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [06:35:15] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [06:35:20] New patchset: Tim Starling; "Bug 43448: don't use threads in varnishhtcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45302 [06:35:58] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45302 [06:36:04] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [06:41:01] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [06:43:45] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 06:43:37 UTC 2013 [06:44:15] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [06:46:33] TimStarling: I find it curious that you're not logging failures. [07:13:57] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45243 [07:15:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/36146 [07:20:44] New patchset: ArielGlenn; "remove the images::rsync(d) classes from use too (see change I8bf98650)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45303 [07:21:57] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45303 [07:58:40] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [07:59:06] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:07:40] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 08:07:34 UTC 2013 [08:07:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:07:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 08:07:43 UTC 2013 [08:07:51] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 08:07:48 UTC 2013 [08:08:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:08:01] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 08:07:55 UTC 2013 [08:08:41] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [08:09:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:15:50] RECOVERY - Puppet freshness on ms2 is OK: puppet ran at Wed Jan 23 08:15:42 UTC 2013 [08:16:00] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [08:27:30] RECOVERY - Puppet freshness on ms1 is OK: puppet ran at Wed Jan 23 08:27:25 UTC 2013 [08:27:40] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [09:18:09] if (req.request == "PURGE") { [09:18:09] if (!client.ip ~ purge) { [09:18:09] error 405 "Denied."; [09:18:13] what is this meant to do? [09:19:07] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [09:25:26] never mind [09:40:08] PROBLEM - Varnish HTCP daemon on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (varnishhtcpd), args varnishhtcpd worker [09:42:38] PROBLEM - Varnish HTCP daemon on cp1021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (varnishhtcpd), args varnishhtcpd worker [10:00:56] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:03:06] TimStarling: ^^. Not sure if that's one of the instances you rolled out your rewrite to. [10:03:39] it's the one I'm using for testing [10:03:43] it's broken on all of them [10:04:15] oh. [10:04:39] also, re: client.ip ~ purge: see the third example on https://www.varnish-cache.org/docs/2.1/tutorial/vcl.html. '~' operator can match ACLs. [10:09:59] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Puppet has not run in the last 10 hours [10:20:11] New patchset: Tim Starling; "Fix another bug in varnishhtcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45311 [10:23:48] New review: Tim Starling; "I confirmed that this bug affects the original version. I tested this version on cp1021." [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/45311 [10:23:49] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45311 [10:25:49] ori-l: in fact I can't see how it ever could have worked [10:26:38] I confirmed it with the original code, it definitely doesn't do what it says on the tin [10:26:45] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [10:27:09] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [10:27:53] argh [10:27:59] "more than half of requests result in a 404"! [10:28:02] https://gerrit.wikimedia.org/r/#/c/29786/1/files/varnish/varnishhtcpd,unified [10:28:07] there's the answer to that then [10:28:21] wikipedia dark matter [10:28:23] it would have worked before that change [10:29:19] * TimStarling hits mark with the testbat [10:31:21] how would it have worked before that change? proxy is still only set once, outside the loop [10:31:24] ah, except for the URL error [10:32:29] # This is a stupid hack to make varnishhtcpd work - it's using a perl mod that sends purge reqs like [10:32:30] # PURGE http://de.wikipedia.orghttp://de.wikipedia.org/w/index.php [10:32:30] } elsif (req.url ~ "^http://upload.wikimedia.org") { [10:32:30] set req.url = regsub ( req.url, "^http://[\w.]+(/.*)", "\1"); [10:33:02] instead of fixing two lines of perl code [10:34:26] '[\w.]+'? [10:37:14] is '.' in a character class a literal dot? [10:37:56] yes. nevermind. [10:43:17] right, not so easy to fix actually [10:43:24] the bug is in LWP, cooperating with varnish [10:43:38] two bugs, I guess [10:43:46] both will deny it though, if you report it to them [10:45:24] let's blame RFC 2616 [11:03:07] hey [11:04:31] hello [11:04:47] PROBLEM - Puppet freshness on ms2 is CRITICAL: Puppet has not run in the last 10 hours [11:05:35] having fun with varnish? [11:05:38] PROBLEM - Puppet freshness on ms1 is CRITICAL: Puppet has not run in the last 10 hours [11:05:41] yeah [11:05:55] done with puppet? [11:05:56] :P [11:06:12] this is definitely a varnish bug, RFC 2616 is actually really clear [11:06:40] what is? [11:07:00] say if I do: curl -vvv -x 10.64.0.143:80 'http://upload.wikimedia.org/kzizzle' [11:07:18] then in the log we see: cp1021.eqiad.wmnet 10676 2013-01-23T11:05:44 0.057930470 208.80.152.165 miss/404 882 GET http://upload.wikimedia.orghttp://upload.wikimedia.org/kzizzle - - - - curl/7.22.0%20(x86_64-pc-linux-gnu)%20libcurl/7.22.0%20OpenSSL/1.0.1%20zlib/1.2.3.4%20libidn/1.23%20librtmp/2.3 [11:07:34] the protocol and host is doubled [11:07:49] http://upload.wikimedia.orghttp://upload.wikimedia.org/kzizzle [11:07:56] aha [11:08:26] there's a comment in the varnish configuration that blames varnishhtcpd for sending such URLs [11:08:39] but actually varnishhtcpd is sending the exact right thing [11:08:54] TimStarling: seeing your two recent varnishhtcpd patches, do you have some concrete ideas what's needed to improve/fix https://bugzilla.wikimedia.org/show_bug.cgi?id=43449 ("Monitor effectiveness of HTCP purging")? [11:09:05] and RFC 2616 requires even non-proxy webservers to understand it [11:09:11] presumably it sends PURGE http://upload.wikimedia.org/foo HTTP/1.1 instead of PURGE /foo HTTP/1.1 [11:09:23] yes, with a Host header as well [11:09:34] so varnish does http:// + host header + request URI [11:10:12] but the spec says "To allow for transition to absoluteURIs in all requests in future versions of HTTP, all HTTP/1.1 servers MUST accept the absoluteURI form in requests, even though HTTP/1.1 clients will only generate them in requests to proxies." [11:10:51] what?! [11:11:07] wow, I didn't know that [11:11:22] andre__: just purge a URL, request it, and check its Age header [11:11:27] it should be less than some threshold [11:11:55] http://tools.ietf.org/rfcmarkup?doc=2616#section-5.1.2 [11:12:54] I didn't know there was a goal of transitioning to absolute URLs in future versions of HTTP either [11:13:00] although this might have changed since [11:13:29] maybe that's why every webserver except varnish seems to work just fine when you use it as a proxy [11:13:49]