[00:01:17] (03CR) 10Dzahn: [C: 031] "mediawiki-lb.wikimedia.org is an alias for text-lb.eqiad.wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/157981 (owner: 10BBlack) [00:01:49] (03PS2) 10Dzahn: add cawikimedia to dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 [00:02:00] andrewbogott: is that using the DB job queue? [00:02:55] AaronSchulz: I don't know enough to know how to answer that. [00:02:56] it doesn't support delayed jobs, only the redis one does [00:03:03] This is the wikitech config which we only just added. [00:03:12] if it the wiki does not specifically use redis then it uses the db [00:03:22] It may have partial prod config [00:03:23] probably just inheriting the normal cluster config [00:03:25] I'm pretty sure that the old (circa 10 minutes ago) wikitech supported delayed jobs since there was a cron to run them [00:04:19] I assume they were delayed for a reason. But if y'all think it's fine as-is then I'll just remove that cron and have done. [00:05:37] I think "delayed" jobs is different than jobs that run via cron. I may be wrong though. AaronSchulz is the job queue expert [00:06:06] ok -- the command that I ran in that paste above is the same command that the cron runs. [00:06:09] I bit there the some prod config that is setting job config that we don't want on wikitech [00:06:55] y'r tekin away all or jerbs! [00:07:07] * bd808 snorts [00:07:08] ^ what he said [00:07:16] well, not the snort part [00:07:19] (03PS6) 10BBlack: Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 [00:08:16] (03CR) 10BBlack: [C: 032] Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 (owner: 10BBlack) [00:09:43] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Epic puppet fail [00:11:38] (03PS2) 10BBlack: Remove LVS/SSL defs for unused project-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/157978 [00:11:41] AaronSchulz: So what makes a job have "checkDelay" in the constructor call? Could this be from when we had 2 wikitech wikis running and one had prod jobrunner config? [00:11:56] a jobqueue has that, not a job [00:12:05] the site config decides that usually [00:12:41] maybe some came from having two wikis...I don't know too much about how that used to be set up or was merged [00:12:52] Config for MWEchoNotificationEmailBundleJob sets it in prod [00:13:01] andrewbogott: That's it I bet ^ [00:13:11] AaronSchulz: the whole config is in operations-mediawiki-config now. So no more shrugging allowed :) [00:14:16] andrewbogott: See line 2398 in CommonSettings.php. Another thing to undo in your wikitech config file [00:14:18] bd808: do we not want it to use the main job queue? [00:14:36] AaronSchulz: We can't. Wikitech is isolated on it's own vlan [00:14:37] wikitech/virt1000 is still standalone [00:15:04] but it reuses the wmf-config code...so this will be fun [00:15:26] we've dealt with most of the fun already :P [00:15:34] true dat [00:15:36] I guess it got by without it, so it doesn't need checkDelay [00:15:43] bd808: sorry, I don't follow yet [00:15:50] we can probably change wikitech to use its own redis job queue in the near future [00:15:56] it's got redis installed already for keystone IIRC [00:16:02] andrewbogott: Echo is making jobs that your runner can't run [00:16:25] Or askign for a runner that you can't support [00:16:42] * bd808 makes a patch [00:16:48] ok, so I need to revive our old echo config? [00:16:55] * andrewbogott supervises, again [00:16:59] I think we just need to variablise [00:17:00] $wgJobTypeConf['MWEchoNotificationEmailBundleJob'] = array( 'checkDelay' => true ) + $wgJobTypeConf['default']; [00:17:04] RECOVERY - puppet last run on amssq50 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:17:15] bd808: did we break some things in echo with our new jobs? [00:17:16] andrewbogott: you could review https://gerrit.wikimedia.org/r/#/c/158262/ :P [00:17:28] well not new, but they've recently been turned on in more places [00:17:37] ebernhardson: on wikitech [00:17:41] ori: I already loaded that patch once and then backed away afeared [00:17:42] (03PS2) 10Dzahn: align the misc wiki section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 [00:17:42] which is now (mostly) using cluster config [00:17:57] ebernhardson: Nah. Reedy, andrewbogott and I broke wikitech [00:18:03] andrewbogott: all right, no worries then [00:18:17] (03PS1) 10BBlack: Cleanup on DNS for LVS service IPs [dns] - 10https://gerrit.wikimedia.org/r/158295 [00:18:35] bd808: whew, not my problem then ;) [00:18:52] (03PS1) 10BryanDavis: Don't set checkdelay on wikitech jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158296 [00:19:39] Reedy, andrewbogott: ^ [00:19:45] I was thinking of another way.. [00:20:06] Reedy: better? Less hacky? [00:20:14] ori: Sorry, I will review tomorrow if it's still in need of attention [00:20:14] cause I'm all for that [00:20:21] andrewbogott: np at all [00:20:34] if ( $wmgUseClusterJobqueue ) { $wgJobTypeConf['MWEchoNotificationEmailBundleJob'] = array( 'checkDelay' => true ) + $wgJobTypeConf['default']; } [00:20:47] * bd808 nods [00:20:53] I don't know if the job type conf needs assigning in all paths [00:21:04] That keeps all the mess in CommonSettings [00:21:33] RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Thu Sep 4 00:21:29 UTC 2014 [00:21:33] $wgJobTypeConf['MWEchoNotificationEmailBundleJob'] = $wgJobTypeConf['default']; if ( $wmgUseClusterJobqueue ) { $wgJobTypeConf['MWEchoNotificationEmailBundleJob'] += array( 'checkDelay' => true ); } [00:21:43] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet last ran 340251 seconds ago, expected 14400 [00:21:55] I would guess wildly that you get default config unless you specify otherwise [00:22:06] $wgJobTypeConf['MWEchoNotificationEmailBundleJob']['checkDelay'] = true; [00:22:10] That's what I'm presuming [00:22:27] (03PS1) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 [00:22:35] I guess they just need to include $wgJobTypeConf['default'] so that it doesn't only have the one setting [00:22:47] yeah. [00:23:03] I'll abondon. Your idea is better [00:23:20] (03Abandoned) 10BryanDavis: Don't set checkdelay on wikitech jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158296 (owner: 10BryanDavis) [00:23:40] (03PS2) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 [00:23:41] only thing i'll mention is that disabling checkDelay means echo will no longer bundle email notifications [00:23:53] speaking of job queue, it looks like the beta queue is stuck again [00:23:54] you will get an email for each individual i think(without testing, just based on what thats used for) [00:24:25] (03PS4) 10BBlack: Remove actual $project-lb.wm.o domainnames [dns] - 10https://gerrit.wikimedia.org/r/157982 [00:24:31] ebernhardson: which is what wikitech does currently [00:24:35] so no big shame [00:24:42] Reedy: oh thats fine then :) [00:25:31] (03CR) 10BBlack: [C: 04-1] "This is on hold for TTL expiry + sniffer validation that nobody's looking up these names anymore. Will probably hold off through Monday j" [dns] - 10https://gerrit.wikimedia.org/r/157982 (owner: 10BBlack) [00:26:23] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures [00:26:45] (03PS1) 10Reedy: Don't alter MWEchoNotificationEmailBundleJob config for non redis backed queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158298 [00:26:46] Hey ebernhardson, you may be just the guy to tell me how this patch broke all email notifications for OSM changes -- https://gerrit.wikimedia.org/r/#/c/144334/3/OpenStackManager.php [00:27:11] I was trying to make the emails not suck and succeeded in making them not happen at all [00:27:39] <^d> Sometimes less is more. [00:28:07] bd808: maybe, looking [00:28:45] bd808: shouldn't break anything with that :( [00:28:59] (03Abandoned) 10Dzahn: align the misc. services section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158276 (owner: 10Dzahn) [00:29:34] (03CR) 10Dzahn: "merging into 157275" [dns] - 10https://gerrit.wikimedia.org/r/158276 (owner: 10Dzahn) [00:29:37] bd808: how long ago was that deployed, or more accuratly which logs should i be looking in for hints :) [00:29:43] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:30:11] (03CR) 10Andrew Bogott: [C: 031] Don't alter MWEchoNotificationEmailBundleJob config for non redis backed queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158298 (owner: 10Reedy) [00:30:25] wikitech logs that I think are only on virt1000. :( Deployed ~2 months ago I think? [00:31:09] (03CR) 10BBlack: [C: 031] align the misc wiki section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 (owner: 10Dzahn) [00:31:58] (03PS3) 10Dzahn: align the misc section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 [00:32:03] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [00:33:21] (03PS4) 10Dzahn: align the misc section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 [00:35:42] (03CR) 10Dzahn: [C: 032] align the misc section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 (owner: 10Dzahn) [00:36:03] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [00:36:03] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [00:37:23] bd808: my irc client crashed, did I miss anything? jobqueuewise? [00:37:57] andrewbogott: Nope. I got pulled away to look at the job queue on beta ironically [00:38:10] I thought Reedy was making a patch but maybe not [00:38:23] No, he did. Um... [00:38:24] https://gerrit.wikimedia.org/r/#/c/158298/ [00:38:25] ? [00:42:04] andrewbogott: Want me to merge it? [00:42:30] Reedy: I don't understand all the angles, but -- sure, if you think it'll let me run jobs :) [00:43:13] (03CR) 10Reedy: [C: 032] Don't alter MWEchoNotificationEmailBundleJob config for non redis backed queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158298 (owner: 10Reedy) [00:43:18] (03Merged) 10jenkins-bot: Don't alter MWEchoNotificationEmailBundleJob config for non redis backed queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158298 (owner: 10Reedy) [00:43:24] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [00:43:54] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [00:45:20] * Reedy glares at ori [00:45:32] oh, just 1 apache with errors [00:45:47] osmium [00:46:07] http://p.defau.lt/?8_3iDPRtJk5cxBdsyxSS7w [00:47:24] Reedy: runjobs fails in just the same way as before [00:48:47] Hmm [00:54:32] (03PS1) 10Dzahn: add cawikimedia to wikiversion, MWMultiVersionTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 [00:54:38] (03CR) 10jenkins-bot: [V: 04-1] add cawikimedia to wikiversion, MWMultiVersionTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:55:35] (03CR) 10Dzahn: "00:54:37 1) MWMultiVersionTests::testRealmFilenames with data set #75 ('cawikimedia', 'ca.wikimedia.org')" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:56:26] (03CR) 10Reedy: "00:54:37 1) MWMultiVersionTests::testRealmFilenames with data set #75 ('cawikimedia', 'ca.wikimedia.org')" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:57:26] (03CR) 10Dzahn: "actual "cawiki"?? no, where?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:58:35] (03CR) 10Reedy: "Think you'll need to add ca to https://github.com/wikimedia/operations-mediawiki-config/blob/master/multiversion/MWMultiVersion.php#L174" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:59:13] !log testing the log [01:00:47] (03CR) 10Dzahn: "grmbl, ok, just copied "uawiki" which is in the tests but not the actual file either" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:01:17] (03CR) 10Dzahn: "arr, i mean "uawikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:02:22] !log the SAL still works, but the bot fails to acknowledge. Something to do with a change on wikitech [01:02:48] (03CR) 10Dzahn: "oooh, i see that array now starting line 173..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:04:15] bd808|AWAY, Reedy, I'm wrapping up for the night. Thanks for all your help -- this mostly works! [01:04:15] (03PS2) 10Dzahn: add cawikimedia to wikiversion, MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 [01:04:20] (03CR) 10jenkins-bot: [V: 04-1] add cawikimedia to wikiversion, MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:04:29] Tomorrow we will sort out adminbot and the jobqueue :) [01:04:40] (03CR) 10Dzahn: "nice, from 1 failure to 2 :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:04:43] what's up with adminbot? [01:05:14] Reedy: it logs properly but doesn't like the response it gets from wikitech so panics and fails to ack [01:05:27] (03CR) 10Reedy: [C: 04-1] add cawikimedia to wikiversion, MWMultiVersion (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:05:27] haha [01:06:06] Reedy: Looks like this: https://dpaste.de/bqpg [01:06:45] (03PS3) 10Dzahn: add cawikimedia to wikiversion, MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 [01:08:06] !log production wants project name? [01:08:21] mutante: ? [01:08:41] Reedy: the format for labs log is [01:08:44] adminlog.log(self.config, message, project, author) [01:08:55] like !log blabla [01:09:01] yeah [01:10:23] Reedy: extra comma, thanks :) [01:10:34] Any chance wikitech's api changed encoding from asci to utf8? [01:19:13] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 1 failures [01:20:18] Reedy: wgLanguageCode .. but Canada needs en AND fr :p [01:20:45] en-ca? [01:21:15] we don't have fr-ca ;) [01:21:46] sounds like it breaks if i add non-existing lang code :) [01:21:58] en-ca exists [01:22:01] oh, heh [01:22:05] 77 # Non-ISO language codes [01:22:35] Reedy: ok:) thx [01:30:00] andrewbogott_afk: that shouldn't affect parsing the JSON... [01:31:10] (03PS1) 10Dzahn: add cawikimedia to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158312 [01:35:21] (03PS2) 10Dzahn: add cawikimedia to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158312 [01:37:13] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [01:37:28] (03PS1) 10Dzahn: retab InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 [01:42:42] Needs a var dump of the returned json or similar and see what's what [01:46:46] python doesn't have a var_dump!!! [01:48:01] (03CR) 10Dzahn: ""trivial" +12338, -12338 :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [01:49:03] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Epic puppet fail [01:49:50] (03CR) 10Legoktm: "MediaWiki coding conventions are to use tabs..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [01:50:01] lol, waited for it [01:51:40] :P [01:54:05] (03CR) 10Dzahn: "http://www.mediawiki.org/wiki/Manual:Coding_conventions#Tab_size says "Most MediaWiki developers find 4 spaces per tab to be best for read" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [01:55:04] (03CR) 10Dzahn: "let's change the convention, it's wiki, so i can say i tried before i abandon :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [01:55:54] mutante: I use different editors for python and php :P [02:05:43] (03PS1) 10Ori.livneh: Update path references for new deployment root directory [apache-config] - 10https://gerrit.wikimedia.org/r/158315 [02:08:03] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:08:53] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [02:09:36] (03PS1) 10Ori.livneh: mediawiki: /usr/local/apache/common-local => /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/158317 [02:10:14] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3610 MB (3% inode=99%): [02:22:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [02:23:24] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet has 1 failures [02:26:54] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [02:27:41] (03PS1) 10Dzahn: wikidata monitoring - work-around wrapper [puppet] - 10https://gerrit.wikimedia.org/r/158319 [02:30:22] (03CR) 10Dzahn: [C: 032] icinga: Set default value for from in graphite threshold checks [puppet] - 10https://gerrit.wikimedia.org/r/158125 (owner: 10Yuvipanda) [02:32:46] (03PS2) 10Dzahn: wikidata monitoring - work-around wrapper [puppet] - 10https://gerrit.wikimedia.org/r/158319 [02:33:03] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [02:34:21] (03CR) 10Dzahn: [C: 032] "just a work-around for now. if that fixes the issue, we can turn that into a template and pass parameters to it" [puppet] - 10https://gerrit.wikimedia.org/r/158319 (owner: 10Dzahn) [02:37:03] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [02:37:03] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [02:37:44] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.217 second response time [02:38:43] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.222 second response time [02:39:10] Error: Could not find a service matching host name 'ms-be3004' and description 'very high load average' [02:39:40] i swear every time i touch icinga there is an unrelated error already that keeps it from reloading :p [02:41:24] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [02:41:55] !log LocalisationUpdate completed (1.24wmf15) at 2014-09-04 02:40:45+00:00 [02:43:28] (03CR) 10Dzahn: "Failed to call refresh: Could not restart Service[icinga]:" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [02:45:20] (03CR) 10Dzahn: "we have /etc/icinga AND /etc/nagios now?? why back to nagios?" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [02:47:03] (03CR) 10Dzahn: "lol, icinga/nagios is chaos, "(config file '/etc/nagios/puppet_servicegroups.cfg'," in error messages but: nagios/puppet_services.cfg: ERR" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [02:47:35] (03CR) 10Dzahn: "servicegroups is in one place, services in another, and unrelated to this check" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [02:49:53] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Epic puppet fail [02:59:22] (03CR) 10Dzahn: "oh damnit, it's actually the comma in the decription text :p" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [03:00:13] RECOVERY - Disk space on virt0 is OK: DISK OK [03:01:59] (03PS1) 10Dzahn: remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 [03:02:32] (03PS2) 10Dzahn: remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 [03:02:56] (03CR) 10Dzahn: [C: 032] remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 (owner: 10Dzahn) [03:03:34] (03CR) 10Dzahn: [V: 032] remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 (owner: 10Dzahn) [03:03:52] (03PS3) 10Dzahn: remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 [03:04:28] (03CR) 10Dzahn: [V: 032] remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 (owner: 10Dzahn) [03:08:53] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [03:13:02] !log LocalisationUpdate completed (1.24wmf18) at 2014-09-04 03:11:58+00:00 [03:27:09] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1243 bytes in 0.699 second response time [03:27:25] jzerebecki: ^ :) [03:28:50] (03CR) 10Dzahn: "fixed in Change-Id: Ic89dcee033a313 after like 5 puppet runs or so" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [03:29:01] (03CR) 10Dzahn: "@neon:/etc/icinga# grep -r "very high" *" [puppet] - 10https://gerrit.wikimedia.org/r/158321 (owner: 10Dzahn) [03:29:30] (03CR) 10Dzahn: "20:29 <+icinga-wm> RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 12" [puppet] - 10https://gerrit.wikimedia.org/r/158319 (owner: 10Dzahn) [03:43:37] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-04 03:42:34+00:00 [04:22:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [04:32:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 4 04:31:28 UTC 2014 (duration 31m 27s) [04:33:59] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [04:37:59] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [04:37:59] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [04:46:02] (03PS1) 10Springle: repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158328 [04:46:34] (03CR) 10Springle: [C: 032] repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158328 (owner: 10Springle) [04:46:38] (03Merged) 10jenkins-bot: repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158328 (owner: 10Springle) [04:47:34] !log springle Synchronized wmf-config/db-eqiad.php: repool db1035, warm up (duration: 00m 08s) [04:56:09] <_joe_> rcs? [04:56:21] <_joe_> who touched rcstream? :) [04:57:04] <_joe_> eheh, it was _me_ [04:57:06] <_joe_> :P [05:04:03] springle: hey, how are the *_content schema changes going? [05:04:48] I was wondering that if it's finished on some smaller wikis, we could turn on the setting there and just wait for larger ones [05:05:30] andrewbogott_afk, ping [05:05:49] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.196 second response time [05:06:49] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.200 second response time [05:07:01] or Reedy. or whoever wants to investigate [05:07:24] Second appservers lvs page...somebody doing something or flaky business? [05:13:17] legoktm: the changes go out slave-by-slave, not wiki-by-wiki. would it help to do all of S3 first? [05:13:36] ah [05:14:17] springle: yeah, doing s3 first would be nice [05:14:36] (if it wouldn't be too much trouble) [05:15:05] revision fields are going out now. had to upgrade some slaves to mariadb 10 to get proper native online DDL. as it happens, 2 of 4 S3 slaves are done, so it could be hurried [05:15:30] just a script change. np [05:16:19] how is that different than OSC? [05:16:26] besides that it's built in [05:17:25] the native online DDL doesn't need to set triggers or create a copy of the table, and seems to run in about 1/3rd the time [05:18:49] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.200 second response time [05:19:28] that still means ~8h for enwiki revision per slave [05:19:49] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.217 second response time [05:23:59] sync-file on tin returned mkdir "/srv/mediawiki/wmf-config" failed: Permission denied (13), yet wmf-config/db-eqiad.php was synced correctly. anyone know what might be wrong? [05:25:41] springle: On which host did that fail? That path looks rather non-standard [05:25:52] !log temp hack fixed deployed for morebots (here and labs, not the other instances) [05:25:58] Logged the message, Master [05:26:24] hoo: mw1161 [05:27:23] hoo: at least, i think so... http://paste.debian.net/119260/ [05:27:46] maybe that meant mw1161 to osmium [05:28:58] I think mw1161 might be a scap proxy [05:29:15] it is [05:29:23] so that's probably not the failing host [05:30:12] osmium is the things :S [05:30:14] * thing [05:30:47] why the ... is it trying to rsync into that dir and not /a/common/ :S [05:34:03] _joe_: ^ happen to know? [05:34:15] springle: Ok, looked at it [05:34:34] seems to be a hhvm testing host not running our standard MW installation [05:34:42] or at least not in the way it's supposed to be [05:35:11] this shouldn't be in the dsh probably... or it should be fixed to be a standard deploy [05:38:51] Or maybe scap should be using /usr/local/apache/common-local which is probably the canonical path [05:38:53] but whatever [05:39:48] springle: Osmium currently can server traffic (via hhvm), but it wont get code updates... that's bad and need to be fixed [05:39:56] I'm not bold enough to mess with its /srv/ though [05:39:59] good night [05:40:00] does it look like we can sync-common osmium for now? [05:40:09] ok, thanks hoo [05:40:24] good question, let me try [05:41:34] nope, also not possible as htat also tries to go to /srv/mediawiki (which kind of makes sense) [05:42:28] * springle emails ops@ [05:43:03] add my commands about the path, please [05:43:10] don't think I'll come to replying today [05:43:14] (to tired) [05:43:15] ok [06:05:03] <_joe_> hey I'm back [06:05:35] <_joe_> springle: what has happened? [06:06:00] <_joe_> osmium is a playground of ori's [06:10:19] <_joe_> and , I'm going to have breakfast now, see you in ~ 1 hour [06:14:38] _joe_: np, emailed ops@ [06:23:49] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.194 second response time [06:23:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [06:25:50] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.214 second response time [06:28:30] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Epic puppet fail [06:28:39] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:39] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:59] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:00] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:09] PROBLEM - Disk space on elastic1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [06:29:09] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:09] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:19] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:49] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:59] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [06:36:30] PROBLEM - puppet last run on ssl1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:09] RECOVERY - Disk space on elastic1004 is OK: DISK OK [06:37:30] !log clear slowlog on elastic1004 [06:37:35] Logged the message, Master [06:38:59] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [06:38:59] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [06:39:20] (03CR) 10Legoktm: [C: 031] "Yes please" [puppet] - 10https://gerrit.wikimedia.org/r/157013 (owner: 10Reedy) [06:43:09] RECOVERY - Disk space on ms1004 is OK: DISK OK [06:45:39] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:09] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 13 MB (0% inode=94%): /var/lib/ureadahead/debugfs 13 MB (0% inode=94%): [06:46:09] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:19] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:39] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:54:29] RECOVERY - puppet last run on ssl1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:05:19] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:12:04] <_joe_> I don't get why we got those errors from the hhvm pool at all [07:16:49] I need some urgent help [07:17:09] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: Epic puppet fail [07:18:07] a troll moved a user talk page to a bad name and i deleted it in order to restore [07:18:30] but i'm unable to restore due to the large number of edits the page has [07:20:03] <_joe_> matanya: uhm I have no idea how to do that sorry :( [07:20:17] this is a springle thing, i guess [07:20:26] <_joe_> oh directly in the db? [07:20:28] <_joe_> naughty [07:20:39] no other option i can think of [07:20:54] my browser can't even open the page [07:21:52] bug report. not doing anything ad-hoc :) [07:22:27] poor editor [07:30:07] <_joe_> of course bug-report :) [07:32:03] _joe_, are you familiar with bigdelete? [07:32:16] <_joe_> jeremyb: no sorry [07:32:45] <_joe_> or better, I knwo what it is [07:32:51] <_joe_> in general [07:33:04] internal_api_error_DBQueryError [07:33:05] ok :) [07:33:17] trolls, i hate you [07:33:21] sounds like maybe matanya was hovering around that threshold [07:33:22] hah [07:33:37] i can't restore her talk page! arrg [07:35:57] springle: https://bugzilla.wikimedia.org/show_bug.cgi?id=70387 [07:36:09] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:36:23] (03PS1) 10Giuseppe Lavagetto: rcstream: remove duplicate package declaration [puppet] - 10https://gerrit.wikimedia.org/r/158335 [07:37:12] seems like i solved it [07:37:50] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.194 second response time [07:37:53] you know that's going to show up in google? [07:38:10] for ${correctusername} [07:38:39] pages for hhvm-appservers ... [07:38:42] <_joe_> mmmmh [07:38:43] <_joe_> yes [07:38:44] _joe_: that is you :P [07:38:48] <_joe_> but without a reason [07:39:15] <_joe_> I'm seeing enwiki with hhvm without problems [07:39:50] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.204 second response time [07:40:33] <_joe_> akosiaris: I really don't know what is this check btw [07:40:51] <_joe_> I did not set this up [07:41:49] aha [07:42:03] <_joe_> "hphp_invoke" [07:42:18] <_joe_> oh that's the status code [07:45:11] <_joe_> ok we do have one of the servers (testwiki) which is in a bad state apparently [07:47:02] WFM. how do i force hhvm for myself? [07:47:04] <_joe_> or at least, pybal thinks so [07:48:02] <_joe_> 2014-09-04 07:47:19.459320 [hhvm_appservers_80 ProxyFetch] mw1017.eqiad.wmnet (enabled/partially up/pooled): Fetch failed, [07:48:05] <_joe_> meh [07:50:45] jeremyb: https://meta.wikimedia.org/wiki/User:Ori.livneh/global.js [07:50:51] <_joe_> ok, it's mw1017 [07:51:49] <_joe_> the bad thing is, we have nothing telling us it's in a bad state, apart an lvs check [07:51:51] but I think testwiki is hhvm for everyone? [07:52:02] <_joe_> it is [07:52:11] <_joe_> legoktm: is testwiki down atm? [07:52:13] <_joe_> it should be [07:52:22] errr, it's working for me :P [07:53:02] <_joe_> lol [07:53:02] was up for me this whole time too [07:53:17] <_joe_> everything is working but en.wikipedia.org/wiki/Main_page [07:53:25] <_joe_> EWWW this is *horrible* [07:53:30] oh [07:53:33] that works for me using hhvm too... [07:53:47] <_joe_> legoktm: yes now I know what the problem is [07:53:53] :D [07:53:55] <_joe_> and it's *worse* than I expected [07:54:03] :/ [07:54:06] <_joe_> it's a frigging nightmare honestly [07:54:21] <_joe_> time for a cigarette :/ [07:54:37] <_joe_> akosiaris: you cna safely ignore the pages for now [07:54:52] <_joe_> I'd like not to take mw1017 out of rotation though [07:54:57] <_joe_> I want to investigate this [07:56:01] ok [08:01:41] <_joe_> ok so this is funny. on testwiki, the main page and barack obama are failing. but https://en.wikipedia.org/wiki/Special:Export/Pulsar is not [08:01:55] <_joe_> any idea what different code patterns they may encounter? [08:05:43] <_joe_> the error is spawned here https://github.com/facebook/hhvm/blob/master/hphp/runtime/server/http-request-handler.cpp#L441 [08:07:02] <_joe_> and the logs show Sep 4 08:06:23 mw1017 hhvm: message repeated 55 times: [ #012Fatal error: Argument 1 passed to Wikibase\RepoLinker::getEntityUrl() must be an instance of Wikibase\EntityId, Wikibase\DataModel\Entity\ItemId given in /usr/local/apache/common-local/php-1.24wmf18/extensions/Wikidata/extensions/Wikibase/client/includes/RepoLinker.php on line 173] [08:07:07] _joe_, front page works for me [08:07:25] <_joe_> jeremyb: yes it's just the testwiki host behaving crazily [08:10:40] _joe_, i don't follow. but anyway, works for me. (testwiki too) [08:11:28] legoktm: nice, that page finally told me the unit for https://www.mediawiki.org/wiki/Manual:WgBackendResponseTime [08:11:30] i wonder if testwiki uses parser cache? saw a 7+ sec load of barack obama according to ori's script [08:16:25] <_joe_> !log running sync-common on mw1017, trying to debug the hhvm bad state [08:16:31] Logged the message, Master [08:24:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [08:35:59] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [08:52:38] (03PS2) 10Giuseppe Lavagetto: rcstream: remove duplicate package declaration [puppet] - 10https://gerrit.wikimedia.org/r/158335 [08:52:51] (03CR) 10Giuseppe Lavagetto: [C: 032] rcstream: remove duplicate package declaration [puppet] - 10https://gerrit.wikimedia.org/r/158335 (owner: 10Giuseppe Lavagetto) [08:59:32] (03PS1) 10Spage: Use Flow on frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158341 [08:59:34] (03PS1) 10Spage: Enable Flow on several pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158342 [09:11:09] RECOVERY - Puppet freshness on rcs1002 is OK: puppet ran at Thu Sep 4 09:11:00 UTC 2014 [09:11:09] RECOVERY - Puppet freshness on rcs1001 is OK: puppet ran at Thu Sep 4 09:11:00 UTC 2014 [09:11:49] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:12:09] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:41:16] (03PS1) 10JanZerebecki: Puppetize icinga log file permission fix. [puppet] - 10https://gerrit.wikimedia.org/r/158345 [10:01:05] (03PS2) 10JanZerebecki: Puppetize icinga tmpfs mount. [puppet] - 10https://gerrit.wikimedia.org/r/158343 [10:25:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [10:36:59] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [10:45:41] hey wikitech is down I guess [10:45:46] is that known issue? [10:49:21] works for me (?) [10:52:07] <_joe_> petan: wfm as well [10:53:04] (03PS6) 10Giuseppe Lavagetto: beta: use HHVM for all requests [puppet] - 10https://gerrit.wikimedia.org/r/157823 [10:54:38] yup it's back [11:04:09] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Epic puppet fail [11:07:31] (03PS1) 10Giuseppe Lavagetto: trebuchet::packages: use ensure_packages from stdlib [puppet] - 10https://gerrit.wikimedia.org/r/158353 [11:24:10] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:47:19] (03PS1) 10Yuvipanda: labmon: Add http:// prefix for graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/158356 [12:04:33] (03PS1) 10Yuvipanda: quarry: Remove unused directive from nginx conf [puppet] - 10https://gerrit.wikimedia.org/r/158360 [12:06:16] akosiaris: can I interest you in merging a couple of trivial patches? [12:09:30] <_joe_> YuviPanda: I will do that for 10 euros less than what he asks for [12:09:31] (03PS2) 10Giuseppe Lavagetto: trebuchet::packages: use ensure_packages from stdlib [puppet] - 10https://gerrit.wikimedia.org/r/158353 [12:10:01] YuviPanda: i would do it for free if i have root [12:10:43] <_joe_> YuviPanda: I'm here, which trivial patches? [12:11:05] _joe_: https://gerrit.wikimedia.org/r/#/c/158356/ and https://gerrit.wikimedia.org/r/#/c/158360/ [12:11:29] _joe_: didn't want to bug you since you seemed to have your hands full with HHVM, but yay you have time :) [12:12:05] <_joe_> well, I'm technically off after lunch, but a puppet problem was bugging me, so... [12:12:29] <_joe_> I'm here looking at puppet already [12:12:32] heh [12:12:48] (03CR) 10Giuseppe Lavagetto: [C: 032] labmon: Add http:// prefix for graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/158356 (owner: 10Yuvipanda) [12:14:40] (03CR) 10Giuseppe Lavagetto: quarry: Remove unused directive from nginx conf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158360 (owner: 10Yuvipanda) [12:15:24]