[00:01:17] (03CR) 10Dzahn: [C: 031] "mediawiki-lb.wikimedia.org is an alias for text-lb.eqiad.wikimedia.org." [dns] - 10https://gerrit.wikimedia.org/r/157981 (owner: 10BBlack) [00:01:49] (03PS2) 10Dzahn: add cawikimedia to dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158284 [00:02:00] andrewbogott: is that using the DB job queue? [00:02:55] AaronSchulz: I don't know enough to know how to answer that. [00:02:56] it doesn't support delayed jobs, only the redis one does [00:03:03] This is the wikitech config which we only just added. [00:03:12] if it the wiki does not specifically use redis then it uses the db [00:03:22] It may have partial prod config [00:03:23] probably just inheriting the normal cluster config [00:03:25] I'm pretty sure that the old (circa 10 minutes ago) wikitech supported delayed jobs since there was a cron to run them [00:04:19] I assume they were delayed for a reason. But if y'all think it's fine as-is then I'll just remove that cron and have done. [00:05:37] I think "delayed" jobs is different than jobs that run via cron. I may be wrong though. AaronSchulz is the job queue expert [00:06:06] ok -- the command that I ran in that paste above is the same command that the cron runs. [00:06:09] I bit there the some prod config that is setting job config that we don't want on wikitech [00:06:55] y'r tekin away all or jerbs! [00:07:07] * bd808 snorts [00:07:08] ^ what he said [00:07:16] well, not the snort part [00:07:19] (03PS6) 10BBlack: Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 [00:08:16] (03CR) 10BBlack: [C: 032] Remove references to deprecated $project-lb.wm.o names [dns] - 10https://gerrit.wikimedia.org/r/157981 (owner: 10BBlack) [00:09:43] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Epic puppet fail [00:11:38] (03PS2) 10BBlack: Remove LVS/SSL defs for unused project-lb IPs [puppet] - 10https://gerrit.wikimedia.org/r/157978 [00:11:41] AaronSchulz: So what makes a job have "checkDelay" in the constructor call? Could this be from when we had 2 wikitech wikis running and one had prod jobrunner config? [00:11:56] a jobqueue has that, not a job [00:12:05] the site config decides that usually [00:12:41] maybe some came from having two wikis...I don't know too much about how that used to be set up or was merged [00:12:52] Config for MWEchoNotificationEmailBundleJob sets it in prod [00:13:01] andrewbogott: That's it I bet ^ [00:13:11] AaronSchulz: the whole config is in operations-mediawiki-config now. So no more shrugging allowed :) [00:14:16] andrewbogott: See line 2398 in CommonSettings.php. Another thing to undo in your wikitech config file [00:14:18] bd808: do we not want it to use the main job queue? [00:14:36] AaronSchulz: We can't. Wikitech is isolated on it's own vlan [00:14:37] wikitech/virt1000 is still standalone [00:15:04] but it reuses the wmf-config code...so this will be fun [00:15:26] we've dealt with most of the fun already :P [00:15:34] true dat [00:15:36] I guess it got by without it, so it doesn't need checkDelay [00:15:43] bd808: sorry, I don't follow yet [00:15:50] we can probably change wikitech to use its own redis job queue in the near future [00:15:56] it's got redis installed already for keystone IIRC [00:16:02] andrewbogott: Echo is making jobs that your runner can't run [00:16:25] Or askign for a runner that you can't support [00:16:42] * bd808 makes a patch [00:16:48] ok, so I need to revive our old echo config? [00:16:55] * andrewbogott supervises, again [00:16:59] I think we just need to variablise [00:17:00] $wgJobTypeConf['MWEchoNotificationEmailBundleJob'] = array( 'checkDelay' => true ) + $wgJobTypeConf['default']; [00:17:04] RECOVERY - puppet last run on amssq50 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [00:17:15] bd808: did we break some things in echo with our new jobs? [00:17:16] andrewbogott: you could review https://gerrit.wikimedia.org/r/#/c/158262/ :P [00:17:28] well not new, but they've recently been turned on in more places [00:17:37] ebernhardson: on wikitech [00:17:41] ori: I already loaded that patch once and then backed away afeared [00:17:42] (03PS2) 10Dzahn: align the misc wiki section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 [00:17:42] which is now (mostly) using cluster config [00:17:57] ebernhardson: Nah. Reedy, andrewbogott and I broke wikitech [00:18:03] andrewbogott: all right, no worries then [00:18:17] (03PS1) 10BBlack: Cleanup on DNS for LVS service IPs [dns] - 10https://gerrit.wikimedia.org/r/158295 [00:18:35] bd808: whew, not my problem then ;) [00:18:52] (03PS1) 10BryanDavis: Don't set checkdelay on wikitech jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158296 [00:19:39] Reedy, andrewbogott: ^ [00:19:45] I was thinking of another way.. [00:20:06] Reedy: better? Less hacky? [00:20:14] ori: Sorry, I will review tomorrow if it's still in need of attention [00:20:14] cause I'm all for that [00:20:21] andrewbogott: np at all [00:20:34] if ( $wmgUseClusterJobqueue ) { $wgJobTypeConf['MWEchoNotificationEmailBundleJob'] = array( 'checkDelay' => true ) + $wgJobTypeConf['default']; } [00:20:47] * bd808 nods [00:20:53] I don't know if the job type conf needs assigning in all paths [00:21:04] That keeps all the mess in CommonSettings [00:21:33] RECOVERY - Puppet freshness on mw1053 is OK: puppet ran at Thu Sep 4 00:21:29 UTC 2014 [00:21:33] $wgJobTypeConf['MWEchoNotificationEmailBundleJob'] = $wgJobTypeConf['default']; if ( $wmgUseClusterJobqueue ) { $wgJobTypeConf['MWEchoNotificationEmailBundleJob'] += array( 'checkDelay' => true ); } [00:21:43] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet last ran 340251 seconds ago, expected 14400 [00:21:55] I would guess wildly that you get default config unless you specify otherwise [00:22:06] $wgJobTypeConf['MWEchoNotificationEmailBundleJob']['checkDelay'] = true; [00:22:10] That's what I'm presuming [00:22:27] (03PS1) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 [00:22:35] I guess they just need to include $wgJobTypeConf['default'] so that it doesn't only have the one setting [00:22:47] yeah. [00:23:03] I'll abondon. Your idea is better [00:23:20] (03Abandoned) 10BryanDavis: Don't set checkdelay on wikitech jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158296 (owner: 10BryanDavis) [00:23:40] (03PS2) 10BBlack: LVS/Protoproxy cleanup [puppet] - 10https://gerrit.wikimedia.org/r/158297 [00:23:41] only thing i'll mention is that disabling checkDelay means echo will no longer bundle email notifications [00:23:53] speaking of job queue, it looks like the beta queue is stuck again [00:23:54] you will get an email for each individual i think(without testing, just based on what thats used for) [00:24:25] (03PS4) 10BBlack: Remove actual $project-lb.wm.o domainnames [dns] - 10https://gerrit.wikimedia.org/r/157982 [00:24:31] ebernhardson: which is what wikitech does currently [00:24:35] so no big shame [00:24:42] Reedy: oh thats fine then :) [00:25:31] (03CR) 10BBlack: [C: 04-1] "This is on hold for TTL expiry + sniffer validation that nobody's looking up these names anymore. Will probably hold off through Monday j" [dns] - 10https://gerrit.wikimedia.org/r/157982 (owner: 10BBlack) [00:26:23] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures [00:26:45] (03PS1) 10Reedy: Don't alter MWEchoNotificationEmailBundleJob config for non redis backed queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158298 [00:26:46] Hey ebernhardson, you may be just the guy to tell me how this patch broke all email notifications for OSM changes -- https://gerrit.wikimedia.org/r/#/c/144334/3/OpenStackManager.php [00:27:11] I was trying to make the emails not suck and succeeded in making them not happen at all [00:27:39] <^d> Sometimes less is more. [00:28:07] bd808: maybe, looking [00:28:45] bd808: shouldn't break anything with that :( [00:28:59] (03Abandoned) 10Dzahn: align the misc. services section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158276 (owner: 10Dzahn) [00:29:34] (03CR) 10Dzahn: "merging into 157275" [dns] - 10https://gerrit.wikimedia.org/r/158276 (owner: 10Dzahn) [00:29:37] bd808: how long ago was that deployed, or more accuratly which logs should i be looking in for hints :) [00:29:43] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [00:30:11] (03CR) 10Andrew Bogott: [C: 031] Don't alter MWEchoNotificationEmailBundleJob config for non redis backed queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158298 (owner: 10Reedy) [00:30:25] wikitech logs that I think are only on virt1000. :( Deployed ~2 months ago I think? [00:31:09] (03CR) 10BBlack: [C: 031] align the misc wiki section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 (owner: 10Dzahn) [00:31:58] (03PS3) 10Dzahn: align the misc section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 [00:32:03] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [00:33:21] (03PS4) 10Dzahn: align the misc section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 [00:35:42] (03CR) 10Dzahn: [C: 032] align the misc section of wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/158275 (owner: 10Dzahn) [00:36:03] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [00:36:03] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [00:37:23] bd808: my irc client crashed, did I miss anything? jobqueuewise? [00:37:57] andrewbogott: Nope. I got pulled away to look at the job queue on beta ironically [00:38:10] I thought Reedy was making a patch but maybe not [00:38:23] No, he did. Um... [00:38:24] https://gerrit.wikimedia.org/r/#/c/158298/ [00:38:25] ? [00:42:04] andrewbogott: Want me to merge it? [00:42:30] Reedy: I don't understand all the angles, but -- sure, if you think it'll let me run jobs :) [00:43:13] (03CR) 10Reedy: [C: 032] Don't alter MWEchoNotificationEmailBundleJob config for non redis backed queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158298 (owner: 10Reedy) [00:43:18] (03Merged) 10jenkins-bot: Don't alter MWEchoNotificationEmailBundleJob config for non redis backed queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158298 (owner: 10Reedy) [00:43:24] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [00:43:54] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [00:45:20] * Reedy glares at ori [00:45:32] oh, just 1 apache with errors [00:45:47] osmium [00:46:07] http://p.defau.lt/?8_3iDPRtJk5cxBdsyxSS7w [00:47:24] Reedy: runjobs fails in just the same way as before [00:48:47] Hmm [00:54:32] (03PS1) 10Dzahn: add cawikimedia to wikiversion, MWMultiVersionTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 [00:54:38] (03CR) 10jenkins-bot: [V: 04-1] add cawikimedia to wikiversion, MWMultiVersionTest [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:55:35] (03CR) 10Dzahn: "00:54:37 1) MWMultiVersionTests::testRealmFilenames with data set #75 ('cawikimedia', 'ca.wikimedia.org')" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:56:26] (03CR) 10Reedy: "00:54:37 1) MWMultiVersionTests::testRealmFilenames with data set #75 ('cawikimedia', 'ca.wikimedia.org')" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:57:26] (03CR) 10Dzahn: "actual "cawiki"?? no, where?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:58:35] (03CR) 10Reedy: "Think you'll need to add ca to https://github.com/wikimedia/operations-mediawiki-config/blob/master/multiversion/MWMultiVersion.php#L174" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [00:59:13] !log testing the log [01:00:47] (03CR) 10Dzahn: "grmbl, ok, just copied "uawiki" which is in the tests but not the actual file either" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:01:17] (03CR) 10Dzahn: "arr, i mean "uawikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:02:22] !log the SAL still works, but the bot fails to acknowledge. Something to do with a change on wikitech [01:02:48] (03CR) 10Dzahn: "oooh, i see that array now starting line 173..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:04:15] bd808|AWAY, Reedy, I'm wrapping up for the night. Thanks for all your help -- this mostly works! [01:04:15] (03PS2) 10Dzahn: add cawikimedia to wikiversion, MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 [01:04:20] (03CR) 10jenkins-bot: [V: 04-1] add cawikimedia to wikiversion, MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:04:29] Tomorrow we will sort out adminbot and the jobqueue :) [01:04:40] (03CR) 10Dzahn: "nice, from 1 failure to 2 :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:04:43] what's up with adminbot? [01:05:14] Reedy: it logs properly but doesn't like the response it gets from wikitech so panics and fails to ack [01:05:27] (03CR) 10Reedy: [C: 04-1] add cawikimedia to wikiversion, MWMultiVersion (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 (owner: 10Dzahn) [01:05:27] haha [01:06:06] Reedy: Looks like this: https://dpaste.de/bqpg [01:06:45] (03PS3) 10Dzahn: add cawikimedia to wikiversion, MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158303 [01:08:06] !log production wants project name? [01:08:21] mutante: ? [01:08:41] Reedy: the format for labs log is [01:08:44] adminlog.log(self.config, message, project, author) [01:08:55] like !log blabla [01:09:01] yeah [01:10:23] Reedy: extra comma, thanks :) [01:10:34] Any chance wikitech's api changed encoding from asci to utf8? [01:19:13] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 1 failures [01:20:18] Reedy: wgLanguageCode .. but Canada needs en AND fr :p [01:20:45] en-ca? [01:21:15] we don't have fr-ca ;) [01:21:46] sounds like it breaks if i add non-existing lang code :) [01:21:58] en-ca exists [01:22:01] oh, heh [01:22:05] 77 # Non-ISO language codes [01:22:35] Reedy: ok:) thx [01:30:00] andrewbogott_afk: that shouldn't affect parsing the JSON... [01:31:10] (03PS1) 10Dzahn: add cawikimedia to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158312 [01:35:21] (03PS2) 10Dzahn: add cawikimedia to InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158312 [01:37:13] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [01:37:28] (03PS1) 10Dzahn: retab InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 [01:42:42] Needs a var dump of the returned json or similar and see what's what [01:46:46] python doesn't have a var_dump!!! [01:48:01] (03CR) 10Dzahn: ""trivial" +12338, -12338 :P" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [01:49:03] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Epic puppet fail [01:49:50] (03CR) 10Legoktm: "MediaWiki coding conventions are to use tabs..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [01:50:01] lol, waited for it [01:51:40] :P [01:54:05] (03CR) 10Dzahn: "http://www.mediawiki.org/wiki/Manual:Coding_conventions#Tab_size says "Most MediaWiki developers find 4 spaces per tab to be best for read" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [01:55:04] (03CR) 10Dzahn: "let's change the convention, it's wiki, so i can say i tried before i abandon :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158313 (owner: 10Dzahn) [01:55:54] mutante: I use different editors for python and php :P [02:05:43] (03PS1) 10Ori.livneh: Update path references for new deployment root directory [apache-config] - 10https://gerrit.wikimedia.org/r/158315 [02:08:03] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [02:08:53] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [02:09:36] (03PS1) 10Ori.livneh: mediawiki: /usr/local/apache/common-local => /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/158317 [02:10:14] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3610 MB (3% inode=99%): [02:22:03] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [02:23:24] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet has 1 failures [02:26:54] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [02:27:41] (03PS1) 10Dzahn: wikidata monitoring - work-around wrapper [puppet] - 10https://gerrit.wikimedia.org/r/158319 [02:30:22] (03CR) 10Dzahn: [C: 032] icinga: Set default value for from in graphite threshold checks [puppet] - 10https://gerrit.wikimedia.org/r/158125 (owner: 10Yuvipanda) [02:32:46] (03PS2) 10Dzahn: wikidata monitoring - work-around wrapper [puppet] - 10https://gerrit.wikimedia.org/r/158319 [02:33:03] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [02:34:21] (03CR) 10Dzahn: [C: 032] "just a work-around for now. if that fixes the issue, we can turn that into a template and pass parameters to it" [puppet] - 10https://gerrit.wikimedia.org/r/158319 (owner: 10Dzahn) [02:37:03] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [02:37:03] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [02:37:44] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.217 second response time [02:38:43] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.222 second response time [02:39:10] Error: Could not find a service matching host name 'ms-be3004' and description 'very high load average' [02:39:40] i swear every time i touch icinga there is an unrelated error already that keeps it from reloading :p [02:41:24] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [02:41:55] !log LocalisationUpdate completed (1.24wmf15) at 2014-09-04 02:40:45+00:00 [02:43:28] (03CR) 10Dzahn: "Failed to call refresh: Could not restart Service[icinga]:" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [02:45:20] (03CR) 10Dzahn: "we have /etc/icinga AND /etc/nagios now?? why back to nagios?" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [02:47:03] (03CR) 10Dzahn: "lol, icinga/nagios is chaos, "(config file '/etc/nagios/puppet_servicegroups.cfg'," in error messages but: nagios/puppet_services.cfg: ERR" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [02:47:35] (03CR) 10Dzahn: "servicegroups is in one place, services in another, and unrelated to this check" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [02:49:53] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Epic puppet fail [02:59:22] (03CR) 10Dzahn: "oh damnit, it's actually the comma in the decription text :p" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [03:00:13] RECOVERY - Disk space on virt0 is OK: DISK OK [03:01:59] (03PS1) 10Dzahn: remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 [03:02:32] (03PS2) 10Dzahn: remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 [03:02:56] (03CR) 10Dzahn: [C: 032] remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 (owner: 10Dzahn) [03:03:34] (03CR) 10Dzahn: [V: 032] remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 (owner: 10Dzahn) [03:03:52] (03PS3) 10Dzahn: remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 [03:04:28] (03CR) 10Dzahn: [V: 032] remove breaking comma from swift monitor [puppet] - 10https://gerrit.wikimedia.org/r/158321 (owner: 10Dzahn) [03:08:53] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [03:13:02] !log LocalisationUpdate completed (1.24wmf18) at 2014-09-04 03:11:58+00:00 [03:27:09] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1243 bytes in 0.699 second response time [03:27:25] jzerebecki: ^ :) [03:28:50] (03CR) 10Dzahn: "fixed in Change-Id: Ic89dcee033a313 after like 5 puppet runs or so" [puppet] - 10https://gerrit.wikimedia.org/r/157672 (owner: 10Filippo Giunchedi) [03:29:01] (03CR) 10Dzahn: "@neon:/etc/icinga# grep -r "very high" *" [puppet] - 10https://gerrit.wikimedia.org/r/158321 (owner: 10Dzahn) [03:29:30] (03CR) 10Dzahn: "20:29 <+icinga-wm> RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 12" [puppet] - 10https://gerrit.wikimedia.org/r/158319 (owner: 10Dzahn) [03:43:37] !log LocalisationUpdate completed (1.24wmf19) at 2014-09-04 03:42:34+00:00 [04:22:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [04:32:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 4 04:31:28 UTC 2014 (duration 31m 27s) [04:33:59] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [04:37:59] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [04:37:59] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [04:46:02] (03PS1) 10Springle: repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158328 [04:46:34] (03CR) 10Springle: [C: 032] repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158328 (owner: 10Springle) [04:46:38] (03Merged) 10jenkins-bot: repool db1035 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158328 (owner: 10Springle) [04:47:34] !log springle Synchronized wmf-config/db-eqiad.php: repool db1035, warm up (duration: 00m 08s) [04:56:09] <_joe_> rcs? [04:56:21] <_joe_> who touched rcstream? :) [04:57:04] <_joe_> eheh, it was _me_ [04:57:06] <_joe_> :P [05:04:03] springle: hey, how are the *_content schema changes going? [05:04:48] I was wondering that if it's finished on some smaller wikis, we could turn on the setting there and just wait for larger ones [05:05:30] andrewbogott_afk, ping [05:05:49] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.196 second response time [05:06:49] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.200 second response time [05:07:01] or Reedy. or whoever wants to investigate [05:07:24] Second appservers lvs page...somebody doing something or flaky business? [05:13:17] legoktm: the changes go out slave-by-slave, not wiki-by-wiki. would it help to do all of S3 first? [05:13:36] ah [05:14:17] springle: yeah, doing s3 first would be nice [05:14:36] (if it wouldn't be too much trouble) [05:15:05] revision fields are going out now. had to upgrade some slaves to mariadb 10 to get proper native online DDL. as it happens, 2 of 4 S3 slaves are done, so it could be hurried [05:15:30] just a script change. np [05:16:19] how is that different than OSC? [05:16:26] besides that it's built in [05:17:25] the native online DDL doesn't need to set triggers or create a copy of the table, and seems to run in about 1/3rd the time [05:18:49] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.200 second response time [05:19:28] that still means ~8h for enwiki revision per slave [05:19:49] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.217 second response time [05:23:59] sync-file on tin returned mkdir "/srv/mediawiki/wmf-config" failed: Permission denied (13), yet wmf-config/db-eqiad.php was synced correctly. anyone know what might be wrong? [05:25:41] springle: On which host did that fail? That path looks rather non-standard [05:25:52] !log temp hack fixed deployed for morebots (here and labs, not the other instances) [05:25:58] Logged the message, Master [05:26:24] hoo: mw1161 [05:27:23] hoo: at least, i think so... http://paste.debian.net/119260/ [05:27:46] maybe that meant mw1161 to osmium [05:28:58] I think mw1161 might be a scap proxy [05:29:15] it is [05:29:23] so that's probably not the failing host [05:30:12] osmium is the things :S [05:30:14] * thing [05:30:47] why the ... is it trying to rsync into that dir and not /a/common/ :S [05:34:03] _joe_: ^ happen to know? [05:34:15] springle: Ok, looked at it [05:34:34] seems to be a hhvm testing host not running our standard MW installation [05:34:42] or at least not in the way it's supposed to be [05:35:11] this shouldn't be in the dsh probably... or it should be fixed to be a standard deploy [05:38:51] Or maybe scap should be using /usr/local/apache/common-local which is probably the canonical path [05:38:53] but whatever [05:39:48] springle: Osmium currently can server traffic (via hhvm), but it wont get code updates... that's bad and need to be fixed [05:39:56] I'm not bold enough to mess with its /srv/ though [05:39:59] good night [05:40:00] does it look like we can sync-common osmium for now? [05:40:09] ok, thanks hoo [05:40:24] good question, let me try [05:41:34] nope, also not possible as htat also tries to go to /srv/mediawiki (which kind of makes sense) [05:42:28] * springle emails ops@ [05:43:03] add my commands about the path, please [05:43:10] don't think I'll come to replying today [05:43:14] (to tired) [05:43:15] ok [06:05:03] <_joe_> hey I'm back [06:05:35] <_joe_> springle: what has happened? [06:06:00] <_joe_> osmium is a playground of ori's [06:10:19] <_joe_> and , I'm going to have breakfast now, see you in ~ 1 hour [06:14:38] _joe_: np, emailed ops@ [06:23:49] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.194 second response time [06:23:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [06:25:50] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.214 second response time [06:28:30] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:30] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Epic puppet fail [06:28:39] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:39] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:49] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:59] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:00] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:09] PROBLEM - Disk space on elastic1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [06:29:09] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:09] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:19] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:49] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:59] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [06:36:30] PROBLEM - puppet last run on ssl1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:09] RECOVERY - Disk space on elastic1004 is OK: DISK OK [06:37:30] !log clear slowlog on elastic1004 [06:37:35] Logged the message, Master [06:38:59] PROBLEM - Puppet freshness on rcs1002 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:30:54 UTC [06:38:59] PROBLEM - Puppet freshness on rcs1001 is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:31:19 UTC [06:39:20] (03CR) 10Legoktm: [C: 031] "Yes please" [puppet] - 10https://gerrit.wikimedia.org/r/157013 (owner: 10Reedy) [06:43:09] RECOVERY - Disk space on ms1004 is OK: DISK OK [06:45:39] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:50] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:59] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:09] PROBLEM - Disk space on ms1004 is CRITICAL: DISK CRITICAL - free space: / 13 MB (0% inode=94%): /var/lib/ureadahead/debugfs 13 MB (0% inode=94%): [06:46:09] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:09] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:19] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:39] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:54:29] RECOVERY - puppet last run on ssl1001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [07:05:19] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:12:04] <_joe_> I don't get why we got those errors from the hhvm pool at all [07:16:49] I need some urgent help [07:17:09] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: Epic puppet fail [07:18:07] a troll moved a user talk page to a bad name and i deleted it in order to restore [07:18:30] but i'm unable to restore due to the large number of edits the page has [07:20:03] <_joe_> matanya: uhm I have no idea how to do that sorry :( [07:20:17] this is a springle thing, i guess [07:20:26] <_joe_> oh directly in the db? [07:20:28] <_joe_> naughty [07:20:39] no other option i can think of [07:20:54] my browser can't even open the page [07:21:52] bug report. not doing anything ad-hoc :) [07:22:27] poor editor [07:30:07] <_joe_> of course bug-report :) [07:32:03] _joe_, are you familiar with bigdelete? [07:32:16] <_joe_> jeremyb: no sorry [07:32:45] <_joe_> or better, I knwo what it is [07:32:51] <_joe_> in general [07:33:04] internal_api_error_DBQueryError [07:33:05] ok :) [07:33:17] trolls, i hate you [07:33:21] sounds like maybe matanya was hovering around that threshold [07:33:22] hah [07:33:37] i can't restore her talk page! arrg [07:35:57] springle: https://bugzilla.wikimedia.org/show_bug.cgi?id=70387 [07:36:09] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [07:36:23] (03PS1) 10Giuseppe Lavagetto: rcstream: remove duplicate package declaration [puppet] - 10https://gerrit.wikimedia.org/r/158335 [07:37:12] seems like i solved it [07:37:50] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 hphp_invoke - 332 bytes in 0.194 second response time [07:37:53] you know that's going to show up in google? [07:38:10] for ${correctusername} [07:38:39] pages for hhvm-appservers ... [07:38:42] <_joe_> mmmmh [07:38:43] <_joe_> yes [07:38:44] _joe_: that is you :P [07:38:48] <_joe_> but without a reason [07:39:15] <_joe_> I'm seeing enwiki with hhvm without problems [07:39:50] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67276 bytes in 0.204 second response time [07:40:33] <_joe_> akosiaris: I really don't know what is this check btw [07:40:51] <_joe_> I did not set this up [07:41:49] aha [07:42:03] <_joe_> "hphp_invoke" [07:42:18] <_joe_> oh that's the status code [07:45:11] <_joe_> ok we do have one of the servers (testwiki) which is in a bad state apparently [07:47:02] WFM. how do i force hhvm for myself? [07:47:04] <_joe_> or at least, pybal thinks so [07:48:02] <_joe_> 2014-09-04 07:47:19.459320 [hhvm_appservers_80 ProxyFetch] mw1017.eqiad.wmnet (enabled/partially up/pooled): Fetch failed, [07:48:05] <_joe_> meh [07:50:45] jeremyb: https://meta.wikimedia.org/wiki/User:Ori.livneh/global.js [07:50:51] <_joe_> ok, it's mw1017 [07:51:49] <_joe_> the bad thing is, we have nothing telling us it's in a bad state, apart an lvs check [07:51:51] but I think testwiki is hhvm for everyone? [07:52:02] <_joe_> it is [07:52:11] <_joe_> legoktm: is testwiki down atm? [07:52:13] <_joe_> it should be [07:52:22] errr, it's working for me :P [07:53:02] <_joe_> lol [07:53:02] was up for me this whole time too [07:53:17] <_joe_> everything is working but en.wikipedia.org/wiki/Main_page [07:53:25] <_joe_> EWWW this is *horrible* [07:53:30] oh [07:53:33] that works for me using hhvm too... [07:53:47] <_joe_> legoktm: yes now I know what the problem is [07:53:53] :D [07:53:55] <_joe_> and it's *worse* than I expected [07:54:03] :/ [07:54:06] <_joe_> it's a frigging nightmare honestly [07:54:21] <_joe_> time for a cigarette :/ [07:54:37] <_joe_> akosiaris: you cna safely ignore the pages for now [07:54:52] <_joe_> I'd like not to take mw1017 out of rotation though [07:54:57] <_joe_> I want to investigate this [07:56:01] ok [08:01:41] <_joe_> ok so this is funny. on testwiki, the main page and barack obama are failing. but https://en.wikipedia.org/wiki/Special:Export/Pulsar is not [08:01:55] <_joe_> any idea what different code patterns they may encounter? [08:05:43] <_joe_> the error is spawned here https://github.com/facebook/hhvm/blob/master/hphp/runtime/server/http-request-handler.cpp#L441 [08:07:02] <_joe_> and the logs show Sep 4 08:06:23 mw1017 hhvm: message repeated 55 times: [ #012Fatal error: Argument 1 passed to Wikibase\RepoLinker::getEntityUrl() must be an instance of Wikibase\EntityId, Wikibase\DataModel\Entity\ItemId given in /usr/local/apache/common-local/php-1.24wmf18/extensions/Wikidata/extensions/Wikibase/client/includes/RepoLinker.php on line 173] [08:07:07] _joe_, front page works for me [08:07:25] <_joe_> jeremyb: yes it's just the testwiki host behaving crazily [08:10:40] _joe_, i don't follow. but anyway, works for me. (testwiki too) [08:11:28] legoktm: nice, that page finally told me the unit for https://www.mediawiki.org/wiki/Manual:WgBackendResponseTime [08:11:30] i wonder if testwiki uses parser cache? saw a 7+ sec load of barack obama according to ori's script [08:16:25] <_joe_> !log running sync-common on mw1017, trying to debug the hhvm bad state [08:16:31] Logged the message, Master [08:24:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [08:35:59] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [08:52:38] (03PS2) 10Giuseppe Lavagetto: rcstream: remove duplicate package declaration [puppet] - 10https://gerrit.wikimedia.org/r/158335 [08:52:51] (03CR) 10Giuseppe Lavagetto: [C: 032] rcstream: remove duplicate package declaration [puppet] - 10https://gerrit.wikimedia.org/r/158335 (owner: 10Giuseppe Lavagetto) [08:59:32] (03PS1) 10Spage: Use Flow on frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158341 [08:59:34] (03PS1) 10Spage: Enable Flow on several pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158342 [09:11:09] RECOVERY - Puppet freshness on rcs1002 is OK: puppet ran at Thu Sep 4 09:11:00 UTC 2014 [09:11:09] RECOVERY - Puppet freshness on rcs1001 is OK: puppet ran at Thu Sep 4 09:11:00 UTC 2014 [09:11:49] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:12:09] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:41:16] (03PS1) 10JanZerebecki: Puppetize icinga log file permission fix. [puppet] - 10https://gerrit.wikimedia.org/r/158345 [10:01:05] (03PS2) 10JanZerebecki: Puppetize icinga tmpfs mount. [puppet] - 10https://gerrit.wikimedia.org/r/158343 [10:25:59] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [10:36:59] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [10:45:41] hey wikitech is down I guess [10:45:46] is that known issue? [10:49:21] works for me (?) [10:52:07] <_joe_> petan: wfm as well [10:53:04] (03PS6) 10Giuseppe Lavagetto: beta: use HHVM for all requests [puppet] - 10https://gerrit.wikimedia.org/r/157823 [10:54:38] yup it's back [11:04:09] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Epic puppet fail [11:07:31] (03PS1) 10Giuseppe Lavagetto: trebuchet::packages: use ensure_packages from stdlib [puppet] - 10https://gerrit.wikimedia.org/r/158353 [11:24:10] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [11:47:19] (03PS1) 10Yuvipanda: labmon: Add http:// prefix for graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/158356 [12:04:33] (03PS1) 10Yuvipanda: quarry: Remove unused directive from nginx conf [puppet] - 10https://gerrit.wikimedia.org/r/158360 [12:06:16] akosiaris: can I interest you in merging a couple of trivial patches? [12:09:30] <_joe_> YuviPanda: I will do that for 10 euros less than what he asks for [12:09:31] (03PS2) 10Giuseppe Lavagetto: trebuchet::packages: use ensure_packages from stdlib [puppet] - 10https://gerrit.wikimedia.org/r/158353 [12:10:01] YuviPanda: i would do it for free if i have root [12:10:43] <_joe_> YuviPanda: I'm here, which trivial patches? [12:11:05] _joe_: https://gerrit.wikimedia.org/r/#/c/158356/ and https://gerrit.wikimedia.org/r/#/c/158360/ [12:11:29] _joe_: didn't want to bug you since you seemed to have your hands full with HHVM, but yay you have time :) [12:12:05] <_joe_> well, I'm technically off after lunch, but a puppet problem was bugging me, so... [12:12:29] <_joe_> I'm here looking at puppet already [12:12:32] heh [12:12:48] (03CR) 10Giuseppe Lavagetto: [C: 032] labmon: Add http:// prefix for graphite URL [puppet] - 10https://gerrit.wikimedia.org/r/158356 (owner: 10Yuvipanda) [12:14:40] (03CR) 10Giuseppe Lavagetto: quarry: Remove unused directive from nginx conf (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/158360 (owner: 10Yuvipanda) [12:15:24] (03CR) 10Yuvipanda: [C: 031] quarry: Remove unused directive from nginx conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/158360 (owner: 10Yuvipanda) [12:16:03] (03PS2) 10Giuseppe Lavagetto: quarry: Remove unused directive from nginx conf [puppet] - 10https://gerrit.wikimedia.org/r/158360 (owner: 10Yuvipanda) [12:16:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] quarry: Remove unused directive from nginx conf [puppet] - 10https://gerrit.wikimedia.org/r/158360 (owner: 10Yuvipanda) [12:16:19] _joe_: \o/ tyvm [12:16:31] * YuviPanda waits for a puppet run on neon [12:16:43] <_joe_> on neon? [12:16:54] _joe_: for the icinga change? [12:16:56] <_joe_> oh for graphite [12:17:00] <_joe_> ok :P [12:17:06] yeah :) [12:17:10] <_joe_> I was thinking of quarry [12:17:46] _joe_: ah :) that's a self puppetmaster (for now) so it's already running there [12:17:57] <_joe_> ok [12:18:25] <_joe_> YuviPanda: do you know anything about mobile::vumi in puppet? [12:18:47] <_joe_> it requires a very specific version of python-redis, and I can't upderstand why [12:19:00] _joe_: from what I understand it isn't used anymore, and the people responsible for it that I know of (preilly) aren't at the WMF anymore, and I haven't heard about that service inside the mobile team forever. [12:19:03] <_joe_> also, it's a very very bad decision [12:19:10] <_joe_> lol [12:19:22] <_joe_> so it's sitting there in puppet for no purpose? [12:19:26] _joe_: so I think it just needs some conversation with Tomasz / WP0 to understand where it is (I think it's fully outsourced now) and then remov eit [12:19:28] *remove it [12:20:06] <_joe_> ok, for now I'm just fixing it [12:20:10] cool [12:20:18] _joe_: want me to deal with the conversations / removal? [12:20:36] <_joe_> if possible, that would be great [12:20:48] _joe_: indeed, writing emails now [12:21:56] <_joe_> thanks! [12:22:28] _joe_: yw! [12:23:39] (03PS3) 10Giuseppe Lavagetto: trebuchet::packages: use ensure_packages from stdlib [puppet] - 10https://gerrit.wikimedia.org/r/158353 [12:24:56] _joe_ ,YuviPanda the one before last comment on https://gerrit.wikimedia.org/r/#/c/117673/ might clarify a bit [12:25:25] matanya: _joe_ yeah, jerith is the person who was working on it (externally) [12:26:05] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [12:26:22] matanya: thanks for pointing out, I've added that info to my email to tomasz / dan who should be able to clarify its status [12:27:00] <_joe_> bbl [12:27:02] YuviPanda: sure, if you don't mind, I would like to know as well, in order to fix that patch [12:27:39] matanya: yeah, I'll inform you when I get a response [12:27:52] thank you [12:30:36] (03PS1) 10Yuvipanda: Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 [12:30:47] (03PS2) 10Yuvipanda: Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 [12:31:16] (03CR) 10Yuvipanda: [C: 04-1] "Awaiting confirmation that this code is no longer needed" [puppet] - 10https://gerrit.wikimedia.org/r/158365 (owner: 10Yuvipanda) [12:31:30] (03CR) 10jenkins-bot: [V: 04-1] Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 (owner: 10Yuvipanda) [12:33:14] (03PS3) 10Yuvipanda: Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 [12:34:41] (03PS1) 10BBlack: Add geoip resources for direct address resolution [dns] - 10https://gerrit.wikimedia.org/r/158367 [12:34:43] (03PS1) 10BBlack: add langs templates using DYNA [dns] - 10https://gerrit.wikimedia.org/r/158368 [12:35:04] (03PS4) 10Yuvipanda: Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 [12:35:19] matanya: ^ removed 'em cleanly I think [12:35:28] (03CR) 10Yuvipanda: [C: 04-1] Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 (owner: 10Yuvipanda) [12:35:47] matanya: one less thing to move into a module!!1 [12:36:03] yay! thank you, i'll try to review later today [12:37:05] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Wed 03 Sep 2014 14:26:55 UTC [12:47:48] (03CR) 10BBlack: [C: 032] Add geoip resources for direct address resolution [dns] - 10https://gerrit.wikimedia.org/r/158367 (owner: 10BBlack) [12:50:46] (03PS2) 10BBlack: add langs templates using DYNA [dns] - 10https://gerrit.wikimedia.org/r/158368 [12:51:17] (03CR) 10BBlack: [C: 032] add langs templates using DYNA [dns] - 10https://gerrit.wikimedia.org/r/158368 (owner: 10BBlack) [12:52:04] (03PS4) 10Giuseppe Lavagetto: trebuchet::packages: use ensure_packages from stdlib [puppet] - 10https://gerrit.wikimedia.org/r/158353 [12:53:19] (03Abandoned) 10BBlack: Remove actual $project-lb.wm.o domainnames [dns] - 10https://gerrit.wikimedia.org/r/157982 (owner: 10BBlack) [12:57:03] (03PS5) 10Giuseppe Lavagetto: trebuchet::packages: use ensure_packages from stdlib [puppet] - 10https://gerrit.wikimedia.org/r/158353 [12:58:08] (03PS1) 10BBlack: Comment on future removal of $project-lb.wm.o [dns] - 10https://gerrit.wikimedia.org/r/158371 [12:58:43] (03CR) 10BBlack: [C: 032] Comment on future removal of $project-lb.wm.o [dns] - 10https://gerrit.wikimedia.org/r/158371 (owner: 10BBlack) [13:00:48] (03PS6) 10Giuseppe Lavagetto: trebuchet::packages: use ensure_packages from stdlib [puppet] - 10https://gerrit.wikimedia.org/r/158353 [13:09:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "Verified with the puppet compiler, should not introduce new issues." [puppet] - 10https://gerrit.wikimedia.org/r/158353 (owner: 10Giuseppe Lavagetto) [13:11:25] RECOVERY - Puppet freshness on silver is OK: puppet ran at Thu Sep 4 13:11:20 UTC 2014 [13:12:16] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:21:59] (03PS1) 10Giuseppe Lavagetto: trebuchet::packages: fix conditional install [puppet] - 10https://gerrit.wikimedia.org/r/158374 [13:22:08] (03CR) 10jenkins-bot: [V: 04-1] trebuchet::packages: fix conditional install [puppet] - 10https://gerrit.wikimedia.org/r/158374 (owner: 10Giuseppe Lavagetto) [13:22:13] (03PS2) 10Giuseppe Lavagetto: trebuchet::packages: fix conditional install [puppet] - 10https://gerrit.wikimedia.org/r/158374 [13:22:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] trebuchet::packages: fix conditional install [puppet] - 10https://gerrit.wikimedia.org/r/158374 (owner: 10Giuseppe Lavagetto) [13:27:44] <_joe_> ugh, lucid, not hardy [13:27:50] * _joe_ facepalms [13:32:40] (03PS1) 10Giuseppe Lavagetto: trebuchet::packages: s/hardy/lucid/ [puppet] - 10https://gerrit.wikimedia.org/r/158375 [13:33:56] (03CR) 10Giuseppe Lavagetto: [C: 032] trebuchet::packages: s/hardy/lucid/ [puppet] - 10https://gerrit.wikimedia.org/r/158375 (owner: 10Giuseppe Lavagetto) [13:35:45] RECOVERY - puppet last run on linne is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:36:37] <_joe_> at long last [13:38:55] RECOVERY - puppet last run on nickel is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [13:39:46] RECOVERY - puppet last run on es4 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:41:05] RECOVERY - puppet last run on ms1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:43:45] RECOVERY - puppet last run on tridge is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [13:44:45] RECOVERY - puppet last run on ms1004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:45:51] it's hard to be precisely lucid sometimes :) [13:46:24] <_joe_> it's hardy in fact [13:46:32] <_joe_> oh my [13:47:35] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [13:48:55] RECOVERY - puppet last run on sodium is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:48:55] RECOVERY - puppet last run on nfs1 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:53:25] RECOVERY - puppet last run on sanger is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:02:37] PROBLEM - MySQL Processlist on db1030 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 83 statistics [14:03:35] RECOVERY - MySQL Processlist on db1030 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 0 statistics [14:26:48] Reedy, yt? Apparently we forgot to check if images worked :) [14:27:06] I thought they were back! [14:27:08] (03PS1) 10BBlack: Move all geoip-based resolution to DYNA [dns] - 10https://gerrit.wikimedia.org/r/158382 [14:27:27] marktrac1ur: they're back on normal wikis, but broken on wikitech [14:27:37] Huh [14:27:44] So, the next step will be to reverse that :( [14:28:02] Yay intentional breakage. [14:28:12] Good news is, there doesn't appear to be any SWAT [14:29:11] marktraceur: do /you/ want to help me fix images on wikitech? According to my recent email, it's now easier for non-Ops staff to maintain :) [14:29:34] "easier" [14:29:41] andrewbogott: What do you think is wrong? [14:30:37] Speaking in very general terms (since I don't know anything)… probably the new config is looking in a normal place for images, but the actual historic images are stored in a non-normal place. [14:30:44] Ah. [14:30:51] In a default mw install what happens to images? Surely they aren't packed into a db table... [14:31:00] andrewbogott: Do you have a pointer to the config for wikitech? [14:31:13] Default installs have an images/ directory with hashed subdirs [14:31:22] ok, lemme look, maybe I just have to copy that over [14:31:29] a/ab/foobar.jpg e.g. [14:31:55] the config is in comingled with the other configs now. Plus a supplementary wikitech.php [14:32:41] Cool. [14:32:43] Hm, default images dir (in the old wiki) is empty [14:32:45] * marktraceur will look [14:33:22] happen to know if that dir is configurable, and via what setting? [14:33:40] * andrewbogott is looking at the old config, which is not visible to normal human eyes, unfortunately [14:33:49] oh wgUploadDirectory [14:33:57] can I just cp -r everything in the old dir to the new one? [14:34:02] Since the db is still expecting them anyway? [14:35:53] Oh, bad news, it's already configured to point to the same place in the new config [14:35:58] unless it's getting overridden someplace else [14:37:11] andrewbogott: I am here [14:37:37] https://upload.wikimedia.org/wikipedia/labs/thumb/4/41/LabsProjectsInstance.png/1920px-LabsProjectsInstance.png [14:37:41] oops, wrong paste [14:37:50] Error: Could not load thumbnail data. could not load image from https://upload.wikimedia.org/wikipedia/labs/thumb/4/41/LabsProjectsInstance.png/1920px-LabsProjectsInstance.png [14:38:24] I left the images pointing at the old path in the apache config (I thought!) [14:38:40] is wgUploadDirectory not set correctly? [14:38:52] You did, and it looks right to me in the config... [14:40:10] andrewbogott: wgUploadPath [14:40:12] easily fixed [14:40:14] give me a minute [14:40:37] Oh, two different things huh? [14:40:45] yeah [14:40:55] path on disk vs path to expect them via the interwebs [14:41:10] Probably should also move those to someplace a little more sane now that we aren't using that old tree. Is it safe to just copy things when I do that? [14:41:27] yup, should be fine [14:41:36] Let's do that now since it's already broken. [14:42:26] Is it safe to put them in /usr/local/apache/common/images or will scap hate that? [14:42:50] andrewbogott: lots of labswki dberror noise on fluorine. known? [14:43:01] (03PS1) 10Reedy: Set wgUploadPath for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158385 [14:43:07] wikiuser denied access, snapshots failing, etc [14:43:07] springle: no, I don't know what that's about [14:43:28] springle: andrewbogott: because it's iterating over all.dblist trying to access it [14:44:16] Is it breaking all dumps? ie the scripts are just stopping? [14:44:19] andrewbogott: /usr/local/apache/common/images will likely end up getting broken [14:44:30] don't know. apergos question [14:44:38] probably just noise [14:44:48] Reedy: ok, how about just /usr/local/apache/images then? [14:45:11] andrewbogott: WFM [14:45:13] * apergos goes to check on the dumps [14:45:22] Reedy: ok, I'll move things. Stay tuned... [14:46:18] (03PS2) 10Reedy: Set wgUploadPath for labswiki. Update wgUploadDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158385 [14:46:42] andrewbogott: also, virt1000 has started trying to connect as wikiuser to a bunch of production dbs. should it? [14:46:51] it's being denied [14:47:04] springle: Mh... it has a MW install now [14:47:09] virt1000 labswiki Error connecting to 10.64.16.30 [14:47:10] so I guess it should be able [14:47:14] (a "real" one) [14:47:27] springle: I'd say that it shouldn't be trying. [14:47:31] But I don't know why it is [14:47:38] hoo: labswiki is still local to virt1000 [14:47:39] springle: I think there's a handful more extension that probably need disabling on it [14:48:11] seem ok, I see it listing labswiki [14:48:39] so removing that before the scripts try to dump it would be good, if it's a problem [14:49:29] Reedy: once this copy finishes, who should own all these files? [14:49:47] apache presumably [14:51:42] Ok, copy finished [14:52:35] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /a/common/). [14:52:44] ? [14:53:09] andrewbogott: https://gerrit.wikimedia.org/r/#/c/158385/ want to confirm it? [14:54:01] (03CR) 10Andrew Bogott: [C: 031] "I'm taking your word for the upload path" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158385 (owner: 10Reedy) [14:54:19] Re /srv/mediawiki in scap: ori made a change yesterday that I don't see SAL or email info about. Scap is now syncing to /srv/mediawiki instead of /usr/local/apache/common-local. He said he made /srv/mediawiki a symlink to /usr/local/apache/common-local everywhere. [14:54:41] (03CR) 10Reedy: [C: 032] Set wgUploadPath for labswiki. Update wgUploadDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158385 (owner: 10Reedy) [14:54:46] (03Merged) 10jenkins-bot: Set wgUploadPath for labswiki. Update wgUploadDirectory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158385 (owner: 10Reedy) [14:55:35] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [14:56:00] !log ori updated scap to 773f95f (change deploy_dir to /srv/mediawiki) ~15 hours ago [14:56:05] Logged the message, Master [14:56:40] Reedy: ready for me to sync? [14:56:46] yeah [15:01:03] images: I haz them! [15:01:07] Thanks Reedy [15:01:17] yay [15:01:35] * anomie sees nothing for SWAT this morning [15:02:02] andrewbogott: Did you see the paste of the data that morebots is getting from wikitech? http://dpaste.com/078DJ46.txt [15:02:22] Seems like it may be a bug in that wmf15 branch? [15:02:45] bd808: Seems like. Or a misconfiguration of search [15:02:56] Do you know what JeremyB did to fix things? [15:03:11] We can disable the debugging flags btw [15:03:13] I think he hacked the crap out of morebots to ignore the junk [15:03:24] <^d> I've got something for swat. [15:03:28] <^d> That I'm going to do myself. [15:03:48] ^d: http://dpaste.com/078DJ46.txt Do you know if that's been fixed/still a problem in master? (the strict warnings) [15:03:53] (03Abandoned) 10Giuseppe Lavagetto: appservers: mediawiki config in puppet, debianized, 2.4-compatible [WIP] [puppet] - 10https://gerrit.wikimedia.org/r/144453 (owner: 10Giuseppe Lavagetto) [15:03:57] * marktraceur bows to ^d [15:04:03] andrewbogott: He did this -- http://dpaste.com/3RX66W5 [15:04:44] Will turning off debug flags make those extra API messages go away? [15:04:44] <^d> Reedy: Probably not. I forgot all about mwsearch :p [15:04:45] andrewbogott: Yeah, it's php warnings [15:04:48] might be worth fixing it anyway ;) [15:04:54] let me have a look [15:05:03] * ^d works on his patch [15:05:45] fucking seriously [15:05:47] unused variable [15:06:04] <^d> Oh yeah, I was removing those because they were mostly unused. [15:06:09] <^d> Just nuke it from mwsearch. [15:06:34] (03PS2) 10BBlack: Move all geoip-based resolution to DYNA [dns] - 10https://gerrit.wikimedia.org/r/158382 [15:07:27] https://gerrit.wikimedia.org/r/158387 [15:07:30] uh [15:07:44] No permission to view newly created ticket #8286. [15:07:48] can anyone fix that for me? [15:07:49] :P [15:07:57] this is why I dislike RT [15:07:59] * hoo hides [15:08:42] Anyone want to review https://gerrit.wikimedia.org/r/158387 for me? ;) [15:09:37] Reedy: Done [15:09:52] (CR) Andrew Bogott: [C: 1] "Obvious fix is obvious" [extensions/MWSearch] - https://gerrit.wikimedia.org/r/158387 (owner: Reedy) [15:09:52] (CR) Hoo man: [C: 2] "Trivial change is trivial" [extensions/MWSearch] - https://gerrit.wikimedia.org/r/158387 (owner: Reedy) [15:09:54] * Reedy grins [15:10:47] (03PS1) 10Reedy: Remove debugging stuff from wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158390 [15:11:36] (03CR) 10Andrew Bogott: [C: 031] "Such optimism!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158390 (owner: 10Reedy) [15:12:02] Can any op give me permission to view RT#8286 ? [15:12:26] Which queue? [15:13:05] access requests [15:13:06] It's an access request, traditionally not publicly viewable. [15:13:15] which I'm not totally sure is the right one [15:13:25] andrewbogott: me != public :P [15:13:26] PROBLEM - Disk space on elastic1009 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20145 MB (3% inode=99%): [15:13:35] True... [15:13:36] (03CR) 10BBlack: [C: 031] "LGTM, needs review" [dns] - 10https://gerrit.wikimedia.org/r/158382 (owner: 10BBlack) [15:13:46] hoo, ping Coren when he shows up, he's the RT boss of the week [15:13:57] I'm around and listening to pings. [15:14:02] andrewbogott: Not going to be around for long [15:14:02] * Coren reads scrollback. [15:14:10] * hoo actually is on vacation :P [15:14:11] andrewbogott: sync-common should fix MWSearch [15:14:18] (deployed on tin etc) [15:14:45] PROBLEM - puppet last run on mw1121 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:56] hoo: But andrewbogott is correct; we don't normally open access request tickets. [15:16:08] (03PS1) 10Chad: CirrusSearch to primary on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158393 [15:16:18] Coren: Not even to my own ticket(s)? [15:16:35] hoo: In this case, 'the public' == you, since I believe the goal is that we can freely discuss the merits of the requestor :) [15:16:41] https://git.wikimedia.org/raw/operations%2fdumps.git/HEAD/xmlfileutils%2fREADME [15:16:45] ^ Looks like it's fatal-ing [15:16:46] hoo: Well, the point of the three day wait period is to allow free discussion of the request. [15:17:16] I see [15:17:23] someone might want to say "no, because hoo is a terrible person", and not have you see it :) [15:17:30] Coren: Can you then CC apergos in? [15:17:33] bblack: uh :D [15:17:40] https://git.wikimedia.org/raw/operations%2fdumps.git [15:17:43] internal error! [15:18:04] heh [15:18:08] (03CR) 10Chad: [C: 032] CirrusSearch to primary on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158393 (owner: 10Chad) [15:18:12] (03Merged) 10jenkins-bot: CirrusSearch to primary on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158393 (owner: 10Chad) [15:18:13] If someone has an email address for JeremyB, could you pm me please? [15:18:21] gerrit was probably slain by trying to review my awesome dns patch [15:18:24] hoo: I can; though most ops look at access-requests tickets anyways. [15:18:46] !log demon Synchronized wmf-config/InitialiseSettings.php: plwiki gets Cirrus (duration: 00m 06s) [15:18:51] Logged the message, Master [15:18:58] Coren: Ok, might be better, though [15:19:18] <^d> osmium is giving permission errors on sync. [15:19:49] <^d> http://p.defau.lt/?b8x1FmeHEmlkXrwmVEZ3UQ [15:19:54] ^d: Known, see ops-l [15:19:56] hoo: No worries. I CC'ed Ariel. [15:20:41] <^d> hoo: Yeah, I skimmed that :) [15:21:33] ^d: by the way, search seems just fine on the All New wikitech, so I never got around to regenerating like you asked [15:21:48] <^d> No big deal if it's working. [15:22:35] <^d> Grr, where's manybubbles? [15:22:56] <^d> cpu crept up about 2 hours ago. [15:23:23] Reedy: my remaining issues are… 1) What's the deal with the jobqueue? 2) Quieting down all those attempts wikitech makes to hit other Dbs 3) Answering all of springle's other questions [15:24:25] 1) Is the only one of semi-urgent concern. I mean, it seems to be working, but… why? [15:25:11] <^d> manybubbles: plwiki is live. cluster barely noticed, but I noticed something. [15:25:12] <^d> http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=cpu_report&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [15:25:16] <^d> cpu went up ~2h ago [15:25:30] (03CR) 10Yuvipanda: [C: 031] "Confirmed with Dan that this is not needed." [puppet] - 10https://gerrit.wikimedia.org/r/158365 (owner: 10Yuvipanda) [15:25:34] ^d: my fault I think [15:25:41] I'm reindexing some of the final wikis [15:25:49] _joe_: matanya ^ vumi can be removed [15:25:49] <^d> Ah ok. [15:26:01] thanks YuviPanda [15:26:10] ^d: and I'm not doing it with the kid gloves that I've been using for a while because I'm tired of waiting [15:26:16] <_joe_> ok, +2 ing that [15:26:17] (03CR) 10Yuvipanda: "Confirmed with Dan Foy. Quoting," [puppet] - 10https://gerrit.wikimedia.org/r/158365 (owner: 10Yuvipanda) [15:26:55] <_joe_> YuviPanda: the whole thing? [15:27:04] _joe_: yup [15:27:15] <_joe_> wow, seems scary [15:27:24] heh [15:27:31] <_joe_> can we merge that tomorrow morning? [15:27:34] yeah! [15:27:44] _joe_: I'm going away in an hour as well, so makes sense. [15:27:55] <_joe_> I'd really like to take a break in a few [15:28:03] _joe_: yeah, makes sense [15:28:06] I'll poke tomorrow [15:28:09] <_joe_> :) [15:28:23] _joe_: but I guess you don't have to fix the redis issue then :) [15:28:33] we might have to stop the services in silver by hand, I guess [15:28:37] andrewbogott: Pastebin runJobs again please? [15:28:39] <_joe_> oh I already did [15:29:23] here's a fresh one: https://dpaste.de/onhU [15:29:32] _joe_: oh, heh [15:29:48] _joe_: heh, merge conflict because of that, I guess [15:29:55] <_joe_> yes [15:30:12] should be easy enough to fix tomorrow [15:30:17] let me email ops@ about vumi [15:30:46] (03PS1) 10Ottomata: Installing libbz2-dev on stat servers. RT 8278 [puppet] - 10https://gerrit.wikimedia.org/r/158394 [15:32:17] (03PS2) 10Ottomata: Installing libbz2-dev on stat servers. RT 8278 [puppet] - 10https://gerrit.wikimedia.org/r/158394 [15:32:30] (03CR) 10Ottomata: [C: 032 V: 032] Installing libbz2-dev on stat servers. RT 8278 [puppet] - 10https://gerrit.wikimedia.org/r/158394 (owner: 10Ottomata) [15:32:46] $this->checkDelay = !empty( $params['checkDelay'] ); [15:32:46] if ( $this->checkDelay && !$this->supportsDelayedJobs() ) { [15:32:46] throw new MWException( __CLASS__ . " does not support delayed jobs." ); [15:32:46] } [15:32:57] andrewbogott: So something else is injecting a checkDelay parameter [15:32:59] The question is what [15:33:26] do we not want to support delayed jobs? [15:33:31] hmm, this looks concerning: [15:33:31] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=nickel&service=RAID [15:33:45] RECOVERY - puppet last run on mw1121 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:33:51] andrewbogott: We could, but we don't need to [15:34:18] So, the fact that there was a 'runJobs' cron on the old wikitech -- does that mean that we definitely were supporting delayed jobs there? [15:34:23] Or is a delayed job something else? [15:34:35] It's something else [15:34:39] Ah, ok. [15:34:44] It seems to be like AT for the jobqueue [15:34:47] ie don't run this before X [15:34:52] X being a time [15:34:52] So even without delayed jobs we'll still need the cron [15:34:54] ah, Coren noticed already [15:34:55] https://rt.wikimedia.org/Ticket/Display.html?id=8252 [15:34:58] oook, this makes more sense now [15:35:05] cmjohnson1: have you seen that? [15:35:36] yes ^ we need to schedule downtime [15:35:44] andrewbogott: Ah, CirrusSearch [15:36:21] CirrusSearch-common.php:$wgJobTypeConf['cirrusSearchLinksUpdateSecondary'] = array( 'checkDelay' => true ) + [15:36:21] also echo, right? [15:36:27] Yeah, I fixed the echo one though ;) [15:36:40] ok :) [15:37:35] cmjohnson1: Isn't that a box that can hotswap disks? [15:37:49] (03PS1) 10Reedy: Wrap a few more checkDelay jobs in $wmgUseClusterJobqueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158397 [15:37:53] (03CR) 10Ottomata: [C: 031] Remove all vumi related code [puppet] - 10https://gerrit.wikimedia.org/r/158365 (owner: 10Yuvipanda) [15:38:18] andrewbogott: sync-common and try again :) [15:38:20] coren: no..disks are internal [15:38:26] I hate that setup [15:38:28] cmjohnson1: Bah. [15:39:15] Reedy: new error! Exception from line 758 of /usr/local/apache/common-local/php-1.24wmf15/includes/jobqueue/JobQueueDB.php: DBConnectionError:DB connection error: Access denied for user 'wikiadmin'@'virt1000.wikimedia.org' (using password: YES) (208.80.154.18) [15:39:31] heh [15:39:49] springle: What did you set at the right hand side of the @ for wikiadmin on virt1000? :) [15:40:04] * springle checks [15:40:07] thanks [15:40:16] (03CR) 10Reedy: [C: 032] Wrap a few more checkDelay jobs in $wmgUseClusterJobqueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158397 (owner: 10Reedy) [15:40:20] (03Merged) 10jenkins-bot: Wrap a few more checkDelay jobs in $wmgUseClusterJobqueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158397 (owner: 10Reedy) [15:40:56] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [15:41:02] Logged the message, Master [15:41:51] Reedy: % [15:42:30] springle: thanks [15:42:37] That sounds like it's using the wrong password then [15:43:13] is that error connecting to virt100 mysql, or some other master [15:43:22] 208.80.154.18 is virt1000 [15:43:26] so it's to itself [15:44:04] I wonder if the refactoring done for the ldap passwords might've upset it.. [15:44:49] Ah, yeah [15:45:00] Can we update the wikiuser password on virt1000? [15:45:05] to the same as production? [15:45:23] then we can remove $wgDBpassword from WikitechPrivateSettings.php [15:45:40] which is overriding whether CLI || Apache [15:45:47] ok [15:45:54] fine w/me [15:46:14] * Reedy looks to update in puppet [15:46:20] waiting for git pull [15:48:14] (03PS1) 10Reedy: Remove $wgDBpassword from wikitech private settings [puppet] - 10https://gerrit.wikimedia.org/r/158400 [15:48:15] Reedy: done [15:49:00] andrewbogott: just need ^ merging and pushing to unbreak wikitech [15:49:23] _joe_: puppet on silver can be fixed with [15:49:31] (03CR) 10Andrew Bogott: [C: 032] Remove $wgDBpassword from wikitech private settings [puppet] - 10https://gerrit.wikimedia.org/r/158400 (owner: 10Reedy) [15:49:37] andrewbogott: You can delete $wgDBpassword from WikitechPrivateSettings.php in the meantime (as puppet won't overwrite it there) [15:49:38] _joe_: i see now that silver will be decom'd, but the redis-python issue should be solved more comprehensively [15:50:09] <_joe_> ori: well, probably yes. [15:50:18] <_joe_> I solved a couple other problems too [15:50:22] * andrewbogott runs puppet on tin [15:50:50] <_joe_> but I can't find anything to fix the package puppet problem that satisfies me [15:51:02] Reedy: why wouldn't puppet overwrite it there? [15:51:05] check out https://gerrit.wikimedia.org/r/#/c/158262/ :) [15:51:17] <_joe_> I was looking at it [15:51:28] 14 13 # Drop this file onto the mediawiki deployment host so that the passwords are deployed [15:51:29] 15 14 file { '/a/common/private/WikitechPrivateSettings.php': [15:51:34] I presumed because of that... [15:51:36] (03PS1) 10Manybubbles: Lower throttle for Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158401 [15:51:46] springle: thanks [15:51:56] <_joe_> ori: it's "cleaner" to look at, the main problem still persists [15:52:04] <_joe_> ori: +1 anyway [15:52:17] _joe_: how so? [15:53:09] !log andrew Synchronized private/WikitechPrivateSettings.php: (no message) (duration: 00m 01s) [15:53:25] <_joe_> ori: a class wrapping a package define for each package you use? [15:53:31] woo [15:53:35] andrewbogott: try runJobs again :) [15:53:36] (03PS2) 10Manybubbles: Lower throttle for Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158401 [15:53:44] Logged the message, Master [15:53:50] <_joe_> (I think your solution is good btw) [15:53:51] _joe_: just for a frequently-used client library that is part of many different software stacks [15:54:03] There were a lot of jobs! [15:54:08] heh [15:54:12] how dare people use the wiki ;) [15:54:13] <_joe_> ori: yes I said, this solution is very good in _this_ case [15:54:18] (03PS3) 10Manybubbles: Lower throttle for Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158401 [15:54:25] <_joe_> I'd like to have a less-sucky package def [15:54:36] RECOVERY - Puppet freshness on osmium is OK: puppet ran at Thu Sep 4 15:54:34 UTC 2014 [15:57:10] Reedy: can i reduce wikiuser grants on virt1000 now? [15:57:24] springle: yup :D [15:57:28] to match wikuser everywhere else [15:57:28] springle: Should be safe, yeah [15:57:31] cool [15:58:00] Reedy: Still running! But, seems good; I'll check again in an hour to make sure the cron is properly draining the queue. [15:58:18] So, now, why are we still hitting other databases... [16:00:12] andrewbogott: I suspect it's numerous extensions wikitech doesn't need [16:00:26] EventLogging seems to be wanting to hit meta, for example [16:00:35] Reedy: yep, I'm going through the list trying to find things that weren't in the old setup [16:00:51] I think there'll be some we should probably leave [16:00:52] hm, eventlogging was enabled in the old wikitech [16:01:02] But different config :) [16:01:20] probably, although there aren't any specific config settings that I see [16:01:39] https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/CommonSettings.php#L2562-L2568 [16:03:53] PROBLEM - HTTP on zirconium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:04:08] Reedy: well, that's not right :) [16:04:26] heh [16:04:29] I'm guessing it will default to the local db and such [16:06:27] (03PS3) 10BBlack: Move all geoip-based resolution to DYNA [dns] - 10https://gerrit.wikimedia.org/r/158382 [16:07:10] (03CR) 10BBlack: [C: 031] "(PS3 just removes the old config-geo resources that are no longer used)" [dns] - 10https://gerrit.wikimedia.org/r/158382 (owner: 10BBlack) [16:08:07] (03PS1) 10Ori.livneh: wmflib: add require_package() from vagrant [puppet] - 10https://gerrit.wikimedia.org/r/158404 [16:09:41] :-) [16:10:42] RECOVERY - HTTP on zirconium is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 0.010 second response time [16:12:21] Reedy: here are all the extensions that are running now but weren't on the old wikitech: https://dpaste.de/HBYC [16:12:26] Long list! Probably not very useful [16:15:24] There's some that are probably useful to keep enabled [16:15:30] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, just a small remark: you should probably use "require" rather than "include", as it sets an inclusion order that can be important." [puppet] - 10https://gerrit.wikimedia.org/r/158262 (owner: 10Ori.livneh) [16:17:00] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "it would be +2 but... we need to port this to operations/puppet :)" [apache-config] - 10https://gerrit.wikimedia.org/r/158315 (owner: 10Ori.livneh) [16:17:37] cherry pick ;) [16:17:56] <_joe_> Reedy: ori can do that :) [16:18:10] andrewbogott: Shall I put it in etherpad and we can decide what to keep and what to disable again? [16:18:40] Reedy: maybe? I feel like we should default in favor of keeping things unless we actually see a problem. [16:18:45] Since… smaller diff is better [16:19:15] https://etherpad.wikimedia.org/p/wikitech-extensions [16:19:37] GlobalBlocking is another doing cross db requests [16:20:23] hmm. I fixed MWSearch but it didn't use it before.... [16:21:31] (03PS1) 10Andrew Bogott: Default eventlogging settings for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158405 [16:26:31] Reedy: I'm not sure that ^ is proper syntax... [16:27:19] (03PS1) 10Reedy: Conditionally enable GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158406 [16:27:42] PROBLEM - Puppet freshness on mw1053 is CRITICAL: Last successful Puppet run was Thu 04 Sep 2014 00:21:29 UTC [16:27:51] (03CR) 10Reedy: [C: 031] Default eventlogging settings for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158405 (owner: 10Andrew Bogott) [16:28:06] (03CR) 10Reedy: [C: 032] Default eventlogging settings for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158405 (owner: 10Andrew Bogott) [16:28:11] (03Merged) 10jenkins-bot: Default eventlogging settings for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158405 (owner: 10Andrew Bogott) [16:28:30] (03PS2) 10Phuedx: Enable the Task Recommendations experiment v1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156282 [16:28:40] (03CR) 10Andrew Bogott: [C: 031] Conditionally enable GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158406 (owner: 10Reedy) [16:28:44] (03CR) 10Phuedx: [C: 04-1] Enable the Task Recommendations experiment v1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156282 (owner: 10Phuedx) [16:28:50] (03PS2) 10Reedy: Conditionally enable GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158406 [16:28:58] (03PS1) 10Ori.livneh: Update path references in Apache configs for /srv/mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/158407 [16:30:07] what changed between patch one and two for global blocking? [16:30:45] rebase [16:31:10] ok [16:31:42] (03PS3) 10Phuedx: Enable the Task Recommendations experiment v1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/156282 [16:32:00] (03CR) 10Ori.livneh: "Heh. Re-done in https://gerrit.wikimedia.org/r/#/c/158407/" [apache-config] - 10https://gerrit.wikimedia.org/r/158315 (owner: 10Ori.livneh) [16:36:23] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [16:36:33] RECOVERY - Disk space on elastic1009 is OK: DISK OK [16:37:53] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Tried to cherry-pick this in beta, whenever trying to compile this manifest the puppet master would blow up eating 1.6 GB of memory. Altho" [puppet] - 10https://gerrit.wikimedia.org/r/157823 (owner: 10Giuseppe Lavagetto) [16:40:59] (03PS1) 10Andrew Bogott: Disable a bunch of extensions on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158409 [16:44:01] (03PS2) 10Reedy: Disable a bunch of extensions on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158409 (owner: 10Andrew Bogott) [16:44:14] (03CR) 10Reedy: [C: 032] Disable a bunch of extensions on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158409 (owner: 10Andrew Bogott) [16:44:18] (03Merged) 10jenkins-bot: Disable a bunch of extensions on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158409 (owner: 10Andrew Bogott) [16:44:54] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 17s) [16:44:58] Logged the message, Master [16:45:25] springle, is wikitech still reaching out to prod databases? [16:46:16] if so, can you tell which cluster/db? [16:47:55] Middle of the night for springle, he's probably gone. [16:48:30] i wonder if they were surfacing in fluorine:/a/mw-log/dberror.log [16:48:40] if so, should be easy enough to check [16:48:54] ok, I'll look if you aren't already [16:50:05] tail -f dberror.log | grep labswiki [16:50:06] mmm [16:50:07] Thu Sep 4 9:39:42 UTC 2014 virt1000 labswiki Error connecting to 10.64.16.30: :real_connect(): (28000/1045): Access denied for user 'wikiuser'@'208.80.154.18' (using password: YES) [16:50:30] that's from quite a while ago [16:50:38] indeed [16:50:48] nothing since :23 [16:51:09] and those are snapshot trying to access labs [16:51:59] hm… snapshots would be nice but I'm can't guess if that's possible [16:52:40] Thu Sep 4 16:23:34 UTC 2014 snapshot1004 labswiki Error connecting to 208.80.154.18: :real_connect(): (HY000/2003): Can't connect to MySQL server on '208.80.154.18' (4) [16:52:45] That looks like a firewall issue to me [16:52:57] firewall/ports mysql is listening on [16:53:13] uh, ports/sockets [16:55:25] andrewbogott: IIRC default is like (localhost/127.0.0.1):3306 [16:55:30] So it's not binding to it's public ip [16:55:44] but that might not be the case with our mysql config [16:55:49] And in which case, it could just be routing [16:56:10] I think I'm happy to dump all this in an email to springle and let him sort it out (or decide not to) [16:56:30] heh [16:56:41] haven't people been wanting dumps from wikitech for a while? :) [16:56:58] Oh, I don't know. Not that I'd heard. [16:57:18] Carmela: ? [16:57:19] ^^ [16:58:08] I love that thunderbird beachballs right after I compose a new email but not until after I type the first sentence [16:58:25] I wonder if udp packets can make it to the irc host [16:58:35] s/irc host/argon/ [16:59:19] andrewbogott: Anyway, what else is broken? [16:59:37] Reedy: Maybe nothing? Nothing on my list at least. [16:59:39] * Reedy is just making 1.24wmf20 [17:00:02] (03PS3) 10Ori.livneh: Consolidate python-redis package declarations in ::redis::client::python [puppet] - 10https://gerrit.wikimedia.org/r/158262 [17:00:15] (03CR) 10Aaron Schulz: [C: 031] Lower throttle for Cirrus template update jobs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158401 (owner: 10Manybubbles) [17:00:43] Reedy: I'm going on holiday tomorrow for a few days. So I vote for us letting this sit for now. I'll schedule a window for the end of next week when we can move wikitech to a newer branch. [17:01:06] sounds good to me :) [17:01:19] I wasn't suggesting you moved wikitech to the new 1.24wmf20 ;) [17:01:47] I know :) But given how many people have said "wmf15, really? Really?" in the last day... [17:01:49] Bah. Use master! [17:01:51] seems like we should plan on an update [17:03:08] Coren: 1.24wmf20 is master [17:03:11] for a few minutes... [17:03:12] :D [17:12:56] legoktm: Why are we still branching WikimediaShopLink? ;) [17:13:01] oops [17:13:11] well, the repository is empty [17:13:33] when I get my shell back I'll remove it [17:14:07] what happened to your shell? [17:14:20] checking out a lot of fucking extensions [17:14:28] ahaha [17:14:37] (03PS3) 10Dduvall: WIP Labs: Varnish backend/director for isolated security audits [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://bugzilla.wikimedia.org/70181)