[00:00:04] RoanKattouw, ^d, marktraceur, MaxSem, MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141205T0000). Please do the needful. [00:00:40] * MaxSem takes it [00:01:05] MaxSem: Here. [00:01:14] * hoo around [00:01:30] * YuviPanda is around too [00:01:40] (03PS1) 10BryanDavis: beta: enable php processing for bits/w [puppet] - 10https://gerrit.wikimedia.org/r/177700 [00:01:43] <^d> I'm around, regrettably. [00:02:23] YuviPanda: https://gerrit.wikimedia.org/r/#/c/177700/ for that beta bits problem. [00:03:00] bd808: some day, we hopefully won't have this many different configs :) [00:03:06] bd808: let me cherry-pick and test [00:04:06] bd808: I'm actually thinking of calling it a... morning? I'll take care of this in the morning if there's nobody else at it [00:04:20] YuviPanda: yeah no worries [00:04:23] !log maxsem Synchronized php-1.25wmf11/extensions/VisualEditor/: https://gerrit.wikimedia.org/r/#q,177643,n,z (duration: 00m 07s) [00:04:28] Logged the message, Master [00:04:29] James_F, ^^^ [00:04:34] If I really cared I'd test myself :) [00:04:41] bd808: heh :) [00:04:51] (03PS1) 10Andrew Bogott: Use ext4 everywhere on the new hp virt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/177701 [00:05:01] bd808: I'm going to uncherry-pick it now, I think. if it *did* fail, it would bork betalabs :) [00:05:16] done [00:06:02] o/ [00:06:48] MaxSem: Fixed. Thanks! [00:07:15] Which database servers do we use? cos I'm seeing intermittent behavior problems with a "set session group_concat_max_len=" query. [00:08:51] <^d> awight: Define "we" [00:09:31] !log maxsem Synchronized php-1.25wmf11/extensions/Wikidata/: https://gerrit.wikimedia.org/r/#q,177693,n,z (duration: 00m 14s) [00:09:36] Logged the message, Master [00:09:52] ^d: hehe. WMF, in particular metawiki [00:09:53] !log maxsem Synchronized php-1.25wmf10/extensions/Wikidata/: https://gerrit.wikimedia.org/r/#q,177693,n,z (duration: 00m 11s) [00:09:56] Logged the message, Master [00:10:08] awight, should be in mediawiki-config [00:10:10] <^d> awight: metawiki's s3 I think. [00:10:15] <^d> db.php in mediawiki-config. [00:10:33] hoo, ^^^ [00:10:36] <^d> db-eqiad.php [00:10:38] <^d> Rather [00:10:47] MaxSem: Thanks [00:10:58] <^d> awight: s7, actually. [00:11:01] well the freakish part is that the results flap between correct and incorrect. [00:11:14] happens [00:11:25] maybe "set session" is not to be trusted?? [00:11:27] ...when there are more than one slave [00:11:43] (03CR) 10coren: [C: 032] "ext4 ftw" [puppet] - 10https://gerrit.wikimedia.org/r/177701 (owner: 10Andrew Bogott) [00:12:04] (03PS2) 10coren: Use ext4 everywhere on the new hp virt nodes. [puppet] - 10https://gerrit.wikimedia.org/r/177701 (owner: 10Andrew Bogott) [00:13:05] well... anyone know of configuration differences between the s7 cluster which would cause "set session" to be ignored? Meanwhile, I'll revert my patch... [00:13:30] <^d> No, they're pretty much all the same afaik. [00:13:55] <^d> But considering you can end up with more than one database connection over the course of a request, it doesn't surprise me that a session-related setting might not persist. [00:13:57] (03CR) 10MaxSem: [C: 032] Expand ConfirmEdit disabling test to all group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177494 (owner: 10Chad) [00:14:10] (03Merged) 10jenkins-bot: Expand ConfirmEdit disabling test to all group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177494 (owner: 10Chad) [00:14:42] ^d: really? I grab a wfDatabase object--that has underlying stuff that might be reconnecting? [00:15:03] hoo, Fatal error: Argument 1 passed to Wikibase\Client\OtherProjectsSitesProvider::__construct() must implement interface SiteStore, SiteList given in /srv/mediawiki/php-1.25wmf10/extensions/Wikidata/extensions/Wikibase/client/includes/OtherProjectsSitesProvid [00:15:04] <^d> wfGetDB( DB_SLAVE ) and wfGetDB( DB_MASTER ) might return different connections. [00:15:04] sorry, "wfGetDB" [00:15:13] <^d> Multiple DB_SLAVE calls should return the same connection. [00:15:23] yeah but I'm ... oh I see. my code is buggy :) [00:15:28] MaxSem: Saw that... I guess that happened mid-deploy [00:16:17] (03PS4) 10Yuvipanda: beta: Fix http uptime check [puppet] - 10https://gerrit.wikimedia.org/r/177692 [00:17:04] !log maxsem Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/177494/ (duration: 00m 07s) [00:17:10] Logged the message, Master [00:18:23] <^d> MaxSem: Mine looks good, thx [00:19:20] hoo, any changes in metrics? [00:19:40] Surprisingly not [00:21:31] (03CR) 10MaxSem: [C: 032] Set $wgCentralAuthPreventUnattached = true on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177660 (owner: 10Legoktm) [00:21:39] (03Merged) 10jenkins-bot: Set $wgCentralAuthPreventUnattached = true on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177660 (owner: 10Legoktm) [00:22:15] !log maxsem Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/177660/ (duration: 00m 06s) [00:22:19] legoktm, ^^^^ [00:22:19] Logged the message, Master [00:22:20] * legoktm tests [00:24:42] MaxSem: works :) [00:24:48] wee [00:26:49] mutante|travel, jzerebecki - yt? [00:28:03] (03CR) 10MaxSem: [C: 032] Change ru.wikinews.org to HTTPS only. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173083 (owner: 10JanZerebecki) [00:28:13] (03Merged) 10jenkins-bot: Change ru.wikinews.org to HTTPS only. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/173083 (owner: 10JanZerebecki) [00:28:42] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/173083/ (duration: 00m 05s) [00:28:47] Logged the message, Master [00:31:53] (03CR) 10MaxSem: [C: 032] Disable gadgets caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177600 (owner: 10MaxSem) [00:32:00] (03Merged) 10jenkins-bot: Disable gadgets caching [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177600 (owner: 10MaxSem) [00:32:35] !log maxsem Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/177600/ (duration: 00m 06s) [00:32:38] Logged the message, Master [00:32:47] woo [00:33:25] * legoktm still has his gadgets :) [00:34:20] (03PS1) 10Hoo man: Revert "Expand ConfirmEdit disabling test to all group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177708 [00:34:25] MaxSem: ^ [00:34:34] waaa? [00:34:44] Look at Mediawiki.org's RC [00:34:47] totally flooded [00:34:47] (03CR) 10MaxSem: [C: 032] Revert "Expand ConfirmEdit disabling test to all group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177708 (owner: 10Hoo man) [00:34:56] (03Merged) 10jenkins-bot: Revert "Expand ConfirmEdit disabling test to all group0 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177708 (owner: 10Hoo man) [00:35:21] !log maxsem Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/177708/ (duration: 00m 07s) [00:35:24] Logged the message, Master [00:35:38] Wow... I never knew our captchas are that effective [00:35:55] lolololololololololololololololol [00:36:14] Spam bots are obviously very bad [00:36:42] hoo: well, we did just announce on wikitech-l how to beat our captcha's :) [00:36:43] Wow, I knew our captchas set a minor, low bar... but that's kinda telling [00:37:08] SimpleAntiSpam still blocks a good amount of spam :P [00:37:09] ebernhardson: Yeah sure... but it's still work, and they don't seem to bother [00:37:13] (03PS5) 10Yuvipanda: beta: Fix http uptime check [puppet] - 10https://gerrit.wikimedia.org/r/177692 [00:37:35] hoo: true, and i'm sure most spammers would have tried some sort of OCR anyways without us saying a thing [00:38:12] but i suppose i was thinking it might just be some kid that saw the message and was curious. [00:38:23] (03CR) 10Yuvipanda: [C: 032] beta: Fix http uptime check [puppet] - 10https://gerrit.wikimedia.org/r/177692 (owner: 10Yuvipanda) [00:39:14] anybody wants to review https://gerrit.wikimedia.org/r/177665 ? [00:39:38] I want to reduce logspam [00:40:55] (03CR) 10Hoo man: [C: 031] "Looks good at a glance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177665 (owner: 10MaxSem) [00:41:42] (03CR) 10MaxSem: [C: 032] search-redirect: check for parameter existence consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177665 (owner: 10MaxSem) [00:41:51] (03Merged) 10jenkins-bot: search-redirect: check for parameter existence consistently [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177665 (owner: 10MaxSem) [00:42:14] !log maxsem Synchronized search-redirect.php: https://gerrit.wikimedia.org/r/#/c/177665/ (duration: 00m 06s) [00:42:21] Logged the message, Master [00:42:50] !log maxsem Synchronized search-redirect.php: Second attempt (duration: 00m 05s) [00:42:53] Logged the message, Master [00:42:58] (03CR) 10EBernhardson: "looks fine, random suggestion for code readability in the future." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177665 (owner: 10MaxSem) [00:43:06] (03PS1) 10Yuvipanda: beta: typofix in shinken config [puppet] - 10https://gerrit.wikimedia.org/r/177717 [00:43:14] (03CR) 10jenkins-bot: [V: 04-1] beta: typofix in shinken config [puppet] - 10https://gerrit.wikimedia.org/r/177717 (owner: 10Yuvipanda) [00:43:20] grmbl, mw1009 glitched [00:43:22] (03PS2) 10Yuvipanda: beta: typofix in shinken config [puppet] - 10https://gerrit.wikimedia.org/r/177717 [00:43:49] YuviPanda: could I ask you to nuke the restbase/ hierachy in graphite? [00:44:07] gwicke: sure, give me a minute? [00:44:26] (03CR) 10Yuvipanda: [C: 032] beta: typofix in shinken config [puppet] - 10https://gerrit.wikimedia.org/r/177717 (owner: 10Yuvipanda) [00:44:53] YuviPanda: no hurry; can also open a ticket if you prefer [00:45:05] gwicke: yeah, that would also be nice :) [00:47:36] (03CR) 10MaxSem: search-redirect: check for parameter existence consistently (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177665 (owner: 10MaxSem) [00:51:14] RECOVERY - Kafka Broker Messages In Per Second on tungsten is OK: OK: No anomaly detected [00:51:22] (03PS1) 10Yuvipanda: nagios_common: Rename check_http_generic appropriately [puppet] - 10https://gerrit.wikimedia.org/r/177719 [00:51:33] RECOVERY - Kafka Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected [00:51:49] bblack: ^ *kind* of touches varnish but not really (just renaming the check_command) [00:51:59] (03PS2) 10Yuvipanda: nagios_common: Rename check_http_generic appropriately [puppet] - 10https://gerrit.wikimedia.org/r/177719 [00:52:02] (03CR) 10EBernhardson: "are" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177665 (owner: 10MaxSem) [00:56:00] <^demon|away> Aw fuck. [00:56:54] ^demon|away, ? [00:57:07] <^demon|away> captchas turned back on because people freaked out about spam :( [00:57:26] oh, we turned them off? yea expected :) [00:57:37] ^demon|away: Because it was a hell lot of spam [00:57:50] <^demon|away> Depends on your tolerance level I suppose :) [00:59:07] * ebernhardson re-assigns ^demon to huggle duty ;) [00:59:41] * ^demon|away was blocking and deleting before huggle was a twinkle in someone's eye. [01:00:04] awight, ejegg: Respected human, time to deploy CentralNotice (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141205T0100). Please do the needful. [01:00:23] <^demon|away> ebernhardson: The original "quick deletion userscript" with dropdown reasons was my invention :p [01:00:59] :) [01:09:30] error from https://en.wikipedia.org/w/index.php?title=Special:WhatLinksHere/Module:HtmlBuilder&limit=500&from=49816&back=39501 A database query error has occurred. This may indicate a bug in the software. Function: SpecialWhatLinksHere::showIndirectLinks Error: 2013 Lost connection to MySQL server during query (10.64.48.28) [01:12:03] ergh, lots of slow query timeouts [01:31:34] !log awight Synchronized php-1.25wmf11/extensions/CentralNotice: rollback CentralNotice 'improvement' (duration: 00m 09s) [01:31:38] Logged the message, Master [01:31:43] !log awight Synchronized php-1.25wmf10/extensions/CentralNotice: rollback CentralNotice 'improvement' (duration: 00m 05s) [01:31:47] Logged the message, Master [02:01:11] !log awight Synchronized php-1.25wmf10/extensions/CentralNotice: rollback Googlebot cloaking (duration: 00m 05s) [02:01:19] Logged the message, Master [02:01:26] !log awight Synchronized php-1.25wmf11/extensions/CentralNotice: rollback Googlebot cloaking (duration: 00m 08s) [02:01:30] Logged the message, Master [02:16:45] !log l10nupdate Synchronized php-1.25wmf10/cache/l10n: (no message) (duration: 00m 01s) [02:16:49] !log LocalisationUpdate completed (1.25wmf10) at 2014-12-05 02:16:49+00:00 [02:16:50] Logged the message, Master [02:16:52] Logged the message, Master [02:20:53] !log l10nupdate Synchronized php-1.25wmf11/cache/l10n: (no message) (duration: 00m 01s) [02:20:57] !log LocalisationUpdate completed (1.25wmf11) at 2014-12-05 02:20:57+00:00 [02:20:58] Logged the message, Master [02:21:00] Logged the message, Master [02:24:27] (03PS1) 10Springle: depool db1060 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177736 [02:24:58] (03CR) 10Springle: [C: 032] depool db1060 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177736 (owner: 10Springle) [02:26:38] !log springle Synchronized wmf-config/db-eqiad.php: depool db1060 (duration: 00m 08s) [02:26:43] Logged the message, Master [02:35:55] !log manual sync-common on mw1203 (after apparently transient sync-file network error on tin) [02:35:59] Logged the message, Master [02:40:52] (03PS1) 10Springle: upgrade db1060 to trusty and mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/177738 [02:42:26] (03CR) 10Springle: [C: 032] upgrade db1060 to trusty and mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/177738 (owner: 10Springle) [02:55:02] !log upgrade db1060 trusty [02:55:08] Logged the message, Master [03:26:26] (03CR) 10MZMcBride: "Why?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177278 (owner: 10Ejegg) [03:28:35] (03PS1) 10Springle: repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177741 [03:29:11] (03CR) 10Springle: [C: 032] repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177741 (owner: 10Springle) [03:30:27] !log springle Synchronized wmf-config/db-eqiad.php: repool db1060, warm up (duration: 00m 06s) [03:30:31] Logged the message, Master [03:32:07] hoo: Where is the spam? [03:32:28] Log/delete, I guess. [03:32:48] probably [03:32:57] > buy tramadol buy tramadol for dogs - tramadol 50mg tablets ingredients [03:33:18] Like, you'd think we'd be able to stop that pretty easily. [03:33:50] We could add "Tramadol" to one of the abusefilters for spam [03:34:06] but replacing captchas with smart abusefilters is just not possible [03:34:13] You could just block people adding " akosiaris: I'll try out (2) today, I think [08:40:58] _joe_: all the perl I've encountered in our repo looks like grown up bash [08:41:09] I don't know if that's how perl generally looks like [08:41:15] <_joe_> YuviPanda: it's not [08:41:16] regexes. regexes EVERYWHERE as well. [08:41:27] yeah, that's perl [08:41:32] <_joe_> well, perl is THE tool for manipulating text [08:41:49] true, but parsing commandline options with regexes just feels wrrrronnnnggg [08:41:49] <_joe_> so yeah regexes are pretty near the core of the language [08:41:56] <_joe_> YuviPanda: LOL [08:42:00] wut ? [08:42:02] <_joe_> ok that's BAD perl [08:42:07] yeah... [08:42:24] commandline options in a file, not from the commandline [08:42:35] but with python I'd just pass them on to argparse... [08:42:51] <_joe_> YuviPanda: cpan.org [08:42:54] or optparse if you are old school but yeah... [08:43:00] <_joe_> there is one perl module for literally anything [08:43:11] getopt or one of the thousand other perl modules would do it better [08:43:22] _joe_: true, but... but... [08:44:10] _joe_: akosiaris is this idiomatic enough perl? https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/bigbrother [08:44:37] I should probably learn perl at some point, but the way to do that is to write perl, and all I seem to be doing these days is WM stuff and don't want to add any more perl here. [08:45:07] <_joe_> what's this? [08:46:02] _joe_: toollabs service for dealing with terribly written tools. restarts services when they go down, trying X times in Y hours. [08:46:58] <_joe_> oh ok, your version of upstart, in perl [08:46:59] <_joe_> mh [08:47:01] http://paste.debian.net/hidden/c9ec99e9/ [08:47:08] gods.... [08:47:21] do note the -cp /etc/cassandra: [08:47:48] cause we obviously put jars and .class file in /etc/cassandra :-( [08:48:02] <_joe_> rotfl [08:48:53] YuviPanda: it does not look like very bad perl tbh [08:48:56] <_joe_> YuviPanda: well using perl as an el-cheapo alternative to a supervisor seems suboptimal, but I'm probably missing the details of the operating framework [08:48:59] it is actually readable... [08:49:05] <_joe_> no it's not bad perl at all [08:49:20] hmm, I guess then perl itself isn't too much my thing. [08:49:30] and when perl is readable, the author has achieved something close to impossible [08:49:44] <_joe_> nah [08:49:44] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:49:52] <_joe_> my perl is usually quite readable [08:50:04] but that rcfile approach is weird indeed [08:50:57] what are webservice and webservice2 ? [08:51:03] and how are they different ? [08:51:13] meaning /usr/bin/webservice{,2} [08:51:19] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [08:51:20] akosiaris: webservice is the original written in basah, webservice2 is replacement in python. will merge soon. [08:51:26] akosiaris: *bash [08:51:38] * _joe_ NIH NIH [08:51:41] akosiaris: only webservice2 has the ability to let you specify your tools want to run on precise or trusty. [08:52:00] _joe_: it's just a wrapper around qsub [08:52:09] so people don't have to specify the correct options every single time [08:52:17] because obviously they'll get it all wrong :) [08:52:17] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:52:24] <_joe_> eheh ok [08:52:27] <_joe_> fair enough [08:52:33] <_joe_> wrappers are a good thing [08:53:25] it's also fairly upstart-like, though. start, stop, status, restart [08:53:36] <_joe_> YuviPanda: why didn't you use something like monit? [08:54:00] <_joe_> if you can't use user upstart scripts, I mean [08:54:06] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [08:54:08] aaah, hmm. [08:54:22] _joe_: that might work with wrappers around, since it shouldn't check for processes 'normally' but with qstat etc [08:54:25] (things run on grid) [08:54:39] <_joe_> oh ok, monit would do that [08:55:06] yeah. [08:55:08] <_joe_> but I was talking about the "bigbrother" part [08:55:20] <_joe_> you could use it as a supervisor [08:55:54] oh yeah, that's what I was talking about too. big brother does qstat to see if process is running that needs to be checked, and then if not starts it with appropriate wrapper [08:56:28] <_joe_> ok, monit could do that easily [08:56:31] alright, filed a task. [08:56:45] <_joe_> not necessarily the best option though [08:56:55] yeah, task says 'investigate' :) [08:57:00] <_joe_> but at least the chore of maintaining it it's on others :) [08:57:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [08:57:39] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [08:58:27] _joe_: heh, 'Invented Here Syndrome' is probably a good thing [09:01:32] (03PS1) 10Giuseppe Lavagetto: admins: grant access to Joaquin Otra Hernandez (RT #8951) [puppet] - 10https://gerrit.wikimedia.org/r/177755 [09:02:15] <_joe_> akosiaris: ^^ [09:04:45] akosiaris: http://paste.ubuntu.com/9379637/ is my script that provisioned mongo user accounts, should be trivial to adapt to postgres [09:05:05] <_joe_> he said mongo! [09:05:28] hehe :) [09:05:30] <_joe_> YuviPanda: mongodb is the only thing ops despise more than nodejs and puppet [09:05:40] yeah... if he had not directed it to me, I could pretend it did not happen [09:05:45] :-D [09:05:47] I spent a few days about... 6-8 months back trying to give everyone mongo accounts on toollabs [09:06:09] before realizing that it actually didn't give a fuck about user management in any form [09:06:16] <_joe_> YuviPanda: and then you discovered that those writes never reached the server :P [09:06:18] and abandoning it and removing all traces of that work from puppet [09:06:23] <_joe_> lol [09:06:46] but not giving a fuck about user management seems not very uncommon [09:06:50] ES too, for instance [09:11:22] <_joe_> mmmh what is this supposed to do? [09:11:30] <_joe_> $all_groups = split(inline_template("<%= (@always_groups+@groups).join(',') %>"),',') [09:11:47] <_joe_> it's creating a new array that is the sum of the two? [09:12:08] <_joe_> (always_groups and groups are two arrays) [09:12:13] heh, in interesting ways. [09:13:01] (03CR) 10Nemo bis: "Results being analysed at https://www.mediawiki.org/wiki/Extension:ConfirmEdit/FancyCaptcha_experiments" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177708 (owner: 10Hoo man) [09:16:15] (03PS1) 10Giuseppe Lavagetto: admins: use concat() instead of an inline template [puppet] - 10https://gerrit.wikimedia.org/r/177757 [09:17:06] akosiaris: shinken still has a lot of work left to do, though. we currently have older packages, and it doesn't retain history of check results beyond current one yet. Plus we can't use ldap auth in labs so user accounts / actions aren't usable there (shouldn't be a problem in prod) [09:19:03] _joe_: you might like http://thedoomthatcametopuppet.tumblr.com/ if you don't know of it already [09:19:36] <_joe_> YuviPanda: oh this is new [09:20:05] <_joe_> aahhahahahahah [09:20:13] YuviPanda: well, with all the work you 've done, it is closer that it was 3 months ago :-) [09:21:03] <_joe_> !log depooling some appservers for maintenance/upgrades [09:21:06] Logged the message, Master [09:21:33] <_joe_> YuviPanda: we got something wrong, most newer appservers (the one you worked on) did have HT disabled [09:21:33] akosiaris: heh :) [09:21:42] _joe_: .... [09:22:01] they were ok on both lshw and your incantation. [09:22:07] how did you find out now, btw? [09:22:09] YuviPanda: hahah thanks that tumblr is genius [09:22:19] godog: :D seems very new as well. [09:22:27] <_joe_> YuviPanda: they have 16 procs in ganglia, they have 16 cores... [09:22:46] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: puppet fail [09:23:23] _joe_: hmm, why does the software actually lie so much? is it *that* hard? [09:24:07] <_joe_> YuviPanda: we were a bit clueless I'd say [09:24:09] <_joe_> it's on us [09:24:11] <_joe_> not the sw [09:24:26] <_joe_> and honestly I didn't take the time to do that correctly [09:25:41] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM, the 3 days period on RT have passed, merging. Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/177755 (owner: 10Giuseppe Lavagetto) [09:25:47] ok. [09:26:13] _joe_: so do we reboot into bios and check? I could do some as well, unless you think co-ordinating would be too much of a hassle. [09:26:36] <_joe_> YuviPanda: one lame and fast way is using puppet facts [09:26:39] <_joe_> from the db [09:27:07] <_joe_> YuviPanda: for now wait, give me some time to work on it [09:27:09] ok [09:27:13] let me know if I can help. [09:27:20] <_joe_> thanks :) [09:30:24] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 680 [09:35:28] RECOVERY - check_mysql on db1008 is OK: Uptime: 4392505 Threads: 44 Questions: 89635773 Slow queries: 28297 Opens: 83580 Flush tables: 2 Open tables: 64 Queries per second avg: 20.406 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:37:46] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:38] unrelated, but will hopefully happen in ~2h http://www.nasa.gov/orion/ [09:45:39] akosiaris: got some time to chat about the postgres user access design / what rights we're going to give them to what? [09:49:31] (03CR) 10Hashar: "Note the alarm will self acknowledge after an hour, and the committer date can be forged." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/177521 (owner: 10Ori.livneh) [09:54:58] (03CR) 10Yuvipanda: "Should look for patches rather than merges, I think?" [puppet] - 10https://gerrit.wikimedia.org/r/177521 (owner: 10Ori.livneh) [09:57:47] YuviPanda: sure [09:58:26] YuviPanda: wanna walk a bit through how we do it now with mysql ? I 'd like us to mimic it as much as possible [09:58:30] akosiaris: so... we give them read perms on particular databases and createdb on databases prefixed with userid__. [09:58:36] akosiaris: yeah, ^ is what mysql does [09:59:18] particular databases ? the ones mirrored through santitarium ? [09:59:39] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Puppet has 1 failures [10:00:05] akosiaris: yeah [10:00:11] akosiaris: well, in postgis's case, the OSM ones? [10:00:13] err [10:00:15] postgres [10:01:06] has anyone requested that ? [10:01:33] cause if not, it might not even make sense to have the boxes replicate planet openstreetmap continuously [10:01:44] and eat up CPU and I/O [10:02:15] akosiaris: requested what? [10:02:16] akosiaris: OSM? [10:02:36] hey YuviPanda can I pm you? [10:02:39] yeah [10:02:40] akosiaris: hmm, so do we have one box that has both OSM and user databases? or are they two things? Don't we also want to give read access to OSM for everyone? [10:02:42] joakino_: sure! [10:03:24] YuviPanda: different boxes right now. 2 (master/slave) do only OSM stuff. 2 (master/slave) or generic postgres for labs [10:03:43] nothing stops us from upgrading the generic postgres to doing OSM also [10:03:47] akosiaris: aaha. so we're only handing out creds for generic postgres now? [10:04:10] akosiaris: hmm, I think we should keep it separate, to mimic labsdb/tools-db for mysql. [10:04:24] I agree [10:04:38] akosiaris: right, so in this case, just appropriate createdb rights with prefixing? [10:04:56] how do we technically enforce that in mysql's case ? [10:08:08] I am asking because we can not enforce that in postgres itself from what I see. The moment someone has the createdb privilege ... well they can create databases [10:12:31] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [10:14:11] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet has 1 failures [10:15:42] (03PS1) 10Alexandros Kosiaris: Give ebernhardson access to stat1003 [puppet] - 10https://gerrit.wikimedia.org/r/177762 [10:19:27] (03CR) 10Alexandros Kosiaris: "The 3 day waiting period has passed so this can be merged" [puppet] - 10https://gerrit.wikimedia.org/r/177762 (owner: 10Alexandros Kosiaris) [10:24:08] <_joe_> !log repooling the appservers [10:24:12] Logged the message, Master [10:29:13] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:29:28] somone from opeartions here? [10:29:54] yeah [10:30:10] akosiaris: can you plese kill my GWT upload [10:30:30] in the job queue... it is processin moor then the 20 test files in the xml :/ [10:31:22] not sure how to do that... [10:33:00] (03PS1) 10Filippo Giunchedi: codfw-prod: empty ms-be2013/2014/2015 sdm3/sdn3 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/177765 [10:33:02] omg :( [10:33:08] __joe__ maybe [10:33:31] <_joe_> err [10:33:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw-prod: empty ms-be2013/2014/2015 sdm3/sdn3 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/177765 (owner: 10Filippo Giunchedi) [10:33:41] <_joe_> me neither [10:33:42] godog: any idea how to stop an gwt upload ? [10:34:20] now self stopped. O_O [10:34:28] uh, not offhand akosiaris :( [10:34:53] Steinsplitter: has this been done before ? [10:35:08] and someone knew how to stop a gwt upload ? [10:35:13] okay, i file a bug. there schould be a way to kill GWT jobs onwiki. [10:35:32] : i know only that ops has topped long time ago Fae's gwt upload [10:35:33] <_joe_> if it's running, I think that would mean finding which jobrunner is running it [10:35:39] <_joe_> and kill(1) [10:37:15] <_joe_> or removing the jobs from redis [10:37:18] Steinsplitter: ok thanks [10:37:25] <_joe_> but I have to search which one are [10:37:38] _joe_: well if it is already running, removing from redis wouldn't do much [10:37:54] assuming we talk about a single one job [10:38:01] <_joe_> akosiaris: it may be a list of jobs [10:38:28] in that case... the problem becomes NP complete :P [10:38:48] just joking but still... we are missing documentation [10:38:52] for example we got https://wikitech.wikimedia.org/wiki/Job_queue_runner [10:39:02] but it is only on verious serious problems [10:39:08] <_joe_> I don't think anyone ever thought of this [10:39:24] and https://wikitech.wikimedia.org/wiki/Job_queue is not helpful either [10:40:14] (03PS1) 10Springle: depool db1059 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177766 [10:41:51] (03CR) 10Springle: [C: 032] depool db1059 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177766 (owner: 10Springle) [10:42:01] akosiaris: so, joakino_ can't login to bast1001. I wonder if we need to add him to the bast group? [10:42:49] <_joe_> meh, yes [10:42:56] !log springle Synchronized wmf-config/db-eqiad.php: depool db1059 (duration: 00m 06s) [10:42:58] <_joe_> my bad [10:43:01] Logged the message, Master [10:43:03] mine as well [10:43:07] <_joe_> I always forget that [10:44:12] hhvm responds to db-eqiad.php changes much faster than the old setup [10:44:16] * springle like [10:44:54] <_joe_> really? [10:45:17] <_joe_> springle: can you check if mw1081 did respond as well? 10.64.0.48 [10:45:28] seems so. used to wait a while for db connections to die down. now, bang, traffic moves [10:47:03] (03PS1) 10Hashar: contint: mediawik::multimedia on labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/177770 [10:47:53] _joe_: can't tell. it isn't connected right now to the box i just depooled (db1027) [10:49:11] <_joe_> springle: ok thanks [10:49:12] (03PS1) 10Alexandros Kosiaris: Followup commit for I3ee6cef4f89b [puppet] - 10https://gerrit.wikimedia.org/r/177771 [10:49:26] <_joe_> next time you depool something, lemme know [10:49:34] hmm, they need to get on bastion, don't they? [10:49:42] yeah, think I need to add him to the bastonly group [10:49:43] * YuviPanda does [10:50:45] YuviPanda: ^ [10:50:55] already done in 177771 [10:50:57] akosiaris: hah! [10:51:02] nice number ... [10:51:07] wanna merge ? [10:51:13] well, review and merge [10:51:19] yeah, doing [10:51:20] I am getting ahead of myself [10:51:38] akosiaris: let's break up that line? it's a bit too long now. [10:51:47] _joe_: something about mw1081 you expect to be different? related to hhvm.server.stat_cache ? [10:52:19] YuviPanda: I did... [10:52:27] oh [10:52:30] oh, that is gerrit and whitespace [10:52:33] <_joe_> springle: now hhvm.server.$something [10:52:35] heh [10:52:44] (03CR) 10Yuvipanda: [C: 032] Followup commit for I3ee6cef4f89b [puppet] - 10https://gerrit.wikimedia.org/r/177771 (owner: 10Alexandros Kosiaris) [10:52:46] it sometimes displays whitespace changes weird [10:52:47] <_joe_> lemme check the name of the variable [10:53:00] yeaaah [10:53:13] there was no line number on left pane and that confused me [10:53:21] <_joe_> hhvm.check_sym_link = false [10:54:15] akosiaris: re: postgres again, I think the way to do this is to define a function people can use to create databases, and have that automatically prefix. [10:54:30] akosiaris: and we can set that function to run with previlages of the definer, so won't actually need to hand out createdb [10:55:15] _joe_: tnx. interesting [10:55:16] akosiaris: so we just generate privilageless logins for everyone, and then define this function, and then we're good to go. [10:55:52] YuviPanda: sounds fine to me [10:56:24] akosiaris: ok! do we have a password in puppet private already? [10:56:42] yes [10:56:48] <_joe_> !log depooling mw1161-mw1170 for enabling hyperthreading [10:56:51] Logged the message, Master [10:56:51] let me check the class [10:57:07] <_joe_> springle: give me an heads-up before depooling more servers, I may have servers rebooting [10:57:40] YuviPanda: passwords::postgres::labsadmin_password, [10:57:42] (03PS2) 10Hashar: contint: mediawik::multimedia on labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/177770 [10:57:46] akosiaris: cool, thanks! [10:57:54] labsadmin is the username [10:58:25] (03PS3) 10Hashar: contint: mw multimedia packages on labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/177770 [10:58:37] !log upgrade db1059 trusty [10:58:39] Logged the message, Master [11:01:23] _joe_: ok, won't be for a while [11:01:31] <_joe_> good [11:04:08] (03CR) 10Hashar: [C: 031 V: 032] "PS1 used mediawiki::multimedia which creates a /tmp/magick-tmp owned by apache when the jenkins jobs run as jenkins-deploy. So instead us" [puppet] - 10https://gerrit.wikimedia.org/r/177770 (owner: 10Hashar) [11:08:29] (03PS2) 10Hashar: Drop role::zuul::labs [puppet] - 10https://gerrit.wikimedia.org/r/173248 [11:08:46] (03PS3) 10Hashar: Drop role::zuul::labs [puppet] - 10https://gerrit.wikimedia.org/r/173248 [11:08:58] (03PS3) 10Hashar: role::ci::website::labs [puppet] - 10https://gerrit.wikimedia.org/r/173251 [11:09:32] (03CR) 10Hashar: [V: 032] "Cherry picked on integration puppetmaster (labs)." [puppet] - 10https://gerrit.wikimedia.org/r/173248 (owner: 10Hashar) [11:09:45] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppetmaster (labs)." [puppet] - 10https://gerrit.wikimedia.org/r/173251 (owner: 10Hashar) [11:22:33] (03PS1) 10Springle: upgrade db1059 to trusty and mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/177776 [11:24:20] (03CR) 10Springle: [C: 032] upgrade db1059 to trusty and mariadb 10 [puppet] - 10https://gerrit.wikimedia.org/r/177776 (owner: 10Springle) [11:42:56] sigh ganglia for misc eqiad looks like an ECG, what was the root cause the other day _joe_ for appservers? [11:43:02] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [11:44:30] <_joe_> godog: non-authorized aggregators I guess [11:45:32] (03PS1) 10Springle: repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177778 [11:45:38] _joe_: ^ now, or are you rebooting? [11:45:51] no hurry [11:46:41] <_joe_> !log repooled mw1161-mw1170, depooling mw1171-80 [11:46:44] Logged the message, Master [11:46:48] <_joe_> springle: go on [11:46:52] ta [11:47:03] <_joe_> I'm just done with one batch [11:47:08] (03CR) 10Springle: [C: 032] repool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177778 (owner: 10Springle) [11:47:54] <_joe_> when you're done, I'll get back to my routine [11:48:47] !log springle Synchronized wmf-config/db-eqiad.php: repool db1059, warm up (duration: 00m 07s) [11:48:50] Logged the message, Master [11:48:55] _joe_: done [11:48:59] <_joe_> oook [11:50:01] _joe_: what was the fix for the non authorized aggregators? [11:50:24] <_joe_> removing $ganglia_aggregator=true from the corresponding node [11:50:29] <_joe_> in puppet [11:52:21] ack [12:06:22] letting wikidev people do stuff as nobody? i guess it's good https://gerrit.wikimedia.org/r/#/c/174896/1/modules/mediawiki/manifests/users.pp [12:31:57] andre__: helo. having a hard time finding that phab ticket about the community report queries [12:32:13] !log Reloading Zuul to deploy I9515542a1ac2ff [12:32:15] Logged the message, Master [12:32:19] mutante, yo! mean T1003 ? [12:32:30] or broader T28? [12:33:29] andre__: yes, i mean T1003. well i thought there are SQL queries now and what is needed is a little puppet to make a cron job that runs it [12:33:38] and sends some email out, like it did on BZ [12:34:03] or is that not true, because now comments from mukunda about a "stats app" [12:34:41] i'm not sure about " provide a simple ui for executing these queries" [12:34:50] sounds like phpmyadmin :o [12:35:43] mutante, let me clarify on that ticket. sorry for widening the scope [12:36:01] i'd rather just have queries executed by puppetized crons [12:36:04] thanks andre [12:37:07] <_joe_> sounds horrible [12:37:20] hey, how do i find out what exactly version of rsvg are we running? [12:37:25] (on the cluster) [12:37:35] <_joe_> MatmaRex: do you have shell access? [12:37:38] this file: https://commons.wikimedia.org/wiki/File:VisualEditor_MediaWiki_theme_clear_icon.svg ought to not be parsed correctly, but it is. [12:37:39] nope [12:37:46] <_joe_> then ask me :) [12:37:53] and i'm wondering why [12:38:48] <_joe_> 2.40.2-1 [12:39:07] <_joe_> librsvg2-2 2.40.2-1 [12:39:47] do we have custom patches? have we upgraded recently? [12:40:11] i vaguely recall we have something security-related [12:40:31] mutante, https://phabricator.wikimedia.org/T1003#821136 [12:40:42] MatmaRex: https://gerrit.wikimedia.org/r/#/c/173639/ was applied and rebuilt [12:41:01] hah [12:41:05] well this is wonderful [12:41:16] i was just investigating that bug and it turns out we fixed it. heh [12:41:20] https://phabricator.wikimedia.org/T76852 [12:41:22] <_joe_> MatmaRex: we upgraded to trusty [12:41:29] <_joe_> MatmaRex: thank HHVM! [12:41:40] _joe_: imagescalers too? [12:41:45] <_joe_> YuviPanda: nope [12:42:02] isn't rsvg running on those? [12:42:23] <_joe_> MatmaRex: yeah that may be the case, hold on [12:42:46] i know all i needed, thank you both :D [12:43:34] (03CR) 10Bartosz DziewoƄski: "This fixes https://phabricator.wikimedia.org/T76852 :D" [debs/librsvg] - 10https://gerrit.wikimedia.org/r/173639 (owner: 10Ebrahim) [12:49:23] andre__: thanks, yea, looking into it [12:49:45] <_joe_> MatmaRex: I can take a look at the imagescalers if you need me to [12:50:12] _joe_: no no no, thank you [12:50:57] whatever the configuration is, it seems to be working wonderfully, i'll bother you if it ever breaks :) [13:01:25] <_joe_> !log repooling all appservers [13:01:27] Logged the message, Master [13:08:03] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [13:08:45] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [13:51:43] (03PS1) 10Filippo Giunchedi: codfw-prod: empty ms-be2013/2014/2015 sdm3/sdn3 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/177790 [13:51:59] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] codfw-prod: empty ms-be2013/2014/2015 sdm3/sdn3 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/177790 (owner: 10Filippo Giunchedi) [14:07:38] (03CR) 10Alexandros Kosiaris: [C: 032] contint: mw multimedia packages on labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/177770 (owner: 10Hashar) [14:10:58] (03PS1) 10Dzahn: phabricator: community metrics stats mail [puppet] - 10https://gerrit.wikimedia.org/r/177792 [14:11:02] !log rebooting copper [14:11:10] Logged the message, Master [14:11:16] (03CR) 10Dzahn: [C: 032] Drop role::zuul::labs [puppet] - 10https://gerrit.wikimedia.org/r/173248 (owner: 10Hashar) [14:11:31] PROBLEM - DPKG on copper is CRITICAL: Timeout while attempting connection [14:14:22] RECOVERY - DPKG on copper is OK: All packages OK [14:14:56] (03CR) 10Dzahn: [C: 031] "deploy later tonight ?:)" [puppet] - 10https://gerrit.wikimedia.org/r/177521 (owner: 10Ori.livneh) [14:19:39] (03CR) 10Alexandros Kosiaris: [C: 04-1] role::ci::website::labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173251 (owner: 10Hashar) [14:20:58] PROBLEM - Host copper is DOWN: PING CRITICAL - Packet loss = 100% [14:24:22] RECOVERY - Host copper is UP: PING OK - Packet loss = 0%, RTA = 1.13 ms [14:26:15] (03CR) 10Dzahn: [C: 031] base: Make number of days acct logs are kept customizable [puppet] - 10https://gerrit.wikimedia.org/r/177427 (owner: 10Yuvipanda) [14:27:34] (03CR) 10Dzahn: [C: 031] nagios_common: Rename check_http_generic appropriately [puppet] - 10https://gerrit.wikimedia.org/r/177719 (owner: 10Yuvipanda) [14:33:42] (03CR) 10Dzahn: "i get the reason after reading the phabricator ticket but maybe it could be explained a bit in the commit message why this can't be expect" [puppet] - 10https://gerrit.wikimedia.org/r/177584 (owner: 10Andrew Bogott) [14:35:04] (03CR) 10Dzahn: Add an 'apache' user to eqiad labstores. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/177584 (owner: 10Andrew Bogott) [14:35:56] (03CR) 10Dzahn: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/177792 (owner: 10Dzahn) [14:39:26] (03PS1) 10Alexandros Kosiaris: Introduce heze.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/177798 [14:39:50] (03CR) 10Dzahn: [C: 031] "seems ok to me to just try it, worst case it can just be reverted" [puppet] - 10https://gerrit.wikimedia.org/r/177106 (owner: 10Krinkle) [14:44:23] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.004 second response time [14:47:33] PROBLEM - puppet last run on amssq43 is CRITICAL: CRITICAL: puppet fail [14:48:05] (03CR) 10Dzahn: [C: 031] "FHS 2.3 says /usr/src is optional and" [puppet] - 10https://gerrit.wikimedia.org/r/176624 (owner: 10Ori.livneh) [14:53:22] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.017 second response time [14:53:34] (03CR) 10Dzahn: "yes, it's "mediawiki-installation" that counts, so we need to have a salt grain that is on all appservers and tin,terbium,tmh,snapshot. ju" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [14:57:42] (03CR) 10Dzahn: "also see https://wikitech.wikimedia.org/wiki/Trebuchet#Add_a_deployment_target_grain_to_the_minion_via_puppet and shouldn't there be a "" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [15:02:53] RECOVERY - puppet last run on amssq43 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [15:03:12] (03PS6) 10Andrew Bogott: Add an 'apache' user to eqiad labstores. [puppet] - 10https://gerrit.wikimedia.org/r/177584 [15:04:39] (03CR) 10Dzahn: [C: 031] Add an 'apache' user to eqiad labstores. [puppet] - 10https://gerrit.wikimedia.org/r/177584 (owner: 10Andrew Bogott) [15:04:59] (03PS1) 10Dzahn: add salt grain 'mediawiki-installation' in mw role [puppet] - 10https://gerrit.wikimedia.org/r/177801 [15:06:30] <_joe_> !log depooling mw1209-1220, enabling hyperthreading and upgrading [15:06:35] Logged the message, Master [15:06:51] (03CR) 10Dzahn: "we should use a salt grain that we add in a puppet role instead of matching hostnames. this is trying to add one: https://gerrit.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [15:08:11] (03PS2) 10Ottomata: Rename all webrequest varnishkafka instances as 'webrequest' [puppet] - 10https://gerrit.wikimedia.org/r/177546 [15:10:34] (03CR) 10Dzahn: "oh:) this is already merged, thank you Yuvi" [puppet] - 10https://gerrit.wikimedia.org/r/176863 (owner: 10Dzahn) [15:24:37] (03PS2) 10Dzahn: Give ebernhardson access to stat1003 [puppet] - 10https://gerrit.wikimedia.org/r/177762 (owner: 10Alexandros Kosiaris) [15:25:33] (03PS1) 10Alexandros Kosiaris: Introducing server heze [puppet] - 10https://gerrit.wikimedia.org/r/177808 [15:25:40] (03CR) 10Dzahn: [C: 032] "waiting period over. manager approval on ticket. group correct per ottomata. lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/177762 (owner: 10Alexandros Kosiaris) [15:27:28] (03CR) 10Dzahn: "node name in puppet needs wmnet ?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/177808 (owner: 10Alexandros Kosiaris) [15:28:50] (03CR) 10Dzahn: "heze.codfw.wmnet ?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/177808 (owner: 10Alexandros Kosiaris) [15:32:44] (03PS1) 10Se4598: at the beginning of 2014, db.php was renamed to db-pmtpa.php, also creating a db-eqiad.php. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 [15:33:17] (03PS2) 10Se4598: fix noc.wikimedia.org/db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 [15:35:58] (03PS3) 10Se4598: fix noc.wikimedia.org/db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 [15:39:17] (03CR) 10Dzahn: [C: 031] "http://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 (owner: 10Se4598) [15:44:50] (03CR) 10Dzahn: [C: 04-1] "no wait, the pathes are different now. on terbium: -bash: cd: /usr/local/common: No such file or directory" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 (owner: 10Se4598) [15:45:51] Shall I do SWAT again today? [15:45:57] I was pretty good at it yesterday. [15:46:10] <_joe_> !log repooling all the servers [15:46:14] Logged the message, Master [15:50:13] (03PS4) 10Se4598: fix noc.wikimedia.org/db.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 [15:51:21] (03CR) 10Dzahn: "require_once( '/a/common/wmf-config/db-eqiad.php' );" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 (owner: 10Se4598) [15:51:52] !log hack-fixed http://noc.wikimedia.org/db.php [15:51:58] Logged the message, Master [15:52:26] (03CR) 10Se4598: "@Dzahn: I now used the path from the https://noc.wikimedia.org/conf/db-eqiad.php.txt symlink (minus one ../ ). That should work too." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 (owner: 10Se4598) [15:54:21] (03CR) 10Dzahn: [C: 032] "yes, that works too, thanks for the fix" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 (owner: 10Se4598) [15:55:53] (03CR) 10Dzahn: "live-hack-fixed, will be noop on next deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177810 (owner: 10Se4598) [15:59:53] (03PS1) 10RobH: setting mgmt/production entries for server nembus [dns] - 10https://gerrit.wikimedia.org/r/177817 [15:59:55] OK here we go. [15:59:59] * marktraceur breathes [16:00:02] SWAT IS OVER. [16:00:13] Thanks guys, good job, see you tomorrow [16:02:09] marktraceur: I take you up on that :D [16:02:39] hoo: OK, I might be bluffing. Tomorrow-Mark may be hung over. [16:03:52] (03CR) 10RobH: [C: 032] setting mgmt/production entries for server nembus [dns] - 10https://gerrit.wikimedia.org/r/177817 (owner: 10RobH) [16:13:17] (03PS1) 10Giuseppe Lavagetto: analytics: rename hiera file [puppet] - 10https://gerrit.wikimedia.org/r/177820 [16:13:23] <_joe_> godog: ^^ [16:13:24] (03PS2) 10Alexandros Kosiaris: Introducing server heze [puppet] - 10https://gerrit.wikimedia.org/r/177808 [16:14:21] _joe_: ah, related to ganglia aggregator I take it? [16:14:28] <_joe_> yep [16:14:50] (03CR) 10Giuseppe Lavagetto: [C: 032] analytics: rename hiera file [puppet] - 10https://gerrit.wikimedia.org/r/177820 (owner: 10Giuseppe Lavagetto) [16:15:10] cool, thanks! [16:19:56] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [16:22:58] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce heze.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/177798 (owner: 10Alexandros Kosiaris) [16:24:24] ah! _joe_, i was just looking into that [16:24:32] what happened? [16:27:03] <_joe_> ottomata: ".yml" vs ".yaml" [16:27:06] (03PS1) 10Cmjohnson: Adding virt1010 mac address [puppet] - 10https://gerrit.wikimedia.org/r/177826 [16:27:14] <_joe_> the file name :/ [16:27:15] aye, but, was that recent change? [16:27:35] <_joe_> recent... 1 week? 2? [16:27:41] <_joe_> I don't really remember [16:27:48] hm ok [16:27:56] (03CR) 10Cmjohnson: [C: 032] Adding virt1010 mac address [puppet] - 10https://gerrit.wikimedia.org/r/177826 (owner: 10Cmjohnson) [16:28:00] does that affect ganglia at all? or just icinga? [16:28:09] was just seeing some weirdness in ganglia [16:28:15] around cluster names [16:29:17] <_joe_> both [16:29:41] <_joe_> we didn't set the 'cluster' global var because the hiera lookup failed [16:30:12] k, cool, that would explain it then [16:30:13] thanks [16:30:52] (03CR) 10Ottomata: "Actually, seeing as today is Friday, I will wait til next week to deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/177546 (owner: 10Ottomata) [16:31:12] (03PS3) 10Alexandros Kosiaris: Introducing server heze [puppet] - 10https://gerrit.wikimedia.org/r/177808 [16:38:06] gwicke: I was thinking about the symlink thing in cassandra today. How about using ::cassandra_data_file_directories, ::cassandra_saved_caches_directory, ::cassandra_commitlog_directory ? [16:38:20] this will achieve the same thing and without symlinks, right ? [16:40:54] akosiaris: yeah, but for these test hosts it might not be worth it [16:41:27] I personally like to keep things in a known location, as that makes it easy to discover across machines [16:41:53] even if it's just a symlink [16:41:57] we all do... that is why I dislike the manual symlink approach [16:42:06] but test hosts, your call [16:42:23] anyway, reimaging the other two right now then, are you fine with that ? [16:42:29] canonical data dir is /var/lib/cassandra [16:42:35] in parallel ? [16:42:43] akosiaris: yeah, go ahead [16:44:44] !log rebooting analytics1021 [16:44:48] Logged the message, Master [16:51:33] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [16:57:08] (03PS2) 10Dzahn: phabricator: community metrics stats mail [puppet] - 10https://gerrit.wikimedia.org/r/177792 [17:15:19] (03PS1) 10Cmjohnson: Fixing dhcp entries for virt1010-1012 [puppet] - 10https://gerrit.wikimedia.org/r/177838 [17:15:47] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR alexandros kosiaris RT #8987, Telia acknowledged major USA outage [17:16:14] (03CR) 10Cmjohnson: [C: 032] Fixing dhcp entries for virt1010-1012 [puppet] - 10https://gerrit.wikimedia.org/r/177838 (owner: 10Cmjohnson) [17:17:47] cmjohnson: puppet is disabled on carbon [17:17:53] as the message says [17:17:59] give me a sec [17:18:04] ok [17:18:12] thx for letting me know [17:19:54] ok, fixed [17:19:57] but why ttyS1? [17:26:45] (03PS4) 10Alexandros Kosiaris: Introducing server heze [puppet] - 10https://gerrit.wikimedia.org/r/177808 [17:27:17] (03CR) 10Alexandros Kosiaris: [C: 032] "Thanks for catching the mistake Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/177808 (owner: 10Alexandros Kosiaris) [17:37:11] akosiaris: just for my understandig: node 'heze.eqiad.wmnet' has fixed-address heze.codfw.wmnet; ?(re: https://gerrit.wikimedia.org/r/#/c/177808/4 ) [17:38:33] (03CR) 10Se4598: "Is this correct: node 'heze.eqiad.wmnet' has fixed-address heze.codfw.wmnet ?" [puppet] - 10https://gerrit.wikimedia.org/r/177808 (owner: 10Alexandros Kosiaris) [17:39:42] se4598: yup [17:39:58] sigh [17:40:16] no wait, it is correct... [17:40:26] damn..... [17:40:32] it is wrong... se4598 thanks! [17:41:19] (03Abandoned) 10Umherirrender: Add new user rights 'editsitejs' and 'editsitecss' to user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/155913 (owner: 10Umherirrender) [17:43:51] (03PS1) 10Alexandros Kosiaris: Fix typo in f09525d [puppet] - 10https://gerrit.wikimedia.org/r/177842 [17:46:57] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo in f09525d [puppet] - 10https://gerrit.wikimedia.org/r/177842 (owner: 10Alexandros Kosiaris) [17:50:19] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: Puppet has 1 failures [17:53:22] (03PS1) 10Faidon Liambotis: Add temp d-i-test hostname for d-i tests [dns] - 10https://gerrit.wikimedia.org/r/177844 [17:55:09] (03CR) 10BryanDavis: [C: 031] add salt grain 'mediawiki-installation' in mw role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/177801 (owner: 10Dzahn) [17:55:55] (03CR) 10Faidon Liambotis: [C: 032] Add temp d-i-test hostname for d-i tests [dns] - 10https://gerrit.wikimedia.org/r/177844 (owner: 10Faidon Liambotis) [18:02:24] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:03:05] !log uploaded apertium-eo-en, apertium-id-ms to apt.wikimedia.org [18:03:07] Logged the message, Master [18:18:24] (03PS3) 10Yuvipanda: nagios_common: Rename check_http_generic appropriately [puppet] - 10https://gerrit.wikimedia.org/r/177719 [18:18:46] (03CR) 10Yuvipanda: [C: 032] nagios_common: Rename check_http_generic appropriately [puppet] - 10https://gerrit.wikimedia.org/r/177719 (owner: 10Yuvipanda) [18:20:38] (03CR) 10Filippo Giunchedi: [C: 031] hiera: role-based backend, role keyword [puppet] - 10https://gerrit.wikimedia.org/r/176334 (owner: 10Giuseppe Lavagetto) [18:21:05] eh, gitblit is down [18:21:59] MaxSem: people still use that? :) [18:21:59] MaxSem: try now? I restarted. [18:22:00] !log restart gitblit on antimony [18:22:03] Logged the message, Master [18:23:11] YuviPanda, now Error: 503, Service Unavailable :P [18:23:24] MaxSem: give it 30s or so :) [18:30:21] YuviPanda, thanks:) [18:30:29] MaxSem: :) [18:34:40] hmm [18:34:43] so I broke icinga [18:34:44] * YuviPanda checks [18:34:51] check_icinga [18:34:59] (03CR) 10Hashar: "@YuviPanda That is a good idea, we can even do it as a Jenkins job since Zuul consume events and can trigger a job like 'mail during weeke" [puppet] - 10https://gerrit.wikimedia.org/r/177521 (owner: 10Ori.livneh) [18:35:30] I see. [18:35:45] so I changed names of one of our commands. [18:36:08] so that needs puppet to run on all the individual instances (of varnish, in this instancE) [18:36:13] and then have it be collected [18:36:15] and then have icinga run on neon [18:40:16] (03PS1) 10Yuvipanda: nagios_common: Add compat checkcommand definition temporarily [puppet] - 10https://gerrit.wikimedia.org/r/177854 [18:41:29] (03CR) 10Yuvipanda: [C: 032] nagios_common: Add compat checkcommand definition temporarily [puppet] - 10https://gerrit.wikimedia.org/r/177854 (owner: 10Yuvipanda) [18:47:30] (03PS2) 10Ori.livneh: Emit alert when Ori commits on a weekend [puppet] - 10https://gerrit.wikimedia.org/r/177521 [18:48:15] (03CR) 10Ori.livneh: [C: 032 V: 032] Emit alert when Ori commits on a weekend [puppet] - 10https://gerrit.wikimedia.org/r/177521 (owner: 10Ori.livneh) [18:52:27] (03PS1) 10Ori.livneh: Fix file name from I2c4fd0d6e [puppet] - 10https://gerrit.wikimedia.org/r/177857 [18:52:42] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix file name from I2c4fd0d6e [puppet] - 10https://gerrit.wikimedia.org/r/177857 (owner: 10Ori.livneh) [18:58:12] alright, neon's ok now [19:03:39] paravoid: is puppet still disabled on carbon? [19:25:18] (03PS1) 10Cmjohnson: updating dhcpd with virt1010-1012 [puppet] - 10https://gerrit.wikimedia.org/r/177862 [19:26:04] (03PS2) 10Cmjohnson: updating dhcpd with virt1010-1012 [puppet] - 10https://gerrit.wikimedia.org/r/177862 [19:27:32] (03PS3) 10Cmjohnson: updating dhcpd with virt1010-1012 [puppet] - 10https://gerrit.wikimedia.org/r/177862 [19:28:34] (03CR) 10Cmjohnson: [C: 032] updating dhcpd with virt1010-1012 [puppet] - 10https://gerrit.wikimedia.org/r/177862 (owner: 10Cmjohnson) [19:29:16] (03PS1) 10Alexandros Kosiaris: Remove ssh::server inclusion [puppet] - 10https://gerrit.wikimedia.org/r/177863 [19:32:25] ori: hah, you merged the "ori's deploying at night!" patch :) [19:33:07] (03PS1) 10Alexandros Kosiaris: Minor lint in base manifests [puppet] - 10https://gerrit.wikimedia.org/r/177864 [19:38:18] (03PS1) 10Alexandros Kosiaris: Add labs-support1-b-codfw to install-server [puppet] - 10https://gerrit.wikimedia.org/r/177865 [19:40:52] (03CR) 10Alexandros Kosiaris: [C: 032] Add labs-support1-b-codfw to install-server [puppet] - 10https://gerrit.wikimedia.org/r/177865 (owner: 10Alexandros Kosiaris) [20:03:00] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Puppet has 1 failures [20:17:40] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:34:04] (03PS1) 10Cmjohnson: Revert "updating dhcpd with virt1010-1012" [puppet] - 10https://gerrit.wikimedia.org/r/177873 [20:34:41] (03CR) 10Cmjohnson: [C: 032] Revert "updating dhcpd with virt1010-1012" [puppet] - 10https://gerrit.wikimedia.org/r/177873 (owner: 10Cmjohnson) [21:04:27] Hey folks. I'm looking at the requestlog data and trying to make sense of requests that come in view SSL. Is there a place that I can look up the IP addresses of our SSL terminators? [21:08:30] (03PS1) 10Cscott: Allow OCG binaries to send/receive signals (AppArmor fixes). [puppet] - 10https://gerrit.wikimedia.org/r/177876 [21:10:08] halfak: I *think* that's anything with role::cache::* applied in https://github.com/wikimedia/operations-puppet/blob/production/manifests/site.pp -- hopefully there's a better list somewhere. [21:10:34] Thanks bd808. taking a look [21:12:08] bd808, I see that the role is associated with hostnames, but the logs have IPs. Any chance that extracting the IPs for this would be trivial? [21:12:35] dig 'hostname' from inside the cluster [21:12:49] or grep the dns config [21:13:11] https://github.com/wikimedia/operations-dns/tree/master/templates [21:13:12] Ooh.. dns config would work well. Is that in puppet too? [21:13:49] "wmnet" I presume? [21:14:04] yeah I think that's what you'd want [21:14:22] Great. I think this will work for me. :) [21:14:34] Thank you! [21:14:41] yw [21:27:38] RECOVERY - Kafka Broker Under Replicated Partitions on analytics1022 is OK: kafka.server.ReplicaManager.UnderReplicatedPartitions.Value OKAY: 0.0 [21:39:40] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: puppet fail [21:51:19] i'm getting fatals at http://www.mediawiki.org/wiki/Topic:S2oqpzg1uvj03g2a?action=purge but nothing in fluorine:/a/mw-log/fatal.log, where else could the error be written to? [21:54:25] turns out they are in hhvm.log now [21:54:50] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:55:16] ebernhardson, logstash.wikimedia.org [21:55:41] MaxSem: how do i tail a file, make a request, and see my logs there? [21:55:49] MaxSem: it seems i have to sort through a giant mess of everything else :P [21:56:13] just open the fatalmonitor dashboard [22:16:35] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures [22:17:55] PROBLEM - puppet last run on mw1084 is CRITICAL: CRITICAL: Puppet has 1 failures [22:18:57] legoktm: Hey, you mentioned in phabricator that we might have google webmaster tools set up for english wikipedia. Any idea who we'd need to talk to, for more information there? [22:19:09] I think Reedy might have access? [22:19:11] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 1 failures [22:20:01] PROBLEM - puppet last run on mw1105 is CRITICAL: CRITICAL: Puppet has 1 failures [22:20:21] https://old-bugzilla.wikimedia.org/show_bug.cgi?id=73305#c3 [22:20:49] And stu apparently? [22:20:56] Huh. [22:21:35] and in SAL: 14:26 YuviPanda: made reedy 'full' user on webmaster tools [22:22:02] a few other entries https://wikitech.wikimedia.org/w/index.php?search=google+webmaster+tools&title=Special%3ASearch&go=Go [22:29:35] legoktm: Thanks for the leads. I'll see if we can make use of the old setup somehow. [22:30:12] RECOVERY - puppet last run on mw1084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:30:19] :) [22:31:31] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [22:31:52] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:34:22] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:35:02] RECOVERY - puppet last run on mw1105 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:37:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [22:37:09] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [22:42:01] PROBLEM - puppet last run on mw1091 is CRITICAL: CRITICAL: Puppet has 1 failures [22:57:28] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [22:59:34] (03PS1) 10BryanDavis: robots.txt: Use max lastmod time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/177940 [23:27:15] Jeff_Green: my patch is at https://gerrit.wikimedia.org/r/#/c/177940/ -- probably not worth the potential pain of a last Friday deploy if the purge and null edit worked [23:27:43] bd808: agreed, we're good for now and it can wait [23:28:14] We can try it out on Monday then [23:51:54] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [23:51:57] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected