[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151022T0000). [00:01:10] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for idle pool - https://phabricator.wikimedia.org/T116252#1744494 (10chasemp) 3NEW a:3RobH [00:02:06] 6operations, 10ops-eqiad: wipe einsteinium disks - https://phabricator.wikimedia.org/T116253#1744504 (10RobH) 3NEW a:3Cmjohnson [00:03:34] !log ori@tin Synchronized php-1.27.0-wmf.2/extensions/AbuseFilter: Ice1b6da43: AbuseFilter: don't install custom error handler and I0ecdcdd142: Use isset() to check array element exists rather than relying on @ operator (duration: 00m 18s) [00:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:15] !log ori@tin Synchronized php-1.27.0-wmf.3/extensions/AbuseFilter: Ice1b6da43: AbuseFilter: don't install custom error handler and I0ecdcdd142: Use isset() to check array element exists rather than relying on @ operator (duration: 00m 18s) [00:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:05:23] (03PS1) 10RobH: reclaim einsteinium to spares [dns] - 10https://gerrit.wikimedia.org/r/247944 [00:05:35] 6operations: Off Boarding - Remove users - https://phabricator.wikimedia.org/T116248#1744516 (10Krenair) Please add #operations to tickets that involve modifying exim aliases (or anything else requiring modification of the puppet-private repository) [00:05:57] PROBLEM - SSL-LDAP on pollux is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [00:06:08] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for idle pool - https://phabricator.wikimedia.org/T116252#1744519 (10RobH) [00:06:15] (03PS2) 10Alex Monk: reclaim einsteinium to spares [dns] - 10https://gerrit.wikimedia.org/r/247944 (https://phabricator.wikimedia.org/T116252) (owner: 10RobH) [00:06:34] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for spares - https://phabricator.wikimedia.org/T116252#1744527 (10RobH) [00:07:01] mutante, this will inevitably start alerting again at some point, fyi [00:07:02] meh i dislike the bug [00:07:02] (03CR) 10Dzahn: "in site.pp please use "role spare" now (the role keyword instead of an include)" [dns] - 10https://gerrit.wikimedia.org/r/247944 (https://phabricator.wikimedia.org/T116252) (owner: 10RobH) [00:07:06] PROBLEM - SSL-LDAP on plutonium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [00:07:08] i dont like having to pull the tag review off my tasks [00:07:09] =p [00:07:25] these are brandnew checks, i'm on it [00:07:26] is it required? [00:07:29] SSL-LDAP checks [00:07:58] Krenair: what was that about? mira? [00:07:59] Krenair: is bug: required now? I thought it could still jsut be the task and no bug: ? [00:08:16] mutante, yes [00:08:27] Krenair: ok:) [00:08:30] robh, it's required to link the gerrit change from the ticket [00:08:34] AFAIK [00:08:40] certainly needs the 'T' [00:08:56] did i miss the t? my bad. i just dislike the patchset for review project appending onto things [00:09:01] ACKNOWLEDGEMENT - SSL-LDAP on plutonium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn new checks for corp OIT mirror [00:09:01] ACKNOWLEDGEMENT - SSL-LDAP on pollux is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn new checks for corp OIT mirror [00:09:06] once its merged it shows in the history of a task if its missing the bug: though [00:09:14] afaik, i may be wrong if it changed recently [00:09:15] oh well [00:09:40] mutante: spare? we wipe the system and its not calling into puppet [00:09:46] why would it stay in site.pp? [00:09:51] we dont list every spare in site.pp. [00:10:00] (we shoudl do all of one or the other, not a mix of both) [00:10:02] (03PS1) 10Dzahn: wmfusercontent.org - add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247945 [00:10:20] and if its in spares, half the time they come out of spares with a different hostname [00:10:33] robh: i don't know. because others are. puppet/manifests$ grep "role spare" site.pp . my comment was just about using "role" instead of "include role" now [00:10:43] because we had touch some existing spares for that [00:10:45] im removing it from site.pp entirely. [00:10:54] 7Puppet, 6operations: Move misc::maintenance into a module - https://phabricator.wikimedia.org/T107672#1744533 (10scfc) [00:10:55] 7Puppet, 6operations: Move misc::udp2log into a module - https://phabricator.wikimedia.org/T107671#1744534 (10scfc) [00:10:56] 7Puppet, 6operations: Move role::otrs into a module - https://phabricator.wikimedia.org/T107670#1744535 (10scfc) [00:10:58] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1744532 (10scfc) [00:11:06] having some spares in there and some not is pointless and im not about to volunteer to track them in site.pp and on my spares page. [00:11:40] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1744536 (10chasemp) 5Open>3Resolved a:3chasemp well played @scfc [00:11:42] (that is my viewpoint of course, i should have started with that ;) [00:12:20] (03CR) 10RobH: [C: 032] reclaim einsteinium to spares [dns] - 10https://gerrit.wikimedia.org/r/247944 (https://phabricator.wikimedia.org/T116252) (owner: 10RobH) [00:12:32] i guess it depends whether they change their hostname or not when they come back. but no real opinion. just saw existing ones. [00:13:19] yea we discussed it in past, trackign in site.pp [00:13:22] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1744544 (10scfc) (Prior to c41b180660e96270e9d90cd66634caa90e0e7029, `--no-puppet_url_without_modules-check` was not enabled, so the blocking tasks made sense then.... [00:13:26] but then everytime i go tto allocate i [00:13:36] rob "bug|task: T123" is our bot thing and does that tag, "ref T123" is a phab thing and will link the task/commit post merge [00:13:37] (03CR) 10Dzahn: [C: 032] wmfusercontent.org - add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247945 (owner: 10Dzahn) [00:13:38] but then everytime i go tto allocate i'd have to check the git log and ensure no one took one that wasnt tracked in a task [00:13:38] fyi :) [00:13:44] just easier to not make it the authoriativeve thing [00:13:56] chasemp: Yep! i like the post commit mention myself [00:14:01] same [00:14:06] unless im trying to get someone to review for me, which is never the case for decoms [00:14:23] i know folks like and have real use for the patchset for review tag, i just rarely need it ;] [00:14:39] (that wasnt meant to sound shitty and elitist! its cuz i do simple fucking patches is all ;) [00:14:53] 6operations: Investigate idle/depooled eqiad api appserver - https://phabricator.wikimedia.org/T116254#1744545 (10Reedy) 3NEW [00:15:22] Reedy: thx for finding those [00:15:37] i may steal them for imagescalers ;] [00:15:40] they are idle after all! [00:15:59] There seems to be a handful of idle non api app servers [00:16:19] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1744554 (10RobH) Alternatively, T116254 lists three idle apaches that we could use. [00:16:21] 4 of those [00:16:31] oh, wait, api, nm, i wanna steal non api [00:16:46] yeah, like I say, I think there's 4 non api that are idle :P [00:16:46] 6operations, 7Monitoring: Monitor APC usage on application servers - https://phabricator.wikimedia.org/T116255#1744559 (10Krinkle) 3NEW [00:16:50] comment removed. [00:17:07] seems like an audit of all apache status is well in order. [00:17:08] Krinkle: Didn't bd808 write a script for that a while ago? [00:18:17] 6operations, 7Monitoring: Monitor APC usage on application servers - https://phabricator.wikimedia.org/T116255#1744566 (10Krinkle) [00:18:36] (03PS1) 10Ori.livneh: $wgMathCheckFiles = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247948 [00:19:34] (03CR) 10Ori.livneh: [C: 032] $wgMathCheckFiles = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247948 (owner: 10Ori.livneh) [00:19:39] (03Merged) 10jenkins-bot: $wgMathCheckFiles = false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247948 (owner: 10Ori.livneh) [00:20:33] !log ori@tin Synchronized wmf-config/CommonSettings.php: Ibf752b832: $wgMathCheckFiles = false (duration: 00m 18s) [00:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:03] 6operations: Investigate idle/depooled eqiad appservers - https://phabricator.wikimedia.org/T116256#1744570 (10Reedy) 3NEW [00:21:19] (03PS1) 10RobH: Reclaim einsteinium.eqiad.wmnet for spares [puppet] - 10https://gerrit.wikimedia.org/r/247949 [00:21:53] (03CR) 10RobH: [C: 032] Reclaim einsteinium.eqiad.wmnet for spares [puppet] - 10https://gerrit.wikimedia.org/r/247949 (owner: 10RobH) [00:22:24] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1744581 (10Reedy) T116256 contains 3 eqiad appservers (non api) that have been idle for at least a month, and a 4th recently idle [00:24:09] 6operations, 7Monitoring: Monitor APC usage on application servers - https://phabricator.wikimedia.org/T116255#1744586 (10Reedy) I thought @bd808 wrote something for this previously... [00:24:31] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for spares - https://phabricator.wikimedia.org/T116252#1744588 (10RobH) [00:25:16] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1744592 (10Dzahn) wmfusercontent done: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=phab.wmfusercontent.org&service=HTTPS-wmfusercontent LDAP (... [00:30:14] (03PS1) 10Dzahn: Revert "openldap: add SSL cert expiry check" [puppet] - 10https://gerrit.wikimedia.org/r/247951 [00:30:46] (03PS2) 10Dzahn: Revert "openldap: add SSL cert expiry check" [puppet] - 10https://gerrit.wikimedia.org/r/247951 [00:31:06] (03PS3) 10Dzahn: Revert "openldap: add SSL cert expiry check" [puppet] - 10https://gerrit.wikimedia.org/r/247951 [00:31:29] (03CR) 10Dzahn: [C: 032] Revert "openldap: add SSL cert expiry check" [puppet] - 10https://gerrit.wikimedia.org/r/247951 (owner: 10Dzahn) [00:32:52] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1744596 (10Dzahn) re: LDAP-mirror: nothing to monitor here, because Icinga said: "SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connecti... [00:38:34] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [00:45:45] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:49:36] !log krenair@mira Synchronized README: (no message) (duration: 00m 16s) [00:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:49:49] as expected, did nothing bd808 [00:51:04] (03PS1) 10Dzahn: site.pp: remove virt100[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/247953 [00:52:09] (03PS2) 10Dzahn: site.pp: remove virt100[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/247953 [00:52:27] (03PS3) 10Dzahn: site.pp: remove virt100[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/247953 [00:55:06] !log krenair@mira Synchronized README: (no message) (duration: 00m 17s) [00:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:56:02] (03PS1) 10Ori.livneh: Add Diamond collector for HHVM APC stats [puppet] - 10https://gerrit.wikimedia.org/r/247956 (https://phabricator.wikimedia.org/T116255) [00:56:21] (03PS2) 10Ori.livneh: Add Diamond collector for HHVM APC stats [puppet] - 10https://gerrit.wikimedia.org/r/247956 (https://phabricator.wikimedia.org/T116255) [00:56:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Add Diamond collector for HHVM APC stats [puppet] - 10https://gerrit.wikimedia.org/r/247956 (https://phabricator.wikimedia.org/T116255) (owner: 10Ori.livneh) [00:56:32] !log krenair@mira Synchronized README: (no message) (duration: 00m 16s) [00:58:58] (03PS1) 10Ori.livneh: Follow-up for I87682f90: tweak metric path [puppet] - 10https://gerrit.wikimedia.org/r/247957 [00:59:09] (03CR) 10Ori.livneh: [C: 032 V: 032] Follow-up for I87682f90: tweak metric path [puppet] - 10https://gerrit.wikimedia.org/r/247957 (owner: 10Ori.livneh) [01:00:14] !log ori@tin Synchronized php-1.27.0-wmf.3/includes/Hooks.php: I0e5f2d3b2: Make hookErrorHandler() only care about serious signature errors (duration: 00m 17s) [01:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:00:37] !log ori@tin Synchronized php-1.27.0-wmf.2/includes/Hooks.php: I0e5f2d3b2: Make hookErrorHandler() only care about serious signature errors (duration: 00m 17s) [01:01:46] 6operations, 7Monitoring, 5Patch-For-Review: Monitor APC usage on application servers - https://phabricator.wikimedia.org/T116255#1744613 (10Reedy) >>! In T116255#1744586, @Reedy wrote: > I thought @bd808 wrote something for this previously... https://gist.github.com/bd808/867dda34698717f11e8b [01:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:10:09] !log krenair@mira Synchronized README: testing sync from mira (duration: 00m 17s) [01:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:10:16] (03PS1) 10Ori.livneh: Fix typo in HHVM APC collector [puppet] - 10https://gerrit.wikimedia.org/r/247958 [01:10:32] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix typo in HHVM APC collector [puppet] - 10https://gerrit.wikimedia.org/r/247958 (owner: 10Ori.livneh) [01:17:50] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1744625 (10Dzahn) @andrewbogott which service on which port uses the virt-star cert? i see virt100x compute nodes had it, and exist in site.pp but _not in DNS_ and labv... [01:21:10] (03PS1) 10Ori.livneh: Follow-up for I87682f90: tweak metric path [puppet] - 10https://gerrit.wikimedia.org/r/247959 [01:21:22] (03CR) 10Ori.livneh: [C: 032 V: 032] Follow-up for I87682f90: tweak metric path [puppet] - 10https://gerrit.wikimedia.org/r/247959 (owner: 10Ori.livneh) [01:22:33] (03PS2) 10Dzahn: admin: add tjones to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247455 (https://phabricator.wikimedia.org/T115880) [01:22:58] (03PS3) 10Dzahn: admin: add tjones to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247455 (https://phabricator.wikimedia.org/T115880) [01:23:05] (03CR) 10Dzahn: [C: 032] admin: add tjones to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/247455 (https://phabricator.wikimedia.org/T115880) (owner: 10Dzahn) [01:23:56] (03PS1) 10Ori.livneh: Fix typo in HHVM APC collector [puppet] - 10https://gerrit.wikimedia.org/r/247960 [01:24:06] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix typo in HHVM APC collector [puppet] - 10https://gerrit.wikimedia.org/r/247960 (owner: 10Ori.livneh) [01:27:50] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1744638 (10Dzahn) @Tjones this is done now. on stat1002: [stat1002:~] $ id tjones uid=12510(tjones) gid=500(wikidev) groups=500(wikidev),725(... [01:27:59] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1744639 (10Dzahn) 5Open>3Resolved [01:28:08] 10Ops-Access-Requests, 6operations: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1735214 (10Dzahn) [01:33:01] 6operations, 7Monitoring, 5Patch-For-Review: Monitor APC usage on application servers - https://phabricator.wikimedia.org/T116255#1744646 (10ori) a:3ori Done. Check the `servers.*.hhvm.apc` hierarchy. (It'll take a while to roll out to all app servers, but mw1260 already has it, if you want to check it out.) [01:33:32] 6operations, 7Monitoring, 5Patch-For-Review: Monitor APC usage on application servers - https://phabricator.wikimedia.org/T116255#1744648 (10ori) 5Open>3Resolved [02:30:48] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 08m 44s) [02:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:56] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for spares - https://phabricator.wikimedia.org/T116252#1744668 (10RobH) [02:35:41] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-22 02:35:41+00:00 [02:35:48] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for spares - https://phabricator.wikimedia.org/T116252#1744494 (10RobH) also i show in task history my removing @krenair but i don't recall that, i imagine it was a mistaken backspace when tabbing through editing the description? (adding him back) [02:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:13] !log Startin running 5 threads of enwiki refreshLinks jobs on tin [02:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:22] 6operations, 10ops-eqiad: Reclaim einsteinium.eqiad.wmnet for spares - https://phabricator.wikimedia.org/T116252#1744671 (10Krenair) that'd just be because phabricator doesn't attempt to handle edit conflicts, it just overwrites [03:00:43] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 07m 52s) [03:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:21] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-22 03:05:21+00:00 [03:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:52:48] 6operations, 10Traffic, 7Availability: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1744679 (10aaron) [03:53:25] 6operations, 10Traffic, 7Availability: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1096885 (10aaron) [04:33:53] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [04:34:03] (03PS1) 10Ori.livneh: varnish: add prototype cookie-based backend selection [puppet] - 10https://gerrit.wikimedia.org/r/247970 (https://phabricator.wikimedia.org/T91820) [04:34:53] bblack: ^ [04:35:34] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [05:10:01] !log aaron@tin Synchronized php-1.27.0-wmf.3/includes/deferred/LinksUpdate.php: fe323f9b68bbb (duration: 00m 17s) [05:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:13:41] woo! category additions on commons are now happening right away [05:18:01] bawolff: http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1445491071.663&target=MediaWiki.jobqueue.inserts.refreshLinksPrioritized.count&from=-3hours [05:18:29] http://graphite.wikimedia.org/render/?width=1887&height=960&_salt=1445491096.703&from=-3hours&target=MediaWiki.jobqueue.inserts.refreshLinksPrioritized.count&target=MediaWiki.jobqueue.inserts.refreshLinks.count is clearer [05:20:00] bawolff: I still wonder how git review -d and adding the timestamp part made the other change rollback [05:20:13] that kind of thing is very annoying [05:36:50] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1744752 (10Smalyshev) @GWicke I would be interested to participate. I'll be in the office, could you add me to the invite? [06:14:28] !log Restarted hhvm on mw1011, it was stuck doing nothing at 100 cpu [06:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:17:40] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Oct 22 06:17:39 UTC 2015 (duration 17m 38s) [06:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:18:42] <_joe_> AaronSchulz: did you by any chance took a quickstack dump before restarting it? [06:19:06] <_joe_> (also, thanks for acting :)) [06:22:05] no dump this time, what's the preferred command? [06:22:21] hhvm-dump-debug [06:22:51] gets you a quickstack dump, written to both stdout and /tmp/hhvm.$(pidof -s hhvm).bt [06:23:25] (feel free to try it on another app server) [06:29:13] <_joe_> sorry I disappeared, a cough crisis (again) [06:30:14] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:24] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on ms-be1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:35] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:04] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:15] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:54] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:24] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:04] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:53:54] (03CR) 10Alexandros Kosiaris: "Yes, that's because that OpenLDAP installation supports the STARTTLS command present in LDAP v3 protocol which allows upgrading the unencr" [puppet] - 10https://gerrit.wikimedia.org/r/247951 (owner: 10Dzahn) [06:54:34] RECOVERY - puppet last run on ms-be1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:54:47] (03CR) 10Alexandros Kosiaris: "Or the equivalent of -T which is --starttls, which is clearer" [puppet] - 10https://gerrit.wikimedia.org/r/247951 (owner: 10Dzahn) [06:54:55] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:55:34] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:55:35] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:56:14] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:34] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:43] PROBLEM - puppet last run on mw2057 is CRITICAL: CRITICAL: puppet fail [06:56:45] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:07] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1744789 (10akosiaris) Yes there is. Don't reclaim it please. I plan to work on it next week [06:57:13] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:24] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:29] (03PS10) 10Alexandros Kosiaris: maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) [07:07:37] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) (owner: 10Alexandros Kosiaris) [07:07:43] (03CR) 10Alexandros Kosiaris: [V: 032] maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) (owner: 10Alexandros Kosiaris) [07:12:17] (03CR) 10Nikerabbit: cxserver: Add JWT token support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [07:17:54] (03PS1) 10Alexandros Kosiaris: service::node: use repo value for CWD and Env [puppet] - 10https://gerrit.wikimedia.org/r/247988 [07:22:33] (03CR) 10Alexandros Kosiaris: cxserver: Add JWT token support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [07:23:11] (03CR) 10Alexandros Kosiaris: [C: 032] service::node: use repo value for CWD and Env [puppet] - 10https://gerrit.wikimedia.org/r/247988 (owner: 10Alexandros Kosiaris) [07:23:54] RECOVERY - puppet last run on mw2057 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:26:10] (03PS1) 10Ori.livneh: ~ori: updated deployment scripts [puppet] - 10https://gerrit.wikimedia.org/r/247989 [07:26:30] (03PS2) 10Ori.livneh: ~ori: updated deployment scripts [puppet] - 10https://gerrit.wikimedia.org/r/247989 [07:26:43] (03CR) 10Ori.livneh: [C: 032 V: 032] ~ori: updated deployment scripts [puppet] - 10https://gerrit.wikimedia.org/r/247989 (owner: 10Ori.livneh) [07:35:44] (03CR) 10Alexandros Kosiaris: [C: 032] tilerator should not expose admin UI [puppet] - 10https://gerrit.wikimedia.org/r/244884 (owner: 10Yurik) [07:35:50] (03PS2) 10Alexandros Kosiaris: tilerator should not expose admin UI [puppet] - 10https://gerrit.wikimedia.org/r/244884 (owner: 10Yurik) [07:36:25] (03CR) 10Alexandros Kosiaris: "Change https://gerrit.wikimedia.org/r/#/c/244436/ has been merged, merging this as well" [puppet] - 10https://gerrit.wikimedia.org/r/244884 (owner: 10Yurik) [07:41:13] !log ori@tin Synchronized php-1.27.0-wmf.2/includes/jobqueue/JobQueueRedis.php: Ie7c544fc8: jobqueue: track real job inserts as inserts_actual & I627e8f6ce: JobQueueRedis::doBatchPush(): report metrics even when failures occur (duration: 07m 33s) [07:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:41:40] !log ori@tin Synchronized php-1.27.0-wmf.3/includes/jobqueue/JobQueueRedis.php: Ie7c544fc8: jobqueue: track real job inserts as inserts_actual & I627e8f6ce: JobQueueRedis::doBatchPush(): report metrics even when failures occur (duration: 00m 17s) [07:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:45:56] (03PS6) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [07:46:56] (03PS7) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [07:49:49] (03CR) 10Nikerabbit: cxserver: Add JWT token support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [08:04:18] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25) [08:06:10] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1744891 (10mobrovac) >>! In T114443#1744752, @Smalyshev wrote: > @GWicke I would be interested to participate. I'll be in the office, could you add me to the invite? Done. [08:17:29] (03CR) 10Alexandros Kosiaris: cxserver: Add JWT token support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [08:17:31] (03PS8) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [08:22:29] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [08:22:59] (03PS1) 10Alexandros Kosiaris: aqs: Add LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/247994 (https://phabricator.wikimedia.org/T116245) [08:35:21] (03PS1) 10Muehlenhoff: Restrict access to redis on abacist [puppet] - 10https://gerrit.wikimedia.org/r/247995 [08:42:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "One very minor nitpick, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [08:43:08] (03PS1) 10Mobrovac: RESTBase: make the port fully configurable and change for AQS to 7232 [puppet] - 10https://gerrit.wikimedia.org/r/247996 (https://phabricator.wikimedia.org/T116245) [08:47:38] (03PS9) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [08:48:28] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [08:48:35] (03PS10) 10Alexandros Kosiaris: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [08:50:59] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: make the port fully configurable and change for AQS to 7232 [puppet] - 10https://gerrit.wikimedia.org/r/247996 (https://phabricator.wikimedia.org/T116245) (owner: 10Mobrovac) [08:51:13] (03CR) 10Alexandros Kosiaris: "http://puppet-compiler.wmflabs.org/1052/ says it's doing what it is supposed to do, so merging" [puppet] - 10https://gerrit.wikimedia.org/r/247996 (https://phabricator.wikimedia.org/T116245) (owner: 10Mobrovac) [08:51:17] (03PS2) 10Alexandros Kosiaris: RESTBase: make the port fully configurable and change for AQS to 7232 [puppet] - 10https://gerrit.wikimedia.org/r/247996 (https://phabricator.wikimedia.org/T116245) (owner: 10Mobrovac) [08:57:48] (03CR) 10Mobrovac: [C: 04-1] aqs: Add LVS configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247994 (https://phabricator.wikimedia.org/T116245) (owner: 10Alexandros Kosiaris) [08:58:47] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:59:06] akosiaris: ^^^ [08:59:22] yes, that is expected [08:59:32] akosiaris: i'll restart restbase on aqs100x [08:59:38] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [08:59:48] I am fixing it already [08:59:59] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7232): Max retries exceeded with url: /analytics.wikimedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [09:00:21] ah [09:01:07] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [09:02:58] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [09:03:12] (03CR) 10Alexandros Kosiaris: aqs: Add LVS configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247994 (https://phabricator.wikimedia.org/T116245) (owner: 10Alexandros Kosiaris) [09:03:17] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [09:03:41] (03PS2) 10Alexandros Kosiaris: aqs: Add LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/247994 (https://phabricator.wikimedia.org/T116245) [09:04:09] (03PS2) 10Muehlenhoff: Assign salt grains for deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/247877 [09:07:55] (03CR) 10Alexandros Kosiaris: [C: 032] aqs: Add LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/247994 (https://phabricator.wikimedia.org/T116245) (owner: 10Alexandros Kosiaris) [09:12:27] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: puppet fail [09:12:36] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: puppet fail [09:12:57] PROBLEM - puppet last run on mw2089 is CRITICAL: CRITICAL: puppet fail [09:13:16] PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: puppet fail [09:13:27] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: puppet fail [09:13:28] PROBLEM - puppet last run on mw1130 is CRITICAL: CRITICAL: puppet fail [09:13:38] PROBLEM - puppet last run on cp1069 is CRITICAL: CRITICAL: puppet fail [09:13:56] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: puppet fail [09:13:56] PROBLEM - puppet last run on lvs1002 is CRITICAL: CRITICAL: puppet fail [09:13:57] PROBLEM - puppet last run on mw2041 is CRITICAL: CRITICAL: puppet fail [09:13:57] PROBLEM - puppet last run on mw1081 is CRITICAL: CRITICAL: puppet fail [09:13:57] PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: puppet fail [09:14:06] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: puppet fail [09:14:07] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: puppet fail [09:14:07] PROBLEM - puppet last run on mw1056 is CRITICAL: CRITICAL: puppet fail [09:14:08] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: puppet fail [09:14:08] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: puppet fail [09:14:11] omg [09:14:16] what happened [09:14:17] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: puppet fail [09:14:17] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: puppet fail [09:14:18] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: puppet fail [09:14:28] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: puppet fail [09:14:46] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: puppet fail [09:15:14] (03PS1) 10Alexandros Kosiaris: LVS: fix yaml indentation typo for aqs [puppet] - 10https://gerrit.wikimedia.org/r/247999 [09:15:17] PROBLEM - puppet last run on mw1089 is CRITICAL: CRITICAL: puppet fail [09:15:27] PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: puppet fail [09:15:33] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] LVS: fix yaml indentation typo for aqs [puppet] - 10https://gerrit.wikimedia.org/r/247999 (owner: 10Alexandros Kosiaris) [09:15:37] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: puppet fail [09:15:46] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:15:47] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: puppet fail [09:15:47] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: puppet fail [09:15:58] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: puppet fail [09:16:46] PROBLEM - puppet last run on mw1024 is CRITICAL: CRITICAL: puppet fail [09:17:06] PROBLEM - puppet last run on mw2031 is CRITICAL: CRITICAL: puppet fail [09:17:07] PROBLEM - puppet last run on mw2015 is CRITICAL: CRITICAL: puppet fail [09:17:16] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: puppet fail [09:17:27] PROBLEM - puppet last run on mw2109 is CRITICAL: CRITICAL: puppet fail [09:17:37] PROBLEM - puppet last run on mw1106 is CRITICAL: CRITICAL: puppet fail [09:17:37] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: puppet fail [09:17:37] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: puppet fail [09:17:50] (03PS1) 10Faidon Liambotis: Add A/AAAA for mr1-eqiad's OOB link [dns] - 10https://gerrit.wikimedia.org/r/248000 (https://phabricator.wikimedia.org/T113771) [09:17:56] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: puppet fail [09:17:56] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: puppet fail [09:17:57] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: puppet fail [09:17:57] PROBLEM - puppet last run on mw2064 is CRITICAL: CRITICAL: puppet fail [09:18:06] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: puppet fail [09:18:07] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: puppet fail [09:18:07] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: puppet fail [09:18:07] PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: puppet fail [09:18:16] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail [09:18:17] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: puppet fail [09:18:17] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: puppet fail [09:18:17] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: puppet fail [09:18:17] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: puppet fail [09:18:17] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: puppet fail [09:18:19] (03CR) 10Faidon Liambotis: [C: 032] Add A/AAAA for mr1-eqiad's OOB link [dns] - 10https://gerrit.wikimedia.org/r/248000 (https://phabricator.wikimedia.org/T113771) (owner: 10Faidon Liambotis) [09:18:26] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: puppet fail [09:18:26] PROBLEM - puppet last run on mw2019 is CRITICAL: CRITICAL: puppet fail [09:18:28] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: puppet fail [09:18:36] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: puppet fail [09:18:37] PROBLEM - puppet last run on mw2010 is CRITICAL: CRITICAL: puppet fail [09:18:46] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: puppet fail [09:18:48] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: puppet fail [09:18:48] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: puppet fail [09:18:56] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: puppet fail [09:18:57] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: puppet fail [09:18:58] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: puppet fail [09:19:16] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: puppet fail [09:19:16] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [09:19:17] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: puppet fail [09:19:18] PROBLEM - puppet last run on mw2070 is CRITICAL: CRITICAL: puppet fail [09:19:19] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: puppet fail [09:19:27] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: puppet fail [09:19:27] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: puppet fail [09:19:28] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: puppet fail [09:19:36] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: puppet fail [09:19:37] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: puppet fail [09:19:46] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [09:19:46] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: puppet fail [09:19:47] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: puppet fail [09:19:57] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: puppet fail [09:20:06] PROBLEM - puppet last run on mw2056 is CRITICAL: CRITICAL: puppet fail [09:20:06] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: puppet fail [09:20:07] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: puppet fail [09:20:07] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: puppet fail [09:20:07] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: puppet fail [09:20:08] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: puppet fail [09:20:08] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: puppet fail [09:20:08] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: puppet fail [09:20:16] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [09:20:16] PROBLEM - puppet last run on mw2067 is CRITICAL: CRITICAL: puppet fail [09:20:16] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: puppet fail [09:20:26] PROBLEM - puppet last run on mw2055 is CRITICAL: CRITICAL: puppet fail [09:20:27] (03PS1) 10Alexandros Kosiaris: LVS: another indentation fix for aqs [puppet] - 10https://gerrit.wikimedia.org/r/248001 [09:20:27] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: puppet fail [09:20:37] PROBLEM - puppet last run on mw2142 is CRITICAL: CRITICAL: puppet fail [09:20:37] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: puppet fail [09:20:47] PROBLEM - puppet last run on mw2030 is CRITICAL: CRITICAL: puppet fail [09:20:47] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: puppet fail [09:20:47] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: puppet fail [09:20:47] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: puppet fail [09:20:47] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: puppet fail [09:20:48] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [09:20:48] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: puppet fail [09:20:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] LVS: another indentation fix for aqs [puppet] - 10https://gerrit.wikimedia.org/r/248001 (owner: 10Alexandros Kosiaris) [09:20:56] 6operations, 10netops, 5Patch-For-Review: setup new equinix out of band mgmt access - https://phabricator.wikimedia.org/T113771#1745008 (10faidon) The link is all set up on the SRX based on IP information sent to me just today inband. DNS is now pointed at it, so we can already use the OOB access \o/ The th... [09:20:56] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: puppet fail [09:20:56] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail [09:20:57] PROBLEM - puppet last run on mw2049 is CRITICAL: CRITICAL: puppet fail [09:20:57] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: puppet fail [09:20:57] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: puppet fail [09:21:17] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: puppet fail [09:21:17] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: puppet fail [09:21:18] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: puppet fail [09:21:27] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [09:21:27] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: puppet fail [09:21:27] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: puppet fail [09:21:36] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: puppet fail [09:21:36] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [09:21:37] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: puppet fail [09:21:47] PROBLEM - puppet last run on mw2047 is CRITICAL: CRITICAL: puppet fail [09:21:47] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: puppet fail [09:21:48] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: puppet fail [09:21:48] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: puppet fail [09:21:48] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: puppet fail [09:22:06] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: puppet fail [09:22:06] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: puppet fail [09:22:07] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: puppet fail [09:22:27] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: puppet fail [09:22:28] PROBLEM - puppet last run on mw1084 is CRITICAL: CRITICAL: puppet fail [09:22:36] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: puppet fail [09:22:37] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: puppet fail [09:22:38] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: puppet fail [09:22:48] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: puppet fail [09:22:57] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [09:23:06] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: puppet fail [09:23:07] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:23:07] RECOVERY - puppet last run on lvs1002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:23:07] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: puppet fail [09:23:08] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: puppet fail [09:23:08] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: puppet fail [09:23:16] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: puppet fail [09:23:17] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: puppet fail [09:23:17] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: puppet fail [09:23:27] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: puppet fail [09:23:36] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: puppet fail [09:23:38] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: puppet fail [09:23:38] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:23:47] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: puppet fail [09:23:47] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:23:47] PROBLEM - puppet last run on mw1074 is CRITICAL: CRITICAL: puppet fail [09:23:47] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: puppet fail [09:23:48] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:23:57] PROBLEM - puppet last run on mw2048 is CRITICAL: CRITICAL: puppet fail [09:24:07] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: puppet fail [09:24:07] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: puppet fail [09:24:17] RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:24:18] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: puppet fail [09:24:26] PROBLEM - puppet last run on mw1256 is CRITICAL: CRITICAL: puppet fail [09:24:28] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: puppet fail [09:24:28] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [09:24:34] (03CR) 10Filippo Giunchedi: [C: 031] "is there a labs instance already active to take a look at the resulting metrics?" [puppet] - 10https://gerrit.wikimedia.org/r/247823 (owner: 10Alexandros Kosiaris) [09:24:37] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: puppet fail [09:24:37] PROBLEM - puppet last run on mw1080 is CRITICAL: CRITICAL: puppet fail [09:24:38] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: puppet fail [09:24:38] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [09:24:47] PROBLEM - puppet last run on mw2008 is CRITICAL: CRITICAL: puppet fail [09:24:47] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: puppet fail [09:24:49] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: puppet fail [09:24:56] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: puppet fail [09:24:58] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: puppet fail [09:25:06] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: puppet fail [09:25:06] PROBLEM - puppet last run on mw2058 is CRITICAL: CRITICAL: puppet fail [09:25:17] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: puppet fail [09:25:17] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: puppet fail [09:25:17] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [09:25:26] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: puppet fail [09:25:26] PROBLEM - puppet last run on mw2046 is CRITICAL: CRITICAL: puppet fail [09:25:28] PROBLEM - puppet last run on mw2144 is CRITICAL: CRITICAL: puppet fail [09:25:36] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: puppet fail [09:25:36] PROBLEM - puppet last run on mw2032 is CRITICAL: CRITICAL: puppet fail [09:25:37] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: puppet fail [09:25:38] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: puppet fail [09:25:43] (03CR) 10Filippo Giunchedi: [C: 031] dumps: move ssl cert and config to role [puppet] - 10https://gerrit.wikimedia.org/r/247700 (owner: 10Dzahn) [09:25:47] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: puppet fail [09:25:57] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: puppet fail [09:26:27] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: puppet fail [09:26:31] ignore the puppet errors, problem fixed [09:26:36] (03PS3) 10Muehlenhoff: Assign salt grains for deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/247877 [09:27:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/247877 (owner: 10Muehlenhoff) [09:31:04] 6operations, 6Discovery, 7Elasticsearch: unattended elasticsearch restarts - https://phabricator.wikimedia.org/T89845#1745030 (10fgiunchedi) the idea was to track work towards having unattended cluster restarts as much as possible, at this point I don't know where it stands though [09:31:08] (03PS2) 10Muehlenhoff: Assign salt grains for logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/247878 [09:34:42] (03PS3) 10Muehlenhoff: Assign salt grains for logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/247878 [09:35:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 615 [09:36:14] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/247878 (owner: 10Muehlenhoff) [09:38:57] (03PS2) 10Muehlenhoff: Assign salt grains for logstash [puppet] - 10https://gerrit.wikimedia.org/r/247879 [09:39:41] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:40:10] RECOVERY - check_mysql on db1008 is OK: Uptime: 7750340 Threads: 1 Questions: 62467002 Slow queries: 52617 Opens: 129256 Flush tables: 2 Open tables: 64 Queries per second avg: 8.059 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:40:20] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:40:31] RECOVERY - puppet last run on cp1069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:40:42] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:41:12] RECOVERY - puppet last run on mw1081 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:41:12] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:41:21] RECOVERY - puppet last run on mw2089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:41:21] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:41:50] RECOVERY - puppet last run on mw1056 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:41:51] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:42:10] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:42:10] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:42:21] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:42:33] (03PS3) 10Muehlenhoff: Assign salt grains for logstash [puppet] - 10https://gerrit.wikimedia.org/r/247879 [09:42:41] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:42:42] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:42:42] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:43:02] RECOVERY - puppet last run on mw2041 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:43:20] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [09:43:21] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:43:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for logstash [puppet] - 10https://gerrit.wikimedia.org/r/247879 (owner: 10Muehlenhoff) [09:43:41] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:43:42] RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:43:42] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:43:51] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:43:52] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:44:11] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:44:30] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:44:41] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:44:41] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:44:51] RECOVERY - puppet last run on mw2064 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:45:00] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:00] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:02] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:45:32] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [09:45:41] RECOVERY - puppet last run on mw2015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:45:47] (03PS2) 10Muehlenhoff: Assign salt grains for package::builder [puppet] - 10https://gerrit.wikimedia.org/r/247880 [09:45:51] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:46:11] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:46:20] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:46:21] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:46:21] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:21] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:30] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:46:30] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:46:30] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:31] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:46:41] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:41] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:50] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:50] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:46:51] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:46:51] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:46:52] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:46:59] (03PS3) 10Muehlenhoff: Assign salt grains for package::builder [puppet] - 10https://gerrit.wikimedia.org/r/247880 [09:47:00] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:01] RECOVERY - puppet last run on mw2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:10] RECOVERY - puppet last run on mw2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:11] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:47:21] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:47:21] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:21] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:47:32] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:47:40] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:47:41] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:47:50] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:47:51] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:00] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:48:01] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:48:01] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:02] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:10] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:11] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for package::builder [puppet] - 10https://gerrit.wikimedia.org/r/247880 (owner: 10Muehlenhoff) [09:48:12] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:20] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:48:20] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:21] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:48:21] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:48:31] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:48:31] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:48:41] RECOVERY - puppet last run on mw2055 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:48:41] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:48:50] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:51] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:48:51] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:49:00] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:00] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:49:00] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:01] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:01] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:01] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:49:02] RECOVERY - puppet last run on mw2047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:11] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:11] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:49:11] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:49:11] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:49:12] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:49:12] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:49:12] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:49:20] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:49:21] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:21] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:49:22] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:31] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:49:31] (03PS2) 10Muehlenhoff: Use testsystem role for ruthenium [puppet] - 10https://gerrit.wikimedia.org/r/247239 [09:49:31] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:49:31] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:31] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:49:32] RECOVERY - puppet last run on mw2049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:32] RECOVERY - puppet last run on cp4019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:41] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:42] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:49:42] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:49:50] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:50] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:49:51] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:52] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:50:01] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:11] RECOVERY - puppet last run on mw2056 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [09:50:11] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:50:21] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:21] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:21] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:22] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:30] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:50:40] 6operations, 10Analytics, 6Services, 5Patch-For-Review: Set up LVS for AQS - https://phabricator.wikimedia.org/T116245#1745040 (10akosiaris) 5Open>3Resolved a:3akosiaris LVS for AQS is up and running. We had to migrated restbase on AQS to port 7232 to avoid conflicting with the services restbase inst... [09:50:41] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:50:42] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:50:50] RECOVERY - puppet last run on mw1084 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:50:50] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:51] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:50:51] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:51:01] RECOVERY - puppet last run on mw2030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:51:12] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:51:31] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:51:31] RECOVERY - puppet last run on mw2048 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:51:31] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:51:32] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:51:32] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [09:51:41] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:51:41] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:51:51] RECOVERY - puppet last run on mw1240 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [09:51:51] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:51:52] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [09:52:00] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:52:00] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:01] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:11] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [09:52:12] RECOVERY - puppet last run on mw2046 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:52:12] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [09:52:32] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:41] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:51] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:52:51] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:51] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [09:52:52] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:52] RECOVERY - puppet last run on mw2032 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [09:53:00] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:53:02] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:10] RECOVERY - puppet last run on mw1074 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:53:10] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:11] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:11] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:53:21] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:53:22] RECOVERY - puppet last run on mw1080 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:53:41] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:50] RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:53:50] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:51] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:57] (03CR) 10Muehlenhoff: "No problems according to puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/247225 (owner: 10Muehlenhoff) [09:54:02] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:54:37] (03PS2) 10Muehlenhoff: Assign salt grains for racktables [puppet] - 10https://gerrit.wikimedia.org/r/247881 [09:55:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for racktables [puppet] - 10https://gerrit.wikimedia.org/r/247881 (owner: 10Muehlenhoff) [10:14:56] (03PS1) 10Muehlenhoff: Assign salt grains for rec DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/248009 [10:14:58] (03PS1) 10Muehlenhoff: Assign salt grains for syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/248010 [10:15:00] (03PS1) 10Muehlenhoff: Remove manually set salt grains during initial testing [puppet] - 10https://gerrit.wikimedia.org/r/248011 [10:15:29] hello Lcawte [10:16:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for rec DNS servers [puppet] - 10https://gerrit.wikimedia.org/r/248009 (owner: 10Muehlenhoff) [10:21:22] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for syslog servers [puppet] - 10https://gerrit.wikimedia.org/r/248010 (owner: 10Muehlenhoff) [10:22:21] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [10:24:11] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [10:25:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Remove manually set salt grains during initial testing [puppet] - 10https://gerrit.wikimedia.org/r/248011 (owner: 10Muehlenhoff) [10:29:50] (03PS4) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [10:35:25] (03PS5) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [10:38:03] (03PS6) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [10:50:30] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [10:52:21] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [10:53:28] (03CR) 10Muehlenhoff: fluorine: Use the role keyword (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246980 (owner: 10Muehlenhoff) [10:56:51] 6operations, 10Analytics, 6Services: Set up LVS for AQS - https://phabricator.wikimedia.org/T116245#1745136 (10mobrovac) [11:01:34] (03PS2) 10Mobrovac: RESTBase: Set up MobileApps storage and AQS public API [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) [11:03:48] 7Puppet, 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1745139 (10hashar) On Trusty: ``` root@integration-slave-trusty-1011:~# ls -l /usr/{local/,}bin/{grunt,npm} ls: cannot access /usr/local... [11:04:53] (03PS1) 10Muehlenhoff: Assign salt grains for memcached [puppet] - 10https://gerrit.wikimedia.org/r/248018 [11:04:55] (03PS1) 10Muehlenhoff: Assign salt grains forr ipv6relay [puppet] - 10https://gerrit.wikimedia.org/r/248019 [11:04:57] (03PS1) 10Muehlenhoff: Assign salt grains for parsercache [puppet] - 10https://gerrit.wikimedia.org/r/248020 [11:16:36] 6operations: Off Boarding: Remove user pbeaudette from aliases - https://phabricator.wikimedia.org/T116248#1745160 (10Aklapper) [11:16:50] 6operations: Off Boarding: Remove user pbeaudette from aliases - https://phabricator.wikimedia.org/T116248#1744433 (10Aklapper) [ Setting a descriptive task summary ] [11:20:03] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1745178 (10Aklapper) [11:20:40] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#771227 (10Aklapper) Constantly changing expectations on a task (in this case: version) is not helpful for discussion or planning. Hence I'm reverting the task summary changes. [11:22:04] (03PS7) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [11:31:06] (03CR) 10Filippo Giunchedi: [C: 04-1] RESTBase: Set up MobileApps storage and AQS public API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [11:33:23] (03CR) 10Mobrovac: RESTBase: Set up MobileApps storage and AQS public API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [11:35:33] !log restbase deployed 2bc05f40 [11:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:37:44] (03PS8) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [11:39:16] I will apply T89986 schema change after breakfast [11:49:47] 6operations, 10OTRS: Upgrade OTRS to a more recent stable release - https://phabricator.wikimedia.org/T74109#1745220 (10akosiaris) So, to add some (not much) context into this, we have a 4.0.13 installation that is currently being evaluated by OTRS volunteers. It resides on https://otrs-test.wikimedia.org. It... [11:51:44] (03CR) 10Alexandros Kosiaris: "Why are we bundling 2 completely irrelevant things together ? I would expect a change to enable AQS and another for mobileapps storing, he" [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [11:54:02] (03CR) 10Mobrovac: "Both are low-risk operations and affect only their respective endpoints. I bundled them together since both need only a config change. If " [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [11:55:29] (03PS9) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [11:56:43] 6operations, 10OTRS: move OTRS to a VM - https://phabricator.wikimedia.org/T105554#1745226 (10akosiaris) [11:56:46] 6operations, 10OTRS: upgrade iodine to jessie or find a new host with jessie for OTRS - https://phabricator.wikimedia.org/T105125#1745227 (10akosiaris) [12:01:07] 7Puppet, 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1745238 (10hashar) On Precise the installation of grunt-cli is done with the Precise package which fails because of the CA. But npm is p... [12:03:07] (03CR) 10Alexandros Kosiaris: "Let's do that then. Needing to restart restbase only once (for whatever reason) should not make us present irrelevant changes together." [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [12:03:50] kk akosiaris, /me doing [12:10:56] (03PS10) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [12:16:51] (03PS3) 10Mobrovac: RESTBase: Set up the AQS public API [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) [12:16:59] (03PS11) 10Hashar: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) [12:19:59] (03CR) 10Alex Monk: elasticsearch: apply elasticsearch::server role to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/238616 (owner: 10Rush) [12:21:25] (03PS1) 10Mobrovac: RESTBase: Set up MobileApps storage [puppet] - 10https://gerrit.wikimedia.org/r/248026 (https://phabricator.wikimedia.org/T102130) [12:23:13] (03CR) 10Mobrovac: "@Alex, done. This patch now introduces only the AQS routes, while I885a3e7a2ea380bee4d60463902ec1874ce47eb1 adds the MobileApps storage." [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [12:30:35] (03CR) 10BBlack: [C: 031] "This is fine for testing the cookie for now, but there's a whole lot of other things to heavily refactor before we get to using this for t" [puppet] - 10https://gerrit.wikimedia.org/r/247970 (https://phabricator.wikimedia.org/T91820) (owner: 10Ori.livneh) [12:31:36] !log Rolling schema change for GeoData on all wikis (geo_tags) [12:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:18] 7Puppet, 10Continuous-Integration-Config, 5Continuous-Integration-Scaling, 5Patch-For-Review: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1745277 (10hashar) All fine with PS 11 https://gerrit.wikimedia.org/r/#/c/244748/11 `salt '*slave*' cmd.run '/usr/bin/npm --version; /u... [12:33:55] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppet master. It is not perfect but does the job." [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [12:34:51] 7Puppet, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 5Patch-For-Review, and 2 others: Puppetize npm/grunt manual setup - https://phabricator.wikimedia.org/T113903#1745278 (10hashar) [12:36:37] I got a warning while applying the changes on a large wiki- I think it is expected, but I am going to abort the schema change until we confirm it is ok to continue [12:36:45] (03PS3) 10Hashar: beta: point parsoid back to source code [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) [12:36:53] What was it? [12:37:07] (03CR) 10Hashar: [C: 031 V: 032] "Basic rebase, cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/243987 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [12:37:16] (03PS3) 10Hashar: beta: parsoid now uses modules defined in source [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) [12:37:26] (03CR) 10Hashar: [C: 031 V: 032] "Basic rebase, cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/243992 (https://phabricator.wikimedia.org/T92871) (owner: 10Hashar) [12:38:08] (03PS2) 10Hashar: contint: restore unattended upgrade on slaves [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) [12:38:20] (03CR) 10Hashar: [V: 032] "Basic rebase, cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/243925 (https://phabricator.wikimedia.org/T98885) (owner: 10Hashar) [12:42:14] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for memcached [puppet] - 10https://gerrit.wikimedia.org/r/248018 (owner: 10Muehlenhoff) [12:43:55] (03PS1) 10Glaisher: noc: change Gitblit links to Diffusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248027 [12:44:03] jynus: :/ [12:44:17] aude, why sad? [12:44:26] no schema change :( [12:44:28] yet [12:44:44] what kind of warning? [12:45:34] 6operations, 10RESTBase, 6Services: Switch RESTBase to use Node.js 4 - https://phabricator.wikimedia.org/T107762#1745290 (10Pchelolo) [12:45:46] so, this is the issue: https://phabricator.wikimedia.org/T89986#1745288 [12:46:46] 81.65 float is not exact, and with the extra precision gets converted to 81.65000153 (and truncated) [12:47:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains forr ipv6relay [puppet] - 10https://gerrit.wikimedia.org/r/248019 (owner: 10Muehlenhoff) [12:47:20] hm [12:47:29] I suppose tagged pages can be reprocessed - that will depend if floats are used on PHP code? [12:47:49] so not now if it is an issue really [12:48:04] but I prefer to stop now than when it is too late [12:48:39] get some feedback from deployment/devel and community, and I can restart at any time [12:48:46] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for parsercache [puppet] - 10https://gerrit.wikimedia.org/r/248020 (owner: 10Muehlenhoff) [12:52:06] yeah, definitely should ask max [12:53:08] sorry, I am from ops and on top on that a DBA, so 120% conservative about changes [12:53:32] i think they are double in the php [12:54:23] that would explain the issue- which means the decimal on mysql makes little sense [12:54:25] can't imagine though that this is so urgent and can wait [12:54:33] exactly [12:54:52] I am not blocking it, I think the change should go through [12:55:13] in wikidata, this is why we have precision as part of the coordinate data there [12:55:15] but I would suggest using a fixed point /text on mw [12:55:26] and that makes totally sense [12:55:53] the good thing about this is that, worst case scenario there is no data loss [12:56:06] only some time to reprocess the tags [12:56:13] yeah [12:56:54] I will also ping eswiki to see if they see something strange in the meantime [12:57:41] actually think a lot of there coordinates come via wikidata now [12:57:44] their [12:58:01] but still get added to geo_tags via templates ... [13:00:56] I've made an edit to a page on a wiki with the new schema and it has been updated successfully, I think it is not an issue [13:01:06] ok [13:01:08] but I will wait and suggest code changes [13:01:36] (I have like 1 million other pending schema changes in the meantime) [13:16:41] ottomata, madhuvishy: I spent some time with Burrow this morning, but there's no simple solution: [13:17:10] dh-make-golang cannot easily be backported, it relies on newer features in dh-golang [13:17:43] which depend on newer features in go itself (like "go generate" introduced in 1.4, while jessie had 1.3) [13:17:55] so that's rather something for Debian unstable/testing [13:18:18] then I tried to package Burrow with a simple debhelper packaging using dh-golang from jessie [13:19:07] that would work per se, but Burrow relies on five additional libs which aren't packaged in Debian yet (one or two seem to have an ITP bug (i.e. someone will work on it the future)) [13:19:33] !log performing schema change on x1-master (flowdb) [13:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:19:40] and I'm afraid even when packaging these five dependency libs we'll run into libs needed by those five libs [13:20:18] oh boy [13:20:20] so packaging this properly would involve a huge chunk of work and I'm not sure it's worth it [13:20:31] yikes, well, i mean [13:20:39] it might be better with Debian stretch since go packaging has caught up some traction [13:20:43] generally we have to package it to use it, maybe just not properly [13:20:50] ? [13:21:03] we want to run this in prod, is stretch something that might be available? [13:21:11] we could easily run this on ganeti i think [13:21:13] if that is an option [13:21:30] stretch has too much ongoing churn [13:21:38] moritzm: for kafka, we did the wrong thing [13:21:44] and manually included some binary dependencies in the git repo [13:21:55] because we were having the same problems [13:22:08] we used debian based deps where we could, but used included ones where we couldn't [13:22:12] could we do that for burrow? [13:22:46] the build method recommended by upstream simply fetches the external deps from github and build them locally using gpm [13:23:03] https://github.com/pote/gpm [13:23:29] of course that's horrible for reproducibiliy and general sanity [13:24:17] but I'm afraid that's the state of the art for go ATM :-) [13:24:27] they don't even support shared libs... [13:24:31] aye [13:24:59] so what we could do it to make a debian/rules which makes the local build using gpm and ship the resulting binary into the deb package [13:25:52] yeah, that sucks though, i would rather use gpm to collect all the binary deps [13:25:54] and commit them to the repo [13:26:00] and make debian/rules just use those to build [13:26:08] that way at least the build is reproduceable [13:27:53] or that way. it sucks a little less [13:27:58] but still sucks :-) [13:28:07] yeah [13:28:30] does gpm install install deps locally somewhere? [13:28:45] maybe in some place that is configurable by an env var, that we could set to be in the local repo [13:28:55] then go build or whatever rules does just sets that and deps will automatically be loaded? [13:29:12] chasemp: hiya [13:30:22] GOPATH perhaps? :) [13:30:25] hey man [13:30:26] reading myself...:p [13:30:30] chasemp: hiyaaa [13:30:32] so , reqstats. [13:30:33] PING [13:30:33] :) [13:30:37] i'm going to work on that a bunch today [13:31:20] not sure, I think GOPATH is to manage multiple golang installs on a system [13:31:29] hm [13:31:40] but there's a plugin, which might do the trick: [13:31:59] https://github.com/technosophos/gpm-local [13:32:05] I'm locked into something a bit exploratory here ottomata, and then a meeting, but I will try to catch up w/ you briefly in a few to outline what may or may not be good points on my part :) and then more later? [13:33:01] keep an eye on flow -related activity. I do not see any errors or obvious regressions, but maybe a slightly higher load? Too soon to say something. [13:33:12] morebots: setting GOPATH did make gpm install there [13:33:12] I am a logbot running on tools-exec-1203. [13:33:12] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [13:33:12] To log a message, type !log . [13:35:16] jynus: don't see anything obviosu now regarding flow, but have generally noticed it's becoming target of spam bots esp. on mediawiki.org :( [13:35:18] I think it is just normal activity [13:35:31] that we are just reaching peak time [13:35:35] chasemp: k [13:35:38] i think users are setting up mor eabuse filters and maybe they help [13:35:38] ottomata: ok, even better [13:35:47] and work* [13:35:47] well, we'll see if it worked, it did install them there [13:35:50] but there was a recent outage related to it [13:35:52] not sure if i can use them easily, am experimenting [13:36:05] so I am double-checking the change was successful [13:37:14] it is just that my monitoring has a 5 minute lag :-) [13:37:33] ottomata: I checked the status in stretch, it has 3 out of the 5 needed packages: uuid, samuel-go-zookeeper and gcfg, plus Shopify/sarama has an ITP and seelog is missing. so hopefully this can be packaged sanely in stretch [13:38:46] the number of full table scans has not changed [13:39:03] although there are now less adaptive hash hits [13:40:59] I think the main issue here is that x1 needs a good performance audit, in general [13:50:36] Hm, moritzm, on second thought, mabye your method is better (with gpm install in build process), since that will build the deps for a particular arch. [13:54:53] it even seems to be reproducible, the Godeps file contains git hashes of the revisions it expects [13:56:07] and to add some fun to the fix the Godeps file for go-zookeeper throws a 404 :-) [13:56:12] (to the mix) [13:57:36] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1745349 (10Andrew) virt-star is used by the nova-compute services to talk to each other, for example when migrating instances from one place to another. It's almost ce... [13:59:03] (03CR) 10Andrew Bogott: [C: 032] "Yep, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/247953 (owner: 10Dzahn) [13:59:08] (03PS4) 10Andrew Bogott: site.pp: remove virt100[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/247953 (owner: 10Dzahn) [14:00:22] (03CR) 10Andrew Bogott: [C: 032] site.pp: remove virt100[1-9] [puppet] - 10https://gerrit.wikimedia.org/r/247953 (owner: 10Dzahn) [14:00:56] (03Abandoned) 10Muehlenhoff: Use testsystem role for virt100[5-7] [puppet] - 10https://gerrit.wikimedia.org/r/247240 (owner: 10Muehlenhoff) [14:01:30] !log performing schema change on officewiki-flow (s3) [14:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:19] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1745395 (10Ottomata) COOL. As part of this discussion, I'd like us to think about not only fields that are relevant to edit events, but also those f... [14:25:45] !log setting thread_pool_size to 32 dynamically on all MariaDB hosts [14:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:57] (03CR) 10Dzahn: [C: 031] Remove auth.login-message - not supported by upstream anymore [puppet] - 10https://gerrit.wikimedia.org/r/247793 (https://phabricator.wikimedia.org/T116142) (owner: 10Aklapper) [14:26:40] (03PS2) 10Dzahn: subra/suhail: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247225 (owner: 10Muehlenhoff) [14:26:51] (03CR) 10Dzahn: [C: 032] subra/suhail: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247225 (owner: 10Muehlenhoff) [14:27:07] godog: Did you see where bd808 had responded to you re: rsync/scap? [14:28:26] moritzm: how can I pass an env var to dh [14:28:27] ? [14:28:35] just exporting it doesn't see to be respected [14:28:43] or, maybe dh golang is overriding what I set [14:29:30] OO, moritzm i got it, but real hacky like [14:29:40] dpkg -c /var/cache/pbuilder/result/jessie-amd64/golang-burrow_0.1.0-1~otto1_all.deb [14:30:42] (03PS5) 10Dzahn: deactivate vikipedi[a].com.tr Turkish domains [dns] - 10https://gerrit.wikimedia.org/r/247903 (https://phabricator.wikimedia.org/T83077) [14:31:28] (03CR) 10Dzahn: [C: 032] deactivate vikipedi[a].com.tr Turkish domains [dns] - 10https://gerrit.wikimedia.org/r/247903 (https://phabricator.wikimedia.org/T83077) (owner: 10Dzahn) [14:34:09] (03PS6) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 [14:38:15] ottomata: just export them in the rules file [14:39:42] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests, 3labs-sprint-117: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1745441 (10mark) So we're down to just one system (32 GB) in warranty now? This is Approved. [14:41:32] !log mw1083 - Error: Could not run Puppet configuration client: Read-only file system [14:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:00] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1083 is CRITICAL: Host mw1083 is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T116184 [14:43:01] ACKNOWLEDGEMENT - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet last ran 1 day ago daniel_zahn https://phabricator.wikimedia.org/T116184 [14:43:44] moritzm: i did that, but it doesn't seem to listen [14:43:46] ""Difference between raw and validated EventLogging overall message rates"" shrugs .. [14:44:25] morebots: i think dh golang (or something) mucks with the GOPATH [14:44:26] I am a logbot running on tools-exec-1203. [14:44:26] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:44:26] To log a message, type !log . [14:44:34] i had to do this [14:44:34] ln -s $(CURDIR)/debian/godeps $(CURDIR)/obj-x86_64-linux-gnu [14:44:39] hmmmmmm [14:45:04] i guess i could stick those deps in that dir directly, buuut, i was avoiding adding anything above the debian/ dir [14:45:53] ori: why do you not use the python statsd client in your varnish stat collectors? [14:46:03] why use socket lib directly? [14:46:50] !log performing schema change on eventlogging database on db1046 [14:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:32] mutante, didn't someone else already notice that read-only filesystem on mw1083? [14:47:47] (03PS3) 10Dzahn: admin: create agomez and add to stats groups [puppet] - 10https://gerrit.wikimedia.org/r/247467 (https://phabricator.wikimedia.org/T115666) [14:47:52] Krenair: yes, i found a ticket and linked it [14:48:01] aude [14:48:06] ah, and then akosiaris depooled it [14:48:06] the "not in dsh" part is good [14:48:11] yes [14:48:22] ack'ed [14:49:50] I vaguely remember someone saying hw problems about one mw host [14:51:42] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1745446 (10JanZerebecki) For the cipher we might want to start with ECDHE-RSA-AES128-GCM-SHA256. ECDHE-ECDSA-AES128-GCM-SHA256 might be an alternate option if we want to try EC keys. We o... [14:51:51] (03PS1) 10Alexandros Kosiaris: icinga: Provide a check_gsb command and replace old commands [puppet] - 10https://gerrit.wikimedia.org/r/248036 (https://phabricator.wikimedia.org/T116099) [14:53:35] mutante: maybe this ^ will fix the mess a bit [14:53:41] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [14:53:41] needs some testing though first [14:57:30] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [15:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151022T1500). [15:00:19] (03PS2) 10Alexandros Kosiaris: icinga: Provide a check_gsb command and replace old commands [puppet] - 10https://gerrit.wikimedia.org/r/248036 (https://phabricator.wikimedia.org/T116099) [15:00:39] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip on Phabricator and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1745468 (10chasemp) >>! In T114014#1745461, @greg wrote: > I like Phab's 403 page: "Peace out" :) > > Since I can't see it: how much d... [15:01:19] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip on Phabricator and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1745469 (10greg) good! [15:01:26] chasemp: g'morning :) [15:02:02] Glaisher: ping for SWAT [15:03:11] (03PS1) 10Muehlenhoff: Assign salt grains for mariadb::labs [puppet] - 10https://gerrit.wikimedia.org/r/248038 [15:03:13] (03PS1) 10Muehlenhoff: Assign salt grains for osm [puppet] - 10https://gerrit.wikimedia.org/r/248039 [15:03:15] (03PS1) 10Muehlenhoff: Assign salt grains for the LVS servers [puppet] - 10https://gerrit.wikimedia.org/r/248040 [15:07:03] (03PS3) 10Alexandros Kosiaris: icinga: Provide a check_gsb command and replace old commands [puppet] - 10https://gerrit.wikimedia.org/r/248036 (https://phabricator.wikimedia.org/T116099) [15:08:11] (03CR) 10Greg Grossmeier: [C: 031] noc: change Gitblit links to Diffusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248027 (owner: 10Glaisher) [15:09:36] thcipriani: Hi. [15:09:48] (03CR) 10Rush: "glorious" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248027 (owner: 10Glaisher) [15:09:54] Glaisher here. [15:09:59] bnc troubles :( [15:10:09] (03PS1) 10John F. Lewis: mailman: run qdata cron later into 8am bounces [puppet] - 10https://gerrit.wikimedia.org/r/248044 [15:10:17] Philon: howdy! no problem. [15:10:18] mutante: good news, I get icinga email :) [15:11:00] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248027 (owner: 10Glaisher) [15:11:22] (03Merged) 10jenkins-bot: noc: change Gitblit links to Diffusion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248027 (owner: 10Glaisher) [15:11:33] !log uploaded openjdk-8 8u66-b17 for jessie-wikimedia to carbon [15:11:39] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:41] (03PS1) 10Milimetric: Exclude CentralNoticeBannerHistory from mysql [puppet] - 10https://gerrit.wikimedia.org/r/248045 (https://phabricator.wikimedia.org/T116241) [15:13:21] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 2.218 second response time [15:13:52] !log thcipriani@tin Synchronized docroot/noc/conf: SWAT: noc: change Gitblit links to Diffusion [[gerrit:248027]] (duration: 00m 17s) [15:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:23] ^ Philon Sync'd! Thanks for making that patch by the way! [15:14:38] 6operations, 7Database: Check, test and tune pool-of-connections and max_connections configuration - https://phabricator.wikimedia.org/T112479#1745511 (10jcrespo) I've set all nodes to 32 in configuration. According to all benchmarks, 32-36 are the maximum practical limit for concurrent threads. We could use p... [15:14:56] thcipriani: looks like something's broken :/ [15:15:03] yeah, urlencoding stuffs. [15:15:15] (03PS1) 10Muehlenhoff: Enable ferm on logstash1004 [puppet] - 10https://gerrit.wikimedia.org/r/248047 [15:15:16] yeah, looks like it [15:16:40] chasemp: q for you [15:16:45] since i am now not using diamond [15:18:20] i can emit counts to statsd whith whatever prefix i want [15:18:23] not just servers.cpxxxx [15:18:30] shoudl I have statsd aggregate across servers? [15:18:36] that would make viewing things in graphite easier [15:18:57] (03CR) 10Rush: [C: 032] Enable ferm on logstash1004 [puppet] - 10https://gerrit.wikimedia.org/r/248047 (owner: 10Muehlenhoff) [15:19:00] PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:19:40] (03PS2) 10Milimetric: Exclude CentralNoticeBannerHistory from mysql [puppet] - 10https://gerrit.wikimedia.org/r/248045 (https://phabricator.wikimedia.org/T116241) [15:20:24] ottomata: I'll get back to you in a few I'm in process of logstash firewall things [15:21:52] k [15:22:46] is somebody on the etherpad problem ? [15:22:49] akosiaris: ^^ [15:22:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [5000000.0] [15:24:39] RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 4.518 second response time [15:24:46] cool [15:25:20] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 13, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 91, initializing_shards: 4, number_of_data_nodes: 3, [15:25:20] PROBLEM - ElasticSearch health check for shards on logstash1004 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 13, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 91, initializing_shards: 4, number_of_data_nodes: 3, [15:25:35] (03PS1) 10Thcipriani: Remove urlencode from phabricator links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248049 [15:25:36] we know^ [15:25:37] on elastic [15:26:00] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 13, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 91, initializing_shards: 4, number_of_data_nodes: 3, [15:26:42] Philon: I think https://gerrit.wikimedia.org/r/248049 should fix, or I can revert. [15:26:59] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 13, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 91, initializing_shards: 4, number_of_data_nodes: 3, [15:27:00] PROBLEM - ElasticSearch health check for shards on logstash1005 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 13, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 91, initializing_shards: 4, number_of_data_nodes: 3, [15:27:00] PROBLEM - ElasticSearch health check for shards on logstash1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1% breach: status: yellow, number_of_nodes: 6, unassigned_shards: 13, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 36, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 91, initializing_shards: 4, number_of_data_nodes: 3, [15:27:06] chasemp: hmmm... why is it syncing such old indexes? Those should have checkpointed a long time ago [15:27:43] well that's a good question [15:27:53] basically teh fw rules were not complete and the clsuter wigged out a bit [15:27:56] and now it's yellow [15:28:04] but the why of it's allocation [15:28:08] is a more complex issue [15:28:10] (03CR) 10Glaisher: [C: 031] "I can't think of another way to fix this so okay." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248049 (owner: 10Thcipriani) [15:28:35] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248049 (owner: 10Thcipriani) [15:28:41] (03Merged) 10jenkins-bot: Remove urlencode from phabricator links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248049 (owner: 10Thcipriani) [15:28:43] (03PS1) 10Muehlenhoff: Don't enable ferm on logstash1004 yet [puppet] - 10https://gerrit.wikimedia.org/r/248050 [15:28:56] 1004 decided that 18 of its indexes needed to be completely rebuilt :/ [15:29:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Don't enable ferm on logstash1004 yet [puppet] - 10https://gerrit.wikimedia.org/r/248050 (owner: 10Muehlenhoff) [15:29:09] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [15:29:09] that's going to take a while to recover [15:29:17] bd808 so es has a problem where if a node disappears it basically disavows those shards [15:29:25] so if 1004 goes offline for a bit because fw things [15:29:36] it won't just reaccept back into the cluster those shards, it has to rebuild [15:29:45] I think this is where we are at, we see this on the prod cluster and it's hella painful [15:29:49] that's what the checkpoint stuff should handle though. we only have 1 active index at a time [15:29:50] there should be mitigation mechanisms for this [15:30:04] well, interesting [15:30:13] the ELK cluster is different than cirrus [15:30:27] !log thcipriani@tin Synchronized docroot/noc/conf/highlight.php: SWAT: Remove urlencode from phabricator links [[gerrit:248049]] (duration: 00m 17s) [15:30:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [5000000.0] [15:30:30] sure, but the stupidness of ES is general :) [15:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:34] but yeah [15:30:35] after it recovers I guess we should force a checkpoint [15:30:52] to save the other 2 from flipping out [15:31:17] my last rolling restart didn't have this issue, but I did force a checkpoint right at the start [15:31:33] ah maybe you can school me on teh whole mechanism sometime here [15:31:43] I don't grok entirely how logstash does the rollout time based stuff [15:32:15] wall clock based. events go into an index with a name matching the event's timestamp [15:34:11] bd808: tldr we rolled back the change and mortiz is looking into what is missing [15:34:21] obv we will let it sit at least long enough to go green :) [15:35:16] *nod* I'll keep my recovery monitor script running in a window [15:36:55] thcipriani: fixed now. thanks! [15:37:35] Philon: awesome. Thanks for double checking, and thanks for the patch, appreciated! [15:37:53] :) [15:38:01] hm ka20 eh? [15:38:11] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [15:38:14] ottomata: ok so on this we have some loose standards which are not standards [15:38:17] see http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445528264.835&target=elasticsearch.production-search-eqiad.elasticsearch.indices.count [15:38:56] and http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445528327.538&target=servers.elastic1001.elasticsearch.http.current&target=servers.elastic1001.elasticsearch.http.current_open [15:39:09] so take elasticsearch as a slightly off kilter example [15:39:21] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [15:39:46] we do basically .. [15:40:02] which houses cluster related things for teh site, and then each node also has their own which are node specific [15:40:23] aye, guess i'm not sure if we want the node level breakdowns avail in graphite [15:40:30] if we do, i can make that happen just like diamond [15:40:36] but its a lot more metrics for grafana to graph :) [15:40:38] and sum [15:40:42] i could emit both! [15:40:50] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:40:52] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [15:40:58] you could and there is actualy a native mechanism for graphite to do this the smart way [15:40:59] emit a node named stat and then also a generic stat that statsd will aggregate [15:41:02] oh? [15:41:11] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [15:41:14] take the node ones and equate a clsuter one at the carbon time but I dont' think we use it [15:41:20] anywhere [15:41:26] oh, hm. [15:41:30] that is a graphite setting? [15:41:39] slash carbon setting? [15:42:21] http://www.franklinangulo.com/blog/2014/6/6/graphite-series-6-carbon-aggregators [15:43:06] (03PS3) 10Ottomata: Exclude CentralNoticeBannerHistory from mysql [puppet] - 10https://gerrit.wikimedia.org/r/248045 (https://phabricator.wikimedia.org/T116241) (owner: 10Milimetric) [15:43:14] (03CR) 10Ottomata: [C: 032 V: 032] Exclude CentralNoticeBannerHistory from mysql [puppet] - 10https://gerrit.wikimedia.org/r/248045 (https://phabricator.wikimedia.org/T116241) (owner: 10Milimetric) [15:46:51] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:45] ottomata: fwiw the tradeoff on submission in the pattern you are suggesting is real, and the graphite aggregate is the consequence of a lot of ppl making this "we weill have later expensive" queries tradeoff [15:47:54] (03PS4) 10Alexandros Kosiaris: icinga: Provide a check_gsb command and replace old commands [puppet] - 10https://gerrit.wikimedia.org/r/248036 (https://phabricator.wikimedia.org/T116099) [15:48:35] and it's pretty tempting to have just one cluster tree but you can abstract that in grafana and it only stretches as far as counters (well) the moment you want to do a guage now you are in a nondeterministic situaiton [15:48:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "The command seems to work fine and a puppet catalog compilation in http://puppet-compiler.wmflabs.org/1055/neon.wikimedia.org/ showed the " [puppet] - 10https://gerrit.wikimedia.org/r/248036 (https://phabricator.wikimedia.org/T116099) (owner: 10Alexandros Kosiaris) [15:48:40] as it always overwrites but from wehre [15:48:41] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [15:49:51] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [15:49:55] hm, chasemp, simple statsd q [15:50:04] if i publish a metric from multiple hosts with the same key [15:50:05] via [15:50:11] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [15:50:11] statsd_client.incr in python [15:50:31] statsd will collect all counts for that for 1 minute [15:50:40] and then publish the sum and computed values like rate and mean [15:50:42] to graphite? [15:50:55] 6operations, 10Wikimedia-Mailing-lists: Internal mailman-api - https://phabricator.wikimedia.org/T116288#1745598 (10Addshore) 3NEW [15:51:08] 6operations, 10Wikimedia-Mailing-lists: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1745605 (10Addshore) [15:51:18] so our statsd I'm not sure what stats it provides but essentially the increment function is for the '|c' counter type [15:52:02] and per $flush_interval it will perform this basic operation, counters are essentially rate (I think ours does a sum as well and includes other things) [15:52:15] we effectively do counter where we mean sum and get sum for free and dupe a lot of data [15:52:19] in some cases [15:53:17] (03PS2) 10Muehlenhoff: Move the ferm rules for elasticsearch internode traffic into role::logstash::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/244412 (https://phabricator.wikimedia.org/T104964) [15:53:27] we have blended the boundaries of |c and |s (which isn't even really sum but "set") in the reference implementation [15:53:30] ottomata: https://github.com/etsy/statsd/blob/master/docs/metric_types.md [15:54:00] aye [15:54:02] chasemp: [15:54:02] https://graphite.wikimedia.org/render/?width=588&height=311&from=-1hours&target=test.reqstats.misc.cp1056.client.method.get.rate&target=test.reqstats.misc.cp1056.client.method.get.count&target=test.reqstats.misc.cp1056.client.method.get.mean [15:55:25] ottomata: keep in mind that could be confusing as I don't know how count works there but with sampling effectively rate and count do not work as 1:1...assuming it's not flushing a post ratio count which would be totally nuts [15:55:31] chasemp: Gauges are supported in the wikimedia implementation right? [15:55:32] hmm, chasemp, ah I have a bug. [15:55:37] its me, not statsd :) [15:56:00] chasemp: i am not sampling [15:56:02] our statsd supports the natives types afaik it just does a bit extra [15:56:09] cool! :) [15:56:47] ottomata: I understand :) but I'm saying if you count on the count being the same as a set for a submitted counter and the load grows to where we have to sample [15:57:11] it gets complicated :) you can always infer rate from a set [15:57:19] anyways [15:57:26] Ignore lag on dbstore2002, it is expected, and not critical [15:58:45] chasemp: are you User:Deamon? [15:59:02] chasemp: i don't understand the set. that would keep each individual reported metric between each flush separate? [15:59:04] ...no [15:59:14] Steinsplitter: ^ [16:00:04] _joe_ andrewbogott: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151022T1600). Please do the needful. [16:00:22] empty, woo [16:00:23] ottomata: so ignoring our slightly offkilter implementation here is why no one uses sum in the counter as authoritative. You start at 1:1 and sum is effectively the same as set. We submitted 10 values in 10s. We get a rate of 1/ps and a sum of 10 [16:00:34] now say it's useful at some point we have to start submitting a ratio [16:00:36] say 10% [16:01:23] now the sumitted counter changes to reflect. You submit 10 values at 10% and we get 100 and a rate of 10/ps, but the sum there is still 10 [16:01:38] so natively if you want a total count of things you would submit a set [16:01:48] and you could then use functions to determine rate if you needed [16:02:10] here everyone seems to submit counters and view the sum through rose colored glasses I guess and we don't use the types that much [16:02:36] but if we are talking normal vanilla statsd usage if you want a straight "how many of these things" you would submit as |s [16:03:12] counter is a terrible name for a thing that is meant to create a rate as output as it's primary objective :) [16:03:30] PROBLEM - Disk space on logstash1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 107254 MB (3% inode=99%) [16:09:42] PROBLEM - Host payments2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:45] 6operations, 10Wikimedia-Mailing-lists: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1745645 (10JohnLewis) What do you want to achieve by this? I've set this up on labs to look at it and currently it seems you can only list subscribers to lists, (un)sub an address and send e... [16:09:49] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [16:09:55] PROBLEM - Host bellatrix is DOWN: PING CRITICAL - Packet loss = 100% [16:10:16] FR down? [16:10:18] no bueno... [16:10:21] Jeff_Green: ^ [16:10:23] you about? [16:10:52] oh, its codfw [16:10:56] i think it's all codfw, which last I knew was not in use [16:11:01] what chasemp just said [16:11:09] ah right, star names [16:11:12] I'd expect a very large email from FR-tech about codfw going online since then [16:11:15] RECOVERY - Disk space on logstash1004 is OK: DISK OK [16:11:25] (if it was live) [16:12:03] PROBLEM - Host mintaka is DOWN: PING CRITICAL - Packet loss = 100% [16:12:25] ??? [16:12:32] in an interview [16:12:38] 6operations, 10Wikimedia-Mailing-lists: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1745660 (10Krenair) Is this part of the version of mailman we're using? Does it provide access to data that's currently hidden such as the full list of lists? It sounds like it will let you... [16:13:13] PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:13:16] boy, that sure is a lot of host down alerts though. papaul, is this expected? [16:13:23] 6operations, 10Wikimedia-Mailing-lists: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1745662 (10Addshore) list subscribers is exactly what I would initially want to do here. See https://github.com/wikimedia/analytics-limn-wikidata-data/blob/master/src/social/mail/generate.php [16:13:33] seems a bit of a coincidence, network/firewall problems there perhaps? [16:13:51] I would check the switch first [16:13:52] akosiaris: its frack in codfw which isnt online for use [16:14:00] akosiaris: so you can likely return to interview =] [16:14:03] RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 34.73 ms [16:14:11] RECOVERY - Host mintaka is UP: PING OK - Packet loss = 0%, RTA = 35.15 ms [16:15:19] RECOVERY - Host payments2002 is UP: PING OK - Packet loss = 0%, RTA = 34.81 ms [16:15:29] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 34.62 ms [16:15:46] So i wonder if this just had the issue that it had the other week [16:15:50] RECOVERY - Host bellatrix is UP: PING OK - Packet loss = 0%, RTA = 35.26 ms [16:15:51] paravoid: Are you around by chance? [16:16:37] bblack: Could you have a look at https://phabricator.wikimedia.org/T114995 please? [16:17:02] hoo: he's out sick [16:17:21] :( [16:17:29] probably need to escalate to paravoid [16:17:32] based on the description [16:17:59] (03PS1) 10Glaisher: Increase abusefilter emergency disable threshold on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248060 [16:18:00] hoo: how long has this been an issue? [16:18:22] greg-g: Hi. Is it possible to get that deployed? ^ [16:18:26] i.e., is this possibly the consequence of a recent change, or is it some longstanding issue that has only now been discovered? [16:18:27] ori: Ever since... but we only enabled the mobile redirect on Oct. 8 or os [16:18:31] It's kind of an emergency.. [16:18:41] (03PS2) 10Ori.livneh: Increase abusefilter emergency disable threshold on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248060 (owner: 10Glaisher) [16:18:44] hoo: also could you take a look at it? [16:18:47] (03CR) 10Ori.livneh: [C: 032] Increase abusefilter emergency disable threshold on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248060 (owner: 10Glaisher) [16:18:53] (03Merged) 10jenkins-bot: Increase abusefilter emergency disable threshold on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248060 (owner: 10Glaisher) [16:18:57] ori beat me to it :) [16:19:04] ori: oh, nice. thanks! :) [16:19:29] bleh... [16:19:46] !log ori@tin Synchronized wmf-config/InitialiseSettings.php: I8f690589: Increase abusefilter emergency disable threshold on MediaWiki.org (duration: 00m 17s) [16:19:51] ^ Glaisher [16:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:58] \o/ thanks a lot [16:20:03] thank you [16:20:04] ewk, is the mw.o attack still ongoing? [16:20:16] I blocked most of privateinternetaccess yesterday because of that [16:20:26] adrewbogott: did get you question was at the men's room [16:20:45] adrewbogott: ? [16:20:46] 6operations, 10Wikimedia-Mailing-lists: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1745674 (10Addshore) > It sounds like it will let you subscribe/unsubscribe people without authentication as well? I may try and throw some PRs at it quickly / file some issues. [16:20:59] 6operations, 10Wikimedia-Mailing-lists: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1745675 (10JohnLewis) >>! In T116288#1745660, @Krenair wrote: > Is this part of the version of mailman we're using? Yes. It is built for the 2.1.x branch and I have it working in labs [which... [16:23:19] hoo: could you perhaps update the task description to give that bit of background information and to link to I0998630ea7744d9fecfa50150b209bdbb8e2f71b ? [16:23:39] (03PS3) 10Muehlenhoff: Move the ferm rules for elasticsearch internode traffic into role::logstash::elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/244412 (https://phabricator.wikimedia.org/T104964) [16:23:46] i'm poking at it now, but i'm not sure i'll get anywhere, but giving these pointers will help others get up to speed too [16:24:23] robh/chasemp yah, codfw is tepid-standby at this point [16:24:46] maybe even warm, but mostly untested [16:24:56] 6operations, 10Wikimedia-Mailing-lists: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1745701 (10Addshore) See https://github.com/TracyWebTech/mailman-api/issues/10 [16:25:12] thanks ori [16:25:18] (re abusefilter) [16:25:49] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 13 failures [16:25:50] ori: Sure [16:25:52] greg-g: i asked myself WWGD (what would greg-g do) and figured you'd say yes [16:26:17] :) [16:26:24] WWG-GD [16:26:41] back. was driving a car when i got the texts [16:27:38] 6operations, 10Wikimedia-Mailing-lists: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1745709 (10Addshore) p:5Triage>3Low [16:30:40] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:34:28] (03PS1) 10Ori.livneh: Add req.host rule for m.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/248064 (https://phabricator.wikimedia.org/T114995) [16:36:06] (03CR) 10Ori.livneh: [C: 032] Add req.host rule for m.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/248064 (https://phabricator.wikimedia.org/T114995) (owner: 10Ori.livneh) [16:37:16] (03CR) 10Nuria: ">Actually, on 2nd thought, should this be configurable? Is it something all >users of the puppet-cdh module will want to have turned on b" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/247758 (https://phabricator.wikimedia.org/T116202) (owner: 10Nuria) [16:39:27] !log kartik@tin Synchronized private/PrivateSettings.php: T116134: Set CX JWT token (duration: 00m 17s) [16:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:39:39] thcipriani: ^^ [16:40:43] hoo: i may have fixed it. can you test? [16:41:07] it's not retroactive, so you'd have to do something that would initiate a purge [16:41:51] addshore, aude, etc. ^ [16:42:52] chasemp: 1004 doesn't have any active shards recovering, but it is still missing 13 shards. I haven't logged in to check logs there yet to see what's up [16:44:50] ori: looks good to me [16:44:56] \o/ [16:45:22] https://m.wikidata.org/wiki/Q2128211 says "general" for the description (which someone just added) [16:45:31] \o/ [16:46:45] Nice! :) [16:47:04] (03PS1) 10Ottomata: The varnish reqstats diamond collector does not work, emit to statsd directly instead [puppet] - 10https://gerrit.wikimedia.org/r/248067 (https://phabricator.wikimedia.org/T83580) [16:47:41] so wait, chasemp, should I send both to statsd? node based and aggregate key based [16:47:41] ? [16:48:16] ottomata: probably only aggregate, imo [16:48:22] k [16:48:45] it will aggregate as "varnish.${::site}.misc.frontend.request", [16:48:54] !log Restarted Elasticsearch on logstash1004 [16:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:49:56] chasemp: restarting elasticsearch on 1004 seems to have got it recovering again [16:50:03] huh [16:51:31] there wasn't anything interesting in the logs but I saw in the head monitor screens that the cluster thought there was no host that could take the remaining shards. Not sure why [16:52:08] i.e. I saw unassigned_info: reason: NODE_LEFT [16:52:33] are you doing unicast? [16:52:58] It does fixed master list to find the cluster [16:53:20] (03PS1) 10Andrew Bogott: Partman attempt for the new labvirt hardware [puppet] - 10https://gerrit.wikimedia.org/r/248068 [16:53:21] so, yeah unicast I guess [16:53:41] not the multicast discovery stuff. that's too flakey [16:53:55] oh agreed but we use it for main cluster and haven't seen this exact behavior [16:54:03] I don't imagine either mechanism is without issue [16:55:09] (03PS2) 10Andrew Bogott: Partman attempt for the new labvirt hardware [puppet] - 10https://gerrit.wikimedia.org/r/248068 [17:01:16] godog: or chasemp, whenever you get a sec [17:01:16] https://gerrit.wikimedia.org/r/#/c/248067/1 [17:01:28] robh: no, frack was not the exact same issue as same week [17:01:46] as last week* [17:01:47] sigh [17:02:19] not sure what it was though [17:02:50] ahh, ok [17:03:27] (03CR) 10Alexandros Kosiaris: "nope, but I suppose we can do that in labs. In fact I am a bit worried about disk space usage if we enable this fleet wise, so I am gonna " [puppet] - 10https://gerrit.wikimedia.org/r/247823 (owner: 10Alexandros Kosiaris) [17:04:59] Jeff_Green: is eventdonations.wm.org on a wmf server? [17:05:47] mutante: no, it's a third-party service we use [17:07:22] ottomata: ok! I'll take a closer look tomorrow too [17:07:39] Jeff_Green: why i asked: i monitor the ssl cert because it appeared on a list of our certs. so when i add that to icinga and check HTTPS it works just fine. except that Icinga thinks the "host" is down because they dont respond to ICMP from us [17:08:13] service up, host down, just disabled the notification but it's meh [17:09:11] mutante: i don't know whether that's fixable, my guess is no [17:09:43] yea, i dont expect a fix, really. i'm just pondering whether i should leave it as it is or remove the cert check [17:10:04] i'd have everything covered except this .. [17:10:10] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [17:12:00] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [17:12:54] i'll leave it in and live with one host being shown as down but acknowledged [17:15:20] jynus: [17:15:31] jynus: hola, yt? [17:21:56] can any handy ops person let me know what physical machine m4-master.eqiad.wmnet maps to? is it db1046.eqiad.wmnet? [17:22:40] nuria: dbproxy1004.eqiad.wmnet [17:22:50] (03PS2) 10Rush: Remove auth.login-message - not supported by upstream anymore [puppet] - 10https://gerrit.wikimedia.org/r/247793 (https://phabricator.wikimedia.org/T116142) (owner: 10Aklapper) [17:22:55] and ahem .. how could i know that? [17:23:09] mutante: how can i find those mappings? [17:23:14] nuria: i grepped in the repository with the DNS templates [17:23:26] you can git clone the WMF DNS repo [17:23:28] ahahaha, where is taht depot? [17:23:31] *that [17:23:57] (03CR) 10Rush: [C: 032] Remove auth.login-message - not supported by upstream anymore [puppet] - 10https://gerrit.wikimedia.org/r/247793 (https://phabricator.wikimedia.org/T116142) (owner: 10Aklapper) [17:23:57] https://gerrit.wikimedia.org/r/p/operations/dns.git [17:24:09] PROBLEM - Disk space on logstash1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 111837 MB (3% inode=99%) [17:24:43] jynus: MySQL_at_Wikipedia.pdf , verry informative presenatation. thanks for sharing :) [17:24:57] nuria: git clone the above URL, then in "./templates/wmnet" search for m4-master [17:25:23] many thanks mutante [17:25:31] nuria: you're welcome [17:25:46] mutante: our labored over login text is now toast man :) [17:26:37] chasemp: ah! the phab login, yes, saw that from andre! are you also merging the new message? [17:26:45] nope didn't see it [17:26:52] just cleanup atm [17:27:37] mutante: since you are so kind .. is there an icinga page that best displays load for this db? [17:27:42] chasemp: https://gerrit.wikimedia.org/r/#/c/247794/ [17:28:04] chasemp: that's the new way to add a custom message, per upstream [17:28:19] so andre is just like doing what he has to do now to keep it, afaict [17:28:43] per https://secure.phabricator.com/T9346 [17:28:59] (03PS1) 10Alexandros Kosiaris: gsb: Fix typo in icinga command_line definition [puppet] - 10https://gerrit.wikimedia.org/r/248077 [17:29:07] nuria: not icinga, but there is tendril.wikimedia.org [17:29:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] gsb: Fix typo in icinga command_line definition [puppet] - 10https://gerrit.wikimedia.org/r/248077 (owner: 10Alexandros Kosiaris) [17:29:29] mutante: no clue if that pans out, I saw the new logic but no specifs [17:29:33] nuria: can you login there with labs credentials? [17:29:35] mutante: thanks gain [17:29:38] *again [17:29:50] RECOVERY - Disk space on logstash1004 is OK: DISK OK [17:30:10] RECOVERY - ElasticSearch health check for shards on logstash1006 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 96, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [17:30:11] nuria: also https://dbtree.wikimedia.org/ might be useful [17:30:41] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 96, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [17:30:49] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 96, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [17:30:59] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 96, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [17:31:10] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 96, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [17:31:11] chasemp: *nod*, sounds like it should be on phab-0x in labs first then [17:31:19] nice recoveries [17:31:50] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 35, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 96, initializing_shards: 4, number_of_data_nodes: 3, delayed_unassigned_sh [17:32:34] jouncebot: refresh [17:32:37] I refreshed my knowledge about deployments. [17:33:18] greg-g: I just put myself down for a deploy slot right after the train -- https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151022T2000 [17:36:11] mutante, do you need anything else for https://gerrit.wikimedia.org/r/#/c/244078/ ? [17:36:12] bd808: cool, glad we're taking care of that [17:36:27] for certain values of "we" ;) [17:37:30] MaxSem: i think.. no :) [17:37:43] reading the comments now [17:38:52] bd808: thank *you* :) [17:39:27] and Niharika! And for this round of fixes csteipp too! [17:40:37] (03CR) 10Dzahn: [C: 032] "thank you all for your input. going ahead with the "parking". that means no traffic, but yes, we keep the domain (unless legal decides oth" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [17:40:43] (03PS3) 10Dzahn: deactivate wikimaps.[com|net|org] domains [dns] - 10https://gerrit.wikimedia.org/r/244078 [17:40:45] bd808: awesome [17:41:59] (03PS2) 10Ori.livneh: Restrict access to redis on abacist [puppet] - 10https://gerrit.wikimedia.org/r/247995 (owner: 10Muehlenhoff) [17:42:07] (03CR) 10Ori.livneh: [C: 032 V: 032] Restrict access to redis on abacist [puppet] - 10https://gerrit.wikimedia.org/r/247995 (owner: 10Muehlenhoff) [17:43:46] (03PS1) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [17:45:45] (03CR) 10Ottomata: "Looks like it is already on by default:" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/247758 (https://phabricator.wikimedia.org/T116202) (owner: 10Nuria) [17:48:21] (03Abandoned) 10Nuria: Enabling mapjoins in hive by default [puppet/cdh] - 10https://gerrit.wikimedia.org/r/247758 (https://phabricator.wikimedia.org/T116202) (owner: 10Nuria) [17:48:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1745944 (10Dzahn) @atgo thanks, i added that key to the code change waiting in code review on gerrit now. Everything looks ready to go here... [17:51:36] (03CR) 10Ottomata: "I'd say don't worry about parameterizing $config_dir or the systemd file. This is all standard and will come with the .deb package." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [17:51:43] (03PS4) 10Dzahn: dumps: move ssl cert and config to role [puppet] - 10https://gerrit.wikimedia.org/r/247700 [17:54:32] (03CR) 10Dzahn: [C: 032] dumps: move ssl cert and config to role [puppet] - 10https://gerrit.wikimedia.org/r/247700 (owner: 10Dzahn) [17:58:42] that might not have worked.. hrm [17:58:56] but error doesnt even look related [17:59:36] 6operations: Photos of Servers - https://phabricator.wikimedia.org/T94694#1745986 (10chasemp) 5Open>3Resolved seems done [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151022T1800). [18:00:30] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [18:00:34] (03PS1) 10Dzahn: Revert "dumps: move ssl cert and config to role" [puppet] - 10https://gerrit.wikimedia.org/r/248081 [18:00:54] (03CR) 10Dzahn: [C: 032] "undefined method `join' for nil:NilClass" [puppet] - 10https://gerrit.wikimedia.org/r/248081 (owner: 10Dzahn) [18:01:49] (03CR) 10JanZerebecki: [C: 031] Publish bzip2 compressed Wikidata json dumps [puppet] - 10https://gerrit.wikimedia.org/r/245850 (https://phabricator.wikimedia.org/T115222) (owner: 10Hoo man) [18:03:27] nuria, sorry, still on an interview [18:03:48] jynus: np, mutante help us [18:04:20] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:04:20] is there any problem with m4? [18:04:32] I am running a schema change there [18:05:23] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746016 (10atgo) If it's OK with you guy that @tnegrin approve, that would be super. Otherwise is there a way for Lisa to do it by email ins... [18:06:00] (03PS2) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [18:07:49] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1746024 (10chasemp) [18:07:51] 6operations, 7Database: Replicate the Phabricator database to labsdb - https://phabricator.wikimedia.org/T52422#1746021 (10chasemp) 5Open>3Resolved a:3chasemp So we have visited this in a few other tickets and the tldr is there is a huge amount of data that is sensitive in Phab for fundraising, procureme... [18:08:20] PROBLEM - Disk space on logstash1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 107686 MB (3% inode=99%) [18:08:41] (03PS3) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [18:10:10] RECOVERY - Disk space on logstash1004 is OK: DISK OK [18:10:11] (03CR) 10Madhuvishy: [WIP] burrow: Add new module for burrow (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [18:12:10] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail [18:12:40] I am reading the backlog: please note that dbproxy1004.eqiad.wmnet's 3306 port gets redirected, as a good proxy [18:13:19] Steinsplitter thanks for finding it useful [18:37:54] (03CR) 10Andrew Bogott: [C: 032] Partman attempt for the new labvirt hardware [puppet] - 10https://gerrit.wikimedia.org/r/248068 (owner: 10Andrew Bogott) [18:38:28] (03PS3) 10Andrew Bogott: Partman attempt for the new labvirt hardware [puppet] - 10https://gerrit.wikimedia.org/r/248068 [18:38:31] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:40:30] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [500.0] [18:42:09] how do you add a tag to a new grafana dashboard? [18:43:40] nvm, found it [18:47:31] PROBLEM - puppet last run on mw2102 is CRITICAL: CRITICAL: Puppet has 1 failures [18:47:39] PROBLEM - Disk space on logstash1004 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 109605 MB (3% inode=99%) [18:53:39] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:53:47] ok time to get the train rolling again... [18:54:34] (03PS1) 1020after4: all wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248095 [18:55:30] (03CR) 1020after4: [C: 032] all wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248095 (owner: 1020after4) [18:55:35] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/248095 (owner: 1020after4) [18:55:42] (03PS4) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [18:56:13] (03CR) 10jenkins-bot: [V: 04-1] [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [18:56:16] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.27.0-wmf.3 [18:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:07] (03PS5) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [18:57:18] jynus: free? [18:57:31] for a few minutes, yes [18:57:39] (03CR) 10jenkins-bot: [V: 04-1] [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) (owner: 10Madhuvishy) [19:00:41] (03PS2) 10Dzahn: mailman: run qdata cron later into 8am bounces [puppet] - 10https://gerrit.wikimedia.org/r/248044 (owner: 10John F. Lewis) [19:00:59] ^ nuria? [19:01:05] (03CR) 10Dzahn: [C: 032] mailman: run qdata cron later into 8am bounces [puppet] - 10https://gerrit.wikimedia.org/r/248044 (owner: 10John F. Lewis) [19:01:06] jynus: hola! [19:01:13] hola [19:01:27] jynus: question, was looking at https://tendril.wikimedia.org/host/view/db1046.eqiad.wmnet/3306 [19:01:37] (03PS3) 10Dzahn: Publish bzip2 compressed Wikidata json dumps [puppet] - 10https://gerrit.wikimedia.org/r/245850 (https://phabricator.wikimedia.org/T115222) (owner: 10Hoo man) [19:01:41] jynus: and it looks like you are doing work on EL master, is that done? [19:01:44] yes? [19:01:52] EL? [19:01:58] eventlogging, sorry [19:02:14] yes, I am doing some schema changes [19:02:17] (03PS4) 10Dzahn: Publish bzip2 compressed Wikidata json dumps [puppet] - 10https://gerrit.wikimedia.org/r/245850 (https://phabricator.wikimedia.org/T115222) (owner: 10Hoo man) [19:02:25] I announce them on the SAL [19:02:29] Server ADmin log [19:02:35] (03CR) 10Dzahn: [C: 032] Publish bzip2 compressed Wikidata json dumps [puppet] - 10https://gerrit.wikimedia.org/r/245850 (https://phabricator.wikimedia.org/T115222) (owner: 10Hoo man) [19:02:36] they will take 1 day [19:02:37] jynus: ok, can you let us know when you are done? we will start backfilling of some events then cc milimetric [19:02:40] RECOVERY - Disk space on logstash1004 is OK: DISK OK [19:02:58] schema changes are compatible with writes [19:03:07] you can write to the tables with no problem [19:03:08] jynus: or we can look at sal if you will announce the end there too [19:03:22] jynus: even at high load? [19:03:32] it is true that they will be slower [19:03:36] (03PS1) 10Andrew Bogott: Added labvirt1010 and 1011 to linux-host-entries.ttyS1-115200 [puppet] - 10https://gerrit.wikimedia.org/r/248096 [19:03:54] (03PS6) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [19:03:58] (03PS2) 10Andrew Bogott: Added labvirt1010 and 1011 to linux-host-entries.ttyS1-115200 [puppet] - 10https://gerrit.wikimedia.org/r/248096 [19:03:59] jynus: right, that's what we noticed, so it kinds of makes sense to wait so as to reduce our baby sitting time [19:04:07] pk, then [19:04:13] so the ETA is [19:04:29] 12% 1+04:22:41 remain [19:04:29] 6operations, 10MediaWiki-Cache, 6Performance-Team, 10hardware-requests, 7Availability: Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams - https://phabricator.wikimedia.org/T114191#1687618 (10Ottomata) [19:04:32] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1746131 (10Ottomata) [19:04:46] is that ok, or would you prefer me to cancel? [19:04:55] that is 1 day 4 hours [19:05:12] jynus: no, taht si fine, even couple three days would be fine too [19:05:16] *that [19:05:25] cc milimetric [19:05:31] so, let me give you the background [19:05:32] jynus: thank you! [19:05:39] jynus: yess [19:05:55] I will update the ticket as soon as posiible [19:06:27] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1746147 (10Ottomata) etherpad from today's meeting: https://etherpad.wikimedia.org/p/eventbus-events [19:06:36] nuria, subscribe to https://phabricator.wikimedia.org/T108856#1745519 [19:06:56] (03CR) 10Andrew Bogott: [C: 032] Added labvirt1010 and 1011 to linux-host-entries.ttyS1-115200 [puppet] - 10https://gerrit.wikimedia.org/r/248096 (owner: 10Andrew Bogott) [19:07:07] that is the place I will give the most up to date info regarding this work [19:07:21] man we really should trim that table, I disagree we need to keep so many events in mysql [19:07:26] we can move it to hdfs or something [19:07:43] thx jynus, "roger that" on all the other stuff you said above [19:07:47] actually, the next job is deleting many events [19:07:56] k, good [19:08:02] but this had to be done first [19:08:27] I hope I didn't impact much, but I was also surprised by the size of this table [19:12:10] RECOVERY - puppet last run on mw2102 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:15:06] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Production achine to suport pywick analytics - https://phabricator.wikimedia.org/T116312#1746196 (10JMinor) 3NEW [19:16:05] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport pywick analytics - https://phabricator.wikimedia.org/T116312#1746205 (10JMinor) [19:16:37] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport pywick analytics - https://phabricator.wikimedia.org/T116312#1746196 (10JMinor) [19:20:49] moritzm: still around? [19:20:54] got some go packaging qs [19:21:02] sure, go ahead [19:21:46] so, ok, i got it to build, but [19:21:58] it wouldn't listen to my export GOPATH [19:22:03] dh_golang didn't [19:22:19] it expected go deps to be in $(CURDIR)/obj-x86_64-linux-gnu [19:22:34] that seems to be go-specific behaviour [19:22:38] so, i hacked around by symlinking that in rules [19:22:40] ln -s $(CURDIR)/debian/godeps $(CURDIR)/obj-x86_64-linux-gnu [19:22:42] that worked. but [19:22:47] that also happened when I built burrow locally [19:22:48] i'd rather not do that. [19:22:50] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [19:23:39] (03PS1) 10Dzahn: holmium: use role keyword for 2 classes [puppet] - 10https://gerrit.wikimedia.org/r/248099 [19:23:44] not sure, let me have a look at build logs of other go packages for comparison [19:23:54] kaldari: around? [19:24:26] k, i'm googling around too [19:24:30] yes [19:24:40] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [19:24:49] moritzm: also, maybe you are right about making the build process run gpm to get deps [19:24:56] with kafka, we just included the binaries [19:25:08] but, that is more ok there, since kafka deps are jvm based and cross platform [19:25:11] go deps are not [19:27:57] !log Forced ELK Elasticsearch to allocate replica of logstash-2015.10.22 shard 0 on logstash1004 [19:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:26] go doesn't even provide proper shared libs, from a distro point of view it's all really horrible. if e.g. golang is updated you effective need to rebuild all go packages, so rebuilding all the deps during the build is actually the sanest (or rather least insane) approach [19:30:54] go compiles everything down to a static binary right? It doesn't properly have shared libs jsut build time dependencies [19:31:23] yeah [19:31:37] the go packages I built are just static binaries that use .debs as glorified 'cp' [19:32:00] yes, they're aiming for something like shared libs in an upcoming release, but only to reduce the code size/deduplicate, but not provising the semantics of a shared lib as glibc does for C [19:32:20] the one completely cargo-cult built deb for a go package I made had a pretty simple rules file -- https://github.com/bd808/ggml/blob/master/debian/rules [19:32:48] it's pretty neat for the huge internal Google codebase, but pretty much sucks for everyone else :-) [19:37:19] ottomata: haven't found anything in the go logs I checked, but as said it also did that in the local build of burrow, so maybe that's standard behaviour [19:37:41] aye, am trying this [19:37:42] https://github.com/mkouhei/golang-ugorji-go-msgpack-debian/blob/master/debian/rules [19:37:47] sorta, we'll see how that goes [19:37:57] at least he infers the arch automatically [19:37:59] i could live with that [19:38:09] woudl you be ok with symlinking like that? [19:38:13] if we have to? [19:38:25] yeah, without symlink i still get error [19:38:31] cannot find package "github.com/samuel/go-zookeeper/zk" in any of: [19:38:31] /usr/lib/go/src/pkg/github.com/samuel/go-zookeeper/zk (from $GOROOT) [19:38:32] /tmp/buildd/golang-burrow-0.1.0/obj-x86_64-linux-gnu/src/github.com/samuel/go-zookeeper/zk (from $GOPATH) [19:38:44] twentyafterfour, there is a patch that is breaking prod - https://gerrit.wikimedia.org/r/#/c/248116 [19:38:53] *fixing [19:39:44] ottomata: but that's a bug in the URL? github.com/samuel/go-zookeeper/zk gives me a 404 at github [19:41:40] moritzm: [19:41:41] https://github.com/samuel/go-zookeeper [19:41:41] symlinks would certainly work, it's just different shades of messy :-) [19:41:55] moritzm: maybe if i used patches to make the symlink? [19:42:04] orrr, i guess if i do it in rules, its the same. [19:42:04] greg-g, who is doing the train depl? [19:42:33] moritzm: i could just put the deps directly in that dir. [19:42:37] instead of inside debian/ [19:42:41] I don't believe you can create a symlink using a patch, you can do that manually in debian/rules or using dh_link [19:42:45] i just was avoiding modifying above debian/ dir [19:42:51] or that way [19:43:03] which do you think is better? [19:44:36] ok, is anyone fixing the production? all of zero is broken, and we need to sync the patch https://gerrit.wikimedia.org/r/#/c/248116 CC: twentyafterfour greg-g [19:44:54] using symlinks still less unclean [19:46:49] less unclean, haha ok [19:47:21] http://bots.wmflabs.org/dump/%23wikimedia-operations.htm [19:47:21] @info hywiktionary [19:49:06] (03PS7) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [19:53:37] !log yurik@tin Synchronized php-1.27.0-wmf.3/extensions/ZeroBanner: Deploying ZeroBanner T116309 patch 248116 (duration: 00m 18s) [19:53:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:36] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport pywick analytics - https://phabricator.wikimedia.org/T116312#1746317 (10Legoktm) Do you mean piwik? https://github.com/sachdevs/pyWick seems unrelated to analytic... [19:58:55] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746326 (10Dzahn) >>! In T115666#1746016, @atgo wrote: > Otherwise is there a way for Lisa to do it by email instead? She's having trouble w... [20:00:04] bd808: Respected human, time to deploy Grant review updates (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151022T2000). Please do the needful. [20:00:54] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746333 (10atgo) Ok. Is @tnegrin not an option for sign off? [20:01:51] bd808, around? [20:01:58] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746337 (10Tnegrin) Anne's doing this as part of her work for the readership team and I approve. [20:02:16] yurik: yeah, about to deploy some code [20:02:17] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1746338 (10GWicke) Some notes from the meeting: ## Framing, for all events - **uri**: string; path or url. Example: /en.wikipedia.org/v1/page/title/... [20:02:38] bd808, prod is slightly unstable - due to the train, zero started to fail [20:02:47] i deployed quick fix [20:02:57] but i am still seeing it in the logs (CC aude ) [20:03:09] k. my thing is a trebuchet deploy that is unrelated to MW [20:03:37] so we shouldn't crash into each other in any way [20:03:46] bd808, ok, i might need to push another urgent patch. CC greg-g [20:05:46] chasemp: the ELK elasticsearch cluster is finally healed. All shards are allocated and recovered [20:06:53] next time I guess we should run `curl -s -XPOST 'localhost:9200/_flush/synced'` before the restarts to mark things as synced. That really shouldn't be needed based on the docs but *shrug* [20:07:33] bd808, do you have any problems connecting to tin? its taking ages [20:07:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746361 (10atgo) Thanks @tnegrin @dzahn @catrope lmk if that's not enough and I can get a time to walk through this with Lisa. [20:07:48] yurik: nope, I logged in with no issues [20:07:50] or is something wrong with my network stack again [20:08:09] greg-g: ping [20:08:09] hoo: You sent me a contentless ping. This is a contentless pong. Please provide a bit of information about what you want and I will respond when I am around. [20:08:13] meh [20:08:33] greg-g, I sent you meaningful pings, you didn't reply :-P [20:09:18] * yurik thinks hoo's approach is better - it gets greg-g's attention faster :))) [20:09:42] moritzm: what should Architecture: be in the control file? [20:09:44] I actually want greg-g to join a meeting right now :P [20:09:47] i thought x86_64-linux-gnu [20:09:48] but [20:09:53] that is not in [20:09:56] dpkg-architecture -L [20:10:04] its an autoresponse script for irssi [20:10:35] use "any", i.e. it can be build for any arch from source [20:10:58] yurik: did it not fix everything? [20:11:12] aude, i still see it in the logs for some reason [20:11:15] trying to figure out why [20:11:17] * aude looks [20:11:28] did you update the submodule? [20:11:38] Someone should look at Zero [20:11:40] aude, ah, there is a different call there (( [20:11:41] see fluorine [20:11:52] ottomata: I'm off, please send a mail if anything else comes up [20:12:12] yurik: aaaaa [20:12:19] aude, i'm fixing it now [20:12:25] aude, $skin->getTermsLink [20:12:25] k [20:12:30] !log Updated iegreview.wikimedia.org to bcaf23b (Fix logger usage in Controllers\Account\Recover) [20:12:31] ok moritzm, thanks [20:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:39] i'll probably just put this up for review and add you by end of my day [20:12:42] * aude doesn't completely understand how zero is supposed to work [20:12:52] so had trouble to test my fix [20:12:54] ok [20:12:59] yurik: in general, when you ping me with "I need to do an urgent thing" please tell me what it is/why [20:13:12] greg-g, i did - there was a link )) [20:13:28] greg-g, prod is broken for zero [20:13:36] logstash is unhappy [20:13:38] yurik: bah! quit being right (I didn't get back that far yet) [20:13:49] kk [20:13:49] sorry [20:14:05] * yurik note to self, ping greg-g every time with a full context [20:14:14] on every line [20:14:15] "logstash is unhappy"? [20:14:19] yep ) [20:14:24] exceptions, fatals [20:14:26] you know ) [20:14:35] ottomata: document your issues and stuff too since I'll have to be building a few of those too [20:14:45] logstash is unhappy == you see the unhappiness in logstash [20:14:49] YuviPanda: go package? [20:14:50] that sounds more like "lots of errors in logs" [20:14:53] ottomata: yeah [20:15:00] i think you can follow the readme i'm writing for any go package [20:15:02] it sucks though. [20:15:06] committing binary debs to git [20:15:10] deps* [20:15:20] ottomata: link? or is it still WIP? [20:15:23] bd808: yeah :) [20:15:24] still WIP [20:15:24] ottomata: this is for burrow, right? [20:15:26] yes [20:15:29] ottomata: kk do link when done [20:15:35] gonna have a commit up by the time i quit for the day [20:15:43] kk [20:17:12] dbbot? [20:20:09] ... [20:20:11] (03PS8) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [20:21:40] (03CR) 10Dzahn: "@Muehlenhoff like you said, this was the test to confirm it's in this one class and not the other 2." [puppet] - 10https://gerrit.wikimedia.org/r/248099 (owner: 10Dzahn) [20:23:21] did https://gerrit.wikimedia.org/r/#/c/248116 get deployed? [20:23:23] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/248099/1 & http://puppet-compiler.wmflabs.org/1060/ shows this is an issue with role::labsdnsrecursor " [puppet] - 10https://gerrit.wikimedia.org/r/247198 (owner: 10Muehlenhoff) [20:23:38] yurik: ^ :) [20:23:59] apparently: https://tools.wmflabs.org/sal/log/AVCRHSMQ1oXzWjit6DMT [20:24:04] twentyafterfour, yes [20:24:13] but it didn't fully fix the problem, working on a fix now [20:24:18] gotcha [20:24:23] will need to deploy both mobifel front end & zero (( [20:25:34] yuck, but ok [20:26:02] !log Removed "zirconium.wikimedia.org" from Trebuchet's minion list for iegreview/iegreview [20:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:11] aude, MaxSem - could you +2 https://gerrit.wikimedia.org/r/248231 [20:26:33] and cherry pick it to the wmf3 [20:26:39] greg-g: I'm done with my deploy window. Grants is updated and happy [20:26:56] * aude looks [20:27:10] bd808: yay [20:28:27] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwick analytics - https://phabricator.wikimedia.org/T116312#1746422 (10JMinor) [20:29:03] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1746426 (10Milimetric) [20:29:07] 6operations, 6Analytics-Backlog, 10Wikipedia-iOS-App-Product-Backlog, 10hardware-requests, 5iOS-5-app-production: Request one server to suport piwik analytics - https://phabricator.wikimedia.org/T116312#1746427 (10JMinor) Yes, sorry http://piwik.org/ Corrected spelling... [20:29:18] (03CR) 10Dzahn: "so from the compiler output we see what changes is the "allow_from" part in the pdns config. looking at that in puppet code, it is:" [puppet] - 10https://gerrit.wikimedia.org/r/247198 (owner: 10Muehlenhoff) [20:32:16] (03PS2) 10Nemo bis: Add three new groups to ruwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247850 (https://phabricator.wikimedia.org/T116143) (owner: 10Luke081515) [20:34:00] greg-g, is there a way to sync two dirs at the same time? [20:34:12] not a biggie if its not [20:34:25] do a parent dir? [20:34:35] greg-g, the parent dir is /extensions )) [20:34:36] if that's / then ;) [20:34:56] yurik: no, it's annoying, something similar we've seen before [20:35:07] never mind, will sync mobile first [20:35:24] twentyafterfour: is that ^ (syncing two side by side dirs at the same time) a use case we need to care about? [20:35:48] okie, syncing now [20:37:23] greg-g: scap could do it but has no cli to make it happen [20:37:47] under the hood it would be fine with syncing multiple dirs, files whatever [20:38:07] !log yurik@tin Synchronized php-1.27.0-wmf.3/extensions/MobileFrontend: Deploying MobileFrontend T116309 patch 248238 (duration: 00m 36s) [20:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:38:34] making scap have a more sane and parameterized cli would allow it to do many new things with little code change [20:39:02] * bd808 regrets that he let himself get pulled off of that project 3 months too soon [20:39:05] !log yurik@tin Synchronized php-1.27.0-wmf.3/extensions/ZeroBanner/: Deploying ZeroBanner T116309 patch 248239 (duration: 00m 18s) [20:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:27] bd808: me too, my fault [20:41:46] (03CR) 10Dzahn: [C: 032] "has approval from Toby Negrin, this is work for the readership team" [puppet] - 10https://gerrit.wikimedia.org/r/247467 (https://phabricator.wikimedia.org/T115666) (owner: 10Dzahn) [20:41:51] (03PS4) 10Dzahn: admin: create agomez and add to stats groups [puppet] - 10https://gerrit.wikimedia.org/r/247467 (https://phabricator.wikimedia.org/T115666) [20:42:02] I don't even remember what OMG fix now thing I moved on to [20:42:26] greg-g, aude, MaxSem done i think, thx for your help! [20:42:37] bd808: ieg? scholarships? ;) [20:42:46] * yurik wants to give jdlrobson a proper IDE so that this won't happen :-P [20:42:52] (03CR) 10Dzahn: [C: 032] admin: create agomez and add to stats groups [puppet] - 10https://gerrit.wikimedia.org/r/247467 (https://phabricator.wikimedia.org/T115666) (owner: 10Dzahn) [20:43:00] yurik: :) [20:43:13] yurik: unit/intergration tests, yo [20:43:35] greg-g, static code analysis ;) [20:43:36] * aude uses search on github for usages like this [20:44:09] which is essentially elastic search for all the code :) [20:44:11] greg-g: I bet it was SUL [20:44:12] php is fine for that, would have caught most of these issues, and forces people to use proper type documentation, making the code overall better [20:44:16] bd808: probably [20:45:18] Wait is yurik "just use lua on the wiki" advocating for ways to check for breaking changes? ;) [20:46:09] greg-g: I don't even know if we should really support syncing individual files or directories ;) ... sync should probably sync the entire state of the repo really [20:46:36] bd808: I'm working on a much nicer cli [20:46:41] that seems most sane [20:46:46] or you could have scap -- svn mode [20:46:47] :) [20:46:55] twentyafterfour: sometimes particular order of sync matters [20:47:17] twentyafterfour: there's a quip for "sync all the things" -- https://tools.wmflabs.org/bash/quip/AU7VV9Fn6snAnmqnK_1n [20:47:21] my goto excuse for "just sync it all" was previously "we wanna move to HHVM RepoAuthoratative mode soon anyways" but... [20:47:36] for certain values of soon [20:47:49] "before the heat death of the universe" [20:48:09] I'd like to see "soon" mean < 1 year [20:48:44] I think eventually everything expands and our atoms separate so there is that [20:48:53] as a deadline [20:48:58] the big rip [20:49:02] we need (a) a good way to ship big blobs quickly, (b) a reliable way to restart hhvm, and (c) the deploy server to have hhvm [20:49:21] a isn't that hard [20:49:25] I've prototyped it [20:49:28] fwiw, last i read facebook uses a torrent like aproach for distributing the images [20:49:32] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746519 (10Dzahn) Thanks, yea, it's ok. I noted on the change that it's work for the Readership team and went ahead with Toby's approval. I... [20:49:42] ebernhardson: yup, rack aware torrents [20:49:44] ebernhardson: yes that's what we would use [20:50:00] something similar to twitter [20:50:01] ebernhardson: yeah. I actually have notes and example code from the FB deploy team [20:50:06] "murder" from twitter [20:50:06] sweet! [20:50:11] Oh Emir [20:50:17] nice guy [20:50:18] that's really the name [20:50:40] he wrote stub/pseudocode for us [20:50:52] ebernhardson was actually in the meeting ;) [20:51:00] https://github.com/emiraga/hhvm-deploy-ideas [20:51:06] that [20:51:07] heh, yea but its been awhile. maybe thats when i remember hearing about that :) [20:51:08] Tyler and I played with that on staging and did a proof of concept mediawiki deploy using bittorrent. It worked well [20:51:41] * ebernhardson would love the atomic deploys that go along with RepoAuth [20:52:05] yeah that's what I would rather be working on, instead of implementing "sync two directories" in scap [20:52:05] yup [20:52:16] twentyafterfour: forget I said anything :) [20:52:42] I mean, sync >1 directory would be straightforward, I think, but still... [20:52:58] once we move to repo authoritative then it becomes pointless [20:52:59] priorities :) [20:53:25] (in other news the hipster coffee shop is playing nirvana.... of course) [20:53:35] lol [20:54:13] * bd808 is playing nirvana too but unironically [20:54:54] * bd808 is also wearing flannel [20:55:54] (03PS9) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [20:56:04] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746537 (10Dzahn) 5Open>3Resolved @atgo Here's an example SSH config snippet, for your `/home/atgomez/.ssh/config` (assuming you use atg... [20:56:07] i got lunch from a white guy wearing a pagari, hipster enough? :P [20:56:11] bd808: yeaaaaaaaaaaaaaaaahhhhhh, yeaaayaaaaaaaaaaaaaaaaah, yeyeaayeaaaahahhhhhhhhhh yeyeaaahaahhhhh [20:56:49] ebernhardson: did he have a man-bun sticking out the top too? [20:57:44] greg-g: I was blaring http://lyrics.wikia.com/wiki/Nirvana:Negative_Creep [20:58:12] "You must enable javascript to view this page. This is a requirement of our licensing agreement with music Gracenote." great lyrics [20:59:02] does no one respect gopher users anymore [20:59:43] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1746555 (10RobH) Addiing another domain adds to the overall cost of the certificate, and thus would need additional budget approval. Also, T101048... [21:00:13] we even have a snippet from it on wp.o -- https://en.wikipedia.org/wiki/File:Negative_Creep_%28Nirvana_song_-_sample%29.ogg [21:00:29] it's obviously quite notable [21:00:41] also the best nirvana song [21:02:51] This is apparently one of those music services stations, there was a station ID break (that I couldn't make out), and now we're playing Creep by Radiohead [21:03:07] anywho, enough reminiscing of high school [21:04:02] twentyafterfour: I was an inch from taking that bait on best nirvana song well played... [21:05:19] !log aaron@tin Synchronized php-1.27.0-wmf.3/includes: a6262272c9666d (duration: 00m 23s) [21:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:31] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746594 (10Dzahn) [21:06:49] bd808: yes, man bun as wel [21:09:12] (03PS2) 10Reedy: Add new dblist symlinks for noc conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245584 [21:09:40] (03CR) 10Reedy: [C: 032] Add new dblist symlinks for noc conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245584 (owner: 10Reedy) [21:09:45] (03Merged) 10jenkins-bot: Add new dblist symlinks for noc conf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/245584 (owner: 10Reedy) [21:10:28] !log aaron@tin Synchronized php-1.27.0-wmf.3/includes/changetags/ChangeTags.php: e7126ed331109 (duration: 00m 17s) [21:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:47] 6operations, 10Deployment-Systems, 3Scap3: Move scap target configuration to etcd - https://phabricator.wikimedia.org/T115899#1746621 (10greg) [21:11:10] !log reedy@tin Synchronized docroot and w: Add more dblist symlinks (duration: 00m 18s) [21:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:07] 6operations, 7Mail: Mails to any wikimedia.org account/list from any account @wikimedia.org.ve bounces - https://phabricator.wikimedia.org/T62215#1746637 (10chasemp) 5Open>3declined a:3chasemp I'm marking as declined as it seems this ticket is shrouded in status mystery. Please reopen if there are things... [21:17:26] (03PS2) 10Reedy: Update comments/hints for WMF MW version format changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/244735 [21:19:15] lol my scrollback [21:20:29] (03PS1) 10Ottomata: Initial debian packaging [debs/golang-burrow] (debian) - 10https://gerrit.wikimedia.org/r/248245 [21:20:57] YuviPanda: https://gerrit.wikimedia.org/r/#/c/248245/ [21:23:01] (03PS2) 10Ottomata: Initial debian packaging [debs/golang-burrow] (debian) - 10https://gerrit.wikimedia.org/r/248245 (https://phabricator.wikimedia.org/T116084) [21:25:49] 10Ops-Access-Requests, 6operations: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1746680 (10atgo) Thanks @dzahn and @catrope! [21:57:50] 6operations, 5Patch-For-Review: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1746777 (10Dzahn) 3NEW a:3Dzahn [21:58:24] 6operations: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1746777 (10Dzahn) [21:59:22] 6operations: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1746777 (10Dzahn) We might as well just fix it with some simple work around for this one special case, such as a generic check that turns critical on a certain date, and we feed it the expiration date manually once [22:00:05] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1746787 (10Dzahn) [22:00:58] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1746795 (10Dzahn) We agreed to split this special case into a non-blocking subtask -> T116332 [22:01:21] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1746798 (10Dzahn) a:5Dzahn>3Andrew [22:02:10] !log ori@tin Synchronized php-1.27.0-wmf.2/extensions/WikimediaMaintenance/getJobQueueLengths.php: Ie95ec067da9: getJobQueueLengths: add '--report' option for StatsD reporting (duration: 00m 18s) [22:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:34] !log ori@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaMaintenance/getJobQueueLengths.php: Ie95ec067da9: getJobQueueLengths: add '--report' option for StatsD reporting (duration: 00m 18s) [22:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:03:41] (03PS1) 10Ori.livneh: Add a cron job for reporting the job queue size to Graphite [puppet] - 10https://gerrit.wikimedia.org/r/248256 [22:03:52] (03PS2) 10Ori.livneh: Add a cron job for reporting the job queue size to Graphite [puppet] - 10https://gerrit.wikimedia.org/r/248256 [22:04:09] (03CR) 10Ori.livneh: [C: 032 V: 032] Add a cron job for reporting the job queue size to Graphite [puppet] - 10https://gerrit.wikimedia.org/r/248256 (owner: 10Ori.livneh) [22:04:21] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1746802 (10Dzahn) @neon:/usr/lib/nagios/plugins# ./check_http -I labvirt1001.eqiad.wmnet -p 5925 -S CRITICAL - Cannot make SSL connection ...error:140770FC:SSL... [22:07:35] (03PS1) 10Andrew Bogott: Initial puppet config for labvirt1010 and 1011 [puppet] - 10https://gerrit.wikimedia.org/r/248258 [22:08:32] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1746834 (10Dzahn) 5Open>3Resolved All the certs in the original list in the ticket are covered now. The only exception is the special case for labvirtstar. [22:08:57] (03CR) 10Andrew Bogott: [C: 032] Initial puppet config for labvirt1010 and 1011 [puppet] - 10https://gerrit.wikimedia.org/r/248258 (owner: 10Andrew Bogott) [22:09:12] 6operations, 7HTTPS, 7Icinga, 7Monitoring: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1746836 (10Dzahn) [22:09:50] 6operations, 7HTTPS, 7Icinga, 7Monitoring: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1683510 (10Dzahn) https://etherpad.wikimedia.org/p/T114059 [22:12:00] 6operations: removing alias - https://phabricator.wikimedia.org/T116334#1746848 (10Krenair) [22:12:48] 6operations, 6Labs, 10Labs-Infrastructure, 7Monitoring: monitor expiration of labvirt-star SSL cert - https://phabricator.wikimedia.org/T116332#1746850 (10Dzahn) tcp 0 0 *:5906 *:* LISTEN 8360/**kvm** [22:17:13] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1746860 (10ssastry) [22:18:43] 6operations: removing alias - https://phabricator.wikimedia.org/T116334#1746864 (10Dzahn) a:3Dzahn [22:18:57] 6operations: removing alias for roaaary - https://phabricator.wikimedia.org/T116334#1746865 (10Dzahn) [22:23:26] 6operations: removing alias for roaaary - https://phabricator.wikimedia.org/T116334#1746902 (10Dzahn) Hello @eliza, confirmed and done. before it was: ``` 486 roary: pbeaudette, mpaulson 487 roaaary: roary 488 roaaaary: roary ``` now it's gone on our side. As soon as the change is appl... [22:23:45] 6operations, 7Mail: removing alias for roaaary - https://phabricator.wikimedia.org/T116334#1746909 (10Dzahn) [22:23:52] 6operations, 7Mail: removing alias for roaaary - https://phabricator.wikimedia.org/T116334#1746919 (10Dzahn) 5Open>3Resolved [22:25:02] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1746935 (10Dzahn) [22:25:09] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1746937 (10Dzahn) p:5Triage>3Normal [22:26:13] (03PS10) 10Madhuvishy: [WIP] burrow: Add new module for burrow [puppet] - 10https://gerrit.wikimedia.org/r/248079 (https://phabricator.wikimedia.org/T115669) [22:26:15] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1740321 (10Dzahn) I added the hardware-requests tag which should start the process [22:26:18] 6operations, 7Mail: removing alias for roaaary - https://phabricator.wikimedia.org/T116334#1746948 (10eliza) Thank you Dzahn [22:28:10] 6operations, 10Beta-Cluster-Infrastructure, 7WorkType-NewFunctionality: etcd/confd is not started on deployment-cache-mobile04 - https://phabricator.wikimedia.org/T116224#1746954 (10Dzahn) p:5Triage>3Normal [22:31:50] 7Puppet, 10Deployment-Systems, 6Release-Engineering-Team, 10Salt, 10Staging: provider => trebuchet doesn't work until manual 'git deploy start' on deployment-server - https://phabricator.wikimedia.org/T92978#1746960 (10mmodell) 5Open>3declined a:3mmodell trebuchet is completely deprecated. Scap3 wi... [22:32:36] RECOVERY - check google safe browsing for mediawiki.org on google is OK: https://mediawiki.org is OK [22:33:10] 6operations, 10Wikimedia-Mailing-lists, 7Upstream: Install mailman-api for internal use - https://phabricator.wikimedia.org/T116288#1746964 (10JohnLewis) 5Open>3stalled [22:36:01] 6operations, 6Commons, 10Wikimedia-Media-storage, 7Swift: Some files had disappeared from Commons after renaming - https://phabricator.wikimedia.org/T111838#1746971 (10Dzahn) [22:38:06] 6operations, 10hardware-requests: Site: 1 server hardware access request for initializing the codfw elasticsearch cluster. - https://phabricator.wikimedia.org/T116236#1746975 (10Dzahn) p:5Triage>3Normal [22:39:04] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [22:43:27] (03PS2) 10Dzahn: holmium: use role keyword for 2 classes [puppet] - 10https://gerrit.wikimedia.org/r/248099 [22:46:05] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:55:20] !log ori@tin Synchronized php-1.27.0-wmf.3/extensions/Cite: extensions/Cite update which fell off the SWAT train yesterday (duration: 00m 19s) [22:55:25] ^ MatmaRex [22:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:00:04] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151022T2300). Please do the needful. [23:00:04] Luke081515: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:35] jouncebot: what about me and matt_flaschen, you robotic bastard. [23:03:05] LOL [23:03:27] Also, I forgot to update it. I may have to leave during the hour, so mooeypoo agreed to be designee. [23:05:28] okay, guess I'm doing this then [23:05:43] let's start with people who are here [23:06:13] Luke081515|away: poke for SWAT [23:06:28] I can probably deal with his myself, but that means it goes last MatmaRex :P [23:07:36] right. just poking, since jouncebot pinged a different name [23:07:41] right [23:07:44] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [23:07:47] matt_flaschen's patch looks good, sending it through jenkins [23:08:23] * mooeypoo is here if needed when matt_flaschen leaves [23:08:29] 6operations: restore old mw config private svn repo from bacula - https://phabricator.wikimedia.org/T115937#1747042 (10Dzahn) 16:05 < mutante> Reedy: root@bast1001:/var/tmp/bacula-restores/srv/home_pmtpa/wikipedia/common-before-tin# 16:08 < mutante> ./wmf-config/.svn/text-base/missing.php.svn-base ? [23:12:40] !log krenair@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/includes/SpamFilter/RateLimits.php: https://gerrit.wikimedia.org/r/#/c/248229/ (duration: 00m 17s) [23:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:21] !log krenair@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/autoload.php: https://gerrit.wikimedia.org/r/#/c/248229/ (duration: 00m 17s) [23:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:48] !log krenair@tin Synchronized php-1.27.0-wmf.3/extensions/Flow/container.php: https://gerrit.wikimedia.org/r/#/c/248229/ (duration: 00m 17s) [23:13:53] matt_flaschen, ^ [23:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:00] Thanks, will test. [23:16:25] PROBLEM - puppet last run on db2069 is CRITICAL: CRITICAL: Puppet has 1 failures [23:16:35] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:17:33] Works, thanks. :) [23:17:39] great [23:18:34] MatmaRex, want me to do both of yours at once? [23:19:15] Krenair: one depends on the other [23:19:38] i don't care, whichever is easier for you [23:25:06] dammit [23:25:14] I forgot we need to do the submodule update for VE [23:25:20] Every damn time... [23:25:27] * Krenair rages [23:26:42] Git is not supposed to say "Bus error (core dumped)" [23:26:56] Nor fail to insert files into the database when running git diff [23:27:04] hm, why do we need that? [23:27:10] oh, right [23:27:13] because it's special [23:27:20] i forgot too. want me to submit them? [23:27:38] that'd be helpful [23:27:41] it's https://wikitech.wikimedia.org/wiki/How_to_deploy_code/VE_MW_core_submodule_update [23:28:56] doing [23:30:26] I keep having this type of issue with git [23:30:45] it's worrying [23:30:48] I have no idea what causes it [23:31:06] git fsck? [23:32:59] busted hard drive? :( [23:33:48] nothing else has been complaining [23:33:49] just git [23:34:12] presumably most things don't check sha1s of everything all the time [23:34:17] legoktm, so sometimes if I run git diff enough times it goes away [23:35:38] Krenair: https://gerrit.wikimedia.org/r/248273 [23:36:33] thanks [23:38:15] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [23:41:45] RECOVERY - puppet last run on db2069 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:47:15] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:52] MatmaRex, syncing [23:49:07] !log krenair@tin Synchronized php-1.27.0-wmf.3/extensions/VisualEditor/modules/ve-mw/ui/dialogs/ve.ui.MWMediaDialog.js: https://gerrit.wikimedia.org/r/#/c/248273/ (duration: 00m 18s) [23:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:49:26] MatmaRex, ^ [23:49:30] please test [23:51:25] Krenair: well, that's probably not the reply you were hoping for, but it doesn't seem to be working [23:51:40] no effect? [23:51:42] or everything broke? [23:51:45] no effect [23:51:55] cleared cache etc.? [23:52:21] i am testing https://en.wikipedia.org/wiki/The_Fighting_Temeraire , when i open the media dialog for the first image (on the left), i see a disabled "Upload" button, i should see an "Apply" button [23:52:26] yeah, tried in incognito [23:52:38] hmm, but it works in debug mode. [23:53:06] it just took longer for some reason. seems fine now. [23:54:12] works. crisis averted [23:55:04] :) [23:57:25] (03PS3) 10Alex Monk: Add three new groups to ruwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247850 (https://phabricator.wikimedia.org/T116143) (owner: 10Luke081515) [23:57:43] (03CR) 10Alex Monk: [C: 032] Add three new groups to ruwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247850 (https://phabricator.wikimedia.org/T116143) (owner: 10Luke081515) [23:57:49] (03Merged) 10jenkins-bot: Add three new groups to ruwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247850 (https://phabricator.wikimedia.org/T116143) (owner: 10Luke081515) [23:58:47] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/247850/ (duration: 00m 17s) [23:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master